I'm getting rx_fifo errors and rx_dropped_errors receiving UDP packets. I have 8 applications each receiving ~8000 byte UDP packets from 7 different pieces of hardware with different IP addresses. The packet and data rate is identical for each application - totalling 440k packets/sec and 29 Gbit/sec respectively. The packets are all transmitted synchronously, at a rate of 2x8000 byte packets every 1.5 ms for each of 56 different hardware cards.
In this mode, rx_dropped and rx_fifo_errors increased at a few tens of packets per second. Attached is a dump of what ethtool shows. vma_stats shows no dropped packets. Each application is bound with numactl to NUMA node 1 (which is is where the NIC is attached). top shows each core on that node is running at < 40% CPU. The switch shows no dropped packets.
Libvma configuration as shown below. I had the same problem when not using libvma (i.e. vanilla linux kernel packet processing).
Can anyone give me some hints on where to look to reduce the number of lost packets?
Many thanks in advance,
export VMA_MTU=9000 #don't need to set - should be intelligent but we'll set it anyway for now
export VMA_RX_BUFS=32768 # number of buffers -each of 1xMTU. Default is 200000 = 1 GB!
export VMA_RX_WRE=4096 # number work requests
export VMA_RX_POLL=0 # Don't waste CPU time polling. WE don't need to
export VMA_TX_BUFS=256 # Dont need many of these, so make it smalle
export VMA_TX_WRE=32 # Don't need to tx so make this small to save memory
export VMA_THREAD_MODE=0 # all socket processing is single threaded
export VMA_CQ_KEEP_QP_FULL=0 # this does packet drops according ot the docs??
ban115@tethys:~$ lspci -v | grep -A 10 ellanox
84:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3]
Subsystem: Mellanox Technologies MT27500 Family [ConnectX-3]
Flags: bus master, fast devsel, latency 0, IRQ 74, NUMA node 1
Memory at c9800000 (64-bit, non-prefetchable) [size=1M]
Memory at c9000000 (64-bit, prefetchable) [size=8M]
Expansion ROM at <ignored> [disabled]
Capabilities: <access denied>
Kernel driver in use: mlx4_core
Kernel modules: mlx4_core
ban115@tethys:~$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14
node 0 size: 15968 MB
node 0 free: 129 MB
node 1 cpus: 1 3 5 7 9 11 13 15
node 1 size: 16114 MB
node 1 free: 2106 MB
node 0 1
0: 10 21
1: 21 10