1 Reply Latest reply on Jan 31, 2018 2:49 AM by pawmal

    ConnectX-4 RX performance issues on DPDK

    pawmal

      Are there any RX-side pps performance tips for ConnectX-4/PMD mlx5 family?

       

      Our usecase requires optimising RX pps, I don't care about TX. Adding more receiving lcores actually decreases RX performance.

      After applying performance tips I am able to achieve 107M pps on TX side (no RX) using one 5-tuple or around 92M pps using 16 5-tuples for better RSS hashing.

      However, I am not able to exceed 60Mpps on RX side in very specific case and around 18-37Mpps in more typical cases. (Performance is heavily affected by increasing number of queues above 4).

       

      Running our DPDK application on 2x10G and 4x10G cards on different PMDs we have much more predictable performance scaling. I would rather expect that with 8 RX lcores I would be close to 100M RX pps.

       

      Test setup details:

      • testpmd + dpdk-pktgen or dpdk-pktgen alone
      • DPDK 17.11
      • one 2x100G OEM card Mellanox Technologies MT27700 Family [ConnectX-4], mt4115, FW upgraded to 12.21
      • two ports connected to itself via copper MCP1600 1m
      • PCIe 3 16x slot, DevCtl MaxPayload 256 bytes, MaxReadReq 1024 bytes
      • E5-2650 v4 @ 2.20GHz CPU (12 cores), turbo disabled

       

      I'm not expecting 148Mpps here, but according to performance results from http://fast.dpdk.org/doc/perf/DPDK_17_11_Mellanox_NIC_performance_report.pdf , card should be able to do >90Mpps full duplex using single port.

      I use two ports, one port for RX, one for TX, though.

       

      Example commands:

      ./testpmd --file-prefix=820 --socket-mem=8192,8192 -l 12-23 -n 2 -w 0000:82:00.0,txq_inline=256 -- --port-topology=chained --forward-mode=rxonly --rss-udp --rxq=2 --txq=2 --nb-cores=8 --socket-num=1 --stats-period=1 --burst=128 --rxd=2048 --txd=512

      ./testpmd --file-prefix=820 --socket-mem=8192,8192 -l 12-23 -n 2 -w 0000:82:00.0,txq_inline=256 -- --port-topology=chained --forward-mode=rxonly --rss-udp --rxq=8 --txq=8 --nb-cores=8 --socket-num=1 --stats-period=1 --burst=128 --rxd=2048 --txd=512

      ./pktgen --file-prefix=both --socket-mem=28672,28672 -w 0000:82:00.0,txq_inline=256,txqs_min_inline=4 -w 0000:82:00.1,txq_inline=256,txqs_min_inline=4 -l 0-11,12-23 -n 4 -- -P -N -T -m "[1:12-15].0, [16-23:1].1"

      ...

        • Re: ConnectX-4 RX performance issues on DPDK
          pawmal

          I will respond to myself, hope somebody will find this useful.

          • moving traffic from card0:port0-card0/port1 to card0:port0-card1/port0 helped a lot
          • dpdk-pktgen requires some code tuning, more mbufs, larger burst etc.
          • dpdk-pktgen range traffic seems to be skewed sometimes/is not equally distributed by RX's side RSS
          • I have better experience with testpmd in txmode; rxonly mode does not randomise IP addresses and flowgen mode is very slow
          • be careful as testpmd requires #rx cores = #tx cores (it silently uses MIN of these two numbers), in pktgen one can assing only 1 core for RX-doing-nothing which was better for txonly performance; ./pktgen --file-prefix=second --socket-mem=128,16384 -w 0000:82:00.0,txq_inline=128 -l 0,12-23 -n 2 -- -N -T -m "[12:13-23].0"
          • all in all, I was able to reach
            • around 85Mpps rxonly traffic using 8 cores (2.1GHz, turbo off) and probable can do a little more (spare cpu cycles) as it achieved 100% generator performance;
            • up to 107Mpps txonly using 11 local NUMA cores + some borrowed remote NUMA cores