0 Replies Latest reply on Nov 30, 2017 12:49 AM by susinthiran

    Achieving 40Gbps with Ethernet mode on ConnectX-3 VPI

    susinthiran

      Hi,

      i'm trying to get some more understanding  of achieving near line speed of 40Gbps on the following adapter card:

      [root@compute8 scripts]# lspci -vv -s 07:00.0  | grep "Part number" -A 3

                  [PN] Part number: MCX354A-FCBT      

                  [EC] Engineering changes: A4

                  [SN] Serial number: MT1334U01416         

                  [V0] Vendor specific: PCIe Gen3 x8.

      Some system info

      [root@compute8 scripts]# cat /etc/centos-release

      CentOS Linux release 7.3.1611 (Core)

      [root@compute8 scripts]# ofed_info -s

      MLNX_OFED_LINUX-4.1-1.0.2.0:

      [root@compute8 scripts]# uname -r

      3.10.0-514.26.2.el7.x86_64

       

      2 identical HP ProLiant DL360p Gen8 servers equipped with 2 Quad Core Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz CPUs and 32GB RAM. Performance profile is  network-throughput (using tuned) on both servers.

      The ConnectX-3 cards are connected back to back (no switch) with a Mellanox FDR copper cable (1m long). It's been put into Ethernet mode and i've followed the recommended optimization guide at Performance Tuning for Mellanox Adapters .

      The problem is achieving anything near line speed of 40Gbps. I've tested with iperf2 since  iperf, iperf2, iperf3 recommends it (and not to use iperf3):

      Server side:

      [root@compute7 ~]# iperf -v

      ..

      ..

       

      Client side:

      [root@compute8 scripts]# iperf   -c 192.168.100.1  -P2

      ------------------------------------------------------------

      Client connecting to 192.168.100.1, TCP port 5001

      TCP window size:  325 KByte (default)

      ------------------------------------------------------------

      [  3] local 192.168.100.2 port 54430 connected with 192.168.100.1 port 5001

      [  4] local 192.168.100.2 port 54432 connected with 192.168.100.1 port 5001

      [ ID] Interval       Transfer     Bandwidth

      [  3]  0.0-10.0 sec  17.1 GBytes  14.7 Gbits/sec

      [  4]  0.0-10.0 sec  17.1 GBytes  14.7 Gbits/sec

      [SUM]  0.0-10.0 sec  34.1 GBytes  29.3 Gbits/sec

      [root@compute8 scripts]# iperf   -c 192.168.100.1  -P3

      ------------------------------------------------------------

      Client connecting to 192.168.100.1, TCP port 5001

      TCP window size:  325 KByte (default)

      ------------------------------------------------------------

      [  5] local 192.168.100.2 port 54438 connected with 192.168.100.1 port 5001

      [  4] local 192.168.100.2 port 54434 connected with 192.168.100.1 port 5001

      [  3] local 192.168.100.2 port 54436 connected with 192.168.100.1 port 5001

      [ ID] Interval       Transfer     Bandwidth

      [  5]  0.0-10.0 sec  14.4 GBytes  12.4 Gbits/sec

      [  4]  0.0-10.0 sec  15.2 GBytes  13.1 Gbits/sec

      [  3]  0.0-10.0 sec  15.3 GBytes  13.1 Gbits/sec

      [SUM]  0.0-10.0 sec  44.9 GBytes  38.6 Gbits/sec

      [root@compute8 scripts]# iperf   -c 192.168.100.1  -P4

      ------------------------------------------------------------

      Client connecting to 192.168.100.1, TCP port 5001

      TCP window size:  325 KByte (default)

      ------------------------------------------------------------

      [  6] local 192.168.100.2 port 54446 connected with 192.168.100.1 port 5001

      [  4] local 192.168.100.2 port 54440 connected with 192.168.100.1 port 5001

      [  5] local 192.168.100.2 port 54444 connected with 192.168.100.1 port 5001

      [  3] local 192.168.100.2 port 54442 connected with 192.168.100.1 port 5001

      [ ID] Interval       Transfer     Bandwidth

      [  6]  0.0-10.0 sec  11.0 GBytes  9.47 Gbits/sec

      [  4]  0.0-10.0 sec  12.4 GBytes  10.6 Gbits/sec

      [  5]  0.0-10.0 sec  13.0 GBytes  11.2 Gbits/sec

      [  3]  0.0-10.0 sec  8.09 GBytes  6.95 Gbits/sec

      [SUM]  0.0-10.0 sec  44.5 GBytes  38.2 Gbits/sec

       

      So it seems a minimum of 3 threads is needed to get close to line speed. Increasing threads above 3 doesn't improve anything.  While running the above tests, i was able to observe a lot of changes in /proc/interrupts for ens2 ( port 1 of ConnectX-3), which means there interrupts being generated to request CPU time. This should not happen when RDMA is in use and i've confirmed RDMA is working using some of the tools (ib_send_bw, rping, udaddy, rdma_server etc) mentioned in HowTo Enable, Verify and Troubleshoot RDMA.

      Why are these Mellanox utilities performing as intended? Is the answer built in RDMA support?

       

      Further, using perf gives me some details:

      [root@compute8 scripts]# perf stat -e  cpu-migrations,context-switches,task-clock,cycles,instructions,cache-references,cache-misses iperf -c 192.168.100.1  -P4

      ------------------------------------------------------------

      Client connecting to 192.168.100.1, TCP port 5001

      TCP window size:  325 KByte (default)

      ------------------------------------------------------------

      [  6] local 192.168.100.2 port 54470 connected with 192.168.100.1 port 5001

      [  4] local 192.168.100.2 port 54464 connected with 192.168.100.1 port 5001

      [  3] local 192.168.100.2 port 54466 connected with 192.168.100.1 port 5001

      [  5] local 192.168.100.2 port 54468 connected with 192.168.100.1 port 5001

      [ ID] Interval       Transfer     Bandwidth

      [  6]  0.0-10.0 sec  10.6 GBytes  9.08 Gbits/sec

      [  4]  0.0-10.0 sec  11.5 GBytes  9.85 Gbits/sec

      [  3]  0.0-10.0 sec  12.4 GBytes  10.7 Gbits/sec

      [  5]  0.0-10.0 sec  10.1 GBytes  8.69 Gbits/sec

      [SUM]  0.0-10.0 sec  44.6 GBytes  38.3 Gbits/sec

       

      Performance counter stats for 'iperf -c 192.168.100.1 -P4':

       

                     126      cpu-migrations            #    0.005 K/sec               

                  11,934      context-switches          #    0.446 K/sec               

            26730.400620      task-clock (msec)         #    2.666 CPUs utilized       

          63,926,425,845      cycles                    #    2.392 GHz                 

          25,417,772,891      instructions              #    0.40  insn per cycle                                         

           1,786,983,037      cache-references          #   66.852 M/sec               

             446,840,327      cache-misses              #   25.005 % of all cache refs 

       

            10.025755759 seconds time elapsed

       

      For instance, i observe the high numbers of CPU context switches that are very costly.

       

      But then after some research efforts i discovered http://ftp100.cewit.stonybrook.edu/rperf. Using rperf server and client, i was able to achieve near line speed without any further effort and increased threads:

      Server side:

      root@compute7 ~]# rperf -s -p 5001 -l 500M -H

      ...

      Client side:

      [root@compute8 scripts]# perf stat -e  cpu-migrations,context-switches,task-clock,cycles,instructions,cache-references,cache-misses rperf -c $IP -p 5001 -H -G pw -l 500M -i 2

      ------------------------------------------------------------

      RDMA Client connecting to 192.168.100.1, TCP port 5001

      TCP window size: -1.00 Byte (default)

      ------------------------------------------------------------

      [  4] local 192.168.100.2 port 40580 connected with 192.168.100.1 port 5001

      [ ID] Interval       Transfer     Bandwidth

      [  4]  0.0- 2.0 sec  8.79 GBytes  37.7 Gbits/sec

      [  4]  2.0- 4.0 sec  9.28 GBytes  39.8 Gbits/sec

      [  4]  4.0- 6.0 sec  8.79 GBytes  37.7 Gbits/sec

      [  4]  6.0- 8.0 sec  9.28 GBytes  39.8 Gbits/sec

      [  4]  8.0-10.0 sec  9.28 GBytes  39.8 Gbits/sec

      [  4]  0.0-10.1 sec  45.9 GBytes  39.1 Gbits/sec

       

      Performance counter stats for 'rperf -c 192.168.100.1 -p 5001 -H -G pw -l 500M -i 2':

       

                      13      cpu-migrations            #    0.007 K/sec               

                   1,348      context-switches          #    0.734 K/sec               

             1836.487997      task-clock (msec)         #    0.152 CPUs utilized       

           4,393,238,230      cycles                    #    2.392 GHz                 

           9,201,275,892      instructions              #    2.09  insn per cycle                                         

              26,855,320      cache-references          #   14.623 M/sec               

              23,419,862      cache-misses              #   87.208 % of all cache refs 

       

            12.084867922 seconds time elapsed

       

      Note the numbers in CPU context switches that are very low compared to the iperf run and the low CPU utlization task-clock (msec). Monitoring /proc/loadavg also showed low CPU utilization.

      Another important observation i did is that there are just a few interrupts generated as seen from /proc/interrupts and more importantly the transmitted and received packets are unchanged monitoring /proc/net/dev from the client side while rperf is being run. This clearly indicates that RDMA is being used here to move data from the application and directly into the server, bypassing the kernel.

      Also, what i've found out is that the tuning of MTU is 9000 is vital, with default settings, even rperf is performing pretty bad!

       

      According to Vangelis post Cannot get 40Gbps on Ethernet mode with ConnectX-3 VPI ,he was able to achieve near line speed using iperf2 without the need of using multi threads. Why am i not able to any longer? Is RDMA support removed from iperf2 (and it's been there at some point earlier)?

       

      What's important in the end is not the benchmarks, but the actual application the systems will be running on Linux with such an above mentioned setup. How and will they be able to take advantage of RDMA and achieve anything close to the line speed? Do each of the running applications explicitly support RDMA in order to achieve the high speeds?