3 Replies Latest reply on Aug 18, 2016 12:48 PM by fangchin

    Is this the best our FDR adapters can do?

    fangchin

      We have a small test setup illustrated below. I have done some ib_write_bw tests. Got "decent" numbers, but not as fast as I anticipated.  First, some background of the setup:

       

      ipoib_for_the_network_layout_after.png

       

      Two 1U storage servers each has a EDR HCA MCX455A-ECAT. The other four each has a ConnectX-3 VPI FDR 40/50Gb/s HCA mezz OEMed by Mellanox for Dell.  The firmware version: 2.33.5040.  This is not the latest (2.36.5000 according to hca_self_test.ofed) but I am new to IB, and still getting up to speed with updating using Mellanox's firmware tools. The EDR HCA firmware has been updated when the MLNX_OFED was installed.

       

      All servers:

      CPU: 2 x Intel E5-2620v3 2.4Ghz 6 core/12 HT

      RAM: 8 x 16GiB DDR4 1866Mhz DIMMs

      OS: CentOS 7.2 Linux ... 3.10.0-327.28.2.el7.x86_64 #1 SMP Wed Aug 3 11:11:39 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

      OFED: MLNX_OFED_LINUX-3.3-1.0.4.0 (OFED-3.3-1.0.4)

       

      A typical ib_write_bw test:

       

      Server:

      [root@fs00 ~]# ib_write_bw -R

       

       

      ************************************

      * Waiting for client to connect... *

      ************************************

      ---------------------------------------------------------------------------------------

                          RDMA_Write BW Test

      Dual-port       : OFF Device         : mlx5_0

      Number of qps   : 1 Transport type : IB

      Connection type : RC Using SRQ      : OFF

      CQ Moderation   : 100

      Mtu             : 2048[B]

      Link type       : IB

      Max inline data : 0[B]

      rdma_cm QPs : ON

      Data ex. method : rdma_cm

      ---------------------------------------------------------------------------------------

      Waiting for client rdma_cm QP to connect

      Please run the same command with the IB/RoCE interface IP

      ---------------------------------------------------------------------------------------

      local address: LID 0x03 QPN 0x01aa PSN 0x23156

      remote address: LID 0x05 QPN 0x4024a PSN 0x28cd2e

      ---------------------------------------------------------------------------------------

      #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]

      65536      5000             6082.15            6081.07   0.097297

      ---------------------------------------------------------------------------------------

       

      Client:

      [root@sc2u0n0 ~]# ib_write_bw -d mlx4_0 -R 192.168.111.150

      ---------------------------------------------------------------------------------------

                          RDMA_Write BW Test

      Dual-port       : OFF Device         : mlx4_0

      Number of qps   : 1 Transport type : IB

      Connection type : RC Using SRQ      : OFF

      TX depth        : 128

      CQ Moderation   : 100

      Mtu             : 2048[B]

      Link type       : IB

      Max inline data : 0[B]

      rdma_cm QPs : ON

      Data ex. method : rdma_cm

      ---------------------------------------------------------------------------------------

      local address: LID 0x05 QPN 0x4024a PSN 0x28cd2e

      remote address: LID 0x03 QPN 0x01aa PSN 0x23156

      ---------------------------------------------------------------------------------------

      #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]

      65536      5000             6082.15            6081.07   0.097297

      ---------------------------------------------------------------------------------------

       

      Now 6082MB/s ~ 48.65Gbps.  Even taking into account of the 64/66 encoding overhead, over 50+Gbps should be the case, or is this the best the setup can do?  Or is there anything I can do to push up the speed further?

       

      Look forward to hearing the experience and observations from the experienced camp!  Thanks!

        • Re: Is this the best our FDR adapters can do?
          praetzel

          One thing to keep in mind is that you'll hit the bandwidth of the PCIe bus.

          I've not used the ib_write test myself - but I'm fairly sure that it's not actually handling data - just accepting it and tossing it away so it's going to be a theoretical maximum.

          In real life situations that bus is going to be handling all data in/out of the CPU and for my oldest motherboards that maxes out at 25Gb/s - which is what I hit with fio tests on QDR links.  I've heard that with PCIe gen 3 you'll get up to 35Gb/s.

          Generally whenever newer networking tech rolls out there is nothing that a single computer can do to saturate the link - unless it's pushing junk data and the only way to really max it out is for switch-switch (hardware to hardware) traffic.

          Of course using IPoIB an anything other than native IB traffic is going to cost you performance.  In my case of NFS with IPoIB (with or without RDMA) I quickly slam into the bandwidth of my SSDs.  The only exception I'll have is the Oracle dB where the low latency is what I'm after as the database is small enough to fit in RAM.

            • Re: Is this the best our FDR adapters can do?
              fangchin

               

              Thanks for sharing your experience.  I did the following:

               

              [root@sc2u0n0 ~]# dmidecode |grep PCI

                Designation: PCIe Slot 1

                Type: x8 PCI Express 3 x16

                Designation: PCIe Slot 3

                Type: x8 PCI Express 3

               

              lspci -vv

              [...]

              02:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

              [...]

                              LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s unlimited, L1 unlimited

                                      ClockPM- Surprise- LLActRep- BwNot-

               

              So, the theoretical speed should be 8Gbps/lane x 8 lane x 128b/130b = 63 Gbps.  In fact, we just did a fio sweep using fio-2.12.  The read is quite reasonable. We are now investigating why the write is so low.

               

              A. Read test results

               

              • Chunk size = 2 MiB
              • Num. Jobs = 32
              • IO Depth = 128
              • File size = 500 GiB
              • Test time = 360 seconds
              ModeSpeed, GbpsIOPS
              psync, direct47.772986
              psync, buffered24.491530
              libaio, direct49.17

              3073

               

               

              B. Write test results

               

              • Chunk size = 2 MiB
              • Num. Jobs = 32
              • IO Depth = 128
              • File size = 500 GiB
              • Test time = 360 seconds
              ModeSpeed, GbpsIOPS
              psync, direct24.141509
              psync, buffered9.32583
              libaio, direct22.511407

               


               

               

            • Re: Is this the best our FDR adapters can do?
              fangchin

              I think I have the answer now.  It's due to the confusion caused by the prevalent and inconsistent use of MB and MiB out there in different software applications.

               

              When I ran ib_write_bw with the --report_gbits flag, I did see over 50+ Gbps. That got me curious, so I assumed the MB/s output to be actually MiB/s, then 6028MiB/s = 51.02Gbps, as anticipated.