9 Replies Latest reply on May 28, 2018 6:50 AM by alkx

    Poor bandwidth performance when running with large block size

    wonzhq

      Hi all,

       

      I have a cluster running ROCE on Mellanox NIC.

      # lspci | grep Mellanox

      03:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

      03:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

      There is a problem when I run large block size workload on it. The bandwidth is fairly poor. I tried to run ib_xxxx_bw tools. ib_read_bw showed the same issue, as showing below:

      Server:

      # ib_read_bw -d mlx5_1 -i 1 -s 131072 -n 10000 -F --report_gbits

       

       

      ************************************

      * Waiting for client to connect... *

      ************************************

      ---------------------------------------------------------------------------------------

                          RDMA_Read BW Test

      Dual-port       : OFF Device         : mlx5_1

      Number of qps   : 1 Transport type : IB

      Connection type : RC Using SRQ      : OFF

      CQ Moderation   : 100

      Mtu             : 1024[B]

      Link type       : Ethernet

      GID index       : 3

      Outstand reads  : 16

      rdma_cm QPs : OFF

      Data ex. method : Ethernet

      ---------------------------------------------------------------------------------------

      local address: LID 0000 QPN 0x08cc PSN 0x9f5907 OUT 0x10 RKey 0x08da1a VAddr 0x007fdfbb260000

      GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:62

      remote address: LID 0000 QPN 0x0ec1 PSN 0xc25c5e OUT 0x10 RKey 0x0d9351 VAddr 0x007f60c8de0000

      GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:61

      ---------------------------------------------------------------------------------------

      #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]

      ^@ 131072     10000           0.354959            0.091025            0.000087

      ---------------------------------------------------------------------------------------

       

      Client:

      # ib_read_bw -d mlx5_1 -i 1 -s 131072 -n 10000 -F --report_gbits 10.252.4.62

      ---------------------------------------------------------------------------------------

                          RDMA_Read BW Test

      Dual-port       : OFF Device         : mlx5_1

      Number of qps   : 1 Transport type : IB

      Connection type : RC Using SRQ      : OFF

      TX depth        : 128

      CQ Moderation   : 100

      Mtu             : 1024[B]

      Link type       : Ethernet

      GID index       : 3

      Outstand reads  : 16

      rdma_cm QPs : OFF

      Data ex. method : Ethernet

      ---------------------------------------------------------------------------------------

      local address: LID 0000 QPN 0x0ec1 PSN 0xc25c5e OUT 0x10 RKey 0x0d9351 VAddr 0x007f60c8de0000

      GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:61

      remote address: LID 0000 QPN 0x08cc PSN 0x9f5907 OUT 0x10 RKey 0x08da1a VAddr 0x007fdfbb260000

      GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:62

      ---------------------------------------------------------------------------------------

      #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]

      ^@ 131072     10000           0.354959            0.091025            0.000087

      ---------------------------------------------------------------------------------------

      As you can see, the bw is only 91Mb/s, which is apparently not correct. I checked the possible causes and found that the 'rx_discards_phy' counter is increasing constantly when running the test.

      # ethtool -S enp3s0f1 | grep discard

           rx_discards_phy: 19459329

           tx_discards_phy: 0

      # ethtool -S enp3s0f1 | grep discard

           rx_discards_phy: 19493876

           tx_discards_phy: 0

      # ethtool -S enp3s0f1 | grep discard

           rx_discards_phy: 19517948

           tx_discards_phy: 0

      # ethtool -S enp3s0f1 | grep discard

           rx_discards_phy: 19524980

           tx_discards_phy: 0

      # ethtool -S enp3s0f1 | grep discard

           rx_discards_phy: 19660462

           tx_discards_phy: 0

      # ethtool -S enp3s0f1 | grep discard

           rx_discards_phy: 19715074

           tx_discards_phy: 0

      From what I learned from another post Understanding mlx5 ethtool Counters , this seems like the receive side is constantly dropping the packets because of lacking port receive buffers. But I don't know how to find more informations or solve this issue starting from here. I also tried to increase the NIC ring buffer using the ethtool, but not too much help.

      The number of received packets dropped due to lack of buffers on a physical port. If this counter is increasing, it implies that the adapter is congested and cannot absorb the traffic coming from the network.

      Another interesting point is that both the ib_send_bw and ib_write_bw commands are running well without this issue.

      # ib_send_bw -d mlx5_1 -i 1 -s 131072 -n 100000 -F --report_gbits 10.252.4.62

      ---------------------------------------------------------------------------------------

                          Send BW Test

      Dual-port       : OFF Device         : mlx5_1

      Number of qps   : 1 Transport type : IB

      Connection type : RC Using SRQ      : OFF

      TX depth        : 128

      CQ Moderation   : 100

      Mtu             : 1024[B]

      Link type       : Ethernet

      GID index       : 3

      Max inline data : 0[B]

      rdma_cm QPs : OFF

      Data ex. method : Ethernet

      ---------------------------------------------------------------------------------------

      local address: LID 0000 QPN 0x0ec4 PSN 0xb420ab

      GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:61

      remote address: LID 0000 QPN 0x08cf PSN 0x98b61c

      GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:62

      ---------------------------------------------------------------------------------------

      #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]

      131072     100000           0.00               92.16     0.087890

      ---------------------------------------------------------------------------------------

      # ib_write_bw -d mlx5_1 -i 1 -s 131072 -n 100000 -F --report_gbits 10.252.4.62

      ---------------------------------------------------------------------------------------

                          RDMA_Write BW Test

      Dual-port       : OFF Device         : mlx5_1

      Number of qps   : 1 Transport type : IB

      Connection type : RC Using SRQ      : OFF

      TX depth        : 128

      CQ Moderation   : 100

      Mtu             : 1024[B]

      Link type       : Ethernet

      GID index       : 3

      Max inline data : 0[B]

      rdma_cm QPs : OFF

      Data ex. method : Ethernet

      ---------------------------------------------------------------------------------------

      local address: LID 0000 QPN 0x0ec5 PSN 0x31b13f RKey 0x0dbfc3 VAddr 0x007f59856a0000

      GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:61

      remote address: LID 0000 QPN 0x08d0 PSN 0x25cb57 RKey 0x091496 VAddr 0x007fd7f8e20000

      GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:252:04:62

      ---------------------------------------------------------------------------------------

      #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]

      131072     100000           0.00               92.57     0.088281

      ---------------------------------------------------------------------------------------

      Does anyone have any clues on what might be causing the problem? Any suggestions are appreciated!