Getting Started with Socket Direct ConnectX-5 Adapters on a RoCE Network

Version 8

    This post describes the basic procedures for Mellanox Socket Direct™ Adapters on an Ethernet network with RoCE.

     

    References

    Overview

    Mellanox Socket Direct™ is a unique form-factor network adapter offered as two PCIe cards, wherein the PCIe lanes are split between the two cards. A key benefit that this adapter card brings to multi-socket servers is in eliminating the network traffic traversing over the internal bus between the sockets, significantly reducing overhead and latency. The figures below show the Socket Direct adapter (component and print sides), which cost-effectively integrates a single network adapter silicon on a primary board, together with an auxiliary board and a harness connecting the two.

     

     

    A Socket Direct adapter uses one NIC ASIC that interfaces to the host over the two PCIe x8  PCB interfaces, and to the network over two shared physical ports.

    The idea is to be able to run ~50Gb/s on each PCI device and two traffic flows (one on each device will give 100Gb/s).

    Each device is limited by PCI x8 bandwidth.

     

    Below are illustrations of a standard card configuration (left) and a Socket Direct Card Configuration (right). Notice that in a standard configuration, only one CPU (socket) connects to PCIe bus and on to the adapter card, and the other CPU uses an inter-CPU communications bus (QPI) to send/receive network data. In contrast, the Socket Direct adapter allows each CPU socket its own PCIe interface, and the NIC ASIC handles network traffic for both CPUs bypassing the inter-CPU communications bus.

    Std_and_Socket_Direct.png

     

    Installation in Pictures

    The following figure shows the primary board with the NIC ASIC (covered by a heatsink) and its PCIe x8 goldfingers.

     

     

    The following figure shows the auxiliary board (print side) and its PCIe x8 goldfingers.

     

     

    The figures below show the two PCIe cards inserted into two separate PCIe x8 slots of a server, and connected to each other using the special harness.

     

     

     

     

     

    Configuration

    1. Install the latest MLNX_OFED, see HowTo Install MLNX_OFED Driver .

     

    2. For the dual-port ConnectX-5 Socket Direct Adapter, you should see 4 PCIe devices, two physical and two logical, on each of the two PCI slots used. Use ibdev2netdev to display the installed Mellanox ConnectX-5 Socket Direct Adapter and the mapping of logical ports to physical ports.

    # ibdev2netdev -v | grep -i  MCX556M-ECAT-S25

    0000:84:00.0 mlx5_10 (MT4119 - MCX556M-ECAT-S25SN) CX556M - ConnectX-5 QSFP28 fw 16.22.0228 port 1 (DOWN  ) ==> p2p1 (Down)

    0000:84:00.1 mlx5_11 (MT4119 - MCX556M-ECAT-S25SN) CX556M - ConnectX-5 QSFP28 fw 16.22.0228 port 1 (DOWN  ) ==> p2p2 (Down)

    0000:05:00.0 mlx5_2 (MT4119 - MCX556M-ECAT-S25SN) CX556M - ConnectX-5 QSFP28 fw 16.22.0228 port 1 (DOWN  ) ==> p5p1 (Down)

    0000:05:00.1 mlx5_3 (MT4119 - MCX556M-ECAT-S25SN) CX556M - ConnectX-5 QSFP28 fw 16.22.0228 port 1 (DOWN  ) ==> p5p2 (Down)

    Note that each PCI card of ConnectX-5 Socket Direct has a different PCI address. In the output example above, the first two rows indicate that one card is installed in a PCI slot with PCI Bus address 84 (hexadecimal), and PCI Device Number 00, and PCI Function Number 0 and 1. RoCE assigned mlx5_10 as the logical port, which is the same as netdevice p2p1, and both are mapped to physical port of PCI function 0000:84:00.0.

    Note also that RoCE logical port mlx5_2 of the second PCI card (PCI Bus address 05) and netdevice p5p1 are mapped to physical port of PCI function 0000:05:00.0, which is the same physical port of PCI function 0000:84:00.0.

    MT4119 is the PCI Device ID of the Mellanox ConnectX-5 adapters family.

     

    3. Check to which NUMA each netdevice is connected to.

    Method 1: Check by interface name

    # cat /sys/class/net/p2p1/device/numa_node

    0

    # cat /sys/class/net/p5p1/device/numa_node

    1

    Method 2: If OFED is installed, the following command shows the NUMA node per interface.

    # mst status -v

    Note: If you get a negative NUMA value, see Troubleshooting section below.

     

    4. Make sure that each of the two PCIe slots used for the Socket Direct cards are defined as x8 wide and run 8GT/s.

    For testing that you can use lspci or mlnx_tune commands.

    # mlnx_tune

     

    ...

     

     

    ConnectX-5 Device Status on PCI 81:00.0

    FW version 16.22.0228

    OK: PCI Width x8

    OK: PCI Speed 8GT/s

    PCI Max Payload Size 512

    PCI Max Read Request 512

    Local CPUs list [16, 17, 18, 19, 20, 21, 22, 23, 80, 81, 82, 83, 84, 85, 86, 87]

     

    ...

     

    ConnectX-5 Device Status on PCI 05:00.0

    FW version 16.22.0228

    OK: PCI Width x8

    OK: PCI Speed 8GT/s

    PCI Max Payload Size 512

    PCI Max Read Request 512

    Local CPUs list [48, 49, 50, 51, 52, 53, 54, 55, 112, 113, 114, 115, 116, 117, 118, 119]

     

     

    5. Depending on your desired network profile (lossless, lossy, with or without QoS), please perform additional configuration steps based on Recommended Network Configuration Examples for RoCE Deployment.

     

    Verification

    1. Assign/obtain IP addresses to the 4 netdevices. The following commands assign IP addresses.

    # ifconfig p2p1 2.1.203.1

    # ifconfig p2p2 3.1.203.1

    # ifconfig p5p1 1.1.203.1

    # ifconfig p5p2 4.1.203.1

    2. Verify the IP assignment and show MAC address using ifconfig.

    # ifconfig

     

    lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536

            inet 127.0.0.1  netmask 255.0.0.0

            inet6 ::1  prefixlen 128  scopeid 0x10<host>

            loop  txqueuelen 1  (Local Loopback)

            RX packets 48  bytes 3564 (3.4 KiB)

            RX errors 0  dropped 0  overruns 0  frame 0

            TX packets 48  bytes 3564 (3.4 KiB)

            TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

     

    p2p1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500

            inet 2.1.203.1  netmask 255.0.0.0  broadcast 2.255.255.255

            inet6 fe80::268a:7ff:fe9d:4624  prefixlen 64  scopeid 0x20<link>

            ether 24:8a:07:9d:46:24  txqueuelen 1000  (Ethernet)                (MAC ADDRESS)

            RX packets 745184235  bytes 1084195232540 (1009.7 GiB)

            RX errors 0  dropped 33118  overruns 0  frame 0

            TX packets 21605713  bytes 1428417776 (1.3 GiB)

            TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

    ...

    p5p1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500

            inet 1.1.203.1  netmask 255.0.0.0  broadcast 1.255.255.255

            inet6 fe80::268a:7ff:fe9d:4628  prefixlen 64  scopeid 0x20<link>

            ether 24:8a:07:9d:46:28  txqueuelen 1000  (Ethernet)             (MAC ADDRESS – xxxx28 instead of xxxx24)

            RX packets 788009522  bytes 1144066044824 (1.0 TiB)

            RX errors 0  dropped 6431  overruns 0  frame 0

            TX packets 21167209  bytes 1400390004 (1.3 GiB)

            TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

    ...

     

    3.       Verify link is up and max speed rate.

    # ibstat mlx5_10

    CA 'mlx5_10'

            CA type: MT4119

            Number of ports: 1

            Firmware version: 16.22.0228

            Hardware version: 0

            Node GUID: 0x248a0703009d4624

            System image GUID: 0x248a0703009d4624

            Port 1:

                    State: Active

                    Physical state: LinkUp

                    Rate: 100

                    Base lid: 0

                    LMC: 0

                    SM lid: 0

                    Capability mask: 0x04010000

                    Port GUID: 0x268a07fffe9d4624

                    Link layer: Ethernet

    ...

    Running RoCE Example

    This sections shows how to set up a host server with a ConnectX-5 Socket Adapter and a client server with a standard ConnectX-5 100GbE to run ib_send_bw and ib_write_bw traffic. Message size is 64KB.

     

    1. Assume the following setup:

    • Dual-socket host server with a ConnectX-5 Socket Direct adapter (two PCIe x8 boards). (Using IP addresses from example above for netdevices.)
    • Client server at in the same subnet of the host server with a standard ConnectX-5 PCIe x16 adapter running 100GbE
    • Cable connecting between the port of the Socket Direct card on the host server and the network port of the client server

     

    2. Set up the host server side such that one netdevice on each PCIe card (different sockets: "-c 5" and "-c 8") listens to traffic on different channels ("-p 8005" and "-p 7005"). Using the netdevices above:

    taskset -c 5  ib_send_bw -d mlx5_10 -D 66 -p 8005 &

    taskset –c 8 ib_send_bw -d mlx5_2 -D 66 -p 7005 &

    3. Connect to the client (via ssh) and set it up to send traffic over its netdevice (say mlx5_7) to both netdevices on the host over two separate channels ("-p 8005" and "-p 7005"). Netdevice IP addresses are taken from example above.

    # sleep 3; ssh <client>

    # taskset -c 7 ib_send_bw -d mlx5_7 -D 66 --report_gbits -p 8005 2.1.203.1 &

    # taskset -c 5 ib_send_bw -d mlx5_7 -D 66 --report_gbits -p 7005 1.1.203.1 &

    # ---------------------------------------------------------------------------------------

                        Send BW Test

    Dual-port       : OFF          Device         : mlx5_10

    Number of qps   : 1            Transport type : IB

    Connection type : RC           Using SRQ      : OFF

    TX depth        : 128

    CQ Moderation   : 100

    Mtu             : 1024[B]

    Link type       : Ethernet

    GID index       : 3

    Max inline data : 0[B]

    rdma_cm QPs     : OFF

    Data ex. method : Ethernet

    ---------------------------------------------------------------------------------------

    local address: LID 0000 QPN 0x04ba PSN 0xa8c234

    GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:44:01

    remote address: LID 0000 QPN 0x00bb PSN 0xd65141

    GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:45:01

    ---------------------------------------------------------------------------------------

                        Send BW Test

    Dual-port       : OFF          Device         : mlx5_2

    Number of qps   : 1            Transport type : IB

    Connection type : RC           Using SRQ      : OFF

    TX depth        : 128

    CQ Moderation   : 100

    Mtu             : 1024[B]

    Link type       : Ethernet

    GID index       : 3

    Max inline data : 0[B]

    rdma_cm QPs     : OFF

    Data ex. method : Ethernet

    ---------------------------------------------------------------------------------------

    local address: LID 0000 QPN 0x00ba PSN 0xf5fec3

    GID: 00:00:00:00:00:00:00:00:00:00:255:255:02:01:44:01

    remote address: LID 0000 QPN 0x00ba PSN 0x238dd0

    GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:45:01

    ---------------------------------------------------------------------------------------

    ---------------------------------------------------------------------------------------

    #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]

    #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]

    65536      3003100          0.00               46.42              0.088538

    ---------------------------------------------------------------------------------------

    65536      3001100          0.00               46.39              0.088479

    ---------------------------------------------------------------------------------------

    4. To run ib_write_bw, replace instances of ib_send_bw with ib_write_bw in steps 2 and 3 above.

     

    Running MPI

    Here is an example script code for HPC-X MPI over the ConnectX-5 Socket Direct adapter.

    • Make sure to have the two ports defined
    • Use MXM_IB_MAP_MODE=round-robin
    • Note that hostname should be twice, as we run two processes (one per device) on each host.

    #!/bin/bash

     

    module use /opt/hpcx-v1.9.5-MLNX_OFED_LINUX-4.1-1.0.2.0-redhat7.2-x86_64/modulefiles

    module load hpcx

    HCAS="mlx5_10:1,mlx5_2:1"

     

    FLAGS="--host venus001,venus001,venus002,venus002 "

    FLAGS+="-mca btl_openib_warn_default_gid_prefix 0 "

    FLAGS+="-mca btl_openib_warn_no_device_params_found 0 "

    FLAGS+="--report-bindings --allow-run-as-root -bind-to core "

    FLAGS+="-mca coll_fca_enable 0 -mca coll_hcoll_enable 0 "

    FLAGS+="-mca pml yalla -mca mtl_mxm_np 0 -x MXM_TLS=ud,shm,self -x MXM_RDMA_PORTS=$HCAS "

    FLAGS+="-x MXM_LOG_LEVEL=ERROR -x MXM_IB_PORTS=$HCAS "

    FLAGS+="-x MXM_IB_MAP_MODE=round-robin -x MXM_IB_USE_GRH=y "

     

    mpirun -np 4 $FLAGS /opt/openmpi/osu-micro-benchmarks-5.3.2/install/osu_mbw_mr

    The result below shows a max of ~92GbE bandwidth (11447MB/s --> ~92GbE).

    # [ pairs: 2 ] [ window size: 64 ]
    # Size MB/s Messages/s
    1 5.78 5784787.82
    2 11.66 5832317.23
    4 23.44 5859021.86
    8 46.61 5826188.86
    16 92.46 5778794.08
    32 173.49 5421443.08
    64 352.42 5506600.82
    128 657.56 5137190.07
    256 1218.55 4759948.01
    512 2206.26 4309097.63
    1024 3872.19 3781436.51
    2048 8009.95 3911109.64
    4096 9815.54 2396372.34
    8192 10488.37 1280318.80
    16384 10904.51 665558.62
    32768 11097.61 338672.12
    65536 11321.88 172758.20
    131072 11378.61 86811.88
    262144 11401.47 43493.14
    524288 11432.23 21805.25
    1048576 11438.51 10908.61
    2097152 11443.92 5456.89
    4194304 11446.54 2729.07

     

    Troubleshooting

    • If you get a negative NUMA, please contact Mellanox Technologies for suggested workarounds. In most cases this is due to an outdated BIOS version; a BIOS update will probably resolve the issue.