Getting Started with Socket Direct ConnectX-5 Adapters (InfiniBand)

Version 9

    This post describes the basic procedures for Socket Direct Adapters when using InfiniBand networks.

     

    References

     

    Overview

    The socket direct cards adapters uses one ASIC over two PCB PCI x8  adapters, connected with a cable between them.

    The idea is to be able to run ~50Gb/s on each PCI device (mlx5_0, mlx5_1 ...) and two traffic flows (one on each device will give 100Gb/s).

    Each device is limited with the PCI x8 bandwidth.

     

     

     

     

     

     

     

    Configuration

    1. Before you start, make sure that that BIOS is configured to max performance and SR-IOV is enabled.

     

    2. OS Consideration: Ubuntu 17.04 (kernel 4.10) or above is recommended. RHEL 7.4 require kernel upgrade to 4.10 or later.

     

    3. Set iommu=pt on the kernel grub menu.

    For example:

    BOOT_IMAGE=/boot/vmlinuz-4.13.0-16-generic.efi.signed root=UUID=91e6c1c7-271e-405b-a498-ff799fd0fa05 ro crashkernel=auto rhgb console=tty0 console=ttyS0,115200 iommu=pt

     

    4. Install the latest MLNX_OFED, see HowTo Install MLNX_OFED Driver .

     

    5. The SM should be added virtualization support. Edit (or create) the configuration file.

    # opensm --create-config /etc/opensm/opensm.conf

    -------------------------------------------------

    OpenSM 4.9.0.MLNX20170607.280b8f7

    Command Line Arguments:

    Creating config file template '/etc/opensm/opensm.conf'.

    Log File: /var/log/opensm.log

    -------------------------------------------------

     

    6. Edit /etc/opensm/opensm.conf . Change the value for virt_enabled to 2 and save.

     

    7. Restart the SM.

     

    Verification

    1. if you have two port Socket Direct Adpater, you suppose to see PCI 4 devices, two physical and two logical, on the two PCI slots used.

    # lspci | grep Mellanoc

    21:00.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]

    21:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]

    61:00.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]

    61:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]

     

    2. MST status will supply the mapping of device to Numa node. in this case, both cards are mapped to the same NUMA.

    # mst status -v

    MST modules:

    ------------

        MST PCI module is not loaded

        MST PCI configuration module loaded

    PCI devices:

    ------------

    DEVICE_TYPE             MST                           PCI       RDMA            NET                       NUMA 

    ConnectX5(rev:0)        /dev/mst/mt4119_pciconf1.1    61:00.1   mlx5_3          net-ib3                   6    

    ConnectX5(rev:0)        /dev/mst/mt4119_pciconf1      61:00.0   mlx5_2          net-ib2                   6    

    ConnectX5(rev:0)        /dev/mst/mt4119_pciconf0.1    21:00.1   mlx5_1          net-ib1                   2    

    ConnectX5(rev:0)        /dev/mst/mt4119_pciconf0      21:00.0   mlx5_0          net-ib0                   2    

     

    3. Ibstat output should be linkUP and Active for two devices (per port connected). Each port get different BASE LID from the SM.

    # ibstat

    CA 'mlx5_0'

    CA type: MT4119

    Number of ports: 1

    Firmware version: 16.20.1010

    Hardware version: 0

    Node GUID: 0xec0d9a03002fae7e

    System image GUID: 0xec0d9a03002fae7e

    Port 1:

    State: Active

    Physical state: LinkUp

    Rate: 100

    Base lid: 1

    LMC: 0

    SM lid: 1

    Capability mask: 0x2651e84a

    Port GUID: 0xec0d9a03002fae7e

    Link layer: InfiniBand

     

    ...

     

    CA 'mlx5_2'

    CA type: MT4119

    Number of ports: 1

    Firmware version: 16.20.1010

    Hardware version: 0

    Node GUID: 0xec0d9a03002fae82

    System image GUID: 0xec0d9a03002fae7e

    Port 1:

    State: Active

    Physical state: LinkUp

    Rate: 100

    Base lid: 3

    LMC: 0

    SM lid: 1

    Capability mask: 0x2641e848

    Port GUID: 0xec0d9a03002fae82

    Link layer: InfiniBand

     

    4. Make sure that the PCI used for the cards is PCI width x8, and Speed 8GT for both of the slots.

    For testing that you can use lspci or mlnx_tune commands.

     

    # mlnx_tune

     

    ...

     

     

    ConnectX-5 Device Status on PCI 21:00.0

    FW version 16.20.1010

    OK: PCI Width x8

    OK: PCI Speed 8GT/s

    PCI Max Payload Size 512

    PCI Max Read Request 512

    Local CPUs list [16, 17, 18, 19, 20, 21, 22, 23, 80, 81, 82, 83, 84, 85, 86, 87]

     

    ...

     

    ConnectX-5 Device Status on PCI 61:00.0

    FW version 16.20.1010

    OK: PCI Width x8

    OK: PCI Speed 8GT/s

    PCI Max Payload Size 512

    PCI Max Read Request 512

    Local CPUs list [48, 49, 50, 51, 52, 53, 54, 55, 112, 113, 114, 115, 116, 117, 118, 119]

     

    MPI Considerations

    Here is an example script code for hpc-x MPI over the ConnectX-5 socket direct adapter.

    • Make sure to have the two ports defined
    • Use MXM_IB_MAP_MODE=round-robin

    • Note that hostname should be twice, as we run two processes (one per device) on each host.

    #!/bin/bash

     

    module use /opt/hpcx-v1.9.5-MLNX_OFED_LINUX-4.1-1.0.2.0-redhat7.2-x86_64/modulefiles

    module load hpcx

    HCAS="mlx5_0:1,mlx5_2:1"

     

    FLAGS="--host venus001,venus001,venus002,venus002 "

    FLAGS+="-mca btl_openib_warn_default_gid_prefix 0 "

    FLAGS+="-mca btl_openib_warn_no_device_params_found 0 "

    FLAGS+="--report-bindings --allow-run-as-root -bind-to core "

    FLAGS+="-mca coll_fca_enable 0 -mca coll_hcoll_enable 0 "

    FLAGS+="-mca pml yalla -mca mtl_mxm_np 0 -x MXM_TLS=ud,shm,self -x MXM_RDMA_PORTS=$HCAS "

    FLAGS+="-x MXM_LOG_LEVEL=ERROR -x MXM_IB_PORTS=$HCAS "

    FLAGS+="-x MXM_IB_MAP_MODE=round-robin -x MXM_IB_USE_GRH=y "

     

    mpirun -np 4 $FLAGS /opt/openmpi/osu-micro-benchmarks-5.3.2/install/osu_mbw_mr

     

    Expected output for MPI is close to 100Gb/s

     

     

    Troubleshooting

    1. if you get IO_PAGE_FAULT error after boot on dmesg, make sure the SR-IOV is enabled on the BIOS and iommu=pt is set on the Linux grub menu. See example above.

    venus003: [  384.377165] mlx5_core 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0xffffffffbb000000 flags=0x0030]