HowTo Configure SR-IOV for ConnectX-4/ConnectX-5 with KVM (Ethernet)

Version 23

    This post shows the procedure of how to configure Mellanox ConnectX-4/ConnectX-5 driver with SR-IOV (Ethernet)

    Setting up VM via KVM (virt-manager) is out of the scope of this post, refer to virt-manager documentation.

     

     

    References

     

    Overview

    SR-IOV configuration includes the following steps:

    1. Enable Virtualization (SR-IOV) in the BIOS (prerequisites)

    2. Enable SR-IOV in the firmware

    3. Enable SR-IOV in the MLNX_OFED Driver

    4. Set up the VM

     

    Setup and Prerequisites

    1. Two servers connected via an Ethernet switch

     

    2. KVM is installed on the servers

    # yum install kvm

    # yum install virt-manager libvirt libvirt-python python-virtinst

     

    3. Make sure that SR-IOV is enabled in the BIOS of the specific server. Each server has different BIOS configuration options for virtualization. See HowTo Set Dell PowerEdge R730 BIOS parameters to support SR-IOV for BIOS configuration examples.

     

    4. Make sure that intel_iommu=on is added to /boot/grub/grub.conf

    # cat /boot/grub/grub.conf

     

    default=0

    timeout=5

    splashimage=(hd0,0)/grub/splash.xpm.gz

    hiddenmenu

    title Red Hat Enterprise Linux (2.6.32-358.el6.x86_64)

      root (hd0,0)

      kernel /vmlinuz-2.6.32-358.el6.x86_64 ro root=UUID=4f9ed446-05fe-4db5-a079-56738f4ae05f rd_NO_LUKS  KEYBOARDTYPE=pc KEYTABLE=us LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 rd_NO_LVM crashkernel=auto rhgb quiet rd_NO_DM rhgb quiet intel_iommu=on iommu=pt

      initrd /initramfs-2.6.32-358.el6.x86_64.img

     

    or to /boot/grub2/grub.cfg  (depends on the kernel version).

     

    # cat /boot/grub2/grub.cfg

     

    ...

     

    menuentry 'CentOS Linux (3.10.0-229.11.1.el7.x86_64) 7 (Core)' --class rhel fedora --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-3.10.0-229.el7.x86_64-advanced-7837218d-e353-4524-9141-782727d2f8ca' {

            load_video

            set gfxpayload=keep

            insmod gzio

            insmod part_msdos

            insmod ext2

            set root='hd0,msdos1'

            if [ x$feature_platform_search_hint = xy ]; then

              search --no-floppy --fs-uuid --set=root --hint-bios=hd0,msdos1 --hint-efi=hd0,msdos1 --hint-baremetal=ahci0,msdos1 --hint='hd0,msdos1'  c4e661a5-3f11-49a6-9a6b-be5a8e8e9881

            else

              search --no-floppy --fs-uuid --set=root c4e661a5-3f11-49a6-9a6b-be5a8e8e9881

            fi

            linux16 /vmlinuz-3.10.0-229.11.1.el7.x86_64 root=UUID=7837218d-e353-4524-9141-782727d2f8ca ro crashkernel=auto rhgb quiet LANG=en_US.UTF-8 systemd.debug intel_iommu=on

            initrd16 /initramfs-3.10.0-229.11.1.el7.x86_64.img

    }

     

    To learn more about iommu grub parameters refer to Understanding the iommu Linux grub File Configuration.

     

    5. Install the latest MLNX_OFED driver on the server and on the VM.

    # mlnxofedinstall
    ...

     

    Configuration

    I. Enable SR-IOV on the Firmware

     

    1. Run MFT

    # mst start

    Starting MST (Mellanox Software Tools) driver set

    Loading MST PCI module - Success

    Loading MST PCI configuration module - Success

    Create devices

     

    2. Locate the HCA device on the desired PCI slot.

    MST modules:

    ------------

        MST PCI module loaded

        MST PCI configuration module loaded

     

     

    MST devices:

    ------------

    /dev/mst/mt4103_pciconf0         - PCI configuration cycles access.

                                       domain:bus:dev.fn=0000:81:00.0 addr.reg=88 data.reg=92

                                       Chip revision is: 00

    /dev/mst/mt4103_pci_cr0          - PCI direct access.

                                       domain:bus:dev.fn=0000:81:00.0 bar=0xc8000000 size=0x100000

                                       Chip revision is: 00

    /dev/mst/mt4115_pciconf0         - PCI configuration cycles access.

                                       domain:bus:dev.fn=0000:05:00.0 addr.reg=88 data.reg=92

                                       Chip revision is: 00

     

    3. Query the Status of the device

    #  mlxconfig -d /dev/mst/mt4115_pciconf0 q

     

    Device #1:

    ----------

     

     

    Device type:    ConnectX4      

    PCI device:     /dev/mst/mt4115_pciconf0

     

     

    Configurations:                              Current

             SRIOV_EN                            0              

             NUM_OF_VFS                          0              

             LINK_TYPE_P1                        2              

             LINK_TYPE_P2                        2              

             INT_LOG_MAX_PAYLOAD_SIZE            0              

             LOG_DCR_HASH_TABLE_SIZE             14             

             DCR_LIFO_SIZE                       16384          

             ROCE_NEXT_PROTOCOL                  254            

             ROCE_CC_ALGORITHM_P1                0              

             ROCE_CC_PRIO_MASK_P1                0              

             ROCE_CC_ALGORITHM_P2                0              

             ROCE_CC_PRIO_MASK_P2                0              

             CLAMP_TGT_RATE_P1                   0              

             CLAMP_TGT_RATE_AFTER_TIME_INC_P1    1              

             RPG_TIME_RESET_P1                   5000           

             RPG_BYTE_RESET_P1                   150            

             RPG_THRESHOLD_P1                    5              

             RPG_MAX_RATE_P1                     0              

             RPG_AI_RATE_P1                      10             

             RPG_HAI_RATE_P1                     50             

             RPG_GD_P1                           7              

             RPG_MIN_DEC_FAC_P1                  50             

             RPG_MIN_RATE_P1                     1              

             RATE_TO_SET_ON_FIRST_CNP_P1         0              

             DCE_TCP_G_P1                        64             

             DCE_TCP_RTT_P1                      2              

             RATE_REDUCE_MONITOR_PERIOD_P1       2              

             INITIAL_ALPHA_VALUE_P1              3              

             MIN_TIME_BETWEEN_CNPS_P1            0              

             CNP_DSCP_P1                         0              

             CNP_802P_PRIO_P1                    7              

             CLAMP_TGT_RATE_P2                   0              

             CLAMP_TGT_RATE_AFTER_TIME_INC_P2    1              

             RPG_TIME_RESET_P2                   5000           

             RPG_BYTE_RESET_P2                   150            

             RPG_THRESHOLD_P2                    5              

             RPG_MAX_RATE_P2                     0              

             RPG_AI_RATE_P2                      10             

             RPG_HAI_RATE_P2                     50             

             RPG_GD_P2                           7              

             RPG_MIN_DEC_FAC_P2                  50             

             RPG_MIN_RATE_P2                     1              

             RATE_TO_SET_ON_FIRST_CNP_P2         0              

             DCE_TCP_G_P2                        64             

             DCE_TCP_RTT_P2                      2              

             RATE_REDUCE_MONITOR_PERIOD_P2       2              

             INITIAL_ALPHA_VALUE_P2              3              

             MIN_TIME_BETWEEN_CNPS_P2            0              

             CNP_DSCP_P2                         0              

             CNP_802P_PRIO_P2                    7      

     

    4. Enable SR-IOV , set the desired number of VFs.

    • SRIOV_EN=1
    • NUM_OF_VFS=4   ; This is an example with 4 VFs

     

     

    # mlxconfig -d /dev/mst/mt4115_pciconf0 set SRIOV_EN=1 NUM_OF_VFS=4

     

     

    Device #1:

    ----------

     

     

    Device type:    ConnectX4      

    PCI device:     /dev/mst/mt4115_pciconf0

     

     

    Configurations:                              Current         New

             SRIOV_EN                            0               1              

             NUM_OF_VFS                          0               4              

             LINK_TYPE_P1                        2               2              

             LINK_TYPE_P2                        2               2              

             INT_LOG_MAX_PAYLOAD_SIZE            0               0              

             LOG_DCR_HASH_TABLE_SIZE             14              14             

             DCR_LIFO_SIZE                       16384           16384       

     

    ...

     

    Apply new Configuration? ? (y/n) [n] : y

    Applying... Done!

    -I- Please reboot machine to load new configurations.

     

     

    5. Reboot the server or just reboot the adapter firmware (faster ...)

    # mlxfwreset --device /dev/mst/mt4115_pciconf0 reset

     

    Minimal reset level for device, /dev/mst/mt4115_pciconf0:

     

    3: Driver restart and PCI reset

    Continue with reset?[y/N] y

    -I- Stopping Driver                         -Done

    -I- Sending Reset Command To Fw             -Done

    -I- Resetting PCI                           -Done

    -I- Starting Driver                         -Done

    -I- Restarting MST                          -Done

    -I- FW was loaded successfully.

    [root@i-zak-3 ~]#

     

    Note: At this point, the VFs are not seen via the lspci. Only when SR-IOV is enabled on the MLNX_OFED driver, you will be able to see them.

    # lspci -D | grep Mellanox

    0000:05:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]

    0000:05:00.1 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]

    0000:81:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]

     

     

    II. Enable SR-IOV on the MLNX_OFED Driver

    1. Find the device, In this case, we have mlx5_1 active and up on interface ens785f1 .

    # ibstat

    ...

    CA 'mlx5_0'

      CA type: MT4115

      Number of ports: 1

      Firmware version: 12.12.0780

      Hardware version: 0

      Node GUID: 0xe41d2d0300f2a488

      System image GUID: 0xe41d2d0300f2a488

      Port 1:

      State: Down

      Physical state: Disabled

      Rate: 0

      Base lid: 0

      LMC: 0

      SM lid: 0

      Capability mask: 0x3c010000

      Port GUID: 0xe61d2dfffef2a488

      Link layer: Ethernet

    CA 'mlx5_1'

      CA type: MT4115

      Number of ports: 1

      Firmware version: 12.12.0780

      Hardware version: 0

      Node GUID: 0xe41d2d0300f2a489

      System image GUID: 0xe41d2d0300f2a488

      Port 1:

      State: Active

      Physical state: LinkUp

      Rate: 40

      Base lid: 0

      LMC: 0

      SM lid: 0

      Capability mask: 0x3c010000

      Port GUI2a489

      Link layer: Ethernet

     

    # ibdev2netdev

    mlx4_0 port 1 ==> ens817 (Up)

    mlx4_0 port 2 ==> ens817d1 (Up)

    mlx5_0 port 1 ==> ens785f0 (Down)

    mlx5_1 port 1 ==> ens785f1 (Up)

     

    2. Get the total VFs that are allowed and configured in the firmware.

     

    # cat /sys/class/net/ens785f1/device/sriov_totalvfs

    4

     

    Note: This is a read only parameter, and should be aligned with the number configured in the firmware in the command above:  "mlxconfig -d /dev/mst/mt4115_pciconf0 set SRIOV_EN=1 NUM_OF_VFS=4"

     

    Note: if you do not see this parameter, it means that the intel_iommu=on was not added to the grub file, as mentioned above.

     

    3. Get the current number of VFs on this device:

     

    There are several ways to do so:

    # cat /sys/class/infiniband/mlx5_1/device/mlx5_num_vfs

    0

     

    # cat /sys/class/net/ens785f1/device/sriov_numvfs

    0

     

    # cat /sys/class/net/ens785f1/device/mlx5_num_vfs

    0

     

    Note: In case the command fails, it may imply that the driver was not loaded.

    Note: The difference between the mlx5_num_vfs parameter and the sriov_numvfs is that the mlx5_num_vfs will always be there, even if the OS did not load the virtualization module (when adding intel_iommu support to the grub file). The sriov_numvfs will be applicable only if the intel_iommu was added to the grub file. So, if you do not see the sriov_numvfs file, recheck that intel_iommu was added to the grub file as mentioned above.

    Note: Different kernel version may not have all options above.

     

    4. Set the desired number of VFs:

        The number of VFs can be set via two parameters, depends on the kernel version.

        There are several ways to do so:

    # option 1:

    # echo 4 > /sys/class/infiniband/mlx5_0/device/mlx5_num_vfs

    # cat /sys/class/infiniband/mlx5_0/device/mlx5_num_vfs

    4

     

    # option 2:

    # echo 4 > /sys/class/net/ens785f1/device/sriov_numvfs

    # cat /sys/class/net/ens785f1/device/sriov_numvfs

    4

     

    # option 3:

     

    # echo 4 > /sys/class/net/ens785f1/device/mlx5_num_vfs

    4

     

    Note: Changing the number of VFs is not persistent and does not survive a server reboot!

     

    5. Check the PCI bus:

    # lspci -D | grep Mellanox

    0000:05:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]

    0000:05:00.1 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]

    0000:05:00.6 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4 Virtual Function]

    0000:05:00.7 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4 Virtual Function]

    0000:05:01.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4 Virtual Function]

    0000:05:01.1 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4 Virtual Function]

    0000:81:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]

     

     

    # ibdev2netdev -v

    0000:81:00.0 mlx4_0 (MT4103 - MT1521X02584) CX354A - ConnectX-3 Pro QSFP fw 2.33.5100 port 1 (ACTIVE) ==> ens817 (Up)

    0000:81:00.0 mlx4_0 (MT4103 - MT1521X02584) CX354A - ConnectX-3 Pro QSFP fw 2.33.5100 port 2 (ACTIVE) ==> ens817d1 (Up)

    0000:05:00.0 mlx5_0 (MT4115 - MT1530X08465) CX456A - ConnectX-4 QSFP fw 12.12.0780 port 1 (DOWN  ) ==> ens785f0 (Down)

    0000:05:00.1 mlx5_1 (MT4115 - MT1530X08465) CX456A - ConnectX-4 QSFP fw 12.12.0780 port 1 (ACTIVE) ==> ens785f1 (Up)

    0000:05:00.6 mlx5_2 (MT4116 - MT1530X08465) CX456A - ConnectX-4 QSFP fw 12.12.0780 port 1 (ACTIVE) ==> ens785f6 (Up)

    0000:05:00.7 mlx5_3 (MT4116 - MT1530X08465) CX456A - ConnectX-4 QSFP fw 12.12.0780 port 1 (ACTIVE) ==> ens785f7 (Up)

    0000:05:01.0 mlx5_4 (MT4116 - MT1530X08465) CX456A - ConnectX-4 QSFP fw 12.12.0780 port 1 (ACTIVE) ==> enp5s1 (Up)

    0000:05:01.1 mlx5_5 (MT4116 - MT1530X08465) CX456A - ConnectX-4 QSFP fw 12.12.0780 port 1 (ACTIVE) ==> enp5s1f1 (Up)

    At this point you can see 4 VFs and one PF.

     

     

    PCI FunctionVF number
    0000:05:00.60
    0000:05:00.71
    0000:05:01.02
    0000:05:01.13

     

    Note: Functions 05:00:2, 05:00:3, 05:00:4, 05:00:5 are kept for mlx5_0 device (the other port).

     

    6. Check the VFs configuration via the IP tool.

    # ip link show

    ...

    9: ens785f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT qlen 1000

        link/ether e4:1d:2d:f2:a4:89 brd ff:ff:ff:ff:ff:ff

        vf 0 MAC 00:00:00:00:00:00, spoof checking off, link-state auto

        vf 1 MAC 00:00:00:00:00:00, spoof checking off, link-state auto

        vf 2 MAC 00:00:00:00:00:00, spoof checking off, link-state auto

        vf 3 MAC 00:00:00:00:00:00, spoof checking off, link-state auto

    ...

     

    7. Set MAC address per VF.

    # echo 0000:05:00.6  > /sys/bus/pci/drivers/mlx5_core/unbind

    # ip link set ens785f1 vf 0 mac 00:22:33:44:55:66

    # echo 0000:05:00.6  > /sys/bus/pci/drivers/mlx5_core/bind

    # ip link show

    ...

    9: ens785f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT qlen 1000

        link/ether e4:1d:2d:f2:a4:89 brd ff:ff:ff:ff:ff:ff

        vf 0 MAC 00:22:33:44:55:66, spoof checking off, link-state auto

        vf 1 MAC 00:00:00:00:00:00, spoof checking off, link-state auto

        vf 2 MAC 00:00:00:00:00:00, spoof checking off, link-state auto

        vf 3 MAC 00:00:00:00:00:00, spoof checking off, link-state auto

    ...

     

    Note: In this case, you must use the full PCI address retrieved via the lcpsci -D command.

     

    8. More configuration options:

     

    For more configuration options see: HowTo Set Virtual Network Attributes on a Virtual Function (SR-IOV)

     

    Note: unbind and bind prior to some commands.

     

     

    III. VM Management

    1. Add PCI device to the VM.

        In our example, we will connect the VM to the PCI address 05:00.6

        Here is an example from virt-manager application. Note: Shut down the VM before adding the PCI host device.

     

    6.PNG

     

    2. Connect to the VM Console and set IP address to the relevant interface.

     

    Note: Make sure that the VM has the latest MLNX_OFED.

     

    # ifconfig eth2 10.10.10.1/24 up

     

    3. Ping another server on the network.

     

     

    Troubleshooting

    1. The MLNX_OFED installation script contains two fields related to Virtualization and SR-IOV. There is no need to use those fields with Connect-IB installation (relevant to other adapter cards such as ConnectX-3).

    enable-sriov# ./mlnxofedinstall  --enable-sriov  --hypervisor

     

    2. In case the sysfs commands fails, it may imply that the driver is not loaded, enable the driver with:

    # /etc/init.d/openibd restart

     

    3. Make sure intel_iommu is enabled on the grub config file.

     

    4. mlx5_num_vfs parameter does not survive reboot, make sure to add this command to the startup script or run manually each reboot.

     

    For example (4 VFs):

    # echo 4 > /sys/class/infiniband/mlx5_0/device/mlx5_num_vfs

    # cat /sys/class/infiniband/mlx5_0/device/mlx5_num_vfs

    4

     

    5. Note that on old OS systems, the output of the lspci command will be different.

     

    To look for the file:

    # locate pci.ids

    /usr/share/hwdata/pci.ids

    /usr/share/libosinfo/db/pci.ids

     

    Open the file and search for Mellanox

    ...

    15b3  Mellanox Technologies

            0191  MT25408 [ConnectX IB Flash Recovery]

            01f6  MT27500 Family [ConnectX-3 Flash Recovery]

            01ff  MT27600 Family [Connect-IB Flash Recovery]

            0209  MT27700 Family [ConnectX-4 Flash Recovery]

            020b  MT27710 Family [ConnectX-4 Lx Flash Recovery]

            020d  MT28800 Family [ConnectX-5 Flash Recovery]

            0262  MT27710 [ConnectX-4 Lx Programmable] EN

            0263  MT27710 [ConnectX-4 Lx Programmable Virtual Function] EN

            1002  MT25400 Family [ConnectX-2 Virtual Function]

            1003  MT27500 Family [ConnectX-3]

                    103c 1777  InfiniBand FDR/EN 10/40Gb Dual Port 544FLR-QSFP Adapter (Rev Cx)

                    103c 17c9  Infiniband QDR/Ethernet 10Gb 2-port 544i Adapter

                    103c 18ce  InfiniBand QDR/EN 10Gb Dual Port 544M Adapter

                    103c 18cf  InfiniBand FDR/EN 10/40Gb Dual Port 544M Adapter

                    103c 18d6  InfiniBand FDR/EN 10/40Gb Dual Port 544QSFP Adapter

            1004  MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

            1005  MT27510 Family

            1006  MT27511 Family

            1007  MT27520 Family [ConnectX-3 Pro]

                    103c 22f3  InfiniBand FDR/Ethernet 10Gb/40Gb 2-port 544+QSFP Adapter

                    103c 22f4  InfiniBand FDR/Ethernet 10Gb/40Gb 2-port 544+FLR-QSFP Adapter

                    117c 0090  FastFrame NQ41

                    117c 0091  FastFrame NQ42

                    117c 0092  FastFrame NQ11

                    117c 0093  FastFrame NQ12

            1009  MT27530 Family

            100a  MT27531 Family

            100b  MT27540 Family

            100c  MT27541 Family

            100d  MT27550 Family

            100e  MT27551 Family

            100f  MT27560 Family

            1010  MT27561 Family

            1011  MT27600 [Connect-IB]

            1012  MT27600 Family [Connect-IB Virtual Function]

            1013  MT27700 Family [ConnectX-4]

            1014  MT27700 Family [ConnectX-4 Virtual Function]

            1015  MT27710 Family [ConnectX-4 Lx]

            1016  MT27710 Family [ConnectX-4 Lx Virtual Function]

     

    Update the file if needed:

    # update-pciids

     

    6. In case you are using ASUS BIOS, and you have an issue with setting the mlx5_num_vfs, with the following message, check the solutions below:

    # echo 4 > /sys/class/infiniband/mlx5_0/device/mlx5_num_vfs

    bash: echo: write error: Cannot allocate memory

     

    The ASUS BIOS used does not have explicit SR-IOV support so ACPI does not know about the VF and the kernel did not assign any resources to the 4 VF’s. To fix is to pass “pci=nocrs” in kernel command line so the kernel will discard the pci info in ACPI and do the allocation again, including the discovered VF’s.

     

     

    OpenStack Support

    For OpenStack SR-IOV support for ConnectX-4, refer to OpenStack SR-IOV Support for ConnectX-4.

     

    RDMA Considerations

    In case you wish to run RDMA from the VM, make sure you set the node guids, for more info, see HowTo Configure SR-IOV for Connect-IB/ConnectX-4 with KVM (InfiniBand).

     

    https://community.mellanox.com/docs/DOC-2372