HowTo Install MLNX_DPDK 2.1 with ConnectX-4 Adapter 

Version 18

    This post describes the procedure of installing MLNX_DPDK-2.1_1.1 on bare metal Linux server with Mellanox ConnectX-4/ConnectX-4 Lx adapters.

     

     

    References

     

    What is MLNX_DPDK?

    MLNX_DPDK are intermediate DPDK packages which contain the DPDK code from dpdk.org, bug fixes and new supported features for Mellanox NICs.

    Mellanox releases MLNX_DPDK packages to support new features and new adapters before the next DPDK release. For example, MLNX_DPDK v2.1 added support for ConnectX-4 and ConnectX-4 Lx adapters, which are not part of the upstream DPDK 2.1 code. Bug fixes and new features will be integrated in the upstream on the upcoming DPDK release.

    Mellanox recommends using MLNX_DPDK package when available as it already contains the DPDK code from dpdk.org, thus no change is required in the application.

    More detailes can be found in the release notes of the package at Mellanox PMD for DPDK page.

     

    MLNX OFED Installation

    Download and install MLNX_OFED 3.1-X from  Mellanox OFED download page

    MLNX_OFED installation, by default, will update the adapter firmware, if necessary

     

    MLNX_DPDK Installation

    Download the MLNX_DPDK-2.1_1.1 from Mellanox PMD for DPDK page

    # wget www.mellanox.com/downloads/Drivers/MLNX_DPDK-2.1_1.1.tar.gz

    # tar -zxvf MLNX_DPDK-2.1_1.1.tar.gz

     

    The default mlx5 configuration in config/common_linuxapp is the following:

    #

    # Compile burst-oriented Mellanox ConnectX-4 (MLX5) PMD

    #

    CONFIG_RTE_LIBRTE_MLX5_PMD=y

    CONFIG_RTE_LIBRTE_MLX5_DEBUG=n

    CONFIG_RTE_LIBRTE_MLX5_SGE_WR_N=1

    CONFIG_RTE_LIBRTE_MLX5_MAX_INLINE=0

    CONFIG_RTE_LIBRTE_MLX5_TX_MP_CACHE=8

    CONFIG_RTE_LIBRTE_MLX5_SOFT_COUNTERS=1

     

    ParameterDescription
    CONFIG_RTE_LIBRTE_MLX5_PMD=yEnables mlx5 PMD compilation, must be set to "y" to use ConnectX-4 NICs.
    CONFIG_RTE_LIBRTE_MLX5_DEBUG=nEnables/disables debug mode. For more details see Quick Start Guide in Mellanox PMD for DPDK page.
    CONFIG_RTE_LIBRTE_MLX5_SGE_WR_N=1

    Number of Scatter Gather elements. For Jumbo Frame support set to "4".

    Note: Setting the parameter to "4", might hurt performance in some cases.

    CONFIG_RTE_LIBRTE_MLX5_MAX_INLINE=0Max packet size for inline send. Can improve performance in case the hardware is the bottleneck.
    CONFIG_RTE_LIBRTE_MLX5_TX_MP_CACHE=8Max number of cached memory pools per TX queue. This number should be sufficient and it is not recommended to change this parameter.
    CONFIG_RTE_LIBRTE_MLX5_SOFT_COUNTERS=1Enables/Disable software counters.
    Note: Hardware counters are not supported.

     

    Compile DPDK

    # cd MLNX_DPDK-2.1_1.1/

    # make install T=x86_64-native-linuxapp-gcc

    For more advanced DPDK compilation options, please refer to DPDK documentation here.

     

    Find the NUMA Configuration

    Identify the CPUs belonging to each NUMA node.
    Note: You can also run the "lscpu" command.

     

    # numactl --hardware

    available: 2 nodes (0-1)

    node 0 cpus: 0 1 2 3 4 5 6 14 15 16 17 18 19 20

    node 0 size: 32052 MB

    node 0 free: 30713 MB

    node 1 cpus: 7 8 9 10 11 12 13 21 22 23 24 25 26 27

    node 1 size: 32253 MB

    node 1 free: 31495 MB

    node distances:

    node   0   1

      0: 10  21

      1: 21  10

     

    Find the NIC's NUMA Node

    Run the mst utility for detailed information.

    # mst start

     

    # mst status -v

    MST modules:

    ------------

        MST PCI module loaded

        MST PCI configuration module loaded

    PCI devices:

    ------------

    DEVICE_TYPE             MST                        PCI          RDMA         NET                      NUMA

    ConnectX4(rev:0)        /dev/mst/mt4115_pciconf0   08:00.0      mlx5_0       net-p1p1                 0

                                                       08:00.1      mlx5_1       net-p1p2                 0

     

    ConnectX3Pro(rev:0)     /dev/mst/mt4103_pciconf0

    ConnectX3Pro(rev:0)     /dev/mst/mt4103_pci_cr0    05:00.0      mlx4_0       net-eth1,net-p2p1        0

     

    # mst stop

    In the above example we see that the server has two NIC installed, ConnectX-3 Pro (mlx4) and ConnectX-4 (mlx5). The difference in two NICs outputs above is that ConnectX-3 has a single PCI address per device and ConnexX-4 has PCI address per port.

     

    Another way to find the NUMA node is to  choose one of the ConnectX-4 ports

    # cat /sys/class/net/p1p1/device/numa_node

    0

    Use ibdev2netdev to see the correlation between the NIC and the OS port

    # ibdev2netdev

    mlx4_0 port 2 ==> eth1 (Down)

    mlx4_0 port 1 ==> p2p1 (Down)

    mlx5_0 port 1 ==> p1p1 (Up)

    mlx5_1 port 1 ==> p1p2 (Up)

     

    Hugepages Configuration

    Use the following commands to configure and mount 1024 pages of 2MB (2048kB) on NUMA node 0.

    # echo 1024 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

    # mkdir -p /mnt/huge

    # mount -t hugetlbfs hugetlb /mnt/huge

     

    Use the following command to verify hugepages configuration.

    # cat /proc/meminfo | grep Huge

     

    Verify Installation

    Note: When running a MLNX_OFED version other than 3.1-1.0.0 (use ofed_info -s to see the MLNX_OFED version), CQE compression is not enabled by default.

    Setting MLX5_ENABLE_CQE_COMPRESSION=1 can increase DPDK performance, however it does not apply to all applications using MLNX_OFED. It is advised to set this parameter only for DPDK applications.

     

    1. Disable Pause Frames on the ports (optional, usually increase performance)

    # ethtool -A p1p1 rx off tx off

    # ethtool -A p1p2 rx off tx off

     

    2. Verify that Max_Read_Req BIOS parameter is set to >4K

    To obtain the current setting for the Max_Read_Req BIOS parameter use setpci -s <NIC BIOS address> 68.w.

    # setpci -s 80:00.0 68.w

    5936

    3. If the output is different than 5XXX, set it by setpci -s <NIC BIOS address> 68.w=5XXX

    # setpci -s 84:00.0 68.w=5936

     

    4. Use similar command line like the following to run the testpmd application

    # MLX5_ENABLE_CQE_COMPRESSION=1 ./testpmd -c 0x7 -n 4 -w 08:00.0 -w 08:00.1 --socket-mem=2048,0 -- --port-numa-config=0,0,1,0 --socket-num=0 --burst=64 --txd=1024 --rxd=256 --mbcache=512 --rxq=1 --txq=1 --nb-cores=2 --i

     

    The above command line will start a forwarding process on two ports using two cores with one RX queue and one TX queue each.

    The default behavior with two ports is to forward traffic between the ports.

     

    ParameterDescription
    -c 0x7Use the first 3 cores (DPDK require one extra core beside the cores used for the application)
    -n 4 Use 4 memory channels
    -w 08:00.0 -w 08:00.1Use only the 08:00.0 and  08:00.1 PCI devices specified (ConnectX-4 ports in this case)
    --socket-mem=2048,0   Allocate 2048MB on numa node 0 and none on NUMA node 1
    --port-numa-config=0,0,1,0Both ports are on numa node 0
    --socket-num=0Set the socket from which all memory is allocated to socket 0
    --burst=64Use 64 packets batches
    --txd=1024 Number of TX descriptors (it is recommended to have more TX descriptors when using ConnectX-4
    --rxd=256Number of RX descriptors
    --mbcache=512Set the cache of mbuf memory pools
    --rxq=1One RX queue per port
    --txq=1One TX queue per port
    --nb-cores=2Use 2 cores
    --iInteractive mode

     

     

    For more option see testpmd application guide.

     

    # cd x86_64-native-linuxapp-gcc/app/

    # MLX5_ENABLE_CQE_COMPRESSION=1 ./testpmd -c 0x7 -n 4 -w 08:00.0 -w 08:00.1 --socket-mem=2048,0 -- --port-numa-config=0,0,1,0 --socket-num=0 \ --burst=64 --txd=1024 --rxd=256 --mbcache=512 --rxq=1 --txq=1 --nb-cores=2 --i

     

    EAL: Detected lcore 0 as core 0 on socket 0

    EAL: Detected lcore 1 as core 2 on socket 0

    EAL: Detected lcore 2 as core 4 on socket 0

    EAL: Detected lcore 3 as core 6 on socket 0

    EAL: Detected lcore 4 as core 9 on socket 0

    EAL: Detected lcore 5 as core 11 on socket 0

    EAL: Detected lcore 6 as core 13 on socket 0

    EAL: Detected lcore 7 as core 0 on socket 1

    EAL: Detected lcore 8 as core 2 on socket 1

    EAL: Detected lcore 9 as core 4 on socket 1

    EAL: Detected lcore 10 as core 6 on socket 1

    EAL: Detected lcore 11 as core 9 on socket 1

    EAL: Detected lcore 12 as core 11 on socket 1

    EAL: Detected lcore 13 as core 13 on socket 1

    EAL: Detected lcore 14 as core 1 on socket 0

    EAL: Detected lcore 15 as core 3 on socket 0

    EAL: Detected lcore 16 as core 5 on socket 0

    EAL: Detected lcore 17 as core 8 on socket 0

    EAL: Detected lcore 18 as core 10 on socket 0

    EAL: Detected lcore 19 as core 12 on socket 0

    EAL: Detected lcore 20 as core 14 on socket 0

    EAL: Detected lcore 21 as core 1 on socket 1

    EAL: Detected lcore 22 as core 3 on socket 1

    EAL: Detected lcore 23 as core 5 on socket 1

    EAL: Detected lcore 24 as core 8 on socket 1

    EAL: Detected lcore 25 as core 10 on socket 1

    EAL: Detected lcore 26 as core 12 on socket 1

    EAL: Detected lcore 27 as core 14 on socket 1

    EAL: Support maximum 128 logical core(s) by configuration.

    EAL: Detected 28 lcore(s)

    EAL: VFIO modules not all loaded, skip VFIO support...

    EAL: Setting up physically contiguous memory...

    EAL: Ask a virtual area of 0x200000 bytes

    EAL: Virtual area found at 0x7f4670400000 (size = 0x200000)

    EAL: Ask a virtual area of 0x12000000 bytes

    EAL: Virtual area found at 0x7f465e200000 (size = 0x12000000)

    EAL: Ask a virtual area of 0x200000 bytes

    EAL: Virtual area found at 0x7f465de00000 (size = 0x200000)

    EAL: Ask a virtual area of 0x200000 bytes

    EAL: Virtual area found at 0x7f465da00000 (size = 0x200000)

    EAL: Ask a virtual area of 0x6cc00000 bytes

    EAL: Virtual area found at 0x7f45f0c00000 (size = 0x6cc00000)

    EAL: Ask a virtual area of 0x200000 bytes

    EAL: Virtual area found at 0x7f45f0800000 (size = 0x200000)

    EAL: Ask a virtual area of 0x200000 bytes

    EAL: Virtual area found at 0x7f45f0400000 (size = 0x200000)

    EAL: Ask a virtual area of 0x200000 bytes

    EAL: Virtual area found at 0x7f45f0000000 (size = 0x200000)

    EAL: Ask a virtual area of 0x200000 bytes

    EAL: Virtual area found at 0x7f45efc00000 (size = 0x200000)

    EAL: Ask a virtual area of 0x200000 bytes

    EAL: Virtual area found at 0x7f45ef800000 (size = 0x200000)

    EAL: Ask a virtual area of 0x200000 bytes

    EAL: Virtual area found at 0x7f45ef400000 (size = 0x200000)

    EAL: Ask a virtual area of 0x200000 bytes

    EAL: Virtual area found at 0x7f45ef000000 (size = 0x200000)

    EAL: Requesting 1024 pages of size 2MB from socket 0

    EAL: TSC frequency is ~2596987 KHz

    EAL: Master lcore 0 is ready (tid=71e3e940;cpuset=[0])

    EAL: lcore 2 is ready (tid=edbf1700;cpuset=[2])

    EAL: lcore 1 is ready (tid=ee3f2700;cpuset=[1])

    EAL: PCI device 0000:08:00.0 on NUMA socket 0

    EAL:   probe driver: 15b3:1013 librte_pmd_mlx5

    PMD: librte_pmd_mlx5: PCI information matches, using device "mlx5_0" (VF: false)

    PMD: librte_pmd_mlx5: 1 port(s) detected

    PMD: librte_pmd_mlx5: port 1 MAC address is e4:1d:2d:5c:f1:1c

    EAL: PCI device 0000:08:00.1 on NUMA socket 0

    EAL:   probe driver: 15b3:1013 librte_pmd_mlx5

    PMD: librte_pmd_mlx5: PCI information matches, using device "mlx5_1" (VF: false)

    PMD: librte_pmd_mlx5: 1 port(s) detected

    PMD: librte_pmd_mlx5: port 1 MAC address is e4:1d:2d:5c:f1:1d

    Interactive-mode selected

    Configuring Port 0 (socket 0)

    PMD: librte_pmd_mlx5: 0x8bd7e0: TX queues number update: 0 -> 1

    PMD: librte_pmd_mlx5: 0x8bd7e0: RX queues number update: 0 -> 1

    Port 0: E4:1D:2D:5C:F1:1C

    Configuring Port 1 (socket 0)

    PMD: librte_pmd_mlx5: 0x8be828: TX queues number update: 0 -> 1

    PMD: librte_pmd_mlx5: 0x8be828: RX queues number update: 0 -> 1

    Port 1: E4:1D:2D:5C:F1:1D

    Checking link statuses...

    Port 0 Link Up - speed 100000 Mbps - full-duplex

    Port 1 Link Up - speed 100000 Mbps - full-duplex

    Done

     

    testpmd> set fwd io

    Set io packet forwarding mode

    testpmd> start

      io packet forwarding - CRC stripping disabled - packets/burst=64

      nb forwarding cores=2 - nb forwarding ports=2

      RX queues=1 - RX desc=256 - RX free threshold=0

      RX threshold registers: pthresh=0 hthresh=0 wthresh=0

      TX queues=1 - TX desc=1024 - TX free threshold=0

      TX threshold registers: pthresh=0 hthresh=0 wthresh=0

      TX RS bit threshold=0 - TXQ flags=0x0

     

    5. Start sending packet to the machine running testpmd.

        Easy option is to use the raw_ethernet_bw utility shipped with MLNX_OFED

    # raw_ethernet_bw -s 64 -E E4:1D:2D:5C:F1:13 --client -D 10 -l 8 -d mlx5_1

       This command line will send 64B L2 only packets in bursts of 8 with DMAC of E4:1D:2D:5C:F1:13 for 10 second from mlx5_1 device (use ibstat to see device) .

       Use raw_ethernet_bw -h for the full list of parameters.

     

    Note: When machines connected back-to-back, any MAC address will work because testpmd sets the port to promiscuous mode by default. In case there is a switch in the middle, you have to specific a port MAC or a MAC which is not related to any other port of the switch.

     

    testpmd> stop

    testpmd> quit

     

    Running test PMD using multiple queues (RSS)

    # MLX5_ENABLE_CQE_COMPRESSION=1 ./testpmd -c 0x1F -n 4 -w 08:00.0 -w 08:00.1 --socket-mem=2048,0 -- --port-numa-config=0,0,1,0 --socket-num=0 --burst=64 --txd=1024 --rxd=256 --mbcache=512 --rxq=2 --txq=2 --nb-cores=4 --I

     

     

    ParameterDescription
    -c 0x1F                                                        Uses the first 5 cores (DPDK require one extra core beside the cores used for the application)
    --rxq=2Uses 2 RX queues per port
    --txq=2Uses 2 TX queues per port
    --nb-cores=4Uses 4 cores

     

     

    In this case it is important to create different flows to allow RSS effectively distribute them between the queues.

    A simple script using raw_ethernt_bw can achieve that -

    #!/bin/bash

    raw_ethernet_bw -s 64 -E E4:1D:2D:5C:F1:13 -j 1.1.1.1 -J 2.2.2.2 --client -D 10 -l 8 -d mlx5_1 &

    raw_ethernet_bw -s 64 -E E4:1D:2D:5C:F1:13 -j 1.1.1.2 -J 2.2.2.2 --client -D 10 -l 8 -d mlx5_1 &

    raw_ethernet_bw -s 64 -E E4:1D:2D:5C:F1:13 -j 1.1.1.1 -J 2.2.2.2 --client -D 10 -l 8 -d mlx5_0 &

    raw_ethernet_bw -s 64 -E E4:1D:2D:5C:F1:13 -j 1.1.1.2 -J 2.2.2.2 --client -D 10 -l 8 -d mlx5_0

     

    The script will send 2 different IP flows from each port:

     

    Parameter

    Description
    -j           Source IP
    -JDestination IP
    -dDevice