HowTo Configure RoCE v2 for ConnectX-3 Pro using Mellanox SwitchX Switches

Version 27

    This is an archived document. Please refer to the more recent knowledge base articles on Getting Started with RoCE Configuration 

     

    This post is showing how to configure RoCE v2.0 End to End starting with ConnectX-3 Pro adapters over Mellanox SwitchX based switches configured with L3 (OSPF).

     

    References

     

    Setup

    25.png

    Network configuration:

     

    The network in this setup consists of four Mellanox switches, L3 enabled and configured with OSPF.

    1. The running config of the setup can be found in HowTo Configure OSPF on Mellanox Switches (Running-Config) post.

     

    2. In addition, it is recommended to enable PFC on all router ports for the lossless priority (e.g. 3) used for the RoCE application:

    switch (config) # dcb priority-flow-control enable

    switch (config) # dcb priority-flow-control priority 3 enable

    switch (config) # interface ethernet 1/1-1/36 dcb priority-flow-control mode on force

                          

    For additional information about PFC configuration refer to HowTo Run RoCE over L2 Enabled with PFC.

     

    3. By default, the router perform DSCP to PCP (L2 priority) mapping (fixed mapping), to map from PCP of one network to PCP of the other network (to preserve the priority), run following command on all switches:

    switch (config) # qos map dscp-to-pcp preserve-pcp

        

    Note: This command is applicable only for Mellanox switches based on SwitchX IC.

     

    4. The switches sx01 and sx02 in the example above perform ECMP (multi-path) - load sharing. The default load sharing hash function is based on source IP and UDP/TCP port as well as Destination IP and UDP/TCP port and traffic class (in the CLI it is "all" option).

    (Optional) To change the load sharing function use the switch command: ip load-sharing:

    sx01 (config) # ip load-sharing ?

    source-ip-port                 source ip and TCP/UDP port

    destination-ip-port            destination ip and TCP/UDP port

    source-destination-ip-port     source & destination ip and TCP/UDP port

    traffic-class                  traffic class

    all                            all options

    sx01 (config) # show ip load-sharing

    Load sharing: all

    sx01 (config) #

                         

     

    Server Configuration (Linux)

    1. To enable RoCE v2 make sure to have the following entry in /etc/modprobe.d/mlx4_core.conf file (create the file, if not exist)
    options mlx4_core roce_mode=2
    2. (Optional) To change the destination UDP port (to 23456, for example) for RoCE v2 add the following entry to  /etc/modprobe.d/mlx4.conf  file
    options mlx4_core roce_mode=2 rr_proto=23456
    3. Driver restart is required after changing mlx4_core parameters
    #/etc/init.d/openibd restart
    4. Set IP address and route on each server. There are different ways to do it, here is one example:
    Server S1:
    #ifconfig eth2 11.11.5.1/24 up ; route add -net 11.11.0.0 -gw 11.11.5.2
    Server S2:
    #ifconfig eth2 11.11.6.1/24 up ; route add -net 11.11.0.0 -gw 11.11.6.2

    5. Configure QoS on the server. The QoS is important for priority map to TC (similar to RoCE v1). Refer to End-to-End QoS Configuration for Mellanox Switches (SwitchX) and Adapters for more details.

     

    6. In order to work with RDMA_CM libraries run the following commands:

    # mount -t configfs none /sys/kernel/config
    # cd /sys/kernel/config/rdma_cm
    # mkdir mlx4_0
    # cd mlx4_0
    # echo RoCE V2 > default_roce_mode
    # cd ..
    # rmdir mlx4_0

    Note: The Possible value for default_roce_mode parameters are "IB/RoCE V1" and "RoCE V2"

    Note: The mkdir and rmdir command act differently when mounting the configfs file system.


    Basic Verification

    1. Check the current RoCE Mode:

    # cat /sys/module/mlx4_core/parameters/roce_mode 2

     

    2. A basic verification test would be to run one of the performance tests with "-R" enabled (for RoCE)
    For example:

    Server S1:
    #ib_write_bw -R -d mlx4_0 -i 1 --report_gbits -D 10
    Server S2:
    #ib_write_bw -R -d mlx4_0 -i 1 --report_gbits 11.11.5.1   -D 10