HowTo Enable, Verify and Troubleshoot RDMA

Version 7

    This is an archived document. Please refer to the more recent knowledge base articles on Recommended Network Configuration Examples for RoCE Deployment

     

    This post shows several ways to test that RDMA is running smoothly and supplies several troubleshooting guidelines. It is applicable for both Ethernet (RoCE) or InfiniBand link layer based networks.

    This post is based on HowTo Setup RDMA Connection using Inbox Driver (RHEL, Ubuntu) with some additions and updates.

     

    References

     

    Setup

    • Make sure you have two servers equipped with Mellanox ConnectX-3/ ConnectX-3 Pro adapter cards
    • (Optional) Connect the two servers via an Ethernet switch, you can use access port (VLAN 1 as default) when using RoCE.

     

    RDMA Drivers

    It is recommended to install the latest MLNX_OFED, however, it is possible to use the RDMA inbox drivers.

     

    For RHEL/CentOS Installation:

    Run the following installation commands on both servers:

    # yum -y groupinstall "InfiniBand Support"

    # yum -y install perftest infiniband-diags      

    Make sure that RDMA is enabled on boot (RHEL7/CentOS7)

    # dracut --add-drivers "mlx4_en mlx4_ib mlx5_ib" -f

    # service rdma restart

    # systemctl enable rdma

    Make sure that RDMA is enabled on boot (RHEL6/CentOS6)

     

    # service rdma restart ; chkconfig rdma on

     

     

    For Ubuntu Installation:

    Run the following installation commands on both servers:

    # apt-get install libmlx4-1 infiniband-diags ibutils ibverbs-utils rdmacm-utils perftest

      

    For tgt target support install:

    # apt-get install tgt

      

    For LIO target support install:

    # apt-get install targetcli

      

    For iscsi client install:

    # apt-get install open-iscsi-utils open-iscsi

      

     

    Port type configuration:

     

    Follow this post to configure the port type.

     

    HowTo Change Port Type in Mellanox ConnectX-3 Adapter

     

     

    Configure port parameters:

    In order to find the exact mapping between the interface name and the actual adapter and port number, follow this post

     

    HowTo Find the Logical-to-Physical Port Mapping (Linux)

     

    Configure IP Address and enable the port.

    It can be done via console scripts such, fixed guide or any other method other method

    For example:
    #ifconfig eth2 12.12.12.1/24 up
    Make sure that both servers have IPs on the same network.
    #ifconfig eth2 12.12.12.2/24 up

    Kernel Modules:

    Make sure the the InfiniBand kernel modules are enabled. See this post:

    Mellanox Linux Driver Modules Relationship (MLNX_OFED)

     

    Lossless Network:

    In case the RDMA is running over Ethernet (as known as RoCE) you need to make sure that the network is configured to be loss-less, which means that either flow control (FC) or priority flow control PFC is enabled on the adapter ports and the switch.

    For more info refer to Network Considerations for Global Pause, PFC and QoS with Mellanox Switches and Adapters.

     

    For basic RDMA testing (lab environment) , Global Pause Flow Control may be sufficient (per port). For production environment, PFC is preferred.

     

    Global Pause Flow Control

    in case of lab environment or small setup, you can use method to create loss-less environment.

     

    To check what is the global pause configuration use the following command (by default it is enabled normally).

    # ethtool -a eth2

    Pause parameters for eth2:

    Autonegotiate:  off

    RX:             on

    TX:             on

    In case it is disabled, run:

    # ethtool -A eth2 rx on tx on

    Important, make sure that Global Pause Flow Control is enabled on the switch as well on the relevant ports.

    in case it is a mellanox switch (MLNX-OS) use the following command to enable it

    switch (config) # interface ethernet 1/1 flowcontrol receive on force

    switch (config) # interface ethernet 1/1 flowcontrol send on force

    If you use other switches, refer to the switch vendor user manual (the commands are similar).

     

    PFC

    For PFC configuration on the adapter refer to the following posts:

     

    Other 3rd party switch vendors PFC configuration is located here Solutions

     

    RoCE version

    Refer to RoCE v2 Considerations

    Make sure you have the same versions on the relevant servers running end to end.

     

    ----

     

    At this point RDMA should be able to run between the two servers.

     

    Test your setup at this point

    1. Verify that all relevant ports are in Up state (link is up)

    2. Check L3 IP connectivity (e.g. ping is running)

    3. Make sure that that network is configured to be loss-less (either flow control or PFC)

    4. Make sure that you have the same RoCE version on the relevant servers.

    5. Make sure that iptables service is stopped. In case it is running, it is likely that host firewall rules blocking the tcp/ip connection.

    5. Continue to the next section - RDMA verification

    RDMA Verification

    To check basic RDMA CM you can simply use several testing scripts

    1. udaddy

    This script covers RDMA_CM UD connections. (It establishes a set of unreliable RDMA datagram communication paths between two nodes using the librdmacm, optionally transfers datagrams between the nodes, then tears down the communication)
    Run the following command on one server (act as a server):

     

    #udaddy

     

    Run the following command on the second server (act as a client)

    # udaddy -s 12.12.12.1

    udaddy: starting client

    udaddy: connecting

    initiating data transfers

    receiving data transfers

    data transfers complete

    test complete

    return status 0

                                   

     

    "return status=0" means good exit (RDMA is running).

    2. rdma_server, rdma_client commands

    Another options is to use rdma_server and rdma_client commands:
    Those commands  are simple RDMA CM connection and ping-pong test (It uses synchronous librdmam calls to establish an RDMA connections between two nodes).
    Run the following command on one server (act as a server):
    #rdma_server

     

    Run the following command on the second server (act as a client)

    rdma_client -s 12.12.12.1

    rdma_client: start

    rdma_client: end 0

                                 

     

    "rdma_client: end 0" means good exit (RDMA is running).

    3. ib_send_bw (performance test)

    Run pefformance test such as ib_send_bw, ib_read_bw or similar

     

    For Example:

    Run the following command on one server (act as a server):

    # ib_send_bw -d mlx4_0 -i 1 -F --report_gbits

     

    Run the following command on the second server (act as a client):

    # ib_send_bw -d mlx4_0 -i 1 -F --report_gbits 12.12.12.1

    ---------------------------------------------------------------------------------------

                        Send BW Test

    Dual-port       : OFF          Device         : mlx4_0

    Number of qps   : 1            Transport type : IB

    Connection type : RC

    RX depth        : 512

    CQ Moderation   : 100

    Mtu             : 1024[B]

    Link type       : Ethernet

    Gid index       : 0

    Max inline data : 0[B]

    rdma_cm QPs     : OFF

    Data ex. method : Ethernet

    ---------------------------------------------------------------------------------------

    local address: LID 0000 QPN 0x0065 PSN 0xc8f367

    GID: 254:128:00:00:00:00:00:00:246:82:20:255:254:23:27:129

    remote address: LID 0000 QPN 0x005d PSN 0x884d7d

    GID: 254:128:00:00:00:00:00:00:246:82:20:255:254:23:31:225

    ---------------------------------------------------------------------------------------

    #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]

    65536      1000           0.00               36.40                0.069428

    ---------------------------------------------------------------------------------------

                           

    4. rping

    This script covers RDMA_CM RC connections, but only userspace (It establishes a set of reliable RDMA connections between two nodes using the librdmacm, optionally transfers data between the nodes, then disconnects).

     

    Run the following on one of the servers (act as a rping server)

     

    # rping -s  -C 10 -v

    Run the following on one of the servers (act as a rping client)

     

    rping  -c -a 12.12.12.1  -C 10 -v

    ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr

    ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs

    ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst

    ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu

    ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv

    ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw

    ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx

    ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy

    ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz

    ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA

    client DISCONNECT EVENT...

             

    5. ucmatose

    This script covers RDMA_CM RC connections, but only userspace (same as rping) (It establishes a set of reliable RDMA connections between two nodes using the librdmacm, optionally transfers data between the nodes, then disconnects).

     

    Run the following on one of the servers (act as a server)

    # ucmatose

     

    Run the following on the other server (act as a client)

    #ucmatose -s 12.12.12.1

    cmatose: starting client

    cmatose: connecting

    receiving data transfers

    sending replies

    data transfers complete

    test complete

    return status 0

            

    6. krping

    The krping module is a kernel loadable module that utilizes the Open Fabrics verbs to implement a client/server ping/pong program.

    This module should be unzipped and complied into both servers.

    [Note: The package can be downloaded from here]

     

     

     

    # cd /tmp

    # tar xvzf krping.tgz

    ...

    # cd krping

    # make

    ...

    # make install

    ...

    # modinfo rdma_krping

    filename:       /lib/modules/3.10.0-123.el7.x86_64/extra/rdma_krping.ko

    license:        Dual BSD/GPL

    description:    RDMA ping server

    author:         Steve Wise

    srcversion:     C4533E67F73469BA240B78D

    depends:        ib_core,rdma_cm

    vermagic:       3.10.0-123.el7.x86_64 SMP mod_unload modversions

    parm:           debug:Debug level (0=none, 1=all) (int)

    # modprobe rdma_krping debug=1

          

     

    Run the following on one of the servers (act as a server)

    #echo "server,addr=12.12.12.1,port=9999",verbose >/proc/krping

          

    Run the following on the other server (act as a client)

     

    #echo "client,addr=12.12.12.1,port=9999,count=100",verbose >/proc/krping

          

    You can check the dmesg or /var/log/messages for debug output. Additional command options can be found in the README file within the package.

     

     

    RDMA Troubleshooting

    1. Port counters

    To see port counters use "ethtool -S <device>"

     

    # ethtool -S eth2

    NIC statistics:

        rx_packets: 64610

        rx_bytes: 70319145

        rx_multicast_packets: 573

        rx_broadcast_packets: 1

        rx_errors: 0

        rx_dropped: 0

        rx_length_errors: 0

        rx_over_errors: 0

        rx_crc_errors: 0

        ...

     

    2. Traffic dump

    To capture files, use ibdump command.

    To be able to use ibdump, you need to enable flow steeting.

     

    a. To enable flow-steeting:

         - add/create /etc/modprobe.d/mlx4.conf file and add this line:

    options mlx4_core log_num_mgm_entry_size=-1

        - restart the driver

    #/etc/init.d/openibd restart

    (Make sure the you still have IP configured on the interface)

     

    b. Run some RDMA traffic (e.g. ib_send_bw or similar above)

     

    c. run ibdump to create *.pcap file.

     

    # ibdump

    Initiating resources ...

    searching for IB devices in host

    Port active_mtu=1024

    MR was registered with addr=0x61b8f0, lkey=0x10010d00, rkey=0x10010d00, flags=0x1

    ------------------------------------------------

    Device                         : "mlx4_0"

    Physical port                  : 1

    Link layer                     : Ethernet

    Dump file                      : sniffer.pcap

    Sniffer WQEs (max burst size)  : 4096

    ------------------------------------------------

     

    Ready to capture (Press ^c to stop):

    Captured:     82133 packets, 88626986 bytes    

     

    Interrupted (signal 2) - exiting ...

     

    Captured:     82133 packets, 88626986 bytes

    # ls

    sniffer.pcap

    #

     

    d. Open the pcap file using wireshark (or similar program)

    In this case RoCE V1 was used (ethertype 0x8915)

     

    1.PNG