This is an archived document. Please refer to the more recent knowledge base articles on Recommended Network Configuration Examples for RoCE Deployment
This post shows several ways to test that RDMA is running smoothly and supplies several troubleshooting guidelines. It is applicable for both Ethernet (RoCE) or InfiniBand link layer based networks.
This post is based on HowTo Setup RDMA Connection using Inbox Driver (RHEL, Ubuntu) with some additions and updates.
>> Learn RDMA on the Mellanox Academy for free
References
- What is RDMA?
- HowTo Change Port Type in Mellanox ConnectX-3 Adapter
- HowTo Find the Logical-to-Physical Port Mapping (Linux)
- Mellanox Linux Driver Modules Relationship (MLNX_OFED)
- HowTo Setup RDMA Connection using Inbox Driver (RHEL, Ubuntu)
Setup
- Make sure you have two servers equipped with Mellanox ConnectX-3/ ConnectX-3 Pro adapter cards
- (Optional) Connect the two servers via an Ethernet switch, you can use access port (VLAN 1 as default) when using RoCE.
RDMA Drivers
It is recommended to install the latest MLNX_OFED, however, it is possible to use the RDMA inbox drivers.
For RHEL/CentOS Installation:
# yum -y groupinstall "InfiniBand Support"
# yum -y install perftest infiniband-diags
Make sure that RDMA is enabled on boot (RHEL7/CentOS7)
# dracut --add-drivers "mlx4_en mlx4_ib mlx5_ib" -f
# service rdma restart
# systemctl enable rdma
Make sure that RDMA is enabled on boot (RHEL6/CentOS6)
# service rdma restart ; chkconfig rdma on
For Ubuntu Installation:
Run the following installation commands on both servers:
# apt-get install libmlx4-1 infiniband-diags ibutils ibverbs-utils rdmacm-utils perftest
For tgt target support install:
# apt-get install tgt
For LIO target support install:
# apt-get install targetcli
For iscsi client install:
# apt-get install open-iscsi-utils open-iscsi
Port type configuration:
Follow this post to configure the port type.
HowTo Change Port Type in Mellanox ConnectX-3 Adapter
Configure port parameters:
In order to find the exact mapping between the interface name and the actual adapter and port number, follow this post
HowTo Find the Logical-to-Physical Port Mapping (Linux)
Configure IP Address and enable the port.
It can be done via console scripts such, fixed guide or any other method other method
#ifconfig eth2 12.12.12.1/24 up
#ifconfig eth2 12.12.12.2/24 up
Kernel Modules:
Make sure the the InfiniBand kernel modules are enabled. See this post:
Mellanox Linux Driver Modules Relationship (MLNX_OFED)
Lossless Network:
In case the RDMA is running over Ethernet (as known as RoCE) you need to make sure that the network is configured to be loss-less, which means that either flow control (FC) or priority flow control PFC is enabled on the adapter ports and the switch.
For more info refer to Network Considerations for Global Pause, PFC and QoS with Mellanox Switches and Adapters.
For basic RDMA testing (lab environment) , Global Pause Flow Control may be sufficient (per port). For production environment, PFC is preferred.
Global Pause Flow Control
in case of lab environment or small setup, you can use method to create loss-less environment.
To check what is the global pause configuration use the following command (by default it is enabled normally).
# ethtool -a eth2
Pause parameters for eth2:
Autonegotiate: off
RX: on
TX: on
In case it is disabled, run:
# ethtool -A eth2 rx on tx on
Important, make sure that Global Pause Flow Control is enabled on the switch as well on the relevant ports.
in case it is a mellanox switch (MLNX-OS) use the following command to enable it
switch (config) # interface ethernet 1/1 flowcontrol receive on force
switch (config) # interface ethernet 1/1 flowcontrol send on force
If you use other switches, refer to the switch vendor user manual (the commands are similar).
PFC
For PFC configuration on the adapter refer to the following posts:
- HowTo Run RoCE over L2 Enabled with PFC
- HowTo Run RoCE and TCP over L2 Enabled with PFC
- HowTo Enable PFC on Mellanox Switches (SwitchX)
Other 3rd party switch vendors PFC configuration is located here Solutions
RoCE version
Refer to RoCE v2 Considerations
Make sure you have the same versions on the relevant servers running end to end.
----
Test your setup at this point
1. Verify that all relevant ports are in Up state (link is up)
2. Check L3 IP connectivity (e.g. ping is running)
3. Make sure that that network is configured to be loss-less (either flow control or PFC)
4. Make sure that you have the same RoCE version on the relevant servers.
5. Make sure that iptables service is stopped. In case it is running, it is likely that host firewall rules blocking the tcp/ip connection.
5. Continue to the next section - RDMA verification
RDMA Verification
1. udaddy
#udaddy
# udaddy -s 12.12.12.1
udaddy: starting client
udaddy: connecting
initiating data transfers
receiving data transfers
data transfers complete
test complete
return status 0
2. rdma_server, rdma_client commands
#rdma_server
rdma_client -s 12.12.12.1
rdma_client: start
rdma_client: end 0
3. ib_send_bw (performance test)
Run pefformance test such as ib_send_bw, ib_read_bw or similar
For Example:
Run the following command on one server (act as a server):
# ib_send_bw -d mlx4_0 -i 1 -F --report_gbits
Run the following command on the second server (act as a client):
# ib_send_bw -d mlx4_0 -i 1 -F --report_gbits 12.12.12.1
---------------------------------------------------------------------------------------
Send BW Test
Dual-port : OFF Device : mlx4_0
Number of qps : 1 Transport type : IB
Connection type : RC
RX depth : 512
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
Gid index : 0
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0065 PSN 0xc8f367
GID: 254:128:00:00:00:00:00:00:246:82:20:255:254:23:27:129
remote address: LID 0000 QPN 0x005d PSN 0x884d7d
GID: 254:128:00:00:00:00:00:00:246:82:20:255:254:23:31:225
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65536 1000 0.00 36.40 0.069428
---------------------------------------------------------------------------------------
4. rping
This script covers RDMA_CM RC connections, but only userspace (It establishes a set of reliable RDMA connections between two nodes using the librdmacm, optionally transfers data between the nodes, then disconnects).
Run the following on one of the servers (act as a rping server)
# rping -s -C 10 -v
Run the following on one of the servers (act as a rping client)
rping -c -a 12.12.12.1 -C 10 -v
ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs
ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst
ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu
ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv
ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw
ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx
ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz
ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
client DISCONNECT EVENT...
5. ucmatose
This script covers RDMA_CM RC connections, but only userspace (same as rping) (It establishes a set of reliable RDMA connections between two nodes using the librdmacm, optionally transfers data between the nodes, then disconnects).
Run the following on one of the servers (act as a server)
# ucmatose
Run the following on the other server (act as a client)
#ucmatose -s 12.12.12.1
cmatose: starting client
cmatose: connecting
receiving data transfers
sending replies
data transfers complete
test complete
return status 0
6. krping
The krping module is a kernel loadable module that utilizes the Open Fabrics verbs to implement a client/server ping/pong program.
This module should be unzipped and complied into both servers.
[Note: The package can be downloaded from here]
# cd /tmp
# tar xvzf krping.tgz
...
# cd krping
# make
...
# make install
...
# modinfo rdma_krping
filename: /lib/modules/3.10.0-123.el7.x86_64/extra/rdma_krping.ko
license: Dual BSD/GPL
description: RDMA ping server
author: Steve Wise
srcversion: C4533E67F73469BA240B78D
depends: ib_core,rdma_cm
vermagic: 3.10.0-123.el7.x86_64 SMP mod_unload modversions
parm: debug:Debug level (0=none, 1=all) (int)
# modprobe rdma_krping debug=1
Run the following on one of the servers (act as a server)
#echo "server,addr=12.12.12.1,port=9999",verbose >/proc/krping
Run the following on the other server (act as a client)
#echo "client,addr=12.12.12.1,port=9999,count=100",verbose >/proc/krping
You can check the dmesg or /var/log/messages for debug output. Additional command options can be found in the README file within the package.
RDMA Troubleshooting
1. Port counters
To see port counters use "ethtool -S <device>"
# ethtool -S eth2
NIC statistics:
rx_packets: 64610
rx_bytes: 70319145
rx_multicast_packets: 573
rx_broadcast_packets: 1
rx_errors: 0
rx_dropped: 0
rx_length_errors: 0
rx_over_errors: 0
rx_crc_errors: 0
...
2. Traffic dump
To capture files, use ibdump command.
To be able to use ibdump, you need to enable flow steeting.
a. To enable flow-steeting:
- add/create /etc/modprobe.d/mlx4.conf file and add this line:
options mlx4_core log_num_mgm_entry_size=-1
- restart the driver
#/etc/init.d/openibd restart
(Make sure the you still have IP configured on the interface)
b. Run some RDMA traffic (e.g. ib_send_bw or similar above)
c. run ibdump to create *.pcap file.
# ibdump
Initiating resources ...
searching for IB devices in host
Port active_mtu=1024
MR was registered with addr=0x61b8f0, lkey=0x10010d00, rkey=0x10010d00, flags=0x1
------------------------------------------------
Device : "mlx4_0"
Physical port : 1
Link layer : Ethernet
Dump file : sniffer.pcap
Sniffer WQEs (max burst size) : 4096
------------------------------------------------
Ready to capture (Press ^c to stop):
Captured: 82133 packets, 88626986 bytes
Interrupted (signal 2) - exiting ...
Captured: 82133 packets, 88626986 bytes
# ls
sniffer.pcap
#
d. Open the pcap file using wireshark (or similar program)
In this case RoCE V1 was used (ethertype 0x8915)
Comments