HowTo Configure RoCE over LAG (ConnectX-4)

Version 5

    This post discusses the usage and testing of RoCE over a Link Aggregation (LAG) interface for ConnectX-4 adpaters.

     

    References

     

    Background

    RoCE LAG is a feature meant for mimicking Ethernet bonding for IB devices and is available for dual port cards only.

     

    The LAG mode is entered when both the Ethernet interfaces that belong to the same card are only slaves to the same bond interface and the bonding mode is one of the following:

    • active-backup (mode 1)
    • balance-xor (mode 2)
    • 802.3ad (LACP) (mode 4)

     

    If there is any change of the bond configuration that negates one of the above rules (that is,  bonding mode is not 1, 2 or 4, or both Ethernet interfaces that belong to the same card are not the only slaves of the bond interface),  you will exit RoCE LAG mode and return to a normal InfiniBand (IB) device per port configuration.

     

    When in RoCE LAG mode, instead of having an IB device per physical port (for example mlx5_0 and mlx5_1), only one IB device will be present for both ports with 'bond' appended to its name (for example mlx5_bond_0). This device provides an aggregation of both IB ports, just as the bond interface provides an aggregation of both Ethernet interfaces.

     

    The method used by the IB bond device to distribute traffic depends on the Ethernet bond interface's mode, as follows:

    • active-backup (mode 1): QPs are always assigned to the active slave's physical port (chosen by the bond interface). When the active slave changes, all QPs are moved to the new active slave. Note that in order for the bond driver to act on link-down events and set the other interface to be the active slave, it must be configured with the "miimon" option (for example "miimon = 100").
    • balance-xor (mode 2) and 802.3ad (LACP) (mode 4): QPs are assigned to physical ports in a round-robin manner. Each time a QP is moved from RESET to the INIT state, it is assigned to one of the two physical ports at random.

     

    Note: The connected switch or (server) should be configured on the same way.

     

    Setup

    • Use two servers - each equipped with a dual port Mellanox ConnectX-4 adapter card, connected through a switch.
    • Configure the ports in Ethernet mode.
    • Install the MLNX_OFED Rel. 4.0 or later on both servers.
    • Use Kernel 4.9 or later.

     

     

    Configuration Example

    1. Make sure bonding is enabled on the server. Follow the distribution OS manuals to create a bond0 interface, see here for bonding example over RHEL OS.

     

    2. Set the bond0 interface on both servers, edit /etc/sysconfig/network-scripts/ifcfg-bond0 as follows:

    DEVICE=bond0

    NAME=bond0

    TYPE=bond

    BONDING_MASTER=yes

    IPADDR=22.22.22.6        #the other server should have different IP on the same subnet

    PREFIX=24

    BOOTPROTO=none

    ONBOOT=yes

    NM_CONTROLLED=no

    BONDING_OPTS="mode=active-backup miimon=100 updelay=100 downdelay=100"

     

    In this example, the bond mode is active-backup.

     

    3. Set the physical port of the Mellanox ConnectX-4 adapter, edit /etc/sysconfig/network-scripts/ifcfg-ens817.

    DEVICE=ens817

    TYPE=Ethernet

    ONBOOT=yes

    MASTER=bond0

    SLAVE=yes

    BOOTPROTO=none

     

    4. Set the other physical port of Mellanox ConnectX-4 adapter, edit /etc/sysconfig/network-scripts/ifcfg-ens817d1.

    DEVICE=ens817d1

    TYPE=Ethernet

    ONBOOT=yes

    MASTER=bond0

    SLAVE=yes

    BOOTPROTO=none

     

    5. Restart the port (or the driver)

    # ifdown bond0

    # ifup bond0

     

    or

    # /etc/init.d/network restart

    # /etc/init.d/openibd restart

     

     

    6. Make sure that you can ping between the two servers using the bond0 interface.

     

    7. Check the current status of bond0:

    # cat /proc/net/bonding/bond0

    Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

     

    Bonding Mode: fault-tolerance (active-backup)

    Primary Slave: None

    Currently Active Slave: ens817

    MII Status: up

    MII Polling Interval (ms): 100

    Up Delay (ms): 100

    Down Delay (ms): 100

     

    Slave Interface: ens817

    MII Status: up

    Speed: 40000 Mbps

    Duplex: full

    Link Failure Count: 6

    Permanent HW addr: e4:1d:2d:26:3c:e1

    Slave queue ID: 0

     

     

    Slave Interface: ens817d1

    MII Status: up

    Speed: 40000 Mbps

    Duplex: full

    Link Failure Count: 6

    Permanent HW addr: e4:1d:2d:26:3c:e2

    Slave queue ID: 0

     

    If you use bond mode 802.3ad, your output looks like the example:

    # cat /proc/net/bonding/bond0

    Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

     

    Bonding Mode: IEEE 802.3ad Dynamic link aggregation

    Transmit Hash Policy: layer2 (0)

    MII Status: up

    MII Polling Interval (ms): 100

    Up Delay (ms): 100

    Down Delay (ms): 100

     

    802.3ad info

    LACP rate: slow

    Min links: 0

    Aggregator selection policy (ad_select): stable

    Active Aggregator Info:

      Aggregator ID: 1

      Number of ports: 2

      Actor Key: 1

      Partner Key: 1

      Partner Mac Address: e4:1d:2d:26:33:31

     

    Slave Interface: ens817

    MII Status: up

    Speed: 40000 Mbps

    Duplex: full

    Link Failure Count: 0

    Permanent HW addr: e4:1d:2d:26:3c:e1

    Aggregator ID: 1

    Slave queue ID: 0

     

    Slave Interface: ens817d1

    MII Status: up

    Speed: 40000 Mbps

    Duplex: full

    Link Failure Count: 0

    Permanent HW addr: e4:1d:2d:26:3c:e2

    Aggregator ID: 1

    Slave queue ID: 0

     

     

    8. Run one of the perftests tools to check RDMA, for example use ib_send_bw.

     

    Run the ib_send_bw server on one host:

    #ib_send_bw -D60  -f --report_gbits

     

    and the ib_send_bw client on the other host:

    #ib_send_bw 22.22.22.6 -D60  -f --report_gbits

     

    Testing Procedure

    1. Use a simple testing procedure for active-backup mode, which involves running RoCE traffic over a bond0 interface. Within that time, close and open the ports to see that the connection is still open and traffic is running (while the traffic is being toggled between the ports).

     

    2. For active-active bonding mode it could be more tricky, as you will need to create multiple RoCE flows, each of which is load-balanced from a different port.