HowTo Enable RDMA and TCP over SONiC (OCP 2017 Demonstration)

Version 8

    This post describes the demonstration presented at the 2017 Open Compute Summit (OCP) in Santa Clara, CA.

    The purpose of this demonstration was to show two flows (RoCE and TCP) running simultaneously end-to-end over SONiC. For our demonstration we used two Mt. Olympus servers to send and receive the flows.

    Traffic QoS mapping was based on the Differentiated Services Code Point (DSCP) field.

    RDMA traffic was sent with DSCP 4 (mapped to a lossless queue (4)), while TCP traffic was sent with DSCP 0 (mapped to the lossy queue (0)).

    The bandwidth was measured on each of the egress ports used to connect the switches, the total bandwidth measured was 91Gb/s.

    Note: The "limited" bandwidth of 91Gb/s is due to a limitation of the user-mode tools we used to generate the RDMA traffic (for example  ib_write_bw)

     

     

     

    References

     

    Setup

    • Three Mellanox Spectrum-based Ethernet switches (we used SN2700 in our example) with SONiC installed.
    • Two Mount Olympus servers with a ConnectX-4 adapter card installed.

     

     

     

    OCP-Topo.PNG

     

     

    Demonstration Video

     

      

     

    Configuration

     

    Connectivity

    In this demonstration we used 100GbE cables of various lengths to connect the servers and switches, as described in the following diagram and table:

     

    OCP-Ports.PNG

     

     

    From DeviceFront Panel PortInterfaceTo DeviceFront Panel PortInterfaceSpeed
    ToR-19Ethernet32Server AN/AN/A100Gb/s
    ToR-15Ethernet16Leaf-11Ethernet0100Gb/s
    ToR-29Ethernet32Server BN/AN/A100Gb/s
    ToR-26Ethernet20Leaf-12Ethernet4100Gb/s

     

    OS Details

    The setup used in this demonstration included two Mt. Olympus servers with a ConnectX-4 100G card with the following:

    Version
    OSWindows 2016 Datacenter
    Driver

    WinOF-2 1.60

    Firmware12.18.1000

    Host Configuration

     

    IP Configuration

    1. Click on the Window Icon and type 'adapter' at the bottom of the pop-up window.


       adapter-setting.PNG

     

    2. Open 'Network and Sharing Center'.

    adapter-setting2.PNG

     

    3. Click on the name of the adapter that is connected to the switch (in this example: 'Ethernet 15').

    adapter-setting3.PNG

     

    4. Click on 'Properties' to display the Ethernet 15 Properties dialog:

    adapter-setting4.PNG

     

    5. Click on 'Internet Protocol Version 4 (TCP/IPv4)' and open its 'Properties' dialog.

     

    Host 1 IP Configuration

    •     'IP Address' : 10.1.1.10
    •     'Subnet mask' : 255.255.255.0
    •     'Default gateway' : 10.1.1.1

     

    Host 2  IP Configuration

    • 'IP Address' : 20.1.1.10
    • 'Subnet mask' : 255.255.255.0
    • 'Default gateway' : 20.1.1.1

     

    adapter-setting5.PNG

     

    Get the Driver Key

    1. Display the Device Manager.

    Device-Manager.PNG

     

    2. Right-click a Mellanox network adapter (under the “Network adapters” list) and left-click to select  Properties. Select the Details tab from the Properties sheet and then Left-Click on a Property named "Driver-Key" (shown at the bottom).

    Driver-Key.PNG

    3. Copy the Driver Key Value (highlighted below) and verify that it matches the values marked in green in the setting below. If needed, change the settings.

    driver-key-value.PNG

     

     

    Note: The setting below is attached as a PowerShell script (server1-setup.ps). Make sure to update the Driver Key, if needed.

     

    Configuring DSCP to Control PFC for RDMA Traffic

    1. A value to mark DSCP for RoCE packets assigned to CoS=ID, when priority flow control is enabled.

    In this demonstration we are using DSCP 4 to map into a lossless queue (4) in the SN2700 for RDMA traffic.

    PS $  new-itemProperty -Path HKLM:\SYSTEM\CurrentControlSet\Control\Class\"{4d36e972-e325-11ce-bfc1-08002be10318}"\0002\ -Name "PriorityToDscpMappingTable_4" -PropertyType "String" -Value "4" -Force

     

    2. Map all untagged traffic to the lossless receive queue. The default is 0x0 for DSCP-based PFC set to 0x1.

    PS $  new-itemProperty -Path HKLM:\SYSTEM\CurrentControlSet\Control\Class\"{4d36e972-e325-11ce-bfc1-08002be10318}"\0002\ -Name "RxUntaggedMapToLossless" -PropertyType "String" -Value "1" -Force

     

    3. Do not add a 802.1Q tag to transmitted packets that are assigned an 802.1p priority. Note that they are not assigned a non-zero VLAN ID (for example priority-tagged). The default is 0x0 for DSCP-based PFC set to 0x1.

    PS $  new-itemProperty -Path HKLM:\SYSTEM\CurrentControlSet\Control\Class\"{4d36e972-e325-11ce-bfc1-08002be10318}"\0002\ -Name "TxUntagPriorityTag" -PropertyType "String" -Value "1" -Force

     

    4. Create a Quality of Service (QoS) policy and tag each type of traffic with the relevant priority.

    In this example we used RDMA for port 50000 with a CoS Value 4.

    PS $  New-NetQosPolicy -Name "RDMA_Policy01" -NetDirectPortMatchCondition 50000 -PriorityValue8021Action 4

    Configuring Quality of Service for RDMA Traffic

    1. Remove the entire previous settings (if it exists).

    PS $  Remove-NetQosTrafficClass

    PS $  Write-host "Removed All Traffic Classes"

    PS $  Remove-NetQosPolicy -Confirm:$False -policystore ActiveStore

    PS $  Write-host "Removed all Network QOS policy"

     

    2. Install the dcbx feature as follows:

    PS $  Install-WindowsFeature “data-center-bridging”

     

    3. Import the PowerShell modules that are required to configure DCB.

    PS $  import-module netqos

    PS $  import-module dcbqos

    PS $  import-module netadapter

     

    4. Enable Network Adapter QoS.

    PS $  Set-NetAdapterQos -Enabled 1 *

     

    5. Enable Priority Flow Control (PFC) on the specific priority 3, 4.

    PS $  Enable-NetQosFlowControl -Priority 3,4

     

    Note: In order for changes to take affect, restart the network adapter after changing the registry key.  The setting below are attached as a PowerShell script (server1-setup.ps). Make sure to update the Driver Key, if needed.

     

    Switch Configuration

    ToR-1 configuration

    1. Log into Quagga:

    ~$ vtysh

    Hello, this is Quagga (version 0.99.24.1).

    Copyright 1996-2005 Kunihiro Ishiguro, et al.

     

    2. Enter the router instance configuration mode:

    # conf terminal

    conf terminal

    (config)# router bgp 11

    router bgp 11

     

    3. Configure BGP so that it can redistribute the connected route (for better visibility and debugging purposes):

    (config-router)# redistribute connected

    redistribute connected

    (config-router)# end

    end

    ToR-2 Configuration

    1. Log into Quagga:

    ~$ vtysh

    Hello, this is Quagga (version 0.99.24.1).

    Copyright 1996-2005 Kunihiro Ishiguro, et al.

     

    2. Enter the router instance configuration mode:

    # conf terminal

    conf terminal

    (config)# router bgp 12

    router bgp 12

     

    3. Configure BGP so that it can redistribute the connected route (for better visibility and debugging purposes):

    (config-router)# redistribute connected

    redistribute connected

    (config-router)# end

     

    Leaf-1 Configuration

    1. Log into Quagga:

    ~$ vtysh

    Hello, this is Quagga (version 0.99.24.1).

    Copyright 1996-2005 Kunihiro Ishiguro, et al.

     

    2. Enter the router instance configuration mode:

    # conf terminal

    conf terminal

    (config)# router bgp 1

    router bgp 1

     

    3. Configure BGP so that it can redistribute the connected route (for better visibility and debug):

    (config-router)# redistribute connected

    redistribute connected

    (config-router)# end

     

    Verification

    Running TCP/RDMA Traffic Flows

    The demonstration consists of two main flows (RDMA and TCP) running from a Host 1 ("Client") to Host 2 ("Server").

     

    RDMA Flow

    For the RDMA traffic we used nd_write_bw with the following parameters:

    •   -D <test duration in seconds>
    •   -S <server interface IP>   (Either IPv4 or Ipv6)
    •   -C <server interface IP>   (Either IPv4 or IPv6)
    •   -p,  Listen on/connect to port <port> (default 6830)

     

    On the Server (Host 2):

    C:>start nd_write_bw -D 3600 -S 20.1.1.10 -p 50000

    On the Client (Host 1):

    C:>nd_write_bw -D 3600 -C 20.1.1.10 -p 50000

     

    See the following attached scripts: rdma_client.bat and rdma_server.bat.

     

    TCP Flow

    For the TCP traffic we used iperf :

     

    On the Server (Host 2):

    C:>start iperf3 -s -p 5101

     

    On the Client (Host 1):

    C:>start iperf3 -c 20.1.1.10 -p 5101 -P 8 -t 3600

    See the attached scripts: iperf_client.bat and iperf_server.bat.

    Note: The scripts execute a number of iperf3 instances.

     

    Results

    1. To check the real time traffic, run the 'sx_api_port_tc_dump.py' script in the 'syncd' container on a transmitting port.

    In this demonstration we are sending traffic from Host 1 ("Client") to Host 2 ("Server"). The transmitting ports on the switches are as follows:

    SwitchPortEthernetPort_num
    ToR15Ethernet160x13500
    Leaf2Ethernet40x13f00
    ToR29Ethernet320x12d00

     

    2. Enter the 'syncd' docker, which contains the SDK APIs:

    admin@sonic:~$ sudo docker exec -it syncd bash

     

    3. Run the following script:

    sx_api_port_tc_dump.py <frame_size> <interval(Sec)> <duration(Sec)> <port_num> <tc#> <tc#> <T>

    root@sonic:/# sx_api_port_tc_dump.py 4096 1 3600 0x13500 0 4 T