3 Replies Latest reply on Jun 15, 2014 2:41 AM by ophirmaor

    Lossless Ethernet for RDMA over Converged Ethernet (RoCE)

      Hi all,

       

      I have several ConnectX-3 VPI adapters (model MCX354A-FCBT) and an SX1024 switch. The adapters are all in Ethernet mode and RoCE nominally appears to be working (based on, e.g., ib_send_bw and some MPI testing).  I am using MLNX_OFED version 2.2-1.0.1 on the hosts and MLNX-OS image 3.3.5006 on the switch. Rather than configuring Priority Flow Control (PFC) and VLANs, I have just enabled pause support on the adapters (via ethtool -A) and on the associated switch ports (via setting flowcontrol receive and send to on).

       

      While this appears to be working, I am not clear on how strong of a guarantee this makes for lossless Ethernet (at layer 2). Does enabling pause / flow control guarantee that there will never be an overflow on the adapter or switch, which would manifest as a dropped packets? I was under the impression that a pause packet requested a halt in transmission for a specified amount of time, but I could imagine instances where, e.g., that delay turned out to be too short to clear a buffer elsewhere. If lossless transfer is guaranteed, what portion of the stack (switch and adapter firmware, switch software, adapter driver) enforces such a guarantee?

       

      Regards,

      Thomas Benson

        • Re: Lossless Ethernet for RDMA over Converged Ethernet (RoCE)
          ophirmaor

          Hi Thomas,

          The switch and the adapter, both have xon and xoff thresholds (when to send the "pause sending" or the "continue sending" messages. The threshold numbers are designed in a way to be able to absorb the traffic from the time you send the pause frame, until it actually stops.

          In other words, once you enable Flow control globally on the port, you should not have any drops on the link.

           

          Thanks,

          Ophir.

            • Re: Lossless Ethernet for RDMA over Converged Ethernet (RoCE)

              Thank ophirmaor, that's helpful. I guess my question is whether that is a solution that almost always works or that always works. For example, what happens if the kernel is pushing packets to the card while it has been paused by the switch? Some number of frames could be buffered, but, e.g., does the driver block if the kernel is attempting to push a frame that would overflow the transmit ring?

               

              Thanks,

              Thomas

                • Re: Lossless Ethernet for RDMA over Converged Ethernet (RoCE)
                  ophirmaor

                  Hi Thomas,

                  When the adapter needs more data, it fetches the data from the TX buffer. but if the port is in paused state, fetch won’t happen if no memory is available on the Tx buffer.

                  Another issue, note that there are packets that could be sent but reached to the switch somehow malformed.

                  Those packets will be dropped. so there is a chance of dropping packets, the application should handle it.

                   

                  Ophir.