Understanding PCIe Configuration for Maximum Performance

Version 3

    This post discusses PCIe configuration for Mellanox adapters.




    Why do we use PCIe?

    PCIe is used in any system for communication between different modules. Network adapters need to communicate with the CPU and memory (among other modules). This means that in order to process network traffic, the different devices communicating via the PCIe should be well configured. When connecting the network adapter to the PCIe, it auto-negotiates for the maximum capabilities supported between the network adapter and the CPU.


    Image result for connectx-4


    PCIe Attributes

    Any PCI device is loaded with certain attributes. Some of these attributes are critical for performance. The device's PCIe attributes are set by negotiating between the system's and the device's capabilities. And that results in the highest value both can support being chosen. Below, you can find an explanation of the relevant PCIe attributes, how to verify them, and their affect on performance.


    PCIe Width

    PCIe width determines the number of PCIe lanes that can be used in parallel by the device for communication. The width is marked as xA, where A is the number of lanes (e.g. x8 for 8 lanes). Mellanox adapters support x8 and x16 configurations, depending on their type.

    In order to verify PCIe width, the command lspc may be used.


    In this example we have a Mellanox adapter installed on PCI 04.00.0 address.

    # lspci -s 04:00.0 -vvv | grep Width

                 LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited

                 LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-


    As you can see, the PCIe reports the device capabilities (under LnkCap) which were communicated, and their current status (under LnkSta) which is the actual PCIe device properties.


    PCIe Speed

    Determines the number of PCIe transactions possible. The speed is measured in GT/s which stands for "billion transactions per second". Together with the PCIe width, the maximal PCIe bandwidth is determined (speed * width).

    In order to verify PCIe speed, the command lspc may be used.

    # lspci -s 04:00.0 -vvv | grep Speed

                 LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited

                 LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-


    Similar to the width parameter, both the device capabilities and status are reported.

    PCIe speeds are identified as "generations", where 2.5GT/s is referred as "gen1", 5GT/s as "gen2", and 8GT/s as "gen3". Most Mellanox products support all generations. You can view the PCIe generation by using the command lspci as well:

    # lspci -s 04:00.0 -vvv | grep PCIeGen

                            [V0] Vendor specific: PCIeGen3 x8


    Note: The main difference between the generations besides the supported speed is the encoding overhead of the packet. For generations 1 and 2, each packet sent on the PCIe has 20% PCIe headers overhead. This was improved in generation 3, where the overhead was reduced to 1.5% (2/130). See the actual PCIe bandwidth calculation below for more details.


    PCIe Max Payload Size

    The PCIe Max Payload Size determines the maximal size of a PCIe packet, or PCIe MTU (similar to networking protocols). This means that larger PCIe transactions are broken into PCIe MTU sized packets. This parameter is set only by the system and depends on the chipset architecture (e.g. x86_64, Power8, ARM, etc). You can view the PCIe Max Payload Size by using the command lspci (specified under DevCtl).

    lspci -s 04:00.0 -vvv | grep DevCtl: -C 2

                    DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited

                            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+

                    DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-

                            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-

                            MaxPayload 256 bytes, MaxReadReq 4096 bytes


    PCIe Max Read Request

    PCIe Max Read Request determines the maximal PCIe read request allowed. A PCIe device usually keeps track of the number of pending read requests due to having to prepare buffers for an incoming response. The size of the PCIe max read request may affect the number of pending requests (when using data fetch larger than the PCIe MTU). Again, use the command lspci in order to query for the Max Read Request value:

    # lspci -s 04:00.0 -vvv | grep MaxReadReq

                            MaxPayload 256 bytes, MaxReadReq 4096 bytes


    As opposed to other parameters discussed here, PCIe Max Read Request can be changed during runtime by using the command setpci:

    Firstly, query the value in order to avoid overriding other properties:

    # setpci -s 04:00.0 68.w


    The first digit is the PCIe Max Read Request size selector.


    Set the selector index:

    # setpci -s 04:00.0 68.w=2936


    The value should update using the command lspci:

    # lspci -s 04:00.0 -vvv | grep MaxReadReq

                            MaxPayload 256 bytes, MaxReadReq 512 bytes

    The acceptable values are: 0 - 128B, 1 - 256B, 2 - 512B, 3 - 1024B, 4 - 2048B and 5 - 4096B.


    Note: Specifying selector indexes outside this range might cause the system to crash.


    Calculating PCIe Limitations

    As mentioned before, PCIe capabilities might affect the network adapter performance. It is good to understand the bandwidth limitation introduced by the PCIe. Below are the theoretical calculation and a few examples.

    The maximum possible PCIe bandwidth is calculated by multiplying the PCIe width and speed. From that number we reduce ~1Gb/s for error correction protocols and the PCIe headers overhead. The overhead is determined by both the PCIe encoding (see PCIe speed for details), and the PCIe MTU:

    Maximum PCIe Bandwidth = SPEED * WIDTH * (1 - ENCODING) - 1Gb/s.

    For example, a gen 3 PCIe device with x8 width will be limited to:

    Maximum PCIe Bandwidth = 8G * 8 * (1 - 2/130) - 1G = 64G * 0.985 - 1G = ~62Gb/s.

    Another example - a gen 2 PCIe device with x16 width will be limited to:

    Maximum PCIe Bandwidth = 5G * 16 * (1 - 1/5) - 1G = 80G * 0.8 - 1G = ~63Gb/s.


    Note: PCIe transaction includes both the network packets payload and headers, so they need to be taken into account when calculating the PCIe limitation over the network traffic.


    PCIe Max Read Request and Max Payload Size might cause a limitation in transaction rate due to increased PCIe overall and pending transactions for the same load.