What is IRQ Affinity?

Version 12

    This post meant for users that wishes to understand the Interrupt ReQuest Affinity (IRQ Affinity) concept.

    The post discusses various configuration scenarios related to performance tuning on Mellanox ConnectX adapter family

     

    References

     

    IRQ is an interrupt request sent from the hardware level to the CPU. While receiving the interrupt, the CPU will switch to interrupt context - Interrupt Service Routine (ISR) in order to handle the coming interrupt.

    The affinity of an interrupt request (IRQ Affinity) is defined as the set of CPU cores that can service that interrupt. To improve application scalability and latency, it is recommended to distribute IRQs between the available CPU cores.

     

    1%3Fauth_token%3D9ae767f44682ce8500f46db362c975df7e9df124

     

    What is irqbalance?

     

    Irqbalance is a Linux daemon that help to balance the CPU load generated by interrupts across all CPUs. Irqbalance identifies the highest volume interrupt sources, and isolates them to a single CPU, so that load is spread as much as possible over an entire processor set, while minimizing cache hit rates for irq handlers. The /proc/interrupts file lists the number of interrupts per CPU per I/O device, the IRQ number, the interrupt number handled by each CPU core, the interrupt type, and a comma-delimited list of drivers that are registered to receive that interrupt.

     

    What is mlnx_tune?

    mlnx_tune is a performance tuning tool that replaces mlnx_affinity tools.

    Refer to HowTo Tune Your Linux Server for Best Performance Using the mlnx_tune Tool for more details.

     

    Which tool should I use?

    There could be specific workloads or configuration where stopping irqbalance and using mlnx_tune is more optimal and recommended. mlnx_tune offers several different traffic profiles.

    Note that that default mlnx_tune profile doesn't stop irqbalance.

     

    IRQ Affinity moderation on Windows OS (WinOF Driver)

    There are several profiles for interrupt moderation for Windows OS.

    • Low latency - for low latency (lower values, instructing the driver to react as soon as possible to incoming packet)
    • Aggressive - higher values. Usually in high throughput scenarios where we can wait longer between interrupts.
    • The interrupt moderation can be disabled (issuing an interrupt for every packet). in this case, latency will be the best it can, but  this is a trade-off for higher CPU utilization. This mode is less suitable for production environment.
    • Polling mode will always poll the system for incoming packets. (no interrupts)

     

    Dynamic interrupt moderation - Since WinOF 4.60 and onward, the driver senses the scenario dynamically and tunes the interrupt moderation values accordingly.

     

    Setup

    1. In the example below CentOS 7 was use

    2. MLNX_OFED 2.4

     

    For basic users it is recommended to use mlnx_tune performance tool that tunes all performance related parameters (fast and easy).

     

    To view and tune IRQ Affinity specifically, follow the next steps.

     

    Tune IRQ Affinity

    1. Check how many CPUs there is on the servers:

    # lscpu

    Architecture:          x86_64

    CPU op-mode(s):        32-bit, 64-bit

    Byte Order:            Little Endian

    CPU(s):                8

    On-line CPU(s) list:   0-7

    Thread(s) per core:    1

    Core(s) per socket:    4

    Socket(s):             2

    NUMA node(s):          1

    Vendor ID:             GenuineIntel

    CPU family:            6

    Model:                 23

    Model name:            Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz

    Stepping:              6

    CPU MHz:               1995.011

    BogoMIPS:              3989.83

    Virtualization:        VT-x

    L1d cache:             32K

    L1i cache:             32K

    L2 cache:              6144K

    NUMA node0 CPU(s):     0-7

     

    2. Map the adapter ports (e.g. we will use the Ethernet port ens1)

    # ibdev2netdev

    mlx4_0 port 1 ==> ens1 (Up)

    mlx4_0 port 2 ==> ib0 (Up)

     

    3.  Get the IRQ numbers for the relevant port (e.g. ens1). The first column is the IRQ number associated with the port (e.g. 92, 93 ... 99)

    # cat /proc/interrupts  | grep ens1

                CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7

      92:     262357          0          1          0          0          0          0          0   PCI-MSI-edge      ens1-0

      93:          2    1427651          2          2          5          3          3          0   PCI-MSI-edge      ens1-1

      94:          0          0     302068          0          0          0          1          0   PCI-MSI-edge      ens1-2

      95:          0          0          0      14147          1          0          0          0   PCI-MSI-edge      ens1-3

      96:          0          0          0          1     499659          0          0          0   PCI-MSI-edge      ens1-4

      97:          0          0          0          1          0      17150          1          0   PCI-MSI-edge      ens1-5

      98:          0          0          0          0          0          0      51657          0   PCI-MSI-edge      ens1-6

      99:          1          2          1          0          1          0          0     322914   PCI-MSI-edge      ens1-7

    PCI-MSI (Message Signel Interrupt) is one of the possible interrupts that can be allocated from the PCIe to the CPU (click here for more info).

     

    For better performance it is recommended to map each interrupt to a different CPU and only one CPU. In most cases (such as this case) there is an interrupt (e.g. ens1-0 ... ens1-7)  per CPU core (CPU0 ... CPU7).

     

    Locate the appropriate SMP affinity

    # cat /proc/irq/92/smp_affinity

    80

    # cat /proc/irq/93/smp_affinity

    10

    ...

     

    A faster option to do that is to use the show_irq_affinity.sh script.

    # show_irq_affinity.sh ens1

    92: 01

    93: 02

    94: 04

    95: 08

    96: 10

    97: 20

    98: 40

    99: 80

     

    The output is basically a bit map for the 8 CPUs in this example (01 = 00000001 -> CPU0, 02 = 00000010 -> CPU1 ... ).

     

    Troubleshooting

    In case the IRQ affinity is not tuned, it mapped differently. For example, each interrupt to all CPUs you will get something like "FF" while running the command "show_irq_affinity". This is less recommended, as in most cases, only the lower CPU cores (e.g. CPU0, CPU1) will be used and congested due to high volume of interrupts while the higher CPUs will not get to answer interrupts.

    # show_irq_affinity.sh ens1

    92: ff

    93: ff

    94: ff

    95: ff

    96: ff

    97: ff

    98: ff

    99: ff

     

    To solve that manually simple write the CPU bitmap number (01 = 00000001 -> CPU0, 02 = 00000010 -> CPU1 ... ) to the proper interrupt.

    # echo 1 >/proc/irq/92/smp_affinity

    # echo 2 >/proc/irq/93/smp_affinity

    # echo 4 >/proc/irq/94/smp_affinity

    ...

    A better option is simple to use one of Mellanox tools that performs all that automatically. Refer to HowTo Tune Your Linux Server for Best Performance Using the mlnx_tune Tool