This post meant for users that wishes to understand the Interrupt ReQuest Affinity (IRQ Affinity) concept.
The post discusses various configuration scenarios related to performance tuning on Mellanox ConnectX adapter family
- Interrupts and IRQ Tuning on RHEL
- Improved Linux SMP Scaling: User-directed Processor Affinity
- HowTo Tune Your Linux Server for Best Performance Using the mlnx_tune Tool
IRQ is an interrupt request sent from the hardware level to the CPU. While receiving the interrupt, the CPU will switch to interrupt context - Interrupt Service Routine (ISR) in order to handle the coming interrupt.
The affinity of an interrupt request (IRQ Affinity) is defined as the set of CPU cores that can service that interrupt. To improve application scalability and latency, it is recommended to distribute IRQs between the available CPU cores.
What is irqbalance?
Irqbalance is a Linux daemon that help to balance the CPU load generated by interrupts across all CPUs. Irqbalance identifies the highest volume interrupt sources, and isolates them to a single CPU, so that load is spread as much as possible over an entire processor set, while minimizing cache hit rates for irq handlers. The /proc/interrupts file lists the number of interrupts per CPU per I/O device, the IRQ number, the interrupt number handled by each CPU core, the interrupt type, and a comma-delimited list of drivers that are registered to receive that interrupt.
What is mlnx_tune?
mlnx_tune is a performance tuning tool that replaces mlnx_affinity tools.
Refer to HowTo Tune Your Linux Server for Best Performance Using the mlnx_tune Tool for more details.
Which tool should I use?
There could be specific workloads or configuration where stopping irqbalance and using mlnx_tune is more optimal and recommended. mlnx_tune offers several different traffic profiles.
Note that that default mlnx_tune profile doesn't stop irqbalance.
IRQ Affinity moderation on Windows OS (WinOF Driver)
There are several profiles for interrupt moderation for Windows OS.
- Low latency - for low latency (lower values, instructing the driver to react as soon as possible to incoming packet)
- Aggressive - higher values. Usually in high throughput scenarios where we can wait longer between interrupts.
- The interrupt moderation can be disabled (issuing an interrupt for every packet). in this case, latency will be the best it can, but this is a trade-off for higher CPU utilization. This mode is less suitable for production environment.
- Polling mode will always poll the system for incoming packets. (no interrupts)
Dynamic interrupt moderation - Since WinOF 4.60 and onward, the driver senses the scenario dynamically and tunes the interrupt moderation values accordingly.
1. In the example below CentOS 7 was use
2. MLNX_OFED 2.4
For basic users it is recommended to use mlnx_tune performance tool that tunes all performance related parameters (fast and easy).
To view and tune IRQ Affinity specifically, follow the next steps.
Tune IRQ Affinity
1. Check how many CPUs there is on the servers:
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 4
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model name: Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
CPU MHz: 1995.011
L1d cache: 32K
L1i cache: 32K
L2 cache: 6144K
NUMA node0 CPU(s): 0-7
2. Map the adapter ports (e.g. we will use the Ethernet port ens1)
mlx4_0 port 1 ==> ens1 (Up)
mlx4_0 port 2 ==> ib0 (Up)
3. Get the IRQ numbers for the relevant port (e.g. ens1). The first column is the IRQ number associated with the port (e.g. 92, 93 ... 99)
# cat /proc/interrupts | grep ens1
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
92: 262357 0 1 0 0 0 0 0 PCI-MSI-edge ens1-0
93: 2 1427651 2 2 5 3 3 0 PCI-MSI-edge ens1-1
94: 0 0 302068 0 0 0 1 0 PCI-MSI-edge ens1-2
95: 0 0 0 14147 1 0 0 0 PCI-MSI-edge ens1-3
96: 0 0 0 1 499659 0 0 0 PCI-MSI-edge ens1-4
97: 0 0 0 1 0 17150 1 0 PCI-MSI-edge ens1-5
98: 0 0 0 0 0 0 51657 0 PCI-MSI-edge ens1-6
99: 1 2 1 0 1 0 0 322914 PCI-MSI-edge ens1-7
PCI-MSI (Message Signel Interrupt) is one of the possible interrupts that can be allocated from the PCIe to the CPU (click here for more info).
For better performance it is recommended to map each interrupt to a different CPU and only one CPU. In most cases (such as this case) there is an interrupt (e.g. ens1-0 ... ens1-7) per CPU core (CPU0 ... CPU7).
Locate the appropriate SMP affinity
# cat /proc/irq/92/smp_affinity
# cat /proc/irq/93/smp_affinity
A faster option to do that is to use the show_irq_affinity.sh script.
# show_irq_affinity.sh ens1
The output is basically a bit map for the 8 CPUs in this example (01 = 00000001 -> CPU0, 02 = 00000010 -> CPU1 ... ).
In case the IRQ affinity is not tuned, it mapped differently. For example, each interrupt to all CPUs you will get something like "FF" while running the command "show_irq_affinity". This is less recommended, as in most cases, only the lower CPU cores (e.g. CPU0, CPU1) will be used and congested due to high volume of interrupts while the higher CPUs will not get to answer interrupts.
# show_irq_affinity.sh ens1
To solve that manually simple write the CPU bitmap number (01 = 00000001 -> CPU0, 02 = 00000010 -> CPU1 ... ) to the proper interrupt.
# echo 1 >/proc/irq/92/smp_affinity
# echo 2 >/proc/irq/93/smp_affinity
# echo 4 >/proc/irq/94/smp_affinity
A better option is simple to use one of Mellanox tools that performs all that automatically. Refer to HowTo Tune Your Linux Server for Best Performance Using the mlnx_tune Tool