Understanding NUMA Node for Performance Benchmarks

Version 8

    Non-uniform memory access (NUMA) systems are server platforms with more than one system bus. These platforms can utilize multiple processors on a single motherboard, and all processors can access all the memory on the board. When a processor accesses memory that does not lie within its own node (remote memory), data must be transferred over the NUMA connection at a rate that is slower than it would be when accessing local memory. Thus, memory access times are not uniform and depend on the location (proximity) of the memory and the node from which it is accessed.




    Here is an example of a motherboard with two CPU sockets.



    To achieve high performance, you first need to determine which CPU will run the application and ensure that the memory used is the one closest to it.

    Mellanox adapters installed over PCIe link will be connected to one of the CPUs, when performing benchmark tests you need to run the tests from the CPU attached to that PCIe link.




    Mapping between PCI, device driver, port and NUMA


    1. How do I map between a PCI, device, port and NUMA?

    The easiest way it to run "mst status -v".

    Here is an example of servers with two cards installed (ConnectX-4 and ConnectX-3 Pro), each connected to different numa_node.

    The red line below shows that on PCI address 05:00.0, mlx5_0 is the defice, the port used for that is ens785f0 and the NUMA is 0.

    # mst start


    # mst status -v

    MST modules:


        MST PCI module loaded

        MST PCI configuration module loaded

    PCI devices:


    DEVICE_TYPE             MST                           PCI       RDMA    NET                       NUMA

    ConnectX4(rev:0)        /dev/mst/mt4115_pciconf0.1    05:00.1   mlx5_1  net-ens785f1              0



    ConnectX4(rev:0)        /dev/mst/mt4115_pciconf0      05:00.0   mlx5_0  net-ens785f0              0



    ConnectX3Pro(rev:0)     /dev/mst/mt4103_pciconf0

    ConnectX3Pro(rev:0)     /dev/mst/mt4103_pci_cr0       81:00.0   mlx4_0  net-ens817d1,net-ens817   1


    2. How do I map a port and to a CPU (numa_node)?

    On the same example, here is another way to find this information:

    # ibdev2netdev

    mlx4_0 port 1 ==> ens817 (Up)

    mlx4_0 port 2 ==> ens817d1 (Down)

    mlx5_0 port 1 ==> ens785f0 (Down)

    mlx5_1 port 1 ==> ens785f1 (Up)


    # cat /sys/class/net/ens785f0/device/numa_node



    # cat /sys/class/net/ens817/device/numa_node



    3. How do I map the PCI (root and function) to a numa_node?

    On the same example, here is another way to find this information:

    # lspci -D | grep Mellanox

    0000:05:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]

    0000:05:00.1 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]

    0000:81:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]


    # cat /sys/devices/pci0000\:00/0000\:00\:05.1/numa_node



    # cat /sys/devices/pci0000\:00/0000\:00\:05.0/numa_node



    HINT: In most cases, if the adapter is installed in a PCI address starting with 8 (for example: 81), it will be on NUMA 1. If it starts with 0 (for example: 05), it will be in NUMA 0.


    Note: When the system does not support NUMA architecture, the result is expected to be -1.


    4. How do I map the CPU Cores to the NUMA node?

    Each CPU core is mapped to one of the NUMA nodes. In this example, by getting the CPU list (cpulist) we can see that cores 0-13 and 28-41 are mapped to NUMA 0, while the rest are mapped to NUMA 1.

    # cat /sys/devices/system/node/node0/cpulist


    # cat /sys/devices/system/node/node1/cpulist


    The cpumap parameter, supply the same results in bitmap.


    # cat /sys/devices/system/node/node0/cpumap

    000003ff,f0003fff      <-- 0-13 & 28-41 bits are ON


    # cat /sys/devices/system/node/node1/cpumap

    00fffc00,0fffc000      <-- 14-27 & 42-55 bits are ON


    Invoking Application on specific NUMA node


    1. How do I run applications on a specific NUMA node?

    Use the taskset application as follows:

    First run ib_send_bw as a server to get the PID.

    # ib_write_bw &

    [1] 45118


    Next, get the Core affinity.

    # taskset -p 45118

    pid 45118's current affinity mask: ffffffffffffff


    In this example this task can run on all cores. In our example ConnectX-4 is connected to NUMA 0. You can change the affinity mask to suit the list of cores used by NUMA 0 (0-13,28-41).

    # taskset -cp 0-13,28-41 45118

    pid 45118's current affinity list: 0-55

    pid 45118's new affinity list: 0-13,28-41


    # taskset -p 45118

    pid 45118's current affinity mask: 3fff0003fff


    In this example you spawn a task on specific NUMA cores using the -c flag.

    # taskset -c 0-13,28-41 ib_send_bw &

    [1] 45292



    * Waiting for client to connect... *



    For more information about using taskset, run taskset -h, run man taskset, or click here.