12 Replies Latest reply on Jan 10, 2017 7:21 AM by mkkang01

    CentOS 7 KVM-SR-IOV Performance?

    mkkang01

      Hello,

       

      I have a big problem to get the right performance for KVM virtualization (SR-IOV) on top of CentOS7 Host.

      I'm using ConnectX-3 card and MLNX_OFED_LINUX-3.0-2.0.1 version. The firmware is up-to-date (2.34.5000).

       

      I've checked that MPI performance is OK for CentOS6.5 Host/bare-metal and CentOS6.5 KVM image (both Infiniband/Ethernet).

      I've used exactly same OFED/firmware version and same applications for two CentOS6.5 and CentOS7 cases.

       

      But, the KVM-SR-IOV performance is too bad on top CentOS7.

      I tried different KVM-OS (CentOS6.5 and CentOS7 VM images) on top of CentOS7 host, but the result is same.

       

      Between two CentOS7 host/bare-metal machines, MPI performance is OK.

      Between two CentOS6.5 or CentOS7 KVM images on top of CentOS7 host, MPI performance becomes too bad in case of <=32KB message size.

      For example, only 17% bandwidth is gotten within KVM-SR-IOV in case of 4KB MPI message size.

       

      I'm using 3.10.0-229.4.2.el7.x86_64 kernel.

      To enhance the KVM performance on top of CentOS7, I could upgrade kernel to v.4.0.1, but that kernel is not supported by OFED.

      While following the section 3.12.1 of "Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf", I tuned for Hypervisor.

      The bios setup is totally same between CentOS6.5 test and CentOS7 test.

       

      Is there any way to get the similar performance (bandwidth/latency) between two KVM-SR-IOV images on top of CentOS7 also? Any help is welcome!

        • Re: CentOS 7 KVM-SR-IOV Performance?
          blairo

          Hi Mikyung,

           

          Are you sure your issue is not related to other virtualisation factors? E.g., are you pinning your VMs to CPU cores and exposing the host NUMA topology to them? If your VMs have memory accesses that cross NUMA nodes (e.g., need to cross QPI) then that would explain your performance degradation as the message size increases and the effect of CPU caches is reduced to be dominated by memory.

           

          Good luck!

            • Re: CentOS 7 KVM-SR-IOV Performance?
              mkkang01

              Thanks, Blair. Surely I checked the NUMA topology. I have two NUMA nodes and each has 8 cores. While pinning VM to different cores/NUMA nodes, I've checked the MPI Bandwidth Performance. In case of 1B~32KB, the performance is too bad (<17% of host result) even though maximum bandwidth is OK in case of >=64KB.

                • Re: CentOS 7 KVM-SR-IOV Performance?
                  blairo

                  Hi Mikyung,

                   

                  What does the NUMA topology look like inside your VMs, i.e., are you pinning memory nodes as well as CPUs? Do you have cpu numa xml elements in your libvirt domain xml, e.g., like:

                    ...

                    <cpu>

                      ...

                      <numa>

                        <cell id='0' cpus='0-3' memory='512000' unit='KiB'/>

                        <cell id='1' cpus='4-7' memory='512000' unit='KiB' memAccess='shared'/>

                      </numa>

                      ...

                    </cpu>

                    ...

                   

                  ?

                    • Re: CentOS 7 KVM-SR-IOV Performance?
                      mkkang01

                      Thanks for your help, Blair!

                      Yes, I added NUMA information into libvirt domain xml (libvirtd 1.2.8 / CentOS7.1).

                      In case of 4KB mpirun using 40G Ethernet, the result pattern is as follows:

                      * hostA<->hostB (3797.05 MB/s)

                      * hostA<->hostB's VM (3521.03 MB/s)

                      * hostA's VM<->hostB's VM (830.20 MB/s)

                        • Re: CentOS 7 KVM-SR-IOV Performance?
                          blairo

                          Hi Mikyung,

                           

                          Your figures make it look like there might be a problem with hostB's VM... do you get the same results (hostA<->hostB's VM (3521.03 MB/s)) when reversing to hostA's VM<->hostB?

                           

                          It might be useful if you dump more of your config here, e.g., lscpu/numactl -h on the hosts and inside the VMs, the libvirt xml etc.

                            • Re: CentOS 7 KVM-SR-IOV Performance?
                              mkkang01

                              Thanks, Blair! I pasted more detail configurations/result on host/vm here.

                               

                               

                              * mpirun result (bandwidth, 4KB message size)

                               

                              - between 2 hosts

                               

                              mpirun -np 2 -host A,B           : 3798.62 MB/s

                              mpirun -np 2 -host B,A           : 3790.37 MB/s

                               

                               

                              - between 1 host and the other host's VM

                               

                              mpirun -np 2 -host A,B_vm      : 3554.95 MB/s

                              mpirun -np 2 -host B_vm,A      : 804.30 MB/s

                               

                              mpirun -np 2 -host B,A_vm      : 3433.93 MB/s

                              mpirun -np 2 -host A_vm,B      : 834.83 MB/s

                               

                               

                              - between 2 VMs on different host

                               

                              mpirun -np 2 -host A_vm,B_vm     : 796.67 MB/s

                              mpirun -np 2 -host B_vm,A_vm      : 789.85 MB/s

                               

                               

                               

                               

                              * A host

                               

                              [root@A tmp]# lscpu

                              Architecture:          x86_64

                              CPU op-mode(s):        32-bit, 64-bit

                              Byte Order:            Little Endian

                              CPU(s):                16

                              On-line CPU(s) list:   0-15

                              Thread(s) per core:    1

                              Core(s) per socket:    8

                              Socket(s):             2

                              NUMA node(s):          2

                              Vendor ID:             GenuineIntel

                              CPU family:            6

                              Model:                 45

                              Model name:            Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz

                              Stepping:              7

                              CPU MHz:               1200.000

                              BogoMIPS:              3993.96

                              Virtualization:        VT-x

                              L1d cache:             32K

                              L1i cache:             32K

                              L2 cache:              256K

                              L3 cache:              20480K

                              NUMA node0 CPU(s):     0-7

                              NUMA node1 CPU(s):     8-15

                               

                              [root@A tmp]# numactl -H

                              available: 2 nodes (0-1)

                              node 0 cpus: 0 1 2 3 4 5 6 7

                              node 0 size: 24541 MB

                              node 0 free: 484 MB

                              node 1 cpus: 8 9 10 11 12 13 14 15

                              node 1 size: 24575 MB

                              node 1 free: 21446 MB

                              node distances:

                              node   0   1

                                0:  10  20

                                1:  20  10

                               

                               

                               

                               

                              * B host

                               

                              [root@B tmp]# lscpu

                              Architecture:          x86_64

                              CPU op-mode(s):        32-bit, 64-bit

                              Byte Order:            Little Endian

                              CPU(s):                16

                              On-line CPU(s) list:   0-15

                              Thread(s) per core:    1

                              Core(s) per socket:    8

                              Socket(s):             2

                              NUMA node(s):          2

                              Vendor ID:             GenuineIntel

                              CPU family:            6

                              Model:                 45

                              Model name:            Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz

                              Stepping:              7

                              CPU MHz:               1200.000

                              BogoMIPS:              3993.95

                              Virtualization:        VT-x

                              L1d cache:             32K

                              L1i cache:             32K

                              L2 cache:              256K

                              L3 cache:              20480K

                              NUMA node0 CPU(s):     0-7

                              NUMA node1 CPU(s):     8-15

                               

                              [root@B tmp]# numactl -H

                              available: 2 nodes (0-1)

                              node 0 cpus: 0 1 2 3 4 5 6 7

                              node 0 size: 24541 MB

                              node 0 free: 7483 MB

                              node 1 cpus: 8 9 10 11 12 13 14 15

                              node 1 size: 24575 MB

                              node 1 free: 23911 MB

                              node distances:

                              node   0   1

                                0:  10  20

                                1:  20  10

                               

                               

                               

                              * A host's VM

                               

                              [root@A_vm]# lscpu

                              Architecture:          x86_64

                              CPU op-mode(s):        32-bit, 64-bit

                              Byte Order:            Little Endian

                              CPU(s):                8

                              On-line CPU(s) list:   0-7

                              Thread(s) per core:    1

                              Core(s) per socket:    1

                              Socket(s):             8

                              NUMA node(s):          1

                              Vendor ID:             GenuineIntel

                              CPU family:            6

                              Model:                 13

                              Stepping:              3

                              CPU MHz:               1995.192

                              BogoMIPS:              3990.38

                              Hypervisor vendor:     KVM

                              Virtualization type:   full

                              L1d cache:             32K

                              L1i cache:             32K

                              L2 cache:              4096K

                              NUMA node0 CPU(s):     0-7

                               

                              [root@A_vm]# numactl -H

                              available: 1 nodes (0)

                              node 0 cpus: 0 1 2 3 4 5 6 7

                              node 0 size: 15624 MB

                              node 0 free: 14788 MB

                              node distances:

                              node   0

                                0:  10

                               

                               

                               

                              * B host's VM

                               

                              [root@B_vm]# lscpu

                              Architecture:          x86_64

                              CPU op-mode(s):        32-bit, 64-bit

                              Byte Order:            Little Endian

                              CPU(s):                8

                              On-line CPU(s) list:   0-7

                              Thread(s) per core:    1

                              Core(s) per socket:    1

                              Socket(s):             8

                              NUMA node(s):          1

                              Vendor ID:             GenuineIntel

                              CPU family:            6

                              Model:                 13

                              Stepping:              3

                              CPU MHz:               1995.191

                              BogoMIPS:              3990.38

                              Hypervisor vendor:     KVM

                              Virtualization type:   full

                              L1d cache:             32K

                              L1i cache:             32K

                              L2 cache:              4096K

                              NUMA node0 CPU(s):     0-7

                               

                              [root@B_vm]# numactl -H

                              available: 1 nodes (0)

                              node 0 cpus: 0 1 2 3 4 5 6 7

                              node 0 size: 15624 MB

                              node 0 free: 14791 MB

                              node distances:

                              node   0

                                0:  10

                               

                               

                               

                               

                               

                              * libvirt xml

                               

                               

                              [root@hp4 pt2pt]# cat /tmp/test.xml

                              <domain type="kvm">

                                <uuid>ab14e717-90a9-4085-9a32-f0b24430b2c0</uuid>

                                <name>test</name>

                                <memory>16000000</memory>

                                <cpu>

                                <numa>

                                  <cell id='0' cpus='0-7' memory="16000000" unit='KiB'/>

                                </numa>

                                </cpu>

                                <vcpu>8</vcpu>

                                <sysinfo type="smbios">

                                  <system>

                                    <entry name="manufacturer">RDO Project</entry>

                                    <entry name="product">OpenStack Nova</entry>

                                    <entry name="version">2014.1.3-2.el7.centos</entry>

                                    <entry name="serial">16353439-3339-5553-4532-333845585934</entry>

                                    <entry name="uuid">ab14e717-90a9-4085-9a32-f0b24430b2c0</entry>

                                  </system>

                                </sysinfo>

                                <os>

                                  <type>hvm</type>

                                  <boot dev="hd"/>

                                  <smbios mode="sysinfo"/>

                                </os>

                                <features>

                                  <acpi/>

                                  <apic/>

                                </features>

                                <clock offset="utc">

                                  <timer name="pit" tickpolicy="delay"/>

                                  <timer name="rtc" tickpolicy="catchup"/>

                                  <timer name="hpet" present="no"/>

                                </clock>

                                <cpu mode="host-model" match="exact"/>

                                <devices>

                                  <disk type="file" device="disk">

                                    <driver name="qemu" type="qcow2" cache="none"/>

                                    <source file="/tmp/disk"/>

                                    <target bus="virtio" dev="vda"/>

                                  </disk>

                                  <interface type='hostdev' managed='yes'>

                                    <source>

                                      <address type='pci' bus="0x07" domain="0x0" function="0x2" slot="0x01"/>

                                    </source>

                                    <mac address='5a:16:3e:6c:d9:1f'/>

                                    <vlan>

                                      <tag id='1000'/>

                                    </vlan>

                                  </interface>

                                  <serial type='pty'>

                                  <target port='0'/>

                                  </serial>

                                  <console type='pty'>

                                  <target type='serial' port='0'/>

                                  </console>

                                </devices>

                              </domain>

                                • Re: CentOS 7 KVM-SR-IOV Performance?
                                  blairo

                                  Hi Mikyung,

                                   

                                  I don't see any obvious problems there, but IIRC there is a lot of config required to make this work... I guess for completeness you could show us how you've configured your host NICs and drivers and same for guests? And relevant flow control settings on your switch/es, host and guests (I believe you might need to be setting qos parameters in the guest if you are not already)? Are you using the same OFED inside the guests/VMs?

                                   

                                  Cheers,

                                    • Re: CentOS 7 KVM-SR-IOV Performance?
                                      mkkang01

                                      Thanks, Blair! I've setup exactly same OFED/FW/NIC/Switch/OS versions on each host/VM (CentOS6.5/7.1).

                                       

                                      ...

                                      Host Driver Version .................... MLNX_OFED_LINUX-3.0-2.0.1 (OFED-3.0-2.0.0): modules

                                      Firmware version: 2.34.5000

                                      vlan 1000 is setup on Switch/NIC

                                       

                                      <HOST>

                                      Linux xxx 3.10.0-229.4.2.el7.x86_64 #1 SMP Wed May 13 10:06:09 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

                                       

                                      6: ens2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT qlen 1000

                                          link/ether e4:1d:2d:01:12:40 brd ff:ff:ff:ff:ff:ff

                                          vf 0 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto

                                          vf 1 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto

                                          ...

                                          vf 9 MAC 5a:16:3e:6c:d9:2f, vlan 1000, spoof checking off, link-state auto

                                          ...

                                          vf 15 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto

                                       

                                      07:01.2 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

                                       

                                      <VM>

                                      Linux mk-test-inst1.novalocal 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

                                       

                                      2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT qlen 1000

                                          link/ether 5a:16:3e:6c:d9:1f brd ff:ff:ff:ff:ff:ff

                                       

                                      00:04.0 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

                                      ...

                                       

                                       

                                      To check result on both 2 CentOS 6.5 hosts and 2 CentOS 7.1 hosts, I'm using same machines/setup. Even though I tried to make new VM images (CentOS 6.5/7.1) again, the bandwidth result is still bad. Should I have to do something more (that's not needed on CentOS6.5 host) on CentOS7.1 to get reasonable bandwidth between 2VMs? Could you please explain the QoS parameters in the guest?? I tried same guest/VM (default QoS) on CentOS6.5 and CentOS7.1. Only the VM on CentOS7.1 have a problem.

                                    • Re: CentOS 7 KVM-SR-IOV Performance?
                                      alkx

                                      Following the output I would expect the same results in both cases, however A->B_vm is good, but B_vm->A is bad. Having ~3.5G in one direction shows that there is no issues with IB communiction, so may be it is how the ranks are bound on the VM or real host?

                                      mpirun -np 2 -host A,B_vm      : 3554.95 MB/s

                                      mpirun -np 2 -host B_vm,A      : 804.30 MB/s

                                       

                                      Try to use ib_read_bw and ib_send_bw utilities before MPI. Also check that your CPU are running on the maximum speed, as seems that they are not - 1200 MHz/1995Mhz.

                                        • Re: CentOS 7 KVM-SR-IOV Performance?
                                          mkkang01

                                          Thanks, alkx!

                                          I already tested w/ ib_read_bw and ib_send_bw also. I pasted sample output as follows. VM->Host is bad as expected in case of size<32K.

                                          Let me check it again while changing the speed.

                                           

                                          [1] hostB -> vm@hostA

                                          # ib_write_bw -R -F $hostA_vm_IP -a

                                          #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]

                                          4096       5000             4313.10            4311.29                   1.103690

                                           

                                          [2] vm@hostA -> hostB

                                          # ib_write_bw -R -F $hostB_IP -a

                                          #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]

                                          4096       5000             806.55             802.18                    0.205359

                                          • Re: CentOS 7 KVM-SR-IOV Performance?
                                            mkkang01

                                            Even w/ maximum speed (cpufreq/scaling_governor: performance), I'm getting similar result: ib_write_bw performance (VM->host) is bad.

                              • Re: CentOS 7 KVM-SR-IOV Performance?
                                mkkang01

                                The patches provided by RH from link http://people.redhat.com/~alwillia/bz1299846/ is solving the issue.

                                I have downloaded and installed (yum install *.rpm) the 3 user space packages (qemu-img, qemu-kvm and qemu-kvm-common) on the hypervisor.

                                The performance could be enhanced by as much as 90% and 65% in the case of 1KB and 4KB message size respectively.

                                 

                                qemu-img-1.5.3-105.el7_2.1.bz1299846.0.x86_64.rpm

                                qemu-kvm-1.5.3-105.el7_2.1.bz1299846.0.x86_64.rpm

                                qemu-kvm-common-1.5.3-105.el7_2.1.bz1299846.0.x86_64.rpm

                                 

                                Performance Known Issues#783496: When using a VF over RH7.X KVM, low throughput is expected.

                                http://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_Release_Notes_3_3-1_0_0_0.pdf