1 Reply Latest reply on Sep 18, 2014 6:50 AM by alkx

    InfiniBand RDMA latency test on Xen's dom0 crashes

      Hello.

       

      The short story: while setting up InfiniBand connection between two servers one of which is Xen's dom0, I cannot complete the RDMA latency test. It crashes even with breaking the ssh connection to Xen's dom0.

       

      The long story. The first server is the Xen 4.4 with Ubuntu 14.04 as dom0 (hostname is xen). The second server is a usual server with Ubuntu 14.04 (hostname is node3). They both have Mellanox MT25208 HCAs connected over IB switch. Both have all the kernel modules loaded, OpenSM installed. The IPoIB works fine. The bare ibping goes both directions xen -> node3 and node3 -> xen. The problem occurs when I try ib_rdma_lat test. Here are the steps that lead to ib_rdma_lat and next sshd crash on xen.

       

       

      1. On the xen I run ib_rdma_lat.
      2. On the node3 I run ib_rdma_lat xen
      3. The ssh connection to xen closes.
      4. This is the output before ssh's connection close.

       

       

      root@xen:~/tmp/22# ib_rdma_lat
      local address: LID 0x03 QPN 0x10406 PSN 0x9f903b RKey 0x40004000 VAddr 0x000000017e4001
      remote address: LID 0x01 QPN 0x10406 PSN 0xd8c16e RKey 0x20004000 VAddr 0x000000013fd001
      Connection to xen closed by remote host.
      Connection to xen closed.

       

       

      I googled, and the only thing that I could do was tunig the ib_mthca's module parameters num_mtt and log_mtts_per_seg. As it is said in the article http://community.mellanox.com/docs/DOC-1120. I set them on both servers as num_mtt=4194304 and log_mtts_per_seg=4. I did this while experimenting with those values so that the ib_mthca module would load correct.

      But this didn't help. ib_rdma_lat still crashes on xen. Here's the log:

       

       

      Aug  4 00:12:52 localhost kernel: [ 4011.170180] ib_rdma_lat invoked oom-killer: gfp_mask=0x0, order=0, oom_score_adj=0
      Aug  4 00:12:52 localhost kernel: [ 4011.170189] ib_rdma_lat cpuset=/ mems_allowed=0
      Aug  4 00:12:52 localhost kernel: [ 4011.170195] CPU: 0 PID: 2889 Comm: ib_rdma_lat Tainted: G    B   W    3.13.0-32-generic #57-Ubuntu
      Aug  4 00:12:52 localhost kernel: [ 4011.170198] Hardware name: Supermicro X9DRFF-iG+/-7G+/-iTG+/-7TG+/X9DRFF-iG+/-7G+/-iTG+/-7TG+, BIOS 3.0 07/29/2013
      Aug  4 00:12:52 localhost kernel: [ 4011.170202]  0000000000000000 ffff880f175ebc68 ffffffff8171bcb4 ffff880f1ae02fe0
      Aug  4 00:12:52 localhost kernel: [ 4011.170209]  ffff880f175ebcf0 ffffffff817165ef ffff880f1a96afe0 0000000000000000
      Aug  4 00:12:52 localhost kernel: [ 4011.170213]  00000000016ad5c1 ffff880f1a96afe0 ffffffff817246aa ffffffff8172417b
      Aug  4 00:12:52 localhost kernel: [ 4011.170217] Call Trace:
      Aug  4 00:12:52 localhost kernel: [ 4011.170236]  [] dump_stack+0x45/0x56
      Aug  4 00:12:52 localhost kernel: [ 4011.170242]  [] dump_header+0x7f/0x1f1
      Aug  4 00:12:52 localhost kernel: [ 4011.170248]  [] ? error_exit+0x2a/0x60
      Aug  4 00:12:52 localhost kernel: [ 4011.170253]  [] ? retint_restore_args+0x5/0x6
      Aug  4 00:12:52 localhost kernel: [ 4011.170260]  [] oom_kill_process+0x1ce/0x330
      Aug  4 00:12:52 localhost kernel: [ 4011.170269]  [] ? security_capable_noaudit+0x15/0x20
      Aug  4 00:12:52 localhost kernel: [ 4011.170273]  [] out_of_memory+0x414/0x450
      Aug  4 00:12:52 localhost kernel: [ 4011.170278]  [] pagefault_out_of_memory+0x6f/0x80
      Aug  4 00:12:52 localhost kernel: [ 4011.170284]  [] mm_fault_error+0x8e/0x180
      Aug  4 00:12:52 localhost kernel: [ 4011.170289]  [] __do_page_fault+0x4a1/0x560
      Aug  4 00:12:52 localhost kernel: [ 4011.170299]  [] ? __acct_update_integrals+0x76/0xe0
      Aug  4 00:12:52 localhost kernel: [ 4011.170305]  [] ? acct_account_cputime+0x1c/0x20
      Aug  4 00:12:52 localhost kernel: [ 4011.170312]  [] ? account_user_time+0x8b/0xa0
      Aug  4 00:12:52 localhost kernel: [ 4011.170316]  [] ? vtime_account_user+0x54/0x60
      Aug  4 00:12:52 localhost kernel: [ 4011.170320]  [] do_page_fault+0x1a/0x70
      Aug  4 00:12:52 localhost kernel: [ 4011.170324]  [] page_fault+0x28/0x30
      Aug  4 00:12:52 localhost kernel: [ 4011.170326] Mem-Info:
      Aug  4 00:12:52 localhost kernel: [ 4011.170329] Node 0 DMA per-cpu:
      Aug  4 00:12:52 localhost kernel: [ 4011.170334] CPU    0: hi:    0, btch:   1 usd:   0
      Aug  4 00:12:52 localhost kernel: [ 4011.170336] Node 0 DMA32 per-cpu:
      Aug  4 00:12:52 localhost kernel: [ 4011.170339] CPU    0: hi:  186, btch:  31 usd: 135
      Aug  4 00:12:52 localhost kernel: [ 4011.170341] Node 0 Normal per-cpu:
      Aug  4 00:12:52 localhost kernel: [ 4011.170344] CPU    0: hi:  186, btch:  31 usd: 124
      Aug  4 00:12:52 localhost kernel: [ 4011.170351] active_anon:7920 inactive_anon:23 isolated_anon:0
      Aug  4 00:12:52 localhost kernel: [ 4011.170351]  active_file:20177 inactive_file:37521 isolated_file:0
      Aug  4 00:12:52 localhost kernel: [ 4011.170351]  unevictable:8 dirty:0 writeback:0 unstable:0
      Aug  4 00:12:52 localhost kernel: [ 4011.170351]  free:15211440 slab_reclaimable:4583 slab_unreclaimable:8427
      Aug  4 00:12:52 localhost kernel: [ 4011.170351]  mapped:4644 shmem:408 pagetables:993 bounce:0
      Aug  4 00:12:52 localhost kernel: [ 4011.170351]  free_cma:0
      Aug  4 00:12:52 localhost kernel: [ 4011.170358] Node 0 DMA free:15888kB min:8kB low:8kB high:12kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15972kB managed:15888kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
      Aug  4 00:12:52 localhost kernel: [ 4011.170367] lowmem_reserve[]: 0 1980 60135 60135
      Aug  4 00:12:52 localhost kernel: [ 4011.170372] Node 0 DMA32 free:2017364kB min:1032kB low:1288kB high:1548kB active_anon:992kB inactive_anon:4kB active_file:2596kB inactive_file:5756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2045472kB managed:2031128kB mlocked:0kB dirty:0kB writeback:0kB mapped:692kB shmem:32kB slab_reclaimable:428kB slab_unreclaimable:472kB kernel_stack:40kB pagetables:132kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
      Aug  4 00:12:52 localhost kernel: [ 4011.170381] lowmem_reserve[]: 0 0 58154 58154
      Aug  4 00:12:52 localhost kernel: [ 4011.170386] Node 0 Normal free:58812508kB min:30348kB low:37932kB high:45520kB active_anon:30688kB inactive_anon:88kB active_file:78112kB inactive_file:144328kB unevictable:32kB isolated(anon):0kB isolated(file):0kB present:60853112kB managed:59550432kB mlocked:32kB dirty:0kB writeback:0kB mapped:17884kB shmem:1600kB slab_reclaimable:17904kB slab_unreclaimable:33236kB kernel_stack:1704kB pagetables:3840kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
      Aug  4 00:12:52 localhost kernel: [ 4011.170394] lowmem_reserve[]: 0 0 0 0
      Aug  4 00:12:52 localhost kernel: [ 4011.170398] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15888kB
      Aug  4 00:12:52 localhost kernel: [ 4011.170416] Node 0 DMA32: 1*4kB (M) 12*8kB (UEM) 7*16kB (UE) 2*32kB (UM) 1*64kB (U) 2*128kB (UM) 0*256kB 1*512kB (E) 1*1024kB (E) 2*2048kB (ER) 491*4096kB (M) = 2017364kB
      Aug  4 00:12:52 localhost kernel: [ 4011.170434] Node 0 Normal: 67*4kB (UM) 34*8kB (UEM) 16*16kB (UEM) 38*32kB (UM) 26*64kB (UM) 22*128kB (UEM) 15*256kB (UEM) 2*512kB (M) 1*1024kB (M) 3*2048kB (UEM) 14354*4096kB (MR) = 58812508kB
      Aug  4 00:12:52 localhost kernel: [ 4011.170468] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      Aug  4 00:12:52 localhost kernel: [ 4011.170470] 58105 total pagecache pages
      Aug  4 00:12:52 localhost kernel: [ 4011.170473] 0 pages in swap cache
      Aug  4 00:12:52 localhost kernel: [ 4011.170476] Swap cache stats: add 0, delete 0, find 0/189
      Aug  4 00:12:52 localhost kernel: [ 4011.170478] Free swap  = 33517564kB
      Aug  4 00:12:52 localhost kernel: [ 4011.170480] Total swap = 33517564kB
      Aug  4 00:12:52 localhost kernel: [ 4011.170482] 15728639 pages RAM
      Aug  4 00:12:52 localhost kernel: [ 4011.170483] 0 pages HighMem/MovableOnly
      Aug  4 00:12:52 localhost kernel: [ 4011.170485] 325670 pages reserved
      Aug  4 00:12:52 localhost kernel: [ 4011.170487] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
      Aug  4 00:12:52 localhost kernel: [ 4011.170496] [  375]     0   375     4935      228      14        0             0 upstart-udev-br
      Aug  4 00:12:52 localhost kernel: [ 4011.170501] [  384]     0   384    12927      485      28        0         -1000 systemd-udevd
      Aug  4 00:12:52 localhost kernel: [ 4011.170505] [  571]   102   571     9887      391      23        0             0 dbus-daemon
      Aug  4 00:12:52 localhost kernel: [ 4011.170509] [  590]   101   590    63961      318      27        0             0 rsyslogd
      Aug  4 00:12:52 localhost kernel: [ 4011.170513] [  596]     0   596     4823      373      14        0             0 bluetoothd
      Aug  4 00:12:52 localhost kernel: [ 4011.170516] [  606]     0   606    18680      893      40        0             0 cupsd
      Aug  4 00:12:52 localhost kernel: [ 4011.170520] [  614]     0   614     5870      106      16        0             0 rpc.idmapd
      Aug  4 00:12:52 localhost kernel: [ 4011.170523] [  622]     0   622    10863      454      26        0             0 systemd-logind
      Aug  4 00:12:52 localhost kernel: [ 4011.170528] [  702]     0   702     3984      308      13        0             0 upstart-file-br
      Aug  4 00:12:52 localhost kernel: [ 4011.170531] [  877]     0   877     5855      275      18        0             0 rpcbind
      Aug  4 00:12:52 localhost kernel: [ 4011.170534] [  898]   111   898     5386      347      15        0             0 rpc.statd
      Aug  4 00:12:52 localhost kernel: [ 4011.170538] [  901]     0   901     3848      184      13        0             0 upstart-socket-
      Aug  4 00:12:52 localhost kernel: [ 4011.170541] [ 1300]   105  1300     7861      513      21        0             0 ntpd
      Aug  4 00:12:52 localhost kernel: [ 4011.170545] [ 1374]     0  1374     5268      237      13        0             0 getty
      Aug  4 00:12:52 localhost kernel: [ 4011.170548] [ 1378]     0  1378     5268      235      13        0             0 getty
      Aug  4 00:12:52 localhost kernel: [ 4011.170551] [ 1384]     0  1384     5268      237      13        0             0 getty
      Aug  4 00:12:52 localhost kernel: [ 4011.170555] [ 1385]     0  1385     5268      238      13        0             0 getty
      Aug  4 00:12:52 localhost kernel: [ 4011.170558] [ 1388]     0  1388     5268      238      13        0             0 getty
      Aug  4 00:12:52 localhost kernel: [ 4011.170561] [ 1427]     0  1427    15341      762      33        0         -1000 sshd
      Aug  4 00:12:52 localhost kernel: [ 4011.170564] [ 1443]     0  1443     5914      257      17        0             0 cron
      Aug  4 00:12:52 localhost kernel: [ 4011.170568] [ 1554]     0  1554     2750      242      11        0             0 xenstored
      Aug  4 00:12:52 localhost kernel: [ 4011.170571] [ 1566]     0  1566    22752      261      19        0             0 xenconsoled
      Aug  4 00:12:52 localhost kernel: [ 4011.170575] [ 1613]     0  1613    73631     1045      48        0             0 polkitd
      Aug  4 00:12:52 localhost kernel: [ 4011.170578] [ 1885]   113  1885     7052      249      18        0             0 dnsmasq
      Aug  4 00:12:52 localhost kernel: [ 4011.170581] [ 2004]     0  2004   148275      997      39        0             0 console-kit-dae
      Aug  4 00:12:52 localhost kernel: [ 4011.170585] [ 2166]     0  2166    23985      237      21        0             0 xl
      Aug  4 00:12:52 localhost kernel: [ 4011.170589] [ 2303]     0  2303     5268      237      13        0             0 getty
      Aug  4 00:12:52 localhost kernel: [ 4011.170592] [ 2378]     0  2378    82712      784      23        0             0 opensm
      Aug  4 00:12:52 localhost kernel: [ 4011.170595] [ 2379]     0  2379    65942      358      22        0             0 opensm
      Aug  4 00:12:52 localhost kernel: [ 4011.170598] [ 2450]   106  2450    91259     1269      74        0             0 whoopsie
      Aug  4 00:12:52 localhost kernel: [ 4011.170602] [ 2453]     0  2453    93762     3220     114        0             0 libvirtd
      Aug  4 00:12:52 localhost kernel: [ 4011.170605] [ 2634]     0  2634    26407     1058      54        0             0 sshd
      Aug  4 00:12:52 localhost kernel: [ 4011.170608] [ 2671]  1000  2671    26407      501      52        0             0 sshd
      Aug  4 00:12:52 localhost kernel: [ 4011.170612] [ 2672]  1000  2672     7041     1040      17        0             0 bash
      Aug  4 00:12:52 localhost kernel: [ 4011.170615] [ 2749]     0  2749    17566      547      36        0             0 sudo
      Aug  4 00:12:52 localhost kernel: [ 4011.170618] [ 2750]     0  2750     7063     1074      16        0             0 bash
      Aug  4 00:12:52 localhost kernel: [ 4011.170622] [ 2889]     0  2889     3732      213      12        0             0 ib_rdma_lat
      Aug  4 00:12:52 localhost kernel: [ 4011.170625] Out of memory: Kill process 2453 (libvirtd) score 0 or sacrifice child
      Aug  4 00:12:52 localhost kernel: [ 4011.170729] Killed process 2453 (libvirtd) total-vm:375048kB, anon-rss:4748kB, file-rss:8132kB

       

       

      The xen (dom0) has 60GB of RAM. And the node3 has 180GB of RAM.

      Here are some logs and command outputs that I made for diagnosing the problem.

       

       

      1. dmesg on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/dmesg.xen.log
      2. xl dmesg on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/xl-dmesg.xen.log
      3. parameters of the loaded ib_mthca on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/ib_mthca.xen.log
      4. ibhosts on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/ibhosts.xen.log
      5. ibstat on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/ibstat.xen.log
      6. ibstatus on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/ibstatus.xen.log
      7. lsmod | grep rdma on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/lsmod-rdma.xen.log
      8. lspci -s 04:00.0 -k on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/lspci.xen.log
      9. a cut from /var/log/syslog after ib_rdma_lat crash on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/syslog.xen

       

       

      Can anyone advise me anything, please?