2 Replies Latest reply on Sep 24, 2014 2:48 AM by thug

    Windows VMs hang out NFSoRDMA on CentOS 6.5

      Hello. We've got stuck on such a problem. 4 nodes are connected to a storage via NFS over RDMA. Hardware is:

      Intel 2312WPQJR as a node

      Intel R2312GL4GS as a storage with Intel Infiniband 2 ports controller

      Infiniband Mellanox SwitchX IS5023 for commutation.

       

      The nodes and storage run CentOS 6.5 with built-in Infiniband package (Linux 2.6.32-431.el6.x86_64)

       

      On the storage is made an array, that is shown in system as /storage/s01. Then it is exported via NFS. The nodes connect to NFS by:

      /bin/mount -t nfs -o rdma,port=20049,rw,hard,timeo=600,retrans=5,async,nfsvers=3,intr 192.168.1.1:/storage/s01 /home/storage/sata/01

      mount shows:

      192.168.1.1:/storage/s01 on /home/storage/sata/01 type nfs

      (rw,rdma,port=20049,hard,timeo=600,retrans=5,nfsvers=3,intr,addr=192.168.1.1)

       

      Then we create a virtual machine with virsh with a disk bus virtio. All is OK, until we don't start Windows on KVM. It may work for 2 hours or 2 days, but under heavy load it hangs the mount (i.e. /sata/02 and 03 are accessible, but requesting 01 will result in a total hang of console). This can be beaten only by hardware reset of the node. If we mount without rdma - all is fine. All linux vms work fine, no problems.

       

      NFS tuning is done, the logs on the time of problem show:

      195 Mar 20 09:42:22 v0004 kernel: rpcrdma: connection to 192.168.1.1:20049

      closed (-103)

      196 Mar 20 09:42:42 v0004 kernel: rpcrdma: connection to 192.168.1.1:20049

      on mlx4_0, memreg 5 slots 32 ird 16

      197 Mar 20 09:42:49 v0004 kernel: ------------[ cut here ]------------

      198 Mar 20 09:42:49 v0004 kernel: WARNING: at kernel/softirq.c:159

      local_bh_enable_ip+0x7d/0xb0() (Not tainted)

      199 Mar 20 09:42:49 v0004 kernel: Hardware name: S2600WP

      200 Mar 20 09:42:49 v0004 kernel: Modules linked in: act_police cls_u32

      sch_ingress cls_fw sch_sfq sch_htb ebt_arp ebt_ip ebtable_nat ebtables

      xprtrdma nfs lockd fscache auth_rpcgss nfs_acl sunrpc bridge stp llc

      ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables

      ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack

      ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad

      rdma_cm ib_cm iw_cm ib_addr ipv6 openvswitch(U) vhost_net macvtap macvlan

      tun kvm_intel kvm iTCO_wdt iTCO_vendor_support sr_mod cdrom sb_edac

      edac_core lpc_ich mfd_core igb i2c_algo_bit ptp pps_core sg i2c_i801

      i2c_core ioatdma dca mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core ext4

      jbd2 mbcache usb_storage sd_mod crc_t10dif ahci isci libsas

      scsi_transport_sas wmi dm_mirror dm_region_hash dm_log dm_mod [last

      unloaded: scsi_wait_scan]

      201 Mar 20 09:42:49 v0004 kernel: Pid: 0, comm: swapper Not tainted

      2.6.32-431.5.1.el6.x86_64 #1

      202 Mar 20 09:42:49 v0004 kernel: Call Trace:

      203 Mar 20 09:42:49 v0004 kernel: <IRQ> [<ffffffff81071e27>] ?

      warn_slowpath_common+0x87/0xc0

      204 Mar 20 09:42:49 v0004 kernel: [<ffffffff81071e7a>] ?

      warn_slowpath_null+0x1a/0x20

      205 Mar 20 09:42:49 v0004 kernel: [<ffffffff8107a3ed>] ?

      local_bh_enable_ip+0x7d/0xb0

      206 Mar 20 09:42:49 v0004 kernel: [<ffffffff8152a7fb>] ?

      _spin_unlock_bh+0x1b/0x20

      207 Mar 20 09:42:49 v0004 kernel: [<ffffffffa04554f0>] ?

      rpc_wake_up_status+0x70/0x80 [sunrpc]

      208 Mar 20 09:42:49 v0004 kernel: [<ffffffffa044e79c>] ?

      xprt_wake_pending_tasks+0x2c/0x30 [sunrpc]

      209 Mar 20 09:42:49 v0004 kernel: [<ffffffffa05322fc>] ?

      rpcrdma_conn_func+0x9c/0xb0 [xprtrdma]

      210 Mar 20 09:42:49 v0004 kernel: [<ffffffffa0535450>] ?

      rpcrdma_qp_async_error_upcall+0x40/0x80 [xprtrdma]

      211 Mar 20 09:42:49 v0004 kernel: [<ffffffffa01c11cb>] ?

      mlx4_ib_qp_event+0x8b/0x100 [mlx4_ib]

      212 Mar 20 09:42:49 v0004 kernel: [<ffffffffa0166c54>] ?

      mlx4_qp_event+0x74/0xf0 [mlx4_core]

      213 Mar 20 09:42:49 v0004 kernel: [<ffffffffa0154057>] ?

      mlx4_eq_int+0x557/0xcb0 [mlx4_core]

      214 Mar 20 09:42:49 v0004 kernel: [<ffffffffa0455396>] ?

      rpc_wake_up_task_queue_locked+0x186/0x270 [sunrpc]

      215 Mar 20 09:42:49 v0004 kernel: [<ffffffffa01547c4>] ?

      mlx4_msi_x_interrupt+0x14/0x20 [mlx4_core]

      216 Mar 20 09:42:49 v0004 kernel: [<ffffffff810e6eb0>] ?

      handle_IRQ_event+0x60/0x170

      217 Mar 20 09:42:49 v0004 kernel: [<ffffffff810e980e>] ?

      handle_edge_irq+0xde/0x180

      218 Mar 20 09:42:49 v0004 kernel: [<ffffffffa0153362>] ?

      mlx4_cq_completion+0x42/0x90 [mlx4_core]

      219 Mar 20 09:42:49 v0004 kernel: [<ffffffff8100faf9>] ? handle_irq+0x49/0xa0

      220 Mar 20 09:42:49 v0004 kernel: [<ffffffff815312ec>] ? do_IRQ+0x6c/0xf0

      221 Mar 20 09:42:49 v0004 kernel: [<ffffffff8100b9d3>] ?

      ret_from_intr+0x0/0x11

      222 Mar 20 09:42:49 v0004 kernel: [<ffffffff8107a893>] ?

      __do_softirq+0x73/0x1e0

      223 Mar 20 09:42:49 v0004 kernel: [<ffffffff810e6eb0>] ?

      handle_IRQ_event+0x60/0x170

      224 Mar 20 09:42:49 v0004 kernel: [<ffffffff8100c30c>] ?

      call_softirq+0x1c/0x30

      225 Mar 20 09:42:49 v0004 kernel: [<ffffffff8100fa75>] ? do_softirq+0x65/0xa0

      226 Mar 20 09:42:49 v0004 kernel: [<ffffffff8107a795>] ? irq_exit+0x85/0x90

      227 Mar 20 09:42:49 v0004 kernel: [<ffffffff815312f5>] ? do_IRQ+0x75/0xf0

      228 Mar 20 09:42:49 v0004 kernel: [<ffffffff8100b9d3>] ?

      ret_from_intr+0x0/0x11

      229 Mar 20 09:42:49 v0004 kernel: <EOI> [<ffffffff812e09ae>] ?

      intel_idle+0xde/0x170

      230 Mar 20 09:42:49 v0004 kernel: [<ffffffff812e0991>] ?

      intel_idle+0xc1/0x170

      231 Mar 20 09:42:49 v0004 kernel: [<ffffffff814268f7>] ?

      cpuidle_idle_call+0xa7/0x140

      232 Mar 20 09:42:49 v0004 kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110

      233 Mar 20 09:42:49 v0004 kernel: [<ffffffff8150cf1a>] ? rest_init+0x7a/0x80

      234 Mar 20 09:42:49 v0004 kernel: [<ffffffff81c26f8f>] ?

      start_kernel+0x424/0x430

      235 Mar 20 09:42:49 v0004 kernel: [<ffffffff81c2633a>] ?

      x86_64_start_reservations+0x125/0x129

      236 Mar 20 09:42:49 v0004 kernel: [<ffffffff81c26453>] ?

      x86_64_start_kernel+0x115/0x124

      237 Mar 20 09:42:49 v0004 kernel: ---[ end trace ddc1b92aa1d57ab7 ]---

      238 Mar 20 09:42:49 v0004 kernel: rpcrdma: connection to 192.168.1.1:20049

      closed (-103)

      239 Mar 20 09:43:19 v0004 kernel: rpcrdma: connection to 192.168.1.1:20049

      on mlx4_0, memreg 5 slots 32 ird 16

       

      On the storage nothing is shown. CentOS virt-list can't help, so this community is the last place to ask.

        • Re: Windows VMs hang out NFSoRDMA on CentOS 6.5
          yairi

          Hi Nikolay,

          At the beginning of your post you wrote you are using "Intel Infiniband 2 ports controller and later on the dump (which i will assume taken from the same client) i see mlx4 mentioned which is the Mellanox adapter driver. can you clarify about your configuration?

           

          Generally speaking, i can't recall seeing a lot of users doing NFSoRDMA. I've seen more users doing ipoib NFS. i suggest you try it first and see if things improves. other then that. i don't have anything smart to say but i will forward this on to folks who are more familiar with NFS...

           

          Cheers!

            • Re: Windows VMs hang out NFSoRDMA on CentOS 6.5

              Sorry for no answers - thought the question was ignored and lost the link .

              The model of the Intel IB-cards is AXX1FDRIBIOM as i've researched. As an update - the NFSoRDMA hangs not only with Win VMs, but just any VMs with no way to predict - there may be 100s of errors Connection closed error -103, and all is fine, or 1-2 lines and hanged up connection...