5 Replies Latest reply on Sep 11, 2013 12:55 PM by inbusiness

    ESX 5.1 IPoIB driver crash

      Hello,

       

      after two weeks of testing and firmware patching I think we found some major bug in the ESX 5.1 OFED 1.8.1.0 IPoIB driver. We are currently running on a Fujitsu RX300 S6 (Dual Xeon X5670) and a Mellanox ConnectX-2 MHRH2A (Firmware 2.9.1200). The storage server is running Ubuntu 12.04 LTS with an older ConnectX (PCIe Gen2) card and Linux Kernel 3.5. In between an 24 Port DDR Flextronics IB CX4 Switch. Therefore our max MTU is limited to 2K but that is no problem for us.

       

      On the ESX the Infiniband card serves as a VMKernel interface and as a VM port group at the same time. A running VM has its "local" disks mounted over the VMKernel interface via IPoIB. Inside the VM we have mounted a NFS filesystem from the NFS server. So it looks like:

       

      vm:~ # df

      Filesystem           1K-blocks      Used Available Use% Mounted on

      /dev/sda1             61927388   3577888  55203784   7% /  (mounted by ESX)

      10.10.30.253:/var/nas/backup 11007961088 6360753152 4647207936  58% /backup (mounted inside VM)

       

      To reproduce the error we copy data into the VM using SCP and use /backup as a target. After copying some gigabytes of data the Infiniband card stops working and the ESX kernel gives the following error message. Ths situation cannot be solved without ESX reboot.

       

      WARNING: LinDMA: Linux_DMACheckContraints:149:Cannot

               map machine address = 0x15ffff37b0, length = 65160

               for device 0000:02:00.0; reason = buffer straddles

               device dma boundary (0xffffffff)

      <3>vmnic_ib1:ipoib_send:504: found skb where it does not belong

                                   tx_head = 323830, tx_tail =323830

      <3>vmnic_ib1:ipoib_send:505: netif_queue_stopped = 0

      Backtrace for current CPU #20, worldID=8212, ebp=0x41220051b028

      ipoib_send@<None>#<None>+0x5d4 stack: 0x41800c4524aa, 0x4f0f5000000d

      ipoib_send@<None>#<None>+0x5d4 stack: 0x41800c44bca8, 0x41000fe5d6c0

      ipoib_start_xmit@<None>#<None>+0x53 stack: 0x41220051b238, 0x41800c4

       

      In the process of eleminating the error we tried (without success)

       

      1) Updated servers firmware to latest version

      2) Switched from ConnectX to ConnectX-2 card

      3) Switched from firmware 2.9.1000 to 2.9.1200

       

      Everything works fine if we use the infiniband card only as a VMKernel interface. More details in my first post: http://community.mellanox.com/message/2270

       

      Any help is appreciated.

        • Re: ESX 5.1 IPoIB driver crash
          yairi

          Hi Markus,

           

          Thank you for taking the time and posting. I poked around with some smart engineers and was able to get some idea in addition to the data you provided.

           

          The issue here was the SCSI mid-layer modifying the DMA device dma_boundary attribute under IPoIB (from 64bit to 32bit).

          This phenomenon was due to SRP adding a new SCSI host while keeping the dma_boundary attribute of scsi_host template at default.

          In this case SCSI mid-layer will override the DMA device dma_boundary to default (32bit boundary) – causing IPoIB allocation across the 32bit boundary to fail and possibly crash.

           

          In order to avoid this problem, it is recommended to uninstall SRP (if no need for it) using:

          $ esxcli software vib remove –n scsi-ib-srp

          $ reboot

           

           

          I hope that it will help..

          Cheers!