18 Replies Latest reply on Jun 12, 2014 11:14 AM by thowie

    ib_srp disconnect on ESXi 5.0 with H:0x5 D:0x0 P:0x0 error

    xgrv

      Hello everyone,

       

      I'm using the 1.8.2.0 for ESXi 5.X ib_srp drivers with the ESXi 5.0.0 1311175 servers and every couple of days one of my initiators is disconnected from the storage with an error similar to "2013-11-29T14:09:51.001Z cpu36:8451)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x2a (0x4125c10b3c00, 8256) to dev "eui.3731346538376162" on path "vmhba_mlx4_0.1.1:C0:T2:L4" Failed: H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL". I've seen it since the 1.8.1.0 version came out at a various ESXi 5.0 builds (from U1 623860), ConnectX-2 QDR (MT26428) and ConnectX-3 FDR10 (MT27500) HCAs, HP and Dell blade servers, 8 and 16 node clusters and after doing a lot of digging I just have no clue what can be the cause. Please, see the attached log.

       

      Modules are set as follows:

       

      ~ # esxcli system module parameters list -m ib_srp

      Name                  Type  Value  Description                                                                  

      --------------------  ----  -----  ------------------------------------------------------------------------------

      dead_state_time       int   3      Number of minutes a target can be in DEAD state before moving to REMOVED state

      debug_level           int          Set debug level (1)                                                          

      heap_initial          int          Initial heap size allocated for the driver.                                  

      heap_max              int          Maximum attainable heap size for the driver.                                 

      max_srp_targets       int   128    Max number of srp targets per scsi host (ie. HCA)                            

      max_vmhbas            int          Maximum number of vmhba(s) per physical port (0<x<8)                         

      mellanox_workarounds  int   1      Enable workarounds for Mellanox SRP target bugs if != 0                      

      srp_can_queue         int   256    Max number of commands can queue per scsi_host ie. HCA                       

      srp_cmd_per_lun       int   64     Max number of commands can queue per lun                                     

      srp_sg_tablesize      int   128    Max number of scatter lists supportted per IO - default is 32                

      topspin_workarounds   int          Enable workarounds for Topspin/Cisco SRP target bugs if != 0                 

      use_fmr               int   1      Enable/disable FMR support (1)

        • Re: ib_srp disconnect on ESXi 5.0 with H:0x5 D:0x0 P:0x0 error

          Hi,

           

          Can you please increase the verbosity level and upload the logs.

          • Re: ib_srp disconnect on ESXi 5.0 with H:0x5 D:0x0 P:0x0 error
            xgrv

            Thank You rian and vlad I've enabled debugging on my clusters and will get back to you if I get anything interesting.

            • Re: ib_srp disconnect on ESXi 5.0 with H:0x5 D:0x0 P:0x0 error
              xgrv

              Please, see the attached logs, the issue begins at 2013-12-13T 20:28:25.785Z, where I have the first occurrence of "WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:4:7 (driver name: ib_srp)" error. From this moment ESXi is having queue problems "Reduced the queue depth for device eui.3966613235306564 to 25, due to queue full/busy conditions" and the only solution is to vmotion all the VM's from the host and do a reboot. Relocated VMs don't cause this on their new hosts and no other ESXi servers in the cluster have this problem at the same time.

                • Re: ib_srp disconnect on ESXi 5.0 with H:0x5 D:0x0 P:0x0 error
                  inbusiness

                  Hi!

                  This is my vSphere 5.5 configurations.

                   

                  -------------------------------

                  esxcli system module parameters set -m=ib_ipoib -p="ipoib_recvq_size=1024 ipoib_sendq_size=1024 ipoib_mac_type=0"

                  esxcli system module parameters set -m=mlx4_core -p="mtu_4k=1 msi_x=1"

                  esxcli system module parameters set -m=ib_srp -p="dead_state_time=5 max_vmhbas=1 srp_can_queue=1024 srp_cmd_per_lun=64 srp_sg_tablesize=32"

                  --------------------------------

                   

                  Can you try it?

                    • Re: ib_srp disconnect on ESXi 5.0 with H:0x5 D:0x0 P:0x0 error
                      xgrv

                      Hello inbusiness

                       

                      Yours ib_ipoib and mlx4_core are identical to mine, I'll check with the different max_vmhbas and default srp_sg_tablesize as you suggested. Thanks!

                        • Re: ib_srp disconnect on ESXi 5.0 with H:0x5 D:0x0 P:0x0 error
                          inbusiness

                          max_vmhbas=1

                          - it's a default value. That means vHBAs count per IB port

                           

                          srp_can_queue=1024

                          - It's a physical HCA's total QD(queue depth)

                          - Almost latest FC HBA can support 4096 logins

                          - But I don't know what's a maximum value. Also default one, too.

                          - 1024 is just half of old 4Gb FC-HBA's

                           

                          srp_cmd_per_lun=64

                          - It's a maximun QD PER LUN.

                          - Default value is 8. but I don't know what's a maximum value.

                          - I was found default value is 63 on general linux system via googling.

                           

                          srp_sg_tablesize=32

                          - It's a default value on vSphere and General Linux system.

                           

                          It's a very important factor on vSphere environment with shared block storage. Every block storage has a QD limit. And the count of hosts and each HBA's QD value also very important, too.

                           

                          I think that Mellanox must show a default and maximum value of every parameter on manuals.

                    • Re: ib_srp disconnect on ESXi 5.0 with H:0x5 D:0x0 P:0x0 error
                      xgrv

                      Unfortunately nothing has changed, still got connectivity issues with ib_srp.

                       

                      2014-01-03T05:57:25.944Z stratus203 vmkwarning: cpu39:9479)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:3 (driver name: ib_srp) - Message repeated 20 times

                      2014-01-03T05:57:25.944Z stratus203 vmkernel: cpu39:9479)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:3 (driver name: ib_srp) - Message repeated 20 times

                      2014-01-03T05:57:25.955Z stratus203 vmkernel: cpu35:8227)ScsiDeviceIO: 2311: Cmd(0x412580aae440) 0x2a, CmdSN 0x194e1d from world 9479 to dev "eui.3632656331666463" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                      2014-01-03T06:57:39.529Z stratus203 vmkwarning: cpu8:8200)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 25 times

                      2014-01-03T06:57:39.529Z stratus203 vmkernel: cpu8:8200)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 25 times

                      2014-01-03T06:57:39.529Z stratus203 vmkernel: cpu9:11701)ScsiDeviceIO: 2311: Cmd(0x4124c0fcafc0) 0x2a, CmdSN 0x800e0021 from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                      2014-01-03T07:25:05.061Z stratus203 vmkwarning: cpu11:8203)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 9 times

                      2014-01-03T07:25:05.061Z stratus203 vmkernel: cpu11:8203)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 9 times

                      2014-01-03T07:25:05.063Z stratus203 vmkernel: cpu13:962303)ScsiDeviceIO: 2311: Cmd(0x4124c0c8dd00) 0x2a, CmdSN 0x800e0041 from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x5 0x24 0x0.

                      2014-01-03T07:43:15.573Z stratus203 vmkwarning: cpu11:8203)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 5 times

                      2014-01-03T07:43:15.573Z stratus203 vmkernel: cpu11:8203)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 5 times

                      2014-01-03T07:43:15.575Z stratus203 vmkernel: cpu11:8203)ScsiDeviceIO: 2311: Cmd(0x4124c0fc90c0) 0x2a, CmdSN 0x800e006a from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                      2014-01-03T08:09:20.121Z stratus203 vmkwarning: cpu14:673009)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 9 times

                      2014-01-03T08:09:20.121Z stratus203 vmkernel: cpu14:673009)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 9 times

                      2014-01-03T08:09:20.122Z stratus203 vmkernel: cpu14:673009)ScsiDeviceIO: 2311: Cmd(0x4124c0bad8c0) 0x2a, CmdSN 0x800e0042 from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                      2014-01-03T08:24:50.148Z stratus203 vmkwarning: cpu9:8201)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 10 times

                      2014-01-03T08:24:50.148Z stratus203 vmkernel: cpu9:8201)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 10 times

                      2014-01-03T08:24:50.149Z stratus203 vmkernel: cpu9:8201)ScsiDeviceIO: 2311: Cmd(0x4124c0eb3b80) 0x2a, CmdSN 0x800e006e from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                      2014-01-03T08:41:06.731Z stratus203 vmkwarning: cpu11:8203)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 15 times

                      2014-01-03T08:41:06.731Z stratus203 vmkernel: cpu11:8203)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 15 times

                      2014-01-03T08:41:06.732Z stratus203 vmkernel: cpu11:8203)ScsiDeviceIO: 2311: Cmd(0x4124c00fa7c0) 0x2a, CmdSN 0x800e0028 from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                      2014-01-03T08:56:42.696Z stratus203 vmkwarning: cpu10:8202)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 7 times

                      2014-01-03T08:56:42.696Z stratus203 vmkernel: cpu10:8202)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 7 times

                      2014-01-03T08:56:42.696Z stratus203 vmkernel: cpu15:11702)ScsiDeviceIO: 2311: Cmd(0x4124c13ae5c0) 0x2a, CmdSN 0x800e0015 from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                      2014-01-03T09:11:44.233Z stratus203 vmkwarning: cpu9:8201)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 13 times

                      2014-01-03T09:11:44.233Z stratus203 vmkernel: cpu9:8201)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 13 times

                      2014-01-03T09:11:44.234Z stratus203 vmkernel: cpu9:8201)ScsiDeviceIO: 2311: Cmd(0x4124c121b540) 0x2a, CmdSN 0x800e0003 from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                      2014-01-03T09:33:48.769Z stratus203 vmkwarning: cpu12:8204)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 5 times

                      2014-01-03T09:33:48.769Z stratus203 vmkernel: cpu12:8204)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:4 (driver name: ib_srp) - Message repeated 5 times

                      2014-01-03T09:33:48.770Z stratus203 vmkernel: cpu12:8204)ScsiDeviceIO: 2311: Cmd(0x4124c0bab1c0) 0x2a, CmdSN 0x800e000f from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                      2014-01-03T09:51:31.157Z stratus203 vmkwarning: cpu50:8344)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:1 (driver name: ib_srp) - Message repeated 33 times

                      2014-01-03T09:51:31.157Z stratus203 vmkernel: cpu50:8344)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:1 (driver name: ib_srp) - Message repeated 33 times

                      2014-01-03T09:51:31.158Z stratus203 vmkernel: cpu54:8246)ScsiDeviceIO: 2311: Cmd(0x4126009b0980) 0x2a, CmdSN 0x3bb009 from world 8344 to dev "eui.623233346565652d" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                      2014-01-03T09:51:31.550Z stratus203 vmkernel: cpu51:8243)ScsiDeviceIO: 2311: Cmd(0x41260097df80) 0x2a, CmdSN 0x3bb0dd from world 8344 to dev "eui.623233346565652d" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                      2014-01-03T09:51:31.565Z stratus203 vmkernel: cpu54:8246)ScsiDeviceIO: 2311: Cmd(0x412600d83f40) 0x2a, CmdSN 0x3bb0e4 from world 8344 to dev "eui.623233346565652d" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                      2014-01-03T10:02:36.464Z stratus203 vmkernel: cpu8:8200)ScsiDeviceIO: 2311: Cmd(0x4124c07ee1c0) 0x2a, CmdSN 0x800e0023 from world 13211 to dev "eui.3435613932663332" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                      2014-01-03T10:02:36.465Z stratus203 vmkernel: cpu8:8200)ScsiDeviceIO: 2311: Cmd(0x4124c13acbc0) 0x2a, CmdSN 0x800e000f from world 13211 to dev "eui.3435613932663332" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                      2014-01-03T10:02:36.465Z stratus203 vmkernel: cpu8:8200)ScsiSched: 2147: Reduced the queue depth for device eui.3435613932663332 to 28, due to queue full/busy conditions. The queue depth could be reduced further if the condition persists.

                      2014-01-03T10:07:01.466Z stratus203 vmkwarning: cpu16:8310)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:2 (driver name: ib_srp) - Message repeated 124 times

                      2014-01-03T10:07:01.466Z stratus203 vmkernel: cpu16:8310)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:2 (driver name: ib_srp) - Message repeated 124 times

                      2014-01-03T10:07:01.467Z stratus203 vmkernel: cpu19:8211)ScsiDeviceIO: 2311: Cmd(0x4125009f6700) 0x2a, CmdSN 0x597ab3 from world 8310 to dev "eui.3138383164363939" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x5 0x25 0x0.

                      2014-01-03T10:11:54.815Z stratus203 vmkernel: cpu15:8207)ScsiDeviceIO: 2311: Cmd(0x4124c0c29580) 0x2a, CmdSN 0x800e004f from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                      2014-01-03T10:11:54.816Z stratus203 vmkernel: cpu15:8207)ScsiDeviceIO: 2311: Cmd(0x4124c00e6300) 0x2a, CmdSN 0x800e0056 from world 11698 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                      2014-01-03T10:11:54.817Z stratus203 vmkernel: cpu15:8207)ScsiSched: 2147: Reduced the queue depth for device eui.3731346538376162 to 1, due to queue full/busy conditions. The queue depth could be reduced further if the condition persists.

                      • Re: ib_srp disconnect on ESXi 5.0 with H:0x5 D:0x0 P:0x0 error
                        xgrv

                        ESXi host with the queue problems has been in Maintenance Mode for 4h now and after I issue a "esxcli storage core adapter rescan --all" command, this pops out in the vmkernel.log file:

                         

                        2014-01-03T14:10:11.389Z cpu56:14437)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:6 (driver name: ib_srp) - Message repeated 561 times

                        2014-01-03T14:11:31.334Z cpu42:8234)ScsiDeviceIO: 2311: Cmd(0x4125c0c0b300) 0x28, CmdSN 0x2b from world 1067687 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                        2014-01-03T14:11:31.366Z cpu42:8234)ScsiDeviceIO: 2311: Cmd(0x4125c084bf40) 0x28, CmdSN 0x2f from world 1067687 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x5 0x25 0x0.

                        2014-01-03T14:11:31.388Z cpu42:8234)ScsiDeviceIO: 2311: Cmd(0x4125c00e4fc0) 0x28, CmdSN 0x30 from world 1067687 to dev "eui.3731346538376162" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                        2014-01-03T14:11:31.755Z cpu42:8234)ScsiDeviceIO: 2311: Cmd(0x4125c0d69000) 0x28, CmdSN 0x32 from world 1067687 to dev "eui.3632656331666463" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x5 0x25 0x0.

                         

                        It looks like the ib_srp driver is stuck with some SCSI commands, which are retried even after the VM world isn't there.

                          • Re: ib_srp disconnect on ESXi 5.0 with H:0x5 D:0x0 P:0x0 error
                            inbusiness

                            Okay!

                             

                            01. What's your HCA model and firmware?

                             

                            02. What's your IB switch?

                             

                            03. What's your IB SM?

                                  Does it resides in switch or dedicated linux host?

                                  Anyway, do you show me your SM configuration and ESXi configurations?

                             

                            04. What's your IB cable vendor and model?

                             

                            05. What's your SRP Target and OS?

                                  And can your show me a LUN number per Target?

                             

                            06. What's your ESXi Host number?

                             

                            I think that there is a QUEUE FULL problem on your environments.

                             

                            I have also question about SRPT maximum value of each parameters.

                             

                            I'm using OmniOS ZFS SRPT now.

                             

                            If you give me a some information about you then I'll help you to solve your some problems...

                              • Re: Re: ib_srp disconnect on ESXi 5.0 with H:0x5 D:0x0 P:0x0 error
                                yairi

                                Hi Konrad,

                                I discussed your issue with some smart fellows from Mellanox, this is what they had to say about your issue:

                                - There is a similar internal record for such issue. also discussed by communities. The issue had to do with SRP host receiving SCSI errors status (0x18) with read/write commands and 0x2 (check condition) for reports LUN commands.

                                The target also slows down on some of the LUNs resulting with SRP host aborts and vmkernel reported it with error H:0x5 D:0x0 P:0x0” (H:0x5 is SG_ERR_DID_ABORT)

                                That specific user also noticed the following errors in the syslog:

                                2013-12-15T05:46:50.282Z stratus105.api.oktawave.corp vmkwarning: cpu11:4107)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:4:8 (driver name: ib_srp) - Message repeated 5956 times

                                On this particular error, vmware community also report the same issue with different transport SATA due to lun/device BUSY with ESX 5.1 - https://communities.vmware.com/thread/445270

                                However, the resolution provided is downgrade ESX version to ESX 5.0 U2 which does not help.

                                 

                                Please also looking at this vmsupport link that will explain the “H:0x5 D:0x0 P:0x0” error.

                                http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=289902


                                you didn't get any solution from me and maybe the pointed i provide you already have. sorry i couldn't help further more. Let the group know if you find a solution or a workaround for this issue.


                                Thanks

                            • Re: ib_srp disconnect on ESXi 5.0 with H:0x5 D:0x0 P:0x0 error
                              xgrv

                              Hello everyone,

                               

                              Thank you very much for the time you've spent trying to help. I'm not 100% sure, but I've probably found the solution for the "H:0x0 D:0x8 P:0x0" errors and weird HCA behaviour, where errors occurred under no significant storage and target load. Stress testing with fio never showed any problems, that's why I suspected that the hangs in the BUSY states were caused by something else. I'm using SCST target stack and played with ibdump recently to find out, that the cause was indeed limited to the initiator side. What I did wrong is that I've setup the blades to use OS Control for the power management and even setting it to High Performance in the ESXi - servers were throttling in C1/C1E states and probably messing something with the PCI-Express power too. It suppose to have an impact only on the latency, but explicitly disabling these features in the BIOS made my logs clean. It's passed only 7 days since the change, so it's still to early to be certain, but I've a good feeling.

                               

                              Thanks again!

                                • Re: ib_srp disconnect on ESXi 5.0 with H:0x5 D:0x0 P:0x0 error
                                  yairi

                                  Sounds good my friend! Happy to see you were able to workaround the issue.

                                  • Re: ib_srp disconnect on ESXi 5.0 with H:0x5 D:0x0 P:0x0 error

                                    Hello,

                                    did the workaround helps? i have the same Error on esxi 5.5.

                                     

                                    the vmkernel.log shows:

                                     

                                    2014-05-09T09:39:15.860Z cpu14:32819)<3>vmnic_ib0:vmipoib_start_rx:410: unsupported protocol (0x888)

                                    2014-05-09T09:39:25.950Z cpu12:32799)<3>vmnic_ib0:vmipoib_start_rx:410: unsupported protocol (0x888)

                                    2014-05-09T09:39:32.602Z cpu12:32817)ScsiDeviceIO: 2324: Cmd(0x413680b08cc0) 0x28, CmdSN 0x10aa from world 35932 to dev "naa.600144f098a757880dff532425080001" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                                    2014-05-09T09:39:36.050Z cpu12:33445)<3>vmnic_ib0:vmipoib_start_rx:410: unsupported protocol (0x888)

                                    2014-05-09T09:39:46.141Z cpu14:32799)<3>vmnic_ib0:vmipoib_start_rx:410: unsupported protocol (0x888)

                                    2014-05-09T09:39:48.304Z cpu18:32823)ScsiDeviceIO: 2324: Cmd(0x413680b08cc0) 0x28, CmdSN 0x10b1 from world 35932 to dev "naa.600144f098a757880dff532425080001" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                                    2014-05-09T09:39:51.006Z cpu22:32827)ScsiDeviceIO: 2324: Cmd(0x413680b08cc0) 0x28, CmdSN 0x10b2 from world 35932 to dev "naa.600144f098a757880dff532425080001" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                                    2014-05-09T09:39:56.260Z cpu14:32799)<3>vmnic_ib0:vmipoib_start_rx:410: unsupported protocol (0x888)

                                    2014-05-09T09:40:06.351Z cpu18:32799)<3>vmnic_ib0:vmipoib_start_rx:410: unsupported protocol (0x888)

                                    2014-05-09T09:40:16.441Z cpu12:32799)<3>vmnic_ib0:vmipoib_start_rx:410: unsupported protocol (0x888)

                                    2014-05-09T09:40:26.532Z cpu17:32822)<3>vmnic_ib0:vmipoib_start_rx:410: unsupported protocol (0x888)

                                    2014-05-09T09:40:30.157Z cpu16:32821)ScsiDeviceIO: 2324: Cmd(0x4136856b1080) 0x28, CmdSN 0x10c0 from world 35932 to dev "naa.600144f098a757880dff532425080001" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

                                     

                                    and in the vmkwarning.log i see:

                                     

                                    2014-05-09T08:48:28.140Z cpu12:39950)WARNING: VSCSIFilter: 1428: Failed to issue ioctl to get unmap readback type: Inappropriate ioctl for device

                                    2014-05-09T08:53:38.629Z cpu0:4537932)WARNING: Hbr: 863: Failed to receive from 192.168.70.97 (groupID=GID-c4550a37-b6f0-41f6-b853-42ba3e028895): Broken pipe

                                    2014-05-09T09:00:18.639Z cpu15:34122)WARNING: LinScsi: SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:2:0 (driver name: ib_srp) - Message repeated 77318 times

                                    2014-05-09T09:15:18.651Z cpu16:4168540)WARNING: LinScsi: SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:2:0 (driver name: ib_srp) - Message repeated 14515 times

                                     

                                     

                                    The esxi Server controll the power. And in Bios the ACPI C and ACPI P sates are on...

                                    Hope you have some Information.

                                    THX

                                  • Re: ib_srp disconnect on ESXi 5.0 with H:0x5 D:0x0 P:0x0 error
                                    xgrv

                                    Hello again . I've managed to find a way to quickly trigger the "H:0x0 D:0x8 P:0x0" errors doing some stress testing using fio. Try to deploy 2-4 VMs and use the 'fio --verify=md5 --rw=write --size=8000m --bs=4k --loops=60 --runtime=60m --group_reporting --sync=1 --direct=1 --directory=/mnt/sdb1 --ioengine=libaio --numjobs=32 --thread --name=srp' command on them, the communication will work fine for 10-30 minutes, but after that you should see the 0x8 flood in the ESXi logs and the queues jumping. Firstly, I was convinced that the "DEVICE BUSY" came from the target, which have the logic to trigger such SCSI responses when the req_lim for the LUN is exceeded. But, after playing with the ibdump tool to dump the traffic, it seems that there are no such responses sent between the target and initiator LIDs. You can look for them yourself with "infiniband.bth.opcode == 4 && data.data[0] == c1 && data.data[19] != 0" filter using Wireshark and the pcap dump files.

                                     

                                    Changing the ib_srp module parameters on ESXi doesn't help, still using a Linux initiator instead of ESXi shows that there are no such errors triggered.

                                     

                                    "H:0x5 D:0x0 P:0x0 error" was corrected by optimizing the latency, but "H:0x0 D:0x8 P:0x0" looks like an ESXi module bug in how the initiator tracks SRP credits, according to a friend that helped me to track the issue. Hope it will be corrected in the ib_iser module coming soon.

                                     

                                    regards

                                    • Re: ib_srp disconnect on ESXi 5.0 with H:0x5 D:0x0 P:0x0 error

                                      Hello xgrv,

                                      thanks for your reply. You are Crazy ^^ i will test it with fio in the next days... I use a 3 Solaris ZFS Targets with IB... for 4 ESXi 5.5 Server. My parameter looks like

                                       

                                      ~ #  esxcli system module parameters list -m ib_srp

                                      Name                  Type  Value  Description                                                                  

                                      --------------------  ----  -----  ------------------------------------------------------------------------------

                                      dead_state_time       int   5      Number of minutes a target can be in DEAD state before moving to REMOVED state

                                      debug_level           int          Set debug level (1)                                                          

                                      heap_initial          int          Initial heap size allocated for the driver.                                  

                                      heap_max              int          Maximum attainable heap size for the driver.                                 

                                      max_srp_targets       int          Max number of srp targets per scsi host (ie. HCA)                            

                                      max_vmhbas            int   1      Maximum number of vmhba(s) per physical port (0<x<8)                         

                                      mellanox_workarounds  int          Enable workarounds for Mellanox SRP target bugs if != 0                      

                                      srp_can_queue         int   1024   Max number of commands can queue per scsi_host ie. HCA                       

                                      srp_cmd_per_lun       int   64     Max number of commands can queue per lun                                     

                                      srp_sg_tablesize      int   32     Max number of scatter lists supportted per IO - default is 32                

                                      topspin_workarounds   int          Enable workarounds for Topspin/Cisco SRP target bugs if != 0                 

                                      use_fmr               int          Enable/disable FMR support (1)

                                       

                                      What did you mean with ""H:0x5 D:0x0 P:0x0 error" was corrected by optimizing the latency" ???

                                      I think i had a similar Problem with FC Cards in ESXi... That was the reason why we change to Infiniband and it is great. Hopefully we found a solution for this bad problem...

                                      Thank you very much xgrv... If i can help you, i will do what i can...

                                      Thanks

                                      Thomas

                                        • Re: ib_srp disconnect on ESXi 5.0 with H:0x5 D:0x0 P:0x0 error

                                          Ive found another logline  that looks interesting.

                                          2014-05-28T16:25:43.094Z cpu3:4727885)WARNING: LinuxSocket: 1854: UNKNOWN/UNSUPPORTED socketcall op (whichCall=0x12, args@0xffff8d3c)

                                          2014-05-28T16:32:47.418Z cpu5:4261167)WARNING: LinScsi: SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:2:0 (driver name: ib_srp) - Message repeated 137263 times

                                          2014-05-28T16:48:05.196Z cpu15:1214355)WARNING: LinScsi: SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:2:0 (driver name: ib_srp) - Message repeated 88409 times

                                          2014-05-28T17:03:05.197Z cpu7:32812)WARNING: LinScsi: SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:2:0 (driver name: ib_srp) - Message repeated 127160 times

                                          2014-05-28T17:07:47.885Z cpu11:4734778)WARNING: UserEpoll: 542: UNSUPPORTED events 0x40

                                          2014-05-28T17:07:48.729Z cpu0:4734778)WARNING: LinuxSocket: 1854: UNKNOWN/UNSUPPORTED socketcall op (whichCall=0x12, args@0xffdd3c9c)

                                          2014-05-28T17:18:05.205Z cpu8:36687)WARNING: LinScsi: SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:2:0 (driver name: ib_srp) - Message repeated 133406 times

                                          2014-05-28T17:33:05.221Z cpu5:45877)WARNING: LinScsi: SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:2:0 (driver name: ib_srp) - Message repeated 130912 times

                                          2014-05-28T17:48:05.235Z cpu2:47863)WARNING: LinScsi: SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:2:0 (driver name: ib_srp) - Message repeated 19074 times

                                          2014-05-28T18:03:23.899Z cpu11:4738029)WARNING: LinScsi: SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:2:0 (driver name: ib_srp) - Message repeated 119658 times

                                           

                                          ^^

                                          Bye

                                          Thomas