4 Replies Latest reply on Mar 4, 2017 11:32 AM by t3chyphil

    Storage Spaces Direct Windows Server 2016 (1607) BSOD - Mellanox ConnectX-3 Pro (Dell)

    t3chyphil

      Good afternoon,

       

      There is very little documentation specific to Windows Server 2016, much of the RDMA/RoCE  documentation referrers to Windows Server 2012(r2) Storage Spaces. So I figured I'd start a conversation in here to help others also looking at Microsoft Storage Spaces Direct (S2D) in Windows Server 2016.

       

      I currently have an open case with Dell ProSupport regarding a BSOD my 2 Node cluster encounters. Either node will just halt and restart after 60 seconds when stress testing the environment. Each server is configured as follows...

      • Dell 13th Gen R730XD
      • 2x 120GB Intel SSDs SSDSC2BB120G6R (OS Mirror)
      • 6x 1.6TB SSDs SSDSC2BX016T4R
      • 6x 8TB HDDs ST8000NM0055-1RM112
      • 2x Intel DC P3700 800GB (Journal / Cache)
      • 256GB 2400Mhz Memory
      • HBA330 Mini Controller
      • 1x Mellanox ConnectX-3 Pro (MT04103) Dual Port SFP+ 10GbE (Firmware Version: 2.26.50.80 / Driver Version: 2.25.12665.0)
      • Running Windows Server 2016 DataCenter 1607 Build 14393.693

       

      Each server has two links to a Dell N4032F Switch.

      To rule out a possible fault with my switch config, Dell advised I directly connect the two nodes together. RDMA is engaged because I can see the traffic using performance monitor.

       

      Here's the order in which I've setup my environment...

      1. Install the OS and fully update/patch
      2. Set Windows Power Mode to Performance
      3. Install Windows Features - Hyper-V / File-Services / Failover-Clustering / Data-Center-Bridging
      4. Install Dell drivers for all hardware including the Mellanox nics. (I've tried both the Mellanox drivers and Dell's. They appear to be the same. MLNX_VPI_WinOF-5_25_All_Win2016_x64 / Driver Version: 2.25.12665.0)
      5. I perform the network configuration. Essentially create a Hyper-V SET Switch joined to both ports of the Mellanox nic. I then create two vNics connected to the new Switch with a VLAN tag. (See attached file)
      6. I then create the Failover-Cluster and enable Storage Spaces Direct (See attached file)

       

      Everything appears to be okay then it'll randomly crash. Below is a memory dump. This is what I receive on either host. I want to upgrade the firmware but it's a Dell product code so I'm stuck. It's been three weeks and we still don't have a working environment. I also have another debug output further below...

      *******************************************************************************

      *                                                                             *

      *                        Bugcheck Analysis                                    *

      *                                                                             *

      *******************************************************************************

       

      DRIVER_POWER_STATE_FAILURE (9f)

      A driver has failed to complete a power IRP within a specific time.

      Arguments:

      Arg1: 0000000000000003, A device object has been blocking an Irp for too long a time

      Arg2: ffffa48778febe20, Physical Device Object of the stack

      Arg3: ffffc080258f4960, nt!TRIAGE_9F_POWER on Win7 and higher, otherwise the Functional Device Object of the stack

      Arg4: ffff9c8fe2328010, The blocked IRP

       

      Debugging Details:

      ------------------

       

      Implicit thread is now ffff9c8f`e23a8080

       

      DUMP_CLASS: 1

       

      DUMP_QUALIFIER: 401

       

      BUILD_VERSION_STRING:  14393.693.amd64fre.rs1_release.161220-1747

       

      SYSTEM_MANUFACTURER:  Dell Inc.

       

      SYSTEM_PRODUCT_NAME:  PowerEdge R730xd

       

      SYSTEM_SKU:  SKU=NotProvided;ModelName=PowerEdge R730xd

       

      BIOS_VENDOR:  Dell Inc.

       

      BIOS_VERSION:  2.3.4

       

      BIOS_DATE:  11/08/2016

       

      BASEBOARD_MANUFACTURER:  Dell Inc.

       

      BASEBOARD_PRODUCT:  0WCJNT

       

      BASEBOARD_VERSION:  A04

       

      DUMP_TYPE:  1

       

      BUGCHECK_P1: 3

       

      BUGCHECK_P2: ffffa48778febe20

       

      BUGCHECK_P3: ffffc080258f4960

       

      BUGCHECK_P4: ffff9c8fe2328010

       

      DRVPOWERSTATE_SUBCODE:  3

       

      FAULTING_THREAD:  e23a8080

       

      CPU_COUNT: 38

       

      CPU_MHZ: 960

       

      CPU_VENDOR:  GenuineIntel

       

      CPU_FAMILY: 6

       

      CPU_MODEL: 4f

       

      CPU_STEPPING: 1

       

      CPU_MICROCODE: 6,4f,1,0 (F,M,S,R)  SIG: B00001E'00000000 (cache) B00001E'00000000 (init)

       

      DEFAULT_BUCKET_ID:  WIN8_DRIVER_FAULT

       

      BUGCHECK_STR:  0x9F

       

      PROCESS_NAME:  System

       

      CURRENT_IRQL:  2

       

      ANALYSIS_SESSION_HOST:  PHALFORDPC

       

      ANALYSIS_SESSION_TIME:  01-26-2017 10:07:27.0372

       

      ANALYSIS_VERSION: 10.0.14321.1024 amd64fre

       

      LAST_CONTROL_TRANSFER:  from fffff800d1ce5f5c to fffff800d1dcf506

       

      STACK_TEXT: 

      ffffc080`2afcd6a0 fffff800`d1ce5f5c : 00000000`00000000 00000000`00000001 ffffa487`79d23801 fffff800`d1d47359 : nt!KiSwapContext+0x76

      ffffc080`2afcd7e0 fffff800`d1ce59ff : ffffa487`70040100 00000000`00000000 00000000`00000000 fffff800`00000000 : nt!KiSwapThread+0x17c

      ffffc080`2afcd890 fffff800`d1ce77c7 : ffffc080`00000000 fffff80d`41a33a01 ffffa487`70040130 00000000`00000000 : nt!KiCommitThreadWait+0x14f

      ffffc080`2afcd930 fffff80d`41a0aaba : ffffa487`790a6c90 ffffa487`00000000 fffff80d`41a44000 ffffa487`00000000 : nt!KeWaitForSingleObject+0x377

      ffffc080`2afcd9e0 fffff80d`3b05debf : 00000000`00000000 00000000`00000006 ffffa487`78fd3980 fffff80d`3b428bf9 : mlx4eth63+0x4aaba

      ffffc080`2afcda30 fffff80d`3b0f6f80 : ffffa487`71c971a0 00000000`00000000 ffff9c8f`e2328010 00000000`00000000 : NDIS!ndisMInvokeShutdown+0x53

      ffffc080`2afcda60 fffff80d`3b0b910a : ffffa487`71c971a0 00000000`00000000 0000007f`fffffff8 ffff9c8e`c5249bb0 : NDIS!ndisMShutdownMiniport+0xb4

      ffffc080`2afcda90 fffff80d`3b09d342 : 00000000`00000000 00000000`00000000 ffff9c8f`e2328010 ffffa487`71c971a0 : NDIS!ndisSetSystemPower+0x1bdc6

      ffffc080`2afcdb10 fffff80d`3b01fc28 : ffff9c8f`e2328010 ffffa487`78febe20 ffff9c8f`e2328200 ffffa487`71c97050 : NDIS!ndisSetPower+0x96

      ffffc080`2afcdb40 fffff800`d1d9a1c2 : ffff9c8f`e23a8080 ffffc080`2afcdbf0 fffff800`d1f80600 ffffa487`71c97050 : NDIS!ndisPowerDispatch+0xa8

      ffffc080`2afcdb70 fffff800`d1c82729 : ffffffff`fa0a1f00 fffff800`d1d99fe4 ffff9c8e`c9cb8120 00000000`000001d1 : nt!PopIrpWorker+0x1de

      ffffc080`2afcdc10 fffff800`d1dcfbb6 : ffffc080`25955180 ffff9c8f`e23a8080 fffff800`d1c826e8 00000000`00000000 : nt!PspSystemThreadStartup+0x41

      ffffc080`2afcdc60 00000000`00000000 : ffffc080`2afce000 ffffc080`2afc8000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x16

       

       

      STACK_COMMAND:  .thread 0xffff9c8fe23a8080 ; kb

       

      THREAD_SHA1_HASH_MOD_FUNC:  b7cf6cc0234897f6fd93ad4ead1f75c9e7fd9df1

       

      THREAD_SHA1_HASH_MOD_FUNC_OFFSET:  263f1d39481efd9f34c4df5786cc37534825cc6e

       

      THREAD_SHA1_HASH_MOD:  1de60aba82b9f9b6af56a445a099815cd801e5d9

       

      FOLLOWUP_IP:

      mlx4eth63+4aaba

      fffff80d`41a0aaba 488d152f050300  lea     rdx,[mlx4eth63+0x7aff0 (fffff80d`41a3aff0)]

       

      FAULT_INSTR_CODE:  2f158d48

       

      SYMBOL_STACK_INDEX:  4

       

      SYMBOL_NAME:  mlx4eth63+4aaba

       

      FOLLOWUP_NAME:  MachineOwner

       

      MODULE_NAME: mlx4eth63

       

      IMAGE_NAME:  mlx4eth63.sys

       

      DEBUG_FLR_IMAGE_TIMESTAMP:  57c2dc3b

       

      BUCKET_ID_FUNC_OFFSET:  4aaba

       

      FAILURE_BUCKET_ID:  0x9F_3_POWER_DOWN_mlx4eth63!unknown_function

       

      BUCKET_ID:  0x9F_3_POWER_DOWN_mlx4eth63!unknown_function

       

      PRIMARY_PROBLEM_CLASS:  0x9F_3_POWER_DOWN_mlx4eth63!unknown_function

       

      TARGET_TIME:  2017-01-26T09:54:25.000Z

       

      OSBUILD:  14393

       

      OSSERVICEPACK:  0

       

      SERVICEPACK_NUMBER: 0

       

      OS_REVISION: 0

       

      SUITE_MASK:  400

       

      PRODUCT_TYPE:  3

       

      OSPLATFORM_TYPE:  x64

       

      OSNAME:  Windows 10

       

      OSEDITION:  Windows 10 Server TerminalServer DataCenter SingleUserTS

       

      OS_LOCALE: 

       

      USER_LCID:  0

       

      OSBUILD_TIMESTAMP:  2016-12-21 06:50:57

       

      BUILDDATESTAMP_STR:  161220-1747

       

      BUILDLAB_STR:  rs1_release

       

      BUILDOSVER_STR:  10.0.14393.693.amd64fre.rs1_release.161220-1747

       

      ANALYSIS_SESSION_ELAPSED_TIME: 6ba

       

      ANALYSIS_SOURCE:  KM

       

      FAILURE_ID_HASH_STRING:  km:0x9f_3_power_down_mlx4eth63!unknown_function

       

      FAILURE_ID_HASH:  {476104f0-13a3-bd96-8e08-ff1f10ccd888}

       

      Followup:     MachineOwner

      This is another one...

       

       

      Microsoft (R) Windows Debugger Version 10.0.14321.1024 AMD64

      Copyright (c) Microsoft Corporation. All rights reserved.

       

       

       

       

      Loading Dump File [D:\MEMORY.DMP]

      Kernel Bitmap Dump File: Kernel address space is available, User address space may not be available.

       

       

      Symbol search path is: srv*

      Executable search path is:

      Windows 10 Kernel Version 14393 MP (56 procs) Free x64

      Product: Server, suite: TerminalServer DataCenter SingleUserTS

      Built by: 14393.693.amd64fre.rs1_release.161220-1747

      Machine Name:

      Kernel base = 0xfffff801`96a11000 PsLoadedModuleList = 0xfffff801`96d16060

      Debug session time: Fri Jan 20 16:16:45.177 2017 (UTC + 0:00)

      System Uptime: 0 days 1:40:08.946

      Loading Kernel Symbols

      ...............................................................

      ................................................................

      ..............................................

      Loading User Symbols

       

       

      Loading unloaded module list

      ..............................................

      *******************************************************************************

      *                                                                             *

      *                        Bugcheck Analysis                                    *

      *                                                                             *

      *******************************************************************************

       

       

      Use !analyze -v to get detailed debugging information.

       

       

      BugCheck 133, {1, 1e00, 0, 0}

       

       

      Page 4200 not present in the dump file. Type ".hh dbgerr004" for details

      Page 4200 not present in the dump file. Type ".hh dbgerr004" for details

      Page 4200 not present in the dump file. Type ".hh dbgerr004" for details

      Probably caused by : mrxsmb.sys ( mrxsmb!SmbWskSend+1f2 )

       

       

      Followup:     MachineOwner

      ---------

       

       

      53: kd> !analyze -v

      *******************************************************************************

      *                                                                             *

      *                        Bugcheck Analysis                                    *

      *                                                                             *

      *******************************************************************************

       

       

      DPC_WATCHDOG_VIOLATION (133)

      The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVEL

      or above.

      Arguments:

      Arg1: 0000000000000001, The system cumulatively spent an extended period of time at

        DISPATCH_LEVEL or above. The offending component can usually be

        identified with a stack trace.

      Arg2: 0000000000001e00, The watchdog period.

      Arg3: 0000000000000000

      Arg4: 0000000000000000

       

       

      Debugging Details:

      ------------------

       

       

      Page 4200 not present in the dump file. Type ".hh dbgerr004" for details

      Page 4200 not present in the dump file. Type ".hh dbgerr004" for details

      Page 4200 not present in the dump file. Type ".hh dbgerr004" for details

       

       

      DUMP_CLASS: 1

       

       

      DUMP_QUALIFIER: 401

       

       

      BUILD_VERSION_STRING:  14393.693.amd64fre.rs1_release.161220-1747

       

       

      SYSTEM_MANUFACTURER:  Dell Inc.

       

       

      SYSTEM_PRODUCT_NAME:  PowerEdge R730xd

       

       

      SYSTEM_SKU:  SKU=NotProvided;ModelName=PowerEdge R730xd

       

       

      BIOS_VENDOR:  Dell Inc.

       

       

      BIOS_VERSION:  2.3.4

       

       

      BIOS_DATE:  11/08/2016

       

       

      BASEBOARD_MANUFACTURER:  Dell Inc.

       

       

      BASEBOARD_PRODUCT:  0WCJNT

       

       

      BASEBOARD_VERSION:  A04

       

       

      DUMP_TYPE:  1

       

       

      BUGCHECK_P1: 1

       

       

      BUGCHECK_P2: 1e00

       

       

      BUGCHECK_P3: 0

       

       

      BUGCHECK_P4: 0

       

       

      DPC_TIMEOUT_TYPE:  DPC_QUEUE_EXECUTION_TIMEOUT_EXCEEDED

       

       

      CPU_COUNT: 38

       

       

      CPU_MHZ: 960

       

       

      CPU_VENDOR:  GenuineIntel

       

       

      CPU_FAMILY: 6

       

       

      CPU_MODEL: 4f

       

       

      CPU_STEPPING: 1

       

       

      CPU_MICROCODE: 6,4f,1,0 (F,M,S,R)  SIG: B00001E'00000000 (cache) B00001E'00000000 (init)

       

       

      DEFAULT_BUCKET_ID:  WIN8_DRIVER_FAULT

       

       

      BUGCHECK_STR:  0x133

       

       

      PROCESS_NAME:  System

       

       

      CURRENT_IRQL:  d

       

       

      ANALYSIS_SESSION_HOST:  PHALFORDPC

       

       

      ANALYSIS_SESSION_TIME:  01-22-2017 02:23:17.0663

       

       

      ANALYSIS_VERSION: 10.0.14321.1024 amd64fre

       

       

      LAST_CONTROL_TRANSFER:  from fffff80196bb1000 to fffff80196b5b6f0

       

       

      STACK_TEXT: 

      ffffdb80`5a305d88 fffff801`96bb1000 : 00000000`00000133 00000000`00000001 00000000`00001e00 00000000`00000000 : nt!KeBugCheckEx

      ffffdb80`5a305d90 fffff801`96adc7e8 : 00001b8b`037a81b4 00001b8b`037a7f29 fffff780`00000320 fffff801`96b57cc0 : nt! ?? ::FNODOBFM::`string'+0x46470

      ffffdb80`5a305df0 fffff801`972344e5 : ffffcb86`adf28900 ffffcb86`adf28900 00000000`00000001 ffffcb86`adf28900 : nt!KeClockInterruptNotify+0xb8

      ffffdb80`5a305f40 fffff801`96a685d6 : ffffdb80`58adfd00 00000000`00000000 00000000`00000000 00000000`00000000 : hal!HalpTimerClockIpiRoutine+0x15

      ffffdb80`5a305f70 fffff801`96b5cd6a : ffffdb80`631d61d0 ffffd382`b0228cf0 00000000`000000b8 00000000`00000008 : nt!KiCallInterruptServiceRoutine+0x106

      ffffdb80`5a305fb0 fffff801`96b5d1b7 : 00000000`00000017 ffffdb80`631d6278 ffffdb80`631d65c0 ffffcb86`b65aaa40 : nt!KiInterruptSubDispatchNoLockNoEtw+0xea

      ffffdb80`631d6150 fffff801`96b61271 : ffffcb86`bd353c4e fffff801`96a7d923 00000000`00000200 00000000`00001000 : nt!KiInterruptDispatchNoLockNoEtw+0x37

      ffffdb80`631d62e0 fffff801`96a7d923 : 00000000`00000200 00000000`00001000 ffffcb86`bd353776 fffffeff`00000000 : nt!ExpInterlockedPopEntrySListEnd+0x11

      ffffdb80`631d62f0 fffff800`fc9aaf13 : ffffd383`d366120c ffffcb86`bbb68b40 ffffdb80`631d6540 ffffcb86`b09ada78 : nt!IoAllocateMdl+0x73

      ffffdb80`631d6340 fffff800`fc9a9c61 : 00000000`3337ddbe ffffd383`d378d260 00000000`00000000 ffffd383`d3661228 : tcpip!TcpSegmentTcbSend+0x223

      ffffdb80`631d6420 fffff800`fc9abc8d : 00000000`00000010 fffffff6`00000007 fffff800`fcb54210 00000000`0028ed91 : tcpip!TcpBeginTcbSend+0x481

      ffffdb80`631d6710 fffff800`fc9a95d5 : 00000000`00000000 ffffcb86`bbb68b40 00000000`00001000 ffffdb80`631d6b62 : tcpip!TcpTcbSend+0x25d

      ffffdb80`631d6ad0 fffff800`fc9a929a : 00000000`0059dc59 ffffdb80`631d6d60 ffffdb80`631d6d01 00000000`00000000 : tcpip!TcpEnqueueTcbSendOlmNotifySendComplete+0xa5

      ffffdb80`631d6b00 fffff800`fc9a8ddb : ffffdb80`00001300 ffffeb7f`760fa0e8 ffffdb80`631d6d01 fffff801`96a9f581 : tcpip!TcpEnqueueTcbSend+0x30a

      ffffdb80`631d6c00 fffff801`96a9f505 : ffffdb80`631d6d01 ffffdb80`631d6d00 ffffcb86`b433f010 fffff800`fc9a8db0 : tcpip!TcpTlConnectionSendCalloutRoutine+0x2b

      ffffdb80`631d6c80 fffff800`fc9f1aa6 : ffffd383`d1225010 00000000`00000000 00000000`00000000 ffffcb86`b08e7530 : nt!KeExpandKernelStackAndCalloutInternal+0x85

      ffffdb80`631d6cd0 fffff800`fc4e1d47 : ffffd383`d1225010 ffffdb80`631d6df0 00000000`00000000 00000000`00000000 : tcpip!TcpTlConnectionSend+0x76

      ffffdb80`631d6d40 fffff800`fdb59b02 : ffffd383`d1225010 fffff800`fc4fd090 ffffdb80`631d6df0 ffffcb86`b433f010 : afd!AfdWskDispatchInternalDeviceControl+0xf7

      ffffdb80`631d6db0 fffff800`fdb9c3d1 : ffffd383`d0bcb720 ffffcb86`b433f010 00000000`c000020c fffff800`fdbfad4e : mrxsmb!SmbWskSend+0x1f2

      ffffdb80`631d6ea0 fffff800`fdb9c2b8 : ffffd383`d4293eb8 fffff800`fdb5a53b fffff800`fdb8f000 00000000`00000000 : mrxsmb!RxCeSend+0xe1

      ffffdb80`631d6ff0 fffff800`fdb593dd : 00000000`00040070 ffffd383`d4293f28 ffffd383`d0bcb720 ffffd383`d4293eb8 : mrxsmb!VctSend+0x68

      ffffdb80`631d7040 fffff800`fdbfbdc1 : ffffd383`d4293d01 ffffd383`d40e07f0 ffffd383`d4293d28 00000000`00000000 : mrxsmb!SmbCseSubmitBufferContext+0x33d

      ffffdb80`631d7110 fffff800`fdb59f46 : ffffd383`d4293d00 ffffdb80`631d7200 ffffcb86`00800000 00000000`00000000 : mrxsmb20!Smb2Write_Start+0x1d1

      ffffdb80`631d71e0 fffff800`fdc24126 : ffffdb80`631d75a0 ffffd383`ccdbd810 ffffcb86`b436f7a0 00000000`00000004 : mrxsmb!SmbCeInitiateExchange+0x376

      ffffdb80`631d7540 fffff800`fc71755c : ffffd383`d4293d28 00000000`00000001 ffffd383`ccdbd810 fffff801`96a2e934 : mrxsmb20!MRxSmb2Write+0x126

      ffffdb80`631d75a0 fffff800`fc72a37d : fffff800`fc708000 ffffd383`ccdbd810 ffffcb86`bb2348b0 fffff800`fc708000 : rdbss!RxLowIoSubmit+0x17c

      ffffdb80`631d7610 fffff800`fc6e7a0c : 00000000`00000003 00000000`00000001 ffffcb86`bb2348b0 ffffcb86`bb2348b0 : rdbss!RxLowIoWriteShell+0x9d

      ffffdb80`631d7640 fffff800`fc72a289 : 00000000`00000000 ffffd383`d44b8800 ffffcb86`b0b1da40 00000000`00000001 : rdbss!RxCommonFileWrite+0x74c

      ffffdb80`631d7830 fffff800`fc6e299b : ffffd383`ccdbd810 ffffcb86`b48ed080 ffffcb86`bb2348b0 00000000`00000000 : rdbss!RxCommonWrite+0x59

      ffffdb80`631d7860 fffff800`fc71e6e6 : ffffd383`d44b8900 00000000`000371fd 00000000`00000000 00000000`00000002 : rdbss!RxFsdCommonDispatch+0x55b

      ffffdb80`631d79e0 fffff800`fdb990eb : 00000000`00000000 fffff801`96aa55bc 00000000`00000000 ffffcb86`ade77350 : rdbss!RxFsdDispatch+0x86

      ffffdb80`631d7a30 fffff800`fb8f72e7 : ffffd383`d2921600 00000000`00000001 00000000`00000102 ffffcb86`bb2348b0 : mrxsmb!MRxSmbFsdDispatch+0xeb

      ffffdb80`631d7a70 fffff800`fb8f65c8 : ffffd383`d2d9d040 ffffdb80`631d7ba0 00000000`00040000 ffffd383`d2921600 : clusport!ClusPortSendPassthruReadWriteRemote+0x227

      ffffdb80`631d7ac0 fffff800`fb8f4f21 : ffffd383`d2921600 ffffd383`d2921600 ffffd383`d2f0cbb0 ffffd383`d2921701 : clusport!ClusPortExecuteIrp+0x118

      ffffdb80`631d7b70 fffff800`fb8f4bfa : 00000000`00000001 fffff800`fb913a80 00000000`00000000 ffffd383`d2921760 : clusport!ClusPortIrpWorker+0x51

      ffffdb80`631d7ba0 fffff801`96a13729 : 00000000`00000000 ffffd383`d44b8800 00000000`00000080 fffff800`fb8f4ae0 : clusport!CsvFsThreadPoolWorkerRoutine+0x11a

      ffffdb80`631d7c10 fffff801`96b60bb6 : ffffdb80`59fc0180 ffffd383`d44b8800 fffff801`96a136e8 00000000`00000000 : nt!PspSystemThreadStartup+0x41

      ffffdb80`631d7c60 00000000`00000000 : ffffdb80`631d8000 ffffdb80`631d2000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x16

       

       

       

       

      STACK_COMMAND:  kb

       

       

      THREAD_SHA1_HASH_MOD_FUNC:  867e5a968da76728f7672cda902ce03b0094126c

       

       

      THREAD_SHA1_HASH_MOD_FUNC_OFFSET:  0ca44a1f529d8537ce142132cfbf564122925c1a

       

       

      THREAD_SHA1_HASH_MOD:  5f7fb32acfabea61ff02c84c0c3baa1fbe4b0b8d

       

       

      FOLLOWUP_IP:

      mrxsmb!SmbWskSend+1f2

      fffff800`fdb59b02 8bd8            mov     ebx,eax

       

       

      FAULT_INSTR_CODE:  8b49d88b

       

       

      SYMBOL_STACK_INDEX:  12

       

       

      SYMBOL_NAME:  mrxsmb!SmbWskSend+1f2

       

       

      FOLLOWUP_NAME:  MachineOwner

       

       

      MODULE_NAME: mrxsmb

       

       

      IMAGE_NAME:  mrxsmb.sys

       

       

      DEBUG_FLR_IMAGE_TIMESTAMP:  57cf9c38

       

       

      BUCKET_ID_FUNC_OFFSET:  1f2

       

       

      FAILURE_BUCKET_ID:  0x133_ISR_mrxsmb!SmbWskSend

       

       

      BUCKET_ID:  0x133_ISR_mrxsmb!SmbWskSend

       

       

      PRIMARY_PROBLEM_CLASS:  0x133_ISR_mrxsmb!SmbWskSend

       

       

      TARGET_TIME:  2017-01-20T16:16:45.000Z

       

       

      OSBUILD:  14393

       

       

      OSSERVICEPACK:  0

       

       

      SERVICEPACK_NUMBER: 0

       

       

      OS_REVISION: 0

       

       

      SUITE_MASK:  400

       

       

      PRODUCT_TYPE:  3

       

       

      OSPLATFORM_TYPE:  x64

       

       

      OSNAME:  Windows 10

       

       

      OSEDITION:  Windows 10 Server TerminalServer DataCenter SingleUserTS

       

       

      OS_LOCALE: 

       

       

      USER_LCID:  0

       

       

      OSBUILD_TIMESTAMP:  2016-12-21 06:50:57

       

       

      BUILDDATESTAMP_STR:  161220-1747

       

       

      BUILDLAB_STR:  rs1_release

       

       

      BUILDOSVER_STR:  10.0.14393.693.amd64fre.rs1_release.161220-1747

       

       

      ANALYSIS_SESSION_ELAPSED_TIME: e45

       

       

      ANALYSIS_SOURCE:  KM

       

       

      FAILURE_ID_HASH_STRING:  km:0x133_isr_mrxsmb!smbwsksend

       

       

      FAILURE_ID_HASH:  {f4239d18-f80c-7c1f-6289-34a57aa17a7d}

       

       

      Followup:     MachineOwner

      ---------

       

      Can somebody at Mellanox please help? I'm tempted to buy two off the shelf Mellanox cards just to rule Dell's firmware out of the equation.

        • Re: Storage Spaces Direct Windows Server 2016 (1607) BSOD - Mellanox ConnectX-3 Pro (Dell)
          ophirmaor

          Hi,

          Please contact Mellanox support on this.

          you can email support@mellanox.com

           

          Ophir.

          • Re: Storage Spaces Direct Windows Server 2016 (1607) BSOD - Mellanox ConnectX-3 Pro (Dell)
            thorir

            Hi,

             

            I have been in having this same issue and have resolved it with help of Mellanox and pounding Dell ProSupport.

             

            The problem is that accourding to Mellanox the supporting config for Mellanox ConnectX-3 is driver version 5.25 and firmware 2.36.5150 or higher, Dell has the driver but their firmware version is 2.36.5080 which is nearly year old.

            Mellanox wants you to talk to Dell support (which is right) to fix this.  If you have a Dell ProSupport then contact them but if not go to this website, http://www.mellanox.com/page/custom_firmware_table (Dell card = MCX312A-XCB - MT_1080120023), and follow the instruction to create your own firmware image.

             

            Kind Regards,

            Thorir

              • Re: Storage Spaces Direct Windows Server 2016 (1607) BSOD - Mellanox ConnectX-3 Pro (Dell)
                t3chyphil

                Hi Thorir,

                 

                I managed to fix the issue. I had a support case open with Dell ProSupport for about 3 weeks. They too had issues trying to replicate the fault. I suggested the firmware was out of sync with the drivers they'd released. Anyway, they said try BIOS settings. I then spent the next 3 weeks reinstalling windows over and over because it would corrupt the install of Windows on occasion because of the BSOD's.

                 

                In the end I was able to resolve the issue. There's a BIOS setting IO Non Posted Prefetching. This was enabled by default on delivery of the servers. I disabled this setting and was able to run VMFleet for a few days hammering the system with no crashes. I fed this info back to Dell who then closed the case. They did acknowledge the firmware is a problem but said they can't do anything about it other than raise a case for it to be updated. We just have to wait.

                 

                I think I'd buy Mellanox cards directly from Mellanox in future. I can't see a way of upgrading the firmware as the firmware tools don't recognise the cards at all. There's no way to discover them because Dell have changed the identifiers the MFT's look for. Mellanox was very unhelpful as I tried to raise a case with them, only to be told I don't have support. Pretty annoyed at the time. Dell won't give me a time or date for firmware or even if it's on the cards. Mellanox did not want to know unless I paid more. Anyway, I hope this helps others.

                 

                May I add. The servers have been running fine for about a month and now we're experiencing similar crashes again (not as often). This time Microsoft have a case open as I believe the mellanox side of things are sorted. Who knows, Microsoft might turn around and say there's a firmware + driver mismatch on the Mellanox cards. It's been a nightmare.

                 

                Anyway, I hope that BIOS setting helps others.

                • Re: Storage Spaces Direct Windows Server 2016 (1607) BSOD - Mellanox ConnectX-3 Pro (Dell)
                  t3chyphil

                  Sorry I forgot to mention...

                   

                  You said you sorted it. How did you resolve it? Was it a BIOS setting or did you manage to make your own firmware? If that's the case, maybe my more recent crashes are still related...

                   

                  Many thanks