1 Reply Latest reply on Jan 16, 2017 6:22 AM by alkx

    mlx4_0 Initializing and... nothing, fails? on Centos on Dell servers, MT25408

    lejeczek

      hi all,

       

      I've a a very basic setup, directly two boxes via two MHEH28-XTC and I cannot activate them.

      One peculiar thing is I get (randomly & !often):

       

      [85947.090496] AMD-Vi: Event logged [

      [85947.090539] IO_PAGE_FAULT device=09:00.7 domain=0x0000 address=0x00000000f6ffb000 flags=0x0050]

      [85947.298509] AMD-Vi: Event logged [

      [85947.298550] IO_PAGE_FAULT device=09:00.7 domain=0x0000 address=0x00000000f6ffb000 flags=0x0050]

       

      which is the card itself, judging by the device id

      Would you have and share some thoughts please?

       

      $ ./flint/mstflint -d 09:00.0 q # for both cards

       

      -W- Running quick query - Skipping full image integrity checks.

       

      Image type:      FS2

      FW Version:      2.9.1000

      Device ID:       25408

      Description:     Node             Port1            Port2 Sys image

      GUIDs:           0008f104039a62a0 0008f104039a62a1 0008f104039a62a2 0008f104039a62a3

      MACs:                                 000000000000     000000000001

      VSD:

      PSID:            MT_04A0110001

       

      $ ibstat

      CA 'mlx4_0'

          CA type: MT25408

          Number of ports: 2

          Firmware version: 2.9.1000

          Hardware version: a0

          Node GUID: 0x0008f104039a08dc

          System image GUID: 0x0008f104039a08df

          Port 1:

              State: Initializing

              Physical state: LinkUp

              Rate: 10

              Base lid: 1

              LMC: 0

              SM lid: 1

              Capability mask: 0x0259086a

              Port GUID: 0x0008f104039a08dd

              Link layer: InfiniBand

          Port 2:

              State: Down

              Physical state: Polling

              Rate: 10

              Base lid: 0

              LMC: 0

              SM lid: 0

              Capability mask: 0x0259086a

              Port GUID: 0x0008f104039a08de

              Link layer: InfiniBand

       

      in opensm log:

       

      Jan 06 17:00:28 817185 [F6D5A700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x1cd1

      Jan 06 17:00:28 817200 [F6D5A700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3120 Timeout while getting attribute 0x11 (NodeInfo); Possible mis-set mkey?

       

      many thanks