0 Replies Latest reply on Mar 19, 2014 11:16 AM by alexmercer

    A newbie problem with infiniband.

      Hi all =)

       

      I am a bit new to the forum, but I have been reading it for quite some time and the posts are very helpful. Thanks!

       

      So I decided that it is worth hopping on the infiniband wagon ( it is clear why - the speed is awesome, also the performance boost and the price has no match   ) . BUT ....

       

      I have run into some problems setting up the infiniband fabric.

      Some information about my setup : HP c7000 with 4 x Proliant BL685c Gen1. each with a HP 4x DDR DUAL PORT MEZZ HCA, I also have a 2 x HP 4x DDR IB Switch Module ( each with 16 downlink ports and 8 physical interfaces - CX4 connectors ) .

      I am running VMware ESXi 5.1.0

       

      ~ # esxcli system version get

         Product: VMware ESXi

         Version: 5.1.0

         Build: Releasebuild-799733

         Update: 0

       

      So far so good, I have installed the drivers needed :

       

       

      * Mellanox ESXI 5.0 Driver ( esxcli software vib install -d /tmp/drivers/mlx4_en-mlnx-1.6.1.2-offline_bundle-471530.zip –-no-sig-check )

      * Mellanox OFED driver ( esxcli software vib install -d  /tmp/drivers/MLNX-OFED-ESX-1.8.1.0.zip --no-sig-check )

       

       

      # esxcli software vib list | grep Mellanox

      net-ib-cm                      1.8.1.0-1OEM.500.0.0.472560           Mellanox         PartnerSupported  2014-03-18

      net-ib-core                    1.8.1.0-1OEM.500.0.0.472560           Mellanox         PartnerSupported  2014-03-18

      net-ib-ipoib                   1.8.1.0-1OEM.500.0.0.472560           Mellanox         PartnerSupported  2014-03-18

      net-ib-mad                     1.8.1.0-1OEM.500.0.0.472560           Mellanox         PartnerSupported  2014-03-18

      net-ib-sa                      1.8.1.0-1OEM.500.0.0.472560           Mellanox         PartnerSupported  2014-03-18

      net-ib-umad                    1.8.1.0-1OEM.500.0.0.472560           Mellanox         PartnerSupported  2014-03-18

      net-memtrack                   2013.0131.1850-1OEM.500.0.0.472560    Mellanox         PartnerSupported  2014-03-18

      net-mlx4-core                  1.8.1.0-1OEM.500.0.0.472560           Mellanox         PartnerSupported  2014-03-18

      net-mlx4-en                    1.6.1.2-1OEM.500.0.0.406165           Mellanox         VMwareCertified   2014-03-18

      net-mlx4-ib                    1.8.1.0-1OEM.500.0.0.472560           Mellanox         PartnerSupported  2014-03-18

      scsi-ib-srp                    1.8.1.0-1OEM.500.0.0.472560           Mellanox         PartnerSupported  2014-03-18


       

      After that I have installed the OpenSM ( esxcli software vib install -v /tmp/drivers/ib-opensm-3.3.15.x86_64.vib –-no-sig-check)

       

      ~ # esxcli software vib list | grep open

      ib-opensm                      3.3.15                                Intel            VMwareAccepted    2014-03-18


      I also configured the OpenSM per adapter with a partitions.conf file (Default=0x7fff,ipoib,mtu=5:ALL=full;), putting this file in the /scratch/opensm/adapter_1_hca/ and /scratch/opensm/adapter_2_hca/ directories

       

      /vmfs/volumes/530dc445-b2c469b5-adf0-0019bb3b460e/.locker/opensm # ls -la

      drwxr-xr-x    1 root     root           560 Feb 28 09:59 .

      drwxr-xr-x    1 root     root           980 Feb 28 09:59 ..

      drwxr-xr-x    1 root     root           420 Mar 18 12:31 0x00237dffff94d87d

      drwxr-xr-x    1 root     root           420 Mar 18 12:31 0x00237dffff94d87e

       

       

      /vmfs/volumes/530dc445-b2c469b5-adf0-0019bb3b460e/.locker/opensm/0x00237dffff94d87d # cat partitions.conf

      Default=0x7fff,ipoib,mtu=5:ALL=full;

       

      I have been following those two tutorials :

      http://www.vladan.fr/homelab-storage-network-speedup/

      http://www.bussink.ch/?p=1183

       

      Now I can see the adapters :

       

      ~ # esxcli network nic list | grep Mellanox

      vmnic_ib0  0000:047:00.0  ib_ipoib  Up    20000  Full    00:23:7d:94:d8:7d  1500  Mellanox Technologies MT25418 [ConnectX VPI - 10GigE / IB DDR, PCIe 2.0 2.5GT/s]

      vmnic_ib1  0000:047:00.0  ib_ipoib  Up    20000  Full    00:23:7d:94:d8:7e  1500  Mellanox Technologies MT25418 [ConnectX VPI - 10GigE / IB DDR, PCIe 2.0 2.5GT/s]

       

      Also when start ./ibstat I get that :

       

      /opt/opensm/bin # ./ibstat

      CA 'mlx4_0'

              CA type: MT25418

              Number of ports: 2

              Firmware version: 2.7.0

              Hardware version: a0

              Node GUID: 0x00237dffff94d87c

              System image GUID: 0x00237dffff94d87f

              Port 1:

                      State: Active

                      Physical state: LinkUp

                      Rate: 20

                      Base lid: 1

                      LMC: 0

                      SM lid: 6

                      Capability mask: 0x0251086a

                      Port GUID: 0x00237dffff94d87d

                      Link layer: InfiniBand

              Port 2:

                      State: Active

                      Physical state: LinkUp

                      Rate: 20

                      Base lid: 5

                      LMC: 0

                      SM lid: 6

                      Capability mask: 0x0251086a

                      Port GUID: 0x00237dffff94d87e

                      Link layer: InfiniBand

       

      So everything seems to be working, except it is not :

      When trying to ping from one host to the other i get that :

       

      /opt/opensm/bin # ./ibping -S -dd

      ibwarn: [15174] umad_init: umad_init

      ibwarn: [15174] umad_open_port: ca (null) port 0

      ibwarn: [15174] umad_get_cas_names: max 32

      ibwarn: [15174] umad_get_cas_names: return 1 cas

      ibwarn: [15174] resolve_ca_name: checking ca 'mlx4_0'

      ibwarn: [15174] resolve_ca_port: checking ca 'mlx4_0'

      ibwarn: [15174] umad_get_ca: ca_name mlx4_0

      ibwarn: [15174] umad_get_ca: opened mlx4_0

      ibwarn: [15174] resolve_ca_port: checking port 0

      ibwarn: [15174] resolve_ca_port: checking port 1

      ibwarn: [15174] resolve_ca_port: found active port 1

      ibwarn: [15174] resolve_ca_name: found ca mlx4_0 with port 1 type 1

      ibwarn: [15174] resolve_ca_name: found ca mlx4_0 with active port 1

      ibwarn: [15174] umad_open_port: opening mlx4_0 port 1

      ibwarn: [15174] dev_to_umad_id: mapped mlx4_0 1 to 0

      ibwarn: [15174] umad_open_port: opened /dev/umad0 fd 3 portid 0

      ibwarn: [15174] umad_register: fd 3 mgmt_class 3 mgmt_version 2 rmpp_version 1 method_mask (nil)

      ibwarn: [15174] umad_register: fd 3 registered to use agent 0 qp 1

      ibwarn: [15174] umad_register_oui: fd 3 mgmt_class 50 rmpp_version 0 oui 0x0145 method_mask 0xffd0cca0

      ibwarn: [15174] umad_register_oui: fd 3 registered to use agent 1 qp 1 class 0x32 oui 0xffd0cc90

      ibdebug: [15174] ibping_serv: starting to serve...

      ibwarn: [15174] umad_recv: fd 3 umad 0x80579c0 timeout 4294967295

      ibwarn: [15174] umad_recv: read returned 4294967232 > sizeof umad 64 + length 256 (Resource temporarily unavailable)

      ibwarn: [15174] mad_receive_via: recv failed: Resource temporarily unavailable

      ibdebug: [15174] ibping_serv: server out

       

      For some reason I always get the Resource temporarily unavailable message. When I try to do a ./ibping -L the right Lid or ./ibping -G with the right Guid I always get this :

       

      /opt/opensm/bin # ./ibping -G 0x001b78ffff34b9c6

      ibwarn: [15237] _do_madrpc: recv failed: Resource temporarily unavailable

      ibwarn: [15237] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 6)

      ibwarn: [15237] ib_path_query_via: sa call path_query failed

      ./ibping: iberror: failed: can't resolve destination port 0x001b78ffff34b9c6

       

      So I would really appreciate any help with getting one nod to ping the other.

       

      I am thinking that my problem might be the HP 4x IB Switch, but it shouldnt be, because with it I could get at least a point to point connection. The switch doesnt have an onboard subnet manager, but I am using OpenSM, so that also shouldnt be the problem.

      I want to use the Infiniband connection for a virtual storage between the Proliants, but first I need to verify that there is a connection. Any help would be welcome, also any suggestions

      Thanks in advance.

       

      Alex