1 Reply Latest reply on Oct 1, 2018 8:21 AM by alkx

    Best IPoIB settings for a mixed environment of ConnectX-2,3,4 (MT27500, MT26428, MT27700)?

    billbroadley

      I've got a cluster with a mix of IB cards (MT27500, MT26428, MT2700) cards.  Part of the use is for MPI + IBverbs for high bandwidth/low latency messages.  But part of the use is using NFS + IPoIB.  Before the newest cards I used connected mode, which worked well.  But apparently there's a new "enhanced IPoIB" that uses datagram mode, thus connected mode is disabled by default.


      Is that a purely software update?  Can it be used on the last few generations of mellanox cards?  Or does it depend on the hardware?  If depending on the hardware, how do I get the connected mode enabled with the newest (100Gbit) cards?

       

      I installed the newest driver on ubuntu 18.04 LTS and the install want well, it spit this out:

      Device #1:

      ----------

        Device Type:      ConnectX4

        Part Number:      MCX455A-ECA_Ax

        Description:      ConnectX-4 VPI adapter card; EDR IB (100Gb/s) and 100GbE; single-port QSFP28; PCIe3.0 x16; ROHS R6

        PSID:             MT_2180110032

        PCI Device Name:  42:00.0

        Base GUID:        506b4b0300f36e34

        Base MAC:         506b4bf36e34

        Versions:         Current        Available

           FW             12.23.1020     12.23.1020

           PXE            3.5.0504       3.5.0504

           UEFI           14.16.0017     14.16.0017

        Status:           Up to date

      Configuring /etc/security/limits.conf.

      Device (42:00.0):

              42:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]

              Link Width: x16

              PCI Link Speed: 8GT/s

      Installation passed successfully

      To load the new driver, run:

      /etc/init.d/mlnx-en.d restart    

      This same node was working with the inbox drives, IPoIB worked (in datagram mode), ibstat was happy.  Now ibstat doesn't work and ifconfig doesn't find a ib0.

      On boot the device is found:

      # dmesg | grep mlx

      [    3.035024] mlxfw: loading out-of-tree module taints kernel.

      [    3.092878] mlxfw: module verification failed: signature and/or required key missing - tainting kernel

      [    3.245384] mlx5_core 0000:42:00.0: firmware version: 12.23.1020

      [    3.245413] mlx5_core 0000:42:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s

      [    3.245414] mlx5_core 0000:42:00.0: PCIe link width is x16, device supports x16

      [    5.832251] mlx5_core 0000:42:00.0: Port module event: module 0, Cable plugged

      [    6.813408] mlx5_core 0000:42:00.0: FW Tracer Owner

      If I restart the driver as the install mentions:

      [ 3852.805318] PKCS#7 signature not signed with a trusted key

      [ 3852.806388] Compat-mlnx-ofed backport release: ee7aa0e

      [ 3852.806389] Backport based on mlnx_ofed/mlnx-ofa_kernel-4.0.git ee7aa0e

      [ 3852.806390] compat.git: mlnx_ofed/mlnx-ofa_kernel-4.0.git

      [ 3852.812135] PKCS#7 signature not signed with a trusted key

      [ 3852.826716] PKCS#7 signature not signed with a trusted key

      [ 3852.837116] PKCS#7 signature not signed with a trusted key

      [ 3852.875073] PKCS#7 signature not signed with a trusted key

      [ 3852.882801] mlx5_core 0000:42:00.0: firmware version: 12.23.1020

      [ 3852.882833] mlx5_core 0000:42:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s

      [ 3852.882836] mlx5_core 0000:42:00.0: PCIe link width is x16, device supports x16

      [ 3855.470717] mlx5_port_module_event: 5 callbacks suppressed

      [ 3855.470724] mlx5_core 0000:42:00.0: Port module event: module 0, Cable plugged

      [ 3856.536266] mlx5_core 0000:42:00.0: FW Tracer Owner

      [ 3856.537406] PKCS#7 signature not signed with a trusted key

      Any ideas?

      Oh, the forums mentioned:

      # cat ib_ipoib.conf

      options ib_ipoib ipoib_enhanced=0

       

      Which resulted in

      [ 4128.198929] ib_ipoib: unknown parameter 'ipoib_enhanced' ignored

        • Re: Best IPoIB settings for a mixed environment of ConnectX-2,3,4 (MT27500, MT26428, MT27700)?
          alkx

          Hi,

          There are multiple questions, let me answer one by one

          1. ib0 is missing

          When using IPoIB, the stack that need to be installed is Mellanox OFED, as Mellanox EN has no IB related packages.

           

          2. Using mixed environment

          It is possible, but if you going to mix different HCA for MPI job - don't expect good performance. The system will be as slow as its slowest component. In addition, ConnectX-2 HCA is not supported by Mellanox OFED. it might work, but no troubleshooting/code changes will be done. In addition, ConnectX-2 vs ConnectX-4, it is about different encoding QDR (8/10 - 20% loss) vs EDR (64/66 - 0.03% loss), check this link - InfiniBand Types and Speeds - Advanced Clustering Technologies

           

          3. For any INBOX related questions, work with OS vendor. If there will be any issue, vendor will open a case with Mellanox if necessary.

           

          4. Enhanced mode - you seems to be opened another case - Support for "INBOX drivers?" for 18.04/connected mode?  and might be better to keep discussion there.

           

          5. Bests IPoIB settings - everything depends on your setup and traffic pattern and only the benchmarks or real application results can tell what is the best. What works for one cluster, doesn't work for another. The only way is - start with default, set a baseline, start changing parameters one by one, measure, compare.

          1 of 1 people found this helpful