2 Replies Latest reply on Jan 3, 2017 2:42 PM by agi

    "Protocol not supported" when trying to add rdma to nfs portlist

    agi

      I am trying to configure NFS for our infiniband network, and following the instructions at HowTo Configure NFS over RDMA (RoCE)

      I installed the MLNX_OFED drivers on CentOS 6.8.  (I had originally configured the network and IPoIB interface using the RHEL manual (Part II. InfiniBand and RDMA Networking) and was using NFS over the IPoIB but was receiving a bunch of page allocation failures)

      I used the mlnxofedinstall script which completed successfully and updated the firmware, e.g.:

       

      ...

      Device (84:00.0):

          84:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

          Link Width: x8

          PCI Link Speed: 8GT/s

       

      Installation finished successfully.

       

      Preparing...                ########################################### [100%]

         1:mlnx-fw-updater        ########################################### [100%]

       

      Added 'RUN_FW_UPDATER_ONBOOT=no to /etc/infiniband/openib.conf

       

      Attempting to perform Firmware update...

      Querying Mellanox devices firmware ...

       

      Device #1:

      ----------

        Device Type:      ConnectX3

        Part Number:      MCX354A-FCB_A2-A5

        Description:      ConnectX-3 VPI adapter card; dual-port QSFP; FDR IB (56Gb/s) and 40GigE; PCIe3.0 x8 8GT/s; RoHS R6

        PSID:             MT_1090120019

        PCI Device Name:  84:00.0

        Port1 GUID:       e41d2d03006f89f1

        Port2 GUID:       e41d2d03006f89f2

        Versions:         Current        Available    

           FW             2.32.5100      2.36.5150    

           PXE            3.4.0306       3.4.0740     

       

        Status:           Update required

      ---------

      Found 1 device(s) requiring firmware update...

      Device #1: Updating FW ... Done

      Restart needed for updates to take effect.

      Log File: /tmp/MLNX_OFED_LINUX-3.4-1.0.0.0.17971.logs/fw_update.log

      Please reboot your system for the changes to take effect.

      To load the new driver, run:

      /etc/init.d/openibd restart

      #

       

      I rebooted the system and then ran the self test:

      # hca_self_test.ofed

       

      ---- Performing Adapter Device Self Test ----

      Number of CAs Detected ................. 1

      PCI Device Check ....................... PASS

      Kernel Arch ............................ x86_64

      Host Driver Version .................... MLNX_OFED_LINUX-3.4-1.0.0.0 (OFED-3.4-1.0.0): 2.6.32-642.el6.x86_64

      Host Driver RPM Check .................. PASS

      Firmware on CA #0 VPI .................. v2.36.5150

      Host Driver Initialization ............. PASS

      Number of CA Ports Active .............. 0

      Port State of Port #1 on CA #0 (VPI)..... INIT (InfiniBand)

      Port State of Port #2 on CA #0 (VPI)..... DOWN (InfiniBand)

      Error Counter Check on CA #0 (VPI)...... FAIL

          REASON: found errors in the following counters

            Errors in /sys/class/infiniband/mlx4_0/ports/1/counters

               port_rcv_errors: 93

      Kernel Syslog Check .................... PASS

      Node GUID on CA #0 (VPI) ............... e4:1d:2d:03:00:6f:89:f0

      ------------------ DONE ---------------------

      #

       

      As you can see there is an error with the port_rcv_errors counter.  Also the port state for Port #1 will remain at INIT until i start the subnet manager (/etc/init.d/opensmd start) since we have unmanaged switch.  That used to start automatically.  So maybe the OFED installation wasn't completely successful?

       

      Additionally, i am unable to configure NFS for RDMA. e.g.:

      # echo rdma 20049 > /proc/fs/nfsd/portlist

      -bash: echo: write error: Protocol not supported

      #

        • Re: hca_self_test.ofed found errors in the port_rcv_errors counters
          agi

          it seems the port_rcv_errors error is based on the subnet manager not running, as the counter has not increased anymore since OpenSM was started.  I ran several RDMA verification tests which were all successful.  So i think that just leaves the RDMA support in NFS.
            The kernel is 2.6.32-642.11.1.el6.x86_64 and in the /boot/config-2.6.32-642.11.1.el6.x86_64 file it seems RDMA is enabled:

          CONFIG_RDS_RDMA=m

          CONFIG_NET_9P_RDMA=m

          CONFIG_CARDMAN_4000=m

          CONFIG_CARDMAN_4040=m

          CONFIG_INFINIBAND_OCRDMA=m

          CONFIG_SUNRPC_XPRT_RDMA_CLIENT=m

          CONFIG_SUNRPC_XPRT_RDMA_SERVER=m

           

          # modprobe svcrdma

          # /etc/init.d/nfs restart

          ...      [  OK  ]

          # echo rdma 20049 > /proc/fs/nfsd/portlist

          -bash: echo: write error: Protocol not supported

          #

          • Re: "Protocol not supported" when trying to add rdma to nfs portlist
            agi

            The solution was to remove MLNX_OFED and use the distribution's drivers/kernel modules.