Bring Up Ceph RDMA - Developer's Guide

Version 46

    This post provides bring up examples for Ceph RDMA cluster.

     

    References

     

    Preparations

    1. (Optional): Install the latest MLNX_OFED and restart openibd driver.

     

    2. Insure that rping is running between all nodes

         Server: rping –s –v server_ip

         Client: rping –c –v –a server_ip

     

    Ceph

    1. Get latest stable CEPH version with RDMA support from the following branch:

         https://github.com/Mellanox/ceph/tree/luminous-rdma

    This version is based on luminous 12.1.0 RC.

     

    2. Compile:

    # git submodule update --init --recursive

    # ./install-deps.sh

    # ./do_cmake.sh  -DCMAKE_INSTALL_PREFIX=/usr

    # cd build

    # time make -j16

    # sudo make install

     

    3. Insure that your version has RDMA:

    # strings /usr/bin/ceph-osd |grep -i rdma

     

    4. Kill all Ceph processes on all nodes:

    # sudo systemctl stop ceph-osd.target

    # sudo systemctl stop ceph-mon.target

     

         Or by using "kill" command

     

    5. Insure that all Ceph processes are down on every ceph node:

    # ps aux |grep ceph

    6. Bring up ceph in tcp mode (default Async messenger)

     

    7. Test that CEPH is up and running

         # ceph -s

     

    8. Turn down all ceph processes

     

    9. Add to your Ceph conf under [global] section:

    // for setting frontend and backend to RDMA

    ms_type = async+rdma

     

    // for setting backend only to RDMA

    ms_cluster_type = async+rdma

     

    //set a device name according to IB or ROCE device used, e.g.

    ms_async_rdma_device_name = mlx5_0

     

    //Set local GID for ROCEv2 interface used for CEPH

    //The GID corresponding to IPv4 or IPv6 networks

    //should be taken from show_gids command output

    //This parameter should be uniquely set per OSD server/client

    //Not defining this parameter limits the network to RoCEv1

    //That means no routing and no congestion control (ECN)

    ms_async_rdma_local_gid=0000:0000:0000:0000:0000:ffff:6ea8:0138

     

    You can get the GID index using show_gids script, see Understanding show_gids Script .

     

    10. Update the configuration file in all Ceph nodes.

     

    11. If you are using systemd services:

    11.1     Validate that the following parameters are set in relevant systemd files in /usr/lib/systemd/system/:

         ceph-disk@.service

              LimitMEMLOCK=infinity

         ceph-mds@.service

              LimitMEMLOCK=infinity

              PrivateDevices=no

         ceph-mgr@.service

              LimitMEMLOCK=infinity

         ceph-mon@.service

              LimitMEMLOCK=infinity

              PrivateDevices=no

         ceph-osd@.service

              LimitMEMLOCK=infinity

         ceph-radosgw@.service

              LimitMEMLOCK=infinity

              PrivateDevices=no

     

    Note, in case you modify systemd configuration for Ceph-mon/Ceph-osd you may need to run the below:

    # systemctl daemon-reload

     

    11.2     Restart all cluster processes on the monitor node:

    # sudo systemctl start ceph-mon.target //also starts ceph-mgr

    # sudo systemctl start ceph-mgr.target

     

    On the OSD nodes:

    # sudo systemctl start ceph-osd.target

    or

    # for i in `sudo ls /var/lib/ceph/osd/  | cut -d -  -f 2` ;do sudo systemctl start ceph-osd@$i ;done

            

    12. For manual start up of CEPH processes

    12.1     Open /etc/security/limits.conf and add the following lines. The RDMA is tightly coupled with the physical memory address.

    * soft memlock unlimited
    * hard memlock unlimitedroot

    soft memlock unlimitedroot

    hard memlock unlimited

    12.2     Run the processes

         On the monitor node

    # sudo /usr/bin/ceph-mon --cluster ceph --id clx-ssp-056 --setuser ceph --setgroup ceph

    # sudo /usr/bin/ceph-mgr --cluster ceph --id clx-ssp-056 --setuser ceph --setgroup ceph

     

         On the OSD nodes

    # for i in `sudo ls /var/lib/ceph/osd/  | cut -d -  -f 2` ;do sudo /usr/bin/ceph-osd --cluster ceph --id  $i --setuser ceph --setgroup ceph &  done

     

    Verification

    1. Check health:

    # ceph -s

     

    2. Check RDMA is working as expected.

     

    The following command can show whether RDMA traffic occurs on server hosting osd.0:

    # ceph daemon osd.0 perf dump AsyncMessenger::RDMAWorker-1

    {

        "AsyncMessenger::RDMAWorker-1": {

    "tx_no_mem": 0,

    "tx_parital_mem": 0,

    "tx_failed_post": 0,

    "rx_no_registered_mem": 0,

      "tx_chunks": 30063062,

    "tx_bytes": 1512924920228,

    "rx_chunks": 23115500,

    "rx_bytes": 480212597532,

    "pending_sent_conns": 0

        }

    }