Bring up Ceph RDMA - Developer Guide

Version 39

    This post supply bring up examples for Ceph RDMA.

     

    Bring up your ceph cluster in tcp mode (default Async messenger)

     

    References

     

     

    Preparations

    1. Optional - install latest MLNX_OFED and restart openibd driver.

     

    2. make sure rping is running ok between all nodes

    Server: rping –s –v server_ip

    Client: rping –c –v –a server_ip

     

     

     

    Ceph

    1. Get Ceph master from https://github.com/ceph/ceph

     

    2. Compile:

    # git submodule update --init --recursive

    # ./install-deps.sh

    # ./do_cmake.sh  -DCMAKE_INSTALL_PREFIX=/usr

    # cd build

    # time make -j16

    # sudo make install

     

    3. make sure you have RDMA in your version:

    # strings /usr/bin/ceph-osd |grep -i rdma

     

    4. kill all Ceph processes on all nodes:

    # sudo systemctl stop ceph-osd.target

    # sudo systemctl stop ceph-mon.target

     

    Or using kill

     

    5. Make sure all is down on every ceph node:

    # ps aux |grep ceph

     

    6. Add to your Ceph conf under [global] section:

     

    // for setting frontend and backend to RDMA

    ms_type = async+rdma

     

    // for setting backend only to RDMA

    ms_cluster_type = async+rdma

     

    //set a device name according to IB or ROCE device used, e.g.

    ms_async_rdma_device_name = mlx5_0

     

    //Set local GID for ROCEv2 interface used for CEPH

    //The GID corresponding to IPv4 or IPv6 networks

    //should be taken from show_gids command output

    //This parameter should be uniquely set per OSD server/client

    //Not defining this parameter limits the network to RoCEv1

    //That means no routing and no congestion control (ECN)

    ms_async_rdma_local_gid=0000:0000:0000:0000:0000:ffff:6ea8:0138

     

    You can get the GID index using show_gids script, see Understanding show_gids Script .

     

    7. Update configuration file in all Ceph nodes

     

    8. If using systemd services:

    Validate that the following parameters are set in relevant systemd files in /usr/lib/systemd/system/:

         ceph-disk@.service

              LimitMEMLOCK=infinity

         ceph-mds@.service

              LimitMEMLOCK=infinity

              PrivateDevices=no

         ceph-mgr@.service

              LimitMEMLOCK=infinity

         ceph-mon@.service

              LimitMEMLOCK=infinity

              PrivateDevices=no

         ceph-osd@.service

              LimitMEMLOCK=infinity

         ceph-radosgw@.service

              LimitMEMLOCK=infinity

              PrivateDevices=no

     

    and run

    systemctl daemon-reload 

     

    9. Restart all cluster processes

    On the monitor node:

    # sudo systemctl start ceph-mon.target //also starts ceph-mgr

    # sudo systemctl start ceph-mgr.target

     

    On the OSD nodes:

    # sudo systemctl start ceph-osd.target

    or

    # for i in `sudo ls /var/lib/ceph/osd/  | cut -d -  -f 2` ;do sudo systemctl start ceph-osd@$i ;done

            

    10. Open /etc/security/limits.conf and add the following lines. The RDMA is tightly coupled with the physical memory address.

    * soft memlock unlimited
    * hard memlock unlimitedroot

    soft memlock unlimitedroot

    hard memlock unlimited

    On the monitor node

    # sudo /usr/bin/ceph-mon --cluster ceph --id clx-ssp-056 --setuser ceph --setgroup ceph

    # sudo /usr/bin/ceph-mgr --cluster ceph --id clx-ssp-056 --setuser ceph --setgroup ceph

     

    On the OSD nodes

    # for i in `sudo ls /var/lib/ceph/osd/  | cut -d -  -f 2` ;do sudo /usr/bin/ceph-osd --cluster ceph --id  $i --setuser ceph --setgroup ceph &  done

     

    11. Note that in case you modify systemd configuration for Ceph-mon/Ceph-osd you may need to run:

    # systemctl daemon-reload 

     

     

    Verification

    1. check health:

    # ceph -s

     

    2. check  RDMA is working as expected

     

    The following command can show whether RDMA traffic occurs on server hosting osd.0:

    # ceph daemon osd.0 perf dump AsyncMessenger::RDMAWorker-1

    {

        "AsyncMessenger::RDMAWorker-1": {

    "tx_no_mem": 0,

    "tx_parital_mem": 0,

    "tx_failed_post": 0,

    "rx_no_registered_mem": 0,

      "tx_chunks": 30063062,

    "tx_bytes": 1512924920228,

    "rx_chunks": 23115500,

    "rx_bytes": 480212597532,

    "pending_sent_conns": 0

        }

    }