1 Reply Latest reply on Jul 27, 2018 1:03 PM by fwmiller

    Various ping programs segfaulting

    fwmiller

      I have a build of rdma-core in kernel 4.17 using yocto for an Altera Arria10 with a dual-core A53 ARM processor.  The system is build and rxe configures correctly, i.e. I can rxe_cfg start, rxe_cfg add eth0 and ibv_devices looks good:

       

      root@arria10:~# rxe_cfg status

        Name  Link  Driver   Speed  NMTU  IPv4_addr  RDEV  RMTU

        eth0  yes   st_gmac         1500  10.0.1.28  rxe0  1024  (3)

      root@arria10:~# ibv_devices

          device                 node GUID

          ------              ----------------

          rxe0                085697fffec1059b

      root@arria10:~# ibv_devinfo rxe0

      hca_id: rxe0

              transport:                      InfiniBand (0)

              fw_ver:                         0.0.0

              node_guid:                      0856:97ff:fec1:059b

              sys_image_guid:                 0000:0000:0000:0000

              vendor_id:                      0x0000

              vendor_part_id:                 0

              hw_ver:                         0x0

              phys_port_cnt:                  1

                      port:   1

                              state:                  PORT_ACTIVE (4)

                              max_mtu:                4096 (5)

                              active_mtu:             1024 (3)

                              sm_lid:                 0

                              port_lid:               0

                              port_lmc:               0x00

                              link_layer:             Ethernet

       

      This all looks good.  However, when I try to ping this machine against a PC running rdma-core, I'm getting some strange errors including a segfault when the Arria10 acts as server for udaddy.

       

      root@arria10:~# udaddy -s 10.0.1.16

      udaddy: starting client

      [ 1883.526301] rdma_rxe: null vaddr

      udaddy: connecting

      failed to reg MR

      udaddy: failed to create messages: -1

      test complete

      Segmentation faultrxe_mem_init_user

       

      I traced the first error, rdma_rxe: null vaddr to rxe_mem_init_user() in <kernel>/drivers/infiniband/sw/rxe/rxe_mr.c  It appears that a page address, perhaps from a virtual to physical translation is failing.  Any thoughts on how to solve this?

       

      Thanks,

      FM

        • Re: Various ping programs segfaulting
          fwmiller

          This turned out to be a nasty little bug.  Turns out there is place where the rxe driver is registering memory that uses are area of memory that is not available in the ARM processor we are using.  Here's the patch that made it work...

           

          2 files changed, 15 insertions(+), 2 deletions(-)

           

          diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c

          index 5c2684b..f2dc5a7 100644

          --- a/drivers/infiniband/sw/rxe/rxe_mr.c

          +++ b/drivers/infiniband/sw/rxe/rxe_mr.c

          @@ -31,6 +31,7 @@

            * SOFTWARE.

            */

           

          +#include <linux/highmem.h>

          #include "rxe.h"

          #include "rxe_loc.h"

           

          @@ -94,7 +95,15 @@ static void rxe_mem_init(int access, struct rxe_mem *mem)

          void rxe_mem_cleanup(struct rxe_pool_entry *arg)

          {

                  struct rxe_mem *mem = container_of(arg, typeof(*mem), pelem);

          -       int i;

          +       int i, entry;

          +       struct scatterlist *sg;

          +

          +       if (mem->kmap_occurred) {

          +               for_each_sg(mem->umem->sg_head.sgl, sg,

          +                           mem->umem->nmap, entry) {

          +                       kunmap(sg_page(sg));

          +               }

          +       }

           

                  if (mem->umem)

                          ib_umem_release(mem->umem);

          @@ -200,12 +209,14 @@ int rxe_mem_init_user(struct rxe_dev *rxe, struct rxe_pd *pd, u64 start,

                          buf = map[0]->buf;

           

                          for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) {

          -                       vaddr = page_address(sg_page(sg));

          +                       // vaddr = page_address(sg_page(sg));

          +                       vaddr = kmap(sg_page(sg));

                                  if (!vaddr) {

                                          pr_warn("null vaddr\n");

                                          err = -ENOMEM;

                                          goto err1;

                                  }

          +                       mem->kmap_occurred = 1;

           

                                  buf->addr = (uintptr_t)vaddr;

                                  buf->size = BIT(umem->page_shift);

          diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h

          index af1470d..9bd7eac 100644

          --- a/drivers/infiniband/sw/rxe/rxe_verbs.h

          +++ b/drivers/infiniband/sw/rxe/rxe_verbs.h

          @@ -343,6 +343,8 @@ struct rxe_mem {

                  u32                     num_map;

           

                  struct rxe_map          **map;

          +

          +       int                     kmap_occurred;

          };

           

          struct rxe_mc_grp {

          --

          2.7.4

           

          The idea is that you need to use kmap()/kunmap() rather than page_address() to handle these memory regions that are being used by both the kernel and user memory to make this work on the ARM...

           

          Thanks,

          FM