I'm having a similar problem with the RHEL Inbox drivers and NFS over RDMA between some newly installed machines, getting Local Length Errors. While I don't have a solution for you specifically, I'm wondering, how did you do the validation you mention (qp_access flags, buffer sizes, etc.)? I'm wondering if any of the things you're mentioning might help me with my problem.
Basically by validating (qp_access_flags. buffer sizes etc.) I mean, I made sure they have the right values. For example qp_access flags was enabled for RDMA read and write,
buffer size in the work request matched with that used for memory registration and so on
I was able to get this issue resolved. The problem was with the "max_dest_rd_atomic" QP attribute. Per documentation, "max_dest_rd_atomic" is "number of RDMA Reads outstanding at any time for this QP as a destination". Our code was using RDMACM for connection management. The way "max_dest_rd_atomic" is set by RDMACM is via attribute called "responder_resources" sent as an argument "rdma_conn_param" to "rdma_connect". The argument did not look obvious and hence was not set causing RDMACM to set "max_dest_rd_atomic" to zero. causing RDMA reads initiated to this node to fail.
Basically the syndrome "Remote_Invalid_Request_Error" means lot of issues that are not clearly defined, hence it took us time to figure out the exact issue. This is where I was hoping that "vendor syndrome" might come in handy to figure out root cause for "Remote_Invalid_Request_Error" or similar errors that have multiple failure reasons. Unfortunately "vendor syndrome" doesn't seem to be exported by Mellanox. It will help if Mellanox could export this error with its corresponding description such that it will help Mellanox RDMA users to debug similar issues.