Our setup is
1 x Mellanox MX354A Dual port FDR CX3 adapter w/1 x QSA adapter
1 x Xeon E5-2450 processor (8 cores, 2.1Ghz)
16GB Memory (4 x 2GB RDIMMs, 1.6Ghz)
We have 4-node cluster, and all of them are server and client at the same time.
When write, a node split data into 4 pieces and concurrently write to 4 nodes.
When read, a node read from 4 nodes.
We expect it scales with the number of multiple clients.
when a node is reading, it can get 6.4 GB/s bandwidth
but when 2 nodes are reading, both only get 5GB/s each, although aggregated bandwidth is enough.
There's only 1 CPU, no NUMA discrepancy arises.
Concerned possible NIC cache misses, measured PCIe Read using pcm-pcie.
Simply PCIe read cannot scale with increasing number of clients even if its bandwidth is actually much higher.
There must be contention when multiple connections(QPs) read from a single server.
Please Mellanox, can you pinpoint the root cause and possible solution for multiple-client scalability?
It might be useful if you can transform your test description to something that uses ib_read_bw/ib_write_bw or iperf (if you are using TCP) and show what is the output.
Can you see any drops in 'ethtool -S' output or device statistic ( ifconfig, ip)?
Does the sender use 1 CPU when writing to two different clients? Or, in other words, does he use the same thread?
You might check what is the output of 'mlnx_perf' command, however you need Mellanox OFED installed on the host.