VMA Improves MemCacheD Performance over 10GbE Networks

Version 10

    This post demonstrates latency and throughput benchmark results using memaslap  benchmark for testing MemCaheD over high speed Mellanox Ethernet network with and without VMA acceleration.

     

    The Benchmark

    The memaslap benchmark is a command line utility developed in conjunction with MemCacheD load generation and bechmarking Key-Value databases.

    This benchmark test shows latency and throughput improvements using memaslap benchmark over VMA comparing to non VMA benchmark.

    In our performance tests, we used the benchmark in order to understand the improvement between running with and without VMA.

    The benchmark results showed significant improvement in favor of running MemCacheD on top of VMA compared to running on native kernel network stack.

     

    The MemCacheD Non-VMA is saturated around ~1M TPS (Transactions Per Second), while with VMA it is saturated around ~2M TPS

     

    Each test was executed in 2 methods:

    1. "No VMA" - Run over kernel network sockets without any acceleration
    2. "VMA on Server" side only acceleration - Client ran over the Linux sockets.

     

    The results were calculated for single MemCacheD server and single memaslap client, doing single GET operations of key & value size of 64 byte.
    The MemCacheD server was running with 7 threads while the memaslap was running with different number of threads & connections to achieve different TPS rates.
    The memaslap was always running with VMA, to achieve high rates with single client.

     

    References

     

    Setup and Configuration

    key

    64 64 1

    value

    64 64 1

    cmd

    0 0.001

    1 0.999 (GET ratio 99.9%, SET ratio 0.1%)

    • Tuning

    # service irqbalance stop

    # service iptables stop

    # service cpuspeed stop

     

    MemCacheD Command Line

    Command Line for "No VMA"

    # LD_LIBRARY_PATH=/usr/local/lib memcached -m 12000 -l 2.2.2.4 -u root -t 7 -c 10000

    Command line for "VMA on the Server"

    # LD_PRELOAD=libvma.so VMA_RING_ALLOCATION_LOGIC_TX=31 VMA_RING_ALLOCATION_LOGIC_RX=31 LD_LIBRARY_PATH=/usr/local/lib taskset -c 8-15 memcached -m 12000 -l 2.2.2.4 -u root -t 7 -c 10000

     

    Memaslap Command Line

    # VMA_RING_ALLOCATION_LOGIC_TX=20 VMA_RING_ALLOCATION_LOGIC_RX=20 LD_PRELOAD=libvma.so taskset -c 8-15 memaslap -s 2.2.2.4:11211 -T $1 -c $2 -t 30s -X 64  -S 1s

     

    Where ($1 - number of treads, $2 - number of concurrency to simulate with the load) is in: (1,1), (2,2), (3,3), (4,4), (5,5), (6,6), (7,7), (8,8), (8,16), (8,24), (8,32), (8,64)

     

    Results

    1. Latency vs. Transaction Rate (GET & SET operations) [Lower is better]

    2. Higher Max Transaction Rate below 100 usec

     

    Conclusions

    MemCacheD Non-VMA is saturated around ~1M TPS, while MemCacheD with VMA is saturated around ~2M TPS.