Benchmarking Memcached, Redis, and EMS

These benchmarks compare performance of the highly optimized key-value stores Redis and Memcached to an unoptimized key-value store implemented in interpreted Javascript using the Extended Memory Semantics (EMS) addon. EMS complements legacy serial programming languages and applications with support for shared-memory parallel programming. EMS is targeted toward data storage and analytics that are too large for one core, but not large enough for a scalable cluster.

Download the Benchmark at GitHub

Introduction

A key-value store is similar to a file system where a filename (key) is used to access shared data (values), but differs from a file system by adding rich set of commands to modify or process data independently of the requester. Specialization of key-value stores has lead to two dominant tools, Redis and Memcached, with different features and execution models.

The common functionality of reading and writing data allows the two applications to be compared to each other and other load-store model like EMS. However, unlike Memcached and Redis, EMS has no server process the program communicates with, instead EMS is based directly on the CPU's cache coherent memory where synchronization is a property of the data, not the tasks.

EMS Based Applications

EMS extends the basic load/store model with synchronization primitives to support parallel programming. Access to memory happens by transitioning the memory through a finite state machine designed for parallel computing.

Access to each object in EMS memory is governed by a Finite State Machine (FSM) which is automatically transitioned when data is accessed.

Any number, kind, or combination of programs may use EMS memory simultaneously. An hybrid system running multiple web servers, search and recommendation engines, key-value stores, and message queues can be hosted on a single multi-core system as shown in the following figure. Shared memory makes possible sharing of key-value pairs between different types of servers, acting as a protocol translator. Because memory is not rate limited or a potential bottleneck, access to EMS data does not need to be rationed or performed asynchronously.

Instead of several types and sizes of distributed systems, an integrated hybrid server hosts all the services and applications for an entire web site. A single copy of data is shared by all processes, eliminating duplicate storage, network traffic, and associated overhead. Replica servers provide resiliency.

Integrated Hardware-Software Platform

In addition to utilizing multi-core CPUs, hybrid systems may also exploit GPUs to power the number crunching analytics of recommendation engines, and manycore accelerators to perform unstructured searching and analytics. Software services may communicate with other local processes via EMS shared memory, or use a network protocol to communicate with local or remote processes. Tuned hardware-software platforms can be time-shared by any number of combination of services or users, possibly using isolation and encryption directly provided by the OS or hardware

Benchmarks

The experiment writes data to a k-v store and then reads it back, timing each operation and checking it for correctness. Latency of an operation (elapsed time from when the client initiates the request until a response is received) and capacity (number of operations per second handled by the servers) of reading and writing a key-value are measured in these experiments.

50,000 key-value pairs are used, each key contains 100 randomly chosen characters, and the value contains a unique 4096 byte string. Both the client and server programs (except Redis) are parallel, and the number of processes is varied from 1-40, measuring the performance at different scales. The server platform was a 32-core x86 server with 60GB of RAM.

EMS Memcached Server

Because EMS lacks a server process, to compare performance of EMS to a Redis or Memcached server a rudimentary server that handles Memcached syntax was written in Javascript using EMS to share data between processes. Because the EMS server is a Memcached server work-alike, the same client program is used for both Memcached and EMScached experiments.

Memcached is multithreaded and by default uses 4 threads, the benchmarks varied the number of server processes but more than 4 threads did not yield more performance for either Memcached or EMScached so those results are not reported. It was necessary to use EMS's parallel loops and synchronization primitives for both Memcached and Redis client programs to create parallel worker tasks and enforce a barrier synchronization to accurately measure elapsed time of the different benchmark phases.

Oversubscription

Because no additional hardware resources must be allocated to use additional threads or processes, it is possible to oversubscribe hardware with more threads than cores, relying the operating system's time-sharing and load-balancing mechanisms to schedule task execution on the available resources. It is important to bear in mind that the client and server processes of these experiments are running on the same system so their resources must be summed: 16 servers and 16 clients is a total of 32 tasks; 32 clients and 16 servers is 48 tasks, oversubscribing the 32-core system by 50%.

A context-switch is already part of using the network, making oversubscription a natural way to mask the unavoidable latency associated with using the network, however once there is enough work to mask all the latency, additional oversubscription typically has a detrimental effect on performance, increasing execution time. This penalty is highly dependent on how tasks are scheduled to use machine resources.

The system used for benchmarking is a virtual machine it is not known if the virtual cores used are separate hardware resources, Simultaneous Multithreaded (SMT) cores, or virtual cores. Depending on how processes are scheduled, oversubscription may add more hardware resources or share existing resources. For example, doubling the number of processes could double the amount of cache or memory bandwidth available to the entire job, or halve the effective cache size and bandwidth per core.

All experiment results: EMS read and write are the blue lines which exceed 200k ops/sec. At 1.5M operations/second the Javascript EMS benchmark is fetching and validating data at over 6GB/sec. EMS writes are limited by the sequential memory allocator implemented in EMS which could be replaced with a more sophisticated parallel allocator.

EMS local memory access is many times lower latency and higher bandwidth than data access via a network and server, permitting upper-layer functionality to be implemented on EMS using an interpreted language and still achieve performance rivaling optimized native code.

Operation Latency

The time to perform each operation (from initiation to response) is measured and the minimum time is recorded as the minimum latency. This measurement demonstrates the fastest observed execution of the critical path through the client-server pair, and represents the performance potential of the existing code. This timing can also be used as the Tserial component of Amdahl's law (the portion which cannot be sped up with additional parallelism).

Fastest completion of a write operation. Shared memory EMS response time is consistently under 2ms, Redis' latency increases with the number of clients eventually becoming slower than memcached and the same speed as EMScached.



Fastest completion of a read operation. Shared memory EMS response time is consistently under 2ms, and EMScached server responses are about 150ms.



Number of EMS write operations retired per second to shared memory or different numbers of servers. EMS writes to shared memory are limited by the sequential memory allocator in EMS that could be replaced with more sophisticated allocator that supports concurrency. Scaling above 16 cores is limited by oversubscription (ie: 32 clients connected to 32 servers is oversubscribed by a factor of two).



Number of Memcached and Redis writes retired by their respective servers. Peak rate for single threaded Redis is about 40,000 writes per second. Multithreaded Memcached performance stepping is clearly seen as threads are added up to a maximum of about 100,000 writes per second on 4 threads. The lower oversubscription ratio allows Memcached benchmarks to scale up to 24 clients on this 32-core system.



Reads per second from a EMScached server. EMScached latency is longer than memcached and more concurrency is required to mask that latency. Oversubscription limits scaling past 16 server processes. EMS shared memory operates at 10x these rates (see All Experiments figure above), and is not shown to make the Y-axis clearer.



Reads per second from Memcached and Redis servers. Because Memcached's latency is lower than EMScached's, each server thread can process more operations per second and only 4 server threads are needed, allowing the clients to scale up to 24 processes before the machine is oversubscribed.

Conclusions

When EMS is used for finding, synchronizing, and marshaling shared data, performance of an untuned interpreted language implementing a key-value store can rival that of highly optimized natively compiled key-value store, but enjoys all the benefits of using high-level rapid development tools. EMS encourages integration, customization, and acceleration of legacy applications into high performance systems that are easier to scale, deploy, and maintain.