[ Download Benchmark from GitHub ] -- Updated 2016-03-05

Node.js Cluster is Faster and Easier Than Async

The purpose of asynchronous execution is to overlap computation with communication and I/O, which can be achieved in other ways. This benchmark evaluates multi-process parallelism from the Node.js Cluster module.

It is possible to implement non-blocking event loops using
many synchronous event loops instead of a single event loop with asynchronous execution.

Introduction

Asynchronous event-driven programming is at the heart of Node.js, however it is also the root cause of Callback Hell.

No single programming or execution model is appropriate for all tasks - programming is making decisions about how to best complete a given task. Every program has a certain amount of irreducible complexity that no programming and execution model can magically make disappear, the only question is who will have to deal with it.

In Node.js, complexity includes management of concurrency in the form of async callbacks, this can be substituted with multi-process parallelism using Node's built-in Cluster module and Child Process module. The Cluster module is similar to the Child Process module but also transparently shares sockets between processes. Cluster's resource sharing can be used to mimic how a C program might utilize multi-threaded execution to share a single web server socket while executing concurrently on different CPU cores.

In this benchmark, a cluster of Node.js web server processes sharing the same port is a substitute form of concurrency to Node's asynchronous execution model.

Pros and Cons of Async and Cluster
Cons Pros
Async Execution
  • Callback Hell
  • Limits parallelism to shared-nothing tasks
  • Adds latency with parallel overhead:
    • Creating a closure
    • Copy-In/Out of all data shared between threads
    • Scheduling work task execution
    • Scheduling callback execution
  • Callbacks help enforce difficult to implement synchronization
  • Node.js directly implements fundamental infrastructure like event loops
Cluster Execution
  • Larger memory footprint
  • Long latency to start new processes
  • Simpler, sequential control flow
  • Utilizes multiple cores
  • Allows for over-subscription of cores
  • Improved resiliency against abnormal server exit

Tradeoffs between the callback programming model with asynchronous execution, and sequential programming with a cluster execution model.

Equivalency of Callbacks and Multi-threaded Synchronization

Synchronization is an irreducible complexity of all parallel programs, including asynchronous execution, even if that means only returning a completion notification so it is possible to determine if a program has terminated. Callback Hell is a consequence of managing parallelism --
Barriers and mutexes used in multi-threading and callbacks used in asynchronous code are manifestations of the same thing.

In an asynchronous execution model, synchronization occurs when the callback is executed: accidental execution of code after the synchronization point (read: callback) is not possible until the runtime calls the callback.

In multi-threaded or multi-process execution, execution stalls until a synchronization operation is complete. The program does not make progress at all.

Regardless of the language or execution model, the programmer must first make decisions about where synchronization occurs, then manually write code which causes some things to happen before the synchronization point, and other things to happen after the synchronization point.

Node.js Async I/O vs POSIX Asynchronous I/O


This example illustrates the equivalency of a C program using POSIX asynchronous I/O and a Javascript program using asynchronous callbacks. Both programs write two files and then read them back, the only difference is who implements the completion test event loop.

              aio_write(&aiocb_a);
while (aio_error(&aiocb_a) == EINPROGRESS) sleep();
aio_write(&aiocb_b);
while (aio_error(&aiocb_b) == EINPROGRESS) sleep();
aio_read(&aiocb_a);
while (aio_error(&aiocb_a) == EINPROGRESS) sleep();
aio_read(&aiocb_b);
while (aio_error(&aiocb_b) == EINPROGRESS) sleep();
// Work involving a and b
C with POSIX Async I/O
  • Explicitly added by developer
  • Spin loop instead of event loop
  • Synchronization occurs at spin loop
  • Multithreading idiom (infinite loop is ended by another thread)
                    fs.write(a, function () {
    fs.write(b, function () {
        fs.read(function (a) {
            fs.read(function (b) {
                // Work involving a and b
            })
        })
    })
}) 
Javascript with Async I/O
  • Explicitly added by developer
  • Event loop instead of spin loop
  • Synchronization occurs at callback
  • Pyramid of Doom idiom instead of multiple sync points

More Parallelism Does Not Always Require More Code

Node.js enforces sequential consistency of asynchronous operations by executing asynchronous work sequentially, thereby preventing race conditions and order of operation hazards.

One consequence of this serialization is it becomes impossible to exploit more than 2 degrees of parallelism (read: use more than two cores). In contrast to this limitation, merely reordering instructions in the C program exposes additional concurrency.

Sequential Asynchronous Execution

              aio_write(&aiocb_a);
// Do work not involving buffer A
while (aio_error(&aiocb_a) == EINPROGRESS) sleep();
aio_write(&aiocb_b);
// Do work not involving buffers A and B
while (aio_error(&aiocb_b) == EINPROGRESS) sleep();
aio_read(&aiocb_a);
// Do work not involving buffers A and B
while (aio_error(&aiocb_a) == EINPROGRESS) sleep();
aio_read(&aiocb_b);
// Do work using buffer A only
while (aio_error(&aiocb_b) == EINPROGRESS) sleep();
// Do work using buffers A and B
Sequential Execution of Asynchronous I/O
  • Typical Node.js use case
  • Higher latency than synchronous execution


Parallel Asynchronous Execution

    aio_write(&aiocb_a);
aio_write(&aiocb_b);
// Do work not involving buffers A and B
while (aio_error(&aiocb_a) == EINPROGRESS) sleep();
aio_read(&aiocb_a);
// Do work not involving buffers A and B
while (aio_error(&aiocb_b) == EINPROGRESS) sleep();
aio_read(&aiocb_b);
// Do work not involving buffers A and B
while (aio_error(&aiocb_a) == EINPROGRESS) sleep();
// Do work using buffer A only
while (aio_error(&aiocb_b) == EINPROGRESS) sleep();
// Do work using buffers A and B
Using Parallelism to Mask Latency
  • A reordering of operations halves execution time... if the I/O can actually be performed in parallel. YMMV.
  • async.parallel() will execute multiple I/O opertions with this type of concurrency, but it cannot execute in parallel the other code included in the callback.

Asynchronous Execution Increases Latency, Parallelism Masks Latency

Parallel execution requires scheduling work, marshaling data, and synchronizing updates (callbacks), all in addition to the original sequential work. Asynchronous processing does not reduce the amount of work to be done or the amount of time it takes, it does allow other work to be overlapped with the time spent waiting for asynchronous tasks to complete.

The overhead of marshaling data, scheduling work, and coordinating a callback adds to the latency to complete an operation.

Unblocking the event loop does not reduce the amount of time it takes to respond to an event, indeed this example shows asynchronous execution would be expected to take longer than a synchronous version. Furthermore, when events are accepted at a faster rate than they can be processed, the asynchronous events are queued for sequential execution.

Benchmarks of Asynchronous I/O vs. Cluster Execution

A rudimentary web server (Source code on GitHub) is implemented twice: once in the conventional async callback style, and again with synchronous file I/O, using Node's built-in cluster module to provide non-blocking concurrency. A GET request to the server performs some work on the requested file and returns the results, POSTed data is write-appended to the requested file at path defined by config.dataPath.

There is one client program that is used to benchmark both server programs. The client generates config.readsPerWritemany GET requests for each POST request, receiving a response for each request before proceeding to the next. The config.nFiles are read or appended to in round-robin order. The time of each operation is measured and the minimum, maximum, total execution time, and number of operations are logged.

A GET request first reads the file, using the string config.sortSplitString splits the file's contents into a list, sorts that list, concatenates the list into a string, and then returns the sorted results. Choosing a config.sortSplitString not found in the file minimizes the amount of compute work per request, using a null string ('') maximizes the amount of compute work per request.

All Benchmarks were performed on a c4.8xlarge 36 virtual core system.

latency chart

Total execution time to perform 140,000 GET, and 60,000 PUT operations. The overhead of scheduling asynchronous tasks makes the conventional async execution model slower than synchronous execution in practically all cases. A nearly 10x speedup is achieved with only 4 server processes due to the combine benefits of multi-core execution and reduced scheduling overhead.

latency chart

Minimum, maximum, and average latencies measured by the 32 client processes. Average latency is ~5 seconds for write-append PUTs, and ~6 seconds to read, process, and respond to GET requests.

latency chart

Average latency is approximately 1/10th of the asynchronous server, corresponding to the nearly 10x total execution time difference. Worst-case latency for PUTs is about half that of the asynchronous server.

The cluster server has lower latency than the async server in the average case, and the average case is closer to the minimum latency. Variability of the average operation time is lower for the cluster server.

Determining the Number of Cluster Processes to Use

Because synchronous cluster execution blocks the event loop, cluster execution must use enough processes so there is always at least one idle process ready to accept a new event. Little's Law from queuing theory can be adapted to predict the number of processes needed:

Number of Customers = Arrival Rate × Average Time Per Customer
Concurrency = Latency × Bandwidth
L = λ·W

The average server response time (latency) must be characterized experimentally and will vary with the specifics of the work being executed and the system it is executing on. Determining the number of processes to use is then a matter of multiplying the average task duration by the number of tasks to be performed and dividing by the desired response time.

Because the OS transparently implements preemptive multi-tasking, the number of processor cores used is independent of the number of software tasks, allowing systems to be under- or over-subscribed. Cluster execution can also dynamically grow or shrink the pool or processes on demand in an ad-hoc fashion.

Complementary Use

Asynchronous I/O and Cluster execution are complementary forms of parallelism and may be mixed arbitrarily, allowing parallelism to be added incrementally with targeted responses to execution bottlenecks.

Conversion to Cluster execution requires the addition of a small wrapper that forks cluster processes, existing program logic is then executed normall by the children processes. The following example is found in the Node.js documentation:

    var cluster = require('cluster');
var http = require('http');
var numCPUs = require('os').cpus().length;

if (cluster.isMaster) {
    // Fork workers.
    for (var i = 0; i < numCPUs; i++) {
        cluster.fork();
    }

    cluster.on('exit', function(worker, code, signal) {
        console.log('worker ' + worker.process.pid + ' exited');
    });
} else {
    // Workers can share any TCP connection
    // In this case it's an HTTP server
    http.createServer(function(req, res) {
        res.writeHead(200);
        res.end("hello world\n");
    }).listen(8000);
}

Mixed multi-processing and asynchronous execution.

Conclusions

Event-based asynchronous callbacks are unavoidable in Node.js, but "ASYNC ALL THE THINGS!!!!!" is neither necessary nor the best use of machine resources.

Node's event loop is lightweight and libuv is efficient, but the operating system's multitasking capabilities they are based on are at least as fast and also make use of more cores. Developers should consider the advantages of reduced complexity and improved performance returned by the small investment of adding Cluster.

The synthetic benchmark used for these experiments is available to experiment with and can be configured to mimic a range of computational and I/O loads. Adding Cluster to an existing application is minimally invasive and because all systems already have multiple cores, multi-processing requires no special equipment or permissions.

|   Download Benchmark from GitHub   |