Parallelism and Conservation of Complexity
Seattle Node.js | 2014-06-25
http://synsem.com/SeaNode-2014-06-25


Jace A Mogill
Synthetic Semantics - mogill@synsem.com
 
[ Use Arrow Keys or Swipe To Navigate ]
 

Why Hasn't Parallel Programming Happend Yet?

Why isn't there an API or library or compiler feature or magic?

  • Hardware vendors need you to want more cores... but don't make software.
  • Compiler authors want to optimize for parallel platforms... but static analysis is insufficient.
  • VM and OS developers can add hooks for parallelism... but can't make applications use them.
  • IT managers want to better utilize the hardware they have... but don't choose the applications.
  • Users want their computers to go faster... but they don't want to change anything.

Why hasn't anyone else hidden the complexity?
...Because only the application designer can deal with application parallelism.

The Law of Conservation of Complexity


Tesler's Law

Every application must have an inherent amount of irreducible complexity. The only question is who will have to deal with it.


If you use... But you care about... Sooner or later you must deal with...
Garbage Collected Memory Memory Leaks & GC Stalls Garbage Collectors and/or explicitly allocating and freeing memory
TCP/IP or IB Network Latency and Bandwidth Packet size, TTL, DNS, etc.
Compilers or Interpreters Correctness Language specifications and corner cases
Sequential Execution Scalability and Performance Parallel Programming

Complexity Conserved: Callback Hell & Async I/O

Callback Hell is a manifestation of multithreaded synchronization.
These two programs are equivalent:
aio_write(&aiocb_a);
while (aio_error(&aiocb_a) == EINPROGRESS)  sleep();
aio_write(&aiocb_b);
while (aio_error(&aiocb_b) == EINPROGRESS)  sleep();
aio_read(&aiocb_a);
while (aio_error(&aiocb_a) == EINPROGRESS)  sleep();
aio_read(&aiocb_b);
while (aio_error(&aiocb_b) == EINPROGRESS)  sleep();
C with POSIX Async I/O
  • Synchronization occurs at spin loop
  • Explicitly added by developer
  • Multithreading idiom (infinite loop is broken by another thread)
  • Spin loop instead of event loop
fs.write(a, function () {
  fs.write(b, function () {
    fs.read(function (a) {
      fs.read(function (b) {
          // Work involving a and b
      })
    })
  })
})
Javascript with Async I/O
  • Synchronization occurs at callback
  • Explicitly added by developer
  • Pyramid of Doom idiom instead of multiple sync points
  • Event loop instead of spin loop

Sequential Async I/O vs. Parallel I/O


aio_write(&aiocb_a);
AIO_WAIT(aiocb_a);
aio_write(&aiocb_b);
AIO_WAIT(aiocb_b);
aio_read(&aiocb_a);
AIO_WAIT(aiocb_a);
aio_read(&aiocb_b);
AIO_WAIT(aiocb_b);
Sequential Execution of Asynchronous I/O
  • This is how Node.js typically executes
  • Higher latency than synchronous execution



aio_write(&aiocb_a);
aio_write(&aiocb_b);
AIO_WAIT(aiocb_a);
aio_read(&aiocb_a);
AIO_WAIT(aiocb_b);
aio_read(&aiocb_b);
AIO_WAIT(aiocb_A);
AIO_WAIT(aiocb_B);
Using Parallelism to Mask Latency
  • A simple reordering of operations halves execution time... if the I/O can actually be performed in parallel. YMMV.
  • Javascript Requires Child Processes
  • JS parallel with async is Callback Hell plus Messaging

Node.js Programming and Execution Models

Contrary Notions of Strong & Weak Scaling

 
Strong Scaling
Weak Scaling
Scaling Goal Solve the same problem, only faster Solve a bigger problem in the same amount of time
Problem Size Stays constant while number of processors increases Grows with the number of processors
Scaling is limited by Inter-process communication Problem size
Resiliency Single failure causes entire job to fail, SLAs achieved through efficient checkpoint-restart. Failed sub-tasks are detected and retried. SLAs achieved through fault resiliency.
Programming Models MPI, GASNet, Chapel, X10, Co-Array Fortran, UPC Batch jobs, Map-Reduce

Amdahl's Law

The wall clock time required to execute a parallel program can be predicted by Amdahl's Law:

$${Speedup} = \frac{{T_{serial}}}{{T_{parallel}}} = { \frac{100\%}{{Serial\%+{ \frac{100\% - Serial\%}{\#Processors}}} } }$$

The Laws of the Land

The Node Commandment
Thou shalt not block thy event loop

Tesler's Law Conservation of Complexity
Amdahl's Law Maximum speedup of a parallel program
$${Speedup} = \frac{{T_{serial}}}{{T_{parallel}}} = { \frac{100\%}{{Serial\%+{ \frac{100\% - Serial\%}{\#Processors}}} } }$$
Little's Law Amount of concurrency needed to mask latency (L = λ ·W)
$${Concurrency} = {Latency} \times {Bandwidth}$$
Backus' Law The primary concern of computers is finding, marshaling, and synchronizing data through the von Neumann Bottleneck


Backus' Law: von Neumann Bottleneck

Can Programming Be Liberated from the von Neumann Style?
A Functional Style and Its Algebra of Programs

Surely there must be a less primitive way of making big changes in the store than by pushing vast numbers of words back and forth through the von Neumann bottleneck. Not only is this tube a literal bottleneck for the data traffic of a problem, but, more importantly, it is an intellectual bottleneck that has kept us tied to word-at-a-time thinking instead of encouraging us to think in terms of the larger conceptual units of the task at hand. Thus programming is basically planning and detailing the enormous traffic of words through the von Neumann bottleneck , and much of that traffic concerns not significant data itself but where to find it.


John Backus, IBM Research Laboratory, San Jose
Communications of the ACM, August 1978, Volume 21, Number 8

Amdahl's Law & Little's Law

Restating Little's Law from queuing theory in terms of system resources gives:
Concurrency = Latency × Bandwidth, (L = λ ·W),
Programming & Execution
Strictly sequential % of application
Serial region tuning speedup
Parallel region tuning speedup
Execution time of serial region
Average data operation size








System A System B
Number of cores
Relative "Little" core performance
Maximum sequential rate (ops/sec)
Data bandwidth
Data latency
If the algorithm and dataset expose degrees of parallelism that execute for a total of ( per task), and the overhead for starting each task is , the maximum possible speedup on System A with cores is , or on a System B with cores.
Different Perspectives of Parallel Speedup
A program with a serial region ( of the total execution time) will also execute of parallel work. Reaching a speedup on a program that is serial . To achieve a speedup of using System A cores ,
The wall clock time required to execute a parallel program can be predicted by Amdahl's Law:
$${Speedup} = \frac{{T_{serial}}}{{T_{parallel}}} = { \frac{100\%}{{Serial\%+{ \frac{100\% - Serial\%}{\#Processors}}} } }$$

Confusion Makes Everything Harder

Parallel programming only gets harder with bad tools

A carpenter who is thinking about saws isn't thinking about building better houses.
A carpenter who is thinking about saws has a dull tool.

npm async does not behave as documented
serial() and parallel() execute identically, defenders of this must also be confused about cost-benefits for the extra effort.
Redis and memcached have very weak synchronization models
Documentation conflates Check and Set and Compare and Swap (CAS), but both implement Load-Link/Store-Conditional (LL-SC).

LL-SC is a poor API choice:
- Lowest-level possible interface -- Nothing is more tedious or error prone
- Requires every program to implement it's own data coherency scheme
- Optimistic locking performance degrades quickly when there is contention or the longer the lock is held.

Extended Memory Semantics (EMS)

Consider using tools based on proven Supercomputer architectures:

Extended Memory Semantics (EMS)
- Unifies synchronization, communication, and storage
- Transactional memory and fine-grained synchronization
- Barriers, reductions, maps, parallel collective operations
- Bulk Synchronous Parallel, Fork-Join, and user parallelism
- Freely mix any number or kind of applications sharing data

Conclusions

Conservation of Complexity:

Callback Hell is a consequence of parallelism --
      Multithreading and asynchronous callbacks are the same thing.

Using single threaded execution models to hide the complexities of multicore hardware does not remove the complexity of finding, marshaling, and synchronizing data in a parallel application.

The complexity never goes away.
It is your job to deal with it.

Thanks!

http://synsem.com/SeaNode-2014-06-25
mogill@synsem.com
 

Here is where we will say something.