Potential Parallel Speedup

Software   /   Solutions   /   Research   /   Contact   /   About

[ Skip to Compact Version ]
Drag the chart or the buttons to interact with the charts:

The wall clock time required to execute a parallel program can be predicted by Amdahl's Law, which divides execution time into serial and parallel regions, scaling the parallel region by the number of processors used. Concurrency may be used as a substitute for sequential performance improvements, and is required if application performance must exceed sequential execution using a single core.

$${Speedup} = {{T_{serial}}/{T_{parallel}}} = { {100%}/{{Serial%+{{100%- Serial%}/{#Processors}}}}$$

Amdahl's Law shows ideal scaling which should be interpreted as "Performance Not To Exceed", real systems have bandwidth, latency, and processing speed bottlenecks, as well as algorithmic limitations that prevent realizing unlimited perfect scaling.

The laws of parallelism govern scaling regardless of programming or execution model: Map-Reduce, message passing, multithreading, vectorization, or instruction level parallelism, in any amount or combination, is ultimately limited by serial execution. A 1% serial program (99% parallel) can never be sped up by more than 100×, no matter how many processors are used.


Tuning vs. Architecture vs. Parallel Speedup

Consider the performance limitations of an hypothetical application that is strictly serial, executing on two systems: System A with cores, and System B with cores. If the processor implements both "big" and "little" cores, the sequential performance of the individual little cores is and the sequential rate of the "big" cores, respectively.
Additionally, consider tuning the code itself, speeding up the serial region by , and the parallel region by .
Paradoxically, less efficient code in the parallel region gives the appearance that the program scales better because parallelism can be used to compensate for inefficiency. Parallel speedups are not free, and users typically expect a 10x speedup over sequential execution time before accepting the additional complexity that comes with the speedup provided by vectorization, multithreading, FPGAs, GPUs, message passing, map reduce, etc.

Different Perspectives of Parallel Speedup
A program with a serial region ( of the total execution time) will also execute of parallel work. Reaching a speedup on a program that is serial . To achieve a speedup of using System A cores,

Programming & Execution
Strictly sequential % of application
Serial region tuning speedup
Parallel region tuning speedup
Execution time of serial region
Average data operation size
System A System B
Number of cores
Relative "Little" core performance
Maximum sequential rate (ops/sec)
Data bandwidth
Data latency









Example system configurations to use as a starting point
Single core, sequential execution.
Multi-core parallel execution. Dual-socket x86.
Many-core, all "little" cores.
Many-core, Big-Little.
CPU-GPU hybrid parallel execution.
MemCacheD serial execution.
MPI parallel execution.
MR parallel execution.


Little's Law and Execution Bottlenecks

The execution bottleneck of a program has only one of three possible root causes: data bandwidth, processor speed, or data latency. The central conceit of parallelism allows concurrency to be substituted for sequential speed, allowing the relationship between the three factors to be expressed as Concurrency = Latency × Bandwidth (L = λ ·W), a restating of Little's Law from queuing theory.









Because sequential execution uses only one resource at a time (bandwidth is being used to transmit data while the processor stalls, or the processor is operating on the data while bandwidth is unused), strictly sequential execution can never utilize all the bandwidth or processor capacity. Indeed, this is the fundamental motivation for a memory hierarchy.


Task Granularity and Per-Task Overheads

In practice, parallelism is implemented as a finite number of sequential tasks that execute concurrently. The consequences of this become important when the number of processors is large relative to the number of tasks, or the per-task overhead is large relative to the work performed by each task.

$${Speedup} = {{T_{serial}}/{T_{parallel}}} = { { T_{serial} + {#tasks} * T_{task} } / { T_{serial} + { {{#tasks} * (T_{overhead} + T_{task})}/{min({#processors},{#tasks})} } } $$

If the algorithm and dataset expose degrees of parallelism that execute for ( per task), and the overhead for starting a task is , the maximum possible speedup on System A with cores is , or on a System B with cores. It bears repeating that because additional parallelism can compensate for inefficiency, less efficient code improves scaling and speedup, but wall clock execution time is longer.


Programming & Execution
Strictly sequential % of application
Serial region tuning speedup
Parallel region tuning speedup
Execution time of serial region
Average data operation size
System A System B
Number of cores
Relative "Little" core performance
Maximum sequential rate (ops/sec)
Data bandwidth
Data latency









Many More Many-Cores

Parallel programming is still not mainstream despite the ubiquity of multicore processors, meaning the need for more cores is not motivated by software. In fact, the decision to build processors with a larger number of smaller cores is dictated by power, which increases quadratically with frequency (P = C·V2·f), meaning a small reduction in clock speed makes possible a large increase in the number of cores. The coup de grâce is Moore's Law, which provides the transistor road map to build still more cores.

Tuning not only helps sequential execution, it is a multiplier of parallelism's speedup, for this reason it is difficult to overstate the importance of efficient software. This was understood before the first computer was ever built:

As soon as an Analytical Engine exists, it will necessarily guide the future course of the science. Whenever any result is sought by its aid, the question will then arise — by what course of calculation can these results be arrived at by the machine in the shortest time?

"Of the Analytical Engine", chapter 8, The Life of a Philosopher, Charles Babbage (1864)


Home   /   Compact Display   /   Software   /   Solutions   /   Research   /   Contact   /   About

This browsing experience is Copyright 2014.

Here is where we will say something.