Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Parallel Computing Concepts

University of Pisa

Lecture 10 goals


First: what “parallel” means


Processes vs threads (Python)


What do we measure?


Speedup and efficiency

Let T1T_1 be the time on 1 core and TpT_p on pp cores:

S(p)=T1TpE(p)=S(p)p=T1pTpS(p) = \frac{T_1}{T_p} \qquad E(p) = \frac{S(p)}{p} = \frac{T_1}{p\,T_p}

Strong scaling

Fix the problem size, increase cores:


Weak scaling

Increase problem size with pp, but keep work per core ~ constant:


Amdahl’s Law (fixed workload)

Let ff be the serial fraction (cannot be parallelized).

Tp=T1(f+1fp)Sp=1f+1fpT_p = T_1\left(f + \frac{1-f}{p}\right) \qquad\Rightarrow\qquad S_p=\frac{1}{f+\frac{1-f}{p}}

Key consequence:

limpSp=1f\lim_{p\to\infty} S_p = \frac{1}{f}

Amdahl example

If f=0.05f = 0.05 (5% serial):


Gustafson’s Law (scaled workload)

Gustafson’s viewpoint: scale the problem size with the number of processors.

Normalize the time on the parallel system:

Tp=1=f+(1f)T_p = 1 = f + (1-f)

Hypothetical serial time (same serial part, parallel part un-split):

T1=f+p(1f)T_1 = f + p(1-f)

Using T1T_1 as baseline, the scaled speedup is:

SG(p)=fracT1Tp=f+p(1f)=pf(p1)S_G(p) = \\frac{T_1}{T_p} = f + p(1-f) = p - f(p-1)

Interpretation: “How much bigger a problem can I solve in the same time as I add processors?”


Estimating the serial fraction from data

Given measured speedup S(p)=T1/TpS(p)=T_1/T_p, the Karp–Flatt metric estimates ff:

f(p)=1S(p)1p11pf(p) = \frac{\frac{1}{S(p)} - \frac{1}{p}}{1-\frac{1}{p}}

Why speedup flattens (common causes)


Designing timing experiments


From numbers to conclusions

When you see:

Next action: profile and identify the dominant bottleneck.


Lab 10 code: Monte Carlo π benchmark

Goal: estimate π by random sampling.

π4insidesamples\pi \approx 4 \cdot \frac{\text{inside}}{\text{samples}}

Why it’s a good first parallel example:


How the code is parallelized (high level)

In codes/lab10/python/bench_pi.py:

  1. Choose number of processes (p)

  2. Split the total number of samples into (p) chunks

  3. Each process runs the same function on its chunk (no shared state)

  4. Parent sums the partial counts and computes π

This is a classic map → reduce workflow.


Serial vs parallel parts (Amdahl in the real world)

Even in a “parallel program”, some work is still serial:

Parallel part:

If the parallel work is too small, overhead dominates and scaling looks bad.


The worker function: pure computation

Each worker computes a local count and returns it:

def _count_inside_circle(samples: int, seed: int) -> int:
    rng = random.Random(seed)
    inside = 0
    for _ in range(samples):
        x = rng.random()
        y = rng.random()
        if x*x + y*y <= 1.0:
            inside += 1
    return inside

The parallel call: multiprocessing.Pool

The parent process creates a pool and runs workers:

with mp.Pool(processes=p) as pool:
    inside_parts = pool.starmap(_count_inside_circle, zip(work, seeds))
inside = sum(inside_parts)

What happens conceptually:


Do we need a mutex here?

No.

Rule of thumb: avoid shared state if you can (it scales better and is simpler).


When you DO need a mutex (lock)

If multiple workers access shared data and at least one writes:

Without a lock you can get race conditions:


Lock example: shared counter (safe but slow)

This is not a good scaling strategy, but it illustrates locks:

import multiprocessing as mp

def worker(n, counter, lock):
    for _ in range(n):
        with lock:
            counter.value += 1

if __name__ == "__main__":
    counter = mp.Value("i", 0)
    lock = mp.Lock()

Better than locks: local work + reduction

Instead of incrementing one shared counter:

This is exactly what the Monte Carlo π benchmark does.


Multiprocessing patterns you’ll reuse

  1. Data parallelism: apply the same function to many items (map)

  2. Task parallelism: different tasks/processes coordinated via queues

  3. Pipelines: producer/consumer stages (often with Queue)

We’ll show small examples of each.


Example 1: simplest Pool.map

import multiprocessing as mp

def f(x):  # must be top-level for pickling
    return x * x

with mp.Pool(processes=4) as pool:
    ys = pool.map(f, [1, 2, 3, 4, 5])

Notes:


Example 2: multiple arguments with starmap

import multiprocessing as mp

def add(a, b):
    return a + b

items = [(1, 10), (2, 20), (3, 30)]
with mp.Pool(3) as pool:
    out = pool.starmap(add, items)

This is the same API used by the Lab 10 π benchmark.


Example 3: Process + Queue (producer/consumer)

import multiprocessing as mp

def worker(q_in, q_out):
    for x in iter(q_in.get, None):
        q_out.put(x * x)

if __name__ == "__main__":
    q_in, q_out = mp.Queue(), mp.Queue()
    p = mp.Process(target=worker, args=(q_in, q_out))
    p.start()
    for x in [1, 2, 3]:
        q_in.put(x)
    q_in.put(None)  # sentinel: stop
    p.join()

Use this when:


Practical “first parallel run” checklist


Lab 10 preview

You will: