Lecture 10 goals¶
Define speedup and efficiency
Contrast strong vs weak scaling
Use Amdahl and Gustafson laws to reason about limits
Read timing data and diagnose bottlenecks
First: what “parallel” means¶
Parallelism: doing multiple computations at the same time on multiple CPU cores
Concurrency: multiple tasks in progress, possibly time-sliced on one core
A common pattern:
split work into independent tasks
run tasks on workers (cores/processes)
combine partial results (a “reduction”)
Processes vs threads (Python)¶
Threads share memory, but Python’s GIL usually prevents CPU-bound code from running in parallel
Processes have separate memory and can use multiple cores for CPU-bound work
In this course we use
multiprocessing(process-based parallelism)
What do we measure?¶
Wall time (elapsed time): what users care about
CPU time: total compute time across cores
Always specify:
problem size (input/work)
hardware (CPU model, cores, memory)
software (compiler/interpreter versions, flags)
measurement method (single run vs min/mean of many)
Speedup and efficiency¶
Let be the time on 1 core and on cores:
Ideal: ,
In practice: overheads + resource limits reduce
Strong scaling¶
Fix the problem size, increase cores:
Goal: reduce time-to-solution
Typical plot: , vs , or vs
Pain points show up quickly:
synchronization/communication
memory bandwidth
load imbalance
Weak scaling¶
Increase problem size with , but keep work per core ~ constant:
Goal: keep time roughly constant while scaling problem size
Typical plot: vs for scaled workload
Weak scaling is often the relevant metric in HPC production runs
Amdahl’s Law (fixed workload)¶
Let be the serial fraction (cannot be parallelized).
Key consequence:
Amdahl example¶
If (5% serial):
Upper bound: Interpretation: after some point, more cores buy less.
Gustafson’s Law (scaled workload)¶
Gustafson’s viewpoint: scale the problem size with the number of processors.
Normalize the time on the parallel system:
Hypothetical serial time (same serial part, parallel part un-split):
Using as baseline, the scaled speedup is:
Interpretation: “How much bigger a problem can I solve in the same time as I add processors?”
Estimating the serial fraction from data¶
Given measured speedup , the Karp–Flatt metric estimates :
If is ~constant → Amdahl model fits well
If grows with → overheads/limits dominate
Why speedup flattens (common causes)¶
Load imbalance: some workers finish early and wait
Parallel overhead: task creation, scheduling, communication
Synchronization: locks, barriers, reductions
Memory bandwidth: more cores, same memory channels
I/O: shared filesystem, serialization on output
Designing timing experiments¶
Warm up and repeat; report min/median over multiple trials
Control the environment:
pin threads/processes when possible
avoid background load
fix frequency scaling if relevant
Measure what matters:
end-to-end wall time
plus breakdowns from profiling (where time is spent)
From numbers to conclusions¶
When you see:
dropping early → overheads or bandwidth limits
stops improving → serial fraction or contention
big variance run-to-run → noise, scheduling, or unstable workload
Next action: profile and identify the dominant bottleneck.
Lab 10 code: Monte Carlo π benchmark¶
Goal: estimate π by random sampling.
Draw random points in the unit square
Count how many fall inside the quarter circle
Estimate:
Why it’s a good first parallel example:
each sample is independent (embarrassingly parallel)
the only “combine step” is summing counts (reduction)
How the code is parallelized (high level)¶
In codes/lab10/python/bench_pi.py:
Choose number of processes (p)
Split the total number of samples into (p) chunks
Each process runs the same function on its chunk (no shared state)
Parent sums the partial counts and computes π
This is a classic map → reduce workflow.
Serial vs parallel parts (Amdahl in the real world)¶
Even in a “parallel program”, some work is still serial:
argument parsing, input validation, printing results
creating worker processes / starting the pool
distributing tasks to workers
combining results (sum/reduction)
Parallel part:
the loop that generates random points and counts “inside”
If the parallel work is too small, overhead dominates and scaling looks bad.
The worker function: pure computation¶
Each worker computes a local count and returns it:
def _count_inside_circle(samples: int, seed: int) -> int:
rng = random.Random(seed)
inside = 0
for _ in range(samples):
x = rng.random()
y = rng.random()
if x*x + y*y <= 1.0:
inside += 1
return insideNo shared variables
Deterministic RNG per worker via a seed
Returns a single integer (easy to combine)
The parallel call: multiprocessing.Pool¶
The parent process creates a pool and runs workers:
with mp.Pool(processes=p) as pool:
inside_parts = pool.starmap(_count_inside_circle, zip(work, seeds))
inside = sum(inside_parts)What happens conceptually:
Poolstarts (p) worker processesstarmapsends arguments to workers (serialization / pickling)workers run independently and return results
parent collects results and sums them
Do we need a mutex here?¶
No.
A mutex/lock is needed when multiple workers update the same shared data.
In this benchmark, workers do not share memory:
each returns a local count
the parent combines results at the end
Rule of thumb: avoid shared state if you can (it scales better and is simpler).
When you DO need a mutex (lock)¶
If multiple workers access shared data and at least one writes:
incrementing a shared counter
updating a shared dictionary/list
writing to the same file
Without a lock you can get race conditions:
lost updates
corrupted data
non-deterministic results (sometimes “works”, sometimes not)
Lock example: shared counter (safe but slow)¶
This is not a good scaling strategy, but it illustrates locks:
import multiprocessing as mp
def worker(n, counter, lock):
for _ in range(n):
with lock:
counter.value += 1
if __name__ == "__main__":
counter = mp.Value("i", 0)
lock = mp.Lock()Correct because increments are protected by
lockOften slow because the lock becomes a bottleneck (serializes the updates)
Better than locks: local work + reduction¶
Instead of incrementing one shared counter:
each worker counts locally
return local result
parent sums at the end
This is exactly what the Monte Carlo π benchmark does.
Multiprocessing patterns you’ll reuse¶
Data parallelism: apply the same function to many items (
map)Task parallelism: different tasks/processes coordinated via queues
Pipelines: producer/consumer stages (often with
Queue)
We’ll show small examples of each.
Example 1: simplest Pool.map¶
import multiprocessing as mp
def f(x): # must be top-level for pickling
return x * x
with mp.Pool(processes=4) as pool:
ys = pool.map(f, [1, 2, 3, 4, 5])Notes:
mappreserves input orderprocess startup + serialization cost is real → batch enough work
Example 2: multiple arguments with starmap¶
import multiprocessing as mp
def add(a, b):
return a + b
items = [(1, 10), (2, 20), (3, 30)]
with mp.Pool(3) as pool:
out = pool.starmap(add, items)This is the same API used by the Lab 10 π benchmark.
Example 3: Process + Queue (producer/consumer)¶
import multiprocessing as mp
def worker(q_in, q_out):
for x in iter(q_in.get, None):
q_out.put(x * x)
if __name__ == "__main__":
q_in, q_out = mp.Queue(), mp.Queue()
p = mp.Process(target=worker, args=(q_in, q_out))
p.start()
for x in [1, 2, 3]:
q_in.put(x)
q_in.put(None) # sentinel: stop
p.join()Use this when:
you want a pipeline or streaming workflow
tasks arrive over time, not as a fixed list
Practical “first parallel run” checklist¶
Start with (p=1) to validate correctness, then scale up
Use a workload big enough to amortize overhead
Avoid shared state; prefer “local work + reduction”
Expect non-ideal speedups:
process startup costs
OS scheduling noise
memory bandwidth limits
Lab 10 preview¶
You will:
collect strong/weak scaling data for a small benchmark
compute , , and estimate
compare measurements to Amdahl/Gustafson expectations
write a short interpretation of the scaling behavior