Solutions: 32 - Partition, don’t lock

Exercise 1 - Run the coordination exhibit

uv run code/measurement/coordination_patterns.py

pattern                            total (s)        msgs/sec  p50 jitter   p99 jitter
-------------------------------------------------------------------------------------
1. single shared Queue                4.48          31,283       54.7 µs    400.4 µs
2. per-worker Queue                  14.63           9,567      171.2 µs   3480.3 µs
3. shared numpy array                13.87          10,090        0.1 µs   5960.9 µs

Three readings of this particular run:

The single shared queue is fastest by throughput; the shared numpy array has the lowest p50 jitter (sub-microsecond) but higher p99 (multi-millisecond spikes when the OS preempts a spinning worker).
Per-worker queues are the slowest on every metric. The lock-contention argument from textbooks loses to the pipelining-through-one-queue effect at this workload size.
Numbers shift by machine and CPython version. On other hardware the shared-array pattern can run 5-20× faster on throughput too (the chapter prose’s numbers); your run will likely sit between these two.

Coordination events per 30 Hz tick budget (33 ms):

pattern	events/tick
single Queue	~1,000
per-worker Queue	~300
shared numpy array	~330

Any of them is enough for a batched simulator (10-100 phase signals per tick). None of them is enough for per-creature signalling (1M events per tick).

Exercise 2 - The batching threshold on your machine

At 30K msgs/sec coordination and ~30M ops/sec inner-loop numpy work (one motion update on 1M creatures in ~30 ms):

coordination cost per message: 33 µs
inner-loop work per message:   2 µs per creature × partition_size
For coordination ≤ 10% of work:
  33 µs ≤ 0.1 × 2 µs × partition_size
  partition_size ≥ 165 creatures

So partitions ≥ 200 creatures keep coordination cost under 10% of work cost. Below 200, coordination dominates; above, work dominates. For the simulator’s 1M creatures over 8 workers, each partition is 125,000 - three orders of magnitude past the threshold. Coordination is negligible.

The threshold matters when partition size shrinks - e.g., a focal sub-system that only acts on 100 creatures should not be partitioned across 8 workers (coordination would dominate); it should be run by one worker.

Exercise 3 - Pre-assigned partitions

# At startup
boundaries = [(i*N//W, (i+1)*N//W) for i in range(W)]

def init_worker(my_id, my_boundaries, shm_name):
    global _start, _end, _shm, _arr
    _start, _end = my_boundaries
    _shm = SharedMemory(shm_name)
    _arr = np.ndarray((NUM_COLUMNS, N), dtype=np.float32, buffer=_shm.buf)

# Per phase, the only signal is "run system X"
def run_phase(system_id):
    # _start, _end already known
    if system_id == 0:   _arr[0, _start:_end] += _arr[1, _start:_end] * DT
    elif system_id == 1: _arr[2, _start:_end] *= 0.99
    # ...

Compared to a re-sending version (pool.map(motion_worker, [(i*N//W, (i+1)*N//W) for i in range(W)])):

Pre-assigned: one signal per phase = one small int per worker (~1 µs)
Re-sending: tuple of two ints per worker, pickled and unpickled (~10-30 µs)

At 100 phases per tick × 8 workers, that’s 800-2400 µs vs ~800 µs. Real savings, but small in absolute terms - the architectural benefit (workers can keep state across phases, cached) matters more than the marginal IPC.

Exercise 4 - The DAG-as-array

# DAG_PROGRAM[phase, worker_id] = system_id_to_run (or 0 for "idle")
DAG_PROGRAM = np.array([
    [1, 0, 0, 0, 0, 0, 0, 0],     # phase 0: only worker 0 runs system 1
    [2, 2, 2, 2, 2, 2, 2, 0],     # phase 1: 7 workers run system 2 (partitioned)
    [3, 4, 5, 0, 0, 0, 0, 0],     # phase 2: three different systems run in parallel
    [6, 6, 6, 6, 0, 0, 0, 0],     # phase 3: 4 workers run system 6
    [7, 0, 0, 0, 0, 0, 0, 0],     # phase 4: cleanup
], dtype=np.int8)

# main bumps a generation counter; workers spin until it matches their phase index
def worker_loop(my_id, gen_array, dag_array, shm_name):
    expected_phase = 0
    while True:
        while int(gen_array[0]) != expected_phase: pass     # spin-wait
        system_id = int(dag_array[expected_phase, my_id])
        if system_id != 0:
            run_system(system_id, my_id)
        # signal done by incrementing the worker's ack counter
        gen_array[1 + my_id] += 1
        expected_phase += 1

Correctness is testable: pin the DAG, run for N ticks under the shared-array implementation, then run the same with a for system in dag: run_system_serial(system) baseline. Compare world hashes. They must match.

Exercise 5 - Load-balanced partitioning

# Per-worker timestamps stamped at phase end
# After each tick, main reads them and recomputes boundaries

def rebalance(boundaries, last_phase_durations, total_n):
    """Give larger slices to faster workers (smaller durations)."""
    inv_speed = 1.0 / np.maximum(last_phase_durations, 1e-6)
    weight = inv_speed / inv_speed.sum()
    new_sizes = (weight * total_n).astype(np.int64)
    # Build cumulative boundaries from sizes
    cum = np.cumsum(new_sizes)
    new_boundaries = []
    start = 0
    for end in cum:
        new_boundaries.append((start, int(end)))
        start = int(end)
    new_boundaries[-1] = (new_boundaries[-1][0], total_n)        # fix trailing
    return new_boundaries

Run for 1000 ticks. Plot per-worker phase times tick by tick. The boundaries oscillate at first, then converge to a steady state where every worker finishes its phase at roughly the same wall time. The convergence rate depends on the workload’s stability - a flat-world uniform simulator converges fast; one with bursty events stays jittery.

This is closed-loop scheduling. Same pattern as TCP’s congestion control: observe, react, repeat. Main has the timestamps; main decides.

Exercise 6 - Workload heterogeneity

# Construct: 80% of the work in 20% of the partitions
def heavy_partition(i, start, end, _arr):
    # workers 0,1 do expensive work; the rest do cheap work
    work_factor = 10 if i < 2 else 1
    for _ in range(work_factor):
        _arr[0, start:end] += _arr[1, start:end] * DT

Fixed equal-sized partitioning: workers 0 and 1 take 10× longer per phase than workers 2-7. The phase wall time is dominated by workers 0 and 1 - the slowest worker sets the phase budget. Workers 2-7 sit idle, wasting cores.

Load-balanced version (from exercise 5): boundaries converge to small slices for workers 0 and 1, large slices for workers 2-7. Steady state: all workers finish in roughly the same wall time. The phase budget shrinks ~3× because the slow workers got less work.

This is the right shape for any simulator where workload is non-uniform across space (MMORPGs with cities, fluid simulations with turbulence, traffic with congestion). Static partitioning is a special case that works only when the work is uniform.

Exercise 7 - The boundary-builder lives in `main`

# Worker tries to compute its own slice - fragile
def bad_worker(my_id, n_workers, shm_name):
    s = SharedMemory(shm_name)
    arr = np.ndarray(SHAPE, dtype=DTYPE, buffer=s.buf)
    N = arr.shape[1]                          # ← reads N from the buffer
    start = my_id * N // n_workers
    end   = (my_id + 1) * N // n_workers
    arr[0, start:end] += arr[1, start:end] * DT

# Main mutates N mid-tick:
# anti-pattern: bad!
shm = SharedMemory(create=True, size=(2 * 1_000_000 * 4 + 64))
# ...
# tick 1 fires with N=1_000_000
# main resizes the array somehow during tick 2 (in reality you can't easily resize shared memory, but if N is read from a counter:
shm_n = ...                                    # shared counter
shm_n[0] = 2_000_000                            # mid-tick - chaos
# now workers think they own [0, 1_000_000/W) but the data layout changed

The disciplined form: __main__ computes boundaries once, writes them to a shared array, workers read their slice from the shared array. __main__ is the single writer of the boundaries; workers are read-only consumers. If __main__ wants to change boundaries (rebalance), it does so between phases, never during.

Letting workers compute their own slice from (my_id, n_workers, N) is fragile because N and n_workers must agree across all workers and main. Centralising the boundaries in __main__ eliminates the disagreement.

Exercise 8 - `Event` instead of busy-wait (stretch)

# Worker spins on shared array
while gen_array[0] != expected:
    pass

# Worker uses Event.wait()
event.wait()
event.clear()

event.wait() puts the worker to sleep at the kernel level. The wakeup involves an inter-process signal - typically 50-200 µs of overhead. Compared to the spin-loop (~0.1-1 µs latency), the Event-based pattern is 50-500× slower per round-trip.

But: the spinning worker pins a CPU core at 100% even when there’s no work. On a laptop, this means heat and battery drain. On a shared server, it crowds out other processes. Event-based wakeup is the right choice for low-frequency coordination (≤ a few hundred wakeups per second, e.g. background batch jobs). Spin-loop is right for high-frequency coordination on dedicated cores (a real-time simulator at 1 kHz).

Exercise 9 - The 1 kHz physics-engine question (stretch)

Tick budget at 1 kHz: 1 ms = 1000 µs

If coordination is 1 µs/event (shared array, no contention):
  budget allows 1000 coordination events / tick
  but a typical physics simulator wants ~50 system phases × 8 workers = 400 events / tick
  fits - coordination uses 40% of the budget

If coordination is 30 µs/event (queue-based):
  budget allows 33 events / tick
  same simulator needs 400 events → exceeds budget by 12×
  does not fit

At 1 kHz the simulator must use shared-array coordination and still has 40% of its budget consumed by coordination alone. Most physics engines run at 1 kHz or higher (game physics often at 240 Hz, control systems at 1-10 kHz). The arithmetic above is why those engines are usually in C++ or Rust - the per-event coordination cost in those languages is ~10-100 ns, leaving room for the actual physics.

The escalation: at the point where Python coordination eats the budget, the work shifts to maturin (Rust + PyO3) for the inner loop. Same architecture, same partition-don’t-lock pattern, but with sub-microsecond coordination via Rust’s crossbeam::channel or std::sync::atomic. The architecture is portable; the language is the tooling decision.

Keyboard shortcuts

An Introduction to Programming, using ECS & EBP in Python