Solutions: 4 - Cost is layout, and you have a budget

Exercise 1 - Pick your rates

system	target rate	budget per tick
card game	30 Hz (or event-driven)	33 ms
real-time strategy game	30-60 Hz	17-33 ms
market data feed	depends - 100 Hz - 1 MHz	10 ms - 1 µs
embedded sensor controller	1-10 kHz	100 µs - 1 ms
web API endpoint	per-request, ~10-200 ms	10-200 ms
offline batch (1B rows)	throughput, not Hz	minutes-to-hours total

The point of writing these down is that “this should be fast” is not a budget. “33 ms” is. The instant you have a number, every line of code in the inner loop is either spending bytes of that budget or it isn’t.

Exercise 2 - Count an operation

import timeit
d = {i: i*i for i in range(1_000_000)}
t = timeit.timeit("d[42]", globals={"d": d}, number=10_000_000)
print(f"dict[k] lookup: {t/10_000_000*1e9:.1f} ns")

dict[k] lookup: 15.1 ns

At 30 Hz (33 ms): ~2.2 million lookups per tick. At 1 kHz (1 ms): ~66,000 lookups per tick.

A 1-million-entity update that does one dict lookup per entity would cost 15 ms - half a 30 Hz budget on a single bookkeeping op. Two dict lookups per entity blows the budget on bookkeeping alone, with no actual simulation work done yet.

Exercise 3 - The layout difference

import time, numpy as np
n = 1_000_000
arr = np.arange(n, dtype=np.int64)
d   = {i: i for i in range(n)}
arr.sum(); sum(d.values())                     # warmup

t0 = time.perf_counter(); int(arr.sum());    t1 = time.perf_counter()
t2 = time.perf_counter(); sum(d.values());   t3 = time.perf_counter()

print(f"numpy sum:        {(t1-t0)*1e9/n:.2f} ns/elem")
print(f"sum(d.values()):  {(t3-t2)*1e9/n:.1f} ns/elem")

numpy sum:        0.20 ns/elem
sum(d.values()):  3.6 ns/elem
ratio: 18×

The dict version is interpreter-bound: the inner loop is a pure-Python for v in values: total += v, which pays bytecode dispatch + PyLong arithmetic + refcount work per element - about 3-6 ns. The numpy version is bandwidth-bound: a tight C loop reading int64s sequentially, the prefetcher loaded ahead, the L1 line warm. Same 1M int64 payload, two regimes apart, 18× cost gap.

Exercise 4 - The cliff

import time, numpy as np
for size in [100_000, 200_000, 1_000_000, 10_000_000]:
    a = np.ones(size, dtype=np.int64)
    a.sum()                                       # warmup
    best = float("inf")
    for _ in range(3):
        t0 = time.perf_counter(); a.sum(); t1 = time.perf_counter()
        if t1 - t0 < best: best = t1 - t0
    print(f"  N={size:>10,} ({size*8/1024:>7.0f} KB): {best*1e9/size:.2f} ns/elem")

  N=   100,000 (    781 KB):  0.11 ns/elem
  N=   200,000 (   1562 KB):  0.12 ns/elem
  N= 1,000,000 (   7812 KB):  0.11 ns/elem
  N=10,000,000 (  78125 KB):  0.20 ns/elem

On this machine the cliff between L2-fitting (200 KB - 1 MB) and L3-spilling (10 MB+) is shallow on the sequential sum (0.11 → 0.20 ns/elem, less than 2× slowdown). The prefetcher is doing its job: even with the working set in RAM, sequential-access numpy hovers near memory bandwidth limits. The dramatic cliff is on the gather column from §1; sequential numpy is forgiving.

This is why the chapter distinguishes bandwidth-bound from latency-bound: same N, same array, very different cliff depending on access pattern. The cliff exists; sequential numpy hides most of it.

Exercise 5 - Working backwards from the budget

Target 60 Hz (16.67 ms = 16,666 µs); 100,000 entities; one cache line touched per entity.

regime	per-element	for 100K entities	% of 60 Hz budget
compute-bound	~1 ns	100 µs	0.6%
bandwidth-bound	~0.2 ns	20 µs	0.1%
latency-bound	~12 ns	1,200 µs	7.2%
interpreter-bound	~5 ns	500 µs	3.0%

100K is small enough that even a Python-loop version fits comfortably. Scale to 10M:

regime	per-element	for 10M entities	% of 60 Hz budget
compute-bound	~1 ns	10,000 µs	60%
bandwidth-bound	~0.2 ns	2,000 µs	12%
latency-bound	~12 ns	120,000 µs	720% (over)
interpreter-bound	~5 ns	50,000 µs	300% (over)

At 10M entities, latency-bound and interpreter-bound layouts blow the budget by 3-7×. Bandwidth-bound finishes with 88% headroom. Same algorithm, same data, same machine.

Exercise 6 - A bad design

from dataclasses import dataclass

@dataclass
class Entity:
    x: float
    y: float
    vx: float
    vy: float

entities = [Entity(0.0, 0.0, 0.1, 0.1) for _ in range(1_000_000)]

# per tick:
for e in entities:
    e.x += e.vx
    e.y += e.vy

This is the canonical “obviously fast” Python design. Big-O is O(N); the inner work is two floating-point adds. Estimating from the regime table: interpreter-bound at ~5 ns × 4 attribute touches ≈ 20 ns/entity × 1M = 20 ms per tick, ~60% of a 30 Hz budget on simulation work alone. The exhibit tick_budget.py confirms this empirically:

  1,000,000  Python dataclass list      27.525 ms     30 Hz: 82.6%     60 Hz: 165% OVER
  1,000,000  numpy SoA                   0.278 ms     30 Hz:  0.8%     60 Hz:  1.7%

100× cost gap. The big-O is the same. The constant factor - the per-element interpreter dispatch through four attribute accesses on a heap-allocated dataclass - is what blows the budget.

Exercise 7 - Find your CPU’s TDP

Linux:

sudo dmidecode -t processor | grep -i 'power\|TDP'

Or look up the CPU model on the manufacturer’s spec sheet (Intel ARK, AMD product page, Apple silicon spec). Typical 2026 figures:

segment	sustained TDP
Raspberry Pi 5	~5 W
ultrabook (mobile)	15-28 W
desktop	65-125 W
workstation	125-280 W

Burst can run 1.5-2× higher for tens of seconds; sustained settles back to TDP. The number matters because it’s the ceiling for energy per tick on your machine - useful when budgeting battery life or cooling.

Exercise 8 - Battery budget

50 Wh laptop battery, simulator at 30 Hz:

8 W draw: 6.25 hours runtime.
14 W draw (after a layout change): 3.57 hours runtime.

The layout change cost 2.68 hours, or 43% of battery life. A change that adds memory loads to the inner loop is a change that shortens battery life by nearly half. For mobile, embedded, or any battery-powered work, this matters more than the wall-clock tick time.

Exercise 9 - Measure delta power

# Terminal 1: sustained sequential numpy sum
python3 -c "
import numpy as np
arr = np.arange(10_000_000, dtype=np.int64)
while True: _ = int(arr.sum())
"

# Terminal 2: read package energy over 30 seconds
sudo perf stat -a -e power/energy-pkg/ -- sleep 30

Repeat for the random-gather version (arr[idx].sum() with shuffled idx) and for an idle baseline. Convert each to average watts (J/30s = W).

Expected ordering: idle < sequential < gather. The gap between sequential and gather is the energy cost of breaking the prefetcher - same arithmetic, same data volume, but more memory-controller and DRAM activity per useful operation.

This exercise needs root for perf access to RAPL counters, and works on x86 Linux. On macOS, powermetrics is the analog. On bare-metal embedded, an external power meter is the honest answer.

Exercise 10 - Joules per access

Approximate energies per memory read:

level	energy per access
L1	~0.1 nJ
L2	~1 nJ
RAM	~30 nJ

For 10M int64 reads:

Sequential (mostly prefetched): assume mostly L1-equivalent cost. 10⁷ × 0.1 nJ = 1 mJ.
Random gather (mostly RAM misses): 10⁷ × 30 nJ = 300 mJ.

300× more energy. Convert: 1 mJ = 0.28 µWh; 300 mJ = 83 µWh. As a fraction of a 50 Wh battery: 5.6 × 10⁻⁹ vs 1.7 × 10⁻⁶ - both tiny in absolute terms. The ratio is what compounds across millions of ticks per day across millions of laptops, or across the lifetime power bill of a data centre. The disciplined layout is also the cheap one, twice over: faster and cooler per useful operation.

Exercise 11 - The budget is a curve

Hold the per-element cost fixed and the tick time is linear in the count: tick = c · N, so the sustainable rate is 1 / (c · N) - it falls as one over N, a hyperbola. At the interpreter-bound c ≈ 5 ns:

entities	tick	rate
100,000	0.5 ms	2000 Hz
1,000,000	5 ms	200 Hz
3,000,000	15 ms	67 Hz

The rate crosses 30 Hz at 33.3 ms / 5 ns ≈ 6.7M entities and 15 Hz at ≈ 13M. In the bandwidth-bound numpy regime (c ≈ 0.2 ns) the same crossings move out to ~166M and ~333M: leaving the interpreter slides the curve roughly 25× rightward without changing its shape. That is the Python lesson in one number - the curve is the same hyperbola whatever the regime; the regime decides how far out it crosses.

Two honest caveats. The per-element cost here is a single bookkeeping loop; the simulator’s real tick does far more per entity (a forage query, motion, the lifecycle), so its crossing sits at a much smaller N - hundreds of thousands, not millions. And c is your machine’s, not a universal constant. The shape is the durable claim; the crossing scale is something you measure.

Keyboard shortcuts

An Introduction to Programming, using ECS & EBP in Python