Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Solutions: 4 — Cost is layout, and you have a budget

Exercise 1 — Pick your rates

systemtarget ratebudget per tick
card game30 Hz (or event-driven)33 ms
real-time strategy game30-60 Hz17-33 ms
market data feeddepends — 100 Hz – 1 MHz10 ms – 1 µs
embedded sensor controller1-10 kHz100 µs – 1 ms
web API endpointper-request, ~10-200 ms10-200 ms
offline batch (1B rows)throughput, not Hzminutes-to-hours total

The point of writing these down is that “this should be fast” is not a budget. “33 ms” is. The instant you have a number, every line of code in the inner loop is either spending bytes of that budget or it isn’t.

Exercise 2 — Count an operation

import timeit
d = {i: i*i for i in range(1_000_000)}
t = timeit.timeit("d[42]", globals={"d": d}, number=10_000_000)
print(f"dict[k] lookup: {t/10_000_000*1e9:.1f} ns")
dict[k] lookup: 15.1 ns

At 30 Hz (33 ms): ~2.2 million lookups per tick. At 1 kHz (1 ms): ~66,000 lookups per tick.

A 1-million-entity update that does one dict lookup per entity would cost 15 ms — half a 30 Hz budget on a single bookkeeping op. Two dict lookups per entity blows the budget on bookkeeping alone, with no actual simulation work done yet.

Exercise 3 — The layout difference

import time, numpy as np
n = 1_000_000
arr = np.arange(n, dtype=np.int64)
d   = {i: i for i in range(n)}
arr.sum(); sum(d.values())                     # warmup

t0 = time.perf_counter(); int(arr.sum());    t1 = time.perf_counter()
t2 = time.perf_counter(); sum(d.values());   t3 = time.perf_counter()

print(f"numpy sum:        {(t1-t0)*1e9/n:.2f} ns/elem")
print(f"sum(d.values()):  {(t3-t2)*1e9/n:.1f} ns/elem")
numpy sum:        0.20 ns/elem
sum(d.values()):  3.6 ns/elem
ratio: 18×

The dict version is interpreter-bound: the inner loop is a pure-Python for v in values: total += v, which pays bytecode dispatch + PyLong arithmetic + refcount work per element — about 3-6 ns. The numpy version is bandwidth-bound: a tight C loop reading int64s sequentially, the prefetcher loaded ahead, the L1 line warm. Same 1M int64 payload, two regimes apart, 18× cost gap.

Exercise 4 — The cliff

import time, numpy as np
for size in [100_000, 200_000, 1_000_000, 10_000_000]:
    a = np.ones(size, dtype=np.int64)
    a.sum()                                       # warmup
    best = float("inf")
    for _ in range(3):
        t0 = time.perf_counter(); a.sum(); t1 = time.perf_counter()
        if t1 - t0 < best: best = t1 - t0
    print(f"  N={size:>10,} ({size*8/1024:>7.0f} KB): {best*1e9/size:.2f} ns/elem")
  N=   100,000 (    781 KB):  0.11 ns/elem
  N=   200,000 (   1562 KB):  0.12 ns/elem
  N= 1,000,000 (   7812 KB):  0.11 ns/elem
  N=10,000,000 (  78125 KB):  0.20 ns/elem

On this machine the cliff between L2-fitting (200 KB - 1 MB) and L3-spilling (10 MB+) is shallow on the sequential sum (0.11 → 0.20 ns/elem, less than 2× slowdown). The prefetcher is doing its job: even with the working set in RAM, sequential-access numpy hovers near memory bandwidth limits. The dramatic cliff is on the gather column from §1; sequential numpy is forgiving.

This is why the chapter distinguishes bandwidth-bound from latency-bound: same N, same array, very different cliff depending on access pattern. The cliff exists; sequential numpy hides most of it.

Exercise 5 — Working backwards from the budget

Target 60 Hz (16.67 ms = 16,666 µs); 100,000 entities; one cache line touched per entity.

regimeper-elementfor 100K entities% of 60 Hz budget
compute-bound~1 ns100 µs0.6%
bandwidth-bound~0.2 ns20 µs0.1%
latency-bound~12 ns1,200 µs7.2%
interpreter-bound~5 ns500 µs3.0%

100K is small enough that even a Python-loop version fits comfortably. Scale to 10M:

regimeper-elementfor 10M entities% of 60 Hz budget
compute-bound~1 ns10,000 µs60%
bandwidth-bound~0.2 ns2,000 µs12%
latency-bound~12 ns120,000 µs720% (over)
interpreter-bound~5 ns50,000 µs300% (over)

At 10M entities, latency-bound and interpreter-bound layouts blow the budget by 3-7×. Bandwidth-bound finishes with 88% headroom. Same algorithm, same data, same machine.

Exercise 6 — A bad design

from dataclasses import dataclass

@dataclass
class Entity:
    x: float
    y: float
    vx: float
    vy: float

entities = [Entity(0.0, 0.0, 0.1, 0.1) for _ in range(1_000_000)]

# per tick:
for e in entities:
    e.x += e.vx
    e.y += e.vy

This is the canonical “obviously fast” Python design. Big-O is O(N); the inner work is two floating-point adds. Estimating from the regime table: interpreter-bound at ~5 ns × 4 attribute touches ≈ 20 ns/entity × 1M = 20 ms per tick, ~60% of a 30 Hz budget on simulation work alone. The exhibit tick_budget.py confirms this empirically:

  1,000,000  Python dataclass list      27.525 ms     30 Hz: 82.6%     60 Hz: 165% OVER
  1,000,000  numpy SoA                   0.278 ms     30 Hz:  0.8%     60 Hz:  1.7%

100× cost gap. The big-O is the same. The constant factor — the per-element interpreter dispatch through four attribute accesses on a heap-allocated dataclass — is what blows the budget.

Exercise 7 — Find your CPU’s TDP

Linux:

sudo dmidecode -t processor | grep -i 'power\|TDP'

Or look up the CPU model on the manufacturer’s spec sheet (Intel ARK, AMD product page, Apple silicon spec). Typical 2026 figures:

segmentsustained TDP
Raspberry Pi 5~5 W
ultrabook (mobile)15-28 W
desktop65-125 W
workstation125-280 W

Burst can run 1.5-2× higher for tens of seconds; sustained settles back to TDP. The number matters because it’s the ceiling for energy per tick on your machine — useful when budgeting battery life or cooling.

Exercise 8 — Battery budget

50 Wh laptop battery, simulator at 30 Hz:

  • 8 W draw: 6.25 hours runtime.
  • 14 W draw (after a layout change): 3.57 hours runtime.

The layout change cost 2.68 hours, or 43% of battery life. A change that adds memory loads to the inner loop is a change that shortens battery life by nearly half. For mobile, embedded, or any battery-powered work, this matters more than the wall-clock tick time.

Exercise 9 — Measure delta power

# Terminal 1: sustained sequential numpy sum
python3 -c "
import numpy as np
arr = np.arange(10_000_000, dtype=np.int64)
while True: _ = int(arr.sum())
"

# Terminal 2: read package energy over 30 seconds
sudo perf stat -a -e power/energy-pkg/ -- sleep 30

Repeat for the random-gather version (arr[idx].sum() with shuffled idx) and for an idle baseline. Convert each to average watts (J/30s = W).

Expected ordering: idle < sequential < gather. The gap between sequential and gather is the energy cost of breaking the prefetcher — same arithmetic, same data volume, but more memory-controller and DRAM activity per useful operation.

This exercise needs root for perf access to RAPL counters, and works on x86 Linux. On macOS, powermetrics is the analog. On bare-metal embedded, an external power meter is the honest answer.

Exercise 10 — Joules per access

Approximate energies per memory read:

levelenergy per access
L1~0.1 nJ
L2~1 nJ
RAM~30 nJ

For 10M int64 reads:

  • Sequential (mostly prefetched): assume mostly L1-equivalent cost. 10⁷ × 0.1 nJ = 1 mJ.
  • Random gather (mostly RAM misses): 10⁷ × 30 nJ = 300 mJ.

300× more energy. Convert: 1 mJ = 0.28 µWh; 300 mJ = 83 µWh. As a fraction of a 50 Wh battery: 5.6 × 10⁻⁹ vs 1.7 × 10⁻⁶ — both tiny in absolute terms. The ratio is what compounds across millions of ticks per day across millions of laptops, or across the lifetime power bill of a data centre. The disciplined layout is also the cheap one, twice over: faster and cooler per useful operation.