26 — Hot/cold splits

Concept node: see the DAG and glossary entry 26.
The simulator’s creature table has six columns: pos, vel, energy, birth_t, id, gen. The motion system reads three of the six (pos, vel, energy). The starvation system reads only energy. The cleanup system reads id and gen. The births log reads birth_t. No system reads all six.
If the columns are stored together — same memory region, same prefetcher pulls — every load brings in fields the inner loop ignores. At cache-spilling sizes, the ignored fields cost real bandwidth.
The fix is a split: fields touched on the hot path go in one table; fields read rarely go in another. Two tables, same length, same id alignment.
#![allow(unused)]
fn main() {
struct CreatureHot {
pos: Vec<(f32, f32)>, // motion, next_event, apply_eat
vel: Vec<(f32, f32)>, // motion
energy: Vec<f32>, // motion, apply_eat, apply_starve
}
struct CreatureCold {
birth_t: Vec<f64>, // logging only
id: Vec<u32>, // cleanup, id_to_slot maintenance
gen: Vec<u32>, // cleanup
}
}
Motion reads only CreatureHot. Cleanup reads CreatureCold. The two systems’ cache traffic does not overlap.
The bandwidth math: pre-split, motion’s loop reads ~40 bytes per creature (the full row, prefetcher loads everything together). Post-split, motion reads ~24 bytes (just pos + vel + energy). Roughly 1.7× less bandwidth, which often translates to 1.5–2× faster wall-clock time at RAM-bound sizes.
The discipline carries cost. Two tables means two id-to-slot maps (or careful sharing of one). Cleanup must update both in lockstep when slots move. The split is a real architectural commitment — once made, every system that touches creatures must know which table it is touching.
When the split is wrong:
- All-fields workloads. A debug-inspect system that prints every field reads everything; the split adds overhead without reducing bandwidth.
- Tiny rows. If the full row is already 16-24 bytes (one or two fields per cache line), splitting a 4-byte field out adds more pointer traffic than it saves.
- Frequently rebalancing. If which fields are “hot” changes from tick to tick, a fixed split becomes unhelpful. Hot/cold is a static decision, made once for a given target workload.
The decision rests on measurement. Profile the simulator at the target size; identify the inner loop’s actual touched fields; split accordingly. The split is earned by data, not by aesthetics.
A useful test: name the split before writing it. “I am moving birth_t into a cold table because no inner loop reads it” is a sound design choice. “I am moving birth_t into a cold table because that’s how ECS engines do it” is not.
Exercises
These extend the simulator’s creature table from §0/§1.
- Audit access patterns. For each system in your simulator, list which fields it reads and which it writes. Fields read every tick are hot; the rest are cold.
- Build the split. Refactor
creatureintocreature_hotandcreature_cold. Both share the id allocator. Verify each row’s fields stay aligned across the two tables. - Time motion at 1M creatures. Pre-split: time motion. Post-split: time motion. Compare. The post-split version should be ~1.5–2× faster.
- Cleanup must touch both. Modify cleanup to
swap_removefrom bothcreature_hotandcreature_coldwhen a creature dies. Verify alignment after. - A bad split. Construct a split where the wrong fields go cold (e.g.
energyin cold). Time motion. The cost of the cache miss onenergyshould bury any savings elsewhere. - (stretch) The all-fields case. Write a system that reads every field (e.g. a serialiser). Time the split version. Discuss why the split’s overhead is real here, and why this is a fine tradeoff: most ticks do not run this system.
Reference notes in 26_hot_cold_splits_solutions.md.
What’s next
§27 — Working set vs cache puts numbers on the question this section was implicitly asking: how big is the inner loop’s footprint, and what cache level does it fit in?