29 — The wall at 10K → 1M
Concept node: see the DAG and glossary entry 29.

A simulator that runs cleanly at 10 000 creatures often grinds to a halt at 1 000 000. Not because the algorithm changed — because constant factors that were invisible at the smaller scale now bind.
This chapter is about finding the wall. The fixes are techniques you already have: hot/cold splits (§26), working-set discipline (§27), sort for locality (§28), pre-sized buffers, batched cleanup. The chapter’s job is to teach the reader to measure — to find which constant factors blew up.
Constant-factor bugs that bind at 10K → 1M:
- Reallocation. A
to_insert: Vec<CreatureRow>that grew lazily was fine at 100 pushes per tick (10K creatures × 1% reproduction). At 10K pushes per tick (1M × 1%), the reallocations dominate. Fix:Vec::with_capacity(estimated_max). - Linear scans.
hungry.iter().any(|&id| id == target_id)was 0.1 ms at 10K, but 10 ms at 1M. Fix: theid_to_slotmap (§23) plus parallel presence flags. - Cache spillover.
creatureworking set at 10K is 200 KB (L2-resident). At 1M it is 20 MB (L3-resident). Per-element time triples. Fix: hot/cold splits + narrower fields. HashMapiteration order. AHashMap<u32, _>iterated by systems that need deterministic order. At 10K the cost was tolerable; at 1M the bandwidth cost is high. Fix:BTreeMaporVec<(K, V)>.- Per-tick allocation. A system that allocates a fresh
Vecper tick was fine when theVecwas 1 KB. At 1M it is 100 KB; allocation latency starts to matter. Fix: reuse buffers across ticks. - Logging. A
println!per creature was tolerable at 10K. At 1M it is the simulator’s bottleneck. Fix: buffered logging, periodic snapshots, or simply turn it off.
The pattern: any cost that was O(1) per creature, multiplied by 1M, is no longer free. Anything that was O(N) per tick at 10K is now O(N²)-equivalent in wall time. The fixes are local — each cost is a single-line change — but finding them requires measurement.
The right tool is a profiler. cargo flamegraph (or perf record + perf report) tells you where the time goes. The same simulator at 10K and 1M produces different flame graphs; the wall is the difference.
A useful exercise: run your simulator at 10K for 1000 ticks; time it. Run at 1M for 100 ticks (same total entity-ticks); time it. The 1M version should take ~10× longer, not 100×. If it takes 100×, something has crossed a constant-factor wall and the profiler will show you what.
The fix is structural. Apply the techniques: hot/cold, working set, sort for locality, pre-sized buffers, batched cleanup, deterministic structures. Each is a chapter you have already read. The wall is the moment they all become non-optional.
Exercises
- Calibration. Run your simulator at N = 10K for 1000 ticks. Time it. Note the wall-clock total.
- Scale up. Run at N = 1M for 100 ticks (same total entity-ticks). Time it. Compute the ratio.
- Profile. Use
cargo flamegraph(orperf) on the 1M run. Identify the top three hottest functions. - Pre-size
to_insert. ApplyVec::with_capacityto your cleanup buffers. Re-run; re-profile. Did the hot list change? - Hot/cold split. Apply the §26 split. Re-run; re-profile. Working set should shrink visibly in the cache-miss counters.
- Use index maps. Replace any linear
iter().any()with the §23id_to_slotlookup. Re-run; re-profile. - (stretch) Find one new wall. Pick any system in your simulator and find one constant factor that scales worse than expected. The fix is usually one of the techniques above; identifying which one is the lesson.
Reference notes in 29_wall_10k_to_1m_solutions.md.
What’s next
§30 — Moving beyond the wall takes the next step: when even your fastest, tightest, hot/cold-split, sorted-for-locality simulator no longer fits in RAM, the architecture itself shifts.