Solutions: 28 — Sort for locality
Exercise 1 — Compute spatial cells
import numpy as np
def spatial_cell(pos_x: np.ndarray, pos_y: np.ndarray, cell_size: float) -> np.ndarray:
cx = (pos_x / cell_size).astype(np.int32)
cy = (pos_y / cell_size).astype(np.int32)
return ((cx & 0xFFFF) << 16) | (cy & 0xFFFF)
rng = np.random.default_rng(0)
n = 1_000
pos_x = rng.uniform(0, 100, n).astype(np.float32)
pos_y = rng.uniform(0, 100, n).astype(np.float32)
cells = spatial_cell(pos_x, pos_y, cell_size=10.0)
unique, counts = np.unique(cells, return_counts=True)
print(f"{len(unique)} cells occupied, max {counts.max()} creatures per cell")
print(f"first 5 cells: {unique[:5].tolist()}")
print(f"histogram of cell counts: {np.bincount(counts)[:20]}")
For uniformly distributed creatures in a 100×100 world with 10-unit cells, expect ~100 cells (10×10 grid), ~10 creatures per cell on average. The histogram is Poisson-shaped — most cells have 5-15 creatures, a few have 0 or 25+.
Exercise 2 — Sort by cell
def sort_for_locality(world, cell_size: float):
cells = spatial_cell(world.pos_x, world.pos_y, cell_size)
order = np.argsort(cells, kind="stable")
for col in ("pos_x", "pos_y", "vel_x", "vel_y", "energy", "id"):
arr = getattr(world, col)
arr[:] = arr[order]
# rebuild id_to_slot
world.id_to_slot[world.id[:world.n_active]] = np.arange(world.n_active, dtype=np.uint32)
sort_for_locality(world, cell_size=10.0)
print(f"first 10 positions after sort:")
for i in range(10):
print(f" ({world.pos_x[i]:.2f}, {world.pos_y[i]:.2f}) cell={spatial_cell(world.pos_x[i:i+1], world.pos_y[i:i+1], 10.0)[0]}")
After the sort, the first 10 positions belong to creatures in the same (or adjacent) cells — their (pos_x, pos_y) values cluster instead of scattering randomly.
Exercise 3 — Maintain id_to_slot
# Before sort: held_id is at some slot
held_id = int(world.id[42])
before_slot = 42
before_pos = (float(world.pos_x[42]), float(world.pos_y[42]))
sort_for_locality(world, cell_size=10.0)
# After sort: look up by id
after_slot = int(world.id_to_slot[held_id])
after_pos = (float(world.pos_x[after_slot]), float(world.pos_y[after_slot]))
print(f"before: slot={before_slot}, pos={before_pos}")
print(f"after: slot={after_slot}, pos={after_pos}")
assert before_pos == after_pos, "data moved but is the same value"
The held id resolves to a new slot. The position at the new slot equals the position at the old slot. The id_to_slot map is the bridge; without it, the held reference would dereference garbage.
Exercise 4 — Time next_event before and after
import time, numpy as np
def next_event_scan(pos_x, pos_y, radius=1.0):
"""For each creature, count neighbours within radius among the next 100 entries."""
n = len(pos_x)
count = np.zeros(n, dtype=np.uint32)
for i in range(n):
end = min(i + 100, n)
dx = pos_x[i+1:end] - pos_x[i]
dy = pos_y[i+1:end] - pos_y[i]
count[i] = int(np.sum(dx*dx + dy*dy < radius*radius))
return count
# Pre-sort timing
t = time.perf_counter()
next_event_scan(world.pos_x[:10_000], world.pos_y[:10_000])
t_pre = time.perf_counter() - t
sort_for_locality(world, cell_size=10.0)
# Post-sort timing
t = time.perf_counter()
next_event_scan(world.pos_x[:10_000], world.pos_y[:10_000])
t_post = time.perf_counter() - t
print(f"pre-sort: {t_pre*1000:.2f} ms")
print(f"post-sort: {t_post*1000:.2f} ms")
print(f"ratio: {t_pre/t_post:.2f}×")
Expect a ~1.5-3× speedup on the post-sort version. The reason: post-sort, the pos_x[i+1:end] slice is more likely to contain creatures in the same spatial cell — so the boolean mask has more True values clustered together, and the subsequent indexing operations are more cache-friendly.
The exact ratio depends on the spatial distribution and the scan-window size. A scan window of 100 might capture exactly one cell (if cells average 10 creatures) or several adjacent cells; the locality benefit is biggest when the scan window matches the typical cell occupancy.
Exercise 5 — Sort cadence
results = {}
for cadence in (1, 10, 100, 1_000_000): # last one = "never"
world = build_world(n=10_000)
t0 = time.perf_counter()
for tick in range(100):
motion(world, dt=1/30)
if tick % cadence == 0:
sort_for_locality(world, cell_size=10.0)
next_event_scan(world.pos_x, world.pos_y)
results[cadence] = time.perf_counter() - t0
for c, t in results.items():
print(f"sort every {c:>8} ticks: {t:.2f} s total")
Typical shape:
sort every 1 ticks: 0.85 s (sort cost dominates)
sort every 10 ticks: 0.62 s (sweet spot, often)
sort every 100 ticks: 0.75 s (scan cost grows as positions drift)
sort every 1000000 ticks: 1.20 s (no resort; scan cost stays high)
The optimum is wherever the sort’s amortised cost balances the scan’s per-tick savings. For most simulators that’s “resort every 10-100 ticks,” depending on motion speed. A re-sort triggered by accumulated drift (resort once total motion since last sort exceeds half a cell width) generalises this to scenarios with variable motion rates.
Exercise 6 — Z-order curve (stretch)
def _spread2(v: int) -> int:
"""Interleave 16 bits of v with zeros (Morton helper)."""
v &= 0xFFFF
v = (v | (v << 8)) & 0x00FF00FF
v = (v | (v << 4)) & 0x0F0F0F0F
v = (v | (v << 2)) & 0x33333333
v = (v | (v << 1)) & 0x55555555
return v
def morton_cell(pos_x: np.ndarray, pos_y: np.ndarray, cell_size: float) -> np.ndarray:
cx = np.clip((pos_x / cell_size).astype(np.int32), 0, 0xFFFF)
cy = np.clip((pos_y / cell_size).astype(np.int32), 0, 0xFFFF)
return np.array([(_spread2(int(x)) | (_spread2(int(y)) << 1)) for x, y in zip(cx, cy)],
dtype=np.uint32)
For pure-numpy efficiency, vectorise _spread2:
def spread_vec(v: np.ndarray) -> np.ndarray:
v = v & 0xFFFF
v = (v | (v << 8)) & 0x00FF00FF
v = (v | (v << 4)) & 0x0F0F0F0F
v = (v | (v << 2)) & 0x33333333
v = (v | (v << 1)) & 0x55555555
return v
def morton_cell(pos_x, pos_y, cell_size):
cx = np.clip((pos_x / cell_size).astype(np.int32), 0, 0xFFFF)
cy = np.clip((pos_y / cell_size).astype(np.int32), 0, 0xFFFF)
return spread_vec(cx) | (spread_vec(cy) << 1)
Compared to the simple (cx << 16) | cy packing, Z-order keeps cells (1,0), (0,1), (1,1) close to (0,0) in the linear order — instead of (1,0) being adjacent to (0,0) but (0,1) being far away. The result is that 2D adjacency is approximately preserved in 1D adjacency.
For typical simulator workloads where next_event_scan looks at horizontal-and-vertical neighbours, Z-order outperforms simple packing by 10-30%. The difference is biggest for densely-packed simulations where vertical neighbours within the scan window matter.
The full Hilbert curve preserves 2D locality even better but is more expensive to compute. For most simulators, Z-order is the sweet spot — close to optimal, vectorisable in numpy.