Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Solutions: 1 — The machine model

These exercises are about measuring your machine. Numbers vary; ratios are stable. Run them and write down what you see.

Exercise 1 — Cache sizes

Linux: lscpu | grep -E 'L1|L2|L3' or getconf -a | grep CACHE.

Typical desktop x86-64 in 2026: L1d 32-48 KB per core, L2 1-2 MB per core, L3 16-128 MB shared. Apple Silicon: larger L1, very large shared L2.

Exercise 2 — Sequential sum

use std::time::Instant;

fn main() {
    // Playground-scaled. Use 100_000_000 locally for the real number.
    let n = 10_000_000;
    let v: Vec<u64> = vec![1; n];
    let start = Instant::now();
    let sum: u64 = v.iter().sum();
    let elapsed = start.elapsed();
    let ns_per_elem = elapsed.as_nanos() as f64 / v.len() as f64;
    println!("sum = {sum}, {elapsed:?}, {ns_per_elem:.2} ns/elem");
}

Expect somewhere around 0.2-1 ns per element on modern hardware. The loop is memory-bandwidth bound; the CPU is mostly waiting for RAM to deliver lines.

Exercise 3 — Random-access sum

use std::time::Instant;

fn main() {
    // Playground-scaled. Use 100_000_000 locally for the real number.
    let n: usize = 10_000_000;
    let v: Vec<u64> = vec![1; n];
    let mut state = 0xDEAD_BEEFu64;
    let indices: Vec<usize> = (0..n)
        .map(|_| {
            state = state.wrapping_mul(6364136223846793005).wrapping_add(1442695040888963407);
            (state as usize) % n
        })
        .collect();

    let start = Instant::now();
    let mut sum = 0u64;
    for &i in &indices {
        sum += v[i];
    }
    let elapsed = start.elapsed();
    println!("sum = {sum}, {elapsed:?}, {:.2} ns/elem",
             elapsed.as_nanos() as f64 / n as f64);
}

Expect 30-100 ns per element — close to the RAM-latency cost. Each access misses cache.

Exercise 4 — Cache cliffs

The transitions you see roughly correspond to spilling out of L1 (~32 KB), L2 (~1-2 MB), L3 (~32 MB). Below L1 you should see ~0.1-0.3 ns/elem. In L3 maybe 0.5-1.5 ns. Past L3, 0.5-3 ns (sequential, since prefetcher helps even from RAM).

For random-access cliffs (a more dramatic plot), repeat exercise 3 at sizes 1K, 10K, 100K, 1M, 10M, 100M. The transitions are sharper.

Exercise 5 — Pointer chasing

use std::time::Instant;

struct Node { value: u64, next: Option<Box<Node>> }

fn build(n: usize) -> Box<Node> {
    let mut head = Box::new(Node { value: 1, next: None });
    for _ in 1..n {
        head = Box::new(Node { value: 1, next: Some(head) });
    }
    head
}

fn sum(mut head: &Node) -> u64 {
    let mut s = 0;
    loop {
        s += head.value;
        match &head.next {
            Some(next) => head = next,
            None => break s,
        }
    }
}

fn main() {
    // Playground-scaled. Use 1_000_000 locally for the real number.
    let n = 100_000;
    let head = build(n);
    let start = Instant::now();
    let s = sum(&head);
    let elapsed = start.elapsed();
    println!("sum = {s}, {elapsed:?}, {:.2} ns/elem",
             elapsed.as_nanos() as f64 / n as f64);
}

A Vec<u64> sum is roughly 1 ns/elem; the linked-list walk is roughly 50-100 ns/elem. The ratio is the L1-to-RAM gap from the prose.

Important: building the linked list with deep recursion would blow the stack at large N. The build function above uses a loop precisely so it can scale. The sum is also iterative — a recursive walk would blow the stack on the way down.

Exercise 6 — Reading lscpu against your benchmarks

The transitions are noisy because:

  • Cache levels overlap (a hot cache line might still be in L1 after spilling to L2).
  • Hardware prefetchers help sequential reads.
  • The OS may evict pages between runs.

If your noise is worse than your signal, run each measurement multiple times and take the median.