Solutions - An Introduction to Programming, using ECS & EBP in Rust

Exercise 1 - The pathological counter

#![allow(unused)]
fn main() {
use std::sync::atomic::{AtomicU64, Ordering};
use std::thread;
use std::time::Instant;

const ITERS: u64 = 10_000_000;
let counters: [AtomicU64; 8] = std::array::from_fn(|_| AtomicU64::new(0));

let t = Instant::now();
thread::scope(|s| {
    for c in counters.iter() {
        s.spawn(|| {
            for _ in 0..ITERS {
                c.fetch_add(1, Ordering::Relaxed);
            }
        });
    }
});
println!("8 threads on one cache line: {:?}", t.elapsed());
}

Compare with a single-threaded loop doing 8 × ITERS increments on a single counter. On most chips the single-threaded version is faster than the 8-thread version. True negative scaling.

Exercise 2 - The padded version

#![allow(unused)]
fn main() {
#[repr(align(64))]
struct Padded(AtomicU64);

let counters: [Padded; 8] = std::array::from_fn(|_| Padded(AtomicU64::new(0)));
}

Each Padded occupies a full cache line. The 8 padded counters span 8 cache lines. Re-time the parallel loop. It should now scale near-linearly - typically 6-8× faster than single-threaded.

Exercise 3 - Per-thread `to_remove`

The data inside each thread’s Vec<u32> lives on the heap, in regions the allocator distributes to be far apart (typically separated by at least one page = 4 KB). False sharing on the data is unlikely.

The Vec headers (the (ptr, len, cap) fields, 24 bytes each on 64-bit), if stored adjacent in a parent struct or array, can share a cache line. If you observe poor scaling, padding the parent struct fixes it:

#![allow(unused)]
fn main() {
#[repr(align(64))]
struct ThreadLocalVec(Vec<u32>);

let segments: [ThreadLocalVec; 8] = std::array::from_fn(|_| ThreadLocalVec(Vec::new()));
}

Exercise 4 - Adjacent struct fields

#![allow(unused)]
fn main() {
struct TwoCounters { a: AtomicU64, b: AtomicU64 } // 16 bytes, one cache line

let counters = TwoCounters { a: AtomicU64::new(0), b: AtomicU64::new(0) };

thread::scope(|s| {
    s.spawn(|| {
        for _ in 0..ITERS { counters.a.fetch_add(1, Ordering::Relaxed); }
    });
    s.spawn(|| {
        for _ in 0..ITERS { counters.b.fetch_add(1, Ordering::Relaxed); }
    });
});
}

Two threads, separate fields, same cache line. Performance is similar to two threads contending on one field - the line is invalidated on every write either way.

Fix: pad each field to its own cache line.

#![allow(unused)]
fn main() {
#[repr(align(64))]
struct PaddedAtomic(AtomicU64);

struct TwoCounters { a: PaddedAtomic, b: PaddedAtomic } // 128 bytes, two lines
}

Exercise 5 - Find your cache-line size

$ getconf LEVEL1_DCACHE_LINESIZE
64

x86 and most ARM Cortex-A: 64 bytes. Apple Silicon: 128 bytes for some cache levels. AArch64: variable; 64 or 128 depending on the chip. If padding to 64 bytes does not eliminate false sharing on a particular chip, try 128.

crossbeam_utils::CachePadded<T> checks the platform at compile time and pads to 128 on platforms that need it. For portable code, use it instead of hardcoded #[repr(align(64))].

Keyboard shortcuts

An Introduction to Programming, using ECS & EBP in Rust

Solutions: 33 - False sharing

Exercise 1 - The pathological counter

Exercise 2 - The padded version

Exercise 3 - Per-thread to_remove

Exercise 4 - Adjacent struct fields

Exercise 5 - Find your cache-line size

Exercise 3 - Per-thread `to_remove`