Skip to content

pjdurden/churn

churn

DOI

A deterministic, zero-GPU discrete-event simulator of elastic-membership DiLoCo-style training — plus a straggler-aware membership policy that handles the slow-but-alive nodes existing decentralized-training stacks don't.

What it models: time, membership, and coordination — not tensors or gradient math. It is a testbed for policy logic and relative comparison, deliberately not an absolute-throughput predictor.

Why

Decentralized training over the public internet (e.g. Prime Intellect's prime-diloco + PCCL) already handles dead nodes — heartbeat eviction, elastic join/leave, mid-training onboarding. Two things are under-served:

  1. Stragglers — alive but slow. A single slow-but-alive node drags the whole DiLoCo round. There's no adaptive deadline, no partial-participation outer step, no graduated slow-node response.
  2. No deterministic churn/fault testbed. Reproducing a partition, a crash mid-collective, or a slow-node scenario requires real GPUs and is non-deterministic.

churn builds the reproducible testbed first, then uses it to build and prove the missing straggler policy.

Highlights

  • Deterministic engine. One logical clock, a (time, seq)-ordered event queue, all randomness from a seeded ChaCha8Rng. Same seed + scenario ⇒ byte-identical event trace. Enforced by a property test and committed golden traces.
  • Pluggable Policy seam. The engine owns time/membership/network/events; a policy owns only the round decisions.
    • BaselinePolicy mirrors PRIME/PCCL today: barrier waits for everyone, dead nodes evicted by heartbeat, stragglers block.
    • StragglerPolicy adds an adaptive per-round deadline (median + k·MAD over a rolling arrival-offset history), partial-participation quorum with sidelining, and a graduated slow-node response (transient slow → rejoin; persistently slow → evict).
  • A/B harness. Run the same scenario + seed under both policies and measure the delta.
  • Python wheel. Build scenarios, run, and compare from Python — Rust does the work.

Headline result

On a persistent-straggler scenario (4 workers, one 10× slow), StragglerPolicy vs BaselinePolicy:

metric baseline straggler
wall-clock 92,050 µs 20,053 µs
utilization 0.62 0.89
4.59× faster

Baseline blocks on the straggler every round; the straggler policy sidelines it twice, then evicts it.

Layout

crates/
  churn-core/   # the deterministic simulation engine (pure Rust)
  churn-py/     # thin PyO3 bindings → the `churn` Python wheel (maturin)
python/churn/   # pure-Python package: scenario builders, typed results, run()/compare()
scenarios/      # declarative scenario files (the reproducible chaos library)
docs/superpowers/  # design specs + implementation plans

Quickstart — Rust

# Run a scenario under the baseline policy (trace → stdout, metrics → stderr)
cargo run -p churn-core --bin run-scenario -- scenarios/persistent_straggler.json

# A/B the same scenario under baseline vs straggler and print the metrics delta
cargo run -p churn-core --bin run-scenario -- --compare scenarios/persistent_straggler.json

Use the library directly:

use churn_core::{Scenario, Simulator, StragglerPolicy, StragglerConfig};

let scenario = Scenario::from_json(&std::fs::read_to_string("scenarios/persistent_straggler.json")?)?;
let mut sim = Simulator::new(scenario);
let mut policy = StragglerPolicy::new(StragglerConfig::default());
sim.run(&mut policy);
println!("{:?}", sim.metrics());

Quickstart — Python

The wheel is built with maturin:

python3 -m venv .venv && . .venv/bin/activate
pip install maturin
maturin develop            # build + install `churn` into the venv
import churn

sc = churn.Scenario(
    seed=42, inner_steps=2, target_outer_steps=5, horizon=5_000_000,
    heartbeat_period=1000, heartbeat_miss_threshold=5,
    base_latency=100, bandwidth_bpus=10, state_bytes=100,
    workers=[churn.WorkerSpec(id=i, join_at=0, inner_step_mean=1000, inner_step_jitter=0)
             for i in range(4)],
    injects=[churn.Slow(id=3, at=0, factor=10)],
)

res = churn.run(sc, churn.Straggler(churn.StragglerConfig(quorum=0.75)))
print(res.metrics.wall_clock, res.metrics.utilization())
print(len(res.trace), "events")          # trace is a list of typed Event objects

ab = churn.compare(sc, churn.StragglerConfig())
print(f"{ab.wall_clock_speedup:.2f}× faster, +{ab.utilization_gain:.3f} utilization")

# or load a committed scenario
churn.Scenario.from_file("scenarios/persistent_straggler.json")

Build a release wheel with maturin build --release (output under target/wheels/).

Scenarios

A scenario is declarative JSON — the reproducible workload. Workers have per-inner-step timing; timed injects model membership/fault events (Crash, Leave, Slow, Restore, Partition, ClearPartition). Example (scenarios/persistent_straggler.json):

{
  "seed": 42,
  "workers": [ {"id": 0, "join_at": 0, "inner_step_mean": 1000, "inner_step_jitter": 0}, ... ],
  "injects": [ {"op": "Slow", "id": 3, "at": 0, "factor": 10} ],
  "inner_steps": 2, "target_outer_steps": 5, "horizon": 5000000,
  "heartbeat_period": 1000, "heartbeat_miss_threshold": 5,
  "base_latency": 100, "bandwidth_bpus": 10, "state_bytes": 100
}

A crash is a silent death — the worker keeps its membership slot until the heartbeat monitor detects the silence and evicts it (mirroring PRIME). A graceful leave removes it immediately.

Testing

cargo test --workspace        # engine unit tests + determinism property test + golden traces
cargo clippy --workspace --all-targets -- -D warnings
maturin develop && pytest python/tests/   # Python smoke tests (build + end-to-end + determinism)

Determinism is an invariant, not a feature: a proptest asserts identical traces across runs, and golden traces guard against drift (and demonstrate the baseline "one slow node drags the round" pathology that the straggler policy avoids).

Status & roadmap

  • Phase 1 — deterministic engine + BaselinePolicy + CLI + golden traces.
  • Phase 2StragglerPolicy + A/B harness (the headline win above).
  • Python wheelchurn bindings + scenario builders + run()/compare().
  • Joiner-stall modeling — late joiners pay a bandwidth-modeled state fetch (joiner_stall_time).
  • Real-stack validation — a torch.distributed/gloo DiLoCo harness (integration/) runs both policies on real code; the sim's straggler win reproduces directionally (see integration/results/REPORT.md).
  • CI gate — GitHub Actions runs cargo test/clippy, the determinism proptest, golden-trace diffs, and the Python smoke tests on every push.
  • Deferred — pseudo-gradient normalization-by-participant-count correctness; sidelined-node re-sync; parameter sweeps, published wheels, and plotting (see ROADMAP.md).

See ROADMAP.md for the mission and the phased plan (calibration + real-PCCL validation, rigor/sweeps, adoption, research surface). Design specs and implementation plans for shipped work live under docs/superpowers/.

Citation

If you use churn or its straggler policy, please cite the accompanying paper (archived on Zenodo, DOI 10.5281/zenodo.20574905):

@misc{chittori2026stragglerpolicy,
  author    = {Chittori, Prajjwal},
  title     = {{StragglerPolicy: Straggler-Aware Elastic Membership for Decentralized Training}},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.20574905},
  url       = {https://doi.org/10.5281/zenodo.20574905}
}

License

Dual-licensed under either of

at your option. Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this work by you, as defined in the Apache-2.0 license, shall be dual-licensed as above, without any additional terms or conditions.

About

Deterministic, zero-GPU simulator of elastic-membership DiLoCo training + a straggler-aware membership policy that beats the wait-for-everyone baseline (~4.6x in-sim, validated on real torch/gloo).

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors