A deterministic, zero-GPU discrete-event simulator of elastic-membership DiLoCo-style training — plus a straggler-aware membership policy that handles the slow-but-alive nodes existing decentralized-training stacks don't.
What it models: time, membership, and coordination — not tensors or gradient math. It is a testbed for policy logic and relative comparison, deliberately not an absolute-throughput predictor.
Decentralized training over the public internet (e.g. Prime Intellect's prime-diloco + PCCL)
already handles dead nodes — heartbeat eviction, elastic join/leave, mid-training onboarding.
Two things are under-served:
- Stragglers — alive but slow. A single slow-but-alive node drags the whole DiLoCo round. There's no adaptive deadline, no partial-participation outer step, no graduated slow-node response.
- No deterministic churn/fault testbed. Reproducing a partition, a crash mid-collective, or a slow-node scenario requires real GPUs and is non-deterministic.
churn builds the reproducible testbed first, then uses it to build and prove the missing
straggler policy.
- Deterministic engine. One logical clock, a
(time, seq)-ordered event queue, all randomness from a seededChaCha8Rng. Same seed + scenario ⇒ byte-identical event trace. Enforced by a property test and committed golden traces. - Pluggable
Policyseam. The engine owns time/membership/network/events; a policy owns only the round decisions.BaselinePolicymirrors PRIME/PCCL today: barrier waits for everyone, dead nodes evicted by heartbeat, stragglers block.StragglerPolicyadds an adaptive per-round deadline (median + k·MADover a rolling arrival-offset history), partial-participation quorum with sidelining, and a graduated slow-node response (transient slow → rejoin; persistently slow → evict).
- A/B harness. Run the same scenario + seed under both policies and measure the delta.
- Python wheel. Build scenarios, run, and compare from Python — Rust does the work.
On a persistent-straggler scenario (4 workers, one 10× slow), StragglerPolicy vs BaselinePolicy:
| metric | baseline | straggler |
|---|---|---|
| wall-clock | 92,050 µs | 20,053 µs |
| utilization | 0.62 | 0.89 |
| 4.59× faster |
Baseline blocks on the straggler every round; the straggler policy sidelines it twice, then evicts it.
crates/
churn-core/ # the deterministic simulation engine (pure Rust)
churn-py/ # thin PyO3 bindings → the `churn` Python wheel (maturin)
python/churn/ # pure-Python package: scenario builders, typed results, run()/compare()
scenarios/ # declarative scenario files (the reproducible chaos library)
docs/superpowers/ # design specs + implementation plans
# Run a scenario under the baseline policy (trace → stdout, metrics → stderr)
cargo run -p churn-core --bin run-scenario -- scenarios/persistent_straggler.json
# A/B the same scenario under baseline vs straggler and print the metrics delta
cargo run -p churn-core --bin run-scenario -- --compare scenarios/persistent_straggler.jsonUse the library directly:
use churn_core::{Scenario, Simulator, StragglerPolicy, StragglerConfig};
let scenario = Scenario::from_json(&std::fs::read_to_string("scenarios/persistent_straggler.json")?)?;
let mut sim = Simulator::new(scenario);
let mut policy = StragglerPolicy::new(StragglerConfig::default());
sim.run(&mut policy);
println!("{:?}", sim.metrics());The wheel is built with maturin:
python3 -m venv .venv && . .venv/bin/activate
pip install maturin
maturin develop # build + install `churn` into the venvimport churn
sc = churn.Scenario(
seed=42, inner_steps=2, target_outer_steps=5, horizon=5_000_000,
heartbeat_period=1000, heartbeat_miss_threshold=5,
base_latency=100, bandwidth_bpus=10, state_bytes=100,
workers=[churn.WorkerSpec(id=i, join_at=0, inner_step_mean=1000, inner_step_jitter=0)
for i in range(4)],
injects=[churn.Slow(id=3, at=0, factor=10)],
)
res = churn.run(sc, churn.Straggler(churn.StragglerConfig(quorum=0.75)))
print(res.metrics.wall_clock, res.metrics.utilization())
print(len(res.trace), "events") # trace is a list of typed Event objects
ab = churn.compare(sc, churn.StragglerConfig())
print(f"{ab.wall_clock_speedup:.2f}× faster, +{ab.utilization_gain:.3f} utilization")
# or load a committed scenario
churn.Scenario.from_file("scenarios/persistent_straggler.json")Build a release wheel with maturin build --release (output under target/wheels/).
A scenario is declarative JSON — the reproducible workload. Workers have per-inner-step timing;
timed injects model membership/fault events (Crash, Leave, Slow, Restore, Partition,
ClearPartition). Example (scenarios/persistent_straggler.json):
{
"seed": 42,
"workers": [ {"id": 0, "join_at": 0, "inner_step_mean": 1000, "inner_step_jitter": 0}, ... ],
"injects": [ {"op": "Slow", "id": 3, "at": 0, "factor": 10} ],
"inner_steps": 2, "target_outer_steps": 5, "horizon": 5000000,
"heartbeat_period": 1000, "heartbeat_miss_threshold": 5,
"base_latency": 100, "bandwidth_bpus": 10, "state_bytes": 100
}A crash is a silent death — the worker keeps its membership slot until the heartbeat monitor detects the silence and evicts it (mirroring PRIME). A graceful leave removes it immediately.
cargo test --workspace # engine unit tests + determinism property test + golden traces
cargo clippy --workspace --all-targets -- -D warnings
maturin develop && pytest python/tests/ # Python smoke tests (build + end-to-end + determinism)Determinism is an invariant, not a feature: a proptest asserts identical traces across runs, and golden traces guard against drift (and demonstrate the baseline "one slow node drags the round" pathology that the straggler policy avoids).
- ✅ Phase 1 — deterministic engine +
BaselinePolicy+ CLI + golden traces. - ✅ Phase 2 —
StragglerPolicy+ A/B harness (the headline win above). - ✅ Python wheel —
churnbindings + scenario builders +run()/compare(). - ✅ Joiner-stall modeling — late joiners pay a bandwidth-modeled state fetch (
joiner_stall_time). - ✅ Real-stack validation — a
torch.distributed/gloo DiLoCo harness (integration/) runs both policies on real code; the sim's straggler win reproduces directionally (seeintegration/results/REPORT.md). - ✅ CI gate — GitHub Actions runs
cargo test/clippy, the determinism proptest, golden-trace diffs, and the Python smoke tests on every push. - ⏳ Deferred — pseudo-gradient normalization-by-participant-count correctness; sidelined-node re-sync; parameter sweeps, published wheels, and plotting (see ROADMAP.md).
See ROADMAP.md for the mission and the phased plan (calibration + real-PCCL
validation, rigor/sweeps, adoption, research surface). Design specs and implementation plans for
shipped work live under docs/superpowers/.
If you use churn or its straggler policy, please cite the accompanying paper (archived on Zenodo, DOI 10.5281/zenodo.20574905):
@misc{chittori2026stragglerpolicy,
author = {Chittori, Prajjwal},
title = {{StragglerPolicy: Straggler-Aware Elastic Membership for Decentralized Training}},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.20574905},
url = {https://doi.org/10.5281/zenodo.20574905}
}Dual-licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or https://opensource.org/licenses/MIT)
at your option. Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this work by you, as defined in the Apache-2.0 license, shall be dual-licensed as above, without any additional terms or conditions.