Releases: Denis-hamon/world-model-eval-lab
v0.11.0 - Multi-seed CPG sweep: capacity vs coverage
Multi-seed CPG sweep ships
The multi-seed extension promised in v0.9 and outlined as future work in the v0.10 paper now ships. We pooled three seeds at 50 episodes per arm per seed (n = 150 pooled) and swept the MLP's training-set size by a factor of 100 across {200, 2 000, 20 000} random-policy transitions on DMC Acrobot-swingup.
Headline result
In every cell:
| Quantity | Value |
|---|---|
| Oracle success | 40/150 = 0.267 |
| Learned success | 0/150 = 0.000 |
| Raw CPG | +0.267 |
| Agresti-Caffo 95% CI | [+0.191, +0.335] |
| Verdict | MODEL BOTTLENECK |
Validation MSE drops by ~150 times across the sweep (0.0651 → 0.0004); the learned planner's success rate stays at exactly zero across all 450 learned-arm benchmark episodes. The verdict has hardened from INCONCLUSIVE at n = 10 (the v0.10 paper headline) to a confident MODEL BOTTLENECK at n = 150 pooled.
What the paper now says (Sections 5.5 + 5.6)
A flat CPG curve under a monotonically improving prediction loss separates a model-capacity bottleneck (refuted: the MLP is fitting the training distribution to 4e-4 at 20 000 transitions) from a data-coverage bottleneck (consistent: random rollouts in Acrobot never reach the upright regime, the model is extrapolating during planning).
Planner-side and score-function residuals are acknowledged as plausible second-order contributors. The recommended remediation is to change the data-collection policy (energy-aware exploration, relabelled trajectories), not to grow the model.
Adversarial-review findings addressed pre-tag
- F1 (CRITICAL) — CI rounding.
0.33486...rounds to0.33at two decimals, not0.34. The abstract and four tables read[+0.19, +0.34]in the first draft; all swept to[+0.191, +0.335](3dp, matches paper convention). - F2 (MAJOR) — "~170×" overclaim. Real ratio is
0.0651 / 0.000433 ≈ 150.3. Every "170" (LaTeX\times, plain text, HTML×) changed to~150. - F3 (MAJOR) — "Unambiguously" softened to "most parsimoniously"; added a paragraph in Section 5.6 acknowledging planner-side and score-function residuals as plausible second-order contributors.
- F10 (MINOR) — Two
(current)tags inexperiments/dmc_acrobot/README.md(v0.8 and v0.9 both said current). Only v0.11 stays current.
Reproducibility
pip install -e ".[dev,control,learned]"
python -m experiments.dmc_acrobot.cpg_sweep \\
--data-sizes 200,2000,20000 --seeds 0,1,2 --episodes 50
# -> results/dmc_acrobot/cpg_sweep.json (about 6 hours on CPU)
python paper/build_figures.py # prints Table 1 and Table 2 LaTeX valuesNon-affiliation
This release is independent. It is not affiliated with the AMI program at Meta, the LeWorldModel project, the authors of any cited paper, or any current or past employer of the author.
114 tests pass on Python 3.11 / 3.12.
v0.10.0 - Short paper: Counterfactual Planning Gap
Short paper: Counterfactual Planning Gap
The methodological contribution that has been accumulating across the v0.6 / v0.8 / v0.9 releases now has a short paper to point at.
Title. Counterfactual Planning Gap: A Decision-Grade Metric for Decoupling Model Error from Planner Capacity in World Model Evaluation.
What ships in paper/:
main.tex— ~7-page LaTeX source, plainarticleclass. Sections: introduction, related work, the four-method evaluation contract + decision-grade taxonomy, the CPG definition with Agresti-Caffo plus-4 CI and the five-branch gated verdict, an empirical study on DMC Acrobot-swingup, discussion / limitations.references.bib— 23 entries spanning latent-dynamics world models (Dreamer-V3, MuZero, IRIS, Genie, PlaNet, TD-MPC2), the JEPA line (LeCun 2022, I-JEPA, V-JEPA, V-JEPA 2), model exploitation in offline MBRL (MOPO, MOReL), evaluation methodology (Wilson 1927, Agresti-Caffo 2000, Newcombe 1998, Henderson 2018, rliable), and benchmarks (DMC, OGBench, LIBERO, Schäfer-Acrobot).Makefile—pdflatex+bibtexbuild chain.build_figures.py— stdlib script that prints the LaTeX-ready table values directly fromresults/dmc_acrobot/cpg.jsonso the paper's numbers stay reproducible from a clean checkout.README.md— contents, reproducibility, non-affiliation.
Headline empirical result (DMC Acrobot-swingup, n = 10 per arm, seed 0):
| Oracle dynamics | Learned MLP | |
|---|---|---|
| Success rate | 0.30 (3/10) | 0.00 (0/10) |
| Avg. steps to success | 180.7 | n/a |
| Per-call planning latency (ms) | 77.3 | 65.3 |
| Compute per decision (rollout-units) | 407.1 | 157.3 |
| Counterfactual Planning Gap | |
|---|---|
| Raw |
+0.300 |
| Agresti-Caffo 95% CI | [-0.059, +0.559] |
| Verdict | INCONCLUSIVE |
The paper argues that a metric returning the honest inconclusive at small n is the correctly-calibrated behaviour, not a defect, and outlines what a multi-seed extension would change.
Adversarial-review findings addressed pre-tag
- F1 (major) — dropped misplaced
\citep{wilson1927,agresticaffo2000}on the effective planning horizon sentence (those citations support proportion CIs, not the EPH metric). - F2 (minor) — Table 1 Learned compute/decision was
157.4; the JSON source andbuild_figures.pyboth say157.3. Fixed. - F3 (minor) — abstract said ~200-line addition; the actual CPG-specific diff is ~160 lines. Reworded.
Other repo changes
CITATION.cffbumped to 0.10.0, with apreferred-citationblock pointing at the paper.README.md— new Paper section above Related Work. Roadmap refreshed so v0.7 / v0.8 / v0.9 descriptions match what the tags actually shipped, and v0.10 is the current entry.
Non-affiliation
This paper is independent. It is not affiliated with the AMI (Advanced Machine Intelligence) program at Meta, the LeWorldModel project, the authors of any cited paper, or any current or past employer of the author.
114 tests pass on Python 3.11/3.12.
v0.9.0 - Counterfactual Planning Gap metric
Phase 3 of the post-honesty-critique roadmap. The framework's first novel decision-grade metric: a scalar that decomposes "model error vs planner capacity" on a per-env basis by running the same planner twice on the same benchmark with different dynamics callables.
What's new
- `wmel.metrics.CPGResult` / `counterfactual_planning_gap` / `cpg_verdict`: the metric, packaged as a scalar with a 95% Agresti-Caffo CI on the gap and a five-branch verdict (MODEL BOTTLENECK / LEARNED OUTPERFORMS ORACLE / PLANNER BOTTLENECK / MODEL AS GOOD AS ORACLE / INCONCLUSIVE), gated on the AC lower/upper bound rather than the raw point estimate.
- `wmel.envs.dmc_acrobot.make_acrobot_oracle_dynamics`: a side-effect-free `(state, action) -> next_state` callable backed by the real Acrobot physics. Reproduces `env.step()` to numerical precision across a 50-step swept random rollout (verified in tests).
- `experiments/dmc_acrobot/cpg.py`: end-to-end pipeline producing `results/dmc_acrobot/cpg.json` with the committed full-run result.
The committed result, honestly stated
On DMC Acrobot-swingup, 10 episodes per arm, random-shooting MPC (50 candidates x 15-step horizon):
```
raw CPG = +0.300
AC 95% CI = [-0.059, +0.559] (crosses zero)
oracle: 30% success
learned MLP: 0% success
verdict = INCONCLUSIVE
```
The point estimate is positive and large, suggesting a model bottleneck. But with only 10 episodes per arm and one planner reporting 0/10, the Agresti-Caffo CI on the gap crosses zero. The honest verdict at
This is exactly what a well-calibrated metric should do at small sample sizes. A v1.0 multi-seed sweep will either push the CI lower bound above zero (confirming MODEL BOTTLENECK) or flip the diagnosis. Either outcome is a real research finding that this framework will produce reproducibly.
Why Agresti-Caffo and not Wald
An earlier draft used the Wald CI
Adversarial review caught and fixed (2 majors, 7 minors)
Wald CI degeneracy, verdict not gated on CI, version stamp drift, missing verdict tests, oracle round-trip tested at only one state, smoke verdict misleading, novelty claim too strong vs MOPO/MOReL, headline buried in JSON, score function missing regression test. All addressed before tag.
Tests
114 passing on Python 3.11/3.12/3.13 + a stdlib-only job. DMC tests skip on 3.13 (labmaze upstream Bazel issue).
Quickstart
```bash
git clone https://github.com/Denis-hamon/world-model-eval-lab.git
cd world-model-eval-lab
pip install -e ".[dev,control,learned]"
python -m experiments.dmc_acrobot.cpg
```
Takes about 5-10 minutes on CPU. Writes `results/dmc_acrobot/cpg.json`.
v0.8.0 - Learned MLP world model on DMC Acrobot
Phase 2 of the post-honesty-critique roadmap. The first non-toy environment (DMC Acrobot-swingup, shipped in v0.7 Phase 1) now has a learned-dynamics adapter on top.
What's new
- Markovian MLP world model in PyTorch, trained on random rollouts of Acrobot, plugged into
TabularWorldModelPlannervia the existingdynamics=argument. Honestly named MLP not GRU because Acrobot is fully observed and Markov. - Task-specific score function
acrobot_upright_scorederived from the actual DMC observation layout(sin_upper, sin_lower, cos_upper, cos_lower), not the inverted version an earlier draft of this work used. - Two committed scorecards under
results/dmc_acrobot/: random baseline and learned baseline. Both report 0% success, both honestly. - CI smoke-tests the learned pipeline on Python 3.11/3.12 in ~5 s.
What's honest
- val_mse on held-out transitions ~0.026 (one-step prediction is good).
- success_rate = 0 % on swing-up (the planner does not solve the task with this model + this score + 50-candidate / 15-horizon random shooting).
- The two metrics decouple. v0.9 will quantify which of (distribution shift, planner capacity, score approximation) is responsible via the Counterfactual Planning Gap metric.
Adversarial review caught (1 critical, before tag)
The score function indexed the flattened DMC observation as (cos_upper, sin_upper, cos_lower, sin_lower) and assumed relative joint angles. DMC actually emits (sin_upper, sin_lower, cos_upper, cos_lower) in world frame. The buggy score collapsed up and down to the same value, so the planner had no signal to swing up. The 0% success rate from the first run was therefore not a distribution-shift finding — it was an unsupported narrative on top of a broken cost function.
Fixed by re-deriving from dm_control/suite/acrobot.py:Physics.orientations, with new test vectors that satisfy sin² + cos² = 1. The corrected experiment was re-run before tagging; the JSON committed in results/ is the post-fix scorecard.
Quickstart
```bash
git clone https://github.com/Denis-hamon/world-model-eval-lab.git
cd world-model-eval-lab
pip install -e ".[dev,control,learned]"
python -m experiments.dmc_acrobot.learned_baseline
```
Takes about 3 minutes on CPU. Writes `results/dmc_acrobot/learned_random_rollouts.json`.
Tests
96 passing on Python 3.11/3.12/3.13 + a stdlib-only job. The DMC tests skip themselves on Python 3.13 because labmaze (transitive dep of dm-control) does not have a 3.13 wheel and its Bazel build fails upstream.
v0.7.0 - CLI, versioned JSON schema, perturbation-aware sweep
Three additions and one CI safety net. Each one was asked for in an external code review and each one is now backed by tests.
What ships
`wmel` console script installed alongside the package:
```bash
pip install -e ".[dev]"
wmel run --env maze_toy --policy tabular-world-model --episodes 30 --output run.json
wmel sweep --env maze_toy --plan-horizons "5,10,15,20,30" --output sweep.json
```
Both subcommands honour the same `Perturbation` library as the Python API. Pass `--perturbation env-default`, `--perturbation drop-next-2`, or `--perturbation composite:env-default+drop-next-3`.
Versioned JSON report envelope. Every report producer (CLI, example scripts, sweeps) now stamps the outer dict with:
```json
{
"schema_version": "1.0",
"wmel_version": "0.7.0",
"generated_at": "2026-05-15T...Z",
"metadata": { "env": ..., "policy": ..., "perturbation": ..., "seed": ... },
...
}
```
Downstream consumers (a future public scoreboard) can dispatch on `schema_version` to handle format evolution. The constant lives in `wmel.report.REPORT_SCHEMA_VERSION` and a helper `report_envelope_metadata()` lets any script stamp the metadata in one line.
`horizon_sweep(perturbation=...)` kwarg. The sweep machinery now accepts any `Perturbation` instance, forwarded to `BenchmarkRunner`. Default behaviour is unchanged.
Second CI job: stdlib-only. Installs `.[dev]` without torch, verifies torch is not importable, runs the test suite (learned tests skip via `pytest.importorskip`), and smoke-tests the non-learned example scripts. Locks in the no-torch runtime promise so a future commit cannot silently pull torch into a core import path.
Adversarial review pass before tag
An independent agent reviewed the v0.7 diff and surfaced 1 critical, 4 major, and 3 minor findings. All addressed:
- Critical: `wmel` console script crashed with `ModuleNotFoundError: No module named 'examples'` when run from anywhere except the repo root. The toy envs lived under `examples/` which is not part of the installed package. Fixed by moving the env classes into `src/wmel/envs/` (now installable) with a re-export shim under `examples/` for backward compatibility.
- Major: schema-version envelope was bypassed by three example scripts and by the CLI sweep (which hardcoded the literal instead of using the constant). All five report producers now use the shared `report_envelope_metadata()` helper.
- Major: `wmel run --env two_room_toy --policy greedy` silently scored 0% success because the CLI built `GreedyGridPolicy()` without a waypoint, while the example script always passed the doorway hint. CLI now special-cases the two-room env. Regression test added.
- Major: `--plan-horizons "5,a,15"` crashed with a bare `ValueError` traceback. Replaced with a clean `SystemExit` and a message naming the offending token.
- Major: `cmd_sweep` had a dead-code policy guard masked by argparse `choices=`; the corresponding test passed for the wrong reason. Dropped the body code; the test now asserts argparse's exit code 2.
- Minor: `--perturb-prob` accepted values outside `[0, 1]` silently. Now validated.
- Minor: `horizon_sweep(perturbation=...)` reuses the same Perturbation instance across horizon points. Docstring now warns that the Perturbation must be stateless to preserve per-horizon reproducibility.
- Minor: README `tests/` count was 64, now 80+.
Tests
81 passing on Python 3.11 / 3.12 / 3.13. Two CI jobs run on every push: the full matrix (with torch) and the stdlib-only job.
Disclaimer
Independent study of evaluation methodology. Not affiliated with AMI, Meta, the LeWorldModel project, or any of their authors, and not an artifact of any current or past employer of the author.
v0.6.0 - Proof of contract for learned dynamics
The headline thesis of this framework - that any action-conditioned world model can plug into the same evaluation contract - is now backed by a working learned model.
What's new
A tiny PyTorch MLP trained on 64 maze transitions is a drop-in for `TabularWorldModelPlanner.dynamics`. Same MPC planner, same 30 episodes, same perturbation strategy. Captured output:
| Oracle (stdlib) | Learned MLP (PyTorch) | |
|---|---|---|
| success rate | 100% | 100% |
| avg steps to success | 33.8 | 33.8 |
| latency / call | 3.12 ms | 236.93 ms |
| compute / decision | ~256 rollout-units | ~256 rollout-units |
Same success, same steps, same nominal compute. 76 times the per-call latency. Without measuring latency per call, you would conclude "it works just as well!" while the actual deployment cost is two orders of magnitude higher. That is exactly the trade-off the framework is built to expose.
The horizon sweep extends the picture: success curves overlap at every horizon, latency curves diverge by 62-77x across horizons depending on plan depth. See the live site for the rendered chart and the captured terminal output.
How it works
- `src/wmel/adapters/learned_dynamics_torch.py` (60 lines of model code) ships a 2-layer MLP that predicts (dx, dy) from (state, action).
- `train_maze_dynamics(env)` enumerates the env's transition table and memorises it in 800 epochs on CPU (~5 seconds).
- `torch_dynamics(model, w, h)` wraps the trained model as a stateless `(state, action) -> next_state` callable.
- That callable is passed verbatim as `dynamics=` to the existing `TabularWorldModelPlanner`. The planner, the runner, the metrics, the perturbation library, the horizon sweep - none of them change.
Adversarial review pass
An independent agent reviewed the diff before tagging. Five findings, all addressed:
- Major: docstring claimed `collect_transitions` worked on the two-room env when the two-room env did not expose the required attributes - fixed.
- Major: latency-ratio language was inconsistent across the policy card, the figcaption, and the SVG annotation, and the annotation used only the rightmost-horizon shortcut - all three now report a per-horizon range computed from the actual data.
- Minor: CI cost (torch CPU wheel ~150 MB per matrix entry) - documented in a workflow comment.
- Minor: learned sweep used the same `policy_name` as the oracle - now suffixed with `(learned-mlp)`.
- Minor: out-of-grid extrapolation behaviour of the wrapper was undocumented - called out in the docstring.
Compatibility
- Python 3.11+. Core runtime stays stdlib-only.
- New optional dep: `pip install -e ".[learned]"` pulls `torch>=2.0` (CPU). No CUDA required.
- 64 passing tests on Python 3.11 / 3.12 / 3.13. CI installs torch CPU on every matrix entry.
Quickstart
```bash
git clone https://github.com/Denis-hamon/world-model-eval-lab.git
cd world-model-eval-lab
pip install -e ".[dev,learned]"
python -m examples.maze_toy.run_learned_baseline
```
About 20 seconds on a laptop. No GPU required.
Disclaimer
Independent study of evaluation methodology. Not affiliated with AMI, Meta, the LeWorldModel project, or any of their authors, and not an artifact of any current or past employer of the author.
v0.5.0 - Pluggable perturbation library
A composable library for perturbation strategies. The "Perturbation Recovery" metric is no longer captive to whatever the environment's perturb() method happens to do.
What's new
wmel.perturbations ships an abstract Perturbation base with two override hooks:
apply_to_env(env)— mutate environment state at the trigger moment.transform_actions(remaining)— return a (possibly shorter) queue of pending actions.
Plus three concrete subclasses that compose:
EnvPerturbation— delegates toenv.perturb(). The runner's default; pre-v0.5 behavior is preserved exactly.DropNextActions(k)— drops the nextkqueued actions, forcing the policy to replan. Models actuator drops, network gaps, command debouncing.CompositePerturbation(*parts)— chains the others. Hook ordering is pinned: allapply_to_envhooks first, then alltransform_actionshooks.
Runner changes (backward-compatible)
- New
perturbation: Perturbation | None = Nonekwarg onBenchmarkRunner. Omitting it gives identical results to passingEnvPerturbation()(locked in by a backward-compat test). - Inner action-execution loop switched to
collections.deque+popleft, so action-level perturbations are O(1) per action and don't introduce per-call latency overhead.
Scorecard
Scorecard.perturbation_name: str | Nonerecords which perturbation strategy was selected. Two scorecards with the same policy but different perturbations are now distinguishable in JSON, Markdown, and printed output. Example scripts threadperturbation_name=\"env-default\"for honesty about what was configured.
Adversarial review before tag
An independent agent reviewed the diff and surfaced six minor findings, all addressed before this release:
- Dead `Perturbation.name` in the reporting path → threaded into Scorecard.
- O(n) `list.pop(0)` per action → switched to `deque.popleft`.
- Composite ordering unspecified → pinned in docstring and test.
- ABC docstring overclaimed enforcement → clarified.
- `DropNextActions(k > len)` undocumented → documented.
- Fresh-list invariant untested on some paths → tested everywhere.
Tests
59 passing. Includes 12 perturbation-library tests, 3 new runner-correctness tests (backward-compat exact match, queue-shortening + replan, perturbed-only-when-fired with custom perturbation), and 2 markdown-export tests for perturbation_name.
Quickstart
```python
from wmel import (
BenchmarkRunner,
CompositePerturbation,
DropNextActions,
EnvPerturbation,
compute_scorecard,
)
runner = BenchmarkRunner(
env_factory=MyEnv,
policy=MyPolicy(),
episodes=30,
horizon=60,
perturb_prob=0.3,
perturbation=CompositePerturbation(EnvPerturbation(), DropNextActions(k=2)),
seed=0,
)
results = runner.run()
sc = compute_scorecard(results, policy_name="my-policy", perturbation_name=runner.perturbation.name)
```
Disclaimer
Independent research-to-product exploration. Not affiliated with AMI, Meta, the LeWorldModel project, or any of their authors.
v0.4.0 - Markdown export and compute-per-decision
Two product-oriented additions: Markdown output that drops directly into PR bodies and docs, and the compute side of the latency / horizon / compute trade-off surface now tabular.
What's new
Markdown exporters - wmel.report.to_markdown_scorecard, to_markdown_report, and wmel.experiments.to_markdown_horizon_sweep. Output is paste-ready in a pull request description, a Notion page, or a Markdown doc.
Compute per decision - PlannerPolicy.compute_per_plan_call is a class attribute that subclasses set with their estimated cost per plan() call (model forward passes, rollouts, FLOPs). compute_scorecard derives average_compute_per_decision = (compute_per_plan_call * total_plan_calls) / total_steps so a policy that replans more often pays for it proportionally. TabularWorldModelPlanner declares num_candidates * plan_horizon rollout-units per plan() call.
Adversarial review - a fresh review pass before tagging caught a missing compute column in the horizon-sweep Markdown table and a test docstring with the wrong arithmetic. Both fixed before release. The latency / horizon / compute columns now appear together on one row.
Horizon sweep on the maze toy
plan_h | success | 95% CI | steps | latency_ms | 95% CI (ms) | compute/dec
-----------------------------------------------------------------------------------
5 | 0.000 | [0.00, 0.11] | n/a | 0.88 | [0.87, 0.89] | 368.3
10 | 0.900 | [0.74, 0.97] | 31.3 | 1.58 | [1.55, 1.61] | 350.6
15 | 1.000 | [0.89, 1.00] | 30.5 | 2.35 | [2.34, 2.36] | 278.7
20 | 1.000 | [0.89, 1.00] | 33.8 | 3.10 | [3.07, 3.12] | 256.4
30 | 1.000 | [0.89, 1.00] | 41.8 | 4.61 | [4.55, 4.68] | 277.5
Three columns telling the same story: success rate plateaus at h=15, per-call latency keeps rising past the plateau, compute per decision is bounded around 250-370 rollout-units (because the planner returns approximately plan_horizon actions per call).
Quickstart
git clone https://github.com/Denis-hamon/world-model-eval-lab.git
cd world-model-eval-lab
pip install -e ".[dev]"
pytest # 41 passing
python -m examples.maze_toy.run_horizon_sweepCompatibility
- Python 3.11+. Runtime is stdlib-only.
- API change:
EpisodeResult.planning_latency_ms: floatwas renamed toplanning_latencies_ms: tuple[float, ...]in v0.3.1. v0.4 keeps that shape. - New optional kwarg on
compute_scorecard:compute_per_plan_call. Defaults to None - existing callers unaffected.
Disclaimer
Independent research-to-product exploration. Not affiliated with AMI, Meta, the LeWorldModel project, or any of their authors.
v0.3.1 - Initial public release
A lightweight, product-oriented benchmark framework for evaluating action-conditioned world models beyond static AI benchmarks. Independent research-to-product exploration.
Not affiliated with AMI, Meta, the LeWorldModel project, or any of their authors. References to JEPA-style or LeWorldModel ideas are conceptual only.
What's here
- Two CPU-only toy environments: a two-room grid and a small maze.
- A concrete
TabularWorldModelPlannersubclass ofLeWMAdapterStubthat demonstrates the full evaluation contract (encode,rollout,score,plan) end-to-end without any third-party dependency. horizon_sweepexperiment with Wilson and normal confidence intervals.- 36 passing tests, including regression tests for the two metric correctness invariants flagged in internal review (per-call planning latency, perturbation accounting).
Quickstart
git clone https://github.com/Denis-hamon/world-model-eval-lab.git
cd world-model-eval-lab
pip install -e ".[dev]"
pytest
python -m examples.maze_toy.run_horizon_sweepPlanning-horizon sweep on the maze toy (CPU)
Horizon sweep: tabular-world-model
plan_h | success | 95% CI | steps | latency_ms | 95% CI (ms)
-----------------------------------------------------------------
5 | 0.000 | [0.00, 0.11] | n/a | 0.88 | [0.87, 0.89]
10 | 0.900 | [0.74, 0.97] | 31.3 | 1.59 | [1.56, 1.62]
15 | 1.000 | [0.89, 1.00] | 30.5 | 2.39 | [2.34, 2.44]
20 | 1.000 | [0.89, 1.00] | 33.8 | 3.09 | [3.08, 3.09]
30 | 1.000 | [0.89, 1.00] | 41.8 | 4.58 | [4.55, 4.60]
Per-call planning latency grows monotonically with horizon; success rate plateaus at h=15. Beyond the plateau, latency keeps rising while steps-to-success degrades - the "effective planning horizon" in one curve.
License
MIT.