Releases · Denis-hamon/world-model-eval-lab

16 May 22:20

v0.11.0

1abe243

v0.11.0 - Multi-seed CPG sweep: capacity vs coverage Latest

Latest

Multi-seed CPG sweep ships

The multi-seed extension promised in v0.9 and outlined as future work in the v0.10 paper now ships. We pooled three seeds at 50 episodes per arm per seed (n = 150 pooled) and swept the MLP's training-set size by a factor of 100 across {200, 2 000, 20 000} random-policy transitions on DMC Acrobot-swingup.

Headline result

In every cell:

Quantity	Value
Oracle success	40/150 = 0.267
Learned success	0/150 = 0.000
Raw CPG	+0.267
Agresti-Caffo 95% CI	[+0.191, +0.335]
Verdict	MODEL BOTTLENECK

Validation MSE drops by ~150 times across the sweep (0.0651 → 0.0004); the learned planner's success rate stays at exactly zero across all 450 learned-arm benchmark episodes. The verdict has hardened from INCONCLUSIVE at n = 10 (the v0.10 paper headline) to a confident MODEL BOTTLENECK at n = 150 pooled.

What the paper now says (Sections 5.5 + 5.6)

A flat CPG curve under a monotonically improving prediction loss separates a model-capacity bottleneck (refuted: the MLP is fitting the training distribution to 4e-4 at 20 000 transitions) from a data-coverage bottleneck (consistent: random rollouts in Acrobot never reach the upright regime, the model is extrapolating during planning).

Planner-side and score-function residuals are acknowledged as plausible second-order contributors. The recommended remediation is to change the data-collection policy (energy-aware exploration, relabelled trajectories), not to grow the model.

Adversarial-review findings addressed pre-tag

F1 (CRITICAL) — CI rounding. 0.33486... rounds to 0.33 at two decimals, not 0.34. The abstract and four tables read [+0.19, +0.34] in the first draft; all swept to [+0.191, +0.335] (3dp, matches paper convention).
F2 (MAJOR) — "~170×" overclaim. Real ratio is 0.0651 / 0.000433 ≈ 150.3. Every "170" (LaTeX \times, plain text, HTML ×) changed to ~150.
F3 (MAJOR) — "Unambiguously" softened to "most parsimoniously"; added a paragraph in Section 5.6 acknowledging planner-side and score-function residuals as plausible second-order contributors.
F10 (MINOR) — Two (current) tags in experiments/dmc_acrobot/README.md (v0.8 and v0.9 both said current). Only v0.11 stays current.

Reproducibility

pip install -e ".[dev,control,learned]"
python -m experiments.dmc_acrobot.cpg_sweep \\
    --data-sizes 200,2000,20000 --seeds 0,1,2 --episodes 50
# -> results/dmc_acrobot/cpg_sweep.json (about 6 hours on CPU)
python paper/build_figures.py    # prints Table 1 and Table 2 LaTeX values

Non-affiliation

This release is independent. It is not affiliated with the AMI program at Meta, the LeWorldModel project, the authors of any cited paper, or any current or past employer of the author.

114 tests pass on Python 3.11 / 3.12.

Assets 2

16 May 12:50

Denis-hamon

v0.10.0

ba7c682

v0.10.0 - Short paper: Counterfactual Planning Gap

Short paper: Counterfactual Planning Gap

The methodological contribution that has been accumulating across the v0.6 / v0.8 / v0.9 releases now has a short paper to point at.

Title. Counterfactual Planning Gap: A Decision-Grade Metric for Decoupling Model Error from Planner Capacity in World Model Evaluation.

What ships in paper/:

main.tex — ~7-page LaTeX source, plain article class. Sections: introduction, related work, the four-method evaluation contract + decision-grade taxonomy, the CPG definition with Agresti-Caffo plus-4 CI and the five-branch gated verdict, an empirical study on DMC Acrobot-swingup, discussion / limitations.
references.bib — 23 entries spanning latent-dynamics world models (Dreamer-V3, MuZero, IRIS, Genie, PlaNet, TD-MPC2), the JEPA line (LeCun 2022, I-JEPA, V-JEPA, V-JEPA 2), model exploitation in offline MBRL (MOPO, MOReL), evaluation methodology (Wilson 1927, Agresti-Caffo 2000, Newcombe 1998, Henderson 2018, rliable), and benchmarks (DMC, OGBench, LIBERO, Schäfer-Acrobot).
Makefile — pdflatex + bibtex build chain.
build_figures.py — stdlib script that prints the LaTeX-ready table values directly from results/dmc_acrobot/cpg.json so the paper's numbers stay reproducible from a clean checkout.
README.md — contents, reproducibility, non-affiliation.

Headline empirical result (DMC Acrobot-swingup, n = 10 per arm, seed 0):

	Oracle dynamics	Learned MLP
Success rate	0.30 (3/10)	0.00 (0/10)
Avg. steps to success	180.7	n/a
Per-call planning latency (ms)	77.3	65.3
Compute per decision (rollout-units)	407.1	157.3

Counterfactual Planning Gap
Raw $\hat\Delta$	+0.300
Agresti-Caffo 95% CI	[-0.059, +0.559]
Verdict	INCONCLUSIVE

The paper argues that a metric returning the honest inconclusive at small n is the correctly-calibrated behaviour, not a defect, and outlines what a multi-seed extension would change.

Adversarial-review findings addressed pre-tag

F1 (major) — dropped misplaced \citep{wilson1927,agresticaffo2000} on the effective planning horizon sentence (those citations support proportion CIs, not the EPH metric).
F2 (minor) — Table 1 Learned compute/decision was 157.4; the JSON source and build_figures.py both say 157.3. Fixed.
F3 (minor) — abstract said ~200-line addition; the actual CPG-specific diff is ~160 lines. Reworded.

Other repo changes

CITATION.cff bumped to 0.10.0, with a preferred-citation block pointing at the paper.
README.md — new Paper section above Related Work. Roadmap refreshed so v0.7 / v0.8 / v0.9 descriptions match what the tags actually shipped, and v0.10 is the current entry.

Non-affiliation

This paper is independent. It is not affiliated with the AMI (Advanced Machine Intelligence) program at Meta, the LeWorldModel project, the authors of any cited paper, or any current or past employer of the author.

114 tests pass on Python 3.11/3.12.

Assets 2

16 May 12:33

Denis-hamon

v0.9.0

6ea3acf

v0.9.0 - Counterfactual Planning Gap metric

Phase 3 of the post-honesty-critique roadmap. The framework's first novel decision-grade metric: a scalar that decomposes "model error vs planner capacity" on a per-env basis by running the same planner twice on the same benchmark with different dynamics callables.

What's new

`wmel.metrics.CPGResult` / `counterfactual_planning_gap` / `cpg_verdict`: the metric, packaged as a scalar with a 95% Agresti-Caffo CI on the gap and a five-branch verdict (MODEL BOTTLENECK / LEARNED OUTPERFORMS ORACLE / PLANNER BOTTLENECK / MODEL AS GOOD AS ORACLE / INCONCLUSIVE), gated on the AC lower/upper bound rather than the raw point estimate.
`wmel.envs.dmc_acrobot.make_acrobot_oracle_dynamics`: a side-effect-free `(state, action) -> next_state` callable backed by the real Acrobot physics. Reproduces `env.step()` to numerical precision across a 50-step swept random rollout (verified in tests).
`experiments/dmc_acrobot/cpg.py`: end-to-end pipeline producing `results/dmc_acrobot/cpg.json` with the committed full-run result.

The committed result, honestly stated

On DMC Acrobot-swingup, 10 episodes per arm, random-shooting MPC (50 candidates x 15-step horizon):

```
raw CPG = +0.300
AC 95% CI = [-0.059, +0.559] (crosses zero)
oracle: 30% success
learned MLP: 0% success
verdict = INCONCLUSIVE
```

The point estimate is positive and large, suggesting a model bottleneck. But with only 10 episodes per arm and one planner reporting 0/10, the Agresti-Caffo CI on the gap crosses zero. The honest verdict at $n = 10$ is INCONCLUSIVE.

This is exactly what a well-calibrated metric should do at small sample sizes. A v1.0 multi-seed sweep will either push the CI lower bound above zero (confirming MODEL BOTTLENECK) or flip the diagnosis. Either outcome is a real research finding that this framework will produce reproducibly.

Why Agresti-Caffo and not Wald

An earlier draft used the Wald CI $\mathrm{gap} \pm z\sqrt{p_o(1-p_o)/n_o + p_l(1-p_l)/n_l}$. At $p_l = 0$ the Wald variance collapses to $p_o(1-p_o)/n_o$, producing a falsely-tight CI of $[+0.016, +0.584]$ that does not cross zero. The adversarial review pre-tag caught this; AC's plus-4 adjustment fixes it.

Adversarial review caught and fixed (2 majors, 7 minors)

Wald CI degeneracy, verdict not gated on CI, version stamp drift, missing verdict tests, oracle round-trip tested at only one state, smoke verdict misleading, novelty claim too strong vs MOPO/MOReL, headline buried in JSON, score function missing regression test. All addressed before tag.

Tests

114 passing on Python 3.11/3.12/3.13 + a stdlib-only job. DMC tests skip on 3.13 (labmaze upstream Bazel issue).

Quickstart

```bash
git clone https://github.com/Denis-hamon/world-model-eval-lab.git
cd world-model-eval-lab
pip install -e ".[dev,control,learned]"
python -m experiments.dmc_acrobot.cpg
```

Takes about 5-10 minutes on CPU. Writes `results/dmc_acrobot/cpg.json`.

Assets 2

16 May 08:49

Denis-hamon

v0.8.0

51db7bc

v0.8.0 - Learned MLP world model on DMC Acrobot

Phase 2 of the post-honesty-critique roadmap. The first non-toy environment (DMC Acrobot-swingup, shipped in v0.7 Phase 1) now has a learned-dynamics adapter on top.

What's new

Markovian MLP world model in PyTorch, trained on random rollouts of Acrobot, plugged into TabularWorldModelPlanner via the existing dynamics= argument. Honestly named MLP not GRU because Acrobot is fully observed and Markov.
Task-specific score function acrobot_upright_score derived from the actual DMC observation layout (sin_upper, sin_lower, cos_upper, cos_lower), not the inverted version an earlier draft of this work used.
Two committed scorecards under results/dmc_acrobot/: random baseline and learned baseline. Both report 0% success, both honestly.
CI smoke-tests the learned pipeline on Python 3.11/3.12 in ~5 s.

What's honest

val_mse on held-out transitions ~0.026 (one-step prediction is good).
success_rate = 0 % on swing-up (the planner does not solve the task with this model + this score + 50-candidate / 15-horizon random shooting).
The two metrics decouple. v0.9 will quantify which of (distribution shift, planner capacity, score approximation) is responsible via the Counterfactual Planning Gap metric.

Adversarial review caught (1 critical, before tag)

The score function indexed the flattened DMC observation as (cos_upper, sin_upper, cos_lower, sin_lower) and assumed relative joint angles. DMC actually emits (sin_upper, sin_lower, cos_upper, cos_lower) in world frame. The buggy score collapsed up and down to the same value, so the planner had no signal to swing up. The 0% success rate from the first run was therefore not a distribution-shift finding — it was an unsupported narrative on top of a broken cost function.

Fixed by re-deriving from dm_control/suite/acrobot.py:Physics.orientations, with new test vectors that satisfy sin² + cos² = 1. The corrected experiment was re-run before tagging; the JSON committed in results/ is the post-fix scorecard.

Quickstart

```bash
git clone https://github.com/Denis-hamon/world-model-eval-lab.git
cd world-model-eval-lab
pip install -e ".[dev,control,learned]"
python -m experiments.dmc_acrobot.learned_baseline
```

Takes about 3 minutes on CPU. Writes `results/dmc_acrobot/learned_random_rollouts.json`.

Tests

96 passing on Python 3.11/3.12/3.13 + a stdlib-only job. The DMC tests skip themselves on Python 3.13 because labmaze (transitive dep of dm-control) does not have a 3.13 wheel and its Bazel build fails upstream.

Assets 2

15 May 20:51

Denis-hamon

v0.7.0

7bffaaa

v0.7.0 - CLI, versioned JSON schema, perturbation-aware sweep

Three additions and one CI safety net. Each one was asked for in an external code review and each one is now backed by tests.

What ships

`wmel` console script installed alongside the package:

```bash
pip install -e ".[dev]"
wmel run --env maze_toy --policy tabular-world-model --episodes 30 --output run.json
wmel sweep --env maze_toy --plan-horizons "5,10,15,20,30" --output sweep.json
```

Both subcommands honour the same `Perturbation` library as the Python API. Pass `--perturbation env-default`, `--perturbation drop-next-2`, or `--perturbation composite:env-default+drop-next-3`.

Versioned JSON report envelope. Every report producer (CLI, example scripts, sweeps) now stamps the outer dict with:

```json
{
"schema_version": "1.0",
"wmel_version": "0.7.0",
"generated_at": "2026-05-15T...Z",
"metadata": { "env": ..., "policy": ..., "perturbation": ..., "seed": ... },
...
}
```

Downstream consumers (a future public scoreboard) can dispatch on `schema_version` to handle format evolution. The constant lives in `wmel.report.REPORT_SCHEMA_VERSION` and a helper `report_envelope_metadata()` lets any script stamp the metadata in one line.

`horizon_sweep(perturbation=...)` kwarg. The sweep machinery now accepts any `Perturbation` instance, forwarded to `BenchmarkRunner`. Default behaviour is unchanged.

Second CI job: stdlib-only. Installs `.[dev]` without torch, verifies torch is not importable, runs the test suite (learned tests skip via `pytest.importorskip`), and smoke-tests the non-learned example scripts. Locks in the no-torch runtime promise so a future commit cannot silently pull torch into a core import path.

Adversarial review pass before tag

An independent agent reviewed the v0.7 diff and surfaced 1 critical, 4 major, and 3 minor findings. All addressed:

Critical: `wmel` console script crashed with `ModuleNotFoundError: No module named 'examples'` when run from anywhere except the repo root. The toy envs lived under `examples/` which is not part of the installed package. Fixed by moving the env classes into `src/wmel/envs/` (now installable) with a re-export shim under `examples/` for backward compatibility.
Major: schema-version envelope was bypassed by three example scripts and by the CLI sweep (which hardcoded the literal instead of using the constant). All five report producers now use the shared `report_envelope_metadata()` helper.
Major: `wmel run --env two_room_toy --policy greedy` silently scored 0% success because the CLI built `GreedyGridPolicy()` without a waypoint, while the example script always passed the doorway hint. CLI now special-cases the two-room env. Regression test added.
Major: `--plan-horizons "5,a,15"` crashed with a bare `ValueError` traceback. Replaced with a clean `SystemExit` and a message naming the offending token.
Major: `cmd_sweep` had a dead-code policy guard masked by argparse `choices=`; the corresponding test passed for the wrong reason. Dropped the body code; the test now asserts argparse's exit code 2.
Minor: `--perturb-prob` accepted values outside `[0, 1]` silently. Now validated.
Minor: `horizon_sweep(perturbation=...)` reuses the same Perturbation instance across horizon points. Docstring now warns that the Perturbation must be stateless to preserve per-horizon reproducibility.
Minor: README `tests/` count was 64, now 80+.

Tests

81 passing on Python 3.11 / 3.12 / 3.13. Two CI jobs run on every push: the full matrix (with torch) and the stdlib-only job.

Disclaimer

Independent study of evaluation methodology. Not affiliated with AMI, Meta, the LeWorldModel project, or any of their authors, and not an artifact of any current or past employer of the author.

Assets 2

15 May 19:33

Denis-hamon

v0.6.0

8c2b92f

v0.6.0 - Proof of contract for learned dynamics

The headline thesis of this framework - that any action-conditioned world model can plug into the same evaluation contract - is now backed by a working learned model.

What's new

A tiny PyTorch MLP trained on 64 maze transitions is a drop-in for `TabularWorldModelPlanner.dynamics`. Same MPC planner, same 30 episodes, same perturbation strategy. Captured output:

	Oracle (stdlib)	Learned MLP (PyTorch)
success rate	100%	100%
avg steps to success	33.8	33.8
latency / call	3.12 ms	236.93 ms
compute / decision	~256 rollout-units	~256 rollout-units

Same success, same steps, same nominal compute. 76 times the per-call latency. Without measuring latency per call, you would conclude "it works just as well!" while the actual deployment cost is two orders of magnitude higher. That is exactly the trade-off the framework is built to expose.

The horizon sweep extends the picture: success curves overlap at every horizon, latency curves diverge by 62-77x across horizons depending on plan depth. See the live site for the rendered chart and the captured terminal output.

How it works

`src/wmel/adapters/learned_dynamics_torch.py` (60 lines of model code) ships a 2-layer MLP that predicts (dx, dy) from (state, action).
`train_maze_dynamics(env)` enumerates the env's transition table and memorises it in 800 epochs on CPU (~5 seconds).
`torch_dynamics(model, w, h)` wraps the trained model as a stateless `(state, action) -> next_state` callable.
That callable is passed verbatim as `dynamics=` to the existing `TabularWorldModelPlanner`. The planner, the runner, the metrics, the perturbation library, the horizon sweep - none of them change.

Adversarial review pass

An independent agent reviewed the diff before tagging. Five findings, all addressed:

Major: docstring claimed `collect_transitions` worked on the two-room env when the two-room env did not expose the required attributes - fixed.
Major: latency-ratio language was inconsistent across the policy card, the figcaption, and the SVG annotation, and the annotation used only the rightmost-horizon shortcut - all three now report a per-horizon range computed from the actual data.
Minor: CI cost (torch CPU wheel ~150 MB per matrix entry) - documented in a workflow comment.
Minor: learned sweep used the same `policy_name` as the oracle - now suffixed with `(learned-mlp)`.
Minor: out-of-grid extrapolation behaviour of the wrapper was undocumented - called out in the docstring.

Compatibility

Python 3.11+. Core runtime stays stdlib-only.
New optional dep: `pip install -e ".[learned]"` pulls `torch>=2.0` (CPU). No CUDA required.
64 passing tests on Python 3.11 / 3.12 / 3.13. CI installs torch CPU on every matrix entry.

Quickstart

```bash
git clone https://github.com/Denis-hamon/world-model-eval-lab.git
cd world-model-eval-lab
pip install -e ".[dev,learned]"
python -m examples.maze_toy.run_learned_baseline
```

About 20 seconds on a laptop. No GPU required.

Disclaimer

Independent study of evaluation methodology. Not affiliated with AMI, Meta, the LeWorldModel project, or any of their authors, and not an artifact of any current or past employer of the author.

Assets 2

15 May 14:41

Denis-hamon

v0.5.0

a663ef5

v0.5.0 - Pluggable perturbation library

A composable library for perturbation strategies. The "Perturbation Recovery" metric is no longer captive to whatever the environment's perturb() method happens to do.

What's new

wmel.perturbations ships an abstract Perturbation base with two override hooks:

apply_to_env(env) — mutate environment state at the trigger moment.
transform_actions(remaining) — return a (possibly shorter) queue of pending actions.

Plus three concrete subclasses that compose:

EnvPerturbation — delegates to env.perturb(). The runner's default; pre-v0.5 behavior is preserved exactly.
DropNextActions(k) — drops the next k queued actions, forcing the policy to replan. Models actuator drops, network gaps, command debouncing.
CompositePerturbation(*parts) — chains the others. Hook ordering is pinned: all apply_to_env hooks first, then all transform_actions hooks.

Runner changes (backward-compatible)

New perturbation: Perturbation | None = None kwarg on BenchmarkRunner. Omitting it gives identical results to passing EnvPerturbation() (locked in by a backward-compat test).
Inner action-execution loop switched to collections.deque + popleft, so action-level perturbations are O(1) per action and don't introduce per-call latency overhead.

Scorecard

Scorecard.perturbation_name: str | None records which perturbation strategy was selected. Two scorecards with the same policy but different perturbations are now distinguishable in JSON, Markdown, and printed output. Example scripts thread perturbation_name=\"env-default\" for honesty about what was configured.

Adversarial review before tag

An independent agent reviewed the diff and surfaced six minor findings, all addressed before this release:

Dead `Perturbation.name` in the reporting path → threaded into Scorecard.
O(n) `list.pop(0)` per action → switched to `deque.popleft`.
Composite ordering unspecified → pinned in docstring and test.
ABC docstring overclaimed enforcement → clarified.
`DropNextActions(k > len)` undocumented → documented.
Fresh-list invariant untested on some paths → tested everywhere.

Tests

59 passing. Includes 12 perturbation-library tests, 3 new runner-correctness tests (backward-compat exact match, queue-shortening + replan, perturbed-only-when-fired with custom perturbation), and 2 markdown-export tests for perturbation_name.

Quickstart

```python
from wmel import (
BenchmarkRunner,
CompositePerturbation,
DropNextActions,
EnvPerturbation,
compute_scorecard,
)

runner = BenchmarkRunner(
env_factory=MyEnv,
policy=MyPolicy(),
episodes=30,
horizon=60,
perturb_prob=0.3,
perturbation=CompositePerturbation(EnvPerturbation(), DropNextActions(k=2)),
seed=0,
)
results = runner.run()
sc = compute_scorecard(results, policy_name="my-policy", perturbation_name=runner.perturbation.name)
```

Disclaimer

Independent research-to-product exploration. Not affiliated with AMI, Meta, the LeWorldModel project, or any of their authors.

Assets 2

15 May 14:08

Denis-hamon

v0.4.0

48a01fe

v0.4.0 - Markdown export and compute-per-decision

Two product-oriented additions: Markdown output that drops directly into PR bodies and docs, and the compute side of the latency / horizon / compute trade-off surface now tabular.

What's new

Markdown exporters - wmel.report.to_markdown_scorecard, to_markdown_report, and wmel.experiments.to_markdown_horizon_sweep. Output is paste-ready in a pull request description, a Notion page, or a Markdown doc.

Compute per decision - PlannerPolicy.compute_per_plan_call is a class attribute that subclasses set with their estimated cost per plan() call (model forward passes, rollouts, FLOPs). compute_scorecard derives average_compute_per_decision = (compute_per_plan_call * total_plan_calls) / total_steps so a policy that replans more often pays for it proportionally. TabularWorldModelPlanner declares num_candidates * plan_horizon rollout-units per plan() call.

Adversarial review - a fresh review pass before tagging caught a missing compute column in the horizon-sweep Markdown table and a test docstring with the wrong arithmetic. Both fixed before release. The latency / horizon / compute columns now appear together on one row.

Horizon sweep on the maze toy

plan_h | success |    95% CI    | steps | latency_ms |  95% CI (ms) | compute/dec
-----------------------------------------------------------------------------------
     5 |  0.000  | [0.00, 0.11] |  n/a  |    0.88    | [0.87, 0.89] |    368.3
    10 |  0.900  | [0.74, 0.97] |  31.3 |    1.58    | [1.55, 1.61] |    350.6
    15 |  1.000  | [0.89, 1.00] |  30.5 |    2.35    | [2.34, 2.36] |    278.7
    20 |  1.000  | [0.89, 1.00] |  33.8 |    3.10    | [3.07, 3.12] |    256.4
    30 |  1.000  | [0.89, 1.00] |  41.8 |    4.61    | [4.55, 4.68] |    277.5

Three columns telling the same story: success rate plateaus at h=15, per-call latency keeps rising past the plateau, compute per decision is bounded around 250-370 rollout-units (because the planner returns approximately plan_horizon actions per call).

Quickstart

git clone https://github.com/Denis-hamon/world-model-eval-lab.git
cd world-model-eval-lab
pip install -e ".[dev]"
pytest                  # 41 passing
python -m examples.maze_toy.run_horizon_sweep

Compatibility

Python 3.11+. Runtime is stdlib-only.
API change: EpisodeResult.planning_latency_ms: float was renamed to planning_latencies_ms: tuple[float, ...] in v0.3.1. v0.4 keeps that shape.
New optional kwarg on compute_scorecard: compute_per_plan_call. Defaults to None - existing callers unaffected.

Disclaimer

Independent research-to-product exploration. Not affiliated with AMI, Meta, the LeWorldModel project, or any of their authors.

Assets 2

15 May 13:54

Denis-hamon

v0.3.1

5fe30e1

v0.3.1 - Initial public release

A lightweight, product-oriented benchmark framework for evaluating action-conditioned world models beyond static AI benchmarks. Independent research-to-product exploration.

Not affiliated with AMI, Meta, the LeWorldModel project, or any of their authors. References to JEPA-style or LeWorldModel ideas are conceptual only.

What's here

Two CPU-only toy environments: a two-room grid and a small maze.
A concrete TabularWorldModelPlanner subclass of LeWMAdapterStub that demonstrates the full evaluation contract (encode, rollout, score, plan) end-to-end without any third-party dependency.
horizon_sweep experiment with Wilson and normal confidence intervals.
36 passing tests, including regression tests for the two metric correctness invariants flagged in internal review (per-call planning latency, perturbation accounting).

Quickstart

git clone https://github.com/Denis-hamon/world-model-eval-lab.git
cd world-model-eval-lab
pip install -e ".[dev]"
pytest
python -m examples.maze_toy.run_horizon_sweep

Planning-horizon sweep on the maze toy (CPU)

Horizon sweep: tabular-world-model
  plan_h | success |    95% CI    | steps | latency_ms |  95% CI (ms)
  -----------------------------------------------------------------
       5 |  0.000  | [0.00, 0.11] |  n/a  |    0.88    | [0.87, 0.89]
      10 |  0.900  | [0.74, 0.97] |  31.3 |    1.59    | [1.56, 1.62]
      15 |  1.000  | [0.89, 1.00] |  30.5 |    2.39    | [2.34, 2.44]
      20 |  1.000  | [0.89, 1.00] |  33.8 |    3.09    | [3.08, 3.09]
      30 |  1.000  | [0.89, 1.00] |  41.8 |    4.58    | [4.55, 4.60]

Per-call planning latency grows monotonically with horizon; success rate plateaus at h=15. Beyond the plateau, latency keeps rising while steps-to-success degrades - the "effective planning horizon" in one curve.

License

MIT.

Assets 2

Releases: Denis-hamon/world-model-eval-lab

v0.11.0 - Multi-seed CPG sweep: capacity vs coverage

Multi-seed CPG sweep ships

Headline result

What the paper now says (Sections 5.5 + 5.6)

Adversarial-review findings addressed pre-tag

Reproducibility

Non-affiliation

Uh oh!

v0.10.0 - Short paper: Counterfactual Planning Gap

Short paper: Counterfactual Planning Gap

Adversarial-review findings addressed pre-tag

Other repo changes

Non-affiliation

Uh oh!

v0.9.0 - Counterfactual Planning Gap metric

What's new

The committed result, honestly stated

Why Agresti-Caffo and not Wald

Adversarial review caught and fixed (2 majors, 7 minors)

Tests

Quickstart

Uh oh!

v0.8.0 - Learned MLP world model on DMC Acrobot

What's new

What's honest

Adversarial review caught (1 critical, before tag)

Quickstart

Tests

Uh oh!

v0.7.0 - CLI, versioned JSON schema, perturbation-aware sweep

What ships

Adversarial review pass before tag

Tests

Disclaimer

Uh oh!

v0.6.0 - Proof of contract for learned dynamics

What's new

How it works

Adversarial review pass

Compatibility

Quickstart

Disclaimer

Uh oh!

v0.5.0 - Pluggable perturbation library

What's new

Runner changes (backward-compatible)

Scorecard

Adversarial review before tag

Tests

Quickstart

Disclaimer

Uh oh!

v0.4.0 - Markdown export and compute-per-decision

What's new

Horizon sweep on the maze toy

Quickstart

Compatibility

Disclaimer

Uh oh!

v0.3.1 - Initial public release

What's here

Quickstart

Planning-horizon sweep on the maze toy (CPU)

License

Uh oh!