experiments: Stage-2 offline metrics for the committed TD-MPC2 cells (keystone) by Denis-hamon · Pull Request #56 · Denis-hamon/world-model-eval-lab

Denis-hamon · 2026-06-14T16:38:47Z

What

Stage 2 of the offline->downstream keystone, and its discriminating test. The maze proof of concept (PR #55) lives in a regime where prediction and decision quality move together, so the naive one-step metric and the decision-aware ranking metric predict downstream success about equally. This adds the sharper test on the real cells: across the committed TD-MPC2 world models, does a decision-aware offline metric track the downstream Counterfactual Planning Gap where a naive prediction-error metric does not?

experiments/offline_downstream/tdmpc2_offline_metrics.py loads each committed TD-MPC2 cell's trained latent dynamics on CPU and computes, against the matched oracle on a fixed seeded state sample:

M1 one-step L2 error (naive foil)
M2 k-step open-loop L2 divergence (compounding)
M3 action-ranking agreement (decision-aware, unitless Kendall tau over the env score of each action's predicted successor)

then pairs them with that cell's committed downstream CPG (gap). The bundle feeds the already-merged correlate.py (--downstream gap).

Compute / gating

CPU, not GPU. TD-MPC2 inference defaults to CPU and the planner is not run here -- only the (state, action) -> next_state callable is queried.
Checkpoint-gated. The .pt weights are gitignored and not in the repo. Cells without their checkpoint are reported as skipped, so the script runs on a bare checkout and names exactly which files are missing. Run it where the checkpoints live (the training box, or after scp-ing them into results/dmc_*/).

Scale confound (pre-empting the maze-PoC review point)

M1/M2 are L2 distances in each env's own state space and are NOT comparable across environments; pooling Cartpole and Reacher M1/M2 would unfairly handicap them against the unitless, rank-based M3 -- the same apples-to-oranges trap flagged on the maze PoC. The fair head-to-head is within an environment, so the script prints a within-env Spearman for every env with >=3 cells (Cartpole has the widest gap spread, 6 cells); pooled cross-env correlate.py output is valid only for M3.

Testing

Pure metric helpers are stdlib-only at module scope (env/adapter/torch imports deferred), unit-tested with synthetic callables and closed-form expectations (tests/test_tdmpc2_offline_metrics.py, 10 tests).
Full suite green; graceful skip path verified on this checkout (0 cells, 10 skipped, valid JSON).
Adversarial review: GO, no must-fixes; the two flagged nits (stale README schema snippet, a wasted final sim step in state collection) are applied.

No new git tag; core wmel untouched (the script lives under experiments/). The full sweep is run on the box where the checkpoints live.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com

…(keystone) The maze proof of concept sits in a regime where prediction and decision quality move together, so the naive one-step metric and the decision-aware ranking metric predict downstream success about equally. This adds the discriminating test on the real cells: across the committed TD-MPC2 world models, does a decision-aware offline metric track the downstream CPG where a naive prediction-error metric does not? tdmpc2_offline_metrics.py loads each committed TD-MPC2 cell's trained latent dynamics on CPU and computes, against the matched oracle on a fixed seeded state sample: M1 one-step L2 (naive foil), M2 k-step L2 divergence (compounding), M3 action-ranking agreement (decision-aware, unitless Kendall tau), then pairs them with that cell's committed downstream CPG (gap). Output feeds the merged correlate.py. Compute is CPU (TD-MPC2 inference defaults to CPU; the planner is not run, only the dynamics callable is queried) and checkpoint-gated (the .pt weights are gitignored). Cells without their checkpoint are reported as skipped, so the script runs on a bare checkout and names the missing files. Scale confound: M1/M2 are L2 in each env's own state space and are not comparable across envs; pooling them would unfairly handicap them against the unitless M3 (the apples-to-oranges trap the maze PoC review flagged). The fair head-to-head is within an environment, so the script prints a within-env Spearman for every env with >=3 cells; pooled cross-env output is valid only for M3. Pure metric helpers are stdlib-only at module scope (heavy env/adapter/torch imports deferred) and unit-tested with synthetic callables and closed-form expectations; the full sweep is run on the box where the checkpoints live. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Denis-hamon merged commit 2866477 into main Jun 14, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiments: Stage-2 offline metrics for the committed TD-MPC2 cells (keystone)#56

experiments: Stage-2 offline metrics for the committed TD-MPC2 cells (keystone)#56
Denis-hamon merged 1 commit into
mainfrom
feat-tdmpc2-offline-metrics

Denis-hamon commented Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Denis-hamon commented Jun 14, 2026

What

Compute / gating

Scale confound (pre-empting the maze-PoC review point)

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant