Skip to content

experiments: Stage-2 offline metrics for the committed TD-MPC2 cells (keystone)#56

Merged
Denis-hamon merged 1 commit into
mainfrom
feat-tdmpc2-offline-metrics
Jun 14, 2026
Merged

experiments: Stage-2 offline metrics for the committed TD-MPC2 cells (keystone)#56
Denis-hamon merged 1 commit into
mainfrom
feat-tdmpc2-offline-metrics

Conversation

@Denis-hamon

Copy link
Copy Markdown
Owner

What

Stage 2 of the offline->downstream keystone, and its discriminating test. The maze proof of concept (PR #55) lives in a regime where prediction and decision quality move together, so the naive one-step metric and the decision-aware ranking metric predict downstream success about equally. This adds the sharper test on the real cells: across the committed TD-MPC2 world models, does a decision-aware offline metric track the downstream Counterfactual Planning Gap where a naive prediction-error metric does not?

experiments/offline_downstream/tdmpc2_offline_metrics.py loads each committed TD-MPC2 cell's trained latent dynamics on CPU and computes, against the matched oracle on a fixed seeded state sample:

  • M1 one-step L2 error (naive foil)
  • M2 k-step open-loop L2 divergence (compounding)
  • M3 action-ranking agreement (decision-aware, unitless Kendall tau over the env score of each action's predicted successor)

then pairs them with that cell's committed downstream CPG (gap). The bundle feeds the already-merged correlate.py (--downstream gap).

Compute / gating

  • CPU, not GPU. TD-MPC2 inference defaults to CPU and the planner is not run here -- only the (state, action) -> next_state callable is queried.
  • Checkpoint-gated. The .pt weights are gitignored and not in the repo. Cells without their checkpoint are reported as skipped, so the script runs on a bare checkout and names exactly which files are missing. Run it where the checkpoints live (the training box, or after scp-ing them into results/dmc_*/).

Scale confound (pre-empting the maze-PoC review point)

M1/M2 are L2 distances in each env's own state space and are NOT comparable across environments; pooling Cartpole and Reacher M1/M2 would unfairly handicap them against the unitless, rank-based M3 -- the same apples-to-oranges trap flagged on the maze PoC. The fair head-to-head is within an environment, so the script prints a within-env Spearman for every env with >=3 cells (Cartpole has the widest gap spread, 6 cells); pooled cross-env correlate.py output is valid only for M3.

Testing

  • Pure metric helpers are stdlib-only at module scope (env/adapter/torch imports deferred), unit-tested with synthetic callables and closed-form expectations (tests/test_tdmpc2_offline_metrics.py, 10 tests).
  • Full suite green; graceful skip path verified on this checkout (0 cells, 10 skipped, valid JSON).
  • Adversarial review: GO, no must-fixes; the two flagged nits (stale README schema snippet, a wasted final sim step in state collection) are applied.

No new git tag; core wmel untouched (the script lives under experiments/). The full sweep is run on the box where the checkpoints live.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com

…(keystone)

The maze proof of concept sits in a regime where prediction and decision
quality move together, so the naive one-step metric and the decision-aware
ranking metric predict downstream success about equally. This adds the
discriminating test on the real cells: across the committed TD-MPC2 world
models, does a decision-aware offline metric track the downstream CPG where a
naive prediction-error metric does not?

tdmpc2_offline_metrics.py loads each committed TD-MPC2 cell's trained latent
dynamics on CPU and computes, against the matched oracle on a fixed seeded
state sample: M1 one-step L2 (naive foil), M2 k-step L2 divergence
(compounding), M3 action-ranking agreement (decision-aware, unitless Kendall
tau), then pairs them with that cell's committed downstream CPG (gap). Output
feeds the merged correlate.py.

Compute is CPU (TD-MPC2 inference defaults to CPU; the planner is not run, only
the dynamics callable is queried) and checkpoint-gated (the .pt weights are
gitignored). Cells without their checkpoint are reported as skipped, so the
script runs on a bare checkout and names the missing files.

Scale confound: M1/M2 are L2 in each env's own state space and are not
comparable across envs; pooling them would unfairly handicap them against the
unitless M3 (the apples-to-oranges trap the maze PoC review flagged). The fair
head-to-head is within an environment, so the script prints a within-env
Spearman for every env with >=3 cells; pooled cross-env output is valid only
for M3.

Pure metric helpers are stdlib-only at module scope (heavy env/adapter/torch
imports deferred) and unit-tested with synthetic callables and closed-form
expectations; the full sweep is run on the box where the checkpoints live.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Denis-hamon Denis-hamon merged commit 2866477 into main Jun 14, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant