experiments: Stage-2 offline metrics for the committed TD-MPC2 cells (keystone)#56
Merged
Merged
Conversation
…(keystone) The maze proof of concept sits in a regime where prediction and decision quality move together, so the naive one-step metric and the decision-aware ranking metric predict downstream success about equally. This adds the discriminating test on the real cells: across the committed TD-MPC2 world models, does a decision-aware offline metric track the downstream CPG where a naive prediction-error metric does not? tdmpc2_offline_metrics.py loads each committed TD-MPC2 cell's trained latent dynamics on CPU and computes, against the matched oracle on a fixed seeded state sample: M1 one-step L2 (naive foil), M2 k-step L2 divergence (compounding), M3 action-ranking agreement (decision-aware, unitless Kendall tau), then pairs them with that cell's committed downstream CPG (gap). Output feeds the merged correlate.py. Compute is CPU (TD-MPC2 inference defaults to CPU; the planner is not run, only the dynamics callable is queried) and checkpoint-gated (the .pt weights are gitignored). Cells without their checkpoint are reported as skipped, so the script runs on a bare checkout and names the missing files. Scale confound: M1/M2 are L2 in each env's own state space and are not comparable across envs; pooling them would unfairly handicap them against the unitless M3 (the apples-to-oranges trap the maze PoC review flagged). The fair head-to-head is within an environment, so the script prints a within-env Spearman for every env with >=3 cells; pooled cross-env output is valid only for M3. Pure metric helpers are stdlib-only at module scope (heavy env/adapter/torch imports deferred) and unit-tested with synthetic callables and closed-form expectations; the full sweep is run on the box where the checkpoints live. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Stage 2 of the offline->downstream keystone, and its discriminating test. The maze proof of concept (PR #55) lives in a regime where prediction and decision quality move together, so the naive one-step metric and the decision-aware ranking metric predict downstream success about equally. This adds the sharper test on the real cells: across the committed TD-MPC2 world models, does a decision-aware offline metric track the downstream Counterfactual Planning Gap where a naive prediction-error metric does not?
experiments/offline_downstream/tdmpc2_offline_metrics.pyloads each committed TD-MPC2 cell's trained latent dynamics on CPU and computes, against the matched oracle on a fixed seeded state sample:then pairs them with that cell's committed downstream CPG (
gap). The bundle feeds the already-mergedcorrelate.py(--downstream gap).Compute / gating
(state, action) -> next_statecallable is queried..ptweights are gitignored and not in the repo. Cells without their checkpoint are reported as skipped, so the script runs on a bare checkout and names exactly which files are missing. Run it where the checkpoints live (the training box, or afterscp-ing them intoresults/dmc_*/).Scale confound (pre-empting the maze-PoC review point)
M1/M2 are L2 distances in each env's own state space and are NOT comparable across environments; pooling Cartpole and Reacher M1/M2 would unfairly handicap them against the unitless, rank-based M3 -- the same apples-to-oranges trap flagged on the maze PoC. The fair head-to-head is within an environment, so the script prints a within-env Spearman for every env with >=3 cells (Cartpole has the widest gap spread, 6 cells); pooled cross-env
correlate.pyoutput is valid only for M3.Testing
tests/test_tdmpc2_offline_metrics.py, 10 tests).No new git tag; core
wmeluntouched (the script lives underexperiments/). The full sweep is run on the box where the checkpoints live.Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com