experiments: offline->downstream correlation, CPU proof-of-concept (keystone)#55
Merged
Merged
Conversation
…eystone T2.1.b/c)
Runs the no-GPU portion of the external-validity keystone: does a cheap,
planner-free offline metric predict downstream planning success? The learned
MLP arms on DMC sit at the planning floor (success 0, no downstream spread), so
this uses the fast maze toy, where under-training the learned dynamics genuinely
degrades planning.
- experiments/offline_downstream/maze_quality_sweep.py (torch, in experiments/):
sweeps training epochs x seeds (24 cells), computes 3 planner-free offline
metrics vs the true maze dynamics -- M1 one-step mismatch (naive), M2 k-step
open-loop divergence (compounding), M3 action-ranking agreement (decision-aware,
Kendall tau of action-by-closeness-to-goal) -- plus downstream success_rate.
M3 is null (not NaN) for action-blind models; bundle written with allow_nan=False.
- experiments/offline_downstream/correlate.py (stdlib): per-metric rank
correlation vs success with bootstrap CIs, PLUS a common-subset (equal-n)
block -- the only fair cross-metric comparison -- and a comparability_note.
- results/offline_downstream/{maze_offline_scores,offline_vs_downstream}.json,
README, tests.
Result (honest): all three metrics predict success and clear zero; on the fair
common subset (n=17) they are essentially equal (M1 -0.90, M2 -0.91, M3 +0.90),
so the naive metric is NOT inferior to the decision-aware one here -- the
expected prediction-approx-decision signature of a small deterministic env. The
per-metric M3 (n=17) vs M1 (n=24) gap is an artifact of M3's smaller/easier
subset, NOT M3 superiority (stated in the README + comparability_note). The
discriminating M1-vs-M3 test needs Stage 2 (DMC/TD-MPC2 cells, GPU), where
capable models span non-floor downstream success.
Adversarial review GO after one round (NO-GO first caught the unfair
n=17-vs-24 comparison that manufactured a false "M3 strongest", invalid bare-NaN
JSON, and a mislabeled comment -- all fixed; the fair comparison vindicates the
prediction-approx-decision caveat). Full suite 239 passing; correlate.py
stdlib-only; torch confined to the sweep. No new git tag.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tier 2.1.b/c (CPU slice of the keystone). You asked to lancer (a) — run the CPU part to get a first real offline→downstream correlation without the GPU. Here it is.
The question
Does a cheap, planner-free offline metric predict downstream planning success? The learned MLP arms on DMC sit at the planning floor (success 0 regardless of prediction quality — the dissociation the paper reports), so they give no downstream spread. The fast maze toy is the CPU-feasible setting where under-training the dynamics genuinely degrades planning, so success varies (0.0 → 0.30–0.60 → 1.0).
What it adds
maze_quality_sweep.py(torch, inexperiments/): 24 cells (epochs × seeds), three planner-free offline metrics vs the true maze dynamics — M1 one-step mismatch (naive), M2 k-step divergence (compounding), M3 action-ranking agreement (decision-aware) — plus downstreamsuccess_rate.correlate.py(stdlib): per-metric rank correlation + bootstrap CIs, plus a fair common-subset (equal-n) block and acomparability_note. Reuses the merged primitives (PR feat(metrics): stdlib rank-correlation primitives (Spearman, Kendall, bootstrap CI) #54).Result (honest)
All three metrics predict success and clear zero. On the fair common subset (n=17) they are essentially equal: M1 −0.90, M2 −0.91, M3 +0.90 — the naive metric is not inferior to the decision-aware one here. That's the expected prediction ≈ decision signature of a small deterministic env, and it validates the pipeline end-to-end.
Not a claim of M3 superiority. The per-metric M3 (n=17) vs M1 (n=24) gap is an artifact of M3's smaller/easier subset (M3 is undefined for action-blind failures), not a real advantage — stated in the README and the
comparability_note. The discriminating M1-vs-M3 test needs Stage 2 (DMC/TD-MPC2 cells, GPU), where capable models span non-floor downstream success.Verification
correlate.pystdlib-only; torch confined to the sweep; bundles are strict-valid JSON (M3 undefined →null, notNaN).NaNJSON and a mislabeled comment. Caveats documented (M1/M2 rank-redundant ρ≈0.98; near-bimodal success; M2 single-sequence).🤖 Generated with Claude Code