experiments: offline->downstream correlation, CPU proof-of-concept (keystone) by Denis-hamon · Pull Request #55 · Denis-hamon/world-model-eval-lab

Denis-hamon · 2026-06-14T15:39:04Z

Tier 2.1.b/c (CPU slice of the keystone). You asked to lancer (a) — run the CPU part to get a first real offline→downstream correlation without the GPU. Here it is.

The question

Does a cheap, planner-free offline metric predict downstream planning success? The learned MLP arms on DMC sit at the planning floor (success 0 regardless of prediction quality — the dissociation the paper reports), so they give no downstream spread. The fast maze toy is the CPU-feasible setting where under-training the dynamics genuinely degrades planning, so success varies (0.0 → 0.30–0.60 → 1.0).

What it adds

maze_quality_sweep.py (torch, in experiments/): 24 cells (epochs × seeds), three planner-free offline metrics vs the true maze dynamics — M1 one-step mismatch (naive), M2 k-step divergence (compounding), M3 action-ranking agreement (decision-aware) — plus downstream success_rate.
correlate.py (stdlib): per-metric rank correlation + bootstrap CIs, plus a fair common-subset (equal-n) block and a comparability_note. Reuses the merged primitives (PR feat(metrics): stdlib rank-correlation primitives (Spearman, Kendall, bootstrap CI) #54).
Committed bundles, README, tests.

Result (honest)

All three metrics predict success and clear zero. On the fair common subset (n=17) they are essentially equal: M1 −0.90, M2 −0.91, M3 +0.90 — the naive metric is not inferior to the decision-aware one here. That's the expected prediction ≈ decision signature of a small deterministic env, and it validates the pipeline end-to-end.

Not a claim of M3 superiority. The per-metric M3 (n=17) vs M1 (n=24) gap is an artifact of M3's smaller/easier subset (M3 is undefined for action-blind failures), not a real advantage — stated in the README and the comparability_note. The discriminating M1-vs-M3 test needs Stage 2 (DMC/TD-MPC2 cells, GPU), where capable models span non-floor downstream success.

Verification

239 tests pass; correlate.py stdlib-only; torch confined to the sweep; bundles are strict-valid JSON (M3 undefined → null, not NaN).
Adversarial review GO after one round. The first round (NO-GO) caught a real bug: the n=17-vs-24 comparison manufactured a false "M3 strongest" headline — fixed with the equal-n common-subset, which vindicates the prediction≈decision caveat. Also fixed invalid bare-NaN JSON and a mislabeled comment. Caveats documented (M1/M2 rank-redundant ρ≈0.98; near-bimodal success; M2 single-sequence).

No new git tag. Paper section deferred to Stage 2 (this is an experiments/ artifact + README).

🤖 Generated with Claude Code

…eystone T2.1.b/c) Runs the no-GPU portion of the external-validity keystone: does a cheap, planner-free offline metric predict downstream planning success? The learned MLP arms on DMC sit at the planning floor (success 0, no downstream spread), so this uses the fast maze toy, where under-training the learned dynamics genuinely degrades planning. - experiments/offline_downstream/maze_quality_sweep.py (torch, in experiments/): sweeps training epochs x seeds (24 cells), computes 3 planner-free offline metrics vs the true maze dynamics -- M1 one-step mismatch (naive), M2 k-step open-loop divergence (compounding), M3 action-ranking agreement (decision-aware, Kendall tau of action-by-closeness-to-goal) -- plus downstream success_rate. M3 is null (not NaN) for action-blind models; bundle written with allow_nan=False. - experiments/offline_downstream/correlate.py (stdlib): per-metric rank correlation vs success with bootstrap CIs, PLUS a common-subset (equal-n) block -- the only fair cross-metric comparison -- and a comparability_note. - results/offline_downstream/{maze_offline_scores,offline_vs_downstream}.json, README, tests. Result (honest): all three metrics predict success and clear zero; on the fair common subset (n=17) they are essentially equal (M1 -0.90, M2 -0.91, M3 +0.90), so the naive metric is NOT inferior to the decision-aware one here -- the expected prediction-approx-decision signature of a small deterministic env. The per-metric M3 (n=17) vs M1 (n=24) gap is an artifact of M3's smaller/easier subset, NOT M3 superiority (stated in the README + comparability_note). The discriminating M1-vs-M3 test needs Stage 2 (DMC/TD-MPC2 cells, GPU), where capable models span non-floor downstream success. Adversarial review GO after one round (NO-GO first caught the unfair n=17-vs-24 comparison that manufactured a false "M3 strongest", invalid bare-NaN JSON, and a mislabeled comment -- all fixed; the fair comparison vindicates the prediction-approx-decision caveat). Full suite 239 passing; correlate.py stdlib-only; torch confined to the sweep. No new git tag. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Denis-hamon merged commit a175f53 into main Jun 14, 2026
4 checks passed

Denis-hamon mentioned this pull request Jun 14, 2026

experiments: Stage-2 offline metrics for the committed TD-MPC2 cells (keystone) #56

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiments: offline->downstream correlation, CPU proof-of-concept (keystone)#55

experiments: offline->downstream correlation, CPU proof-of-concept (keystone)#55
Denis-hamon merged 1 commit into
mainfrom
feat-offline-downstream-correlate

Denis-hamon commented Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Denis-hamon commented Jun 14, 2026

The question

What it adds

Result (honest)

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant