Skip to content

experiments: offline->downstream correlation, CPU proof-of-concept (keystone)#55

Merged
Denis-hamon merged 1 commit into
mainfrom
feat-offline-downstream-correlate
Jun 14, 2026
Merged

experiments: offline->downstream correlation, CPU proof-of-concept (keystone)#55
Denis-hamon merged 1 commit into
mainfrom
feat-offline-downstream-correlate

Conversation

@Denis-hamon

Copy link
Copy Markdown
Owner

Tier 2.1.b/c (CPU slice of the keystone). You asked to lancer (a) — run the CPU part to get a first real offline→downstream correlation without the GPU. Here it is.

The question

Does a cheap, planner-free offline metric predict downstream planning success? The learned MLP arms on DMC sit at the planning floor (success 0 regardless of prediction quality — the dissociation the paper reports), so they give no downstream spread. The fast maze toy is the CPU-feasible setting where under-training the dynamics genuinely degrades planning, so success varies (0.0 → 0.30–0.60 → 1.0).

What it adds

  • maze_quality_sweep.py (torch, in experiments/): 24 cells (epochs × seeds), three planner-free offline metrics vs the true maze dynamics — M1 one-step mismatch (naive), M2 k-step divergence (compounding), M3 action-ranking agreement (decision-aware) — plus downstream success_rate.
  • correlate.py (stdlib): per-metric rank correlation + bootstrap CIs, plus a fair common-subset (equal-n) block and a comparability_note. Reuses the merged primitives (PR feat(metrics): stdlib rank-correlation primitives (Spearman, Kendall, bootstrap CI) #54).
  • Committed bundles, README, tests.

Result (honest)

All three metrics predict success and clear zero. On the fair common subset (n=17) they are essentially equal: M1 −0.90, M2 −0.91, M3 +0.90 — the naive metric is not inferior to the decision-aware one here. That's the expected prediction ≈ decision signature of a small deterministic env, and it validates the pipeline end-to-end.

Not a claim of M3 superiority. The per-metric M3 (n=17) vs M1 (n=24) gap is an artifact of M3's smaller/easier subset (M3 is undefined for action-blind failures), not a real advantage — stated in the README and the comparability_note. The discriminating M1-vs-M3 test needs Stage 2 (DMC/TD-MPC2 cells, GPU), where capable models span non-floor downstream success.

Verification

  • 239 tests pass; correlate.py stdlib-only; torch confined to the sweep; bundles are strict-valid JSON (M3 undefined → null, not NaN).
  • Adversarial review GO after one round. The first round (NO-GO) caught a real bug: the n=17-vs-24 comparison manufactured a false "M3 strongest" headline — fixed with the equal-n common-subset, which vindicates the prediction≈decision caveat. Also fixed invalid bare-NaN JSON and a mislabeled comment. Caveats documented (M1/M2 rank-redundant ρ≈0.98; near-bimodal success; M2 single-sequence).

No new git tag. Paper section deferred to Stage 2 (this is an experiments/ artifact + README).

🤖 Generated with Claude Code

…eystone T2.1.b/c)

Runs the no-GPU portion of the external-validity keystone: does a cheap,
planner-free offline metric predict downstream planning success? The learned
MLP arms on DMC sit at the planning floor (success 0, no downstream spread), so
this uses the fast maze toy, where under-training the learned dynamics genuinely
degrades planning.

- experiments/offline_downstream/maze_quality_sweep.py (torch, in experiments/):
  sweeps training epochs x seeds (24 cells), computes 3 planner-free offline
  metrics vs the true maze dynamics -- M1 one-step mismatch (naive), M2 k-step
  open-loop divergence (compounding), M3 action-ranking agreement (decision-aware,
  Kendall tau of action-by-closeness-to-goal) -- plus downstream success_rate.
  M3 is null (not NaN) for action-blind models; bundle written with allow_nan=False.
- experiments/offline_downstream/correlate.py (stdlib): per-metric rank
  correlation vs success with bootstrap CIs, PLUS a common-subset (equal-n)
  block -- the only fair cross-metric comparison -- and a comparability_note.
- results/offline_downstream/{maze_offline_scores,offline_vs_downstream}.json,
  README, tests.

Result (honest): all three metrics predict success and clear zero; on the fair
common subset (n=17) they are essentially equal (M1 -0.90, M2 -0.91, M3 +0.90),
so the naive metric is NOT inferior to the decision-aware one here -- the
expected prediction-approx-decision signature of a small deterministic env. The
per-metric M3 (n=17) vs M1 (n=24) gap is an artifact of M3's smaller/easier
subset, NOT M3 superiority (stated in the README + comparability_note). The
discriminating M1-vs-M3 test needs Stage 2 (DMC/TD-MPC2 cells, GPU), where
capable models span non-floor downstream success.

Adversarial review GO after one round (NO-GO first caught the unfair
n=17-vs-24 comparison that manufactured a false "M3 strongest", invalid bare-NaN
JSON, and a mislabeled comment -- all fixed; the fair comparison vindicates the
prediction-approx-decision caveat). Full suite 239 passing; correlate.py
stdlib-only; torch confined to the sweep. No new git tag.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Denis-hamon Denis-hamon merged commit a175f53 into main Jun 14, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant