results: full observation-richness sweep on Cartpole + honest write-up (T2.3) by Denis-hamon · Pull Request #58 · Denis-hamon/world-model-eval-lab

Denis-hamon · 2026-06-14T18:42:27Z

What

Full CPU run of the observation-richness stress test merged in #57 (the pipeline only had smoke/floor data before). 21 cells: {redundant, high_freq} × widths {0,4,16,64} × 3 seeds.

Result

A popular "model quality" proxy — observation-space one-step MSE — swings by orders of magnitude purely from task-irrelevant observation design, while the closed-loop gap does not track it:

mse_total ranges 0.0000 → 0.283 (~5e4×; high_freq climbs monotonically 0.037 → 0.117 → 0.278 with width; redundant stays ~0).
gap mean +0.79, no monotone relation to width/kind. Spearman(mse_total, gap) = +0.09, 95% CI [−0.37, +0.51]; Spearman(mse_state, gap) = +0.03.
mse_state (decision-relevant slice) stays ~1e-5..3e-3 in every cell — orders of magnitude below mse_total.

Honesty (three caveats in the README, no overclaim)

The learned MLP is at MODEL BOTTLENECK in nearly every cell (success 0 in 14/21; oracle 0.8–1.0) — the known DMC-MLP planning floor — so gap is oracle-pinned and the held-constant outcome is near-failure, not a graded success rate.
Because gap is near-constant, the low Spearman is consistent-with-decoupling, not positive proof of a zero relationship (CI is wide). Read it as "the MSE swing buys no detectable change in the gap."
The deflation direction did not materialise — the MLP already fits the clean state to ~0 MSE, so the smooth redundant features leave mse_total at the floor. Only the high_freq inflation arm moves it.

The genuinely new content is not prediction≠decision itself (the keystone establishes that) but its invariance: a ~5e4× MSE swing buys no detectable change in the closed-loop gap.

Review

Adversarial review: GO, every figure reproduces exactly from the committed JSON; the three caveats were added at its recommendation (it flagged the ceiling confound on "decoupled" and that the novelty is the invariance, not a fresh prediction≠decision demo). CPU only; core wmel untouched; no new git tag.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com

Full CPU run of the observation-richness stress test (21 cells: {redundant, high_freq} x widths {0,4,16,64} x 3 seeds). Result: observation- space one-step MSE (mse_total) swings ~5e4x (0.0000 -> 0.283) purely from task-irrelevant observation design, while the CPG gap does not track it (mean +0.79, Spearman(mse_total, gap) = +0.09, 95% CI [-0.37, +0.51]); the decision-relevant mse_state stays orders of magnitude smaller (~1e-5..3e-3) in every cell. Reported with three caveats so it is not overread: (1) the learned MLP is at MODEL BOTTLENECK in nearly every cell (the known DMC-MLP planning floor), so the gap is oracle-pinned and the constant outcome is near-failure, not a graded success rate; (2) because the gap is near-constant the low Spearman is consistent-with-decoupling, not positive proof of zero relationship (wide CI); (3) the deflation direction did not materialise -- the MLP already fits the clean state to ~0 MSE, so the smooth redundant features leave mse_total at the floor (only the high_freq inflation arm moves it). The genuinely new content is the INVARIANCE: a ~5e4x MSE swing buys no detectable change in the closed-loop gap. Numbers cross-checked against the committed JSON by adversarial review (GO; all figures reproduce exactly; the three caveats were added at its recommendation). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Denis-hamon merged commit 466cf47 into main Jun 14, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

results: full observation-richness sweep on Cartpole + honest write-up (T2.3)#58

results: full observation-richness sweep on Cartpole + honest write-up (T2.3)#58
Denis-hamon merged 1 commit into
mainfrom
result-obs-robustness

Denis-hamon commented Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Denis-hamon commented Jun 14, 2026

What

Result

Honesty (three caveats in the README, no overclaim)

Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant