Skip to content

results: full observation-richness sweep on Cartpole + honest write-up (T2.3)#58

Merged
Denis-hamon merged 1 commit into
mainfrom
result-obs-robustness
Jun 14, 2026
Merged

results: full observation-richness sweep on Cartpole + honest write-up (T2.3)#58
Denis-hamon merged 1 commit into
mainfrom
result-obs-robustness

Conversation

@Denis-hamon

Copy link
Copy Markdown
Owner

What

Full CPU run of the observation-richness stress test merged in #57 (the pipeline only had smoke/floor data before). 21 cells: {redundant, high_freq} × widths {0,4,16,64} × 3 seeds.

Result

A popular "model quality" proxy — observation-space one-step MSE — swings by orders of magnitude purely from task-irrelevant observation design, while the closed-loop gap does not track it:

  • mse_total ranges 0.0000 → 0.283 (~5e4×; high_freq climbs monotonically 0.037 → 0.117 → 0.278 with width; redundant stays ~0).
  • gap mean +0.79, no monotone relation to width/kind. Spearman(mse_total, gap) = +0.09, 95% CI [−0.37, +0.51]; Spearman(mse_state, gap) = +0.03.
  • mse_state (decision-relevant slice) stays ~1e-5..3e-3 in every cell — orders of magnitude below mse_total.

Honesty (three caveats in the README, no overclaim)

  1. The learned MLP is at MODEL BOTTLENECK in nearly every cell (success 0 in 14/21; oracle 0.8–1.0) — the known DMC-MLP planning floor — so gap is oracle-pinned and the held-constant outcome is near-failure, not a graded success rate.
  2. Because gap is near-constant, the low Spearman is consistent-with-decoupling, not positive proof of a zero relationship (CI is wide). Read it as "the MSE swing buys no detectable change in the gap."
  3. The deflation direction did not materialise — the MLP already fits the clean state to ~0 MSE, so the smooth redundant features leave mse_total at the floor. Only the high_freq inflation arm moves it.

The genuinely new content is not prediction≠decision itself (the keystone establishes that) but its invariance: a ~5e4× MSE swing buys no detectable change in the closed-loop gap.

Review

Adversarial review: GO, every figure reproduces exactly from the committed JSON; the three caveats were added at its recommendation (it flagged the ceiling confound on "decoupled" and that the novelty is the invariance, not a fresh prediction≠decision demo). CPU only; core wmel untouched; no new git tag.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com

Full CPU run of the observation-richness stress test (21 cells:
{redundant, high_freq} x widths {0,4,16,64} x 3 seeds). Result: observation-
space one-step MSE (mse_total) swings ~5e4x (0.0000 -> 0.283) purely from
task-irrelevant observation design, while the CPG gap does not track it
(mean +0.79, Spearman(mse_total, gap) = +0.09, 95% CI [-0.37, +0.51]); the
decision-relevant mse_state stays orders of magnitude smaller (~1e-5..3e-3)
in every cell.

Reported with three caveats so it is not overread: (1) the learned MLP is at
MODEL BOTTLENECK in nearly every cell (the known DMC-MLP planning floor), so
the gap is oracle-pinned and the constant outcome is near-failure, not a graded
success rate; (2) because the gap is near-constant the low Spearman is
consistent-with-decoupling, not positive proof of zero relationship (wide CI);
(3) the deflation direction did not materialise -- the MLP already fits the
clean state to ~0 MSE, so the smooth redundant features leave mse_total at the
floor (only the high_freq inflation arm moves it). The genuinely new content is
the INVARIANCE: a ~5e4x MSE swing buys no detectable change in the closed-loop
gap.

Numbers cross-checked against the committed JSON by adversarial review (GO; all
figures reproduce exactly; the three caveats were added at its recommendation).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Denis-hamon Denis-hamon merged commit 466cf47 into main Jun 14, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant