results: full observation-richness sweep on Cartpole + honest write-up (T2.3)#58
Merged
Conversation
Full CPU run of the observation-richness stress test (21 cells:
{redundant, high_freq} x widths {0,4,16,64} x 3 seeds). Result: observation-
space one-step MSE (mse_total) swings ~5e4x (0.0000 -> 0.283) purely from
task-irrelevant observation design, while the CPG gap does not track it
(mean +0.79, Spearman(mse_total, gap) = +0.09, 95% CI [-0.37, +0.51]); the
decision-relevant mse_state stays orders of magnitude smaller (~1e-5..3e-3)
in every cell.
Reported with three caveats so it is not overread: (1) the learned MLP is at
MODEL BOTTLENECK in nearly every cell (the known DMC-MLP planning floor), so
the gap is oracle-pinned and the constant outcome is near-failure, not a graded
success rate; (2) because the gap is near-constant the low Spearman is
consistent-with-decoupling, not positive proof of zero relationship (wide CI);
(3) the deflation direction did not materialise -- the MLP already fits the
clean state to ~0 MSE, so the smooth redundant features leave mse_total at the
floor (only the high_freq inflation arm moves it). The genuinely new content is
the INVARIANCE: a ~5e4x MSE swing buys no detectable change in the closed-loop
gap.
Numbers cross-checked against the committed JSON by adversarial review (GO; all
figures reproduce exactly; the three caveats were added at its recommendation).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Full CPU run of the observation-richness stress test merged in #57 (the pipeline only had smoke/floor data before). 21 cells:
{redundant, high_freq} × widths {0,4,16,64} × 3 seeds.Result
A popular "model quality" proxy — observation-space one-step MSE — swings by orders of magnitude purely from task-irrelevant observation design, while the closed-loop gap does not track it:
mse_totalranges 0.0000 → 0.283 (~5e4×;high_freqclimbs monotonically 0.037 → 0.117 → 0.278 with width;redundantstays ~0).gapmean +0.79, no monotone relation to width/kind. Spearman(mse_total, gap) = +0.09, 95% CI [−0.37, +0.51]; Spearman(mse_state, gap) = +0.03.mse_state(decision-relevant slice) stays ~1e-5..3e-3 in every cell — orders of magnitude belowmse_total.Honesty (three caveats in the README, no overclaim)
gapis oracle-pinned and the held-constant outcome is near-failure, not a graded success rate.gapis near-constant, the low Spearman is consistent-with-decoupling, not positive proof of a zero relationship (CI is wide). Read it as "the MSE swing buys no detectable change in the gap."redundantfeatures leavemse_totalat the floor. Only thehigh_freqinflation arm moves it.The genuinely new content is not prediction≠decision itself (the keystone establishes that) but its invariance: a ~5e4× MSE swing buys no detectable change in the closed-loop gap.
Review
Adversarial review: GO, every figure reproduces exactly from the committed JSON; the three caveats were added at its recommendation (it flagged the ceiling confound on "decoupled" and that the novelty is the invariance, not a fresh prediction≠decision demo). CPU only; core
wmeluntouched; no new git tag.Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com