experiments: observation-richness robustness of the CPG verdict (reframed T2.3) by Denis-hamon · Pull Request #57 · Denis-hamon/world-model-eval-lab

Denis-hamon · 2026-06-14T17:13:21Z

What

Reframed T2.3. Does prediction != decision survive when the world model no longer sees the clean, privileged low-dimensional state, or is the dissociation an artifact of that tidy observation?

The roadmap's original T2.3 was "one DMC-from-pixels task." A pixel / field-of-view evaluation axis is the signature of the DINO-WM / stable-worldmodel agenda, and the lab keeps a deliberate non-affiliation guardrail against introducing that axis (GPU_ROADMAP.md:92). So T2.3 is reframed to deliver the same scientific payload with nuisance-augmented state instead of pixels: no pixel axis, no JEPA-adjacent baselines, same question answered.

Construction

observation_augmentation.py (pure, stdlib-only) wraps any BenchmarkEnvironment so the observation is true_state ++ nuisance:

the simulator stays the oracle — the augmented oracle steps true physics on the state slice and reproduces the nuisance exactly, so it reproduces the augmented env step to the same precision the base oracle reproduces the base env;
the planner scores only the true-state slice, so decisions depend on real physics, never on the nuisance;
the learned MLP trains on the augmented observation, so its one-step error (the keystone's M1 foil) is moved by the nuisance design while the closed-loop gap should not move.

Two nuisance kinds, both deterministic state features (hence reconstructable by the oracle), differing only in one-step learnability:

kind	nuisance	expected one-step MSE
`redundant`	smooth low-frequency `tanh(state)`	deflates (easy)
`high_freq`	`sin(K*state)`, K large	inflates (a finite smooth MLP cannot resolve K)

This is a genuine one-step difficulty — the quantity M1 measures. A chaotic temporal map would not do: its one-step update is a fittable parabola and its divergence is multi-step only.

Falsifiable claim

CPG and mse_state stay ~flat across nuisance kind/width while mse_total moves (down for redundant, up for high_freq); width=0 is the no-nuisance control reproducing baseline Cartpole CPG. That would show a popular "model quality" proxy — observation-space one-step MSE — is movable by task-irrelevant observation design without touching the closed-loop verdict. If CPG shifts materially, the verdict is observation-form-dependent — itself a finding. Directions are measured per cell, not assumed (the mse_total/mse_state/mse_nuisance split is reported per cell); the input-distraction confound is disclosed and surfaced via the separate mse_state.

Testing / compute

CPU only, no GPU, no checkpoints.
Pure algebra unit-tested with a synthetic env (oracle reproduces the augmented env exactly; score ignores nuisance; MSE split isolates state vs nuisance); full suite green.
Smoke run confirms both directions empirically: redundant w=4 mse_total 0.029 < baseline 0.060; high_freq w=4 mse_total 0.169 > baseline; mse_state stays in a tight band.
Adversarial review: initial NO-GO caught a real error — a chaotic temporal distractor's one-step update is a fittable parabola, so it would have deflated (not inflated) the one-step metric. Replaced with the high-frequency state feature; re-review GO, no remaining must-fixes.

Core wmel untouched (everything under experiments/). No new git tag.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com

…amed T2.3) Does prediction != decision survive when the world model no longer sees the clean, privileged low-dimensional state? The original T2.3 was "one DMC-from-pixels task", but a pixel/field-of-view axis is the signature of the DINO-WM / stable-worldmodel agenda and is held off by the lab's non-affiliation guardrail (experiments/GPU_ROADMAP.md:92). This delivers the same payload with nuisance-augmented state instead of pixels: no pixel axis, same question. observation_augmentation.py (pure, stdlib-only) wraps any BenchmarkEnvironment so the observation is true_state ++ nuisance. The simulator stays the oracle (the augmented oracle steps true physics on the state slice and reproduces the nuisance exactly, so it reproduces the augmented env step to the same precision the base oracle reproduces the base env); the planner scores ONLY the state slice, so decisions depend on real physics, not the nuisance; the learned MLP trains on the augmented observation, so its one-step error (the keystone's M1 foil) is moved up or down by the nuisance design while the closed-loop gap should not move. Two nuisance kinds, both deterministic state features (hence reconstructable by the oracle), differing only in one-step learnability: redundant = smooth low-frequency tanh features (easy -> expected to deflate one-step MSE); high_freq = sin(K*state) with K large (a finite smooth MLP cannot resolve the frequency -> expected to inflate it). This is a genuine one-step difficulty, not a temporal one -- a chaotic map's one-step update is a fittable parabola and its divergence is multi-step only, which this metric does not compute. Falsifiable claim: CPG and mse_state stay ~flat across nuisance kind/width while mse_total moves; width=0 is the no-nuisance control reproducing baseline CPG. Directions are measured per cell, not assumed (the mse_total/mse_state/ mse_nuisance split is reported for every cell), and the input-distraction confound (nuisance dims are also MLP inputs) is disclosed and surfaced via the separate mse_state. CPU only, no GPU, no checkpoints. The pure augmentation algebra is unit-tested with a synthetic env (oracle reproduces the augmented env exactly; score ignores nuisance; MSE split isolates state vs nuisance); the smoke run confirms both directions (redundant deflates, high_freq inflates) and that mse_state stays in a tight band. Core wmel untouched (everything under experiments/). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Denis-hamon merged commit e9310cf into main Jun 14, 2026
4 checks passed

Denis-hamon mentioned this pull request Jun 14, 2026

results: full observation-richness sweep on Cartpole + honest write-up (T2.3) #58

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiments: observation-richness robustness of the CPG verdict (reframed T2.3)#57

experiments: observation-richness robustness of the CPG verdict (reframed T2.3)#57
Denis-hamon merged 1 commit into
mainfrom
feat-obs-robustness

Denis-hamon commented Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Denis-hamon commented Jun 14, 2026

What

Construction

Falsifiable claim

Testing / compute

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant