v1.5 — paper-scale reruns + original-simulator follow-ups
Why
Of the 58 v1+v1.5 stubs (RESULTS.md):
- 32 reproduce paper claims (full or qualitative)
- 25 partial / qualitative — algorithm works, paper number not fully reached because of laptop/numpy-only budget OR synthetic-data substitution per the SPEC's RL-stub rule
- 1 honest non-replication (
hq-learning-pomdp; HQ-vs-flat gap doesn't reproduce on 29-cell maze; mathematical analysis in §Open questions)
The 25 partials all have honest gaps that should close at paper config or at the original simulator. v1.5 reruns them at proper scale (correct architecture width, correct epoch count, correct dataset size, original simulator) on Modal or a single GPU. Goal: turn the partials into yes-reproductions, OR document them as genuine failures-at-scale.
This issue mirrors hinton-problems #46, adapted for Schmidhuber's lineage.
Scope
Two categories:
Category A: Need bigger budget (paper-scale reruns)
These hit the laptop's 5-min budget at smaller scale than the paper. Need GPU/Modal:
mnist-deep-mlp — 535k MLP / 15 epochs → paper 12M weights / 800 epochs, target 0.35% test err
mcdnn-image-bench — single-column MLP / 22s → paper 35-column ensemble / 60+ ep × 35 cols, target 0.23%; add GTSRB + CASIA Chinese loaders
evolino-sines-mackey-glass — pop=40 / 80 gens / NRMSE@84=0.29 → full ESP / paper budget, target NRMSE@84 ≈ 1.9e-3
pipe-6-bit-parity — 6-bit at 71.9% in 240s → paper budget, target 16/16 = 100%
lstm-search-space-odyssey — 8 variants on adding-problem T=50 → full TIMIT/IAM/JSB battery at paper config (5,400 experiments)
noise-free-long-lag — sub-variant (a) at p=50 only → (b)/(c) variants + p=100/500/1000 sweep
timing-counting-spikes — MSD only at T=150 → MSD/GTS/PFG triple at T≥300
hq-learning-pomdp — 29-cell maze (no replication) → paper's 62-cell maze; mathematical analysis predicts it should reproduce at scale
Category B: Original-simulator paths for the 8 v1.5 synthetic substitutes
Wave 11 shipped these as numpy mini-environments per the SPEC's RL-stub rule. v1.5 reruns at the original env (will need infra beyond nix develop + numpy):
world-models-carracing — currently numpy 2D track → gym CarRacing-v0
world-models-vizdoom-dream — currently numpy 5×5 gridworld → VizDoom DoomTakeCover-v0
torcs-vision-evolution — currently numpy oval → TORCS racing simulator
timit-blstm-ctc — currently synthetic phoneme corpus → TIMIT phoneme set
iam-handwriting — currently synthetic 10-char alphabet → IAM-OnDB / IAM-DB
em-segmentation-isbi — currently synthetic Voronoi-EM → ISBI 2012 EM stack
clockwork-rnn — currently synthetic sum-of-sines → raw-audio TIMIT word
- (8th synthetic substitute already covered above)
Category C: Architecture/capacity gap (mid-scope)
These need a width-bump or an extra slot but not full paper budget:
neural-em-shapes — best test NMI 0.428 at K=3 / H=24 → paper AMI 0.96 with K+1 background slot + GRU M-step
relational-nem-bouncing-balls — distribution shift at K=6 → larger train distribution + K-curriculum
neural-data-router — +1 depth above chance → paper "100% length-gen" at vocab=8/8 + LayerNorm + alternating L→R/R→L heads
self-referential-weight-matrix — 4-way boolean meta-learning at 99.6% → paper's larger sequence-task setup (continuous-pointer relaxation → discrete REINFORCE addresses)
compete-to-compute — small-net regime noisy (LWTA wins 6/10) → paper 3-layer × 512 hidden, longer training, larger sequential split
Acceptance
For each stub:
- New PR with paper-config rerun results
- Update
RESULTS.md row: partial → yes (or document genuine failure at scale)
- Update stub README §Results with paper-scale numbers
- Cost reported: GPU-hours or $ on Modal
Stub-specific follow-ups (already filed)
- #3 — nbb-xor (η ablation, multi-subset arch, source verification). Different scope (rate-parameter ablation, not paper-scale).
agent-0bserver07 (Claude Code) on behalf of Yad
v1.5 — paper-scale reruns + original-simulator follow-ups
Why
Of the 58 v1+v1.5 stubs (RESULTS.md):
hq-learning-pomdp; HQ-vs-flat gap doesn't reproduce on 29-cell maze; mathematical analysis in §Open questions)The 25 partials all have honest gaps that should close at paper config or at the original simulator. v1.5 reruns them at proper scale (correct architecture width, correct epoch count, correct dataset size, original simulator) on Modal or a single GPU. Goal: turn the partials into yes-reproductions, OR document them as genuine failures-at-scale.
This issue mirrors hinton-problems #46, adapted for Schmidhuber's lineage.
Scope
Two categories:
Category A: Need bigger budget (paper-scale reruns)
These hit the laptop's 5-min budget at smaller scale than the paper. Need GPU/Modal:
mnist-deep-mlp— 535k MLP / 15 epochs → paper 12M weights / 800 epochs, target 0.35% test errmcdnn-image-bench— single-column MLP / 22s → paper 35-column ensemble / 60+ ep × 35 cols, target 0.23%; add GTSRB + CASIA Chinese loadersevolino-sines-mackey-glass— pop=40 / 80 gens / NRMSE@84=0.29 → full ESP / paper budget, target NRMSE@84 ≈ 1.9e-3pipe-6-bit-parity— 6-bit at 71.9% in 240s → paper budget, target 16/16 = 100%lstm-search-space-odyssey— 8 variants on adding-problem T=50 → full TIMIT/IAM/JSB battery at paper config (5,400 experiments)noise-free-long-lag— sub-variant (a) at p=50 only → (b)/(c) variants + p=100/500/1000 sweeptiming-counting-spikes— MSD only at T=150 → MSD/GTS/PFG triple at T≥300hq-learning-pomdp— 29-cell maze (no replication) → paper's 62-cell maze; mathematical analysis predicts it should reproduce at scaleCategory B: Original-simulator paths for the 8 v1.5 synthetic substitutes
Wave 11 shipped these as numpy mini-environments per the SPEC's RL-stub rule. v1.5 reruns at the original env (will need infra beyond
nix develop+ numpy):world-models-carracing— currently numpy 2D track → gym CarRacing-v0world-models-vizdoom-dream— currently numpy 5×5 gridworld → VizDoom DoomTakeCover-v0torcs-vision-evolution— currently numpy oval → TORCS racing simulatortimit-blstm-ctc— currently synthetic phoneme corpus → TIMIT phoneme setiam-handwriting— currently synthetic 10-char alphabet → IAM-OnDB / IAM-DBem-segmentation-isbi— currently synthetic Voronoi-EM → ISBI 2012 EM stackclockwork-rnn— currently synthetic sum-of-sines → raw-audio TIMIT wordCategory C: Architecture/capacity gap (mid-scope)
These need a width-bump or an extra slot but not full paper budget:
neural-em-shapes— best test NMI 0.428 at K=3 / H=24 → paper AMI 0.96 with K+1 background slot + GRU M-steprelational-nem-bouncing-balls— distribution shift at K=6 → larger train distribution + K-curriculumneural-data-router— +1 depth above chance → paper "100% length-gen" at vocab=8/8 + LayerNorm + alternating L→R/R→L headsself-referential-weight-matrix— 4-way boolean meta-learning at 99.6% → paper's larger sequence-task setup (continuous-pointer relaxation → discrete REINFORCE addresses)compete-to-compute— small-net regime noisy (LWTA wins 6/10) → paper 3-layer × 512 hidden, longer training, larger sequential splitAcceptance
For each stub:
RESULTS.mdrow:partial→yes(or document genuine failure at scale)Stub-specific follow-ups (already filed)
agent-0bserver07 (Claude Code) on behalf of Yad