v2 — ByteDMD / data-movement instrumentation on v1+v1.5 baselines
Why
v1+v1.5 (PRs #4–#16) shipped 58 reproducible synthetic-learning baselines per the spec in #1. The actual research goal was always to measure these under a data-movement / energy lens — that's ByteDMD (reuse-distance proxy for memory-hierarchy energy cost).
The hypothesis (per Yaroslav, hinton-problems issue #1): backprop has a bad commute-to-compute ratio because every gradient step refetches all activations. Algorithmic alternatives that reuse weights more (NBB local rules, fast-weights/linear-attention, capsule routing, etc.) should look better under ByteDMD even when slower in wallclock. v1+v1.5 is the comparator floor; v2 makes the comparison.
This issue mirrors hinton-problems #45, adapted for Schmidhuber's algorithmic lineage.
What this issue tracks
Per-stub instrumentation of the 58 v1+v1.5 implementations with ByteDMD, producing a new column in RESULTS.md (or a dedicated RESULTS_V2.md):
| Stub |
Run wallclock (v1) |
ByteDMD cost (new) |
Reuse-distance distribution (new) |
Scope
In scope:
- Wrap each stub's
train() and final eval() with the ByteDMD tracer
- Output per-stub: total ByteDMD cost, reuse-distance distribution, breakdown by phase (forward / backward / weight update)
- Aggregate into a top-level table sorted by cost, comparable across algorithms
Out of scope:
- Re-implementing any stub's algorithm
- Optimizing for ByteDMD (that's v3 — "find a solver that beats the v1 baseline on data movement")
- Closing the partial reproductions (that's v1.5 follow-up — a sibling issue to this one)
Priority order (per RESULTS.md "v2 filter recommendation")
Tier 1: clean reproductions + sub-second runs (lowest noise floor)
These give the cleanest data-movement signal:
linear-transformers-fwp (0.08s) — the cleanest pair: equivalent to wave-4's fast-weights-key-value (1992 ancestor); same numpy expression, different schedules. ByteDMD on both should expose whether the equivalence holds at the data-movement level too.
predictable-stereo (0.08s)
levin-add-positions (0.34s)
lococode-ica (0.4s)
compete-to-compute (0.8s)
nbb-xor (0.85s)
rs-two-sequence (0.94s)
levin-count-inputs (1.0s)
semilinear-pm-image-patches (1.2s)
pipe-symbolic-regression (1.3s)
em-segmentation-isbi (1.5s)
ssa-bias-transfer-mazes (1.7s)
chunker-22-symbol (1.86s)
predictability-min-binary-factors (2.8s)
Tier 2: algorithmic-variant pairs on the same task
These let you compare data-movement properties of different algorithms on identical problems:
- adding-problem family: vanilla RNN vs LSTM (paper's contrast, both implemented in
adding-problem and temporal-order-3bit). Direct backprop-cost comparison.
- embedded-reber family: original 1997 LSTM (no forget) vs forget-gate LSTM (
continual-embedded-reber). Forget-gate adds compute; does it add commute too?
- LSTM ablation matrix (
lstm-search-space-odyssey): 8 variants (V/NIG/NFG/NOG/NIAF/NOAF/CIFG/NP) on the same task. Direct architectural-variant data-movement comparison built in.
- Linear-attention ↔ FWP (
linear-transformers-fwp ↔ fast-weights-key-value): the equivalence demo + the 1992 ancestor. ByteDMD on both should produce identical numbers (claim).
- Evolutionary methods:
pipe-symbolic-regression (PIPE), evolino-sines-mackey-glass (Evolino), double-pole-no-velocity (ESP), torcs-vision-evolution (DCT-compressed natural ES). Gradient-free family vs gradient-based.
- Search methods:
levin-count-inputs, levin-add-positions (Levin), oops-towers-of-hanoi (OOPS), rs-* (random search). All gradient-free; comparable data-movement profile.
- World models:
world-models-carracing and world-models-vizdoom-dream share V+M+C decomposition — three distinct training stages with very different memory access patterns.
Tier 3: deferred for v2
Stubs with run wallclock > 100s where v2 ByteDMD overhead would dominate; or partial reproductions where measuring data-movement on a non-converged solver isn't informative:
pipe-6-bit-parity (240s 6-bit cap)
evolino-sines-mackey-glass (140s)
lstm-search-space-odyssey (145s; ablation matrix though, so still high-priority for tier-2 contrast)
hq-learning-pomdp (paper's HQ-vs-flat gap doesn't reproduce on this maze size; resolve at paper-config first via v1.5)
Acceptance
- ByteDMD cost in
RESULTS.md for every stub in tier 1 + tier 2
- A summary plot: ByteDMD cost vs algorithm class (gradient / search / evolutionary / local-rule / fast-weights), x-axis log-scale
- One write-up per algorithmic-variant pair: "X commutes Y× more than Z on the same task; Δ comes from [phase]"
Reference
agent-0bserver07 (Claude Code) on behalf of Yad
v2 — ByteDMD / data-movement instrumentation on v1+v1.5 baselines
Why
v1+v1.5 (PRs #4–#16) shipped 58 reproducible synthetic-learning baselines per the spec in #1. The actual research goal was always to measure these under a data-movement / energy lens — that's ByteDMD (reuse-distance proxy for memory-hierarchy energy cost).
The hypothesis (per Yaroslav, hinton-problems issue #1): backprop has a bad commute-to-compute ratio because every gradient step refetches all activations. Algorithmic alternatives that reuse weights more (NBB local rules, fast-weights/linear-attention, capsule routing, etc.) should look better under ByteDMD even when slower in wallclock. v1+v1.5 is the comparator floor; v2 makes the comparison.
This issue mirrors hinton-problems #45, adapted for Schmidhuber's algorithmic lineage.
What this issue tracks
Per-stub instrumentation of the 58 v1+v1.5 implementations with ByteDMD, producing a new column in
RESULTS.md(or a dedicatedRESULTS_V2.md):Scope
In scope:
train()and finaleval()with the ByteDMD tracerOut of scope:
Priority order (per RESULTS.md "v2 filter recommendation")
Tier 1: clean reproductions + sub-second runs (lowest noise floor)
These give the cleanest data-movement signal:
linear-transformers-fwp(0.08s) — the cleanest pair: equivalent to wave-4'sfast-weights-key-value(1992 ancestor); same numpy expression, different schedules. ByteDMD on both should expose whether the equivalence holds at the data-movement level too.predictable-stereo(0.08s)levin-add-positions(0.34s)lococode-ica(0.4s)compete-to-compute(0.8s)nbb-xor(0.85s)rs-two-sequence(0.94s)levin-count-inputs(1.0s)semilinear-pm-image-patches(1.2s)pipe-symbolic-regression(1.3s)em-segmentation-isbi(1.5s)ssa-bias-transfer-mazes(1.7s)chunker-22-symbol(1.86s)predictability-min-binary-factors(2.8s)Tier 2: algorithmic-variant pairs on the same task
These let you compare data-movement properties of different algorithms on identical problems:
adding-problemandtemporal-order-3bit). Direct backprop-cost comparison.continual-embedded-reber). Forget-gate adds compute; does it add commute too?lstm-search-space-odyssey): 8 variants (V/NIG/NFG/NOG/NIAF/NOAF/CIFG/NP) on the same task. Direct architectural-variant data-movement comparison built in.linear-transformers-fwp↔fast-weights-key-value): the equivalence demo + the 1992 ancestor. ByteDMD on both should produce identical numbers (claim).pipe-symbolic-regression(PIPE),evolino-sines-mackey-glass(Evolino),double-pole-no-velocity(ESP),torcs-vision-evolution(DCT-compressed natural ES). Gradient-free family vs gradient-based.levin-count-inputs,levin-add-positions(Levin),oops-towers-of-hanoi(OOPS),rs-*(random search). All gradient-free; comparable data-movement profile.world-models-carracingandworld-models-vizdoom-dreamshare V+M+C decomposition — three distinct training stages with very different memory access patterns.Tier 3: deferred for v2
Stubs with run wallclock > 100s where v2 ByteDMD overhead would dominate; or partial reproductions where measuring data-movement on a non-converged solver isn't informative:
pipe-6-bit-parity(240s 6-bit cap)evolino-sines-mackey-glass(140s)lstm-search-space-odyssey(145s; ablation matrix though, so still high-priority for tier-2 contrast)hq-learning-pomdp(paper's HQ-vs-flat gap doesn't reproduce on this maze size; resolve at paper-config first via v1.5)Acceptance
RESULTS.mdfor every stub in tier 1 + tier 2Reference
agent-0bserver07 (Claude Code) on behalf of Yad