v2: ByteDMD / data-movement instrumentation on v1+v1.5 baselines

# v2 — ByteDMD / data-movement instrumentation on v1+v1.5 baselines

## Why

v1+v1.5 (PRs #4–#16) shipped 58 reproducible synthetic-learning baselines per the spec in #1. The actual research goal was always to measure these under a **data-movement / energy** lens — that's [ByteDMD](https://github.com/cybertronai/ByteDMD) (reuse-distance proxy for memory-hierarchy energy cost).

The hypothesis (per Yaroslav, [hinton-problems issue #1](https://github.com/cybertronai/hinton-problems/issues/1#issuecomment-4363088986)): backprop has a bad commute-to-compute ratio because every gradient step refetches all activations. Algorithmic alternatives that reuse weights more (NBB local rules, fast-weights/linear-attention, capsule routing, etc.) should look better under ByteDMD even when slower in wallclock. v1+v1.5 is the comparator floor; v2 makes the comparison.

This issue mirrors [hinton-problems #45](https://github.com/cybertronai/hinton-problems/issues/45), adapted for Schmidhuber's algorithmic lineage.

## What this issue tracks

Per-stub instrumentation of the 58 v1+v1.5 implementations with ByteDMD, producing a new column in `RESULTS.md` (or a dedicated `RESULTS_V2.md`):

| Stub | Run wallclock (v1) | **ByteDMD cost (new)** | **Reuse-distance distribution (new)** |
|---|---|---|---|

## Scope

**In scope:**
- Wrap each stub's `train()` and final `eval()` with the ByteDMD tracer
- Output per-stub: total ByteDMD cost, reuse-distance distribution, breakdown by phase (forward / backward / weight update)
- Aggregate into a top-level table sorted by cost, comparable across algorithms

**Out of scope:**
- Re-implementing any stub's algorithm
- Optimizing for ByteDMD (that's v3 — "find a solver that beats the v1 baseline on data movement")
- Closing the partial reproductions (that's [v1.5 follow-up](#) — a sibling issue to this one)

## Priority order (per RESULTS.md "v2 filter recommendation")

### Tier 1: clean reproductions + sub-second runs (lowest noise floor)

These give the cleanest data-movement signal:

- `linear-transformers-fwp` (0.08s) — **the cleanest pair**: equivalent to wave-4's `fast-weights-key-value` (1992 ancestor); same numpy expression, different schedules. ByteDMD on both should expose whether the equivalence holds at the data-movement level too.
- `predictable-stereo` (0.08s)
- `levin-add-positions` (0.34s)
- `lococode-ica` (0.4s)
- `compete-to-compute` (0.8s)
- `nbb-xor` (0.85s)
- `rs-two-sequence` (0.94s)
- `levin-count-inputs` (1.0s)
- `semilinear-pm-image-patches` (1.2s)
- `pipe-symbolic-regression` (1.3s)
- `em-segmentation-isbi` (1.5s)
- `ssa-bias-transfer-mazes` (1.7s)
- `chunker-22-symbol` (1.86s)
- `predictability-min-binary-factors` (2.8s)

### Tier 2: algorithmic-variant pairs on the same task

These let you compare data-movement properties of different algorithms on identical problems:

- **adding-problem family**: vanilla RNN vs LSTM (paper's contrast, both implemented in `adding-problem` and `temporal-order-3bit`). Direct backprop-cost comparison.
- **embedded-reber family**: original 1997 LSTM (no forget) vs forget-gate LSTM (`continual-embedded-reber`). Forget-gate adds compute; does it add commute too?
- **LSTM ablation matrix** (`lstm-search-space-odyssey`): 8 variants (V/NIG/NFG/NOG/NIAF/NOAF/CIFG/NP) on the same task. Direct architectural-variant data-movement comparison built in.
- **Linear-attention ↔ FWP** (`linear-transformers-fwp` ↔ `fast-weights-key-value`): the equivalence demo + the 1992 ancestor. ByteDMD on both should produce identical numbers (claim).
- **Evolutionary methods**: `pipe-symbolic-regression` (PIPE), `evolino-sines-mackey-glass` (Evolino), `double-pole-no-velocity` (ESP), `torcs-vision-evolution` (DCT-compressed natural ES). Gradient-free family vs gradient-based.
- **Search methods**: `levin-count-inputs`, `levin-add-positions` (Levin), `oops-towers-of-hanoi` (OOPS), `rs-*` (random search). All gradient-free; comparable data-movement profile.
- **World models**: `world-models-carracing` and `world-models-vizdoom-dream` share V+M+C decomposition — three distinct training stages with very different memory access patterns.

### Tier 3: deferred for v2

Stubs with run wallclock > 100s where v2 ByteDMD overhead would dominate; or partial reproductions where measuring data-movement on a non-converged solver isn't informative:

- `pipe-6-bit-parity` (240s 6-bit cap)
- `evolino-sines-mackey-glass` (140s)
- `lstm-search-space-odyssey` (145s; ablation matrix though, so still high-priority for tier-2 contrast)
- `hq-learning-pomdp` (paper's HQ-vs-flat gap doesn't reproduce on this maze size; resolve at paper-config first via v1.5)

## Acceptance

- ByteDMD cost in `RESULTS.md` for every stub in tier 1 + tier 2
- A summary plot: ByteDMD cost vs algorithm class (gradient / search / evolutionary / local-rule / fast-weights), x-axis log-scale
- One write-up per algorithmic-variant pair: "X commutes Y× more than Z on the same task; Δ comes from [phase]"

## Reference

- ByteDMD: https://github.com/cybertronai/ByteDMD
- Hinton companion: [hinton-problems #45](https://github.com/cybertronai/hinton-problems/issues/45)

---

_agent-0bserver07 (Claude Code) on behalf of Yad_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v2: ByteDMD / data-movement instrumentation on v1+v1.5 baselines #17

v2 — ByteDMD / data-movement instrumentation on v1+v1.5 baselines

Why

What this issue tracks

Scope

Priority order (per RESULTS.md "v2 filter recommendation")

Tier 1: clean reproductions + sub-second runs (lowest noise floor)

Tier 2: algorithmic-variant pairs on the same task

Tier 3: deferred for v2

Acceptance

Reference

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

v2: ByteDMD / data-movement instrumentation on v1+v1.5 baselines #17

Description

v2 — ByteDMD / data-movement instrumentation on v1+v1.5 baselines

Why

What this issue tracks

Scope

Priority order (per RESULTS.md "v2 filter recommendation")

Tier 1: clean reproductions + sub-second runs (lowest noise floor)

Tier 2: algorithmic-variant pairs on the same task

Tier 3: deferred for v2

Acceptance

Reference

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions