Skip to content

v2: ByteDMD / data-movement instrumentation on v1+v1.5 baselines #17

Description

@0bserver07

v2 — ByteDMD / data-movement instrumentation on v1+v1.5 baselines

Why

v1+v1.5 (PRs #4#16) shipped 58 reproducible synthetic-learning baselines per the spec in #1. The actual research goal was always to measure these under a data-movement / energy lens — that's ByteDMD (reuse-distance proxy for memory-hierarchy energy cost).

The hypothesis (per Yaroslav, hinton-problems issue #1): backprop has a bad commute-to-compute ratio because every gradient step refetches all activations. Algorithmic alternatives that reuse weights more (NBB local rules, fast-weights/linear-attention, capsule routing, etc.) should look better under ByteDMD even when slower in wallclock. v1+v1.5 is the comparator floor; v2 makes the comparison.

This issue mirrors hinton-problems #45, adapted for Schmidhuber's algorithmic lineage.

What this issue tracks

Per-stub instrumentation of the 58 v1+v1.5 implementations with ByteDMD, producing a new column in RESULTS.md (or a dedicated RESULTS_V2.md):

Stub Run wallclock (v1) ByteDMD cost (new) Reuse-distance distribution (new)

Scope

In scope:

  • Wrap each stub's train() and final eval() with the ByteDMD tracer
  • Output per-stub: total ByteDMD cost, reuse-distance distribution, breakdown by phase (forward / backward / weight update)
  • Aggregate into a top-level table sorted by cost, comparable across algorithms

Out of scope:

  • Re-implementing any stub's algorithm
  • Optimizing for ByteDMD (that's v3 — "find a solver that beats the v1 baseline on data movement")
  • Closing the partial reproductions (that's v1.5 follow-up — a sibling issue to this one)

Priority order (per RESULTS.md "v2 filter recommendation")

Tier 1: clean reproductions + sub-second runs (lowest noise floor)

These give the cleanest data-movement signal:

  • linear-transformers-fwp (0.08s) — the cleanest pair: equivalent to wave-4's fast-weights-key-value (1992 ancestor); same numpy expression, different schedules. ByteDMD on both should expose whether the equivalence holds at the data-movement level too.
  • predictable-stereo (0.08s)
  • levin-add-positions (0.34s)
  • lococode-ica (0.4s)
  • compete-to-compute (0.8s)
  • nbb-xor (0.85s)
  • rs-two-sequence (0.94s)
  • levin-count-inputs (1.0s)
  • semilinear-pm-image-patches (1.2s)
  • pipe-symbolic-regression (1.3s)
  • em-segmentation-isbi (1.5s)
  • ssa-bias-transfer-mazes (1.7s)
  • chunker-22-symbol (1.86s)
  • predictability-min-binary-factors (2.8s)

Tier 2: algorithmic-variant pairs on the same task

These let you compare data-movement properties of different algorithms on identical problems:

  • adding-problem family: vanilla RNN vs LSTM (paper's contrast, both implemented in adding-problem and temporal-order-3bit). Direct backprop-cost comparison.
  • embedded-reber family: original 1997 LSTM (no forget) vs forget-gate LSTM (continual-embedded-reber). Forget-gate adds compute; does it add commute too?
  • LSTM ablation matrix (lstm-search-space-odyssey): 8 variants (V/NIG/NFG/NOG/NIAF/NOAF/CIFG/NP) on the same task. Direct architectural-variant data-movement comparison built in.
  • Linear-attention ↔ FWP (linear-transformers-fwpfast-weights-key-value): the equivalence demo + the 1992 ancestor. ByteDMD on both should produce identical numbers (claim).
  • Evolutionary methods: pipe-symbolic-regression (PIPE), evolino-sines-mackey-glass (Evolino), double-pole-no-velocity (ESP), torcs-vision-evolution (DCT-compressed natural ES). Gradient-free family vs gradient-based.
  • Search methods: levin-count-inputs, levin-add-positions (Levin), oops-towers-of-hanoi (OOPS), rs-* (random search). All gradient-free; comparable data-movement profile.
  • World models: world-models-carracing and world-models-vizdoom-dream share V+M+C decomposition — three distinct training stages with very different memory access patterns.

Tier 3: deferred for v2

Stubs with run wallclock > 100s where v2 ByteDMD overhead would dominate; or partial reproductions where measuring data-movement on a non-converged solver isn't informative:

  • pipe-6-bit-parity (240s 6-bit cap)
  • evolino-sines-mackey-glass (140s)
  • lstm-search-space-odyssey (145s; ablation matrix though, so still high-priority for tier-2 contrast)
  • hq-learning-pomdp (paper's HQ-vs-flat gap doesn't reproduce on this maze size; resolve at paper-config first via v1.5)

Acceptance

  • ByteDMD cost in RESULTS.md for every stub in tier 1 + tier 2
  • A summary plot: ByteDMD cost vs algorithm class (gradient / search / evolutionary / local-rule / fast-weights), x-axis log-scale
  • One write-up per algorithmic-variant pair: "X commutes Y× more than Z on the same task; Δ comes from [phase]"

Reference


agent-0bserver07 (Claude Code) on behalf of Yad

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions