Luce-Org · davide221 · Jun 14, 2026 · Jun 12, 2026 · Jun 12, 2026 · Jun 12, 2026
diff --git a/optimizations/kvflash/DESIGN.md b/optimizations/kvflash/DESIGN.md
@@ -0,0 +1,221 @@
+# KVFlash design notes
+
+Mechanism details and tuning data behind [README.md](README.md); measured
+tables in [RESULTS.md](RESULTS.md).
+
+FlashMemory-style (arXiv 2606.09079) decode-time KV paging for the qwen35
+target, designed to compose with pflash. Goal: the GPU footprint of the
+full-attention KV cache is a hard O(pool) constant regardless of logical
+context length, with paged-out chunks recallable bit-exact from host.
+
+## Division of labor with pflash
+
+pflash and the pager own different resources and compose cleanly:
+
+| concern | owner |
+|---|---|
+| which prompt chunks the target ever elaborates | pflash (drafter scores, evict at prefill) |
+| which elaborated chunks occupy GPU slots | KvFlashPager (this module) |
+| prefill compute sparsity | pflash BSA kernels |
+| decode-time KV growth (generated tokens) | KvFlashPager (page out cold generated chunks) |
+
+pflash keeps the target from reading the huge context; the pager keeps
+what the target HAS elaborated inside a fixed VRAM budget and makes every
+eviction reversible. The drafter's chunk scores plug into
+`KvFlashPager::score_hook` as the residency policy (LRU fallback in the
+prototype).
+
+## Mechanism
+
+- Cache tensors are allocated at `pool_tokens` (e.g. 1024) instead of
+  `max_ctx` (e.g. 131072). That allocation delta IS the memory saving:
+  a mask over a full-size cache would save nothing.
+- Logical positions map to physical pool slots at 64-token chunk
+  granularity. The mapping rides the existing step-invariant
+  `ggml_set_rows` KV append (`kv_write_rows` carries the physical slot;
+  the `positions` input keeps the logical position for M-RoPE).
+- Decode FA spans the whole pool with an EXACT slot-validity mask
+  (`KvFlashPager::fill_slot_mask`): resident slots 0, free/paged-out -inf.
+  The host-side mask rebuilds only when the pager epoch moves; the device
+  upload happens before EVERY compute. That upload is mandatory, not an
+  optimization: input tensors live in the gallocr compute buffer, whose
+  regions are reused during graph execution, so a once-uploaded mask is
+  garbage by the next step (this masqueraded as a "fattn NaN kernel bug"
+  for a while — all-NaN logits from the second step on; production never
+  hit it because its prefill refills masks per chunk). `--no-mask` falls
+  back to maskless + zeroed freed slots (exp(-max) ~ 0, production's
+  padded-span approximation, measured ~1% argmax flips).
+- Page-out copies a chunk's quantized rows (per layer x K/V x head
+  segments) to a host backing store and zeroes the slots; page-in writes
+  them back. Quantized bytes + baked-in RoPE means the roundtrip is
+  bit-exact and relocation is position-independent.
+- Eviction protects sinks (first chunk) and the trailing window, mirrors
+  FlashMemory's always-resident floor (their last-8K + decoded window).
+  Unlike their sigmoid-threshold fetch (which leaks footprint at 500K,
+  their §3.3.1), a fixed slot pool is a hard budget by construction.
+- DeltaNet/conv recurrent state is fixed-size and never paged.
+
+## What the prototype verifies (test_kvflash)
+
+A. Baseline at logical ctx 128K: reference greedy sequence + KV bytes.
+B. Relocation proof: same workload in a small pool with SHUFFLED block
+   placement, teacher-forced — argmax must track the baseline.
+C. Live paging: pool ≪ prompt+gen, eviction engaged; bit-exact
+   page_out/page_in roundtrip; decode completes; KV bytes vs A ≥ 90% cut.
+
+## Reselect (τ-step lookahead)
+
+`KvFlashPager::reselect()` rebuilds the resident set as the top-pool chunks by
+`score_hook` over all materialized chunks (resident or host-backed),
+keeping sinks and the trailing window unconditionally. Page-outs run
+first so recalls always find free blocks. This is the FlashMemory τ=64
+loop's mechanism; the production caller invokes it every τ decoded
+tokens with fresh drafter scores. Verified in test run D: an evicted
+chunk recalled by a score flip, decode continues across the residency
+change.
+
+## Measured (lucebox RTX 3090, Qwen3.6-27B Q4_K_M, Q8_0 KV, 2026-06-11)
+
+All gates PASS (exit 0). 64 timed steps per profile row, junk KV so the
+FA span traffic is bandwidth-realistic:
+
+| config | FA span | ms/step p50 | tok/s |
+|---|---|---|---|
+| baseline 8K   | 8192   | 35.1 | 28.5 |
+| baseline 32K  | 32768  | 30.1 | 33.1 |
+| baseline 128K | 131072 | 45.1 | 22.1 |
+| pool 1K @128K logical | 1024 | 25.1 | 39.6 |
+| pool 4K @128K logical | 4096 | 25.7 | 38.7 |
+
+- attn-KV memory: 2304.0 -> 18.0 MiB (99.2% cut); whole cache buffer
+  2653.6 -> 217.6 MiB, confirmed by VRAM deltas.
+- At 128K-logical decode the pool is 1.8x FASTER than the full cache
+  (45.1 -> 25.1 ms/step): FA cost is span-bound, the pool caps the span.
+- Paging: page_out p50 1.26 ms, page_in p50 0.63 ms per 64-token chunk
+  (~2.2 MiB, synchronous); 12 evictions over 1200 generated tokens
+  amortize to ~0.01 ms/token. reselect() recalling with 20 page events
+  took 21.3 ms — at τ=64 that is ~1% of decode time worst-case.
+- Relocation equivalence: 0.83% argmax flips over 1200 teacher-forced
+  tokens at shuffled placement (gate: ≤1%).
+- Open harness question: the C-loop (live eviction) measured ~34 ms/step
+  vs 25 ms for the identical config in the E-loop; suspected interaction
+  of sustained-load GPU clocks with run ordering, not paging cost (12
+  sync page events explain only ~0.01 ms/token). Re-measure under the
+  production decode loop during integration.
+
+## Full LSA loop (drafter as Memory Indexer) — measured
+
+Test run F implements the paper's complete inference paradigm with the
+pflash drafter (Qwen3-0.6B, `/opt/lucebox/models/drafter/`) standing in
+for the trained indexer: prompt (2048) larger than the pool (1024) so
+prefill itself evicts, then every τ=64 decoded tokens the drafter
+rescores the full sequence (tail attention = indexer query, chunk means
+via `drafter_chunk_scores`), `score_hook` receives the fresh scores, and
+`reselect()` repages the pool.
+
+Measured (RTX 3090, target Qwen3.6-27B Q4_K_M + drafter co-resident):
+- 31.2 tok/s with the loop active; 12 rescores over 768 generated tokens
+- 43 genuine drafter-driven recalls of previously evicted context
+- indexer rescore p50 = 245 ms (full 0.6B re-prefill at ~2-2.8K tokens —
+  ~12% decode overhead at τ=64; drops to ~ms once the drafter's own KV
+  is persisted and only the new τ tokens are pushed through it)
+- reselect p50 = 7.5 ms
+
+vs the paper: their indexer is a trained <0.1% projection head (cheaper
+queries, backbone-supervised labels); ours is the existing 0.6B drafter
+(training-free, already shipped for pflash). Their sigmoid threshold
+leaks footprint at scale (their §3.3.1); our fixed pool is a hard cap.
+
+## Production integration (daemon)
+
+The pool is wired into the qwen35 backend behind `--kvflash <tokens>`
+(env `DFLASH_KVFLASH`; rounded to a 256 multiple) + `--kvflash-tau <N>`
+(env `DFLASH_KVFLASH_TAU`, default 64). Pieces:
+
+- `create_target_cache(..., ctx_alloc)`: attention tensors allocated at
+  pool capacity; `cache.max_ctx` stays the logical bound.
+- `do_prefill`: rows land identity-mapped (prompt must fit the pool —
+  with pflash the compressed prompt does; without it, size the pool);
+  `kvflash_sync_prefill` rebuilds the pager map per request/restore.
+- `do_ar_decode`: `build_target_step(..., kvflash_mask=true)` keeps the
+  step-invariant set_rows write active alongside the slot mask;
+  `kv_write_rows` carries the pool slot; the mask uploads per step;
+  every τ generated tokens `kvflash_maybe_reselect` rescores + repages.
+- Policy is agnostic by construction: `KvFlashScorer` (common/) is the
+  interface; with no scorer the pager runs pure LRU (zero pflash
+  dependency). When pflash loads its drafter, `KvFlashDrafterScorer`
+  (qwen3/) attaches automatically and reselect becomes drafter-driven.
+- Spec decode (chain mode) runs ON the pool: verify_batch slot-maps the
+  draft block via per-token kv_write_rows and builds a slot-space mask
+  (resident committed positions + causal among draft tokens). Rejected
+  drafts need no rollback: the pos < base_pos validity rule excludes
+  their slots until the replay rewrites them. All four spec KV-write
+  sites (verify, both replays, stall-prefix) route through this one
+  function. Verified on the daemon: accept_rate 15.4-15.6% pooled vs
+  15.3% pool-off (matched avg_commit 3.47 vs 3.45), coherent output
+  through a mid-generation pool wrap with live eviction. DDTree's
+  tree-verify is not pool-aware yet and falls back to AR.
+- LAYOUT TRAP (cost a day of debugging): kv_write_rows is
+  [n_tokens, n_head_kv] ne0-major — element (token i, head h) lives at
+  i + h*n_tokens (ggml_set_rows asserts b->ne[1] == c->ne[0]). A
+  transposed fill scrambles per-head row targets for every multi-token
+  write while single-token fills (all entries equal) hide the bug
+  completely.
+- Post-generation snapshots are skipped once cur_pos exceeds the pool
+  (pooled snapshots need page-table serialization; prefill-time
+  snapshots still work).
+
+## Production smokes (dflash_server on lucebox 3090, 2026-06-11)
+
+1. WITHOUT pflash (agnostic LRU): `dflash_server <27B> --kvflash 1024`.
+   41-token prompt + 1400 generated = 1441 logical through a 1024-slot
+   pool (live LRU eviction mid-request). Coherent story end to end,
+   36.9 tok/s, clean finish. Second request (per-request pager reset) ok.
+2. WITH pflash: `--kvflash 2048 --prefill-compression always
+   --prefill-threshold 256 --prefill-drafter <Qwen3-0.6B>`. Compression
+   1468 -> 60 tokens, then `[kvflash] drafter scorer attached (tau=64)`
+   automatically; 400 coherent tokens answering from the compressed
+   context. Same binary, zero pflash-specific configuration on the pool.
+
+Ops note: the init banner is flushed now, but generally `nohup` +
+redirected stdout block-buffers printf output — kill the process (atexit
+flush) before concluding a code path didn't run.
+
+## Quality matrix (synthetic NIAH, needle recall /16, teacher-forced)
+
+| context | residency | LRU d=10/50/90% | drafter d=10/50/90% | control |
+|---|---|---|---|---|
+| 8K   | 25%   | 0 / 0 / 16 | 15 / 15 / 16 | 16/16 |
+| 8K   | 9%    | 0 / 0 / 0  | 15 / 15 / 15 | 16/16 |
+| 32K  | 25%   | 0 / 0 / 16 | 15 / 15 / 16 | 16/16 |
+| 32K  | 9%    | 0 / 0 / 0  | 15 / 15 / 15 | 15-16/16 |
+| 256K | 6.25% | 0 (d=0.5); 16/16 in-window | 14 / 15 / 15 | (in-window LRU = control) |
+
+Drafter-scored residency retains 88-100% of perfect needle recall at every
+depth down to 6-9% residency from 8K to the model's native 256K maximum;
+recency-only LRU retains zero outside its tail window. 256K logistics on
+the RTX 3090: ~6.5 min linear pooled prefill, 4.22 GiB host backing,
+~18 GiB VRAM total, 46 s bisected rescore (drafter forward ceiling ~65K
+per segment).
+
+## Tuned defaults (from the matrix)
+
+- Ship drafter scoring whenever a drafter is available; pure-LRU mode is
+  recency-only and must be documented as such.
+- Pool ~25% of expected context is the conservative default; 9% measured
+  safe for retrieval-style work.
+- tau adapts: rescore costs ~0.11 ms/history-token, so the effective
+  reselect interval is max(configured tau, history/45), capping rescore
+  overhead near 15% of decode time.
+
+## Not in the prototype (next phases)
+
+1. Drafter KV persistence for the indexer (incremental rescore: push
+   only the new τ tokens through the drafter; kills the ~240 ms re-prefill).
+2. Pooled chunked prefill (prompt > pool with eviction during prefill).
+3. Spec-decode verify on the pool (block-aligned multi-token writes).
+4. Pooled snapshot save/restore (serialize the page table + host store).
+5. Async paging on a copy stream (currently synchronous
+   ggml_backend_tensor_get/set between steps).
+6. Quality benches through the harness (NIAH-64K, accept-rate) with the
+   drafter policy active.
diff --git a/optimizations/kvflash/README.md b/optimizations/kvflash/README.md
@@ -0,0 +1,87 @@
+<p align="left">
+  <a href="../README.md">← lucebox-hub</a>
+</p>
+
+<h1 align="center">Luce KVFlash</h1>
+
+<p align="center">
+  <strong>Lookahead sparse attention for dflash. Bounded KV residency on one GPU.</strong><br/>
+  The attention KV cache lives in a fixed pool of slots; cold 64-token chunks page to host RAM, bit-exact and recallable.
+  With pflash, its drafter doubles as a Memory Indexer that recalls the context the generation needs next.<br/>
+  Qwen3.6-27B Q4_K_M on a single RTX 3090: <strong>native 256K context at 38.6 tok/s with 72 MiB of resident KV</strong>,
+  needle recall 88-100% at 6% residency, harness accuracy unchanged (32/32 vs full cache).
+</p>
+
+---
+
+```
+                         decode tok/s   KV in VRAM   needle (d=10/50/90%)
+full cache  @  64K            27.8        1152 MiB        16/16
+full cache  @ 128K            19.6        2304 MiB        16/16
+full cache  @ 256K            13.1        4608 MiB        16/16
+KVFlash 4K  @  64K            38.6          72 MiB        14/16
+KVFlash 4K  @ 128K            38.6          72 MiB        14/16
+KVFlash 4K  @ 256K            38.6          72 MiB        15/16
+```
+
+Decode speed is flat at any context length (the per-step KV read is pool-sized,
+not context-sized), prefill is up to 2.8x faster, and a 256K prompt that costs
+4.6 GiB of VRAM as a full cache costs 72 MiB resident + 4.2 GiB of host RAM.
+
+## Usage
+
+```bash
+dflash_server model.gguf --max-ctx 32768 --kvflash 8192          # LRU policy
+dflash_server model.gguf --max-ctx 32768 --kvflash 8192 \
+    --prefill-compression always --prefill-drafter qwen3-0.6b.gguf  # drafter policy
+```
+
+- `--kvflash <tokens>`: resident pool size (rounded to 256; clamped to
+  `--max-ctx`). Env: `DFLASH_KVFLASH`.
+- `--kvflash-tau <N>`: reselect interval floor (default 64; the effective
+  interval grows with history so rescore overhead stays ~15% of decode).
+  Env: `DFLASH_KVFLASH_TAU`.
+
+Sizing rule: without a drafter, pool >= prompt + generation headroom
+(LRU is recency-only memory — an undersized pool can evict the question
+itself). With pflash's drafter attached, 25% of the expected context is a
+conservative default and 6-9% is measured safe for retrieval workloads.
+
+## How it works
+
+- **Pool**: attention KV tensors are allocated at pool size; a pager maps
+  logical positions to slots at 64-token chunk granularity. Cold chunks
+  move to a host backing store (~0.6 ms/chunk) and return bit-exact.
+- **Mask**: attention spans the pool with a slot-validity mask, uploaded
+  before every compute. Exact, and free (25.10 vs 25.52 ms/step maskless).
+- **Reselect**: every tau decoded tokens the scorer re-ranks all chunks
+  (resident or host-backed) and `reselect()` repages the pool — the
+  lookahead loop from FlashMemory (arXiv 2606.09079), with the pflash
+  drafter standing in for their trained indexer, and a hard capacity cap
+  their threshold mechanism lacks.
+- **Spec decode**: chain-mode verify is slot-mapped (per-token
+  `kv_write_rows` + slot-space mask); rejected drafts need no rollback —
+  their slots are excluded by the validity rule until rewritten.
+  Acceptance parity with the full cache (15.4-15.6% vs 15.3%). DDTree
+  falls back to AR while KVFlash is active.
+- **Prefill**: prompts larger than the pool prefill in 64-token chunks at
+  constant VRAM (linear time; 256K in ~5.9 min on the 3090).
+
+Quality verdict (harness ground truth, base-vs-base control included):
+full results in [RESULTS.md](RESULTS.md). Outputs are not guaranteed
+byte-identical to the full cache on long generations (the masked kernel
+path rounds differently — a different deterministic lineage), but
+correctness is identical: 32/32 vs 32/32 across HumanEval, GSM, MATH, and
+agent suites.
+
+## Files
+
+- `server/src/common/kvflash_pager.h` — pool, page table, host store, reselect
+- `server/src/common/kvflash_scorer.h` — chunk-relevance policy interface
+- `server/src/qwen3/qwen3_kvflash_scorer.{h,cpp}` — pflash-drafter scorer
+  (tail attention; bisects on allocation pressure)
+- `server/src/qwen35/*` — cache `ctx_alloc`, masked pooled decode, slot-mapped
+  spec verify, daemon flags
+- `server/test/test_kvflash.cpp` — verification suite (A-F), `--niah`,
+  `--niah256`, `--longab`
+- [DESIGN.md](DESIGN.md) — mechanism details and tuning notes
diff --git a/optimizations/kvflash/RESULTS.md b/optimizations/kvflash/RESULTS.md
@@ -0,0 +1,81 @@
+# KVFlash — measured results
+
+All numbers: single RTX 3090 (24 GB), Qwen3.6-27B Q4_K_M target, Q8_0 KV,
+Qwen3-0.6B pflash drafter as the scorer. June 2026, `test_kvflash` +
+`dflash_server` + `harness/benchmarks`.
+
+## End-to-end long-prompt A/B (`--longab`; needle depth 0.25, 240-token timed free run)
+
+| context | mode | prefill | decode tok/s | needle /16 | KV in VRAM |
+|---|---|---|---|---|---|
+| 32K  | full    | 47.2 s  | 32.8 | 16 | 576 MiB |
+| 32K  | KVFlash 4K | 41.8 s | 29.0 | 15 | 72 MiB |
+| 64K  | full    | 130.6 s | 27.8 | 16 | 1152 MiB |
+| 64K  | KVFlash 4K | 87.5 s | **38.6** | 14 | **72 MiB** |
+| 128K | full    | 335.9 s | 19.6 | 16 | 2304 MiB |
+| 128K | KVFlash 4K | 177.8 s | **38.6** | 14 | **72 MiB** |
+| 256K | full    | 999.0 s | 13.1 | 16 | 4608 MiB |
+| 256K | KVFlash 4K | **354.9 s** | **38.6** | 15 | **72 MiB** |
+
+Decode is flat at 38.6 tok/s from 64K to native-max 256K (speedups 1.4x /
+2.0x / 2.9x); prefill speedups 1.5x / 1.9x / 2.8x. One drafter rescore per
+query: 9-70 s scaling with context (bisected above the drafter's ~65K
+single-pass ceiling).
+
+## Retrieval quality vs residency (synthetic NIAH, teacher-forced /16)
+
+| context | residency | LRU (d=10/50/90%) | drafter (d=10/50/90%) | full control |
+|---|---|---|---|---|
+| 8K   | 25%   | 0 / 0 / 16 | 15 / 15 / 16 | 16/16 |
+| 8K   | 9%    | 0 / 0 / 0  | 15 / 15 / 15 | 16/16 |
+| 32K  | 25%   | 0 / 0 / 16 | 15 / 15 / 16 | 16/16 |
+| 32K  | 9%    | 0 / 0 / 0  | 15 / 15 / 15 | 15-16/16 |
+| 256K | 6.25% | 0 (d=0.5); 16/16 in-window | 14 / 15 / 15 | (in-window LRU = control) |
+
+Drafter-scored residency retains 88-100% of perfect recall at every depth
+down to 6-9% residency; recency-only LRU retains zero outside its tail
+window (mirrors FlashMemory's Recency-Only ablation).
+
+## Harness ground truth (pool sized per the heuristic, vs full cache)
+
+| suite | baseline pass | KVFlash pass | exact text match |
+|---|---|---|---|
+| HumanEval | 10/10 | **10/10** | 10/10 |
+| GSM       | 10/10 | **10/10** | 8/10 |
+| MATH      | 10/10 | **10/10** | 4/10 |
+| agent (to 24K prompts) | 6/6 | **6/6** | 2/6 |
+
+Base-vs-base control: 16/16 byte-identical — the stack is deterministic.
+Text drift under KVFlash is the masked decode kernel's different (equally
+deterministic) rounding lineage, not noise and not a correctness effect.
+
+## Spec decode (chain mode, slot-mapped verify, daemon)
+
+| config | accept rate | avg_commit | output |
+|---|---|---|---|
+| full cache, 2400 tok | 15.3% | 3.45 | coherent |
+| KVFlash 2K, 1800 tok | 15.4% | 3.47 | coherent |
+| KVFlash 2K, 2400 tok (live eviction mid-spec) | 15.6% | 3.49 | coherent |
+
+## Microbenchmarks
+
+- Memory at 128K-logical: attn-KV 2304 -> 18 MiB (99.2%) with a 1K pool;
+  whole cache buffer 2654 -> 218 MiB, confirmed via VRAM deltas.
+- Exact slot mask is free: 25.10 ms/step masked vs 25.52 maskless.
+- Paging: page_out p50 1.27 ms / page_in 0.64 ms per 64-token chunk
+  (~2.2 MiB, synchronous); ~0.01 ms/token amortized at observed rates.
+- reselect() repaging 20 chunks: 21.3 ms.
+- Relocation equivalence (shuffled physical placement, teacher-forced
+  1200 tokens): ~99% argmax agreement; page_out/page_in roundtrip
+  bit-exact.
+
+## Known limits
+
+- DDTree tree-verify is not pool-aware (falls back to AR with KVFlash).
+- Post-generation snapshots are skipped once cur_pos exceeds the pool
+  (pooled snapshots need page-table serialization).
+- Paging is synchronous (copy-stream overlap is a follow-up).
+- Memory-dense tasks needing the entire context at once (MRCR-style) are
+  a paradigm limit shared with FlashMemory; size the pool up for those.
+- 512K+ requires RoPE scaling (model native max is 256K) — memory-side
+  KVFlash already scales (host backing is the only growth).