Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
5ee93da
feat(common): KVFlash pager core + chunk-relevance scorer interface
davide221 Jun 12, 2026
a275c51
feat(qwen35): wire KVFlash into the daemon (--kvflash / --kvflash-tau)
davide221 Jun 12, 2026
48da33d
test: KVFlash verification suite (test_kvflash)
davide221 Jun 12, 2026
f268ffb
docs: optimizations/kvflash (README, RESULTS, DESIGN)
davide221 Jun 12, 2026
facecc1
feat(kvflash): port bounded KV residency to qwen35moe, laguna, gemma4
davide221 Jun 12, 2026
be16d30
fix(kvflash): address cubic review findings on PR #373
davide221 Jun 12, 2026
e2f4296
refactor(kvflash): consolidate per-backend duplication into common he…
davide221 Jun 12, 2026
2c9dffe
docs(kvflash): hub README card + hero + Q8_0 footnote on the 256K rows
davide221 Jun 12, 2026
5cb0606
docs(kvflash): state the KV quant in the table headers
davide221 Jun 12, 2026
7a849e0
docs: KVFlash flags in the main README server-flags reference
davide221 Jun 12, 2026
17f6cbc
docs: keep the main-README KVFlash intro model-agnostic
davide221 Jun 12, 2026
9db8472
feat(kvflash): pooled chunked prefill, --kvflash auto, drafter scorer…
davide221 Jun 12, 2026
f699376
feat(kvflash): drafter-scored residency is the default policy
davide221 Jun 12, 2026
a351091
feat(kvflash): cross-tokenizer drafter scoring for laguna/gemma4 + --…
davide221 Jun 12, 2026
5e79666
feat(kvflash): VRAM-aware auto pool sizing
davide221 Jun 12, 2026
321695c
fix(kvflash): pre-ship audit — cubic round 2 + doc refresh
davide221 Jun 12, 2026
470123b
ci: give the ROCm GPU job its own concurrency group
davide221 Jun 12, 2026
58d924d
ci: fail the ROCm job fast with a KFD diagnosis instead of hanging
davide221 Jun 12, 2026
9a17281
feat(kvflash): --ddtree runs on the pool (gate removed)
davide221 Jun 12, 2026
cc42811
ci: ROCm probe survives a D-state hang
davide221 Jun 12, 2026
abb4cf4
feat(kvflash): gemma4 spec decode on the pool + gemma draft-loader re…
davide221 Jun 12, 2026
feef3fd
fix(draft): convert_dflash_to_gguf reads arch from config.json (was 2…
davide221 Jun 13, 2026
273f280
feat(spark): GPU-resident cold experts for MoE spec-decode verify + c…
davide221 Jun 13, 2026
07af613
perf(spark): adaptive verify width for MoE spec decode
davide221 Jun 13, 2026
5186827
perf(spark): per-n_tokens cache for the all-hot batched FFN graph
davide221 Jun 14, 2026
ea9f870
fix(spark): address code review — spare-slot reads, snapshot UAF, kvf…
davide221 Jun 14, 2026
241893f
docs(kvflash): lead usage with explicit --prefill-drafter; correctnes…
davide221 Jun 14, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
221 changes: 221 additions & 0 deletions optimizations/kvflash/DESIGN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
# KVFlash design notes

Mechanism details and tuning data behind [README.md](README.md); measured
tables in [RESULTS.md](RESULTS.md).

FlashMemory-style (arXiv 2606.09079) decode-time KV paging for the qwen35
target, designed to compose with pflash. Goal: the GPU footprint of the
full-attention KV cache is a hard O(pool) constant regardless of logical
context length, with paged-out chunks recallable bit-exact from host.

## Division of labor with pflash

pflash and the pager own different resources and compose cleanly:

| concern | owner |
|---|---|
| which prompt chunks the target ever elaborates | pflash (drafter scores, evict at prefill) |
| which elaborated chunks occupy GPU slots | KvFlashPager (this module) |
| prefill compute sparsity | pflash BSA kernels |
| decode-time KV growth (generated tokens) | KvFlashPager (page out cold generated chunks) |

pflash keeps the target from reading the huge context; the pager keeps
what the target HAS elaborated inside a fixed VRAM budget and makes every
eviction reversible. The drafter's chunk scores plug into
`KvFlashPager::score_hook` as the residency policy (LRU fallback in the
prototype).

## Mechanism

- Cache tensors are allocated at `pool_tokens` (e.g. 1024) instead of
`max_ctx` (e.g. 131072). That allocation delta IS the memory saving:
a mask over a full-size cache would save nothing.
- Logical positions map to physical pool slots at 64-token chunk
granularity. The mapping rides the existing step-invariant
`ggml_set_rows` KV append (`kv_write_rows` carries the physical slot;
the `positions` input keeps the logical position for M-RoPE).
- Decode FA spans the whole pool with an EXACT slot-validity mask
(`KvFlashPager::fill_slot_mask`): resident slots 0, free/paged-out -inf.
The host-side mask rebuilds only when the pager epoch moves; the device
upload happens before EVERY compute. That upload is mandatory, not an
optimization: input tensors live in the gallocr compute buffer, whose
regions are reused during graph execution, so a once-uploaded mask is
garbage by the next step (this masqueraded as a "fattn NaN kernel bug"
for a while — all-NaN logits from the second step on; production never
hit it because its prefill refills masks per chunk). `--no-mask` falls
back to maskless + zeroed freed slots (exp(-max) ~ 0, production's
padded-span approximation, measured ~1% argmax flips).
- Page-out copies a chunk's quantized rows (per layer x K/V x head
segments) to a host backing store and zeroes the slots; page-in writes
them back. Quantized bytes + baked-in RoPE means the roundtrip is
bit-exact and relocation is position-independent.
- Eviction protects sinks (first chunk) and the trailing window, mirrors
FlashMemory's always-resident floor (their last-8K + decoded window).
Unlike their sigmoid-threshold fetch (which leaks footprint at 500K,
their §3.3.1), a fixed slot pool is a hard budget by construction.
- DeltaNet/conv recurrent state is fixed-size and never paged.

## What the prototype verifies (test_kvflash)

A. Baseline at logical ctx 128K: reference greedy sequence + KV bytes.
B. Relocation proof: same workload in a small pool with SHUFFLED block
placement, teacher-forced — argmax must track the baseline.
C. Live paging: pool ≪ prompt+gen, eviction engaged; bit-exact
page_out/page_in roundtrip; decode completes; KV bytes vs A ≥ 90% cut.

## Reselect (τ-step lookahead)

`KvFlashPager::reselect()` rebuilds the resident set as the top-pool chunks by
`score_hook` over all materialized chunks (resident or host-backed),
keeping sinks and the trailing window unconditionally. Page-outs run
first so recalls always find free blocks. This is the FlashMemory τ=64
loop's mechanism; the production caller invokes it every τ decoded
tokens with fresh drafter scores. Verified in test run D: an evicted
chunk recalled by a score flip, decode continues across the residency
change.

## Measured (lucebox RTX 3090, Qwen3.6-27B Q4_K_M, Q8_0 KV, 2026-06-11)

All gates PASS (exit 0). 64 timed steps per profile row, junk KV so the
FA span traffic is bandwidth-realistic:

| config | FA span | ms/step p50 | tok/s |
|---|---|---|---|
| baseline 8K | 8192 | 35.1 | 28.5 |
| baseline 32K | 32768 | 30.1 | 33.1 |
| baseline 128K | 131072 | 45.1 | 22.1 |
| pool 1K @128K logical | 1024 | 25.1 | 39.6 |
| pool 4K @128K logical | 4096 | 25.7 | 38.7 |

- attn-KV memory: 2304.0 -> 18.0 MiB (99.2% cut); whole cache buffer
2653.6 -> 217.6 MiB, confirmed by VRAM deltas.
- At 128K-logical decode the pool is 1.8x FASTER than the full cache
(45.1 -> 25.1 ms/step): FA cost is span-bound, the pool caps the span.
- Paging: page_out p50 1.26 ms, page_in p50 0.63 ms per 64-token chunk
(~2.2 MiB, synchronous); 12 evictions over 1200 generated tokens
amortize to ~0.01 ms/token. reselect() recalling with 20 page events
took 21.3 ms — at τ=64 that is ~1% of decode time worst-case.
- Relocation equivalence: 0.83% argmax flips over 1200 teacher-forced
tokens at shuffled placement (gate: ≤1%).
- Open harness question: the C-loop (live eviction) measured ~34 ms/step
vs 25 ms for the identical config in the E-loop; suspected interaction
of sustained-load GPU clocks with run ordering, not paging cost (12
sync page events explain only ~0.01 ms/token). Re-measure under the
production decode loop during integration.

## Full LSA loop (drafter as Memory Indexer) — measured

Test run F implements the paper's complete inference paradigm with the
pflash drafter (Qwen3-0.6B, `/opt/lucebox/models/drafter/`) standing in
for the trained indexer: prompt (2048) larger than the pool (1024) so
prefill itself evicts, then every τ=64 decoded tokens the drafter
rescores the full sequence (tail attention = indexer query, chunk means
via `drafter_chunk_scores`), `score_hook` receives the fresh scores, and
`reselect()` repages the pool.

Measured (RTX 3090, target Qwen3.6-27B Q4_K_M + drafter co-resident):
- 31.2 tok/s with the loop active; 12 rescores over 768 generated tokens
- 43 genuine drafter-driven recalls of previously evicted context
- indexer rescore p50 = 245 ms (full 0.6B re-prefill at ~2-2.8K tokens —
~12% decode overhead at τ=64; drops to ~ms once the drafter's own KV
is persisted and only the new τ tokens are pushed through it)
- reselect p50 = 7.5 ms

vs the paper: their indexer is a trained <0.1% projection head (cheaper
queries, backbone-supervised labels); ours is the existing 0.6B drafter
(training-free, already shipped for pflash). Their sigmoid threshold
leaks footprint at scale (their §3.3.1); our fixed pool is a hard cap.

## Production integration (daemon)

The pool is wired into the qwen35 backend behind `--kvflash <tokens>`
(env `DFLASH_KVFLASH`; rounded to a 256 multiple) + `--kvflash-tau <N>`
(env `DFLASH_KVFLASH_TAU`, default 64). Pieces:

- `create_target_cache(..., ctx_alloc)`: attention tensors allocated at
pool capacity; `cache.max_ctx` stays the logical bound.
- `do_prefill`: rows land identity-mapped (prompt must fit the pool —
with pflash the compressed prompt does; without it, size the pool);
`kvflash_sync_prefill` rebuilds the pager map per request/restore.
- `do_ar_decode`: `build_target_step(..., kvflash_mask=true)` keeps the
step-invariant set_rows write active alongside the slot mask;
`kv_write_rows` carries the pool slot; the mask uploads per step;
every τ generated tokens `kvflash_maybe_reselect` rescores + repages.
- Policy is agnostic by construction: `KvFlashScorer` (common/) is the
interface; with no scorer the pager runs pure LRU (zero pflash
dependency). When pflash loads its drafter, `KvFlashDrafterScorer`
(qwen3/) attaches automatically and reselect becomes drafter-driven.
- Spec decode (chain mode) runs ON the pool: verify_batch slot-maps the
draft block via per-token kv_write_rows and builds a slot-space mask
(resident committed positions + causal among draft tokens). Rejected
drafts need no rollback: the pos < base_pos validity rule excludes
their slots until the replay rewrites them. All four spec KV-write
sites (verify, both replays, stall-prefix) route through this one
function. Verified on the daemon: accept_rate 15.4-15.6% pooled vs
15.3% pool-off (matched avg_commit 3.47 vs 3.45), coherent output
through a mid-generation pool wrap with live eviction. DDTree's
tree-verify is not pool-aware yet and falls back to AR.
- LAYOUT TRAP (cost a day of debugging): kv_write_rows is
[n_tokens, n_head_kv] ne0-major — element (token i, head h) lives at
i + h*n_tokens (ggml_set_rows asserts b->ne[1] == c->ne[0]). A
transposed fill scrambles per-head row targets for every multi-token
write while single-token fills (all entries equal) hide the bug
completely.
- Post-generation snapshots are skipped once cur_pos exceeds the pool
(pooled snapshots need page-table serialization; prefill-time
snapshots still work).

## Production smokes (dflash_server on lucebox 3090, 2026-06-11)

1. WITHOUT pflash (agnostic LRU): `dflash_server <27B> --kvflash 1024`.
41-token prompt + 1400 generated = 1441 logical through a 1024-slot
pool (live LRU eviction mid-request). Coherent story end to end,
36.9 tok/s, clean finish. Second request (per-request pager reset) ok.
2. WITH pflash: `--kvflash 2048 --prefill-compression always
--prefill-threshold 256 --prefill-drafter <Qwen3-0.6B>`. Compression
1468 -> 60 tokens, then `[kvflash] drafter scorer attached (tau=64)`
automatically; 400 coherent tokens answering from the compressed
context. Same binary, zero pflash-specific configuration on the pool.

Ops note: the init banner is flushed now, but generally `nohup` +
redirected stdout block-buffers printf output — kill the process (atexit
flush) before concluding a code path didn't run.

## Quality matrix (synthetic NIAH, needle recall /16, teacher-forced)

| context | residency | LRU d=10/50/90% | drafter d=10/50/90% | control |
|---|---|---|---|---|
| 8K | 25% | 0 / 0 / 16 | 15 / 15 / 16 | 16/16 |
| 8K | 9% | 0 / 0 / 0 | 15 / 15 / 15 | 16/16 |
| 32K | 25% | 0 / 0 / 16 | 15 / 15 / 16 | 16/16 |
| 32K | 9% | 0 / 0 / 0 | 15 / 15 / 15 | 15-16/16 |
| 256K | 6.25% | 0 (d=0.5); 16/16 in-window | 14 / 15 / 15 | (in-window LRU = control) |

Drafter-scored residency retains 88-100% of perfect needle recall at every
depth down to 6-9% residency from 8K to the model's native 256K maximum;
recency-only LRU retains zero outside its tail window. 256K logistics on
the RTX 3090: ~6.5 min linear pooled prefill, 4.22 GiB host backing,
~18 GiB VRAM total, 46 s bisected rescore (drafter forward ceiling ~65K
per segment).

## Tuned defaults (from the matrix)

- Ship drafter scoring whenever a drafter is available; pure-LRU mode is
recency-only and must be documented as such.
- Pool ~25% of expected context is the conservative default; 9% measured
safe for retrieval-style work.
- tau adapts: rescore costs ~0.11 ms/history-token, so the effective
reselect interval is max(configured tau, history/45), capping rescore
overhead near 15% of decode time.

## Not in the prototype (next phases)

1. Drafter KV persistence for the indexer (incremental rescore: push
only the new τ tokens through the drafter; kills the ~240 ms re-prefill).
2. Pooled chunked prefill (prompt > pool with eviction during prefill).
3. Spec-decode verify on the pool (block-aligned multi-token writes).
4. Pooled snapshot save/restore (serialize the page table + host store).
5. Async paging on a copy stream (currently synchronous
ggml_backend_tensor_get/set between steps).
6. Quality benches through the harness (NIAH-64K, accept-rate) with the
drafter policy active.
87 changes: 87 additions & 0 deletions optimizations/kvflash/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
<p align="left">
<a href="../README.md">← lucebox-hub</a>
</p>

<h1 align="center">Luce KVFlash</h1>

<p align="center">
<strong>Lookahead sparse attention for dflash. Bounded KV residency on one GPU.</strong><br/>
The attention KV cache lives in a fixed pool of slots; cold 64-token chunks page to host RAM, bit-exact and recallable.
With pflash, its drafter doubles as a Memory Indexer that recalls the context the generation needs next.<br/>
Qwen3.6-27B Q4_K_M on a single RTX 3090: <strong>native 256K context at 38.6 tok/s with 72 MiB of resident KV</strong>,
needle recall 88-100% at 6% residency, harness accuracy unchanged (32/32 vs full cache).
</p>

---

```
decode tok/s KV in VRAM needle (d=10/50/90%)
full cache @ 64K 27.8 1152 MiB 16/16
full cache @ 128K 19.6 2304 MiB 16/16
full cache @ 256K 13.1 4608 MiB 16/16
KVFlash 4K @ 64K 38.6 72 MiB 14/16
KVFlash 4K @ 128K 38.6 72 MiB 14/16
KVFlash 4K @ 256K 38.6 72 MiB 15/16
```

Decode speed is flat at any context length (the per-step KV read is pool-sized,
not context-sized), prefill is up to 2.8x faster, and a 256K prompt that costs
4.6 GiB of VRAM as a full cache costs 72 MiB resident + 4.2 GiB of host RAM.

## Usage

```bash
dflash_server model.gguf --max-ctx 32768 --kvflash 8192 # LRU policy
dflash_server model.gguf --max-ctx 32768 --kvflash 8192 \
--prefill-compression always --prefill-drafter qwen3-0.6b.gguf # drafter policy
```

- `--kvflash <tokens>`: resident pool size (rounded to 256; clamped to
`--max-ctx`). Env: `DFLASH_KVFLASH`.
- `--kvflash-tau <N>`: reselect interval floor (default 64; the effective
interval grows with history so rescore overhead stays ~15% of decode).
Env: `DFLASH_KVFLASH_TAU`.

Sizing rule: without a drafter, pool >= prompt + generation headroom
(LRU is recency-only memory — an undersized pool can evict the question
itself). With pflash's drafter attached, 25% of the expected context is a
conservative default and 6-9% is measured safe for retrieval workloads.

## How it works

- **Pool**: attention KV tensors are allocated at pool size; a pager maps
logical positions to slots at 64-token chunk granularity. Cold chunks
move to a host backing store (~0.6 ms/chunk) and return bit-exact.
- **Mask**: attention spans the pool with a slot-validity mask, uploaded
before every compute. Exact, and free (25.10 vs 25.52 ms/step maskless).
- **Reselect**: every tau decoded tokens the scorer re-ranks all chunks
(resident or host-backed) and `reselect()` repages the pool — the
lookahead loop from FlashMemory (arXiv 2606.09079), with the pflash
drafter standing in for their trained indexer, and a hard capacity cap
their threshold mechanism lacks.
- **Spec decode**: chain-mode verify is slot-mapped (per-token
`kv_write_rows` + slot-space mask); rejected drafts need no rollback —
their slots are excluded by the validity rule until rewritten.
Acceptance parity with the full cache (15.4-15.6% vs 15.3%). DDTree
falls back to AR while KVFlash is active.
- **Prefill**: prompts larger than the pool prefill in 64-token chunks at
constant VRAM (linear time; 256K in ~5.9 min on the 3090).

Quality verdict (harness ground truth, base-vs-base control included):
full results in [RESULTS.md](RESULTS.md). Outputs are not guaranteed
byte-identical to the full cache on long generations (the masked kernel
path rounds differently — a different deterministic lineage), but
correctness is identical: 32/32 vs 32/32 across HumanEval, GSM, MATH, and
agent suites.

## Files

- `server/src/common/kvflash_pager.h` — pool, page table, host store, reselect
- `server/src/common/kvflash_scorer.h` — chunk-relevance policy interface
- `server/src/qwen3/qwen3_kvflash_scorer.{h,cpp}` — pflash-drafter scorer
(tail attention; bisects on allocation pressure)
- `server/src/qwen35/*` — cache `ctx_alloc`, masked pooled decode, slot-mapped
spec verify, daemon flags
- `server/test/test_kvflash.cpp` — verification suite (A-F), `--niah`,
`--niah256`, `--longab`
- [DESIGN.md](DESIGN.md) — mechanism details and tuning notes
81 changes: 81 additions & 0 deletions optimizations/kvflash/RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# KVFlash — measured results

All numbers: single RTX 3090 (24 GB), Qwen3.6-27B Q4_K_M target, Q8_0 KV,
Qwen3-0.6B pflash drafter as the scorer. June 2026, `test_kvflash` +
`dflash_server` + `harness/benchmarks`.

## End-to-end long-prompt A/B (`--longab`; needle depth 0.25, 240-token timed free run)

| context | mode | prefill | decode tok/s | needle /16 | KV in VRAM |
|---|---|---|---|---|---|
| 32K | full | 47.2 s | 32.8 | 16 | 576 MiB |
| 32K | KVFlash 4K | 41.8 s | 29.0 | 15 | 72 MiB |
| 64K | full | 130.6 s | 27.8 | 16 | 1152 MiB |
| 64K | KVFlash 4K | 87.5 s | **38.6** | 14 | **72 MiB** |
| 128K | full | 335.9 s | 19.6 | 16 | 2304 MiB |
| 128K | KVFlash 4K | 177.8 s | **38.6** | 14 | **72 MiB** |
| 256K | full | 999.0 s | 13.1 | 16 | 4608 MiB |
| 256K | KVFlash 4K | **354.9 s** | **38.6** | 15 | **72 MiB** |

Decode is flat at 38.6 tok/s from 64K to native-max 256K (speedups 1.4x /
2.0x / 2.9x); prefill speedups 1.5x / 1.9x / 2.8x. One drafter rescore per
query: 9-70 s scaling with context (bisected above the drafter's ~65K
single-pass ceiling).

## Retrieval quality vs residency (synthetic NIAH, teacher-forced /16)

| context | residency | LRU (d=10/50/90%) | drafter (d=10/50/90%) | full control |
|---|---|---|---|---|
| 8K | 25% | 0 / 0 / 16 | 15 / 15 / 16 | 16/16 |
| 8K | 9% | 0 / 0 / 0 | 15 / 15 / 15 | 16/16 |
| 32K | 25% | 0 / 0 / 16 | 15 / 15 / 16 | 16/16 |
| 32K | 9% | 0 / 0 / 0 | 15 / 15 / 15 | 15-16/16 |
| 256K | 6.25% | 0 (d=0.5); 16/16 in-window | 14 / 15 / 15 | (in-window LRU = control) |

Drafter-scored residency retains 88-100% of perfect recall at every depth
down to 6-9% residency; recency-only LRU retains zero outside its tail
window (mirrors FlashMemory's Recency-Only ablation).

## Harness ground truth (pool sized per the heuristic, vs full cache)

| suite | baseline pass | KVFlash pass | exact text match |
|---|---|---|---|
| HumanEval | 10/10 | **10/10** | 10/10 |
| GSM | 10/10 | **10/10** | 8/10 |
| MATH | 10/10 | **10/10** | 4/10 |
| agent (to 24K prompts) | 6/6 | **6/6** | 2/6 |

Base-vs-base control: 16/16 byte-identical — the stack is deterministic.
Text drift under KVFlash is the masked decode kernel's different (equally
deterministic) rounding lineage, not noise and not a correctness effect.

## Spec decode (chain mode, slot-mapped verify, daemon)

| config | accept rate | avg_commit | output |
|---|---|---|---|
| full cache, 2400 tok | 15.3% | 3.45 | coherent |
| KVFlash 2K, 1800 tok | 15.4% | 3.47 | coherent |
| KVFlash 2K, 2400 tok (live eviction mid-spec) | 15.6% | 3.49 | coherent |

## Microbenchmarks

- Memory at 128K-logical: attn-KV 2304 -> 18 MiB (99.2%) with a 1K pool;
whole cache buffer 2654 -> 218 MiB, confirmed via VRAM deltas.
- Exact slot mask is free: 25.10 ms/step masked vs 25.52 maskless.
- Paging: page_out p50 1.27 ms / page_in 0.64 ms per 64-token chunk
(~2.2 MiB, synchronous); ~0.01 ms/token amortized at observed rates.
- reselect() repaging 20 chunks: 21.3 ms.
- Relocation equivalence (shuffled physical placement, teacher-forced
1200 tokens): ~99% argmax agreement; page_out/page_in roundtrip
bit-exact.

## Known limits

- DDTree tree-verify is not pool-aware (falls back to AR with KVFlash).
- Post-generation snapshots are skipped once cur_pos exceeds the pool
(pooled snapshots need page-table serialization).
- Paging is synchronous (copy-stream overlap is a follow-up).
- Memory-dense tasks needing the entire context at once (MRCR-style) are
a paradigm limit shared with FlashMemory; size the pool up for those.
- 512K+ requires RoPE scaling (model native max is 256K) — memory-side
KVFlash already scales (host backing is the only growth).
Loading
Loading