KVFlash: bounded KV residency (lookahead sparse attention) for dflash by davide221 · Pull Request #373 · Luce-Org/lucebox-hub

davide221 · 2026-06-12T08:20:32Z

KVFlash: bounded KV residency (lookahead sparse attention) for dflash

FlashMemory-style (arXiv 2606.09079) decode-time KV paging behind a new --kvflash <tokens> flag. The full-attention KV cache lives in a fixed pool of slots; cold 64-token chunks page to host RAM bit-exact and recallable. GPU KV footprint becomes a hard O(pool) constant at any logical context length.

Full docs in optimizations/kvflash/ (README, RESULTS, DESIGN).

Headline numbers (lucebox RTX 3090, Qwen3.6-27B Q4_K_M, Q8_0 KV)

context	mode	prefill	decode tok/s	needle /16	KV in VRAM
64K	full cache	130.6 s	27.8	16	1152 MiB
64K	KVFlash 4K	87.5 s	38.6	14	72 MiB
128K	full cache	335.9 s	19.6	16	2304 MiB
128K	KVFlash 4K	177.8 s	38.6	14	72 MiB
256K	full cache	999.0 s	13.1	16	4608 MiB
256K	KVFlash 4K	354.9 s	38.6	15	72 MiB

Decode is flat at 38.6 tok/s from 64K to the model's native 256K maximum (1.4x / 2.0x / 2.9x over the full cache), prefill is up to 2.8x faster, and attn-KV memory drops 99.2% (2304 to 18 MiB at 128K with a 1K pool).

How it works

Attention KV tensors are allocated at pool size (create_target_cache gains ctx_alloc); cache.max_ctx stays the logical bound. The allocation delta IS the saving.
A pager (common/kvflash_pager.h) maps logical positions to pool slots at 64-token chunk granularity, riding the existing step-invariant set_rows KV append. RoPE is baked into K rows at write time, so relocation is legal; page-out/page-in moves raw quantized bytes and is bit-exact.
Decode attends over the pool with an exact slot-validity mask, re-uploaded before every compute (gallocr reuses input regions during graph execution). The mask is free: 25.10 vs 25.52 ms/step maskless.
Every tau decoded tokens (default 64, self-throttling) the scorer re-ranks all chunks and reselect() repages the pool: the paper's lookahead loop, with a hard capacity cap their sigmoid threshold lacks.

Policy is pluggable, pflash is optional

KvFlashScorer (common/) is the policy seam. With no scorer the pool runs pure LRU (zero pflash dependency, recency-only memory). When pflash loads its drafter, KvFlashDrafterScorer attaches automatically and reselect becomes relevance-driven: needle recall holds at 88-100% down to 6-9% residency from 8K to 256K, where LRU scores 0 outside its tail window.

Spec decode runs on the pool

Chain-mode verify_batch slot-maps the draft block (per-token kv_write_rows, which is [n_tokens, n_head_kv] ne0-major) and builds a slot-space mask. Rejected drafts need no rollback: the pos < base_pos validity rule excludes their slots until rewritten. Acceptance parity measured on the daemon: 15.4-15.6% pooled vs 15.3% full cache. DDTree tree-verify is not pool-aware yet and falls back to AR with a one-time warning.

Quality

Harness ground truth with the pool sized per the heuristic: HumanEval 10/10, GSM 10/10, MATH 10/10, agent 6/6, identical to the full-cache baseline (base-vs-base control: 16/16 byte-identical, so the stack is deterministic; text drift under KVFlash is the masked kernel's different deterministic rounding lineage, not a correctness effect).

Verification

test_kvflash suite A-F: full-cache baseline, shuffled-relocation equivalence (0.83% argmax flips, gate 2%), live paging with bit-exact roundtrip and >=90% KV-bytes cut, score-driven reselect recall, decode profile, full LSA loop with the drafter as Memory Indexer.
Daemon smokes: agnostic LRU (1441 logical tokens through a 1024-slot pool, live eviction mid-request, coherent at 36.9 tok/s), pflash + drafter scorer auto-attach, spec decode with mid-generation pool wrap, two-request pager reset.
Rebased onto current main (PRs 364/370/371); end-of-prefill snapshot block and kvflash prefill sync coexist, rebuilt and re-smoked on the 3090.

Known limits (documented in RESULTS.md)

DDTree falls back to AR while KVFlash is active.
Post-generation snapshots are skipped once cur_pos exceeds the pool (pooled snapshots need page-table serialization); prefill-time snapshots work.
Paging is synchronous; copy-stream overlap is a follow-up.
Memory-dense tasks that need the whole context at once (MRCR-style) are a paradigm limit shared with FlashMemory; size the pool up for those.

Usage

dflash_server model.gguf --max-ctx 262144 --kvflash 4096            # LRU policy
dflash_server model.gguf --max-ctx 262144 --kvflash 4096 \
    --prefill-compression always --prefill-drafter qwen3-0.6b.gguf  # drafter policy

🧙 Built with WOZCODE

KvFlashPager: bounded resident pool for the full-attention KV cache (FlashMemory-style lookahead sparse attention, arXiv 2606.09079). Logical positions map to physical pool slots at 64-token chunk granularity; cold chunks page to a host backing store bit-exact and recallable. GPU footprint is a hard O(pool) bound at any context length. KvFlashScorer: dependency-free chunk-relevance policy interface. With no scorer the pager runs pure LRU; KvFlashDrafterScorer adapts the pflash Qwen3-0.6B drafter (tail-attention chunk scores, z-normalized, bisecting on allocation pressure) so reselect becomes relevance-driven. Co-Authored-By: WOZCODE <contact@withwoz.com>

- create_target_cache gains ctx_alloc: attention KV tensors allocate at pool capacity while cache.max_ctx stays the logical bound. - build_target_step gains kvflash_mask: pooled decode keeps the step-invariant set_rows KV append active alongside an exact slot-validity mask (uploaded before every compute; gallocr reuses input regions during graph execution, so a stale mask is garbage). - do_ar_decode routes kv_write_rows through the pager slot, pushes history, and reselects every tau decoded tokens (effective interval max(tau, history/45) caps rescore overhead near 15%). - Spec decode (chain) verifies ON the pool: verify_batch slot-maps the draft block (kv_write_rows is [n_tokens, n_head_kv] ne0-major) and builds a slot-space mask; rejected drafts need no rollback since the pos < base_pos validity rule excludes their slots until rewritten. DDTree tree-verify is not pool-aware and falls back to AR. - pflash synergy: when the prefill drafter loads, KvFlashDrafterScorer attaches automatically; without it the pool runs LRU (fully agnostic). - Post-generation snapshots are skipped once cur_pos exceeds the pool; prompts must fit the pool (clear error otherwise); pool size clamps to --max-ctx with a warning. Co-Authored-By: WOZCODE <contact@withwoz.com>

Gated suite A-F: full-cache baseline, shuffled-relocation equivalence (<=2% argmax flips), live paging with bit-exact page_out/page_in roundtrip and >=90% KV-bytes cut, score-driven reselect recall, decode profile, and the full LSA loop with the drafter as Memory Indexer. Modes: --niah / --niah256 (needle recall vs residency), --longab (end-to-end long-prompt A/B, per-process configs for clean VRAM), --no-mask. Co-Authored-By: WOZCODE <contact@withwoz.com>

Measured on lucebox RTX 3090, Qwen3.6-27B Q4_K_M, Q8_0 KV: decode flat at 38.6 tok/s from 64K to native-max 256K (2.9x over full cache at 256K), 72 MiB resident KV vs 4608 MiB, prefill up to 2.8x faster, needle recall 88-100% at 6-9% residency with the drafter policy, harness ground truth 32/32 vs 32/32, spec acceptance at parity. Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai

6 issues found across 18 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/server/server_main.cpp">

<violation number="1" location="server/src/server/server_main.cpp:411">
P2: Missing input validation for --kvflash token count. The value is stored raw via setenv without any validation that it is a positive integer. Every other numeric flag in this block (--spark-slots, --ddtree-budget, --fa-window, --chunk, etc.) parses with std::atoi and validates. Passing non-numeric, zero, or negative input will silently set DFLASH_KVFLASH to garbage, deferring the failure to an opaque downstream atoi call rather than failing early with a clear error message.</violation>
</file>

<file name="server/src/qwen3/qwen3_kvflash_scorer.cpp">

<violation number="1" location="server/src/qwen3/qwen3_kvflash_scorer.cpp:110">
P2: `score_chunks` divides by `chunk_tokens` without validating it, which can crash on invalid input.</violation>
</file>

<file name="server/src/qwen35/graph_builders.h">

<violation number="1" location="server/src/qwen35/graph_builders.h:71">
P3: Header comment for `kvflash_mask` incorrectly states it is "Only meaningful with n_tokens == 1", but the parameter is actively used with `n_tokens > 1` in the verify_batch/spec-decode path (qwen35_dflash_target.cpp:63), and the implementation in graph_builders.cpp:291-296 explicitly describes support for "multi-token ... forwards (decode AND spec verify)". The header constraint is misleading and contradicts both the implementation comment and actual usage.</violation>
</file>

<file name="server/src/common/kvflash_pager.h">

<violation number="1" location="server/src/common/kvflash_pager.h:70">
P1: `attach()` does not validate that pool capacity leaves at least one evictable chunk, so small pools can deadlock eviction and make `slot_for()` fail with `-1`.</violation>
</file>

<file name="server/src/qwen3/qwen3_kvflash_scorer.h">

<violation number="1" location="server/src/qwen3/qwen3_kvflash_scorer.h:7">
P3: Stale documentation reference: the header comment says 'see common/kv_scorer.h' but no such file exists. The correct base-class header is `common/kvflash_scorer.h` (confirmed at `server/src/common/kvflash_scorer.h`). This will mislead developers looking for the dependency-free interface description.</violation>
</file>

<file name="server/src/qwen35/qwen35_backend.cpp">

<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:1304">
P1: `slot_for()` failure is unchecked, so kvflash can write to KV row `-1` when the pool has no evictable block.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

The pager core is architecture-blind; this routes each backend's KV writes and masks through it so --kvflash works on every model family the daemon serves. - qwen35moe (Qwen3.6-35B-A3B): the non-hybrid path inherits qwen35. The Spark pipelined hybrid decode gains a kv_slot parameter; the cached per-layer FA span clamps to the pool, so the cached graph stops rebuilding once the window reaches pool size. The pool span stays maskless like the rest of that path: the pager zeroes freed blocks (page-out + zero_free_blocks on request reset), the same zero-row approximation production padding already relies on. Hybrid spec decode (literal-offset KV writes) falls back to pipelined AR. - laguna: all 40 layers pooled. laguna_step/_hybrid take a const pager; full + SWA masks are built in SLOT space via fill_slot_pos. SWA exactness from a protected tail >= sliding_window. Legacy per-layer hybrid decode and NO_KVPAD/PAD_CPY/no_mask ablations are refused under kvflash. - gemma4: pools FULL-attention layers only (SWA layers already ring-buffer; KV-reuse layers share their source tensors). Slot-space full mask; FA span and mask width clamp to tensor capacity. Mutually exclusive with --fa-window; spec verify falls back to AR. - pager: new const helpers slot_of / fill_slot_pos (slot-space mask construction) and zero_free_blocks (request-reset hygiene for maskless consumers); kvflash state in Qwen35Backend moved to protected for the MoE subclass. - guards everywhere: prompt-fits-pool on every prefill/restore path, snapshots refused after the first relocation on laguna/gemma4. Smoked on the 3090, pool 1024 / max-ctx 8192 with live LRU eviction mid-request: A3B Spark hybrid 101.6 tok/s, laguna 137.1, gemma4 119.0, all coherent; gemma4 no-flag control unchanged (120.2). Co-Authored-By: WOZCODE <contact@withwoz.com>

davide221 · 2026-06-12T10:21:44Z

Update: KVFlash now covers every architecture the daemon serves

--kvflash was qwen35-only at PR open; this push ports it to the other three backends. The pager core (common/kvflash_pager.h) was already architecture-blind; each backend now routes its KV writes and masks through it:

arch	model smoked	integration	decode
qwen35moe	Qwen3.6-35B-A3B (Spark hybrid, 9403 hot / 837 cold experts)	`pipelined_decode_one_token` gains `kv_slot`; cached per-layer FA span clamps to the pool (graph stops rebuilding at pool size); maskless pool span backed by pager-zeroed free blocks; hybrid spec falls back to pipelined AR	101.6 tok/s coherent
laguna	Laguna-XS.2 (Spark hybrid, single-graph decode)	`laguna_step(_hybrid)` take a const pager; full + SWA masks built in SLOT space via the new `fill_slot_pos`; protected tail >= sliding_window keeps SWA exact; all 40 layers pooled	137.1 tok/s coherent
gemma4	Gemma4 26B-A4B	pools FULL-attention layers only (5 of them; SWA layers already ring-buffer, KV-reuse layers share source tensors); slot-space full mask; mutually exclusive with `--fa-window`; spec falls back to AR	119.0 tok/s coherent

All smokes: pool 1024 / max-ctx 8192, ~1.2K logical tokens so live LRU eviction engages mid-request, RTX 3090. A no-flag gemma4 control on the same build confirms the default path is unchanged. The qwen35 numbers in the PR body are unaffected.

Policy note: qwen35/qwen35moe attach the pflash drafter scorer automatically; laguna and gemma4 run LRU-only for now (the drafter is Qwen-tokenizer bound) with the KvFlashScorer seam open for their own indexers.

New pager helpers: slot_of / fill_slot_pos (const lookups for slot-space masks) and zero_free_blocks (request-reset hygiene for maskless consumers).

🧙 Built with WOZCODE

cubic-dev-ai

2 issues found across 18 files (changes from recent commits).

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

- Pool-deadlock guard (P1): KvFlashPager::min_pool_tokens() + attach() refusal when sinks + tail window leave no evictable block; every backend floors the requested pool at config read (512 for qwen-family and gemma4; laguna derives its floor from the resident SWA window) with a warning instead of a runtime eviction failure. - Unchecked slot_for() in do_ar_decode (P1): a -1 slot now fails the request with a clear error instead of becoming a set_rows row index. - --kvflash / --kvflash-tau (P2): validate as positive integers at the CLI and exit early instead of deferring garbage env values downstream. - score_chunks (P2): guard chunk_tokens <= 0. - Stale docs (P3 x2): kvflash_mask comment no longer claims n_tokens==1 only (it serves multi-token spec verify); kv_scorer.h rename leftover now points at common/kvflash_scorer.h. Verified on the 3090: bad flag values rejected with clear messages; --kvflash 256 raises to the 512 floor and decodes coherently through live eviction in the tightest legal pool (8 blocks, 5 protected). Co-Authored-By: WOZCODE <contact@withwoz.com>

davide221 · 2026-06-12T10:36:34Z

All 6 cubic findings were valid and are fixed in the latest push:

P1 pool-deadlock (attach() no evictable chunk): real — --kvflash 256 gave 4 chunks while sinks (1) + tail (4) protect 5, so eviction had no victim once the pool filled. Fix is two-layered: KvFlashPager::min_pool_tokens() + an attach() refusal with a clear message, and every backend's config read now floors the pool (512 for qwen-family/gemma4; laguna computes its floor from the SWA window it must keep resident) with a "raising" warning instead of a runtime failure. Verified live: --kvflash 256 logs requested pool 256 < minimum 512; raising and decodes correctly through eviction.
P1 unchecked slot_for() in do_ar_decode: real — a -1 would have become a set_rows row index. Now checked, logs, sets last_error, and fails the request. (The spec-verify path already checked; this was the one unchecked site.)
P2 --kvflash raw setenv: both --kvflash and --kvflash-tau now validate as positive integers and exit with a clear message, matching the other numeric flags.
P2 score_chunks division by chunk_tokens: guarded (chunk_tokens <= 0 returns false) alongside the existing entry validation.
P3 stale "Only meaningful with n_tokens == 1" comment: rewritten — the param serves both single-token decode and multi-token spec verify since the spec-on-pool phase landed.
P3 stale common/kv_scorer.h reference: rename leftover, now common/kvflash_scorer.h.

Rebuilt + re-smoked on the 3090 after the fixes (27B, pool floor path, coherent output through live eviction).

🧙 Built with WOZCODE

…lpers The multi-arch port left three copies of the same plumbing; this pulls them into the kvflash layer so each backend integration reduces to wiring (net -32 lines): - kvflash_pool_from_env(): the env read + 256-rounding + eviction floor + max_ctx clamp lived in three slightly diverging copies (qwen35 inline, laguna, gemma4). One reader, parameterized by the arch's KvFlashConfig; laguna passes its SWA-tail config via a new kvflash_config() so the floor and attach can never disagree. - KvFlashPager::alloc_span(): the slot_for loop + exhaustion diagnostic existed in laguna, gemma4, and the qwen35moe restore replay; the backend helpers are now one-line delegates and the error message is single-sourced. - kvflash_fill_rows_and_masks(): laguna's step-input filler and gemma4's inline rows + slot-space mask fill were the same algorithm; the shared helper builds append rows plus causal (and optional sliding-window) masks from the pager's slot map, so graph code no longer reimplements the slot-to-position conversion. No behavior change: rebuilt on the 3090 and re-smoked the three affected archs through live eviction (laguna 138.0 tok/s, gemma4 119.4, qwen35 37.0, all coherent, banners unchanged). Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai

1 issue found across 8 files (changes from recent commits).

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

- assets/cards/kvflash_card.png registered in the README cards grid (DECODE 2.9x at 256K, CONTEXT 256K, KV VRAM -99%), linking to optimizations/kvflash/. - optimizations/kvflash/README.md gains the hero image (pflash layout). - README/RESULTS now state explicitly that the 256K full-cache baseline rows are measured, not extrapolated, and fit the 24 GB card only because the KV is Q8_0 (F16 KV would be 9.2 GiB and not fit); KVFlash holds 72 MiB resident either way. Co-Authored-By: WOZCODE <contact@withwoz.com>

The measured tables now carry the cache parameter on the column itself (KV in VRAM (Q8_0)) instead of relying on the prose footnote alone; the footnote keeps the why (F16 KV would not fit 256K on 24 GB at all). Co-Authored-By: WOZCODE <contact@withwoz.com>

New 'Bounded KV residency (KVFlash)' subsection after the KV cache block, mirroring the Spark pattern: one-paragraph intro + flag table (--kvflash / --kvflash-tau and their env equivalents) linking to optimizations/kvflash/. Co-Authored-By: WOZCODE <contact@withwoz.com>

The 38.6 tok/s / 72 MiB figures are Qwen3.6-27B at one pool size; the four model families land at different speeds. The flags reference now states the property (decode independent of context length, pool-sized resident KV) and points at optimizations/kvflash/ for per-model numbers. Co-Authored-By: WOZCODE <contact@withwoz.com>

… without compression Three UX/capability gaps closed, all verified on the 3090: - Pooled chunked prefill in the daemon (DESIGN follow-up #2): a prompt larger than the pool no longer refuses — do_prefill switches to pager-chunk-sized batches with slot-mapped set_rows writes, a slot-space mask per chunk (verify_batch recipe), and live eviction as the pool fills. Constant VRAM, linear time. Smoked: 6843-token prompt through a 2048 pool, coherent output, 35.1 tok/s decode. Restore offsets and boundary snapshots are refused in the pooled path. - --kvflash auto: sizes the pool from --max-ctx (25% with a drafter configured, 50% LRU-only), same floor/clamp rails, all model families via the shared config reader. Smoked both sizings. - Drafter scorer without compression: --prefill-drafter alone now arms the residency scorer. The server hands the path to the backend (DFLASH_KVFLASH_DRAFTER); kvflash_ensure_scorer() lazy-loads the drafter on the first reselect that needs it (never on the first tokens) and re-attaches after a draft-residency release. Previously the scorer only attached inside the pflash compress path, so this flag combination silently ran recency-only LRU. Smoked: attach fires mid-generation, banner announces the pending policy. - Snapshot guards now use pager.is_identity() instead of cumulative page_outs stats: one eviction-heavy request no longer disables snapshots for the rest of the process (laguna/gemma4), and qwen35 refuses identity-copy snapshots of relocated pools. Co-Authored-By: WOZCODE <contact@withwoz.com>

davide221 · 2026-06-12T17:34:42Z

Update: pooled chunked prefill + `--kvflash auto` + drafter scoring without compression

Three follow-ups landed in 9db8472, all smoked on the 3090:

Prompts larger than the pool now work through the daemon. do_prefill switches to pooled chunked prefill (64-token slot-mapped batches, slot-space mask per chunk, live eviction) instead of refusing — the harness recipe, now in the server. Smoked: a 6,843-token prompt through a 2,048-token pool, coherent output, 35.1 tok/s decode.
--kvflash auto: pool sized from --max-ctx — 25% when a drafter is configured, 50% LRU-only. Works on all four model families.
--prefill-drafter alone now arms the residency scorer (lazy-loaded at the first reselect). Previously the scorer only attached via the pflash compression path, so --kvflash + drafter with compression off silently ran recency-only LRU.
Bugfix: snapshot guards use is_identity() instead of cumulative page_outs, so one long request no longer disables snapshots for the rest of the process.

The intended one-liner UX is now real:

dflash_server model.gguf --max-ctx 262144 --kvflash auto --prefill-drafter qwen3-0.6b.gguf

🧙 Built with WOZCODE

High accuracy by default: when --kvflash is on and no --prefill-drafter was given, the qwen-family backend probes the well-known locations for the Qwen3-0.6B drafter (target's dir, drafter/, draft/, then /opt/lucebox/models/drafter/ — Spark's load-what-sits-next-to-the-model pattern) and arms the residency scorer from it. LRU is now the explicit FALLBACK when no drafter exists, and the banner says so ('lru (recency-only: no Qwen3-0.6B drafter found ...)') instead of presenting recency-only paging as a normal mode. Nothing turns kvflash itself on by default; this only picks the policy once the user asks for the pool. Smoked on the 3090 with ONLY '--kvflash auto': probe found the appliance drafter, auto sized 25% (drafter expected), scorer attached at the first reselect, coherent output. Co-Authored-By: WOZCODE <contact@withwoz.com>

…kvflash-policy Relevance is a property of the text, not the tokenizer, so non-qwen targets no longer have to run recency-only residency: - KvFlashCrossTokScorer: detokenize the target's history with its own tokenizer (loaded from the target GGUF), re-tokenize for the Qwen3-0.6B drafter (its GGUF), run the same tail-attention scoring, and map per-drafter-token scores back to the target's 64-token chunk boundaries by character spans. Tokenizers are host-only, lazy-loaded. - laguna + gemma4 gain the full reselect loop (history, adaptive tau, lazy drafter load at the first reselect boundary, score_hook + repage). Drafter-scored residency is now the default on ALL four families; the probe + sizing live in the shared helpers. - --kvflash-policy {drafter,lru}: the explicit opt-out the default was missing (no probe, no drafter load, recency-only paging). - Shared kvflash_find_drafter() / kvflash_policy_is_lru() replace the per-backend probe; banners state the armed policy and how to change it. Verified on the 3090 (gemma4 26B-A4B, pool 1024): cross-tok scorer attaches mid-generation, 18 drafter-driven reselects with page events, coherent 1.9K-token output. Stress needle A/B vs LRU: LRU degenerates and never recites; cross-tok stays coherent and recalls the correct prefix but not the exact code. Documented in RESULTS.md as functional but untuned (qwen-native scoring keeps its measured 14-16/16; the teacher-forced NIAH harness for non-qwen archs is the follow-up). Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai

2 issues found across 12 files (changes from recent commits).

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

'auto' now sizes from the GPU instead of a fixed fraction of max_ctx: half of (device-free minus reserve) after the weights are resident, converted at the model's pooled-KV density, capped at the decode speed knee (16384 tokens default, DFLASH_KVFLASH_MAX_POOL to override) and at max_ctx. Rationale: a bigger pool means more resident chunks and fewer forced evictions of useful context (the relevance-crowding seen in the gemma4 needle stress), while the cap keeps the per-step KV read near the flat-decode optimum; on tight cards the VRAM term shrinks the pool automatically. Backends supply the budget (ggml_backend_dev_memory + per-arch density: qwen35 full-attn layers at resolve_kv_types' quant, laguna all layers at args.kv_type, gemma4 full-attn layers at F16 with per-layer dims); the reserve covers compute buffers plus the drafter when one is expected. The fraction heuristic survives only as the no-budget fallback. Smoked on the 3090 at max-ctx 131072: 27B picks 16384 (free 8.3 GiB, 14.0 KiB/token, speed-capped), gemma4 picks 16384 (7.5 GiB, 20.0 KiB/token), both banners report the full math, both decode coherently. Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai · 2026-06-12T21:47:14Z

You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment @cubic-dev-ai review.

Four valid findings from cubic's later passes, all fixed: - KvFlashCrossTokScorer: raw owning pimpl now has deleted copy ctor/assignment (double-free guard; held in unique_ptr everywhere, but the class shouldn't rely on that). - KvFlashPager::slot_for: a failed allocation rolls cur_chunk_ back so the next eviction's tail window isn't computed from a chunk that never materialized. - laguna unpark: kvflash_attach failure now frees the just-loaded weights + cache before returning (was leaking them while still reporting parked). - kvflash_drafter_failed_ latch clears on unpark in all three backends: a transient drafter-load failure no longer downgrades residency to LRU for the process lifetime (still no per-tau retry spam). Stale finding skipped: the cumulative page_outs snapshot guard was already replaced by is_identity() two rounds ago. Docs brought up to shipped reality: DESIGN.md per-arch policy section (cross-tok default, --kvflash-policy, VRAM auto), do_prefill bullet (pooled chunked prefill), and the follow-ups list now separates done (pooled prefill, spec-on-pool, VRAM auto, cross-tok) from open (drafter KV persistence, laguna/gemma4 pooled prefill, pooled snapshots, async paging, non-qwen NIAH harness). Full test_kvflash regression suite on this exact tree: ALL PASS (relocation 2% gate, bit-exact roundtrip, eviction decode, reselect recall, LSA loop, >=90% KV cut), exit 0. Co-Authored-By: WOZCODE <contact@withwoz.com>

davide221 · 2026-06-12T22:02:03Z

Pre-ship audit complete (`321695c`)

Final sweep before merge:

Cubic round 2: 4 of 5 later findings valid, all fixed — copy guards on the cross-tok scorer's owning pointer (P1), cur_chunk_ rollback on failed allocation, laguna unpark resource cleanup on attach failure, and the drafter-failure latch now clears on unpark instead of downgrading to LRU for the process lifetime. The fifth (cumulative page_outs snapshot guard) was already fixed by is_identity() in an earlier round.
Docs reconciled with shipped reality: DESIGN.md policy section, pooled-prefill bullet, and a done-vs-open follow-ups split.
Full test_kvflash regression suite on the final tree: ALL PASS, exit 0 — relocation equivalence (2% gate), bit-exact page roundtrip, eviction decode, reselect recall, LSA loop, >=90% KV-memory cut.

Open follow-ups are documented in DESIGN.md (drafter KV persistence, pooled prefill on laguna/gemma4, pooled snapshots, async paging, non-qwen NIAH harness + cross-tok tuning). From our side this is ready to merge once the GPU checks land.

🧙 Built with WOZCODE

Both GPU jobs shared group lucebox3-gpu-runner, but a concurrency group holds only ONE waiting job: the CUDA job took the running slot, the Radeon job sat in the waiting slot, and every new job entering the group from any branch displaced it ('Canceling since a higher priority waiting request exists') — the Radeon leg was cancelled chronically while the 3090 leg passed. The combo box has two distinct GPUs, so the jobs never contended for a device; per-GPU groups keep cross-PR serialization where it matters and stop the cross-displacement. Co-Authored-By: WOZCODE <contact@withwoz.com>

rocminfo on a wedged KFD blocks in uninterruptible sleep until the 20-minute job timeout kills the run with zero evidence. Probe it under a 15 s timeout first; on hang, dump /dev/kfd holders, D-state processes, and recent amdgpu/kfd dmesg, then fail in seconds with the diagnosis on the job page. The smoke step reuses the healthy probe's output. Co-Authored-By: WOZCODE <contact@withwoz.com>

The 'DDTree falls back to AR under KVFlash' limitation guarded against a tree verify that does not exist in the daemon: the complete tree machinery (build_ddtree, build_target_step_tree, follow_verified_tree) is only called from test_dflash, the benchmark harness. In the server, --ddtree sizes the verify intermediates for budget+1 tokens and enables fast_rollback, then generation runs the same chain spec loop either way — and both pieces are already pool-compatible: chain verify_batch is slot-mapped (measured at acceptance parity), and fast_rollback's snapshot_kv/restore_kv only snapshot DeltaNet/conv recurrent state, which KVFlash never pages. Gate removed; docs corrected (the known-limit now names the harness-only tree graphs, not the daemon). A/B on the 3090 (27B + DFlash draft, --ddtree, 600 tokens): pooled 14.6% accept / avg_commit 3.33 / 33.5 tok/s vs full-cache 13.9% / 3.23 / 33.3 — parity, both coherent. Co-Authored-By: WOZCODE <contact@withwoz.com>

timeout(1) cannot kill a process in uninterruptible sleep, so the previous diagnostic step itself blocked for the full job timeout when KFD was wedged (observed live: 20 minutes of silence, no evidence printed). Probe rocminfo in the background with output to a file (no held pipe), enforce the 15 s deadline in the shell, and on hang print the probe's own D-state, /dev/kfd holders, and amdgpu dmesg before failing fast — without ever wait()ing on the corpse. Co-Authored-By: WOZCODE <contact@withwoz.com>

davide221 · 2026-06-12T23:22:51Z

Update: `--ddtree` runs on the pool (`9a17281`)

Investigating the "DDTree falls back to AR" limitation dissolved it: the daemon never had tree verify — the tree machinery (build_ddtree/build_target_step_tree/follow_verified_tree) is only called from test_dflash, the benchmark harness. In the server, --ddtree sizes verify intermediates and enables fast_rollback, then runs the same chain spec loop — and both are pool-compatible (chain verify is slot-mapped; fast_rollback only snapshots DeltaNet state, which is never paged). Gate removed, docs corrected.

A/B on the 3090 (27B + DFlash draft, --ddtree, 600 tokens): pooled 14.6% accept / avg_commit 3.33 / 33.5 tok/s vs full-cache 13.9% / 3.23 / 33.3 — parity, both coherent.

Also in: the ROCm CI job now self-diagnoses a wedged KFD in ~15 s (D-state-proof background probe) instead of eating its 20-minute timeout in silence; the current Radeon failures are a driver wedge on the runner box, not this branch (zero PR code runs in that job).

🧙 Built with WOZCODE

…gression fix Spec decode now runs on the pool everywhere it exists. gemma4 was the last gap: - gemma4_verify_batch gains the kvflash path: set_rows kv-index inputs (full layers -> pool slots, SWA -> ring rows), slot-space causal mask via the shared helper, FA span + mask width clamped to the pool. Gemma4DFlashTarget allocates the verify block's slots up front; the spec loop's KV-truncation rejection maps directly onto the pool's validity rule (rejected slots hold future positions, masked until the next verify rewrites them). Both backend spec gates removed. - Pre-existing regression fixed (blocks gemma spec on MAIN, not just here): PR #359's strict assert reads dflash.n_target_layers, which the published gemma draft fills with the TARGET layer count (30) while its fc tensor is sized for the 6 CAPTURE layers — the draft refused to load at all. Per that PR's own weights-are-ground-truth rule, derive the capture count from fc when it divides n_embd and warn on the metadata mismatch; genuinely inconsistent shapes still fail. - gemma4 accept_rate now reaches the HTTP usage block (was silently 0.0 while the loop logged the real rate — same reporting-only class as the PR #321 layer-split gap). A/B on the 3090 (26B-A4B + published q8_0 draft, 600 tokens): pooled and full cache produce IDENTICAL acceptance (407/3104 = 13.1%, avg_commit 3.09) and identical text; usage reports 0.131 on both. Co-Authored-By: WOZCODE <contact@withwoz.com>

davide221 · 2026-06-12T23:50:53Z

Update: gemma4 spec decode on the pool — spec now works everywhere it exists (`abb4cf4`)

The last spec-on-pool gap is closed. gemma4_verify_batch gains the slot-mapped path (set_rows kv indices, slot-space causal mask, pool-clamped span); gemma4's KV-truncation rejection semantics map directly onto the pool's validity rule. A/B on the 3090 with the published q8_0 draft: pooled and full cache produce identical acceptance (407/3104 = 13.1%, avg_commit 3.09) and identical text.

Two pre-existing bugs found and fixed along the way (both affect main, not just this branch):

The published gemma draft cannot load since PR feat(qwen35): derive scalars from weights, assert vs GGUF metadata #359: its dflash.n_target_layers metadata holds the target's 30 layers while the fc tensor is sized for the 6 capture layers, so the strict assert rejects it and gemma spec decode is silently AR-only. Fixed per that PR's own weights-are-ground-truth rule: derive the capture count from the tensor, warn on the metadata mismatch.
gemma4 accept_rate never reached the HTTP usage block (reported 0.0 while the loop logged the real rate) — same reporting-only class as the PR feat(server): support DFlash with mixed-backend target layer split #321 layer-split gap. Wired through.

Spec-on-pool coverage is now: qwen35 chain ✓, qwen35 --ddtree config ✓, gemma4 chain ✓; the only exception remains MoE-hybrid spec (literal-offset writes, falls back to pipelined AR), and laguna has no spec decode to begin with.

🧙 Built with WOZCODE

…7B-hardcoded) The converter stamped the qwen35-27B draft's scalars (n_head_kv=8, hidden=5120, n_layer=5, ff=17408, ...) onto every draft regardless of source, so any non-27B DFlash draft (A3B, gemma) converted to a GGUF with correct weights but wrong metadata — which the strict draft loader then rejected (blk.0 attn_k dim != n_head_kv*head_dim). Every MoE/A3B spec-decode attempt on main fails at draft load for this reason. load_arch() now resolves the architecture from the source config.json (authoritative for transformer hparams) cross-checked against the tensor shapes (authoritative for the rest: head_dim from k_proj, intermediate from gate_proj, n_target_layers from fc, n_layer from the block count), falling back to the 27B constants only when config.json is absent. Verified: A3B draft converts to n_head_kv=4 n_layer=8 ff=6144 and loads clean. This unblocks MoE speculative decode. Validated on the 3090: A3B MoE all-GPU with --ddtree + --kvflash 2048 runs spec decode on the pool (10.4% accept, avg_commit 2.66, coherent) vs full cache (11.5%, 2.84, coherent) — so dflash + ddtree + kvflash compose on MoE. The qwen35moe --spark hybrid spec path has a separate pre-existing CUDA crash (see RESULTS Known limits); it was never reachable until drafts could load. Co-Authored-By: WOZCODE <contact@withwoz.com>

davide221 · 2026-06-13T11:31:41Z

Update: MoE has dflash + ddtree on the pool (`feef3fd`)

A3B MoE, all-GPU (experts resident), --ddtree --kvflash 2048: spec decode runs on the pool — 10.4% accept, avg_commit 2.66, 59.5 tok/s, coherent vs full cache 11.5% / 2.84 / 64.6 (gap within the documented masked-kernel rounding variance). dflash + ddtree + kvflash compose on MoE. This path needs no new code — qwen35moe inherits the qwen35 spec loop already pool-validated.

Getting there required fixing the draft converter, which was broken for every non-27B draft on main: convert_dflash_to_gguf.py hardcoded the 27B scalars (n_head_kv=8, hidden=5120, ...), so A3B/gemma drafts converted with correct weights but 27B metadata and the strict loader rejected them. Now config-driven (cross-checked against tensor shapes); the A3B draft converts to n_head_kv=4 n_layer=8 ff=6144 and loads clean. This unblocks MoE/A3B spec decode on main, not just here.

One genuine pre-existing bug surfaced, filed in RESULTS Known limits: the qwen35moe --spark hybrid spec path crashes with a CUDA illegal-memory-access — independent of kvflash (crashes on the full cache too), never reachable before because no A3B draft could load. --spark spec falls back to pipelined AR under kvflash; that crash is its own follow-up.

I wrote pool-aware code for the hybrid path too but did not ship it — it can't be exercised until that crash is fixed, and I'm keeping the branch to validated code only.

🧙 Built with WOZCODE

…rash/correctness fixes --spark + DFlash speculative decode on MoE targets (Qwen3.6-35B-A3B etc.) crashed, then produced garbage, then ran ~4x slower than plain --spark AR. Three root causes, all fixed: 1. Crash. The F32 shared-expert gate (ffn_gate_inp_shexp, M=1) routed to cublasSgemm, and the shipped CUDA 12.0 cublasLt is missing the gemv/split-K reduce kernels for small-M matmuls at N>1 (the verify/replay batches), poisoning the stream (surfaced downstream as an illegal access in MUL_MAT_ID). Compute the scalar gate cublas-free: broadcast elementwise mul + sum_rows. 2. Garbage / collapse-to-"the". The MoE path never allocated the DeltaNet ssm/conv rollback snapshot tensors (migrate_prefill_cache is dense-only), so snapshot/restore_ssm_state were silent no-ops and rejected draft tokens leaked permanently into the recurrent state. Add ensure_ssm_snapshot(). 3. Speed. The verify re-evaluated cold experts on the CPU every step, while the AR pipelined path swaps a token's selected cold experts into GPU spare slots (LRU cache, ~21 slots/layer). Make the verify use the same moe_hybrid_cache_swap_in so every layer runs all-hot on GPU. Verify FFN 48->17 ms, decode 28->49 tok/s on a 3090. Also cache the mixed batched FFN graphs per n_tokens (cache-full fallback) and argmax on the GPU. accept_rate is now plumbed to the HTTP usage field. Validated on RTX 3090 (Qwen3.6-35B-A3B): 11% accept, coherent 500-token output, no crash. Co-Authored-By: WOZCODE <contact@withwoz.com>

The batched verify's cost scales with the distinct experts its tokens touch, and under --spark expert offload that means streaming extra cold experts for the draft tokens past the realized accept length. Those tokens are rejected, so cap the verify to the realized accept length plus a small margin (adaptive; DFLASH_VERIFY_WIDTH pins a fixed width). At AL~1.7 this verifies ~6 tokens instead of the full 16-token draft block at no acceptance cost: spec decode 49 -> 64 tok/s at full residency, 35 -> 53 tok/s at simulated-16GB offload. Co-Authored-By: WOZCODE <contact@withwoz.com>

eval_moe_hot_only_batched cached a single hot_batched_graph keyed by n_tokens. Spec decode alternates verify (verify_width) and replay (commit_n) batch sizes, so that single slot rebuilt all 40 layers' FFN graphs on every flip. Reuse the per-n_tokens hot_batched_mixed[] array so each batch size keeps its own graph; after warmup the verify/replay flip rebuilds nothing. 64 -> 68 tok/s. Co-Authored-By: WOZCODE <contact@withwoz.com>

…lash spec history Code review (codex) of the spark/spec changes surfaced: - Dummy (unused) FFN slots were filled with `i % n_hot_stack`, which includes UNINITIALIZED cache-ring spare slots (hot_active..n_hot_stack). Their garbage Q4_K scale bits can dequantize to NaN (x weight 0 = NaN -> corruption). Dummy to pinned/initialized experts (hot_active) instead. - ensure_ssm_snapshot() left dangling snapshot pointers into a freed rollback_ctx on alloc failure; null them so snapshot/restore_ssm_state skip rather than dereference. - Speculative decode ran under --kvflash but never appended committed tokens to kvflash_history_ / called kvflash_maybe_reselect (the AR paths do), so the pool evicted on stale prompt-only history. Sync them per committed step. - The cached cold batched FFN compute ignored its GGML_STATUS; check it. Validated on RTX 3090: spec 69 tok/s coherent (no regression); --spark --draft --kvflash coherent, 51 tok/s, no crash. Co-Authored-By: WOZCODE <contact@withwoz.com>

…s fixes - Usage leads with the drafter-guaranteed form; --kvflash auto alone can silently fall back to LRU if no drafter is auto-probed. - Harness accuracy 32/32 -> 36/36 (HumanEval+GSM+MATH+agent = 10+10+10+6). - qwen35moe --spark hybrid spec decode now runs on the pool: drop the stale crash known-limit, add validated 69.2/51.4 tok/s numbers. - DESIGN/README: daemon --ddtree runs on the pool; only test_dflash tree-verify graphs remain non-pool-aware. Co-Authored-By: WOZCODE <contact@withwoz.com>

Brings KVFlash bounded KV residency (Luce-Org#373), spec-decode budget-hook fixes (Luce-Org#379), FlowKV multi-turn prefill (Luce-Org#372), oversized-prompt admission, the fa-window tool-call warning (Luce-Org#378), and the llama.cpp bump to 574be613. Conflict resolutions: - submodule: rebased our pool_trim commit onto 574be613 -> 07cee1dce. - qwen35_backend.cpp: kept cache_max_verify_tokens_ alongside the KVFlash pool-budget block; gated tree verify off when the KVFlash pager is active (slot-space masks are not tree-position aware); Luce-Org#379 budget-hook fix merged cleanly into the spec-decode tail-off. - http_server.cpp: kept our admission gate (clamps max_tokens for fixed-budget clients, tool requests bypass compression) over main's should_reject_oversized — ours subsumes the admit-when-compressible case. Regression: chunked delta-net parity (CPU+CUDA) and the UTF-8 split-token test both pass post-merge.

davide221 and others added 4 commits June 12, 2026 10:15

cubic-dev-ai Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread server/src/gemma4/gemma4_backend.cpp Outdated

Comment thread server/src/laguna/laguna_backend.cpp Outdated

cubic-dev-ai Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread server/src/common/kvflash_pager.h

davide221 and others added 5 commits June 12, 2026 18:53

davide221 and others added 2 commits June 12, 2026 19:58

cubic-dev-ai Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread server/src/qwen3/qwen3_kvflash_scorer.h

Comment thread server/src/gemma4/gemma4_backend.cpp

davide221 and others added 4 commits June 13, 2026 00:18

davide221 and others added 5 commits June 13, 2026 21:08

davide221 merged commit 437102e into main Jun 14, 2026
3 of 4 checks passed

dusterbloom mentioned this pull request Jun 14, 2026

feat(kvflash): target-QK residency scorer (--kvflash-policy qk) #385

Open

Conversation

davide221 commented Jun 12, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

KVFlash: bounded KV residency (lookahead sparse attention) for dflash

Headline numbers (lucebox RTX 3090, Qwen3.6-27B Q4_K_M, Q8_0 KV)

How it works

Policy is pluggable, pflash is optional

Spec decode runs on the pool

Quality

Verification

Known limits (documented in RESULTS.md)

Usage

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davide221 commented Jun 12, 2026

Update: KVFlash now covers every architecture the daemon serves

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

davide221 commented Jun 12, 2026

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davide221 commented Jun 12, 2026

Update: pooled chunked prefill + --kvflash auto + drafter scoring without compression

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot commented Jun 12, 2026

Uh oh!

davide221 commented Jun 12, 2026

Pre-ship audit complete (321695c)

Uh oh!

davide221 commented Jun 12, 2026

Update: --ddtree runs on the pool (9a17281)

Uh oh!

davide221 commented Jun 12, 2026

Update: gemma4 spec decode on the pool — spec now works everywhere it exists (abb4cf4)

Uh oh!

davide221 commented Jun 13, 2026

Update: MoE has dflash + ddtree on the pool (feef3fd)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davide221 commented Jun 12, 2026 •

edited by cubic-dev-ai Bot

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

Update: pooled chunked prefill + `--kvflash auto` + drafter scoring without compression

cubic-dev-ai Bot left a comment •

edited

Loading

Pre-ship audit complete (`321695c`)

Update: `--ddtree` runs on the pool (`9a17281`)

Update: gemma4 spec decode on the pool — spec now works everywhere it exists (`abb4cf4`)

Update: MoE has dflash + ddtree on the pool (`feef3fd`)