KVFlash: bounded KV residency (lookahead sparse attention) for dflash#373
Conversation
KvFlashPager: bounded resident pool for the full-attention KV cache (FlashMemory-style lookahead sparse attention, arXiv 2606.09079). Logical positions map to physical pool slots at 64-token chunk granularity; cold chunks page to a host backing store bit-exact and recallable. GPU footprint is a hard O(pool) bound at any context length. KvFlashScorer: dependency-free chunk-relevance policy interface. With no scorer the pager runs pure LRU; KvFlashDrafterScorer adapts the pflash Qwen3-0.6B drafter (tail-attention chunk scores, z-normalized, bisecting on allocation pressure) so reselect becomes relevance-driven. Co-Authored-By: WOZCODE <contact@withwoz.com>
- create_target_cache gains ctx_alloc: attention KV tensors allocate at pool capacity while cache.max_ctx stays the logical bound. - build_target_step gains kvflash_mask: pooled decode keeps the step-invariant set_rows KV append active alongside an exact slot-validity mask (uploaded before every compute; gallocr reuses input regions during graph execution, so a stale mask is garbage). - do_ar_decode routes kv_write_rows through the pager slot, pushes history, and reselects every tau decoded tokens (effective interval max(tau, history/45) caps rescore overhead near 15%). - Spec decode (chain) verifies ON the pool: verify_batch slot-maps the draft block (kv_write_rows is [n_tokens, n_head_kv] ne0-major) and builds a slot-space mask; rejected drafts need no rollback since the pos < base_pos validity rule excludes their slots until rewritten. DDTree tree-verify is not pool-aware and falls back to AR. - pflash synergy: when the prefill drafter loads, KvFlashDrafterScorer attaches automatically; without it the pool runs LRU (fully agnostic). - Post-generation snapshots are skipped once cur_pos exceeds the pool; prompts must fit the pool (clear error otherwise); pool size clamps to --max-ctx with a warning. Co-Authored-By: WOZCODE <contact@withwoz.com>
Gated suite A-F: full-cache baseline, shuffled-relocation equivalence (<=2% argmax flips), live paging with bit-exact page_out/page_in roundtrip and >=90% KV-bytes cut, score-driven reselect recall, decode profile, and the full LSA loop with the drafter as Memory Indexer. Modes: --niah / --niah256 (needle recall vs residency), --longab (end-to-end long-prompt A/B, per-process configs for clean VRAM), --no-mask. Co-Authored-By: WOZCODE <contact@withwoz.com>
Measured on lucebox RTX 3090, Qwen3.6-27B Q4_K_M, Q8_0 KV: decode flat at 38.6 tok/s from 64K to native-max 256K (2.9x over full cache at 256K), 72 MiB resident KV vs 4608 MiB, prefill up to 2.8x faster, needle recall 88-100% at 6-9% residency with the drafter policy, harness ground truth 32/32 vs 32/32, spec acceptance at parity. Co-Authored-By: WOZCODE <contact@withwoz.com>
There was a problem hiding this comment.
6 issues found across 18 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/server/server_main.cpp">
<violation number="1" location="server/src/server/server_main.cpp:411">
P2: Missing input validation for --kvflash token count. The value is stored raw via setenv without any validation that it is a positive integer. Every other numeric flag in this block (--spark-slots, --ddtree-budget, --fa-window, --chunk, etc.) parses with std::atoi and validates. Passing non-numeric, zero, or negative input will silently set DFLASH_KVFLASH to garbage, deferring the failure to an opaque downstream atoi call rather than failing early with a clear error message.</violation>
</file>
<file name="server/src/qwen3/qwen3_kvflash_scorer.cpp">
<violation number="1" location="server/src/qwen3/qwen3_kvflash_scorer.cpp:110">
P2: `score_chunks` divides by `chunk_tokens` without validating it, which can crash on invalid input.</violation>
</file>
<file name="server/src/qwen35/graph_builders.h">
<violation number="1" location="server/src/qwen35/graph_builders.h:71">
P3: Header comment for `kvflash_mask` incorrectly states it is "Only meaningful with n_tokens == 1", but the parameter is actively used with `n_tokens > 1` in the verify_batch/spec-decode path (qwen35_dflash_target.cpp:63), and the implementation in graph_builders.cpp:291-296 explicitly describes support for "multi-token ... forwards (decode AND spec verify)". The header constraint is misleading and contradicts both the implementation comment and actual usage.</violation>
</file>
<file name="server/src/common/kvflash_pager.h">
<violation number="1" location="server/src/common/kvflash_pager.h:70">
P1: `attach()` does not validate that pool capacity leaves at least one evictable chunk, so small pools can deadlock eviction and make `slot_for()` fail with `-1`.</violation>
</file>
<file name="server/src/qwen3/qwen3_kvflash_scorer.h">
<violation number="1" location="server/src/qwen3/qwen3_kvflash_scorer.h:7">
P3: Stale documentation reference: the header comment says 'see common/kv_scorer.h' but no such file exists. The correct base-class header is `common/kvflash_scorer.h` (confirmed at `server/src/common/kvflash_scorer.h`). This will mislead developers looking for the dependency-free interface description.</violation>
</file>
<file name="server/src/qwen35/qwen35_backend.cpp">
<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:1304">
P1: `slot_for()` failure is unchecked, so kvflash can write to KV row `-1` when the pool has no evictable block.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
The pager core is architecture-blind; this routes each backend's KV writes and masks through it so --kvflash works on every model family the daemon serves. - qwen35moe (Qwen3.6-35B-A3B): the non-hybrid path inherits qwen35. The Spark pipelined hybrid decode gains a kv_slot parameter; the cached per-layer FA span clamps to the pool, so the cached graph stops rebuilding once the window reaches pool size. The pool span stays maskless like the rest of that path: the pager zeroes freed blocks (page-out + zero_free_blocks on request reset), the same zero-row approximation production padding already relies on. Hybrid spec decode (literal-offset KV writes) falls back to pipelined AR. - laguna: all 40 layers pooled. laguna_step/_hybrid take a const pager; full + SWA masks are built in SLOT space via fill_slot_pos. SWA exactness from a protected tail >= sliding_window. Legacy per-layer hybrid decode and NO_KVPAD/PAD_CPY/no_mask ablations are refused under kvflash. - gemma4: pools FULL-attention layers only (SWA layers already ring-buffer; KV-reuse layers share their source tensors). Slot-space full mask; FA span and mask width clamp to tensor capacity. Mutually exclusive with --fa-window; spec verify falls back to AR. - pager: new const helpers slot_of / fill_slot_pos (slot-space mask construction) and zero_free_blocks (request-reset hygiene for maskless consumers); kvflash state in Qwen35Backend moved to protected for the MoE subclass. - guards everywhere: prompt-fits-pool on every prefill/restore path, snapshots refused after the first relocation on laguna/gemma4. Smoked on the 3090, pool 1024 / max-ctx 8192 with live LRU eviction mid-request: A3B Spark hybrid 101.6 tok/s, laguna 137.1, gemma4 119.0, all coherent; gemma4 no-flag control unchanged (120.2). Co-Authored-By: WOZCODE <contact@withwoz.com>
Update: KVFlash now covers every architecture the daemon serves
All smokes: pool 1024 / max-ctx 8192, ~1.2K logical tokens so live LRU eviction engages mid-request, RTX 3090. A no-flag gemma4 control on the same build confirms the default path is unchanged. The qwen35 numbers in the PR body are unaffected. Policy note: qwen35/qwen35moe attach the pflash drafter scorer automatically; laguna and gemma4 run LRU-only for now (the drafter is Qwen-tokenizer bound) with the New pager helpers: 🧙 Built with WOZCODE |
There was a problem hiding this comment.
2 issues found across 18 files (changes from recent commits).
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
- Pool-deadlock guard (P1): KvFlashPager::min_pool_tokens() + attach() refusal when sinks + tail window leave no evictable block; every backend floors the requested pool at config read (512 for qwen-family and gemma4; laguna derives its floor from the resident SWA window) with a warning instead of a runtime eviction failure. - Unchecked slot_for() in do_ar_decode (P1): a -1 slot now fails the request with a clear error instead of becoming a set_rows row index. - --kvflash / --kvflash-tau (P2): validate as positive integers at the CLI and exit early instead of deferring garbage env values downstream. - score_chunks (P2): guard chunk_tokens <= 0. - Stale docs (P3 x2): kvflash_mask comment no longer claims n_tokens==1 only (it serves multi-token spec verify); kv_scorer.h rename leftover now points at common/kvflash_scorer.h. Verified on the 3090: bad flag values rejected with clear messages; --kvflash 256 raises to the 512 floor and decodes coherently through live eviction in the tightest legal pool (8 blocks, 5 protected). Co-Authored-By: WOZCODE <contact@withwoz.com>
|
All 6 cubic findings were valid and are fixed in the latest push:
Rebuilt + re-smoked on the 3090 after the fixes (27B, pool floor path, coherent output through live eviction). 🧙 Built with WOZCODE |
…lpers The multi-arch port left three copies of the same plumbing; this pulls them into the kvflash layer so each backend integration reduces to wiring (net -32 lines): - kvflash_pool_from_env(): the env read + 256-rounding + eviction floor + max_ctx clamp lived in three slightly diverging copies (qwen35 inline, laguna, gemma4). One reader, parameterized by the arch's KvFlashConfig; laguna passes its SWA-tail config via a new kvflash_config() so the floor and attach can never disagree. - KvFlashPager::alloc_span(): the slot_for loop + exhaustion diagnostic existed in laguna, gemma4, and the qwen35moe restore replay; the backend helpers are now one-line delegates and the error message is single-sourced. - kvflash_fill_rows_and_masks(): laguna's step-input filler and gemma4's inline rows + slot-space mask fill were the same algorithm; the shared helper builds append rows plus causal (and optional sliding-window) masks from the pager's slot map, so graph code no longer reimplements the slot-to-position conversion. No behavior change: rebuilt on the 3090 and re-smoked the three affected archs through live eviction (laguna 138.0 tok/s, gemma4 119.4, qwen35 37.0, all coherent, banners unchanged). Co-Authored-By: WOZCODE <contact@withwoz.com>
There was a problem hiding this comment.
1 issue found across 8 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
- assets/cards/kvflash_card.png registered in the README cards grid (DECODE 2.9x at 256K, CONTEXT 256K, KV VRAM -99%), linking to optimizations/kvflash/. - optimizations/kvflash/README.md gains the hero image (pflash layout). - README/RESULTS now state explicitly that the 256K full-cache baseline rows are measured, not extrapolated, and fit the 24 GB card only because the KV is Q8_0 (F16 KV would be 9.2 GiB and not fit); KVFlash holds 72 MiB resident either way. Co-Authored-By: WOZCODE <contact@withwoz.com>
The measured tables now carry the cache parameter on the column itself (KV in VRAM (Q8_0)) instead of relying on the prose footnote alone; the footnote keeps the why (F16 KV would not fit 256K on 24 GB at all). Co-Authored-By: WOZCODE <contact@withwoz.com>
New 'Bounded KV residency (KVFlash)' subsection after the KV cache block, mirroring the Spark pattern: one-paragraph intro + flag table (--kvflash / --kvflash-tau and their env equivalents) linking to optimizations/kvflash/. Co-Authored-By: WOZCODE <contact@withwoz.com>
The 38.6 tok/s / 72 MiB figures are Qwen3.6-27B at one pool size; the four model families land at different speeds. The flags reference now states the property (decode independent of context length, pool-sized resident KV) and points at optimizations/kvflash/ for per-model numbers. Co-Authored-By: WOZCODE <contact@withwoz.com>
… without compression Three UX/capability gaps closed, all verified on the 3090: - Pooled chunked prefill in the daemon (DESIGN follow-up #2): a prompt larger than the pool no longer refuses — do_prefill switches to pager-chunk-sized batches with slot-mapped set_rows writes, a slot-space mask per chunk (verify_batch recipe), and live eviction as the pool fills. Constant VRAM, linear time. Smoked: 6843-token prompt through a 2048 pool, coherent output, 35.1 tok/s decode. Restore offsets and boundary snapshots are refused in the pooled path. - --kvflash auto: sizes the pool from --max-ctx (25% with a drafter configured, 50% LRU-only), same floor/clamp rails, all model families via the shared config reader. Smoked both sizings. - Drafter scorer without compression: --prefill-drafter alone now arms the residency scorer. The server hands the path to the backend (DFLASH_KVFLASH_DRAFTER); kvflash_ensure_scorer() lazy-loads the drafter on the first reselect that needs it (never on the first tokens) and re-attaches after a draft-residency release. Previously the scorer only attached inside the pflash compress path, so this flag combination silently ran recency-only LRU. Smoked: attach fires mid-generation, banner announces the pending policy. - Snapshot guards now use pager.is_identity() instead of cumulative page_outs stats: one eviction-heavy request no longer disables snapshots for the rest of the process (laguna/gemma4), and qwen35 refuses identity-copy snapshots of relocated pools. Co-Authored-By: WOZCODE <contact@withwoz.com>
Update: pooled chunked prefill +
|
High accuracy by default: when --kvflash is on and no --prefill-drafter
was given, the qwen-family backend probes the well-known locations for
the Qwen3-0.6B drafter (target's dir, drafter/, draft/, then
/opt/lucebox/models/drafter/ — Spark's load-what-sits-next-to-the-model
pattern) and arms the residency scorer from it. LRU is now the explicit
FALLBACK when no drafter exists, and the banner says so
('lru (recency-only: no Qwen3-0.6B drafter found ...)') instead of
presenting recency-only paging as a normal mode.
Nothing turns kvflash itself on by default; this only picks the policy
once the user asks for the pool.
Smoked on the 3090 with ONLY '--kvflash auto': probe found the
appliance drafter, auto sized 25% (drafter expected), scorer attached
at the first reselect, coherent output.
Co-Authored-By: WOZCODE <contact@withwoz.com>
…kvflash-policy
Relevance is a property of the text, not the tokenizer, so non-qwen
targets no longer have to run recency-only residency:
- KvFlashCrossTokScorer: detokenize the target's history with its own
tokenizer (loaded from the target GGUF), re-tokenize for the Qwen3-0.6B
drafter (its GGUF), run the same tail-attention scoring, and map
per-drafter-token scores back to the target's 64-token chunk
boundaries by character spans. Tokenizers are host-only, lazy-loaded.
- laguna + gemma4 gain the full reselect loop (history, adaptive tau,
lazy drafter load at the first reselect boundary, score_hook + repage).
Drafter-scored residency is now the default on ALL four families; the
probe + sizing live in the shared helpers.
- --kvflash-policy {drafter,lru}: the explicit opt-out the default was
missing (no probe, no drafter load, recency-only paging).
- Shared kvflash_find_drafter() / kvflash_policy_is_lru() replace the
per-backend probe; banners state the armed policy and how to change it.
Verified on the 3090 (gemma4 26B-A4B, pool 1024): cross-tok scorer
attaches mid-generation, 18 drafter-driven reselects with page events,
coherent 1.9K-token output. Stress needle A/B vs LRU: LRU degenerates
and never recites; cross-tok stays coherent and recalls the correct
prefix but not the exact code. Documented in RESULTS.md as functional
but untuned (qwen-native scoring keeps its measured 14-16/16; the
teacher-forced NIAH harness for non-qwen archs is the follow-up).
Co-Authored-By: WOZCODE <contact@withwoz.com>
There was a problem hiding this comment.
2 issues found across 12 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
'auto' now sizes from the GPU instead of a fixed fraction of max_ctx: half of (device-free minus reserve) after the weights are resident, converted at the model's pooled-KV density, capped at the decode speed knee (16384 tokens default, DFLASH_KVFLASH_MAX_POOL to override) and at max_ctx. Rationale: a bigger pool means more resident chunks and fewer forced evictions of useful context (the relevance-crowding seen in the gemma4 needle stress), while the cap keeps the per-step KV read near the flat-decode optimum; on tight cards the VRAM term shrinks the pool automatically. Backends supply the budget (ggml_backend_dev_memory + per-arch density: qwen35 full-attn layers at resolve_kv_types' quant, laguna all layers at args.kv_type, gemma4 full-attn layers at F16 with per-layer dims); the reserve covers compute buffers plus the drafter when one is expected. The fraction heuristic survives only as the no-budget fallback. Smoked on the 3090 at max-ctx 131072: 27B picks 16384 (free 8.3 GiB, 14.0 KiB/token, speed-capped), gemma4 picks 16384 (7.5 GiB, 20.0 KiB/token), both banners report the full math, both decode coherently. Co-Authored-By: WOZCODE <contact@withwoz.com>
|
You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment |
Four valid findings from cubic's later passes, all fixed: - KvFlashCrossTokScorer: raw owning pimpl now has deleted copy ctor/assignment (double-free guard; held in unique_ptr everywhere, but the class shouldn't rely on that). - KvFlashPager::slot_for: a failed allocation rolls cur_chunk_ back so the next eviction's tail window isn't computed from a chunk that never materialized. - laguna unpark: kvflash_attach failure now frees the just-loaded weights + cache before returning (was leaking them while still reporting parked). - kvflash_drafter_failed_ latch clears on unpark in all three backends: a transient drafter-load failure no longer downgrades residency to LRU for the process lifetime (still no per-tau retry spam). Stale finding skipped: the cumulative page_outs snapshot guard was already replaced by is_identity() two rounds ago. Docs brought up to shipped reality: DESIGN.md per-arch policy section (cross-tok default, --kvflash-policy, VRAM auto), do_prefill bullet (pooled chunked prefill), and the follow-ups list now separates done (pooled prefill, spec-on-pool, VRAM auto, cross-tok) from open (drafter KV persistence, laguna/gemma4 pooled prefill, pooled snapshots, async paging, non-qwen NIAH harness). Full test_kvflash regression suite on this exact tree: ALL PASS (relocation 2% gate, bit-exact roundtrip, eviction decode, reselect recall, LSA loop, >=90% KV cut), exit 0. Co-Authored-By: WOZCODE <contact@withwoz.com>
Pre-ship audit complete (
|
Both GPU jobs shared group lucebox3-gpu-runner, but a concurrency group
holds only ONE waiting job: the CUDA job took the running slot, the
Radeon job sat in the waiting slot, and every new job entering the
group from any branch displaced it ('Canceling since a higher priority
waiting request exists') — the Radeon leg was cancelled chronically
while the 3090 leg passed. The combo box has two distinct GPUs, so the
jobs never contended for a device; per-GPU groups keep cross-PR
serialization where it matters and stop the cross-displacement.
Co-Authored-By: WOZCODE <contact@withwoz.com>
rocminfo on a wedged KFD blocks in uninterruptible sleep until the 20-minute job timeout kills the run with zero evidence. Probe it under a 15 s timeout first; on hang, dump /dev/kfd holders, D-state processes, and recent amdgpu/kfd dmesg, then fail in seconds with the diagnosis on the job page. The smoke step reuses the healthy probe's output. Co-Authored-By: WOZCODE <contact@withwoz.com>
The 'DDTree falls back to AR under KVFlash' limitation guarded against a tree verify that does not exist in the daemon: the complete tree machinery (build_ddtree, build_target_step_tree, follow_verified_tree) is only called from test_dflash, the benchmark harness. In the server, --ddtree sizes the verify intermediates for budget+1 tokens and enables fast_rollback, then generation runs the same chain spec loop either way — and both pieces are already pool-compatible: chain verify_batch is slot-mapped (measured at acceptance parity), and fast_rollback's snapshot_kv/restore_kv only snapshot DeltaNet/conv recurrent state, which KVFlash never pages. Gate removed; docs corrected (the known-limit now names the harness-only tree graphs, not the daemon). A/B on the 3090 (27B + DFlash draft, --ddtree, 600 tokens): pooled 14.6% accept / avg_commit 3.33 / 33.5 tok/s vs full-cache 13.9% / 3.23 / 33.3 — parity, both coherent. Co-Authored-By: WOZCODE <contact@withwoz.com>
timeout(1) cannot kill a process in uninterruptible sleep, so the previous diagnostic step itself blocked for the full job timeout when KFD was wedged (observed live: 20 minutes of silence, no evidence printed). Probe rocminfo in the background with output to a file (no held pipe), enforce the 15 s deadline in the shell, and on hang print the probe's own D-state, /dev/kfd holders, and amdgpu dmesg before failing fast — without ever wait()ing on the corpse. Co-Authored-By: WOZCODE <contact@withwoz.com>
Update:
|
…gression fix Spec decode now runs on the pool everywhere it exists. gemma4 was the last gap: - gemma4_verify_batch gains the kvflash path: set_rows kv-index inputs (full layers -> pool slots, SWA -> ring rows), slot-space causal mask via the shared helper, FA span + mask width clamped to the pool. Gemma4DFlashTarget allocates the verify block's slots up front; the spec loop's KV-truncation rejection maps directly onto the pool's validity rule (rejected slots hold future positions, masked until the next verify rewrites them). Both backend spec gates removed. - Pre-existing regression fixed (blocks gemma spec on MAIN, not just here): PR #359's strict assert reads dflash.n_target_layers, which the published gemma draft fills with the TARGET layer count (30) while its fc tensor is sized for the 6 CAPTURE layers — the draft refused to load at all. Per that PR's own weights-are-ground-truth rule, derive the capture count from fc when it divides n_embd and warn on the metadata mismatch; genuinely inconsistent shapes still fail. - gemma4 accept_rate now reaches the HTTP usage block (was silently 0.0 while the loop logged the real rate — same reporting-only class as the PR #321 layer-split gap). A/B on the 3090 (26B-A4B + published q8_0 draft, 600 tokens): pooled and full cache produce IDENTICAL acceptance (407/3104 = 13.1%, avg_commit 3.09) and identical text; usage reports 0.131 on both. Co-Authored-By: WOZCODE <contact@withwoz.com>
Update: gemma4 spec decode on the pool — spec now works everywhere it exists (
|
…7B-hardcoded) The converter stamped the qwen35-27B draft's scalars (n_head_kv=8, hidden=5120, n_layer=5, ff=17408, ...) onto every draft regardless of source, so any non-27B DFlash draft (A3B, gemma) converted to a GGUF with correct weights but wrong metadata — which the strict draft loader then rejected (blk.0 attn_k dim != n_head_kv*head_dim). Every MoE/A3B spec-decode attempt on main fails at draft load for this reason. load_arch() now resolves the architecture from the source config.json (authoritative for transformer hparams) cross-checked against the tensor shapes (authoritative for the rest: head_dim from k_proj, intermediate from gate_proj, n_target_layers from fc, n_layer from the block count), falling back to the 27B constants only when config.json is absent. Verified: A3B draft converts to n_head_kv=4 n_layer=8 ff=6144 and loads clean. This unblocks MoE speculative decode. Validated on the 3090: A3B MoE all-GPU with --ddtree + --kvflash 2048 runs spec decode on the pool (10.4% accept, avg_commit 2.66, coherent) vs full cache (11.5%, 2.84, coherent) — so dflash + ddtree + kvflash compose on MoE. The qwen35moe --spark hybrid spec path has a separate pre-existing CUDA crash (see RESULTS Known limits); it was never reachable until drafts could load. Co-Authored-By: WOZCODE <contact@withwoz.com>
Update: MoE has dflash + ddtree on the pool (
|
…rash/correctness fixes --spark + DFlash speculative decode on MoE targets (Qwen3.6-35B-A3B etc.) crashed, then produced garbage, then ran ~4x slower than plain --spark AR. Three root causes, all fixed: 1. Crash. The F32 shared-expert gate (ffn_gate_inp_shexp, M=1) routed to cublasSgemm, and the shipped CUDA 12.0 cublasLt is missing the gemv/split-K reduce kernels for small-M matmuls at N>1 (the verify/replay batches), poisoning the stream (surfaced downstream as an illegal access in MUL_MAT_ID). Compute the scalar gate cublas-free: broadcast elementwise mul + sum_rows. 2. Garbage / collapse-to-"the". The MoE path never allocated the DeltaNet ssm/conv rollback snapshot tensors (migrate_prefill_cache is dense-only), so snapshot/restore_ssm_state were silent no-ops and rejected draft tokens leaked permanently into the recurrent state. Add ensure_ssm_snapshot(). 3. Speed. The verify re-evaluated cold experts on the CPU every step, while the AR pipelined path swaps a token's selected cold experts into GPU spare slots (LRU cache, ~21 slots/layer). Make the verify use the same moe_hybrid_cache_swap_in so every layer runs all-hot on GPU. Verify FFN 48->17 ms, decode 28->49 tok/s on a 3090. Also cache the mixed batched FFN graphs per n_tokens (cache-full fallback) and argmax on the GPU. accept_rate is now plumbed to the HTTP usage field. Validated on RTX 3090 (Qwen3.6-35B-A3B): 11% accept, coherent 500-token output, no crash. Co-Authored-By: WOZCODE <contact@withwoz.com>
The batched verify's cost scales with the distinct experts its tokens touch, and under --spark expert offload that means streaming extra cold experts for the draft tokens past the realized accept length. Those tokens are rejected, so cap the verify to the realized accept length plus a small margin (adaptive; DFLASH_VERIFY_WIDTH pins a fixed width). At AL~1.7 this verifies ~6 tokens instead of the full 16-token draft block at no acceptance cost: spec decode 49 -> 64 tok/s at full residency, 35 -> 53 tok/s at simulated-16GB offload. Co-Authored-By: WOZCODE <contact@withwoz.com>
eval_moe_hot_only_batched cached a single hot_batched_graph keyed by n_tokens. Spec decode alternates verify (verify_width) and replay (commit_n) batch sizes, so that single slot rebuilt all 40 layers' FFN graphs on every flip. Reuse the per-n_tokens hot_batched_mixed[] array so each batch size keeps its own graph; after warmup the verify/replay flip rebuilds nothing. 64 -> 68 tok/s. Co-Authored-By: WOZCODE <contact@withwoz.com>
…lash spec history Code review (codex) of the spark/spec changes surfaced: - Dummy (unused) FFN slots were filled with `i % n_hot_stack`, which includes UNINITIALIZED cache-ring spare slots (hot_active..n_hot_stack). Their garbage Q4_K scale bits can dequantize to NaN (x weight 0 = NaN -> corruption). Dummy to pinned/initialized experts (hot_active) instead. - ensure_ssm_snapshot() left dangling snapshot pointers into a freed rollback_ctx on alloc failure; null them so snapshot/restore_ssm_state skip rather than dereference. - Speculative decode ran under --kvflash but never appended committed tokens to kvflash_history_ / called kvflash_maybe_reselect (the AR paths do), so the pool evicted on stale prompt-only history. Sync them per committed step. - The cached cold batched FFN compute ignored its GGML_STATUS; check it. Validated on RTX 3090: spec 69 tok/s coherent (no regression); --spark --draft --kvflash coherent, 51 tok/s, no crash. Co-Authored-By: WOZCODE <contact@withwoz.com>
…s fixes - Usage leads with the drafter-guaranteed form; --kvflash auto alone can silently fall back to LRU if no drafter is auto-probed. - Harness accuracy 32/32 -> 36/36 (HumanEval+GSM+MATH+agent = 10+10+10+6). - qwen35moe --spark hybrid spec decode now runs on the pool: drop the stale crash known-limit, add validated 69.2/51.4 tok/s numbers. - DESIGN/README: daemon --ddtree runs on the pool; only test_dflash tree-verify graphs remain non-pool-aware. Co-Authored-By: WOZCODE <contact@withwoz.com>
Brings KVFlash bounded KV residency (Luce-Org#373), spec-decode budget-hook fixes (Luce-Org#379), FlowKV multi-turn prefill (Luce-Org#372), oversized-prompt admission, the fa-window tool-call warning (Luce-Org#378), and the llama.cpp bump to 574be613. Conflict resolutions: - submodule: rebased our pool_trim commit onto 574be613 -> 07cee1dce. - qwen35_backend.cpp: kept cache_max_verify_tokens_ alongside the KVFlash pool-budget block; gated tree verify off when the KVFlash pager is active (slot-space masks are not tree-position aware); Luce-Org#379 budget-hook fix merged cleanly into the spec-decode tail-off. - http_server.cpp: kept our admission gate (clamps max_tokens for fixed-budget clients, tool requests bypass compression) over main's should_reject_oversized — ours subsumes the admit-when-compressible case. Regression: chunked delta-net parity (CPU+CUDA) and the UTF-8 split-token test both pass post-merge.
KVFlash: bounded KV residency (lookahead sparse attention) for dflash
FlashMemory-style (arXiv 2606.09079) decode-time KV paging behind a new
--kvflash <tokens>flag. The full-attention KV cache lives in a fixed pool of slots; cold 64-token chunks page to host RAM bit-exact and recallable. GPU KV footprint becomes a hard O(pool) constant at any logical context length.Full docs in
optimizations/kvflash/(README, RESULTS, DESIGN).Headline numbers (lucebox RTX 3090, Qwen3.6-27B Q4_K_M, Q8_0 KV)
Decode is flat at 38.6 tok/s from 64K to the model's native 256K maximum (1.4x / 2.0x / 2.9x over the full cache), prefill is up to 2.8x faster, and attn-KV memory drops 99.2% (2304 to 18 MiB at 128K with a 1K pool).
How it works
create_target_cachegainsctx_alloc);cache.max_ctxstays the logical bound. The allocation delta IS the saving.common/kvflash_pager.h) maps logical positions to pool slots at 64-token chunk granularity, riding the existing step-invariantset_rowsKV append. RoPE is baked into K rows at write time, so relocation is legal; page-out/page-in moves raw quantized bytes and is bit-exact.reselect()repages the pool: the paper's lookahead loop, with a hard capacity cap their sigmoid threshold lacks.Policy is pluggable, pflash is optional
KvFlashScorer(common/) is the policy seam. With no scorer the pool runs pure LRU (zero pflash dependency, recency-only memory). When pflash loads its drafter,KvFlashDrafterScorerattaches automatically and reselect becomes relevance-driven: needle recall holds at 88-100% down to 6-9% residency from 8K to 256K, where LRU scores 0 outside its tail window.Spec decode runs on the pool
Chain-mode
verify_batchslot-maps the draft block (per-tokenkv_write_rows, which is[n_tokens, n_head_kv]ne0-major) and builds a slot-space mask. Rejected drafts need no rollback: thepos < base_posvalidity rule excludes their slots until rewritten. Acceptance parity measured on the daemon: 15.4-15.6% pooled vs 15.3% full cache. DDTree tree-verify is not pool-aware yet and falls back to AR with a one-time warning.Quality
Harness ground truth with the pool sized per the heuristic: HumanEval 10/10, GSM 10/10, MATH 10/10, agent 6/6, identical to the full-cache baseline (base-vs-base control: 16/16 byte-identical, so the stack is deterministic; text drift under KVFlash is the masked kernel's different deterministic rounding lineage, not a correctness effect).
Verification
test_kvflashsuite A-F: full-cache baseline, shuffled-relocation equivalence (0.83% argmax flips, gate 2%), live paging with bit-exact roundtrip and >=90% KV-bytes cut, score-driven reselect recall, decode profile, full LSA loop with the drafter as Memory Indexer.Known limits (documented in RESULTS.md)
cur_posexceeds the pool (pooled snapshots need page-table serialization); prefill-time snapshots work.Usage
🧙 Built with WOZCODE