Skip to content

feat(kvflash): target-QK residency scorer (--kvflash-policy qk)#385

Open
dusterbloom wants to merge 2 commits into
Luce-Org:mainfrom
dusterbloom:feat/kvflash-qk-scorer-clean
Open

feat(kvflash): target-QK residency scorer (--kvflash-policy qk)#385
dusterbloom wants to merge 2 commits into
Luce-Org:mainfrom
dusterbloom:feat/kvflash-qk-scorer-clean

Conversation

@dusterbloom

@dusterbloom dusterbloom commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

KVFlash target-QK residency scorer (--kvflash-policy qk)

Adds a third KVFlash residency policy that reuses the target model's own KV keys as the relevance index — no extra model resident, no separate scoring forward pass.

Score of a sealed chunk = mean over the full-attention layers of the max over (kv_head, group q-head) pairs of cos(decode-Q, mean-pooled chunk-K), both post-RoPE (and post-FWHT when the K cache is rotated — the shared orthogonal transform preserves cosine, so it equals the unrotated-basis score).

Why

The bounded KV pool (#373) needs a per-step "which chunks stay resident?" decision. Today's options:

  • LRU — free, but recalls nothing once a chunk is evicted.
  • drafter — drafter-grade recall, but pays a 0.6B forward pass per reselect.

The QK policy gets the drafter's selection quality at LRU's cost, because the keys it scores against were already computed and cached by the target.

Evidence — 3-arm gate

test_kvflash --qkbench, L=65536, pool=4096, Qwen3.6-35B-A3B-Q3_K_M, RTX 3090. One process per arm, arm= label verified in-log.

arm rescore_mean (s) needle facts gold_res decode tok/s
lru (control) 0.00 0/16 0/64 0/7 124.3
drafter 166.44 14/16 63/64 7/7 99.5
qk (this PR) 0.30 14/16 63/64 7/7 126.6

QK matches drafter recall exactly (14/16 needle, 63/64 facts, 7/7 gold chunks resident) at 555× cheaper rescore, and decodes faster than the drafter arm (no 0.6B model resident). LRU is the control: free but recalls nothing.

Scope

  • ~329 LOC source (kvflash_qk.h kernel = 195), ~382 LOC tests, +16 docs.
  • The pure scoring function is dependency-free and unit-tested with synthetic data (test_kvflash_qk.cpp); --qkbench is the end-to-end gate.
  • Wired into qwen35 only for now (the other KVFlash backends keep LRU/drafter).

Known limitations / follow-ups

  • Evicted chunks can't be paged back — the scorer can only re-rank chunks still in the pool. At pool=ctx/10 this is a decode/VRAM lever, not a memory-recall lever. The fix is critical-chunk pinning (sink + recent + tool-defs + flagged needles); separate PR.
  • The query-capture-and-pool lifecycle is open-coded as kvflash_qk_policy_ branches in the backend (not behind the scorer interface) — deliberately, since there's one query-capturing scorer. A second one should lift those into a KvFlashTargetScorer with seal/step/reselect hooks.

Branch sits directly on current main (ahead 1, behind 0).

Review in cubic

…kbench

- common/kvflash_qk.h: Phase-0-validated scoring (pooled post-RoPE K x
  decode Q cosine, max-over-group-heads, layer-mean), seal-time pooling
- graph: per-layer post-RoPE/post-FWHT Q capture into cache.q_cap
- backend: policy=qk wiring (pool at seal, query at reselect)
- test_kvflash --qkbench: LRU/drafter/QK arms, needle /16 + 4-fact
  cold-history probe with decoys, real-token filler
@dusterbloom dusterbloom marked this pull request as ready for review June 15, 2026 12:15

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 12 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/common/kvflash_qk.h Outdated
Comment thread optimizations/kvflash/README.md Outdated
- kvflash_qk_chunk_scores missing_score default 0.0 -> -2.0 (below the
  [-1,1] cosine-mean floor) so a no-info chunk never outranks a real
  chunk with negative query correlation in reselect()
- unit test locks the default-sentinel ranking
- README: server/ prefix on test_kvflash_qk.cpp path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant