Skip to content

compose: FlowKV aged-history compression + drafter residency fix — 1.72x vs disk-cache baseline at <=64K#372

Merged
davide221 merged 13 commits into
Luce-Org:mainfrom
dusterbloom:split/11-flowkv-compose
Jun 12, 2026
Merged

compose: FlowKV aged-history compression + drafter residency fix — 1.72x vs disk-cache baseline at <=64K#372
davide221 merged 13 commits into
Luce-Org:mainfrom
dusterbloom:split/11-flowkv-compose

Conversation

@dusterbloom

@dusterbloom dusterbloom commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

TL;DR

Baseline is current main, which already ships the #364 scoped disk prefix cache (merged). This PR adds FlowKV aged-history compression on top of that baseline (RTX 3090 24GB):

main baseline (#364 disk cache) this PR (compose) delta
7-turn agentic session wall (N=3 mean) 527.5s 306.7s 1.72x
worst-turn fresh prefill (26K tok) 370 tok/s (73-77s) 396 tok/s (66-73s) parity+
decode @63k context 8.1-9.1 tok/s 10.6-14.5 tok/s +16-30%
tool-call validity 16/21 18/21 held

Benchmark: goldgate_fix trace (real multi-turn agentic session, 34K-64K prompt tokens per turn), N=3 interleaved A/B on the same binary, same thermal window.

Re-validation + q8_0 KV (2026-06-12)

The compose win was re-validated on the final PR head (incl. the deps bump below), same trace, same binary/lib/window, on both the current default KV type (tq3_0) and q8_0:

arm wall (N=2 mean) worst-turn fresh prefill decode @63k tool-calls
main baseline, tq3_0 437.1s 403 tok/s 13.7-14.3 tok/s 12/14
compose (this PR), tq3_0 257.6s 400 tok/s 14.8-15.5 tok/s 12/14
main baseline, q8_0 270.3s 664 tok/s 22.1-23.2 tok/s 10/14
compose (this PR), q8_0 192.2s 673 tok/s 21.9-23.6 tok/s 13/14
  • compose wins 1.70x on tq3_0 and 1.41x on q8_0 (same-window N=2 means).
  • q8_0 KV roughly doubles decode at 63K vs tq3_0 (llama-bench, this fork: tq3_0 14.43 vs q8_0 28.3-29.4 tok/s tg32 @ d63488). Compose + q8_0 vs the main baseline at today's default KV (tq3_0) — same binary, lib, and window — is 2.27x wall end-to-end.
  • Serving q8_0 at max-ctx 65536 on 24GB requires: (a) the llama.cpp deps bump in this PR (llama.cpp-dflash-ggml#16 — see Limitations), and (b) building with -DDFLASH27B_FA_ALL_QUANTS=OFF (see Limitations). Flags: --cache-type-k q8_0 --cache-type-v q8_0.
  • The drafter-residency release in this PR is what buys the ~2GB that lets q8_0 KV fit at 65K on 24GB.

Summary

Main's #364 disk prefix cache (merged) made warm agentic turns cheap by restoring a stable token prefix from disk. The remaining cost on long sessions is the aged conversation history that still has to be prefilled fresh whenever the prefix diverges, and the per-turn growth beyond the cached boundary. This PR composes FlowKV aged-history compression with that cache: messages older than a hot window are compressed (drafter-scored, anchor-preserving) while the system prompt stays verbatim as the cache anchor, so the disk-cache key remains stable across turns. A unified gate keeps the three paths exclusive — turn-1 verbatim, FlowKV on continuations, whole-prompt pFlash otherwise — and with compression disabled the request path is byte-identical to main.

Two fixes found during benchmarking turned the compose from a wash into the 1.72x above:

  • Drafter residency. The pflash scoring drafter (~2GB BF16) staying resident through the target's large prefill collapses prefill throughput 370 -> 121 tok/s on 24GB cards (allocator pressure at ~61K KV; verified by A/B, not capacity — the same prefill runs at full rate with the drafter released). The drafter is now released after its scoring pass and lazily reloaded (~2s). This is the auto residency default; --draft-residency persistent keeps the old behavior for >=32GB cards.
  • Admission ordering. The ingress context-length check rejected prompt+max_tokens > max_ctx before compression could run. Oversized requests are now admitted when compression will run, and the hard limit is enforced on the post-compress effective size instead.

Changes

  • FlowKV aged-history compression composed with the feat(server): add scoped disk prefix cache policy #364 scoped disk cache behind a unified gate (http_server.cpp); compress off keeps main's behavior byte-identical.
  • auto draft residency releases the pflash drafter after compress scoring (placement/draft_residency.h).
  • Pure admission helper should_reject_oversized() + post-compress effective-size gate (server/admission.h).
  • Skip-park guard: --prefill-skip-park downgraded on <32GB GPUs at max_ctx>65536 (VMM VA-fragmentation crash class) (placement/skip_park_guard.h).
  • ee7 early-exit drafter, anchor-transitive cascade with expansion throttles, tail-capture guard for the chunk-boundary assert.
  • deps: llama.cpp bumped to luce-dflash 574be613 — picks up llama.cpp-dflash-ggml#16 (cuMemSetAccess retry during VMM pool growth).
  • Tests: 1926-assertion unit suite green; standalone suites for admission (7), skip-park guard (6), anchor cascade, early-exit score range, warm-path regression. ~55% of the diff is tests.

Limitations

  • VMM pool-growth crash class — corrected root cause. The long-standing cuMemSetAccess device not ready abort (>64K cold prompts, and reproducible at 65K with q8_0/f16 KV + decode draft resident, ~20GB load, no pFlash involved) is a transient NOT_READY during VMM pool growth that was masking the real condition. llama.cpp-dflash-ggml#16 (merged; included via this PR's deps bump) retries through the transient: the crash cell now either completes (verified 7/7 on the goldgate replay where the old pointer died turn-1) or reports an honest out-of-memory.
  • FA_ALL_QUANTS VRAM spike at high load. server/CMakeLists.txt defaults DFLASH27B_FA_ALL_QUANTS=ON, which compiles ~3x more fattn kernel instances; with CUDA lazy module loading their device code faults in at the first FlashAttention call mid-prefill. At >=~20GB load (q8_0 KV at 65K + decode draft) this spike tips pool growth into OOM — reproduced deterministically with the GPU otherwise idle, and eliminated by building with -DDFLASH27B_FA_ALL_QUANTS=OFF (same-type KV pairs incl. tq3_0/q8_0/q4_0 keep their fast vec kernels; only mixed K/V type pairs need ON). Changing the default is left to a follow-up.
  • Compression keeps ~93% on dense code (anchor-dominated) — known, separate lever.
  • GGML_CUDA_NO_VMM=1 as an environment variable is a no-op (compile-time option in this fork); scripts relying on it were never protected.

History

  1. 731561d1 compose FlowKV with feat(server): add scoped disk prefix cache policy #364 scoped cache; 0efdc33c gate compression as fallback so compose can't regress main; 6a848058 unified gate (FlowKV reachable + scoped save preserved).
  2. cefa3caf ee7 early-exit drafter + anchor-transitive cascade + tail-capture guard.
  3. 3fc6882f drafter auto-release after compress scoring (the 1.72x).
  4. 2ae98c0f compress-aware admission.
  5. 637fbdaf comment trim (-133 LOC, no logic changes).
  6. 1c562eb4 skip-park footprint guard.
  7. e542e908 review fixes (request-scope override no longer drops server compress; full-cache-hit admission); 26e0ee33 -1 sentinel for full-cache served tokens.
  8. 6a584981 deps bump to luce-dflash 574be613 (VMM pool-growth fix).

…rg#364 scoped cache

- Port 354e7b6 message-count freeze (aged[1..n-hot) compressed once, cached)
- Remove mutual-exclusion: FlowKV active → disk clamps to system_end (verbatim system anchor, stable cross-session key); Luce-Org#364 unchanged when compress=false
- WS1: non-continuation turns skip compression (cold-poison fix preserved)
- Inert-guard: aged band < 512 tokens → FlowKV-OFF
- Config: DiskPrefixCachePolicy::compress + --disk-prefix-cache-compress CLI
- Tests T1-T7: 1908 assertions, 0 failures
… vs Luce-Org#364

FlowKV ran whenever disk_cache_policy.compress was set, with no size gate, so
every multi-turn agentic turn paid the full pFlash drafter-forward (~400s/session
at 59K) and re-expanded the prompt — making COMPOSE ~1.9x slower than the plain
Luce-Org#364 scoped disk cache it should improve on.

- Gate FlowKV on the original prompt size (same threshold as the pFlash gate),
  and skip it once pFlash has already compressed.
- Below threshold COMPOSE is byte-identical to Luce-Org#364 (full prefix-cache hits, no
  drafter tax); compression fires only when the conversation can't fit the KV.
- Keep the scoped-disk-re-prefill skip under compression (avoids turn-2 hang).

Validated on abc_cache_harness COMPOSE arm (auto, threshold=65000): goldgate_fix
total wall 846s -> 480s (~Luce-Org#364's 443s), zero compression on sub-threshold turns.
Activate via --prefill-compression auto --prefill-threshold ~max_ctx.
…g-42 tail-capture guard

ee7 truncates drafter forward at layer 7 of 28, scoring only those layers.
9.3× drafter wall at 128K (RTX 3090, Qwen3.6-27B-Q4_K_M target + Qwen2.5-0.5B-BF16 drafter).
Anchor-transitive cascade rescues multi-hop on bimodal-density prompts (gated, default OFF).
Bug Luce-Org#42 fix: tail-capture view-bounds guard at S%4096 in {1..7}.

5 unit tests included. Bench scripts split to follow-up PR.
…g#364 scoped save

47081e67 demoted FlowKV to a downstream else-if after whole-prompt pFlash,
gated on the same threshold — making FlowKV structurally unreachable (any
threshold that let it run made pFlash fire first; PFLASH_FREEZE_HISTORY went
dead). Replace with the unified gate (compute should_compress once; route
continuations to FlowKV-freeze with should_compress=false; whole-prompt pFlash
only for cold non-continuations), mirroring the working flowkv-standalone
structure. Re-enable Luce-Org#364's scoped disk save under compression (drop the
band-aid guard; the disk-clamp already pins the save to the stable system_end
prefix).

Paired A/B, same binary (cb458145), full 7-turn goldgate_fix, single-session:
COMPOSE_FLOWKV 615.9s vs pure-Luce-Org#364 713.7s (1.16x), decode 13.6 vs 6.7 tps,
tool-valid 85.7% vs 71.4%. FlowKV engages on continuations; ee7 keeps the
drafter forward cheap. Turn-4 transition cost (park/unpark + uncached
compressed-prefill) is the remaining lever, not the gate.
Resident drafter (~2GB) starves the target's large prefill on 24GB cards
(370 -> 121 tok/s on the freeze transition turn). Release after scoring,
lazy reload next turn (~2s). N=3 interleaved: 527.5s -> 306.7s (1.72x),
turn-4 prefill 217-269s -> 66-73s, quality held. persistent remains the
big-card opt-out.
…them

Ingress gate rejected prompt+max_tokens > max_ctx before compression ran,
making >max_ctx sessions unreachable even when FlowKV/pFlash could shrink
them. Extract pure should_reject_oversized() (admission.h): pass oversized
requests through when compression will run; enforce the hard limit on the
post-compress effective size in worker_loop. Oversized requests now get
compressed first and reject cleanly only if still over budget.
-133 net LOC, comments only — zero logic/string/assertion changes.
All suites re-verified green (1926 asserts + 4 standalone tests).
Dual-resident target+draft fragments VMM virtual address space; at
max_ctx=131072 the compute pool's cuMemSetAccess fails (device not
ready). Safe cell (<=65536, 10+ clean runs) keeps the fast no-park
path; dangerous cell parks. Note: GGML_CUDA_NO_VMM=1 env is compile-
time-only in this fork and never mitigated this.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 issues found across 24 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/server/freeze_history.h">

<violation number="1" location="server/src/server/freeze_history.h:10">
P3: Unused include: `<vector>` is not used by any declaration in this header. Remove it to keep dependencies minimal.</violation>
</file>

<file name="server/src/qwen35/qwen35_backend.cpp">

<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:541">
P2: Skip-park safety guard reads VRAM from current CUDA device instead of the configured placement GPU, making guard decisions incorrect on multi-GPU setups.</violation>
</file>

<file name="server/src/server/http_server.cpp">

<violation number="1" location="server/src/server/http_server.cpp:1904">
P1: FlowKV message-part extraction reads unvalidated JSON as strings; non-string `text` values can throw uncaught exceptions in the worker loop.</violation>
</file>

<file name="server/src/qwen3/anchor_scan.cpp">

<violation number="1" location="server/src/qwen3/anchor_scan.cpp:27">
P1: `search_end` clamping to 0 causes one invalid n-gram comparison when `body_end < ngram`, risking out-of-bounds reads and boundary violations.</violation>

<violation number="2" location="server/src/qwen3/anchor_scan.cpp:103">
P1: Transitive cascade loop exits early due to comparing `forced` against an immediately copied snapshot, so subsequent expansion/rescan iterations are skipped.</violation>
</file>

<file name="server/src/qwen3/anchor_scan.h">

<violation number="1" location="server/src/qwen3/anchor_scan.h:18">
P2: max_forced_count hard cap is checked only inside the cascade loop, but not after pass-1. If pass-1 alone already pushes forced chunks above max_forced_count, the cap is never enforced — the result can exceed the limit.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

{
const int n_chunks = (int)forced.size();
const int ngram = cfg.ngram;
const int search_end = std::max(0, body_end - ngram);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: search_end clamping to 0 causes one invalid n-gram comparison when body_end < ngram, risking out-of-bounds reads and boundary violations.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen3/anchor_scan.cpp, line 27:

<comment>`search_end` clamping to 0 causes one invalid n-gram comparison when `body_end < ngram`, risking out-of-bounds reads and boundary violations.</comment>

<file context>
@@ -0,0 +1,164 @@
+{
+    const int n_chunks = (int)forced.size();
+    const int ngram    = cfg.ngram;
+    const int search_end = std::max(0, body_end - ngram);
+
+    for (int qi = 0; qi + ngram <= (int)query_pool.size(); ++qi) {
</file context>

// Cascade loop: expand pool with tokens from newly-forced chunks and re-scan.
std::vector<uint8_t> prev_forced;
for (int it = 0; it < max_iters; ++it) {
prev_forced = forced;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Transitive cascade loop exits early due to comparing forced against an immediately copied snapshot, so subsequent expansion/rescan iterations are skipped.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen3/anchor_scan.cpp, line 103:

<comment>Transitive cascade loop exits early due to comparing `forced` against an immediately copied snapshot, so subsequent expansion/rescan iterations are skipped.</comment>

<file context>
@@ -0,0 +1,164 @@
+    // Cascade loop: expand pool with tokens from newly-forced chunks and re-scan.
+    std::vector<uint8_t> prev_forced;
+    for (int it = 0; it < max_iters; ++it) {
+        prev_forced = forced;
+
+        // Rare-token worklist: catches multi-hop cascades within a single outer iteration.
</file context>

Comment thread server/src/server/http_server.cpp Outdated
const std::string ptype = part.value("type", "");
if (ptype == "text" || ptype == "input_text" ||
ptype == "output_text")
msg_content += part.value("text", "");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: FlowKV message-part extraction reads unvalidated JSON as strings; non-string text values can throw uncaught exceptions in the worker loop.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/server/http_server.cpp, line 1904:

<comment>FlowKV message-part extraction reads unvalidated JSON as strings; non-string `text` values can throw uncaught exceptions in the worker loop.</comment>

<file context>
@@ -1798,6 +1808,233 @@ void HttpServer::worker_loop() {
+                                        const std::string ptype = part.value("type", "");
+                                        if (ptype == "text" || ptype == "input_text" ||
+                                            ptype == "output_text")
+                                            msg_content += part.value("text", "");
+                                    }
+                                }
</file context>
Suggested change
msg_content += part.value("text", "");
if (part.contains("text") && part["text"].is_string()) msg_content += part["text"].get<std::string>();

{
size_t total_vram = 0;
int dev = 0;
cudaGetDevice(&dev);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Skip-park safety guard reads VRAM from current CUDA device instead of the configured placement GPU, making guard decisions incorrect on multi-GPU setups.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_backend.cpp, line 541:

<comment>Skip-park safety guard reads VRAM from current CUDA device instead of the configured placement GPU, making guard decisions incorrect on multi-GPU setups.</comment>

<file context>
@@ -534,7 +535,22 @@ bool Qwen35Backend::handle_compress(const std::string & line, const DaemonIO & i
+    {
+        size_t total_vram = 0;
+        int dev = 0;
+        cudaGetDevice(&dev);
+        cudaDeviceProp prop{};
+        if (cudaGetDeviceProperties(&prop, dev) == cudaSuccess)
</file context>

Comment thread server/src/server/disk_prefix_cache.h
int ngram = 4;
int rare_token_max_freq = 8; // tokens appearing <= this many times in body count as rare
int cascade_min_anchor_count = 0; // skip cascade if pass-1 forced >= this many chunks (0 = always cascade)
int max_forced_count = INT_MAX; // hard cap on total forced chunks

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: max_forced_count hard cap is checked only inside the cascade loop, but not after pass-1. If pass-1 alone already pushes forced chunks above max_forced_count, the cap is never enforced — the result can exceed the limit.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen3/anchor_scan.h, line 18:

<comment>max_forced_count hard cap is checked only inside the cascade loop, but not after pass-1. If pass-1 alone already pushes forced chunks above max_forced_count, the cap is never enforced — the result can exceed the limit.</comment>

<file context>
@@ -0,0 +1,42 @@
+    int ngram = 4;
+    int rare_token_max_freq = 8;        // tokens appearing <= this many times in body count as rare
+    int cascade_min_anchor_count = 0;   // skip cascade if pass-1 forced >= this many chunks (0 = always cascade)
+    int max_forced_count = INT_MAX;     // hard cap on total forced chunks
+};
+
</file context>

Comment thread server/src/server/freeze_history.h
…ze in post-compress gate

Two confirmed PR-review findings:
- request-level prefix_cache.scope override replaced the whole policy,
  silently dropping the server-level compress flag (FlowKV disabled for
  any client sending an explicit scope)
- post-compress context gate used the raw prompt size on pflash
  full-cache hits, falsely 400ing oversized repeats served from cached
  compressed state

Both extracted to pure helpers (apply_request_scope_override,
effective_prompt_overflows) with failing-test-first coverage.
@dusterbloom

Copy link
Copy Markdown
Collaborator Author

Review disposition (all 8 cubic findings verified against code + tests before fixing):

finding verdict action
http_server.cpp:2175 raw size on full-cache hits confirmed fixed in e542e90 (effective_prompt_overflows helper, failing-test-first)
disk_prefix_cache.h:50 scope override drops compress confirmed fixed in e542e90 (apply_request_scope_override, failing-test-first)
http_server.cpp:1904 JSON throw refuted json::value(key, default) is type-safe (returns default on type mismatch, no throw); the .get<std::string>() at :1897 is guarded by is_string()
anchor_scan.cpp:27 / :103 / anchor_scan.h:18 confirmed in the utility library not reachable from production (the shipping drafter uses its own inline scan; these functions are exercised by tests only). Fixes queued as a follow-up batch rather than blocking this PR
qwen35_backend.cpp:541 current-device VRAM query partial latent multi-GPU-only issue; single-GPU is the only shipped config. 1-line fix queued with the follow-up batch
freeze_history.h:10 unused include confirmed queued with follow-up batch

Suite after fixes: 1939 assertions green; admission standalone 12/12.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 6 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/qwen35/qwen35_backend.cpp">

<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:541">
P2: Skip-park safety guard reads VRAM from current CUDA device instead of the configured placement GPU, making guard decisions incorrect on multi-GPU setups.</violation>
</file>

<file name="server/src/server/http_server.cpp">

<violation number="1" location="server/src/server/http_server.cpp:1904">
P1: FlowKV message-part extraction reads unvalidated JSON as strings; non-string `text` values can throw uncaught exceptions in the worker loop.</violation>
</file>

<file name="server/src/qwen3/anchor_scan.cpp">

<violation number="1" location="server/src/qwen3/anchor_scan.cpp:27">
P1: `search_end` clamping to 0 causes one invalid n-gram comparison when `body_end < ngram`, risking out-of-bounds reads and boundary violations.</violation>

<violation number="2" location="server/src/qwen3/anchor_scan.cpp:103">
P1: Transitive cascade loop exits early due to comparing `forced` against an immediately copied snapshot, so subsequent expansion/rescan iterations are skipped.</violation>
</file>

<file name="server/src/qwen3/anchor_scan.h">

<violation number="1" location="server/src/qwen3/anchor_scan.h:18">
P2: max_forced_count hard cap is checked only inside the cascade loop, but not after pass-1. If pass-1 alone already pushes forced chunks above max_forced_count, the cap is never enforced — the result can exceed the limit.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/server/http_server.cpp Outdated
0 conflated 'no hit' with a zero-length hit; sentinel is now -1 and the
gate treats any >=0 value as served-from-cache.
@dusterbloom

Copy link
Copy Markdown
Collaborator Author

cubic P2 (http_server.cpp:1798 sentinel conflation): a zero-length full-cache entry is not constructible today (admission floor + kept anchors guarantee non-empty compressed prompts), but the conflation class is now removed outright — sentinel is -1, gate treats >=0 as a hit. Failing-test-first (test_effective_overflows_zero_length_hit_is_a_hit), suite 13/13 + 1939 asserts green. Fixed in 26e0ee3.

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 12, 2026
…ter residency fix

Keep the current stack's qwen3 helper/test implementations where the PR overlapped, while taking the PR's server-side admission, skip-park, HTTP/server wiring, and test additions.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 12, 2026
Record the PR Luce-Org#372 integration, current head, and updated open-PR accounting.
Picks up llama.cpp-dflash-ggml#16: cuMemSetAccess retries on NOT_READY
during VMM pool growth instead of aborting. Removes the >19GB-load
crash class (q8_0/f16 KV at 65K, 131K reservations); verified 7/7 on
the goldgate replay where the previous pointer crashed turn-1.
@davide221 davide221 merged commit 7500f00 into Luce-Org:main Jun 12, 2026
11 of 13 checks passed
Rhonstin added a commit to Rhonstin/lucebox-hub that referenced this pull request Jun 14, 2026
Brings KVFlash bounded KV residency (Luce-Org#373), spec-decode budget-hook fixes
(Luce-Org#379), FlowKV multi-turn prefill (Luce-Org#372), oversized-prompt admission, the
fa-window tool-call warning (Luce-Org#378), and the llama.cpp bump to 574be613.

Conflict resolutions:
- submodule: rebased our pool_trim commit onto 574be613 -> 07cee1dce.
- qwen35_backend.cpp: kept cache_max_verify_tokens_ alongside the KVFlash
  pool-budget block; gated tree verify off when the KVFlash pager is active
  (slot-space masks are not tree-position aware); Luce-Org#379 budget-hook fix
  merged cleanly into the spec-decode tail-off.
- http_server.cpp: kept our admission gate (clamps max_tokens for
  fixed-budget clients, tool requests bypass compression) over main's
  should_reject_oversized — ours subsumes the admit-when-compressible case.

Regression: chunked delta-net parity (CPU+CUDA) and the UTF-8 split-token
test both pass post-merge.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants