Skip to content

Commit e2310b5

Browse files
committed
extension/llm/server: token-ID prompt segments for tool-use resume (V2b.1.5)
Warm resume (V2b.1) misses on agent loops: an assistant turn re-rendered from its parsed tool call almost never re-tokenizes to the tokens the model actually generated, so the resident state isn't an exact prefix and the worker resets. On BFCL multi_turn the warm-resume hit rate was 0%. Fix: carry the exact tokens instead of re-deriving them from text. The worker returns generated_token_ids on `done` and accepts a `prompt_segments` form of the prompt -- an ordered list of {"text"} chunks to tokenize and {"ids"} runs of literal token ids (mutually exclusive with the plain `prompt` string); the WorkerClient/SessionRuntime transport for that form was introduced with the SessionRuntime boundary, and this commit makes the worker assemble and emit it. The adapter-specific transcript glue lives in a new module, openai_transcript.py (OpenAITranscriptState): it stores one record per assistant turn ({fingerprint, ids, generation preamble}) and, on the next request, rebuilds the prompt as segments -- each prior assistant turn is replaced with a unique sentinel, the conversation is rendered once, and the rendered text is split on the sentinels with the stored ids spliced back in. Tool results stay text (they re-tokenize deterministically). This logic is the OpenAI adapter's concern, not the runtime's: SessionRuntime only sees a PromptInput (text or segments). The splice also reproduces the deterministic generation scaffold the worker prefills into resident KV. Qwen3's template appends a scaffold after the assistant header (no-think: `<think>\n\n</think>\n\n`; thinking: `<think>\n`), then strips it when re-rendering an assistant turn that precedes the last user message -- so without this, the resident state carried scaffold tokens the next prompt lacked and ordinary multi-turn chat reset (only tool-call turns, whose think the template preserves, ever hit exact_prefix). ChatTemplate. generation_preamble derives that scaffold for the request's mode; it is recorded per turn (so a mid-session enable_thinking switch still reproduces each turn's resident scaffold), and the segment assembly normalizes the scaffold region before each spliced run to the recorded preamble: it inserts the scaffold where history stripped it and replaces it where history preserved a different form (after the last user the template keeps the empty block, which a naive append would double-insert), falling back to text on an unrecognized region. Ordinary multi-turn chat now warm-resumes too. This is adapter-only -- no worker, runtime, or protocol change. Splicing is guarded so stale ids are never injected: a turn is substituted only when the incoming assistant message fingerprint-matches the response we returned (the fingerprint canonicalizes each tool call's JSON arguments before hashing, so a client that reserializes them with different whitespace or key order -- the same value -- still matches and resumes, rather than looking like an edited turn; an edited or branched history, or a session reused for another conversation -> text fallback; splicing stops at the first divergence and the now-stale tail is pruned, and a regenerated turn is recorded at its position so it replaces the stale record instead of shadowing later hits), and only when its ids faithfully decode to what the client saw -- a stop-string trim kept post-stop tokens resident but dropped them from the output, so the worker omits the ids and the turn is re-rendered as text. Sentinel collisions / dropped sentinels also fall back to text, and the worker's exact-token prefix check backstops the rest. The context-window preflight counts what the worker actually assembles: for a segment prompt it sums the literal {ids} run lengths and the tokenized {text} chunks (not the rendered string), so a near-limit request agrees with the worker rather than false-rejecting or failing mid-decode. On BFCL multi_turn (per-conversation sessions) this moves the reuse fraction from 0% to ~50% (exact_prefix hits where there were none); with the scaffold reproduction, ordinary multi-turn chat reaches exact_prefix on every append turn rather than re-prefilling the whole prompt. The single-turn AST suite is unchanged (no prior assistant turn -> plain text prompt). Review order: worker_loop.h (segment assembly + faithful generated_token_ids); then the control plane (the new openai_transcript.py store + fingerprint-guarded sentinel rendering + per-turn generation-scaffold normalization, chat_template. generation_preamble, and the serving_chat wiring that builds the segments, threads the preamble, and counts segments for the context preflight); then tests and docs. This change was authored with Claude Code. Part of #20001 ghstack-source-id: 861cb67 ghstack-comment-id: 4661784137 Pull-Request: #20161
1 parent cbf8def commit e2310b5

10 files changed

Lines changed: 1678 additions & 24 deletions

File tree

examples/models/qwen3_5_moe/README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -243,6 +243,16 @@ Each `done` event reports
243243
(`new`/`exact_prefix`/`dirty`/`mismatch`/`equal`) for measuring the hit rate.
244244
`--no-warm-resume` forces a full prefill every request (for A/B comparison).
245245

246+
**Tool-call turns (token-ID continuation):** an assistant turn re-rendered from
247+
its parsed tool call rarely re-tokenizes to the tokens the model actually
248+
generated, so plain warm resume misses on agent loops. The server stores the
249+
exact generated token ids per session and, on the next turn, sends the prompt as
250+
segments (`{"text"}` / `{"ids"}`) that splice those ids back in for prior
251+
assistant turns instead of re-rendering them — so the resident state stays an
252+
exact token prefix and resume hits. Tool *results* remain text (re-tokenized
253+
deterministically). The worker's exact-token check still backstops everything, so
254+
a mismatch just falls back to a full prefill.
255+
246256
This is **isolation + warm resume, not concurrency**: execution is still
247257
synchronous (one in-flight request; `--num-runners > 1` is rejected since more
248258
workers would duplicate the weights). Fair interleaving across in-flight requests

extension/llm/server/cpp/CMakeLists.txt

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,3 +95,14 @@ target_include_directories(
9595
test_worker_prefill_plan PUBLIC ${_common_include_directories}
9696
)
9797
add_test(NAME worker_prefill_plan COMMAND test_worker_prefill_plan)
98+
99+
# Worker-loop harness (worker_handle_request + WorkerSessions) driven by a
100+
# scriptable fake LLMSession/Tokenizer/LLMEngine -- no model/GPU. It includes
101+
# the full worker_loop.h, so it needs the JSON include + the runtime/tokenizer
102+
# libs.
103+
add_executable(test_worker_loop test_worker_loop.cpp)
104+
target_include_directories(
105+
test_worker_loop PUBLIC ${_common_include_directories} ${_json_include}
106+
)
107+
target_link_libraries(test_worker_loop PUBLIC ${link_libraries})
108+
add_test(NAME worker_loop COMMAND test_worker_loop)

0 commit comments

Comments
 (0)