Commit e2310b5
committed
extension/llm/server: token-ID prompt segments for tool-use resume (V2b.1.5)
Warm resume (V2b.1) misses on agent loops: an assistant turn re-rendered from
its parsed tool call almost never re-tokenizes to the tokens the model actually
generated, so the resident state isn't an exact prefix and the worker resets.
On BFCL multi_turn the warm-resume hit rate was 0%.
Fix: carry the exact tokens instead of re-deriving them from text. The worker
returns generated_token_ids on `done` and accepts a `prompt_segments` form of
the prompt -- an ordered list of {"text"} chunks to tokenize and {"ids"} runs of
literal token ids (mutually exclusive with the plain `prompt` string); the
WorkerClient/SessionRuntime transport for that form was introduced with the
SessionRuntime boundary, and this commit makes the worker assemble and emit it.
The adapter-specific transcript glue lives in a new module, openai_transcript.py
(OpenAITranscriptState): it stores one record per assistant turn ({fingerprint,
ids, generation preamble}) and, on the next request, rebuilds the prompt as
segments -- each prior assistant turn is replaced with a unique sentinel, the
conversation is rendered once, and the rendered text is split on the sentinels
with the stored ids spliced back in. Tool results stay text (they re-tokenize
deterministically). This logic is the OpenAI adapter's concern, not the
runtime's: SessionRuntime only sees a PromptInput (text or segments).
The splice also reproduces the deterministic generation scaffold the worker
prefills into resident KV. Qwen3's template appends a scaffold after the
assistant header (no-think: `<think>\n\n</think>\n\n`; thinking: `<think>\n`),
then strips it when re-rendering an assistant turn that precedes the last user
message -- so without this, the resident state carried scaffold tokens the next
prompt lacked and ordinary multi-turn chat reset (only tool-call turns, whose
think the template preserves, ever hit exact_prefix). ChatTemplate.
generation_preamble derives that scaffold for the request's mode; it is recorded
per turn (so a mid-session enable_thinking switch still reproduces each turn's
resident scaffold), and the segment assembly normalizes the scaffold region
before each spliced run to the recorded preamble: it inserts the scaffold where
history stripped it and replaces it where history preserved a different form
(after the last user the template keeps the empty block, which a naive append
would double-insert), falling back to text on an unrecognized region. Ordinary
multi-turn chat now warm-resumes too. This is adapter-only -- no worker, runtime,
or protocol change.
Splicing is guarded so stale ids are never injected: a turn is substituted only
when the incoming assistant message fingerprint-matches the response we returned
(the fingerprint canonicalizes each tool call's JSON arguments before hashing, so
a client that reserializes them with different whitespace or key order -- the
same value -- still matches and resumes, rather than looking like an edited turn;
an edited or branched history, or a session reused for another conversation ->
text fallback; splicing stops at the first divergence and the now-stale tail is
pruned, and a regenerated turn is recorded at its position so it replaces the
stale record instead of shadowing later hits), and only when its ids faithfully
decode to what the client saw -- a stop-string trim kept post-stop tokens
resident but dropped them from the output, so the worker omits the ids and the
turn is re-rendered as text. Sentinel collisions / dropped sentinels also fall
back to text, and the worker's exact-token prefix check backstops the rest.
The context-window preflight counts what the worker actually assembles: for a
segment prompt it sums the literal {ids} run lengths and the tokenized {text}
chunks (not the rendered string), so a near-limit request agrees with the worker
rather than false-rejecting or failing mid-decode.
On BFCL multi_turn (per-conversation sessions) this moves the reuse fraction
from 0% to ~50% (exact_prefix hits where there were none); with the scaffold
reproduction, ordinary multi-turn chat reaches exact_prefix on every append
turn rather than re-prefilling the whole prompt. The single-turn AST suite is
unchanged (no prior assistant turn -> plain text prompt).
Review order: worker_loop.h (segment assembly + faithful generated_token_ids);
then the control plane (the new openai_transcript.py store + fingerprint-guarded
sentinel rendering + per-turn generation-scaffold normalization, chat_template.
generation_preamble, and the serving_chat wiring that builds the segments,
threads the preamble, and counts segments for the context preflight); then tests
and docs.
This change was authored with Claude Code.
Part of #20001
ghstack-source-id: 861cb67
ghstack-comment-id: 4661784137
Pull-Request: #201611 parent cbf8def commit e2310b5
10 files changed
Lines changed: 1678 additions & 24 deletions
File tree
- examples/models/qwen3_5_moe
- extension/llm/server
- cpp
- python
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
243 | 243 | | |
244 | 244 | | |
245 | 245 | | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
246 | 256 | | |
247 | 257 | | |
248 | 258 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
95 | 95 | | |
96 | 96 | | |
97 | 97 | | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
0 commit comments