Skip to content

Commit 22dd42e

Browse files
committed
[INITIAL] Update
[ghstack-poisoned]
1 parent b971abb commit 22dd42e

23 files changed

Lines changed: 250 additions & 245 deletions

backends/cuda/runtime/cuda_mutable_state.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -398,7 +398,7 @@ void mutable_state_set_active(MutableStateContext ctx, int token) {
398398
void mutable_state_note_handle(CudaDelegateHandle* handle) {
399399
MutableStateContext ctx = tl_loading_ctx;
400400
if (ctx == kInvalidMutableContext) {
401-
return; // not loading within a managed context (e.g. non-V2 path)
401+
return; // not loading within a managed context (single-session path)
402402
}
403403
auto& m = mgr();
404404
std::lock_guard<std::mutex> g(m.mu);

examples/models/qwen3_5_moe/README.md

Lines changed: 15 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -161,9 +161,12 @@ LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH \
161161
--data-path qwen35_moe_exports/aoti_cuda_blob.ptd \
162162
--tokenizer-path ~/models/Qwen3.5-35B-A3B/tokenizer.json \
163163
--hf-tokenizer ~/models/Qwen3.5-35B-A3B \
164-
--model-id qwen3.5-moe --no-think
164+
--model-id qwen3.5-moe --no-think --max-sessions 4
165165
```
166166

167+
`--max-sessions >= 2` is required for named sessions and warm resume; the default
168+
`1` is scratch-only (one slot is reserved for anonymous requests).
169+
167170
### Architecture (process isolation)
168171

169172
Two processes, one model load:
@@ -202,16 +205,16 @@ is safe under asyncio.
202205
### Sessions
203206

204207
One worker loads the weights once (~18 GB) and hosts multiple **isolated**
205-
sessions on that single allocationeach with its own KV/recurrent state, via
206-
CUDA per-session mutable rebinding. Set `--max-sessions N` (clamped to 1 if the
207-
backend cannot rebind); one slot is reserved for anonymous requests, so up to
208-
`N - 1` named `session_id`s are addressable.
208+
sessions on that single allocation, each with its own KV/recurrent state. Set
209+
`--max-sessions N` (clamped to 1 if the backend hosts a single session); one slot
210+
is reserved for anonymous requests, so up to `N - 1` named `session_id`s are
211+
addressable.
209212

210213
Route a request to a persistent session with the `session_id` body field or, as
211214
aliases, the `X-ExecuTorch-Session-ID` / `session_id` / `x-session-affinity`
212215
headers (body wins, then that header order). The header aliases let a client that
213-
already emits a stable per-conversation affinity id (e.g. pi's
214-
`sendSessionAffinityHeaders`) route with no extra config. Requests without any
216+
emits a stable per-conversation affinity id route per conversation (for pi, set
217+
`compat.sendSessionAffinityHeaders: true` in models.json). Requests without any
215218
share a transient scratch session.
216219

217220
```bash
@@ -243,15 +246,11 @@ Each `done` event reports
243246
(`new`/`exact_prefix`/`dirty`/`mismatch`/`equal`) for measuring the hit rate.
244247
`--no-warm-resume` forces a full prefill every request (for A/B comparison).
245248

246-
**Tool-call turns (token-ID continuation):** an assistant turn re-rendered from
247-
its parsed tool call rarely re-tokenizes to the tokens the model actually
248-
generated, so plain warm resume misses on agent loops. The server stores the
249-
exact generated token ids per session and, on the next turn, sends the prompt as
250-
segments (`{"text"}` / `{"ids"}`) that splice those ids back in for prior
251-
assistant turns instead of re-rendering them — so the resident state stays an
252-
exact token prefix and resume hits. Tool *results* remain text (re-tokenized
253-
deterministically). The worker's exact-token check still backstops everything, so
254-
a mismatch just falls back to a full prefill.
249+
**Tool-call turns** also warm-resume: an assistant turn re-rendered from its
250+
parsed tool call rarely re-tokenizes to the tokens the model generated, so the
251+
server replays the exact generated token ids for prior turns to keep the resident
252+
state an exact prefix (tool *results* stay text). A mismatch still falls back to a
253+
full prefill.
255254

256255
This is **isolation + warm resume, not concurrency**: execution is still
257256
synchronous (one in-flight request; `--num-runners > 1` is rejected since more

examples/models/qwen3_5_moe/qwen35_moe_engine.cpp

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -95,8 +95,8 @@ Result<std::unique_ptr<Module>> build_qwen_module(
9595

9696
#ifdef EXECUTORCH_BUILD_CUDA
9797
// Backend options are read during backend init(), so they must be set before
98-
// load_method. (CUDA graph is intentionally not enabled: V2 rebinds each
99-
// session's mutable buffers before execute, which a captured graph's baked
98+
// load_method. (CUDA graph is intentionally not enabled: each session
99+
// rebinds its mutable buffers before execute, which a captured graph's baked
100100
// pointers would ignore.)
101101
{
102102
// Cross-method per-FQN weight sharing: prefill and decode reuse one weight
@@ -124,7 +124,7 @@ Error register_mutable_fqns(Module* module, int mutable_ctx) {
124124
ET_LOG(
125125
Error,
126126
"Qwen35MoEEngine: model has no get_mutable_buffer_metadata; re-export "
127-
"for V2 multi-session");
127+
"for multi-session");
128128
return res.error();
129129
}
130130
const auto& outs = res.get();
@@ -368,7 +368,7 @@ class Qwen35MoESession : public LLMSession {
368368
Error seek(int64_t pos) override {
369369
// The hybrid model carries recurrent/conv state that cannot be safely
370370
// rewound by logical position the way contiguous KV can. Fail closed so the
371-
// prefix cache falls back to reset + full prefill (V1).
371+
// prefix cache falls back to reset + full prefill.
372372
(void)pos;
373373
return Error::NotSupported;
374374
}

examples/models/qwen3_5_moe/qwen35_moe_engine.h

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
// isolated points are where an MLX runtime would slot in. MLX is NOT
1818
// implemented or validated here.
1919
//
20-
// V2 (CUDA): the ENGINE is multi-session — one shared Module (weights loaded
20+
// CUDA: the ENGINE is multi-session — one shared Module (weights loaded
2121
// once); create_session() hands out multiple logical sessions, each rebinding
2222
// its own GPU buffers for the model's mutable state (KV/conv/recurrent) before
2323
// execute, serialized by the engine lock. serving_capacity() reports how many
@@ -26,9 +26,9 @@
2626
// backends/cuda/runtime/cuda_mutable_state).
2727
//
2828
// The SERVING path (qwen3_5_moe_worker + control plane) exposes this over the
29-
// worker protocol: the worker routes requests to per-session_id state (V2a) and
29+
// worker protocol: the worker routes requests to per-session_id state and
3030
// reuses each session's resident context across requests (warm append-only
31-
// resume, V2b.1). Execution stays serialized (one in-flight request).
31+
// resume). Execution stays serialized (one in-flight request).
3232

3333
#pragma once
3434

@@ -53,7 +53,7 @@ struct Qwen35MoEConfig {
5353
std::string model_path; // .pte
5454
std::string data_path; // .ptd (CUDA delegate blob); empty if none
5555
std::string tokenizer_path; // HuggingFace tokenizer.json
56-
// V2 multi-session: max physical sessions to advertise when the backend can
56+
// Multi-session: max physical sessions to advertise when the backend can
5757
// host them without weight duplication (CUDA per-session mutable rebinding).
5858
// Clamped to 1 if the backend cannot rebind.
5959
int32_t max_sessions = 1;
@@ -74,7 +74,7 @@ class ET_EXPERIMENTAL Qwen35MoEEngine : public LLMEngine {
7474
::executorch::runtime::Result<std::unique_ptr<LLMSession>> create_session()
7575
override;
7676

77-
// CUDA V2: one shared Module (one weight allocation); each session rebinds
77+
// CUDA: one shared Module (one weight allocation); each session rebinds
7878
// its own GPU buffers for the model's mutable state. Reports
7979
// config.max_sessions when the backend supports per-session rebinding, else
8080
// fails closed to 1.

examples/models/qwen3_5_moe/qwen35_moe_worker.cpp

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,8 @@
1414
// protocol and decode loop every worker uses (worker_loop.h); this file only
1515
// constructs the engine/session.
1616
//
17-
// Isolation rationale: executing the AOTI CUDA model inside a live asyncio HTTP
18-
// process segfaults in the int4 matmul (validated). Here the model runs in a
19-
// plain synchronous loop in its own process, which is reliable.
17+
// Model execution is isolated in this C++ worker for CUDA/AOTI reliability (see
18+
// the example README for the full rationale).
2019
//
2120
// Multi-session: the engine loads weights once and hosts multiple isolated
2221
// sessions on that one ~18GB allocation; the shared worker loop (worker_loop.h)

examples/models/qwen3_5_moe/serve.py

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,8 @@
1212
process (qwen3_5_moe_worker) that this process drives over JSONL via the generic
1313
WorkerClient — the same protocol the generic text_llm_worker speaks.
1414
15-
Why two processes: executing the AOTI CUDA model inside a live asyncio server
16-
process segfaults in the int4 matmul (validated by elimination — the trigger is
17-
CUDA execution while a live asyncio loop is resident). Isolating CUDA in a plain
18-
(no-asyncio) C++ worker process is the reliable shape, and it loads weights once.
15+
Model execution is isolated in the C++ worker for CUDA/AOTI reliability; the
16+
worker loads weights once. (See the example README for the full rationale.)
1917
2018
Sessions and constraints:
2119
* One worker hosts many isolated sessions on a single ~18GB weight load (CUDA
@@ -121,7 +119,7 @@ def _stop_worker():
121119

122120
def main() -> None:
123121
p = argparse.ArgumentParser(
124-
description="OpenAI-compatible LLM server for Qwen3.5 MoE (process-isolated, V1)"
122+
description="OpenAI-compatible LLM server for Qwen3.5 MoE (process-isolated)"
125123
)
126124
p.add_argument("--model-path", required=True, help="Path to the .pte model")
127125
p.add_argument(

examples/models/qwen3_5_moe/test_qwen35_moe_nobleed.cpp

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,11 @@
66
* LICENSE file in the root directory of this source tree.
77
*/
88

9-
// GPU no-bleed integration proof for the CUDA V2 per-session mutable-state
9+
// GPU no-bleed integration proof for the CUDA per-session mutable-state
1010
// rebind -- the REAL guard for mutable-buffer completeness (an under-declared
1111
// buffer would be shared across sessions; only behavior catches that, not the
1212
// declared-subset-of-discovered bookkeeping check). This is the automated form
13-
// of the manual "A solo / A inter" proof in the V2 foundation commit.
13+
// of the manual "A solo / A inter" multi-session isolation proof.
1414
//
1515
// CRITICAL: sessions are interleaved at EXECUTE granularity (A prefill, B
1616
// prefill, A decode, B decode, ...). The mechanism under test is the

extension/llm/server/README.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,11 @@ extension/llm/server/
1111
# cpp/ # future: no-Python single-binary server
1212
```
1313

14+
**Which entry point:** use `extension.llm.server.python.server` for generic
15+
TextLLM `.pte` models; use `examples.models.qwen3_5_moe.serve` for Qwen3.5-MoE
16+
CUDA (it needs the `.ptd` delegate blob, Qwen XML tool parsing, and the Qwen
17+
engine/session worker).
18+
1419
Why this layout: the OpenAI contract is identical across languages, so the
1520
**spec** and **conformance** suite are shared, and each language gets its own
1621
implementation directory. The real cross-language reuse comes from the C++
@@ -26,8 +31,8 @@ Hugging Face chat templates (`--hf-tokenizer`), `temperature` / `max_tokens` /
2631
(`<tool_call>...</tool_call>` JSON, complete calls only; model-specific launchers
2732
may select the Qwen XML format) with `tool_choice="none"`,
2833
structured API errors, and best-effort cancellation. One worker process with
29-
serialized execution; it hosts many isolated sessions on one weight load (warm
30-
append-only resume across turns). KV/prefix state lives inside the
34+
serialized execution; a worker can host isolated sessions on one weight load when its engine reports
35+
capacity > 1 (with warm append-only resume across turns). KV/prefix state lives inside the
3136
worker/session, not the control plane. Unsupported params (including `top_p`,
3237
`seed`, `n>1`, `reasoning_effort`, penalties, `logit_bias`, `response_format`,
3338
`logprobs`, and `tool_choice="required"`) are rejected with a structured 400
@@ -63,7 +68,8 @@ Point pi at the server via `~/.pi/agent/models.json`:
6368
```json
6469
{ "providers": { "executorch": {
6570
"baseUrl": "http://127.0.0.1:8000/v1", "api": "openai-completions",
66-
"apiKey": "x", "models": [ { "id": "<model-id>" } ] } } }
71+
"apiKey": "x", "models": [ { "id": "<model-id>",
72+
"compat": { "sendSessionAffinityHeaders": true } } ] } } }
6773
```
6874

6975
Other OpenAI-compatible clients use their own schema — generically: base URL

extension/llm/server/cpp/worker_loop.h

Lines changed: 50 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -8,67 +8,47 @@
88

99
#pragma once
1010

11-
// Shared model-worker generation loop + JSONL protocol, used by every model
12-
// worker (the generic text_llm_worker and model-specific workers like
13-
// qwen3_5_moe_worker). A worker only constructs its engine/tokenizer and calls
14-
// run_worker_stdio_loop(); the protocol, session management, and the decode
15-
// loop live here once, so protocol changes land in a single place.
11+
// Shared model-worker generation loop + JSONL protocol for every model worker
12+
// (the generic text_llm_worker and model-specific workers like
13+
// qwen3_5_moe_worker): a worker constructs its engine + tokenizer and calls
14+
// run_worker_stdio_loop(); the protocol, session routing, and decode loop live
15+
// here once.
1616
//
17-
// V2a (isolation): the worker owns one LLMEngine (weights loaded once) and
18-
// hands out multiple isolated LLMSessions keyed by session_id, each with its
19-
// own KV/recurrent state, up to the engine's serving capacity. Execution is
20-
// synchronous -- one in-flight request at a time, the control plane serializes.
17+
// The worker owns one LLMEngine (weights loaded once) and serves multiple
18+
// isolated LLMSessions keyed by session_id, up to the engine's serving
19+
// capacity; anonymous requests (no session_id) share one scratch session that
20+
// is reset every request. Execution is synchronous: one in-flight request at a
21+
// time.
2122
//
22-
// V2b.1 (warm append-only resume): a named session keeps its decoded context
23-
// across requests. On the next request the worker compares the new prompt's
24-
// token ids against the session's resident token ids; if the resident ids are
25-
// an exact prefix, it prefills ONLY the suffix (continuing the KV/recurrent
26-
// state at pos>0) instead of resetting and re-prefilling the whole prompt. The
27-
// check is exact-token (never string/retokenized text) and falls back to a full
28-
// reset+prefill whenever exact reuse can't be proven, so it is always correct;
29-
// the win is when the prompt is a genuine token extension of the prior turn.
23+
// Warm resume: a named session keeps its decoded context across requests. The
24+
// new prompt's token ids are matched against the session's resident token ids;
25+
// on an exact prefix only the suffix is prefilled (continuing at pos>0). The
26+
// match is exact-token (never retokenized text) and falls back to a full
27+
// reset+prefill whenever exact reuse can't be proven, so it is always correct.
3028
// See plan_prefill().
3129
//
32-
// Sessions:
33-
// - Named: an explicit session_id -> session + resident token ids, created on
34-
// first use (or via an `open` op), capped at max_named_sessions = capacity
35-
// - 1 (the scratch slot is reserved). 0 when the backend hosts one session.
36-
// Warm resume applies to named sessions (unless disabled).
37-
// - Scratch: one session for anonymous requests (no session_id), reset every
38-
// request -- distinct anonymous callers must never reuse each other's
39-
// state.
40-
//
41-
// Protocol (one JSON object per line; matches worker_client.py):
30+
// Protocol (one JSON object per line; matches worker_client.py). stdout carries
31+
// ONLY protocol JSON; logs go to stderr (ET_LOG):
4232
// worker -> stdout, once: {"ready": true, "max_sessions": int,
4333
// "max_named_sessions": int}
4434
// client -> stdin:
45-
// generate: {"max_new_tokens": int, "temperature": float,
46-
// "stop": [str, ...], "session_id"?: str,
47-
// and exactly one prompt form:
48-
// "prompt": str
49-
// "prompt_segments": [{"text": str} | {"ids": [int, ...]}]}
50-
// open: {"op": "open", "session_id": str}
51-
// close: {"op": "close", "session_id": str}
52-
// reset: {"op": "reset", "session_id": str} // clear context, keep
53-
// slot
35+
// generate: {"max_new_tokens": int, "temperature": float, "stop":
36+
// [str,...],
37+
// "session_id"?: str, and exactly one prompt form:
38+
// "prompt": str
39+
// "prompt_segments": [{"text": str} | {"ids": [int,...]}]}
40+
// open/close/reset: {"op": "open"|"close"|"reset", "session_id": str}
5441
// worker -> stdout:
55-
// generate: {"token": str} * (streamed)
56-
// {"done": true, "prompt_tokens": int, "completion_tokens":
57-
// int,
58-
// "finish_reason": "stop"|"length",
59-
// "reused_prompt_tokens": int, "prefilled_prompt_tokens": int,
60-
// "session_reset_reason": "new"|"exact_prefix"|"dirty"|
61-
// "mismatch"|"equal",
62-
// "generated_token_ids"?: [int, ...]} // omitted if
63-
// stop-trimmed
64-
// open: {"opened": true, "session_id": str}
65-
// close: {"closed": true, "session_id": str}
66-
// reset: {"reset": true, "session_id": str}
67-
// error: {"error": str, "code"?: str} // code: "capacity_exhausted",
68-
// // "unsupported_session"
69-
//
70-
// stdout carries ONLY protocol JSON; all logs go to stderr (ET_LOG). One
71-
// request at a time (the control plane serializes).
42+
// generate: {"token": str} * (streamed), then
43+
// {"done": true, "prompt_tokens": int, "completion_tokens": int,
44+
// "finish_reason": "stop"|"length",
45+
// "reused_prompt_tokens": int, "prefilled_prompt_tokens": int,
46+
// "session_reset_reason": str
47+
// (new|exact_prefix|mismatch|dirty|equal),
48+
// "generated_token_ids"?: [int,...]} // omitted if stop-trimmed
49+
// open/close/reset: {"opened"|"closed"|"reset": true, "session_id": str}
50+
// error: {"error": str, "code"?: str} // capacity_exhausted |
51+
// // unsupported_session
7252

7353
#include <nlohmann/json.hpp>
7454

@@ -242,7 +222,12 @@ inline void worker_handle_request(
242222
const auto& d = step_result.get();
243223
if (d.is_terminal) {
244224
finish = "stop";
245-
break; // terminal step (EOS / cooperative stop): not emitted or counted
225+
// Terminal step (EOS / cooperative stop): the terminal token is neither
226+
// emitted as text nor counted in num_generated -> completion_tokens. This
227+
// is intentional -- completion_tokens reflects the visible completion the
228+
// client received, not internal forward steps; an EOS the user never sees
229+
// is not part of that count.
230+
break;
246231
}
247232
// The token was forwarded into the cache (pos advanced); track it so the
248233
// resident-ids/position invariant holds. EOS/terminal tokens are not
@@ -264,9 +249,15 @@ inline void worker_handle_request(
264249
if (stop_hit) {
265250
finish = "stop"; // reached a stop string: drop it and everything after
266251
stop_string = true;
267-
// The emitted text was trimmed at the stop string, so the next turn's
268-
// rendered prompt won't be an exact token extension of resident: force a
269-
// reset rather than risk a false prefix match.
252+
// Trimming at the stop means the next turn's prompt won't be an exact
253+
// token extension of resident, so force a reset (no false prefix match).
254+
//
255+
// CONTRACT: every *string* stop is non-resumable this way (trim + dirty +
256+
// omit generated_token_ids) -- right for user/request and content-cleanup
257+
// stops, which change visible text. A clean turn terminator stays
258+
// warm-resumable only if the engine surfaces it as a terminal/EOS token
259+
// id (handled above via d.is_terminal; e.g. Qwen adds <|im_end|> to
260+
// eos_ids).
270261
st.dirty = true;
271262
break;
272263
}
@@ -400,8 +391,8 @@ class WorkerSessions {
400391
// Emit {"ready": true, ...}, then read JSONL requests from stdin and dispatch
401392
// each (generate / open / close / reset), reporting exceptions as
402393
// {"error": ...} and continuing to serve. Returns 0 when stdin closes.
403-
// enable_warm_resume gates V2b.1 warm suffix reuse for named sessions (off ->
404-
// every request resets, the V2a behavior; useful for A/B measurement).
394+
// enable_warm_resume gates warm suffix reuse for named sessions (off -> every
395+
// request resets and re-prefills; useful for A/B measurement).
405396
inline int run_worker_stdio_loop(
406397
LLMEngine& engine,
407398
::tokenizers::Tokenizer& tokenizer,

0 commit comments

Comments
 (0)