@@ -69,7 +69,7 @@ Key flags:
6969| ` --allow-chatml-fallback ` | opt into approximate ChatML when no HF tokenizer |
7070| ` --no-think ` | default ` enable_thinking=False ` (e.g. Qwen3) |
7171| ` --max-context N ` | reject over-long prompts with 400 instead of failing mid-gen |
72- | ` --num-runners N ` | V1 supports ** 1 only** (single-slot: one worker serves one session; concurrent requests queue ) |
72+ | ` --num-runners N ` | Worker processes — ** 1 only** (one worker hosts many isolated sessions on one weight load; more would duplicate weights ) |
7373| ` --worker-bin PATH ` | path to the ` text_llm_worker ` binary (default: ` cmake-out/extension/llm/server/cpp/text_llm_worker ` ) |
7474
7575## Use from an agent harness
@@ -101,16 +101,19 @@ pytest tests/
101101OPENAI_BASE_URL=http://127.0.0.1:8000/v1 pytest ../conformance/test_openai_contract.py
102102```
103103
104- ` tests/ ` builds a ` RunnerPool ` over a single ` FakeRunner ` worker handle , so the
104+ ` tests/ ` builds a ` SessionRuntime ` over a single ` FakeRunner ` worker, so the
105105real server/protocol/streaming code is tested over HTTP without a ` .pte ` . The
106106worker JSONL protocol is covered separately by ` tests/test_worker_client.py ` .
107107
108108## Architecture
109109
110- Control plane (this dir, Python): server, OpenAI protocol, chat templating,
111- streaming bridge, tool parsing — no CUDA, no model, no pybind. Data plane (C++):
112- a worker process (` text_llm_worker ` ) owns one model session and does all token
113- stepping and KV mutation; it speaks one JSON object per line on stdin/stdout.
110+ Control plane (this dir, Python): an OpenAI adapter (` serving_chat ` ) over a
111+ stateful ` SessionRuntime ` over one ` WorkerClient ` — server, protocol, chat
112+ templating, streaming bridge, tool parsing — no CUDA, no model, no pybind. Data
113+ plane (C++): a worker process (` text_llm_worker ` ) that owns all model state
114+ (many isolated sessions on one weight load, warm-resume prefix logic) and does
115+ all token stepping and KV mutation; it speaks one JSON object per line on
116+ stdin/stdout.
114117
115118JSONL protocol (stdout carries protocol JSON only; logs go to stderr):
116119
@@ -132,9 +135,9 @@ does blocking pipe I/O on its executor thread.
132135| ` server.py ` | FastAPI app, routes, CLI entrypoint, worker spawn |
133136| ` protocol.py ` | OpenAI request/response schemas |
134137| ` chat_template.py ` | messages (+tools) → prompt string |
135- | ` worker_client.py ` | spawn a worker process + drive it over JSONL |
136- | ` runner_pool .py` | worker pool ( one in-flight request per worker) + async streaming bridge |
137- | ` serving_chat.py ` | ` /v1/chat/completions ` (streaming + non-streaming, stop, tools) |
138+ | ` worker_client.py ` | spawn a worker process + drive it over JSONL (raw transport) |
139+ | ` session_runtime .py` | stateful runtime over one worker: open/generate/reset/close + streaming bridge |
140+ | ` serving_chat.py ` | ` /v1/chat/completions ` OpenAI adapter (streaming + non-streaming, stop, tools) |
138141| ` tool_parsers/ ` | Hermes/Qwen ` <tool_call> ` parser only |
139142| ` cpp/text_llm_worker.cpp ` | the generic C++ worker binary |
140143
@@ -151,11 +154,11 @@ imports an example. Backend specifics (CUDA/AOTI, Metal) stay inside the worker.
151154## Scope & caveats
152155
153156Deliberately narrow (reliability-first): Hermes/Qwen tool calling only;
154- unsupported sampling params are rejected, not ignored. V1 is ** single-slot ** : one
155- worker hosts one session, so ` --num-runners ` accepts 1 and concurrent requests
156- queue. Serving capacity is worker capacity, chosen by the launcher (each worker
157- is its own process with its own weights, so N workers cost N × the weight memory)
158- — an operator decision, not something the pool infers .
157+ unsupported sampling params are rejected, not ignored. ** One worker process,
158+ serialized execution ** ( one in-flight request; concurrent requests queue).
159+ Session capacity is determined by the worker/engine — a single worker hosts many
160+ isolated sessions on one weight load — so ` --num-runners ` accepts 1; extra worker
161+ processes would each carry their own copy of the weights .
159162
160163Cancellation is best-effort: a worker request runs to completion and is not
161164interruptible mid-generation in V1, so ` runner.stop() ` means "the control plane
0 commit comments