feat: streaming synthesis#7
Merged
Merged
Conversation
… them
A deep query's final synthesis is a 30–60s single LLM call. Until now the
user stared at a frozen terminal for the whole duration, then got the
entire answer at once. Now the tokens stream to stdout as soon as they
arrive.
Turned on by default when ALL of:
- not --json (JSON is buffered into the envelope)
- not --deep (intermediate rounds would print multiple full drafts
back-to-back)
- stdout.isTTY (piped output shouldn't interleave with progress events)
- not --no-stream (explicit opt-out)
Turned off otherwise. Env-var DEEPDIVE_FORCE_STREAM=1 bypasses the TTY
check for CI-style testing.
CLI output structure under streaming:
1. Write "# {question}\n\n" up front
2. Stream answer tokens straight to stdout as they arrive
3. After synthesis completes, write the "## Sources\n..." block
If --out is also set, the full markdown (re-rendered from the buffered
result) goes to the file so the file remains atomically complete.
Implementation:
- src/llm-stream.ts: callLLMStream + parseSSE async generator + pure
parseBlocks frame parser. Reuses the existing retry helper for the
initial connect; mid-stream failures propagate.
- src/synthesize.ts: added optional onToken param; passes through to
streaming variant when set, falls back to buffered callLLM otherwise.
- src/agent.ts: added AgentConfig.onSynthesizeToken — forwards chunks
with the current round number.
- src/cli.ts: picks streaming mode, writes header + sources around the
stream.
- src/config.ts: streamEnabled derived once and auto-off for json/deep/
env opt-out.
Tests: 16 new assertions (180 total, up from 164 pre-branch, 96 before
the production-grade track started).
- parseBlocks (6): single-line, multi-line data folding (spec-compliant),
leading-space stripping, empty/comment blocks, [DONE] sentinel,
malformed-JSON drop
- parseSSE (4): multi-frame stream, chunk boundary splitting, trailing
event without blank line, CRLF line endings
- callLLMStream integration (4): token order + full-text aggregation +
usage parsing, 500-retry-then-succeed, 401-does-not-retry,
non-text_delta events ignored gracefully
- CLI/config (2): --no-stream flag, streamEnabled derivation matrix
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A deep query's final synthesis is a single 30–60s LLM call. Until now the user stared at a frozen terminal for the whole duration, then got the full answer at once. Now tokens land on stdout as the model writes them.
When it's on
On by default when ALL of:
--json(JSON is buffered into the envelope)--deep(intermediate rounds would print multiple full drafts back-to-back)stdout.isTTY(piped output shouldn't interleave with progress events)--no-stream(explicit opt-out)Off otherwise.
DEEPDIVE_FORCE_STREAM=1env bypasses the TTY check for CI/testing.CLI output structure under streaming
# {question}\n\nup front## Sources\n...blockIf
--outis also set, the full markdown (re-rendered from the buffered result) goes to the file, so the file stays atomically complete.Implementation
src/llm-stream.ts— newcallLLMStreamwith async-generator SSE parser. Reuses the existing retry helper for the initial connect; mid-stream failures propagate (can't undo already-emitted tokens).src/synthesize.ts— optionalonTokenparam; streams when set, bufferedcallLLMpath otherwise.src/agent.ts— newAgentConfig.onSynthesizeToken?: (chunk, round) => voidhook.src/cli.ts— picks streaming mode, writes header + sources around the stream.src/config.ts—streamEnabledderived once; auto-off for JSON/deep/env-opt-out.Test plan
npm run build— clean under strict: truenpm test— 180 pass (up from 164), 0 failparseBlocks(6: single-line, multi-line data folding, leading-space stripping, empty/comment blocks,[DONE]sentinel, malformed-JSON drop),parseSSE(4: multi-frame, chunk boundary splits, trailing-event-without-blank-line, CRLF endings),callLLMStreamintegration (4: token order + text aggregation + usage parsing, 500→200 retry, 401 does-not-retry, non-text_delta events ignored), plus CLI--no-streamand config derivation matrix