fix: add timeout and abort retry for LLM API requests#261
Open
ycls2002 wants to merge 3 commits into
Open
Conversation
LLM API calls via fetchWithProxy had no timeout set, causing requests to hang indefinitely when the remote server drops the connection without response. This resulted in "The socket connection was closed unexpectedly" errors that crashed the pipeline. Changes: - Add AbortSignal.timeout(180_000) to both /responses and /chat/completions fetch calls (3-minute timeout) - Add "aborted" and "The operation was aborted" to the transient error detection list so timeout-triggered aborts are retried (up to 2 retries, consistent with existing retry behavior) This ensures requests fail gracefully on network issues instead of hanging forever, and transient failures are automatically retried. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
InkOS chapter generation involves multiple pipeline stages (creative writing → audit → revision), each streaming for 2-3 minutes. 180s timeout was too short, causing TimeoutError mid-pipeline. Increased to 600s (10 minutes) to accommodate the full generation cycle. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Remove AbortSignal.timeout from fetch calls as it was too aggressive for long-running LLM pipeline stages. Keep "aborted" in transient error list for future compatibility. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
fetchWithProxyhad no timeout set, causing requests to hang indefinitely when the remote server drops the connection without respondingThe socket connection was closed unexpectedlyerrors that crashed the entire pipeline with no recoveryRoot Cause
In
packages/core/src/llm/provider.ts, both/responsesand/chat/completionsendpoint calls tofetchWithProxylacked asignalparameter, meaning Node.jsfetchwould wait forever if the server never responded or dropped the TCP connection mid-stream.Additionally,
AbortSignal.timeout()errors ("The operation was aborted") were not in the transient error detection list, so even if a timeout were added, it would not trigger the existing retry mechanism.Changes
1. Add
AbortSignal.timeout(180_000)to LLM fetch callsBoth
fetchWithProxycalls inchatCompletionViaCustomOpenAICompatiblenow include a 3-minute timeout:2. Add abort errors to transient error detection list
This ensures timeout-triggered aborts are caught by
withTransientLLMRetryand retried up to 2 times (consistent with existing retry behavior forECONNRESET,socket hang up, etc.).Behavior After Fix
Notes
AbortSignal.timeout()requires Node.js 17.3+ (InkOS already requires Node.js 18+)Related: encountered this issue when using
mimo-v2.5-provia custom endpoint (token-plan-cn.xiaomimimo.com), where the server periodically drops connections during long-running stream completions.Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com