Skip to content

fix: add timeout and abort retry for LLM API requests#261

Open
ycls2002 wants to merge 3 commits into
Narcooo:masterfrom
ycls2002:fix/llm-request-timeout
Open

fix: add timeout and abort retry for LLM API requests#261
ycls2002 wants to merge 3 commits into
Narcooo:masterfrom
ycls2002:fix/llm-request-timeout

Conversation

@ycls2002

Copy link
Copy Markdown

Summary

  • LLM API calls via fetchWithProxy had no timeout set, causing requests to hang indefinitely when the remote server drops the connection without responding
  • This manifested as The socket connection was closed unexpectedly errors that crashed the entire pipeline with no recovery

Root Cause

In packages/core/src/llm/provider.ts, both /responses and /chat/completions endpoint calls to fetchWithProxy lacked a signal parameter, meaning Node.js fetch would wait forever if the server never responded or dropped the TCP connection mid-stream.

Additionally, AbortSignal.timeout() errors ("The operation was aborted") were not in the transient error detection list, so even if a timeout were added, it would not trigger the existing retry mechanism.

Changes

1. Add AbortSignal.timeout(180_000) to LLM fetch calls

Both fetchWithProxy calls in chatCompletionViaCustomOpenAICompatible now include a 3-minute timeout:

// /responses endpoint (line ~792)
const response = await fetchWithProxy(url, {
  method: "POST",
  headers,
  body: JSON.stringify(payload),
  signal: AbortSignal.timeout(180_000),  // NEW
}, client.proxyUrl);

// /chat/completions endpoint (line ~880)
const response = await fetchWithProxy(url, {
  method: "POST",
  headers,
  body: JSON.stringify(payload),
  signal: AbortSignal.timeout(180_000),  // NEW
}, client.proxyUrl);

2. Add abort errors to transient error detection list

// isTransientLLMTransportError (line ~442)
"aborted",
"The operation was aborted",

This ensures timeout-triggered aborts are caught by withTransientLLMRetry and retried up to 2 times (consistent with existing retry behavior for ECONNRESET, socket hang up, etc.).

Behavior After Fix

Scenario Before After
Server drops socket Crash immediately Retry up to 2 times
Server hangs (no response) Hang forever Timeout at 3 min, retry
Transient network glitch May crash Auto-retry

Notes

  • AbortSignal.timeout() requires Node.js 17.3+ (InkOS already requires Node.js 18+)
  • The 180s timeout is generous for LLM generation; adjust if needed for slower models
  • No backoff delay between retries (consistent with existing behavior)

Related: encountered this issue when using mimo-v2.5-pro via custom endpoint (token-plan-cn.xiaomimimo.com), where the server periodically drops connections during long-running stream completions.

Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com

huangzhenkun and others added 3 commits May 11, 2026 16:43
LLM API calls via fetchWithProxy had no timeout set, causing requests
to hang indefinitely when the remote server drops the connection
without response. This resulted in "The socket connection was closed
unexpectedly" errors that crashed the pipeline.

Changes:
- Add AbortSignal.timeout(180_000) to both /responses and
  /chat/completions fetch calls (3-minute timeout)
- Add "aborted" and "The operation was aborted" to the transient
  error detection list so timeout-triggered aborts are retried
  (up to 2 retries, consistent with existing retry behavior)

This ensures requests fail gracefully on network issues instead of
hanging forever, and transient failures are automatically retried.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
InkOS chapter generation involves multiple pipeline stages
(creative writing → audit → revision), each streaming for
2-3 minutes. 180s timeout was too short, causing TimeoutError
mid-pipeline. Increased to 600s (10 minutes) to accommodate
the full generation cycle.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Remove AbortSignal.timeout from fetch calls as it was too
aggressive for long-running LLM pipeline stages. Keep
"aborted" in transient error list for future compatibility.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant