fix: add timeout and abort retry for LLM API requests by ycls2002 · Pull Request #261 · Narcooo/inkos

ycls2002 · 2026-05-11T09:37:50Z

Summary

LLM API calls via fetchWithProxy had no timeout set, causing requests to hang indefinitely when the remote server drops the connection without responding
This manifested as The socket connection was closed unexpectedly errors that crashed the entire pipeline with no recovery

Root Cause

In packages/core/src/llm/provider.ts, both /responses and /chat/completions endpoint calls to fetchWithProxy lacked a signal parameter, meaning Node.js fetch would wait forever if the server never responded or dropped the TCP connection mid-stream.

Additionally, AbortSignal.timeout() errors ("The operation was aborted") were not in the transient error detection list, so even if a timeout were added, it would not trigger the existing retry mechanism.

Changes

1. Add `AbortSignal.timeout(180_000)` to LLM fetch calls

Both fetchWithProxy calls in chatCompletionViaCustomOpenAICompatible now include a 3-minute timeout:

// /responses endpoint (line ~792)
const response = await fetchWithProxy(url, {
  method: "POST",
  headers,
  body: JSON.stringify(payload),
  signal: AbortSignal.timeout(180_000),  // NEW
}, client.proxyUrl);

// /chat/completions endpoint (line ~880)
const response = await fetchWithProxy(url, {
  method: "POST",
  headers,
  body: JSON.stringify(payload),
  signal: AbortSignal.timeout(180_000),  // NEW
}, client.proxyUrl);

2. Add abort errors to transient error detection list

// isTransientLLMTransportError (line ~442)
"aborted",
"The operation was aborted",

This ensures timeout-triggered aborts are caught by withTransientLLMRetry and retried up to 2 times (consistent with existing retry behavior for ECONNRESET, socket hang up, etc.).

Behavior After Fix

Scenario	Before	After
Server drops socket	Crash immediately	Retry up to 2 times
Server hangs (no response)	Hang forever	Timeout at 3 min, retry
Transient network glitch	May crash	Auto-retry

Notes

AbortSignal.timeout() requires Node.js 17.3+ (InkOS already requires Node.js 18+)
The 180s timeout is generous for LLM generation; adjust if needed for slower models
No backoff delay between retries (consistent with existing behavior)

Related: encountered this issue when using mimo-v2.5-pro via custom endpoint (token-plan-cn.xiaomimimo.com), where the server periodically drops connections during long-running stream completions.

Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com

LLM API calls via fetchWithProxy had no timeout set, causing requests to hang indefinitely when the remote server drops the connection without response. This resulted in "The socket connection was closed unexpectedly" errors that crashed the pipeline. Changes: - Add AbortSignal.timeout(180_000) to both /responses and /chat/completions fetch calls (3-minute timeout) - Add "aborted" and "The operation was aborted" to the transient error detection list so timeout-triggered aborts are retried (up to 2 retries, consistent with existing retry behavior) This ensures requests fail gracefully on network issues instead of hanging forever, and transient failures are automatically retried. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

InkOS chapter generation involves multiple pipeline stages (creative writing → audit → revision), each streaming for 2-3 minutes. 180s timeout was too short, causing TimeoutError mid-pipeline. Increased to 600s (10 minutes) to accommodate the full generation cycle. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Remove AbortSignal.timeout from fetch calls as it was too aggressive for long-running LLM pipeline stages. Keep "aborted" in transient error list for future compatibility. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

huangzhenkun and others added 3 commits May 11, 2026 16:43

fix: remove timeout, keep abort retry entries only

16cdce8

Remove AbortSignal.timeout from fetch calls as it was too aggressive for long-running LLM pipeline stages. Keep "aborted" in transient error list for future compatibility. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add timeout and abort retry for LLM API requests#261

fix: add timeout and abort retry for LLM API requests#261
ycls2002 wants to merge 3 commits into
Narcooo:masterfrom
ycls2002:fix/llm-request-timeout

ycls2002 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ycls2002 commented May 11, 2026

Summary

Root Cause

Changes

1. Add AbortSignal.timeout(180_000) to LLM fetch calls

2. Add abort errors to transient error detection list

Behavior After Fix

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Add `AbortSignal.timeout(180_000)` to LLM fetch calls