Skip to content

feat: LLM retry with exponential backoff + per-call timeout#5

Merged
askalf merged 1 commit into
masterfrom
feat/retry-timeout
Apr 23, 2026
Merged

feat: LLM retry with exponential backoff + per-call timeout#5
askalf merged 1 commit into
masterfrom
feat/retry-timeout

Conversation

@askalf

@askalf askalf commented Apr 23, 2026

Copy link
Copy Markdown
Owner

Summary

A 3-round deep query fires 30+ LLM calls. At a 1% per-call failure rate the overall success rate is 74%. Retries with exponential backoff bring that above 99%.

Behavior

callLLM now wraps every request in retry-with-backoff (default 3 attempts, 500ms base, 8s cap, ±25% jitter) and enforces a per-call timeout (default 120s).

Precise retry policy:

Class Retry?
HTTP 5xx
HTTP 429
HTTP 4xx (non-429) ❌ — fail fast, malformed requests stay malformed
Fetch-level (network/DNS/TLS/timeout)
User-initiated abort ❌ — short-circuits retry loop immediately

New API surface

  • LLMError class with retriable: boolean getter
  • DEFAULT_LLM_TIMEOUT_MS, DEFAULT_LLM_ATTEMPTS constants
  • retry(fn, opts) in src/retry.ts — injectable sleep/random for tests, shouldRetry predicate, onRetry hook, abort-signal aware

New CLI flags

  • --llm-timeout-ms=<ms> / DEEPDIVE_LLM_TIMEOUT_MS (default 120000)
  • --llm-attempts=<n> / DEEPDIVE_LLM_ATTEMPTS (default 3)

LLMConfig gained optional timeoutMs + maxAttempts; resolveConfig sets defaults so library callers going through it pick them up for free (non-breaking).

Test plan

  • npm run build — clean under strict: true
  • npm test — 159 pass (up from 141 pre-branch), 0 fail
  • 21 new assertions: retry helper math/behavior (12) + LLM integration covering first-try-OK, 500→500→200, 429→200, 400 does-not-retry, 401 does-not-retry, exhausted 5xx throws LLMError, per-call timeout on hung server, user-abort short-circuits (9)

A 3-round deep query fires 30+ LLM calls. At a 1% per-call failure rate
the overall success rate is 74%. Retries with exponential backoff bring
that above 99%.

callLLM now wraps every request in retry-with-backoff (default 3 attempts,
500ms base, 8s cap, ±25% jitter) and enforces a per-call timeout (default
120s). Retry policy:

- HTTP 5xx → retry
- HTTP 429 → retry
- HTTP 4xx (non-429) → fail fast, never retry (malformed requests stay
  malformed)
- Fetch-level errors (network, DNS, TLS, timeout) → retry, unless the
  failure was from a user-initiated abort (the user's AbortSignal firing
  short-circuits the retry loop immediately)

New exports:
- LLMError class with retriable: boolean getter
- DEFAULT_LLM_TIMEOUT_MS, DEFAULT_LLM_ATTEMPTS constants
- retry(fn, opts) helper in src/retry.ts — injectable sleep/random for
  deterministic tests, shouldRetry predicate, onRetry hook,
  abort-signal aware

New CLI flags + env vars:
- --llm-timeout-ms=<ms> / DEEPDIVE_LLM_TIMEOUT_MS (default 120000)
- --llm-attempts=<n> / DEEPDIVE_LLM_ATTEMPTS (default 3)

LLMConfig gained optional timeoutMs and maxAttempts; resolveConfig sets
defaults so library callers going through it pick them up for free
(non-breaking).

Tests: 21 new assertions (159 total, up from 141 pre-branch). Retry
helper: exponential + jitter math, attempt cap, shouldRetry predicate,
onRetry hook, abort signal at start and during sleep. LLM integration
(mock HTTP server): first-try-OK, 500→500→200, 429→200, 400 does-not-
retry, 401 does-not-retry, exhausted 5xx throws LLMError, per-call
timeout on hung server, user-abort short-circuits retry loop.
@askalf askalf enabled auto-merge (squash) April 23, 2026 01:09
@askalf askalf merged commit 5766385 into master Apr 23, 2026
4 checks passed
@askalf askalf deleted the feat/retry-timeout branch April 23, 2026 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant