feat: LLM retry with exponential backoff + per-call timeout by askalf · Pull Request #5 · askalf/deepdive

askalf · 2026-04-23T01:08:48Z

Summary

A 3-round deep query fires 30+ LLM calls. At a 1% per-call failure rate the overall success rate is 74%. Retries with exponential backoff bring that above 99%.

Behavior

callLLM now wraps every request in retry-with-backoff (default 3 attempts, 500ms base, 8s cap, ±25% jitter) and enforces a per-call timeout (default 120s).

Precise retry policy:

Class	Retry?
HTTP 5xx	✅
HTTP 429	✅
HTTP 4xx (non-429)	❌ — fail fast, malformed requests stay malformed
Fetch-level (network/DNS/TLS/timeout)	✅
User-initiated abort	❌ — short-circuits retry loop immediately

New API surface

LLMError class with retriable: boolean getter
DEFAULT_LLM_TIMEOUT_MS, DEFAULT_LLM_ATTEMPTS constants
retry(fn, opts) in src/retry.ts — injectable sleep/random for tests, shouldRetry predicate, onRetry hook, abort-signal aware

New CLI flags

--llm-timeout-ms=<ms> / DEEPDIVE_LLM_TIMEOUT_MS (default 120000)
--llm-attempts=<n> / DEEPDIVE_LLM_ATTEMPTS (default 3)

LLMConfig gained optional timeoutMs + maxAttempts; resolveConfig sets defaults so library callers going through it pick them up for free (non-breaking).

Test plan

npm run build — clean under strict: true
npm test — 159 pass (up from 141 pre-branch), 0 fail
21 new assertions: retry helper math/behavior (12) + LLM integration covering first-try-OK, 500→500→200, 429→200, 400 does-not-retry, 401 does-not-retry, exhausted 5xx throws LLMError, per-call timeout on hung server, user-abort short-circuits (9)

A 3-round deep query fires 30+ LLM calls. At a 1% per-call failure rate the overall success rate is 74%. Retries with exponential backoff bring that above 99%. callLLM now wraps every request in retry-with-backoff (default 3 attempts, 500ms base, 8s cap, ±25% jitter) and enforces a per-call timeout (default 120s). Retry policy: - HTTP 5xx → retry - HTTP 429 → retry - HTTP 4xx (non-429) → fail fast, never retry (malformed requests stay malformed) - Fetch-level errors (network, DNS, TLS, timeout) → retry, unless the failure was from a user-initiated abort (the user's AbortSignal firing short-circuits the retry loop immediately) New exports: - LLMError class with retriable: boolean getter - DEFAULT_LLM_TIMEOUT_MS, DEFAULT_LLM_ATTEMPTS constants - retry(fn, opts) helper in src/retry.ts — injectable sleep/random for deterministic tests, shouldRetry predicate, onRetry hook, abort-signal aware New CLI flags + env vars: - --llm-timeout-ms=<ms> / DEEPDIVE_LLM_TIMEOUT_MS (default 120000) - --llm-attempts=<n> / DEEPDIVE_LLM_ATTEMPTS (default 3) LLMConfig gained optional timeoutMs and maxAttempts; resolveConfig sets defaults so library callers going through it pick them up for free (non-breaking). Tests: 21 new assertions (159 total, up from 141 pre-branch). Retry helper: exponential + jitter math, attempt cap, shouldRetry predicate, onRetry hook, abort signal at start and during sleep. LLM integration (mock HTTP server): first-try-OK, 500→500→200, 429→200, 400 does-not- retry, 401 does-not-retry, exhausted 5xx throws LLMError, per-call timeout on hung server, user-abort short-circuits retry loop.

askalf enabled auto-merge (squash) April 23, 2026 01:09

askalf merged commit 5766385 into master Apr 23, 2026
4 checks passed

askalf deleted the feat/retry-timeout branch April 23, 2026 01:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: LLM retry with exponential backoff + per-call timeout#5

feat: LLM retry with exponential backoff + per-call timeout#5
askalf merged 1 commit into
masterfrom
feat/retry-timeout

askalf commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

askalf commented Apr 23, 2026

Summary

Behavior

New API surface

New CLI flags

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant