feat: LLM retry with exponential backoff + per-call timeout#5
Merged
Conversation
A 3-round deep query fires 30+ LLM calls. At a 1% per-call failure rate the overall success rate is 74%. Retries with exponential backoff bring that above 99%. callLLM now wraps every request in retry-with-backoff (default 3 attempts, 500ms base, 8s cap, ±25% jitter) and enforces a per-call timeout (default 120s). Retry policy: - HTTP 5xx → retry - HTTP 429 → retry - HTTP 4xx (non-429) → fail fast, never retry (malformed requests stay malformed) - Fetch-level errors (network, DNS, TLS, timeout) → retry, unless the failure was from a user-initiated abort (the user's AbortSignal firing short-circuits the retry loop immediately) New exports: - LLMError class with retriable: boolean getter - DEFAULT_LLM_TIMEOUT_MS, DEFAULT_LLM_ATTEMPTS constants - retry(fn, opts) helper in src/retry.ts — injectable sleep/random for deterministic tests, shouldRetry predicate, onRetry hook, abort-signal aware New CLI flags + env vars: - --llm-timeout-ms=<ms> / DEEPDIVE_LLM_TIMEOUT_MS (default 120000) - --llm-attempts=<n> / DEEPDIVE_LLM_ATTEMPTS (default 3) LLMConfig gained optional timeoutMs and maxAttempts; resolveConfig sets defaults so library callers going through it pick them up for free (non-breaking). Tests: 21 new assertions (159 total, up from 141 pre-branch). Retry helper: exponential + jitter math, attempt cap, shouldRetry predicate, onRetry hook, abort signal at start and during sleep. LLM integration (mock HTTP server): first-try-OK, 500→500→200, 429→200, 400 does-not- retry, 401 does-not-retry, exhausted 5xx throws LLMError, per-call timeout on hung server, user-abort short-circuits retry loop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A 3-round deep query fires 30+ LLM calls. At a 1% per-call failure rate the overall success rate is 74%. Retries with exponential backoff bring that above 99%.
Behavior
callLLMnow wraps every request in retry-with-backoff (default 3 attempts, 500ms base, 8s cap, ±25% jitter) and enforces a per-call timeout (default 120s).Precise retry policy:
New API surface
LLMErrorclass withretriable: booleangetterDEFAULT_LLM_TIMEOUT_MS,DEFAULT_LLM_ATTEMPTSconstantsretry(fn, opts)insrc/retry.ts— injectable sleep/random for tests,shouldRetrypredicate,onRetryhook, abort-signal awareNew CLI flags
--llm-timeout-ms=<ms>/DEEPDIVE_LLM_TIMEOUT_MS(default 120000)--llm-attempts=<n>/DEEPDIVE_LLM_ATTEMPTS(default 3)LLMConfiggained optionaltimeoutMs+maxAttempts;resolveConfigsets defaults so library callers going through it pick them up for free (non-breaking).Test plan
npm run build— clean under strict: truenpm test— 159 pass (up from 141 pre-branch), 0 fail