feat: surface swallowed response_future errors in interception proxy#1196
Open
feat: surface swallowed response_future errors in interception proxy#1196
Conversation
When the underlying model call for a streaming request fails (e.g. vLLM raised mid-call and `synthesize_stream(error=X)` was invoked), the post-loop `response_future` await in `_handle_streaming_response` raises. Previously that was only logged at debug, letting the chunk loop's clean `data: [DONE]` arrive at the agent. The agent then observes a syntactically-valid but empty completion, exits 0, and the rollout finishes with no `state["error"]` set — same silent-empty class that #1191 fixed for the mid-stream transport case. Funnel this failure into `state["error"]` as `StreamInterrupted` via `_set_rollout_error`. Narrow the catch to `except Exception` + explicit `except asyncio.CancelledError: raise` so normal teardown cancellation (via `unregister_rollout` → `future.cancel()`) is not misclassified as a stream error. Control flow is unchanged: the stream loop's `data: [DONE]` write and the subsequent `write_eof` + `return response` still run — only the error bookkeeping is new. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Counterpart to the streaming response_future fix for the non-streaming branch of `_handle_request`. When the upstream model call fails and `deliver_response` sets the future's exception, the await re-raises. Previously we only logged at debug + returned HTTP 500; now we also set `state["error"]` via `_set_rollout_error`. Add distinct `InterceptionError` class alongside `StreamInterrupted` so rubrics and metrics can tell the two failure shapes apart: a streaming cut leaves the agent with a truncated SSE body; a non-streaming failure returns HTTP 500 to the agent's OpenAI client (agent sees a normal API error). Keep the log level at debug — the exception message may include upstream internals (URLs with tokens, tracebacks) and shouldn't ship to centralized logs at ERROR severity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rasdani
added a commit
that referenced
this pull request
Apr 19, 2026
The two ``logger.error`` sites in ``_handle_streaming_response`` include the raw exception message from aiohttp's transport / the upstream model call. Those messages can contain URLs with tokens, Authorization headers in tracebacks, or other internals. At ERROR severity they propagate to centralized log aggregation where anyone with log access sees them verbatim. Drop both to ``debug`` — local triage still has access, but nothing sensitive ships to aggregators. Matches the treatment of the ``response_future`` error logs in #1196. The rollout itself still halts visibly: ``_set_rollout_error`` sets ``state["error"]`` and ``MultiTurnEnv.has_error`` fires at the next ``is_completed()`` check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Companion to #1194 (SSE keepalive). Fixes two remaining error-swallow paths in
interception_utils.pyanalogous to the chunk-write failure path that #1191 already addressed.Both paths currently caught exceptions from
await response_futureand silently let the handler return. On the streaming side, the chunk loop had already writtendata: [DONE]before the await, so the agent sees a clean "empty completion" → exits 0 → silent empty trajectory. On the non-streaming side, the handler returned HTTP 500 to the agent but never setstate["error"], so rubrics/metrics couldn't see the failure.Changes
_handle_streaming_responsepost-loop): narrowexcept BaseException→except Exceptionwith explicitexcept asyncio.CancelledError: raise(normal teardown viaunregister_rolloutmust not be misclassified as error). Setstate["error"] = StreamInterrupted(...)via_set_rollout_error._handle_requestelse branch): setstate["error"] = InterceptionError(...)before returning HTTP 500.InterceptionErrorclass alongsideStreamInterruptedso rubrics/metrics can distinguish the two shapes (streaming cut → truncated SSE body; non-streaming failure → HTTP 500 to agent's OpenAI client).debug— messages may contain upstream internals (URLs with tokens, traceback fragments); ERROR severity would ship those to centralized log aggregation unnecessarily.Tests
Adds two tests in
tests/test_interception_utils.py:test_streaming_response_future_failure_surfaces_to_statetest_non_streaming_response_future_failure_surfaces_to_state7 tests total pass locally. Ruff + format clean.
Independent of #1194
This branch is off
mainand does not depend on the keepalive PR. Both can merge in either order.Note
Medium Risk
Changes error propagation in the interception proxy for both streaming and non-streaming requests, which can alter rollout termination behavior and metrics capture. Risk is moderate because it touches async control flow and cancellation handling but is covered by new tests.
Overview
Ensures failures from
response_futureare no longer silently swallowed in the interception proxy, so rollouts halt visibly instead of producing empty/clean-looking completions.For non-streaming requests, exceptions while awaiting the intercepted model result now set
state["error"](via a newInterceptionError) before returning HTTP 500. For streaming requests, exceptions from the post-streamresponse_futureawait now setstate["error"]asStreamInterrupted, whileCancelledErroris explicitly re-raised to avoid misclassifying normal teardown.Adds tests covering both streaming and non-streaming
response_futurefailure paths.Reviewed by Cursor Bugbot for commit 7c105f5. Bugbot is set up for automated code reviews on this repo. Configure here.