Skip to content

feat: surface swallowed response_future errors in interception proxy#1196

Open
rasdani wants to merge 2 commits intomainfrom
fix/interception-error-plumbing
Open

feat: surface swallowed response_future errors in interception proxy#1196
rasdani wants to merge 2 commits intomainfrom
fix/interception-error-plumbing

Conversation

@rasdani
Copy link
Copy Markdown
Contributor

@rasdani rasdani commented Apr 19, 2026

Summary

Companion to #1194 (SSE keepalive). Fixes two remaining error-swallow paths in interception_utils.py analogous to the chunk-write failure path that #1191 already addressed.

Both paths currently caught exceptions from await response_future and silently let the handler return. On the streaming side, the chunk loop had already written data: [DONE] before the await, so the agent sees a clean "empty completion" → exits 0 → silent empty trajectory. On the non-streaming side, the handler returned HTTP 500 to the agent but never set state["error"], so rubrics/metrics couldn't see the failure.

Changes

  • Streaming path (_handle_streaming_response post-loop): narrow except BaseExceptionexcept Exception with explicit except asyncio.CancelledError: raise (normal teardown via unregister_rollout must not be misclassified as error). Set state["error"] = StreamInterrupted(...) via _set_rollout_error.
  • Non-streaming path (_handle_request else branch): set state["error"] = InterceptionError(...) before returning HTTP 500.
  • Add InterceptionError class alongside StreamInterrupted so rubrics/metrics can distinguish the two shapes (streaming cut → truncated SSE body; non-streaming failure → HTTP 500 to agent's OpenAI client).
  • Keep log levels at debug — messages may contain upstream internals (URLs with tokens, traceback fragments); ERROR severity would ship those to centralized log aggregation unnecessarily.

Tests

Adds two tests in tests/test_interception_utils.py:

  • test_streaming_response_future_failure_surfaces_to_state
  • test_non_streaming_response_future_failure_surfaces_to_state

7 tests total pass locally. Ruff + format clean.

Independent of #1194

This branch is off main and does not depend on the keepalive PR. Both can merge in either order.


Note

Medium Risk
Changes error propagation in the interception proxy for both streaming and non-streaming requests, which can alter rollout termination behavior and metrics capture. Risk is moderate because it touches async control flow and cancellation handling but is covered by new tests.

Overview
Ensures failures from response_future are no longer silently swallowed in the interception proxy, so rollouts halt visibly instead of producing empty/clean-looking completions.

For non-streaming requests, exceptions while awaiting the intercepted model result now set state["error"] (via a new InterceptionError) before returning HTTP 500. For streaming requests, exceptions from the post-stream response_future await now set state["error"] as StreamInterrupted, while CancelledError is explicitly re-raised to avoid misclassifying normal teardown.

Adds tests covering both streaming and non-streaming response_future failure paths.

Reviewed by Cursor Bugbot for commit 7c105f5. Bugbot is set up for automated code reviews on this repo. Configure here.

rasdani and others added 2 commits April 19, 2026 19:56
When the underlying model call for a streaming request fails (e.g. vLLM
raised mid-call and `synthesize_stream(error=X)` was invoked), the
post-loop `response_future` await in `_handle_streaming_response` raises.
Previously that was only logged at debug, letting the chunk loop's clean
`data: [DONE]` arrive at the agent. The agent then observes a
syntactically-valid but empty completion, exits 0, and the rollout
finishes with no `state["error"]` set — same silent-empty class that
#1191 fixed for the mid-stream transport case.

Funnel this failure into `state["error"]` as `StreamInterrupted` via
`_set_rollout_error`. Narrow the catch to `except Exception` + explicit
`except asyncio.CancelledError: raise` so normal teardown cancellation
(via `unregister_rollout` → `future.cancel()`) is not misclassified as
a stream error.

Control flow is unchanged: the stream loop's `data: [DONE]` write and
the subsequent `write_eof` + `return response` still run — only the
error bookkeeping is new.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Counterpart to the streaming response_future fix for the non-streaming
branch of `_handle_request`. When the upstream model call fails and
`deliver_response` sets the future's exception, the await re-raises.
Previously we only logged at debug + returned HTTP 500; now we also set
`state["error"]` via `_set_rollout_error`.

Add distinct `InterceptionError` class alongside `StreamInterrupted` so
rubrics and metrics can tell the two failure shapes apart: a streaming
cut leaves the agent with a truncated SSE body; a non-streaming failure
returns HTTP 500 to the agent's OpenAI client (agent sees a normal API
error). Keep the log level at debug — the exception message may include
upstream internals (URLs with tokens, tracebacks) and shouldn't ship to
centralized logs at ERROR severity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rasdani added a commit that referenced this pull request Apr 19, 2026
The two ``logger.error`` sites in ``_handle_streaming_response`` include
the raw exception message from aiohttp's transport / the upstream model
call. Those messages can contain URLs with tokens, Authorization headers
in tracebacks, or other internals. At ERROR severity they propagate to
centralized log aggregation where anyone with log access sees them
verbatim.

Drop both to ``debug`` — local triage still has access, but nothing
sensitive ships to aggregators. Matches the treatment of the
``response_future`` error logs in #1196.

The rollout itself still halts visibly: ``_set_rollout_error`` sets
``state["error"]`` and ``MultiTurnEnv.has_error`` fires at the next
``is_completed()`` check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rasdani rasdani requested a review from mikasenghaas April 20, 2026 15:03
Copy link
Copy Markdown
Member

@mikasenghaas mikasenghaas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants