Plan: Mid-stream fallback support for FallbackModel#4953
Draft
sarth6 wants to merge 2 commits intopydantic:mainfrom
Draft
Plan: Mid-stream fallback support for FallbackModel#4953sarth6 wants to merge 2 commits intopydantic:mainfrom
sarth6 wants to merge 2 commits intopydantic:mainfrom
Conversation
Documents the architecture, proposed FallbackStreamedResponse wrapper design, and open questions for supporting mid-stream exception and response-based fallback in FallbackModel.request_stream(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use @DataClass(init=False) with custom __init__ (matching FallbackModel pattern) - Add _get_event_iterator stub (standard pattern, not an open question) - Document FinalResultEvent behavior on fallback (proxy resets correctly) - Document orphaned PartEndEvent on mid-stream failure - Add provider_response_id, provider_details, finish_reason to proxy list - Note redundant _response_handlers guard as intentional optimization - Document AsyncExitStack lifecycle change (moved outside for-loop) - Note list.pop(0) and list copy are intentional - Change recommendation to opt-in flag per maintainer feedback - Add test cases: first-chunk fallback, cancellation/aclose() - Add open questions: InstrumentedModel interaction, prepare_request in fallback - Add tracking issue placeholder Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Mid-Stream Fallback for
FallbackModelProblem Statement
FallbackModelsupports fallback for streaming requests — but only when the exception occurs while opening the stream (themodel.request_stream(...)call itself). Two scenarios are not handled:finish_reason) rejects the completed response. Today response handlers are silently ignored for streaming.The non-streaming
request()path handles both. This plan brings streaming to parity.Background: How Streaming Flows Through the System
Understanding three layers is essential. Here's the full call chain from user code to provider:
Layer 1:
StreamedResponse(base class inmodels/__init__.py)This is the abstract base class all provider-specific streams inherit from.
Key constraint:
__aiter__is memoized — the iterator chain is created once and reused. You can't "restart" it.Layer 2:
AgentStream(inresult.py)The user-facing wrapper. It holds a reference to the
StreamedResponseand accesses these attributes directly:__aiter__()_get_usage_checking_stream_response(self._raw_stream_response, ...)get()self._raw_stream_response.get()final_result_eventstream_output()andvalidate_response_output()usage()self._raw_stream_response.usage()timestampself._raw_stream_response.timestampAny wrapper we create must correctly proxy all of these, plus the provider-set fields (
provider_response_id,provider_details,finish_reason) which are set during iteration by provider-specific_get_event_iteratorimplementations.Layer 3:
FallbackModel.request_stream(current implementation)What's missing: No wrapping means no hook for mid-stream exceptions or post-stream response evaluation.
Recommended Solution:
FallbackStreamedResponseWrapperDesign overview
Why override
__aiter__(not_get_event_iterator)There are three places we could intercept the stream. Only one works cleanly:
_get_event_iterator__aiter__wraps it withPartEndEvent/FinalResultEventlogic that usesself._parts_manager. On fallback, the parts manager state goes stale and those decorators break.request_stream@asynccontextmanagerthat yields once. Can't yield multipleStreamedResponses. You'd still need a wrapper.__aiter____aiter__(which handles its ownPartEnd/FinalResultlogic). On fallback, swap to a new inner stream whose__aiter__handles the new events correctly. No base class state to manage.Consequence of
__aiter__override: The wrapper bypasses the base class'siterator_with_final_eventanditerator_with_part_enddecorators entirely — those run inside each inner stream's own__aiter__. This means:FinalResultEventon fallback: If model A emits aFinalResultEventbefore failing, the property proxy (self._inner.final_result_event) pointed to model A's event. After swapping_innerto model B, it now points to model B'sfinal_result_event, which isNoneuntil model B's stream produces one.AgentStream.stream_output()checksfinal_result_event is Noneon each loop iteration (result.py:73) — after the swap, it correctly skips partial output validation until model B'sFinalResultEventarrives. This is correct behavior.Orphaned
PartEndEvent: When model A fails mid-stream, itsiterator_with_part_endnever completes, so the last part from model A won't get aPartEndEvent. Consumers tracking part boundaries will see an orphanedPartStartEventwithout a matching end. This is benign for all current consumers (AgentStreamdoesn't track part boundary state), but worth noting in documentation.The wrapper class
FallbackModeluses@dataclass(init=False)with a custom__init__.StreamedResponseusesfield(init=False)for internal state. The wrapper should follow the same@dataclass(init=False)pattern asFallbackModelto avoid exposing_-prefixed fields as dataclass constructor parameters:Orphaned base class state: The wrapper inherits
_parts_manager,_usage,final_result_eventetc. fromStreamedResponseas dataclass fields. These are never populated because all delegation goes throughself._inner. This is an unavoidable cost of subclassing —AgentStreamtype-checks forStreamedResponse, so composition without inheritance isn't an option. The orphaned fields are inert and harmless:_parts_manageris only accessed inside provider_get_event_iteratorimplementations, never fromresult.pyor_agent_graph.py. Theget()override ensures the wrapper never uses its own_parts_manager.__aiter__— the fallback loop_get_event_iterator— abstract method stubSince we override
__aiter__and never use the base class's iteration chain, we still must implement the abstract method. This follows the same pattern as the base class itself (models/init.py:1096-1098):Property/method delegation
All attributes that
AgentStreamor external consumers access must proxy toself._inner:Note on
final_result_event: The base class sets this during__aiter__viaiterator_with_final_event. Since we delegate to the inner stream's__aiter__, it gets set onself._inner, not onself. The property proxy means: after a fallback swap, the property automatically points to the new inner stream's (initiallyNone)final_result_event, which is correct —AgentStream.stream_output()will correctly pause partial output validation until the new model'sFinalResultEventarrives.Opening the next model's stream
Note on
list.pop(0): This is O(n) but the list is tiny (typically 2-3 models). The list is a copy (sliced fromself.models[i + 1:]in the constructor) so popping doesn't mutateFallbackModel.models.Note on
prepare_request/_set_span_attributes: The current code callsmodel.prepare_request(...)andself._set_span_attributes(model, prepared_parameters)for the initial model._open_next_streamshould also callprepare_requeston each fallback model to ensureModelRequestParametersare customized correctly. Span attributes are harder — the wrapper doesn't have span access. See open questions.Why
AsyncExitStack: Each model'srequest_streamis an async context manager that owns the HTTP connection. We use a sharedAsyncExitStackso that:stack.enter_async_context).FallbackModel.request_stream'sasync withblock exits.Resource lifecycle change: The current implementation creates a new
AsyncExitStackper model attempt inside the for-loop. The updated design moves it outside the loop so the wrapper can open new streams on the same stack. This means previously-failed streams' resources aren't cleaned up until the outermostasync withexits, rather than immediately after each failed attempt. In practice this is benign — HTTP connections are lightweight and the stack cleans them all up promptly whenrequest_streamexits — but it is a behavioral change worth noting.Updated
FallbackModel.request_streamThe "Mixed Events" Problem
When fallback happens mid-stream, the caller has already received events from the failed model:
The caller sees text fragments from two different models. The impact depends on the consumer:
get()at end (most agent workflows)stream_output()(structured output)final_result_eventproxy resets toNoneon swap; partial validation pauses until new model'sFinalResultEventarrives. But partial outputs from model A were already yielded.stream_text()(frontend streaming)PartStartEventwithoutPartEndEventfrom failed modelDouwe explicitly said this "would need to have a flag on
FallbackModel." Given that guidance:Recommendation: Opt-in via a flag (e.g.
stream_fallback: bool = False). This respects the maintainer's stated preference, avoids surprising frontend consumers by default, and still allows users who want the behavior to enable it. The implementation should be structured so that the flag is a singleifcheck — ifFalse, the wrapper is not created and current behavior is preserved.Alternatives Considered
Buffering the full stream before yielding
Buffer all events from the provider, evaluate handlers, then yield events to the caller only if accepted.
Rejected: Defeats the purpose of streaming. Adds latency equal to the full response generation time. Would require a fundamentally different
StreamedResponsethat replays buffered events.Post-stream evaluation only (no mid-stream exception handling)
Only handle response-based fallback (evaluate after stream completes). Let mid-stream exceptions propagate as today.
Rejected: Incomplete — mid-stream exceptions are the more common real-world failure mode (connection drops, timeouts). Response-based fallback alone is less useful.
Sentinel event to signal "restart"
Emit a new
FallbackRestartEventwhen switching models, so frontend consumers can clear their buffer.Interesting but premature: Requires a new event type (version policy consideration), and no existing consumers know how to handle it. Could be added later if the mixed events problem proves painful.
Implementation Steps
Files to modify
pydantic_ai_slim/pydantic_ai/models/fallback.pyFallbackStreamedResponseclass; updaterequest_streamto wrap the response; addstream_fallbackparametertests/models/test_fallback.pydocs/models/overview.mdTest cases
All tests use
FunctionModelwithstream_functionto control behavior:all_messages()shows model B's responseFallbackExceptionGroupwith all exceptionsFallbackExceptionGroupwithResponseRejectedFalse→ exception propagatesget()returns correct model's responsemodel_name,usage,timestampfrom successful modelfinal_result_eventcomes from successful modelasync def handler(exc) -> boolbreakfromasync for/aclose())AsyncExitStackcleans up; no resource leaksDocumentation updates
docs/models/overview.mdto describe thestream_fallbackparameter.PartStartEventwithoutPartEndEventfrom failed model).stream_fallback=Truewithrun_stream().Open Questions for Maintainers
Flag name and default: We propose
stream_fallback: bool = FalseonFallbackModel.__init__. Is this the right name? Douwe suggested a flag is needed — confirm the default should be opt-in (False).Usage accounting: Should
usage()report only the successful model's usage, or accumulate across all attempts? Non-streamingrequest()returns only the successful model's usage, suggesting consistency. But accumulated usage is arguably more accurate for billing. Current plan: delegate toself._inner.usage()(successful model only) for consistency.Span attributes on fallback:
_set_span_attributesis called when the initial stream opens. If fallback switches models, the span still reflects the first model. The wrapper could accept a callback to update span attributes, or we could pass the span reference. What's the preferred approach?InstrumentedModelinteraction: When models in the fallback list are wrapped inInstrumentedModel, each model'srequest_streamhas afinallyblock that callsresponse_stream.get()to finish the trace span. If we stop iterating a stream mid-way (on fallback), theInstrumentedModelfinally-block will callget()on a partially-consumed stream, producing a partial/incompleteModelResponsein the trace. Is this acceptable, or should we investigate suppressing the trace for failed streams?prepare_requestin_open_next_stream: Should_open_next_streamcallmodel.prepare_request(...)for each fallback model? The initial loop does this. Currently omitted from the wrapper for simplicity, but could lead to incorrectModelRequestParametersfor fallback models that customize them.🤖 Generated with Claude Code