Plan: Unified thinking follow-ups (Cohere, Mistral, profile flags)#4952
Draft
sarth6 wants to merge 2 commits intopydantic:mainfrom
Draft
Plan: Unified thinking follow-ups (Cohere, Mistral, profile flags)#4952sarth6 wants to merge 2 commits intopydantic:mainfrom
sarth6 wants to merge 2 commits intopydantic:mainfrom
Conversation
Covers the 5 items Douwe deferred from PR pydantic#4829: - Cohere + Mistral _translate_thinking() implementation - Profile flags for Meta, Qwen, MoonshotAI, Harmony, ZAI - thinking_mode property on ModelProfile - Google google_supports_thinking_level field rename Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Plan: Unified Thinking Follow-ups
Background
The unified
thinkingsetting landed across PRs #4640, #4812, and #4829. The design is simple: users setthinking=Trueorthinking='high'inModelSettings, and each model's_translate_thinking()method converts that into the provider's native API format. The baseModel.prepare_request()resolves the setting, strips it frommodel_settings, and places it onmodel_request_parameters.thinkingfor downstream consumption.This works well for Anthropic, OpenAI, Google, Groq, Bedrock, xAI, OpenRouter, and Cerebras — all of which have
_translate_thinking()implementations. But during review, Douwe identified several gaps that were explicitly deferred to follow-up PRs. This plan covers all five of them.1. Cohere: Wire up
_translate_thinking()Problem: The Cohere profile sets
supports_thinking=Trueandthinking_always_enabled=Truefor reasoning models, and the response handler already processesThinkingPart. But the request side never readsmodel_request_parameters.thinking— the unified setting is silently ignored.Cohere's API: The SDK (
cohere==5.20.6) exposesthinking: Thinking | NoneonAsyncClientV2.chat(), whereThinkingis:This is a simple binary toggle with an optional token budget. There are no effort levels —
'low','medium', and'high'all map to the sameThinking(type='enabled'). Users who want fine-grainedtoken_budgetcontrol can use a newcohere_thinkingprovider-specific setting.Why this is straightforward: Cohere has no streaming support in pydantic-ai yet, so there's only one call site to modify (
_chat()). The SDK sentinel isOMIT, already imported.Implementation:
Add
_translate_thinking()toCohereModel, following the same pattern as Groq (the simplest existing reference):Then pass it into the
client.chat()call alongside the existing parameters. Addcohere_thinking: ThinkingtoCohereModelSettingsfor users who need token budget control.True/ any effortThinking(type='enabled')FalseThinking(type='disabled')OMIT(provider default)Files:
models/cohere.py,tests/test_thinking.py,docs/thinking.md2. Mistral: Wire up
_translate_thinking()Problem: Same as Cohere — the profile flags are set for Magistral models, and
ThinkingPartis processed in responses viaMistralThinkChunk, but the request side never passes the thinking parameter.Mistral's API: The SDK (
mistralai==1.9.11) uses a different mechanism than most providers. Instead of a thinking-specific parameter, it usesprompt_mode: MistralPromptMode:Setting
prompt_mode='reasoning'enables the reasoning system prompt for Magistral models. Omitting it (viaUNSET) uses the default behavior. Like Cohere, this is purely binary — there are no effort levels.What makes Mistral trickier: Unlike Cohere's single call site, Mistral has four places where chat calls are made:
_completions_create()— non-streaming_stream_completions_create()— streaming with function tools_stream_completions_create()— streaming with output tools / JSON mode_stream_completions_create()— plain streaming (currently only passesmodel,messages,stream,http_headers)All four need
prompt_mode=self._translate_thinking(...)added. The SDK sentinel isUNSET, already imported.Implementation:
For the provider-specific setting in
MistralModelSettings, we useLiteral['reasoning']rather thanMistralPromptModeto avoid exposing the SDK'sUnrecognizedStrunion in our public API. This matches pydantic-ai's convention of using simple Literal types for provider-specific enums (seeGroqModelSettings.groq_reasoning_format).True/ any effortprompt_mode='reasoning'False(always-on)UNSET(silently ignored — Magistral can't disable)UNSET(provider default)Files:
models/mistral.py,tests/test_thinking.py,docs/thinking.md3. Profile flags for profile-only providers
Problem: Several model families support thinking but their profile functions don't set
supports_thinkingorthinking_always_enabled. This means the unifiedthinkingsetting is silently dropped even when the model actually supports it — users get no feedback that their setting was ignored.These are "profile-only" providers because they're accessed through hosting providers like OpenRouter, Ollama, Fireworks, or Together, which already have model implementations. The profile is the only touch point pydantic-ai has.
Proposed changes:
Meta (Llama-4)
Llama-4 reasoning variants include
"reasoning"in the model name (e.g.,llama-4-scout-reasoning). They use<think>tags for thinking output, which is already the default onModelProfile.thinking_tags. These models always think — there's no API parameter to disable it.Qwen (QwQ)
QwQ models (e.g.,
qwq-32b-preview) are Qwen's reasoning-focused family. They also use<think>tags and always think. The existingqwen_model_profile()already handles multiple model families — QwQ detection needs to be added to the default return path.MoonshotAI (Kimi Thinking)
Kimi thinking models include
"thinking"in the name (e.g.,kimi-thinking-preview). Unlike Meta/Qwen, these are optional-thinking (not always-on), so onlysupports_thinking=Trueis needed.Harmony and ZAI
These need additional research before implementation. The original plan references
gpt-oss-120bandzai-glm-4.6as thinking-capable, but we should verify this against current API documentation before setting flags. If the research doesn't clearly confirm thinking support, we skip them rather than risk setting incorrect profile flags.Files:
profiles/meta.py,profiles/qwen.py,profiles/moonshotai.py, possiblyprofiles/harmony.pyandprofiles/zai.py,tests/test_thinking.py4.
thinking_modeproperty onModelProfileProblem: The base
prepare_request()currently checks two booleans inline:This works but is a bit clunky. The two boolean flags (
supports_thinking,thinking_always_enabled) represent a three-state enum that would be clearer as a derived property.Proposal:
This is primarily a readability improvement. Douwe agreed it was worth adding but was lukewarm on refactoring
prepare_request()to use it ("benefit's not super clear"), so the initial scope should be limited to adding the property and using it only where it clearly improves readability.Files:
profiles/__init__.py, optionallymodels/__init__.py,tests/test_thinking.py5. Rename Google
google_supports_thinking_levelProblem: The
GoogleModelProfilefieldgoogle_supports_thinking_level: boolis a boolean that distinguishes between Gemini 2.5 (budget-based) and Gemini 3+ (level-based) thinking APIs. ALiteral['budget', 'level'] | Nonewould be more expressive and self-documenting.Caution: Douwe explicitly flagged backward compatibility concerns here ("I don't think this is worth doing, especially as it'd be backward incompatible"). Since
GoogleModelProfileis a public dataclass, users may be constructing instances withgoogle_supports_thinking_level=True.Approach: Add the new
google_thinking_api: Literal['budget', 'level'] | Nonefield alongside the old one. Deprecate the old field but keep it working — in__post_init__, if the old field is set and the new one isn't, bridge them. This lets us migrate without breaking existing code.Files:
profiles/google.py,models/google.py,tests/test_thinking.pyPR Structure
These five items can be split into 2-3 PRs:
Testing
Each item follows the existing test patterns in
tests/test_thinking.py. The test classes instantiate the model with a specific profile, call_translate_thinking()directly, and assert the returned provider-native value.For Cohere and Mistral, we also need to verify the values actually reach the SDK call. The existing test infrastructure for other providers (e.g.,
TestGroqThinkingTranslation) shows the pattern — construct model, prepare request, call the private method, check the output.Profile flag tests use the profile factory functions directly (e.g.,
meta_model_profile('llama-4-scout-reasoning').supports_thinkingshould beTrue).