perf: add reasoning_effort passthrough to avoid thinking-mode latency in local models#1221
Conversation
Reasoning models (Gemma 4, Qwen 3, etc.) default to thinking mode, adding 10-40x latency for simple transcript cleanup. Pass through the OpenAI-compatible reasoning_effort parameter so users can disable thinking when speed matters more than deep reasoning. - Add optional reasoning_effort field to ChatCompletionRequest - Thread through send_chat_completion / send_chat_completion_with_schema - New set_post_process_reasoning_effort Tauri command for persistence - Dropdown in post-processing settings, Custom provider only - i18n for en and ja
|
Could we simplify this further and just default to None. Or off? I don't think anyone really wants reasoning for post processing as far as I can tell |
Thanks, that makes sense — I agree simpler is better here. |
Addresses review feedback on the reasoning_effort passthrough: post-processing rarely benefits from reasoning, so a 5-way dropdown (default/none/low/medium/high) is overkill. Collapse to a single boolean toggle "Disable reasoning" that defaults to ON for Custom providers (sends reasoning_effort: "none"). Cloud providers continue to receive no reasoning_effort parameter. - Replace post_process_reasoning_effort: Option<String> with post_process_disable_reasoning: bool (default true) in settings - Map bool to Option<String> at the request site (true -> "none", false -> omit) so the API contract is unchanged - Rename Tauri command to set_post_process_disable_reasoning(disable) so parameter naming matches the setting and avoids inversion bugs - Swap the Dropdown for a ToggleSwitch in the Custom provider section - Collapse five i18n strings to one label + description, and add the translation to all 18 additional locales so non-en/ja users don't fall back to English for this control
25dd02d to
99c8299
Compare
|
I think since this is an alpha feature, we can just disable reasoning in total to simplify the settings. Let's hit issues directly if reasoning disabled causes issues. I think for this app it should have little to no impact |
|
Yeah that makes sense. I do think reasoning control might get a bit inconsistent across models over time (params vs prompt-based control, etc.), so we might eventually want a bit more flexibility depending on provider behavior. But agree we shouldn’t overcomplicate it now — I’ll update the PR to just disable reasoning for post-processing. |
…provider Per PR cjpais#1221 discussion with @cjpais: since this is an alpha feature, simplify the UI by always sending reasoning_effort: "none" for custom provider post-processing. If this causes issues with specific provider setups, we can revisit.
|
Done — removed the toggle entirely. Custom provider now always sends One small note: |
…provider Per PR cjpais#1221 discussion with @cjpais: since this is an alpha feature, simplify the UI by always sending reasoning_effort: "none" for custom provider post-processing. If this causes issues with specific provider setups, we can revisit.
e2e0f94 to
d98521d
Compare
|
Thank you! I will test this tomorrow or Tuesday and pull it in. This is a good change, so thank you for making it |
|
Tested this with Qwen 9B on my machine, went from infinite processing to basically instant, this is a big win. Thanks |

Before Submitting This PR
to ensure this isn't a duplicate
Human Written Description
I use Handy with Ollama for post-processing, and recently switched to
Gemma 4 because it handles mixed Japanese/English really well. The
problem is that Gemma 4 (and most recent open-weight models like Qwen 3)
ships with thinking mode enabled by default — so even for simple
transcript cleanup, the model spends a long time "reasoning" about
how to fix punctuation before it actually responds.
This hits all languages, but it's especially bad for Japanese — the
thinking phase generates far more tokens due to less efficient
tokenization. Post-processing matters more for Japanese too, since
Whisper is significantly weaker at mixed Japanese/English input,
making LLM cleanup essential rather than optional.
The fix is a one-field passthrough: the OpenAI-compatible API already
has a
reasoning_effortparameter for this. Setting it to"none"brings response time down to under a second with no observable quality
difference for transcript cleanup tasks.
Per discussion with @cjpais,
this PR keeps things simple: Custom provider always sends
reasoning_effort: "none"during post-processing, with no UI control.Post-processing rarely benefits from reasoning, and if a backend doesn't
recognize the parameter it's typically ignored. Cloud providers are
untouched — they handle this server-side or may reject unknown params.
Related: #1201 — @cjpais noted that thinking tokens should be stripped.
This prevents them from being generated in the first place (for the
Custom provider path).
Benchmark (Gemma 4 E4B, Apple Silicon, Ollama)
Output quality was identical across all test cases.
Changes
reasoning_effortfield toChatCompletionRequest(omitted when unset, per OpenAI spec)
send_chat_completion/send_chat_completion_with_schemaactions.rs: always setreasoning_effort: "none"for Custom providerpost-processing;
Nonefor all other providersKnown limitations
A couple of edge cases worth flagging — none are blockers for the
Ollama/Gemma target, but may surface later:
"none"is an Ollama extension. The OpenAI spec only definesminimal | low | medium | high. If a user points Custom atOpenAI-proper with an o-series reasoning model, the request would
be rejected. This is an edge case (o-series via Custom for
post-processing is unusual), and we can add a fallback if anyone
hits it.
Qwen 3 (
/no_think), Anthropic extended thinking, and somefine-tunes expect instructions in the system prompt rather than
an API field. For those,
reasoning_effortis silently ignoredand users would need to disable thinking inside their prompt
template. Not something this PR can solve generically.
genuinely wants reasoning during post-processing with a local
model, they'd need to change the source. Acceptable tradeoff
given this is alpha and post-processing is a narrow use case.
Community Feedback
Issue: #1201
Testing
Tested locally on macOS (Apple Silicon) with Ollama + Gemma 4 E4B.
Verified
reasoning_effort: "none"is sent for Custom provider, otherproviders receive no
reasoning_effortfield, and post-processinglatency drops to sub-second on Gemma 4 E4B.
AI Assistance
If AI was used:
and verification done by me.