perf: add reasoning_effort passthrough to avoid thinking-mode latency in local models by luoxi · Pull Request #1221 · cjpais/Handy

luoxi · 2026-04-04T14:27:40Z

Before Submitting This PR

I have searched existing issues and pull requests (including closed ones)
to ensure this isn't a duplicate
I have read CONTRIBUTING.md

Human Written Description

I use Handy with Ollama for post-processing, and recently switched to
Gemma 4 because it handles mixed Japanese/English really well. The
problem is that Gemma 4 (and most recent open-weight models like Qwen 3)
ships with thinking mode enabled by default — so even for simple
transcript cleanup, the model spends a long time "reasoning" about
how to fix punctuation before it actually responds.

This hits all languages, but it's especially bad for Japanese — the
thinking phase generates far more tokens due to less efficient
tokenization. Post-processing matters more for Japanese too, since
Whisper is significantly weaker at mixed Japanese/English input,
making LLM cleanup essential rather than optional.

The fix is a one-field passthrough: the OpenAI-compatible API already
has a reasoning_effort parameter for this. Setting it to "none"
brings response time down to under a second with no observable quality
difference for transcript cleanup tasks.

Per discussion with @cjpais,
this PR keeps things simple: Custom provider always sends
reasoning_effort: "none" during post-processing, with no UI control.
Post-processing rarely benefits from reasoning, and if a backend doesn't
recognize the parameter it's typically ignored. Cloud providers are
untouched — they handle this server-side or may reject unknown params.

Related: #1201 — @cjpais noted that thinking tokens should be stripped.
This prevents them from being generated in the first place (for the
Custom provider path).

Benchmark (Gemma 4 E4B, Apple Silicon, Ollama)

Test case	Thinking ON	Thinking OFF
Japanese (tech, mixed JP/EN)	12.7s	1.2s
Japanese (short command)	19.0s	0.4s
English (tech)	12.4s	1.7s
Japanese (CLI terms)	18.3s	0.8s

Output quality was identical across all test cases.

Changes

Add optional reasoning_effort field to ChatCompletionRequest
(omitted when unset, per OpenAI spec)
Thread through send_chat_completion / send_chat_completion_with_schema
actions.rs: always set reasoning_effort: "none" for Custom provider
post-processing; None for all other providers
No new settings, no UI, no i18n strings

Known limitations

A couple of edge cases worth flagging — none are blockers for the
Ollama/Gemma target, but may surface later:

"none" is an Ollama extension. The OpenAI spec only defines
minimal | low | medium | high. If a user points Custom at
OpenAI-proper with an o-series reasoning model, the request would
be rejected. This is an edge case (o-series via Custom for
post-processing is unusual), and we can add a fallback if anyone
hits it.
Some models use prompt-based reasoning control, not params.
Qwen 3 (/no_think), Anthropic extended thinking, and some
fine-tunes expect instructions in the system prompt rather than
an API field. For those, reasoning_effort is silently ignored
and users would need to disable thinking inside their prompt
template. Not something this PR can solve generically.
No escape hatch to re-enable reasoning on Custom. If someone
genuinely wants reasoning during post-processing with a local
model, they'd need to change the source. Acceptable tradeoff
given this is alpha and post-processing is a narrow use case.

Community Feedback

Issue: #1201

Testing

Tested locally on macOS (Apple Silicon) with Ollama + Gemma 4 E4B.
Verified reasoning_effort: "none" is sent for Custom provider, other
providers receive no reasoning_effort field, and post-processing
latency drops to sub-second on Gemma 4 E4B.

AI Assistance

No AI was used in this PR
AI was used (please describe below)

If AI was used:

Tools used: Claude Code
How extensively: Research and implementation guidance. Manual testing
and verification done by me.

Reasoning models (Gemma 4, Qwen 3, etc.) default to thinking mode, adding 10-40x latency for simple transcript cleanup. Pass through the OpenAI-compatible reasoning_effort parameter so users can disable thinking when speed matters more than deep reasoning. - Add optional reasoning_effort field to ChatCompletionRequest - Thread through send_chat_completion / send_chat_completion_with_schema - New set_post_process_reasoning_effort Tauri command for persistence - Dropdown in post-processing settings, Custom provider only - i18n for en and ja

cjpais · 2026-04-04T23:36:56Z

Could we simplify this further and just default to None. Or off? I don't think anyone really wants reasoning for post processing as far as I can tell

luoxi · 2026-04-05T01:18:47Z

Could we simplify this further and just default to None. Or off? I don't think anyone really wants reasoning for post processing as far as I can tell

Thanks, that makes sense — I agree simpler is better here.

Addresses review feedback on the reasoning_effort passthrough: post-processing rarely benefits from reasoning, so a 5-way dropdown (default/none/low/medium/high) is overkill. Collapse to a single boolean toggle "Disable reasoning" that defaults to ON for Custom providers (sends reasoning_effort: "none"). Cloud providers continue to receive no reasoning_effort parameter. - Replace post_process_reasoning_effort: Option<String> with post_process_disable_reasoning: bool (default true) in settings - Map bool to Option<String> at the request site (true -> "none", false -> omit) so the API contract is unchanged - Rename Tauri command to set_post_process_disable_reasoning(disable) so parameter naming matches the setting and avoids inversion bugs - Swap the Dropdown for a ToggleSwitch in the Custom provider section - Collapse five i18n strings to one label + description, and add the translation to all 18 additional locales so non-en/ja users don't fall back to English for this control

luoxi · 2026-04-05T01:36:04Z

Simplified to a single "Disable reasoning" toggle, default ON for custom providers (sends reasoning_effort: "none"), as reasoning doesn’t add much value in post-processing.

cjpais · 2026-04-05T07:10:32Z

I think since this is an alpha feature, we can just disable reasoning in total to simplify the settings. Let's hit issues directly if reasoning disabled causes issues. I think for this app it should have little to no impact

luoxi · 2026-04-05T10:16:08Z

Yeah that makes sense.

I do think reasoning control might get a bit inconsistent across models over time (params vs prompt-based control, etc.), so we might eventually want a bit more flexibility depending on provider behavior.

But agree we shouldn’t overcomplicate it now — I’ll update the PR to just disable reasoning for post-processing.

@cjpais

…provider Per PR cjpais#1221 discussion with @cjpais: since this is an alpha feature, simplify the UI by always sending reasoning_effort: "none" for custom provider post-processing. If this causes issues with specific provider setups, we can revisit.

luoxi · 2026-04-05T10:49:09Z

Done — removed the toggle entirely. Custom provider now always sends reasoning_effort: "none" for post-processing.

One small note: "none" is an Ollama-specific extension, so some strict OpenAI-compatible endpoints might not accept it. Probably not an issue for now, but happy to add a fallback if needed later.

@cjpais

…provider Per PR cjpais#1221 discussion with @cjpais: since this is an alpha feature, simplify the UI by always sending reasoning_effort: "none" for custom provider post-processing. If this causes issues with specific provider setups, we can revisit.

cjpais · 2026-04-05T12:02:36Z

Thank you! I will test this tomorrow or Tuesday and pull it in. This is a good change, so thank you for making it

cjpais · 2026-04-07T06:30:38Z

Tested this with Qwen 9B on my machine, went from infinite processing to basically instant, this is a big win. Thanks

luoxi force-pushed the feat/reasoning-effort branch from 25dd02d to 99c8299 Compare April 5, 2026 01:31

luoxi force-pushed the feat/reasoning-effort branch from e2e0f94 to d98521d Compare April 5, 2026 10:59

disable for openrouter too

6e461b6

cjpais merged commit 84d88f9 into cjpais:main Apr 7, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: add reasoning_effort passthrough to avoid thinking-mode latency in local models#1221

perf: add reasoning_effort passthrough to avoid thinking-mode latency in local models#1221
cjpais merged 4 commits intocjpais:mainfrom
luoxi:feat/reasoning-effort

luoxi commented Apr 4, 2026 •

edited

Loading

Uh oh!

cjpais commented Apr 4, 2026

Uh oh!

luoxi commented Apr 5, 2026

Uh oh!

luoxi commented Apr 5, 2026

Uh oh!

cjpais commented Apr 5, 2026

Uh oh!

luoxi commented Apr 5, 2026

Uh oh!

luoxi commented Apr 5, 2026 •

edited

Loading

Uh oh!

cjpais commented Apr 5, 2026

Uh oh!

cjpais commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

luoxi commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before Submitting This PR

Human Written Description

Benchmark (Gemma 4 E4B, Apple Silicon, Ollama)

Changes

Known limitations

Community Feedback

Testing

AI Assistance

Uh oh!

cjpais commented Apr 4, 2026

Uh oh!

luoxi commented Apr 5, 2026

Uh oh!

luoxi commented Apr 5, 2026

Uh oh!

cjpais commented Apr 5, 2026

Uh oh!

luoxi commented Apr 5, 2026

Uh oh!

luoxi commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjpais commented Apr 5, 2026

Uh oh!

cjpais commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

luoxi commented Apr 4, 2026 •

edited

Loading

luoxi commented Apr 5, 2026 •

edited

Loading