Skip to content

perf: add reasoning_effort passthrough to avoid thinking-mode latency in local models#1221

Merged
cjpais merged 4 commits intocjpais:mainfrom
luoxi:feat/reasoning-effort
Apr 7, 2026
Merged

perf: add reasoning_effort passthrough to avoid thinking-mode latency in local models#1221
cjpais merged 4 commits intocjpais:mainfrom
luoxi:feat/reasoning-effort

Conversation

@luoxi
Copy link
Copy Markdown
Contributor

@luoxi luoxi commented Apr 4, 2026

Before Submitting This PR

  • I have searched existing issues and pull requests (including closed ones)
    to ensure this isn't a duplicate
  • I have read CONTRIBUTING.md

Human Written Description

I use Handy with Ollama for post-processing, and recently switched to
Gemma 4 because it handles mixed Japanese/English really well. The
problem is that Gemma 4 (and most recent open-weight models like Qwen 3)
ships with thinking mode enabled by default — so even for simple
transcript cleanup, the model spends a long time "reasoning" about
how to fix punctuation before it actually responds.

This hits all languages, but it's especially bad for Japanese — the
thinking phase generates far more tokens due to less efficient
tokenization. Post-processing matters more for Japanese too, since
Whisper is significantly weaker at mixed Japanese/English input,
making LLM cleanup essential rather than optional.

The fix is a one-field passthrough: the OpenAI-compatible API already
has a reasoning_effort parameter for this. Setting it to "none"
brings response time down to under a second with no observable quality
difference for transcript cleanup tasks.

Per discussion with @cjpais,
this PR keeps things simple: Custom provider always sends
reasoning_effort: "none"
during post-processing, with no UI control.
Post-processing rarely benefits from reasoning, and if a backend doesn't
recognize the parameter it's typically ignored. Cloud providers are
untouched — they handle this server-side or may reject unknown params.

Related: #1201@cjpais noted that thinking tokens should be stripped.
This prevents them from being generated in the first place (for the
Custom provider path).

Benchmark (Gemma 4 E4B, Apple Silicon, Ollama)

Test case Thinking ON Thinking OFF
Japanese (tech, mixed JP/EN) 12.7s 1.2s
Japanese (short command) 19.0s 0.4s
English (tech) 12.4s 1.7s
Japanese (CLI terms) 18.3s 0.8s

Output quality was identical across all test cases.

Changes

  • Add optional reasoning_effort field to ChatCompletionRequest
    (omitted when unset, per OpenAI spec)
  • Thread through send_chat_completion / send_chat_completion_with_schema
  • actions.rs: always set reasoning_effort: "none" for Custom provider
    post-processing; None for all other providers
  • No new settings, no UI, no i18n strings

Known limitations

A couple of edge cases worth flagging — none are blockers for the
Ollama/Gemma target, but may surface later:

  1. "none" is an Ollama extension. The OpenAI spec only defines
    minimal | low | medium | high. If a user points Custom at
    OpenAI-proper with an o-series reasoning model, the request would
    be rejected. This is an edge case (o-series via Custom for
    post-processing is unusual), and we can add a fallback if anyone
    hits it.
  2. Some models use prompt-based reasoning control, not params.
    Qwen 3 (/no_think), Anthropic extended thinking, and some
    fine-tunes expect instructions in the system prompt rather than
    an API field. For those, reasoning_effort is silently ignored
    and users would need to disable thinking inside their prompt
    template. Not something this PR can solve generically.
  3. No escape hatch to re-enable reasoning on Custom. If someone
    genuinely wants reasoning during post-processing with a local
    model, they'd need to change the source. Acceptable tradeoff
    given this is alpha and post-processing is a narrow use case.

Community Feedback

Issue: #1201

Testing

Tested locally on macOS (Apple Silicon) with Ollama + Gemma 4 E4B.
Verified reasoning_effort: "none" is sent for Custom provider, other
providers receive no reasoning_effort field, and post-processing
latency drops to sub-second on Gemma 4 E4B.

AI Assistance

  • No AI was used in this PR
  • AI was used (please describe below)

If AI was used:

  • Tools used: Claude Code
  • How extensively: Research and implementation guidance. Manual testing
    and verification done by me.

Reasoning models (Gemma 4, Qwen 3, etc.) default to thinking mode,
adding 10-40x latency for simple transcript cleanup. Pass through the
OpenAI-compatible reasoning_effort parameter so users can disable
thinking when speed matters more than deep reasoning.

- Add optional reasoning_effort field to ChatCompletionRequest
- Thread through send_chat_completion / send_chat_completion_with_schema
- New set_post_process_reasoning_effort Tauri command for persistence
- Dropdown in post-processing settings, Custom provider only
- i18n for en and ja
@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Apr 4, 2026

Could we simplify this further and just default to None. Or off? I don't think anyone really wants reasoning for post processing as far as I can tell

@luoxi
Copy link
Copy Markdown
Contributor Author

luoxi commented Apr 5, 2026

Could we simplify this further and just default to None. Or off? I don't think anyone really wants reasoning for post processing as far as I can tell

Thanks, that makes sense — I agree simpler is better here.

Addresses review feedback on the reasoning_effort passthrough:
post-processing rarely benefits from reasoning, so a 5-way dropdown
(default/none/low/medium/high) is overkill. Collapse to a single
boolean toggle "Disable reasoning" that defaults to ON for Custom
providers (sends reasoning_effort: "none"). Cloud providers continue
to receive no reasoning_effort parameter.

- Replace post_process_reasoning_effort: Option<String> with
  post_process_disable_reasoning: bool (default true) in settings
- Map bool to Option<String> at the request site (true -> "none",
  false -> omit) so the API contract is unchanged
- Rename Tauri command to set_post_process_disable_reasoning(disable)
  so parameter naming matches the setting and avoids inversion bugs
- Swap the Dropdown for a ToggleSwitch in the Custom provider section
- Collapse five i18n strings to one label + description, and add the
  translation to all 18 additional locales so non-en/ja users don't
  fall back to English for this control
@luoxi luoxi force-pushed the feat/reasoning-effort branch from 25dd02d to 99c8299 Compare April 5, 2026 01:31
@luoxi
Copy link
Copy Markdown
Contributor Author

luoxi commented Apr 5, 2026

Simplified to a single "Disable reasoning" toggle, default ON for custom providers (sends reasoning_effort: "none"), as reasoning doesn’t add much value in post-processing.

Screenshot 2026-04-05 at 10 35 38

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Apr 5, 2026

I think since this is an alpha feature, we can just disable reasoning in total to simplify the settings. Let's hit issues directly if reasoning disabled causes issues. I think for this app it should have little to no impact

@luoxi
Copy link
Copy Markdown
Contributor Author

luoxi commented Apr 5, 2026

Yeah that makes sense.

I do think reasoning control might get a bit inconsistent across models over time (params vs prompt-based control, etc.), so we might eventually want a bit more flexibility depending on provider behavior.

But agree we shouldn’t overcomplicate it now — I’ll update the PR to just disable reasoning for post-processing.

luoxi added a commit to luoxi/Handy that referenced this pull request Apr 5, 2026
…provider

Per PR cjpais#1221 discussion with @cjpais: since this is an alpha feature,
simplify the UI by always sending reasoning_effort: "none" for custom
provider post-processing. If this causes issues with specific provider
setups, we can revisit.
@luoxi
Copy link
Copy Markdown
Contributor Author

luoxi commented Apr 5, 2026

Done — removed the toggle entirely. Custom provider now always sends reasoning_effort: "none" for post-processing.

One small note: "none" is an Ollama-specific extension, so some strict OpenAI-compatible endpoints might not accept it. Probably not an issue for now, but happy to add a fallback if needed later.

…provider

Per PR cjpais#1221 discussion with @cjpais: since this is an alpha feature,
simplify the UI by always sending reasoning_effort: "none" for custom
provider post-processing. If this causes issues with specific provider
setups, we can revisit.
@luoxi luoxi force-pushed the feat/reasoning-effort branch from e2e0f94 to d98521d Compare April 5, 2026 10:59
@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Apr 5, 2026

Thank you! I will test this tomorrow or Tuesday and pull it in. This is a good change, so thank you for making it

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Apr 7, 2026

Tested this with Qwen 9B on my machine, went from infinite processing to basically instant, this is a big win. Thanks

@cjpais cjpais merged commit 84d88f9 into cjpais:main Apr 7, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants