llm_runner: plumb prefill temperature by mergennachin · Pull Request #20244 · pytorch/executorch

mergennachin · 2026-06-12T20:15:36Z

Session-based serving drives generation as prefill plus token steps instead of one monolithic generate call. For that path to be correct, the first sampled token produced during prefill must honor the same sampling inputs as the rest of the decode loop; otherwise requests using temperature can silently start greedily and then switch behavior on later tokens.

This threads optional temperature through TextPrefiller and exposes the existing TextTokenGenerator logit-processor application so token-step callers can reuse the same sampling preparation as generate(). The goal is to remove a divergence point before session-backed serving starts depending on these primitives.

Default behavior remains greedy, so existing callers that do not pass temperature keep the same semantics. The added tests focus on the new non-default path and on sharing the logit-processor logic rather than duplicating it.

#20001

pytorch-bot · 2026-06-12T20:15:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20244

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm] MI350 CI jobs will have longer queue times due to CI migration

❌ 1 New Failure, 4 Pending

As of commit 878f15c with merge base d7ca5db ():

NEW FAILURE - The following job has failed:

pull / test-llama-runner-qnn-linux (fp32, qnn_8a8w, qnn) / linux-job (gh)
RuntimeError: Command docker exec -t 4369b05f04918a93a5e8e6dab70b8a8ed76c8b709a855631d104f4ca3fa646a4 /exec failed with exit code 92

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-12T20:16:29Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

This PR threads sampling temperature through TextPrefiller so the first sampled token (produced during prefill in session-based serving) uses the same sampling inputs as subsequent decode steps, and exposes TextTokenGenerator’s logit-processor application to keep decode paths consistent.

Changes:

Expose TextTokenGenerator::apply_logit_processors() (and is_eos()) so token-step callers can reuse the same logit-processing logic as generate().
Extend TextPrefiller::prefill() / prefill_chunk() to accept an optional temperature, applied only to the final chunk’s sampled token.
Update TextPrefiller unit tests for the new prefill_chunk signature (but currently without asserting temperature behavior).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
extension/llm/runner/text_token_generator.h	Adds public helpers for applying logit processors and EOS checking; `generate()` now calls the helper.
extension/llm/runner/text_prefiller.h	Adds `temperature` parameter to prefill APIs (currently via virtual signature change + default arg).
extension/llm/runner/text_prefiller.cpp	Threads `temperature` into `logits_to_token()` and ensures only the last chunk uses non-greedy sampling.
extension/llm/runner/test/test_text_prefiller.cpp	Updates mocks/expectations for new `prefill_chunk` signature; does not yet assert temperature forwarding/last-chunk behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Session-based serving drives generation as prefill plus token steps instead of one monolithic generate call. For that path to be correct, the first sampled token produced during prefill must honor the same sampling inputs as the rest of the decode loop; otherwise requests using temperature can silently start greedily and then switch behavior on later tokens. This threads optional temperature through TextPrefiller and exposes the existing TextTokenGenerator logit-processor application so token-step callers can reuse the same sampling preparation as generate(). The goal is to remove a divergence point before session-backed serving starts depending on these primitives. Default behavior remains greedy, so existing callers that do not pass temperature keep the same semantics. The added tests focus on the new non-default path and on sharing the logit-processor logic rather than duplicating it.

mergennachin · 2026-06-12T21:20:39Z

@claude Review this PR

claude · 2026-06-12T21:21:21Z

Claude finished @mergennachin's task in 5m 57s —— View job

Review: `llm_runner: plumb prefill temperature`

Read changed files
Analyze correctness & API design
Check test coverage

Overall this is a clean, well-scoped change. The committed state already addresses the Copilot reviewer's concerns: prefill() / prefill_chunk() keep separate 2-arg (greedy) and 3-arg virtual overloads with no default argument, the 2-arg forwards to the 3-arg with 0.0f, and tests now assert temperature forwarding, the greedy default, and last-chunk-only sampling. Default behavior is unchanged for existing callers.

Main point worth discussing

Logit processors are not applied to the prefill-sampled token, so the divergence the PR targets is only partially closed.

The stated goal is that "the first sampled token produced during prefill must honor the same sampling inputs as the rest of the decode loop." generate() does:

apply_logit_processors(logits_tensor);   // grammar/penalties/top-k/top-p
cur_token = logits_to_token(logits_tensor, temperature);

but prefill_chunk() (text_prefiller.cpp:116, :153) samples with temperature only:

cur_token = text_decoder_runner_->logits_to_token(outputs_res.get(), temperature);

So a request that uses both temperature and a logit processor (grammar mask, repetition penalty, top-k/top-p) would still get a first token that diverges from the decode loop. This is presumably why apply_logit_processors() was made public — the session orchestrator is expected to apply it — but TextPrefiller has no handle to the processors, so the divergence isn't actually removed inside this layer. Worth a sentence in the PR description / a TODO clarifying that processors are applied by the session caller, not the prefiller, so the next reader doesn't assume prefill is fully consistent with decode.

Minor

Dual virtual overloads are slightly fragile. Because both the 2-arg and 3-arg prefill_chunk are virtual, a subclass that overrides only the 2-arg version will not have its override invoked by prefill() (which always calls the 3-arg form). The new TwoArgumentPrefillChunkOverrideStillDispatches test only checks direct 2-arg dispatch, not routing through prefill(). No in-tree subclass overrides these (only the test mocks do), so this is theoretical — but if back-compat for 2-arg overriders is the goal, that path isn't actually exercised. The cleaner alternative is a single virtual 3-arg method with a non-virtual inline 2-arg forwarder; the current shape is a fine compromise if you specifically want 2-arg mocks to keep working.
apply_logit_processors() and is_eos() are exposed with no in-tree caller yet. The doc comments reference a session decode_one() that doesn't exist in this PR. That's acceptable given the stated "remove a divergence point before session-backed serving" intent, just flagging that they're public API ahead of their consumer.

Tests

Coverage on the new path is good: temperature forwarding, greedy 2-arg default, last-chunk-only temperature, and error propagation are all asserted. The Copilot review's test requests are satisfied by the committed version.

No blocking issues. The logit-processor consistency note above is the one item I'd recommend clarifying before session serving starts depending on these primitives.
· branch llm-pr-d-prefill-temperature-logits

Copilot AI review requested due to automatic review settings June 12, 2026 20:15

mergennachin requested a review from larryliu0820 as a code owner June 12, 2026 20:15

mergennachin temporarily deployed to cadence June 12, 2026 20:15 — with GitHub Actions Inactive

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 12, 2026

mergennachin requested review from kirklandsign and psiddh June 12, 2026 20:15

Copilot started reviewing on behalf of mergennachin June 12, 2026 20:16 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread extension/llm/runner/text_prefiller.h Outdated

Comment thread extension/llm/runner/text_prefiller.h Outdated

Comment thread extension/llm/runner/test/test_text_prefiller.cpp Outdated

Comment thread extension/llm/runner/test/test_text_prefiller.cpp

mergennachin force-pushed the llm-pr-d-prefill-temperature-logits branch from 7fde821 to 878f15c Compare June 12, 2026 21:12

mergennachin temporarily deployed to cadence June 12, 2026 21:12 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm_runner: plumb prefill temperature#20244

llm_runner: plumb prefill temperature#20244
mergennachin wants to merge 1 commit into
mainfrom
llm-pr-d-prefill-temperature-logits

mergennachin commented Jun 12, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergennachin commented Jun 12, 2026

Uh oh!

claude Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mergennachin commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20244

❗ 1 Active SEVs

❌ 1 New Failure, 4 Pending

Uh oh!

github-actions Bot commented Jun 12, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergennachin commented Jun 12, 2026

Uh oh!

claude Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: llm_runner: plumb prefill temperature

Main point worth discussing

Minor

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mergennachin commented Jun 12, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 12, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 12, 2026 •

edited

Loading

Review: `llm_runner: plumb prefill temperature`