Skip to content

llm_runner: plumb prefill temperature#20244

Open
mergennachin wants to merge 1 commit into
mainfrom
llm-pr-d-prefill-temperature-logits
Open

llm_runner: plumb prefill temperature#20244
mergennachin wants to merge 1 commit into
mainfrom
llm-pr-d-prefill-temperature-logits

Conversation

@mergennachin

@mergennachin mergennachin commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Session-based serving drives generation as prefill plus token steps instead of one monolithic generate call. For that path to be correct, the first sampled token produced during prefill must honor the same sampling inputs as the rest of the decode loop; otherwise requests using temperature can silently start greedily and then switch behavior on later tokens.

This threads optional temperature through TextPrefiller and exposes the existing TextTokenGenerator logit-processor application so token-step callers can reuse the same sampling preparation as generate(). The goal is to remove a divergence point before session-backed serving starts depending on these primitives.

Default behavior remains greedy, so existing callers that do not pass temperature keep the same semantics. The added tests focus on the new non-default path and on sharing the logit-processor logic rather than duplicating it.

#20001

Copilot AI review requested due to automatic review settings June 12, 2026 20:15
@pytorch-bot

pytorch-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20244

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 1 New Failure, 4 Pending

As of commit 878f15c with merge base d7ca5db (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 12, 2026
@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR threads sampling temperature through TextPrefiller so the first sampled token (produced during prefill in session-based serving) uses the same sampling inputs as subsequent decode steps, and exposes TextTokenGenerator’s logit-processor application to keep decode paths consistent.

Changes:

  • Expose TextTokenGenerator::apply_logit_processors() (and is_eos()) so token-step callers can reuse the same logit-processing logic as generate().
  • Extend TextPrefiller::prefill() / prefill_chunk() to accept an optional temperature, applied only to the final chunk’s sampled token.
  • Update TextPrefiller unit tests for the new prefill_chunk signature (but currently without asserting temperature behavior).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
extension/llm/runner/text_token_generator.h Adds public helpers for applying logit processors and EOS checking; generate() now calls the helper.
extension/llm/runner/text_prefiller.h Adds temperature parameter to prefill APIs (currently via virtual signature change + default arg).
extension/llm/runner/text_prefiller.cpp Threads temperature into logits_to_token() and ensures only the last chunk uses non-greedy sampling.
extension/llm/runner/test/test_text_prefiller.cpp Updates mocks/expectations for new prefill_chunk signature; does not yet assert temperature forwarding/last-chunk behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread extension/llm/runner/text_prefiller.h Outdated
Comment thread extension/llm/runner/text_prefiller.h Outdated
Comment thread extension/llm/runner/test/test_text_prefiller.cpp Outdated
Comment thread extension/llm/runner/test/test_text_prefiller.cpp
Session-based serving drives generation as prefill plus token steps instead of one monolithic generate call. For that path to be correct, the first sampled token produced during prefill must honor the same sampling inputs as the rest of the decode loop; otherwise requests using temperature can silently start greedily and then switch behavior on later tokens.

This threads optional temperature through TextPrefiller and exposes the existing TextTokenGenerator logit-processor application so token-step callers can reuse the same sampling preparation as generate(). The goal is to remove a divergence point before session-backed serving starts depending on these primitives.

Default behavior remains greedy, so existing callers that do not pass temperature keep the same semantics. The added tests focus on the new non-default path and on sharing the logit-processor logic rather than duplicating it.
@mergennachin

Copy link
Copy Markdown
Contributor Author

@claude Review this PR

@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown

Claude finished @mergennachin's task in 5m 57s —— View job


Review: llm_runner: plumb prefill temperature

  • Read changed files
  • Analyze correctness & API design
  • Check test coverage

Overall this is a clean, well-scoped change. The committed state already addresses the Copilot reviewer's concerns: prefill() / prefill_chunk() keep separate 2-arg (greedy) and 3-arg virtual overloads with no default argument, the 2-arg forwards to the 3-arg with 0.0f, and tests now assert temperature forwarding, the greedy default, and last-chunk-only sampling. Default behavior is unchanged for existing callers.

Main point worth discussing

Logit processors are not applied to the prefill-sampled token, so the divergence the PR targets is only partially closed.

The stated goal is that "the first sampled token produced during prefill must honor the same sampling inputs as the rest of the decode loop." generate() does:

apply_logit_processors(logits_tensor);   // grammar/penalties/top-k/top-p
cur_token = logits_to_token(logits_tensor, temperature);

but prefill_chunk() (text_prefiller.cpp:116, :153) samples with temperature only:

cur_token = text_decoder_runner_->logits_to_token(outputs_res.get(), temperature);

So a request that uses both temperature and a logit processor (grammar mask, repetition penalty, top-k/top-p) would still get a first token that diverges from the decode loop. This is presumably why apply_logit_processors() was made public — the session orchestrator is expected to apply it — but TextPrefiller has no handle to the processors, so the divergence isn't actually removed inside this layer. Worth a sentence in the PR description / a TODO clarifying that processors are applied by the session caller, not the prefiller, so the next reader doesn't assume prefill is fully consistent with decode.

Minor

  • Dual virtual overloads are slightly fragile. Because both the 2-arg and 3-arg prefill_chunk are virtual, a subclass that overrides only the 2-arg version will not have its override invoked by prefill() (which always calls the 3-arg form). The new TwoArgumentPrefillChunkOverrideStillDispatches test only checks direct 2-arg dispatch, not routing through prefill(). No in-tree subclass overrides these (only the test mocks do), so this is theoretical — but if back-compat for 2-arg overriders is the goal, that path isn't actually exercised. The cleaner alternative is a single virtual 3-arg method with a non-virtual inline 2-arg forwarder; the current shape is a fine compromise if you specifically want 2-arg mocks to keep working.

  • apply_logit_processors() and is_eos() are exposed with no in-tree caller yet. The doc comments reference a session decode_one() that doesn't exist in this PR. That's acceptable given the stated "remove a divergence point before session-backed serving" intent, just flagging that they're public API ahead of their consumer.

Tests

Coverage on the new path is good: temperature forwarding, greedy 2-arg default, last-chunk-only temperature, and error propagation are all asserted. The Copilot review's test requests are satisfied by the committed version.

No blocking issues. The logit-processor consistency note above is the one item I'd recommend clarifying before session serving starts depending on these primitives.
· branch llm-pr-d-prefill-temperature-logits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants