llm_runner: plumb prefill temperature#20244
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20244
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 1 New Failure, 4 PendingAs of commit 878f15c with merge base d7ca5db ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
This PR threads sampling temperature through TextPrefiller so the first sampled token (produced during prefill in session-based serving) uses the same sampling inputs as subsequent decode steps, and exposes TextTokenGenerator’s logit-processor application to keep decode paths consistent.
Changes:
- Expose
TextTokenGenerator::apply_logit_processors()(andis_eos()) so token-step callers can reuse the same logit-processing logic asgenerate(). - Extend
TextPrefiller::prefill()/prefill_chunk()to accept an optionaltemperature, applied only to the final chunk’s sampled token. - Update
TextPrefillerunit tests for the newprefill_chunksignature (but currently without asserting temperature behavior).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| extension/llm/runner/text_token_generator.h | Adds public helpers for applying logit processors and EOS checking; generate() now calls the helper. |
| extension/llm/runner/text_prefiller.h | Adds temperature parameter to prefill APIs (currently via virtual signature change + default arg). |
| extension/llm/runner/text_prefiller.cpp | Threads temperature into logits_to_token() and ensures only the last chunk uses non-greedy sampling. |
| extension/llm/runner/test/test_text_prefiller.cpp | Updates mocks/expectations for new prefill_chunk signature; does not yet assert temperature forwarding/last-chunk behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Session-based serving drives generation as prefill plus token steps instead of one monolithic generate call. For that path to be correct, the first sampled token produced during prefill must honor the same sampling inputs as the rest of the decode loop; otherwise requests using temperature can silently start greedily and then switch behavior on later tokens. This threads optional temperature through TextPrefiller and exposes the existing TextTokenGenerator logit-processor application so token-step callers can reuse the same sampling preparation as generate(). The goal is to remove a divergence point before session-backed serving starts depending on these primitives. Default behavior remains greedy, so existing callers that do not pass temperature keep the same semantics. The added tests focus on the new non-default path and on sharing the logit-processor logic rather than duplicating it.
7fde821 to
878f15c
Compare
|
@claude Review this PR |
|
Claude finished @mergennachin's task in 5m 57s —— View job Review:
|
Session-based serving drives generation as prefill plus token steps instead of one monolithic generate call. For that path to be correct, the first sampled token produced during prefill must honor the same sampling inputs as the rest of the decode loop; otherwise requests using temperature can silently start greedily and then switch behavior on later tokens.
This threads optional temperature through TextPrefiller and exposes the existing TextTokenGenerator logit-processor application so token-step callers can reuse the same sampling preparation as generate(). The goal is to remove a divergence point before session-backed serving starts depending on these primitives.
Default behavior remains greedy, so existing callers that do not pass temperature keep the same semantics. The added tests focus on the new non-default path and on sharing the logit-processor logic rather than duplicating it.
#20001