Skip to content

Support suppress_tokens / begin_suppress_tokens (HF generation parity)#2207

Draft
Copilot wants to merge 4 commits into
mainfrom
copilot/support-suppress-tokens
Draft

Support suppress_tokens / begin_suppress_tokens (HF generation parity)#2207
Copilot wants to merge 4 commits into
mainfrom
copilot/support-suppress-tokens

Conversation

Copilot AI commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

onnxruntime-genai had no way to suppress specific token IDs during generation, so output diverged from transformers.generate() for models that depend on suppress_tokens / begin_suppress_tokens (e.g. Whisper, Gemma). These options were neither parsed from config nor applied as logits processors.

This adds static per-token suppression mirroring HF's SuppressTokensLogitsProcessor and SuppressTokensAtBeginLogitsProcessor:

  • Config (config.h/config.cpp): new suppress_tokens / begin_suppress_tokens int-array fields on Config::Search, parsed via a new Search_Element::OnArray.
  • Logits processor (search.h, search.cpp, cuda/search_cuda.cpp): new Search::ApplySuppressTokens virtual, implemented for CPU and CUDA, setting targeted logits to float lowest — following the existing ApplyMinLength / ApplyRepetitionPenalty pattern.
  • Application (generators.cpp, engine/request.cpp): suppress_tokens applied every step; begin_suppress_tokens only when current_length == prompt_length. Wired into both the standalone Generator loop and the batching Engine request loop.
  • API surface: additive C API OgaGeneratorParamsSetSearchTokensArray + C++ wrapper; Python set_search_options(...) now accepts list[int].
  • Model builder (builders/base.py, builders/whisper.py): auto-populates both from generation_config.json / model config via a shared Model.add_suppress_tokens_to_search_config helper, so existing HF models work without manual config edits.
  • Tests: C++ greedy-selection tests and Python API tests covering every-step and begin-only suppression.
# Suppressed at every step; greedy falls through to the next unsuppressed token
params.set_search_options(suppress_tokens=[1, 3], begin_suppress_tokens=[50257])

bad_words_ids (multi-token sequence banning), listed as optional in the issue, is intentionally out of scope here.

Note on concurrency

The begin-step is tracked via a per-object length member, mutated non-atomically — consistent with the existing single-threaded-per-Generator/Request design (cf. computed_logits_, is_prefill_). No synchronization added.

Copilot AI changed the title [WIP] Add support for suppress_tokens and begin_suppress_tokens Support suppress_tokens / begin_suppress_tokens (HF generation parity) Jun 9, 2026
Copilot AI requested a review from justinchuby June 9, 2026 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support suppress_tokens / begin_suppress_tokens (HF generation parity)

2 participants