Skip to content

Support suppress_tokens / begin_suppress_tokens (HF generation parity) #2201

Description

@justinchuby

Feature request: support suppress_tokens / begin_suppress_tokens (and ideally bad_words_ids)

Problem

onnxruntime-genai has no way to suppress specific token IDs during generation. HuggingFace transformers.generate() supports:

  • suppress_tokens — token IDs whose logits are set to -inf at every decoding step.
  • begin_suppress_tokens — token IDs suppressed only at the first generated step.
  • bad_words_ids — sequences that may never be generated.

Many models ship these in their generation_config.json and depend on them for correct output (e.g. Whisper suppresses non-speech/special tokens; Gemma4 base/IT suppresses end_of_image/end_of_audio placeholder-closing tokens). Because genai ignores them, its output diverges from HF for these models.

Evidence

The full set of recognized search options is enumerated in src/config.cpp (Search_Element::OnValue, ~L1245–1291):

min_length, max_length, batch_size, num_beams, num_return_sequences,
top_k, top_p, temperature, repetition_penalty, length_penalty,
no_repeat_ngram_size, diversity_penalty, random_seed, chunk_size,
do_sample, past_present_share_buffer, early_stopping, blank_penalty

There is no suppress_tokens / begin_suppress_tokens / bad_words_ids key, and grep -rni "suppress\|bad_word" src/ finds no related logic (only GC.SuppressFinalize etc.). A GuidanceLogitsProcessor exists for grammar-constrained decoding, but that is not a substitute for static per-token suppression driven by generation_config.

Impact

Validating mobius-exported Gemma4 models against HF, we had to re-implement token suppression in the test harness to reproduce HF's greedy output — model.generate() applies suppress_tokens, but a genai generation loop does not, so the two disagree token-for-token unless suppression is added manually. Any downstream user running these models through genai gets subtly wrong generations with no error.

Proposed solution

  1. Parse suppress_tokens and begin_suppress_tokens (arrays of ints) in the search config (src/config.cpp) and from set_search_options.
  2. Add a logits processor that sets the corresponding logits to -infsuppress_tokens on every step, begin_suppress_tokens only when current_length == prompt_length.
  3. Auto-populate these from the model's generation_config.json at build/convert time (or read them in genai_config.json) so existing HF models work out of the box.
  4. (Optional) bad_words_ids for multi-token sequence banning.

This mirrors HF's SuppressTokensLogitsProcessor, SuppressTokensAtBeginLogitsProcessor, and NoBadWordsLogitsProcessor.

Environment

  • onnxruntime-genai 0.14.0-dev (commit 9b875c3)
  • Observed while exporting/validating google/gemma-4-* (and applies to Whisper).

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions