Feature request: support suppress_tokens / begin_suppress_tokens (and ideally bad_words_ids)
Problem
onnxruntime-genai has no way to suppress specific token IDs during generation. HuggingFace transformers.generate() supports:
suppress_tokens — token IDs whose logits are set to -inf at every decoding step.
begin_suppress_tokens — token IDs suppressed only at the first generated step.
bad_words_ids — sequences that may never be generated.
Many models ship these in their generation_config.json and depend on them for correct output (e.g. Whisper suppresses non-speech/special tokens; Gemma4 base/IT suppresses end_of_image/end_of_audio placeholder-closing tokens). Because genai ignores them, its output diverges from HF for these models.
Evidence
The full set of recognized search options is enumerated in src/config.cpp (Search_Element::OnValue, ~L1245–1291):
min_length, max_length, batch_size, num_beams, num_return_sequences,
top_k, top_p, temperature, repetition_penalty, length_penalty,
no_repeat_ngram_size, diversity_penalty, random_seed, chunk_size,
do_sample, past_present_share_buffer, early_stopping, blank_penalty
There is no suppress_tokens / begin_suppress_tokens / bad_words_ids key, and grep -rni "suppress\|bad_word" src/ finds no related logic (only GC.SuppressFinalize etc.). A GuidanceLogitsProcessor exists for grammar-constrained decoding, but that is not a substitute for static per-token suppression driven by generation_config.
Impact
Validating mobius-exported Gemma4 models against HF, we had to re-implement token suppression in the test harness to reproduce HF's greedy output — model.generate() applies suppress_tokens, but a genai generation loop does not, so the two disagree token-for-token unless suppression is added manually. Any downstream user running these models through genai gets subtly wrong generations with no error.
Proposed solution
- Parse
suppress_tokens and begin_suppress_tokens (arrays of ints) in the search config (src/config.cpp) and from set_search_options.
- Add a logits processor that sets the corresponding logits to
-inf — suppress_tokens on every step, begin_suppress_tokens only when current_length == prompt_length.
- Auto-populate these from the model's
generation_config.json at build/convert time (or read them in genai_config.json) so existing HF models work out of the box.
- (Optional)
bad_words_ids for multi-token sequence banning.
This mirrors HF's SuppressTokensLogitsProcessor, SuppressTokensAtBeginLogitsProcessor, and NoBadWordsLogitsProcessor.
Environment
- onnxruntime-genai 0.14.0-dev (commit
9b875c3)
- Observed while exporting/validating
google/gemma-4-* (and applies to Whisper).
Feature request: support
suppress_tokens/begin_suppress_tokens(and ideallybad_words_ids)Problem
onnxruntime-genaihas no way to suppress specific token IDs during generation. HuggingFacetransformers.generate()supports:suppress_tokens— token IDs whose logits are set to-infat every decoding step.begin_suppress_tokens— token IDs suppressed only at the first generated step.bad_words_ids— sequences that may never be generated.Many models ship these in their
generation_config.jsonand depend on them for correct output (e.g. Whisper suppresses non-speech/special tokens; Gemma4 base/IT suppressesend_of_image/end_of_audioplaceholder-closing tokens). Because genai ignores them, its output diverges from HF for these models.Evidence
The full set of recognized search options is enumerated in
src/config.cpp(Search_Element::OnValue, ~L1245–1291):There is no
suppress_tokens/begin_suppress_tokens/bad_words_idskey, andgrep -rni "suppress\|bad_word" src/finds no related logic (onlyGC.SuppressFinalizeetc.). AGuidanceLogitsProcessorexists for grammar-constrained decoding, but that is not a substitute for static per-token suppression driven bygeneration_config.Impact
Validating mobius-exported Gemma4 models against HF, we had to re-implement token suppression in the test harness to reproduce HF's greedy output —
model.generate()appliessuppress_tokens, but a genai generation loop does not, so the two disagree token-for-token unless suppression is added manually. Any downstream user running these models through genai gets subtly wrong generations with no error.Proposed solution
suppress_tokensandbegin_suppress_tokens(arrays of ints) in the search config (src/config.cpp) and fromset_search_options.-inf—suppress_tokenson every step,begin_suppress_tokensonly whencurrent_length == prompt_length.generation_config.jsonat build/convert time (or read them ingenai_config.json) so existing HF models work out of the box.bad_words_idsfor multi-token sequence banning.This mirrors HF's
SuppressTokensLogitsProcessor,SuppressTokensAtBeginLogitsProcessor, andNoBadWordsLogitsProcessor.Environment
9b875c3)google/gemma-4-*(and applies to Whisper).