Implement no_repeat_ngram_size for CPU search#2242
Open
mustjab wants to merge 1 commit into
Open
Conversation
The no_repeat_ngram_size search option was parsed from genai_config.json and settable via set_search_options / OgaGeneratorParamsSetSearchNumber, but no logits processor enforced it, so it had no effect. Greedy decoding could emit unbounded repeated n-grams (e.g. Marian translation looping to max_length on degenerate input). Add a CPU logits processor that bans any token which would complete an already-seen n-gram of the configured size, matching HuggingFace no_repeat_ngram_size semantics. Changes: - search.h: virtual ApplyNoRepeatNgram no-op default (CUDA/beam unaffected), Search_Cpu override. - search.cpp: ApplyNoRepeatNgram implementation. - generators.cpp, engine/request.cpp: invoke after ApplyRepetitionPenalty. - config.h: document the field. - sampling_tests.cpp: NoRepeatNgramCorrectnessCpu unit test.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR makes the existing no_repeat_ngram_size search option actually take effect for CPU decoding by adding a CPU-side logits processor that bans tokens which would complete a previously seen n-gram (matching HuggingFace semantics). This closes a functional gap where the option was parsed and settable but previously had no impact on generation behavior.
Changes:
- Add a
Search::ApplyNoRepeatNgram()virtual hook (default no-op) and implement the CPU override to ban tokens that would repeat an n-gram. - Invoke
ApplyNoRepeatNgram()during per-token scoring in both the direct generator path and the engine request path. - Add a CPU correctness unit test that forces greedy decoding to choose a fallback token when the highest-scoring token would repeat an n-gram.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| test/sampling_tests.cpp | Adds a CPU unit test validating greedy decoding respects no_repeat_ngram_size. |
| src/search.h | Introduces ApplyNoRepeatNgram() (default no-op) and declares CPU override. |
| src/search.cpp | Implements CPU n-gram blocking by banning tokens that would repeat a seen n-gram. |
| src/generators.cpp | Calls ApplyNoRepeatNgram() in the non-engine generation loop after repetition penalty. |
| src/engine/request.cpp | Calls ApplyNoRepeatNgram() in the engine request generation loop after repetition penalty. |
| src/config.h | Updates documentation comment for no_repeat_ngram_size to reflect actual behavior. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The no_repeat_ngram_size search option was parsed from genai_config.json and settable via set_search_options / OgaGeneratorParamsSetSearchNumber, but no logits processor enforced it, so it had no effect. Greedy decoding could emit unbounded repeated n-grams (e.g. Marian translation looping to max_length on degenerate input).
Add a CPU logits processor that bans any token which would complete an already-seen n-gram of the configured size, matching HuggingFace no_repeat_ngram_size semantics.
Changes: