Skip to content

fix: preserve all token IDs when multiple BPE tokens decode to same string#1833

Merged
RobinPicard merged 2 commits intodottxt-ai:mainfrom
alvinttang:fix/vocab-token-id-overwrite
Mar 26, 2026
Merged

fix: preserve all token IDs when multiple BPE tokens decode to same string#1833
RobinPicard merged 2 commits intodottxt-ai:mainfrom
alvinttang:fix/vocab-token-id-overwrite

Conversation

@alvinttang
Copy link
Copy Markdown

Summary

Fixes #1830

Bug: In OutlinesCoreBackend.create_outlines_core_vocabulary(), the line:

formatted_vocab[token_as_str] = [token_id]

overwrites previous entries when multiple BPE tokens decode to the same string via token_to_str(). For Mistral-7B, this silently drops ~17.2% of token IDs from the vocabulary, causing the structured generation FSM to miss valid token transitions.

Fix: Replace the assignment with:

formatted_vocab.setdefault(token_as_str, []).append(token_id)

This accumulates all token IDs that map to the same decoded string, preserving the complete vocabulary for the outlines_core.Vocabulary constructor.

Changes

  • outlines/backends/outlines_core.py: One-line fix in create_outlines_core_vocabulary
  • tests/backends/test_outlines_core.py: Two new unit tests:
    • test_create_vocabulary_preserves_duplicate_token_ids — verifies basic vocabulary building with sentencepiece-style tokens
    • test_create_vocabulary_duplicate_decoded_strings — verifies that when token_to_str maps multiple tokens to the exact same string, all IDs are accumulated (the core regression case)

Test plan

  • New unit tests pass (pytest tests/backends/test_outlines_core.py::test_create_vocabulary_preserves_duplicate_token_ids tests/backends/test_outlines_core.py::test_create_vocabulary_duplicate_decoded_strings)
  • Existing backend tests still pass (pytest tests/backends/)
  • Manual verification: load a Mistral-7B tokenizer, count unique token IDs in the formatted vocab before and after the fix — all IDs should be preserved

Copy link
Copy Markdown
Contributor

@RobinPicard RobinPicard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for opening a PR! Another PR has already been merged to fix this issue though (#1831). What you could do is to rebase on main to get the fix from the other PR, and then have this PR be about having more thorough testing of the feature since you have more tests than there are on main.

assert "name" in response


def test_create_vocabulary_preserves_duplicate_token_ids():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good test to have, but the content/description below does not include different tokens that decode to the same string!

@alvinttang
Copy link
Copy Markdown
Author

Thanks for the review, Robin! I'll rebase on main to pick up the existing fix and refocus this PR on the improved test coverage. Good catch on the first test's description not matching its content — I'll fix that as well.

@alvinttang alvinttang force-pushed the fix/vocab-token-id-overwrite branch from de2c485 to 2aff36b Compare March 18, 2026 03:18
@RobinPicard
Copy link
Copy Markdown
Contributor

Tests are failing!

…tring

In `create_outlines_core_vocabulary`, the assignment
`formatted_vocab[token_as_str] = [token_id]` overwrites previous entries
when multiple BPE tokens decode to the same string via `token_to_str`.
For Mistral-7B, this silently drops 17.2% of token IDs.

Replace with `formatted_vocab.setdefault(token_as_str, []).append(token_id)`
to accumulate all token IDs for each decoded string.

Fixes dottxt-ai#1830

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@RobinPicard RobinPicard force-pushed the fix/vocab-token-id-overwrite branch from 2aff36b to 05d198e Compare March 21, 2026 08:08
@RobinPicard RobinPicard merged commit f08900f into dottxt-ai:main Mar 26, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

create_outlines_core_vocabulary silently drops token IDs when multiple tokens decode to the same string

2 participants