fix: preserve all token IDs when multiple BPE tokens decode to same string by alvinttang · Pull Request #1833 · dottxt-ai/outlines

alvinttang · 2026-03-16T13:54:01Z

Summary

Bug: In OutlinesCoreBackend.create_outlines_core_vocabulary(), the line:

formatted_vocab[token_as_str] = [token_id]

overwrites previous entries when multiple BPE tokens decode to the same string via token_to_str(). For Mistral-7B, this silently drops ~17.2% of token IDs from the vocabulary, causing the structured generation FSM to miss valid token transitions.

Fix: Replace the assignment with:

formatted_vocab.setdefault(token_as_str, []).append(token_id)

This accumulates all token IDs that map to the same decoded string, preserving the complete vocabulary for the outlines_core.Vocabulary constructor.

Changes

outlines/backends/outlines_core.py: One-line fix in create_outlines_core_vocabulary
tests/backends/test_outlines_core.py: Two new unit tests:
- test_create_vocabulary_preserves_duplicate_token_ids — verifies basic vocabulary building with sentencepiece-style tokens
- test_create_vocabulary_duplicate_decoded_strings — verifies that when token_to_str maps multiple tokens to the exact same string, all IDs are accumulated (the core regression case)

Test plan

New unit tests pass (pytest tests/backends/test_outlines_core.py::test_create_vocabulary_preserves_duplicate_token_ids tests/backends/test_outlines_core.py::test_create_vocabulary_duplicate_decoded_strings)
Existing backend tests still pass (pytest tests/backends/)
Manual verification: load a Mistral-7B tokenizer, count unique token IDs in the formatted vocab before and after the fix — all IDs should be preserved

RobinPicard

Hi, thanks for opening a PR! Another PR has already been merged to fix this issue though (#1831). What you could do is to rebase on main to get the fix from the other PR, and then have this PR be about having more thorough testing of the feature since you have more tests than there are on main.

RobinPicard · 2026-03-17T07:38:42Z

tests/backends/test_outlines_core.py

            assert "name" in response
+
+
+def test_create_vocabulary_preserves_duplicate_token_ids():


That's a good test to have, but the content/description below does not include different tokens that decode to the same string!

alvinttang · 2026-03-18T03:08:00Z

Thanks for the review, Robin! I'll rebase on main to pick up the existing fix and refocus this PR on the improved test coverage. Good catch on the first test's description not matching its content — I'll fix that as well.

RobinPicard · 2026-03-18T09:27:34Z

Tests are failing!

…tring In `create_outlines_core_vocabulary`, the assignment `formatted_vocab[token_as_str] = [token_id]` overwrites previous entries when multiple BPE tokens decode to the same string via `token_to_str`. For Mistral-7B, this silently drops 17.2% of token IDs. Replace with `formatted_vocab.setdefault(token_as_str, []).append(token_id)` to accumulate all token IDs for each decoded string. Fixes dottxt-ai#1830 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

RobinPicard requested changes Mar 17, 2026

View reviewed changes

alvinttang force-pushed the fix/vocab-token-id-overwrite branch from de2c485 to 2aff36b Compare March 18, 2026 03:18

RobinPicard force-pushed the fix/vocab-token-id-overwrite branch from 2aff36b to 05d198e Compare March 21, 2026 08:08

Fix OutlinesCoreBackend and related tests

13e8c77

RobinPicard merged commit f08900f into dottxt-ai:main Mar 26, 2026
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: preserve all token IDs when multiple BPE tokens decode to same string#1833

fix: preserve all token IDs when multiple BPE tokens decode to same string#1833
RobinPicard merged 2 commits intodottxt-ai:mainfrom
alvinttang:fix/vocab-token-id-overwrite

alvinttang commented Mar 16, 2026

Uh oh!

RobinPicard left a comment •

edited

Loading

Uh oh!

RobinPicard Mar 17, 2026

Uh oh!

alvinttang commented Mar 18, 2026

Uh oh!

RobinPicard commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		assert "name" in response


		def test_create_vocabulary_preserves_duplicate_token_ids():

Conversation

alvinttang commented Mar 16, 2026

Summary

Changes

Test plan

Uh oh!

RobinPicard left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RobinPicard Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

alvinttang commented Mar 18, 2026

Uh oh!

RobinPicard commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RobinPicard left a comment •

edited

Loading