You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When OCR'ing a scanned page of flowing English prose through the mlx-vlm server (/chat/completions) with mlx-community/GLM-OCR-bf16, the model locks into a repetition loop after the first ~500 correct characters and re-emits the same short paragraph dozens of times until max_tokens truncates. Setting repetition_penalty on the request does not prevent this.
The same source image processed by the same model weights via Ollama (GGUF quantisation, llama.cpp sampler) produces clean, coherent OCR with no loops — so the model itself and image understanding are fine; the issue is isolated to mlx-vlm's generation / sampling path.
Environment
mlx-vlm 0.4.4 (installed from git+https://github.com/Blaizzy/mlx-vlm.git)
Render a page of flowing English prose at 200 DPI to PNG (image-only scanned PDF, typical size ~300–500 KB). A public-domain Project Gutenberg scan reproduces the issue on any page with an all-caps section heading.
POST to /chat/completions:
{
"model": "mlx-community/GLM-OCR-bf16",
"messages": [{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,<...>"}},
{"type": "text", "text": "OCR this image. Extract ALL text preserving the original formatting, paragraphs, tables, and formulas. Output only the extracted text."}
]}],
"max_tokens": 4096,
"temperature": 0,
"repetition_penalty": 1.1
}
Observe: the first ~500 characters of the response match the page content correctly. Then the model hits a short all-caps heading (e.g. a section header, ~3 words) and begins re-emitting the heading + the subsequent paragraph in a loop until max_tokens.
Measured behaviour
Same 200-DPI rendering, single page of English prose, identical prompt:
repetition_penalty
temperature
top_p
Output (chars)
Repeated-heading occurrences
unset
0.0
—
33,688
~100
1.05
0.0
—
16,870
54
1.1
0.0
—
16,870
54
1.3
0.3
—
16,657
79
1.5
0.0
—
16,594
80
1.1
0.5
0.9
17,086
54
Ground truth (Ollama GGUF on same image, glm-ocr:latest, default sampler): 1,603 chars, heading appears once. Complete correct transcription of the page.
Observations:
repetition_penalty values from 1.05 to 1.5 do not break the loop on this content
Increasing repetition_penalty from 1.1 → 1.5 at temperature=0 actually increased repeat count (54 → 80)
Temperature > 0 + top_p 0.9 did not help either
The loop output consistently fills close to max_tokens (input ~5000 prompt tokens, output ~4096 generation tokens)
Hypothesis
Either:
The repetition_penalty field on /chat/completions is accepted by the request model but not threaded into stream_generate / generate_step for VLM paths, or
It is applied but using a codebase convention (e.g. a context window length) that renders it ineffective for this model's tokenisation of repeated all-caps headings, or
GLM-OCR's chat template's pre-filled <think></think> interacts with sampling in a way that bypasses the penalty.
The counterfactual (Ollama/llama.cpp producing clean output from the same weights on the same image) strongly points at the mlx-vlm sampling path as the isolated cause.
GLM-OCR via mlx-vlm is otherwise an excellent fit for Apple Silicon OCR pipelines (~2× throughput vs. Ollama on benchmarks). This sampler issue makes it unusable today for anyone processing long-form prose; structured content (receipts, forms, equations) appears to mask the problem because short repetitive headers are rarer there.
Happy to test patches against the same reproduction and report back. I'm also skimming the sampling code to see whether a small PR is feasible; will link if I find something.
Summary
When OCR'ing a scanned page of flowing English prose through the mlx-vlm server (
/chat/completions) withmlx-community/GLM-OCR-bf16, the model locks into a repetition loop after the first ~500 correct characters and re-emits the same short paragraph dozens of times untilmax_tokenstruncates. Settingrepetition_penaltyon the request does not prevent this.The same source image processed by the same model weights via Ollama (GGUF quantisation, llama.cpp sampler) produces clean, coherent OCR with no loops — so the model itself and image understanding are fine; the issue is isolated to mlx-vlm's generation / sampling path.
Environment
git+https://github.com/Blaizzy/mlx-vlm.git)GlmOcrProcessorvia closed issue GLM-OCR: images not processed (GlmOcrProcessor not loaded by AutoProcessor) #886)mlx-community/GLM-OCR-bf16(also reproduces withEZCon/GLM-OCR-mlx)Reproduction
Start the server:
Render a page of flowing English prose at 200 DPI to PNG (image-only scanned PDF, typical size ~300–500 KB). A public-domain Project Gutenberg scan reproduces the issue on any page with an all-caps section heading.
POST to
/chat/completions:{ "model": "mlx-community/GLM-OCR-bf16", "messages": [{"role": "user", "content": [ {"type": "image_url", "image_url": {"url": "data:image/png;base64,<...>"}}, {"type": "text", "text": "OCR this image. Extract ALL text preserving the original formatting, paragraphs, tables, and formulas. Output only the extracted text."} ]}], "max_tokens": 4096, "temperature": 0, "repetition_penalty": 1.1 }Observe: the first ~500 characters of the response match the page content correctly. Then the model hits a short all-caps heading (e.g. a section header, ~3 words) and begins re-emitting the heading + the subsequent paragraph in a loop until
max_tokens.Measured behaviour
Same 200-DPI rendering, single page of English prose, identical prompt:
repetition_penaltytemperaturetop_pGround truth (Ollama GGUF on same image,
glm-ocr:latest, default sampler): 1,603 chars, heading appears once. Complete correct transcription of the page.Observations:
repetition_penaltyvalues from 1.05 to 1.5 do not break the loop on this contentrepetition_penaltyfrom 1.1 → 1.5 attemperature=0actually increased repeat count (54 → 80)top_p0.9 did not help eithermax_tokens(input ~5000 prompt tokens, output ~4096 generation tokens)Hypothesis
Either:
repetition_penaltyfield on/chat/completionsis accepted by the request model but not threaded intostream_generate/generate_stepfor VLM paths, or<think></think>interacts with sampling in a way that bypasses the penalty.The counterfactual (Ollama/llama.cpp producing clean output from the same weights on the same image) strongly points at the mlx-vlm sampling path as the isolated cause.
Related
GlmOcrProcessornot loaded by AutoProcessor; same model, different failure mode. Resolved by installing torch+torchvision, but exposed this downstream sampler issue.Why this matters
GLM-OCR via mlx-vlm is otherwise an excellent fit for Apple Silicon OCR pipelines (~2× throughput vs. Ollama on benchmarks). This sampler issue makes it unusable today for anyone processing long-form prose; structured content (receipts, forms, equations) appears to mask the problem because short repetitive headers are rarer there.
Happy to test patches against the same reproduction and report back. I'm also skimming the sampling code to see whether a small PR is feasible; will link if I find something.
Thanks for maintaining this project!