GLM-OCR: /chat/completions produces repetition loops on prose; repetition_penalty does not prevent them

## Summary

When OCR'ing a scanned page of flowing English prose through the mlx-vlm server (`/chat/completions`) with `mlx-community/GLM-OCR-bf16`, the model locks into a repetition loop after the first ~500 correct characters and re-emits the same short paragraph dozens of times until `max_tokens` truncates. Setting `repetition_penalty` on the request does not prevent this.

The same source image processed by the same model weights via Ollama (GGUF quantisation, llama.cpp sampler) produces clean, coherent OCR with no loops — so the model itself and image understanding are fine; the issue is isolated to mlx-vlm's generation / sampling path.

## Environment

- mlx-vlm 0.4.4 (installed from `git+https://github.com/Blaizzy/mlx-vlm.git`)
- Python 3.12.13 (Homebrew, macOS arm64)
- macOS Darwin 24.6.0, Apple Silicon Mac Mini
- torch 2.11.0 + torchvision 0.26.0 installed (required for `GlmOcrProcessor` via closed issue #886)
- Model: `mlx-community/GLM-OCR-bf16` (also reproduces with `EZCon/GLM-OCR-mlx`)

## Reproduction

1. Start the server:
   ```
   python -m mlx_vlm.server --trust-remote-code --port 8080
   ```

2. Render a page of flowing English prose at 200 DPI to PNG (image-only scanned PDF, typical size ~300–500 KB). A public-domain Project Gutenberg scan reproduces the issue on any page with an all-caps section heading.

3. POST to `/chat/completions`:
   ```json
   {
     "model": "mlx-community/GLM-OCR-bf16",
     "messages": [{"role": "user", "content": [
       {"type": "image_url", "image_url": {"url": "data:image/png;base64,<...>"}},
       {"type": "text", "text": "OCR this image. Extract ALL text preserving the original formatting, paragraphs, tables, and formulas. Output only the extracted text."}
     ]}],
     "max_tokens": 4096,
     "temperature": 0,
     "repetition_penalty": 1.1
   }
   ```

4. Observe: the first ~500 characters of the response match the page content correctly. Then the model hits a short all-caps heading (e.g. a section header, ~3 words) and begins re-emitting the heading + the subsequent paragraph in a loop until `max_tokens`.

## Measured behaviour

Same 200-DPI rendering, single page of English prose, identical prompt:

| `repetition_penalty` | `temperature` | `top_p` | Output (chars) | Repeated-heading occurrences |
|---|---|---|---|---|
| *unset* | 0.0 | — | 33,688 | ~100 |
| 1.05 | 0.0 | — | 16,870 | 54 |
| 1.1 | 0.0 | — | 16,870 | 54 |
| 1.3 | 0.3 | — | 16,657 | 79 |
| 1.5 | 0.0 | — | 16,594 | 80 |
| 1.1 | 0.5 | 0.9 | 17,086 | 54 |

Ground truth (Ollama GGUF on same image, `glm-ocr:latest`, default sampler): **1,603 chars, heading appears once**. Complete correct transcription of the page.

Observations:
- `repetition_penalty` values from 1.05 to 1.5 do not break the loop on this content
- Increasing `repetition_penalty` from 1.1 → 1.5 at `temperature=0` actually *increased* repeat count (54 → 80)
- Temperature > 0 + `top_p` 0.9 did not help either
- The loop output consistently fills close to `max_tokens` (input ~5000 prompt tokens, output ~4096 generation tokens)

## Hypothesis

Either:
1. The `repetition_penalty` field on `/chat/completions` is accepted by the request model but not threaded into `stream_generate` / `generate_step` for VLM paths, or
2. It is applied but using a codebase convention (e.g. a context window length) that renders it ineffective for this model's tokenisation of repeated all-caps headings, or
3. GLM-OCR's chat template's pre-filled `<think></think>` interacts with sampling in a way that bypasses the penalty.

The counterfactual (Ollama/llama.cpp producing clean output from the same weights on the same image) strongly points at the mlx-vlm sampling path as the isolated cause.

## Related

- #886 (closed) — `GlmOcrProcessor` not loaded by AutoProcessor; same model, different failure mode. Resolved by installing torch+torchvision, but exposed this downstream sampler issue.
- #387 (closed, different model) — similar pattern of mlx-vlm generation diverging from llama.cpp.

## Why this matters

GLM-OCR via mlx-vlm is otherwise an excellent fit for Apple Silicon OCR pipelines (~2× throughput vs. Ollama on benchmarks). This sampler issue makes it unusable today for anyone processing long-form prose; structured content (receipts, forms, equations) appears to mask the problem because short repetitive headers are rarer there.

Happy to test patches against the same reproduction and report back. I'm also skimming the sampling code to see whether a small PR is feasible; will link if I find something.

Thanks for maintaining this project!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GLM-OCR: /chat/completions produces repetition loops on prose; repetition_penalty does not prevent them #1021

Summary

Environment

Reproduction

Measured behaviour

Hypothesis

Related

Why this matters

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

`repetition_penalty`	`temperature`	`top_p`	Output (chars)	Repeated-heading occurrences
unset	0.0	—	33,688	~100
1.05	0.0	—	16,870	54
1.1	0.0	—	16,870	54
1.3	0.3	—	16,657	79
1.5	0.0	—	16,594	80
1.1	0.5	0.9	17,086	54

Uh oh!

GLM-OCR: /chat/completions produces repetition loops on prose; repetition_penalty does not prevent them #1021

Description

Summary

Environment

Reproduction

Measured behaviour

Hypothesis

Related

Why this matters

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions