Skip to content

GLM-OCR: /chat/completions produces repetition loops on prose; repetition_penalty does not prevent them #1021

@esaruoho

Description

@esaruoho

Summary

When OCR'ing a scanned page of flowing English prose through the mlx-vlm server (/chat/completions) with mlx-community/GLM-OCR-bf16, the model locks into a repetition loop after the first ~500 correct characters and re-emits the same short paragraph dozens of times until max_tokens truncates. Setting repetition_penalty on the request does not prevent this.

The same source image processed by the same model weights via Ollama (GGUF quantisation, llama.cpp sampler) produces clean, coherent OCR with no loops — so the model itself and image understanding are fine; the issue is isolated to mlx-vlm's generation / sampling path.

Environment

  • mlx-vlm 0.4.4 (installed from git+https://github.com/Blaizzy/mlx-vlm.git)
  • Python 3.12.13 (Homebrew, macOS arm64)
  • macOS Darwin 24.6.0, Apple Silicon Mac Mini
  • torch 2.11.0 + torchvision 0.26.0 installed (required for GlmOcrProcessor via closed issue GLM-OCR: images not processed (GlmOcrProcessor not loaded by AutoProcessor) #886)
  • Model: mlx-community/GLM-OCR-bf16 (also reproduces with EZCon/GLM-OCR-mlx)

Reproduction

  1. Start the server:

    python -m mlx_vlm.server --trust-remote-code --port 8080
    
  2. Render a page of flowing English prose at 200 DPI to PNG (image-only scanned PDF, typical size ~300–500 KB). A public-domain Project Gutenberg scan reproduces the issue on any page with an all-caps section heading.

  3. POST to /chat/completions:

    {
      "model": "mlx-community/GLM-OCR-bf16",
      "messages": [{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,<...>"}},
        {"type": "text", "text": "OCR this image. Extract ALL text preserving the original formatting, paragraphs, tables, and formulas. Output only the extracted text."}
      ]}],
      "max_tokens": 4096,
      "temperature": 0,
      "repetition_penalty": 1.1
    }
  4. Observe: the first ~500 characters of the response match the page content correctly. Then the model hits a short all-caps heading (e.g. a section header, ~3 words) and begins re-emitting the heading + the subsequent paragraph in a loop until max_tokens.

Measured behaviour

Same 200-DPI rendering, single page of English prose, identical prompt:

repetition_penalty temperature top_p Output (chars) Repeated-heading occurrences
unset 0.0 33,688 ~100
1.05 0.0 16,870 54
1.1 0.0 16,870 54
1.3 0.3 16,657 79
1.5 0.0 16,594 80
1.1 0.5 0.9 17,086 54

Ground truth (Ollama GGUF on same image, glm-ocr:latest, default sampler): 1,603 chars, heading appears once. Complete correct transcription of the page.

Observations:

  • repetition_penalty values from 1.05 to 1.5 do not break the loop on this content
  • Increasing repetition_penalty from 1.1 → 1.5 at temperature=0 actually increased repeat count (54 → 80)
  • Temperature > 0 + top_p 0.9 did not help either
  • The loop output consistently fills close to max_tokens (input ~5000 prompt tokens, output ~4096 generation tokens)

Hypothesis

Either:

  1. The repetition_penalty field on /chat/completions is accepted by the request model but not threaded into stream_generate / generate_step for VLM paths, or
  2. It is applied but using a codebase convention (e.g. a context window length) that renders it ineffective for this model's tokenisation of repeated all-caps headings, or
  3. GLM-OCR's chat template's pre-filled <think></think> interacts with sampling in a way that bypasses the penalty.

The counterfactual (Ollama/llama.cpp producing clean output from the same weights on the same image) strongly points at the mlx-vlm sampling path as the isolated cause.

Related

Why this matters

GLM-OCR via mlx-vlm is otherwise an excellent fit for Apple Silicon OCR pipelines (~2× throughput vs. Ollama on benchmarks). This sampler issue makes it unusable today for anyone processing long-form prose; structured content (receipts, forms, equations) appears to mask the problem because short repetitive headers are rarer there.

Happy to test patches against the same reproduction and report back. I'm also skimming the sampling code to see whether a small PR is feasible; will link if I find something.

Thanks for maintaining this project!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions