Skip to content

pymupdf markdown mode inlines base64 images into chunk text → oversized chunks & Milvus insert failure #570

Description

@andyne13

Summary

pymupdf's default markdown mode renders images and embeds each one as a base64 data:image/... URI inside the page text. That base64 is never stripped from the text, so it flows into chunks → embeddings → Milvus. Result: bloated chunks, meaningless embeddings, and insert failures.

Evidence

Indexing a PDF with pymupdf fails at the store stage:

grpc: received message larger than max (~147 MB vs 64 MB)  →  StatusCode.RESOURCE_EXHAUSTED

Likely the same root cause as #61 (a single chunk's varchar text exceeding Milvus's 65535-char field limit) — both stem from base64 image data living in the text.

Root cause

core/indexing/parsers/pdf/pymupdf.py::_extract_markdown calls pymupdf4llm.to_markdown(..., embed_images=True, dpi=300) and appends the full page text (base64 and all). image_preprocessor.extract_data_uri_image_blocks only copies the data URIs into ImageBlocks — it does not remove them from the text. The base64 is removed only if the caption stage runs (substituting markdown_ref → caption); with captioning off it persists.

Suggested fix

  • Make pymupdf the lightweight backend: build in text mode (page.get_text(), no image rendering) — fast, no base64. Image-aware parsing is marker/docling's job.
  • And/or: in markdown mode, strip base64 from text (compact placeholder) regardless of whether captioning runs, so raw base64 never reaches chunks.

Related: #61, #183, #453. Area: backend (refactor/hexagonal).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions