Summary
pymupdf's default markdown mode renders images and embeds each one as a base64 data:image/... URI inside the page text. That base64 is never stripped from the text, so it flows into chunks → embeddings → Milvus. Result: bloated chunks, meaningless embeddings, and insert failures.
Evidence
Indexing a PDF with pymupdf fails at the store stage:
grpc: received message larger than max (~147 MB vs 64 MB) → StatusCode.RESOURCE_EXHAUSTED
Likely the same root cause as #61 (a single chunk's varchar text exceeding Milvus's 65535-char field limit) — both stem from base64 image data living in the text.
Root cause
core/indexing/parsers/pdf/pymupdf.py::_extract_markdown calls pymupdf4llm.to_markdown(..., embed_images=True, dpi=300) and appends the full page text (base64 and all). image_preprocessor.extract_data_uri_image_blocks only copies the data URIs into ImageBlocks — it does not remove them from the text. The base64 is removed only if the caption stage runs (substituting markdown_ref → caption); with captioning off it persists.
Suggested fix
- Make pymupdf the lightweight backend: build in
text mode (page.get_text(), no image rendering) — fast, no base64. Image-aware parsing is marker/docling's job.
- And/or: in markdown mode, strip base64 from text (compact placeholder) regardless of whether captioning runs, so raw base64 never reaches chunks.
Related: #61, #183, #453. Area: backend (refactor/hexagonal).
Summary
pymupdf's default
markdownmode renders images and embeds each one as a base64data:image/...URI inside the page text. That base64 is never stripped from the text, so it flows into chunks → embeddings → Milvus. Result: bloated chunks, meaningless embeddings, and insert failures.Evidence
Indexing a PDF with pymupdf fails at the store stage:
Likely the same root cause as #61 (a single chunk's varchar text exceeding Milvus's 65535-char field limit) — both stem from base64 image data living in the text.
Root cause
core/indexing/parsers/pdf/pymupdf.py::_extract_markdowncallspymupdf4llm.to_markdown(..., embed_images=True, dpi=300)and appends the full page text (base64 and all).image_preprocessor.extract_data_uri_image_blocksonly copies the data URIs intoImageBlocks — it does not remove them from the text. The base64 is removed only if the caption stage runs (substitutingmarkdown_ref → caption); with captioning off it persists.Suggested fix
textmode (page.get_text(), no image rendering) — fast, no base64. Image-aware parsing is marker/docling's job.Related: #61, #183, #453. Area: backend (
refactor/hexagonal).