pymupdf markdown mode inlines base64 images into chunk text → oversized chunks & Milvus insert failure

### Summary
pymupdf's default `markdown` mode renders images and embeds each one as a base64 `data:image/...` URI **inside the page text**. That base64 is never stripped from the text, so it flows into chunks → embeddings → Milvus. Result: bloated chunks, meaningless embeddings, and insert failures.

### Evidence
Indexing a PDF with pymupdf fails at the store stage:
```
grpc: received message larger than max (~147 MB vs 64 MB)  →  StatusCode.RESOURCE_EXHAUSTED
```
Likely the same root cause as #61 (a single chunk's varchar text exceeding Milvus's 65535-char field limit) — both stem from base64 image data living in the text.

### Root cause
`core/indexing/parsers/pdf/pymupdf.py::_extract_markdown` calls `pymupdf4llm.to_markdown(..., embed_images=True, dpi=300)` and appends the full page text (base64 and all). `image_preprocessor.extract_data_uri_image_blocks` only **copies** the data URIs into `ImageBlock`s — it does not remove them from the text. The base64 is removed only if the caption stage runs (substituting `markdown_ref → caption`); with captioning off it persists.

### Suggested fix
- Make pymupdf the lightweight backend: build in `text` mode (`page.get_text()`, no image rendering) — fast, no base64. Image-aware parsing is marker/docling's job.
- And/or: in markdown mode, strip base64 from text (compact placeholder) regardless of whether captioning runs, so raw base64 never reaches chunks.

Related: #61, #183, #453. Area: backend (`refactor/hexagonal`).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pymupdf markdown mode inlines base64 images into chunk text → oversized chunks & Milvus insert failure #570

Summary

Evidence

Root cause

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

pymupdf markdown mode inlines base64 images into chunk text → oversized chunks & Milvus insert failure #570

Description

Summary

Evidence

Root cause

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions