Summary
The indexing pipeline runs the parse stage with no timeout, so a slow or wedged parse can stall indexing indefinitely.
Root cause
services/workers/indexer_pool.py calls build_indexing_pipeline(...) without timeouts=, so PipelineTimeouts.parse defaults to None and parse_stage(..., timeout=None) never bounds the parse.
- marker is self-protected (its pool has its own
MARKER_TIMEOUT + retry), so it survives a bad PDF.
- pymupdf/docling have no parse timeout. pymupdf additionally serializes all work on a single shared worker thread (
_PYMUPDF_EXECUTOR, max_workers=1), so one wedged parse blocks every subsequent pymupdf parse — the whole backend appears frozen.
Suggested fix
Wire a parse-stage timeout so a pathological PDF fails that file (and is reported) instead of stalling the pipeline. Consider per-backend defaults.
Area: backend (refactor/hexagonal).
Summary
The indexing pipeline runs the parse stage with no timeout, so a slow or wedged parse can stall indexing indefinitely.
Root cause
services/workers/indexer_pool.pycallsbuild_indexing_pipeline(...)withouttimeouts=, soPipelineTimeouts.parsedefaults toNoneandparse_stage(..., timeout=None)never bounds the parse.MARKER_TIMEOUT+ retry), so it survives a bad PDF._PYMUPDF_EXECUTOR,max_workers=1), so one wedged parse blocks every subsequent pymupdf parse — the whole backend appears frozen.Suggested fix
Wire a parse-stage timeout so a pathological PDF fails that file (and is reported) instead of stalling the pipeline. Consider per-backend defaults.
Area: backend (
refactor/hexagonal).