Skip to content

Indexer parse stage has no timeout — a slow/wedged parse can hang indexing #571

Description

@andyne13

Summary

The indexing pipeline runs the parse stage with no timeout, so a slow or wedged parse can stall indexing indefinitely.

Root cause

services/workers/indexer_pool.py calls build_indexing_pipeline(...) without timeouts=, so PipelineTimeouts.parse defaults to None and parse_stage(..., timeout=None) never bounds the parse.

  • marker is self-protected (its pool has its own MARKER_TIMEOUT + retry), so it survives a bad PDF.
  • pymupdf/docling have no parse timeout. pymupdf additionally serializes all work on a single shared worker thread (_PYMUPDF_EXECUTOR, max_workers=1), so one wedged parse blocks every subsequent pymupdf parse — the whole backend appears frozen.

Suggested fix

Wire a parse-stage timeout so a pathological PDF fails that file (and is reported) instead of stalling the pipeline. Consider per-backend defaults.

Area: backend (refactor/hexagonal).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions