Skip to content

olmocr.bench scoring: partial_ratio falsely matches when candidate is near-empty (e.g. single \\n) #461

Description

@tsensei

Summary

TextPresenceTest.run in olmocr/bench/tests.py inverts its decision when the candidate's md_content is much shorter than the test's text query. A candidate that produced literally no text (e.g. a one-byte file containing only "\n") makes most present tests falsely pass and most absent tests falsely fail, inflating overall scores by ~5x against the legitimate empty-output baseline.

The bug is silent and load-bearing: it makes the bench unable to distinguish "produced perfect markdown" from "produced nothing at all" for any natural-language query that happens to contain a space — which is essentially all present tests in old_scans.jsonl.

Root cause

Three steps interact:

  1. Splitter / pipeline emits an effectively empty candidate file. Common in real workloads when a parser has no text layer + no working OCR for a page. In our case the file is exactly "\n".
  2. normalize_text("\n") returns " " (a single space). The whitespace-collapsing regex re.sub(r"\s+", " ", md_content) turns the lone newline into a lone space.
  3. fuzz.partial_ratio(query, " ") returns 1.0 for any query that contains a space character. partial_ratio scales by the shorter of its two arguments, so a single-character md whose character appears verbatim somewhere in the query scores a perfect match. With max_diffs=0 the threshold is 1.0, so:
    • PRESENT on a space-containing query → falsely PASS
    • ABSENT on a space-containing query → falsely FAIL

Quick demonstration (any rapidfuzz install):

from rapidfuzz import fuzz
fuzz.partial_ratio("an expression of good will from you", " ") / 100  # → 1.0
fuzz.partial_ratio("TELEPHONE 478",                       " ") / 100  # → 1.0
fuzz.partial_ratio("4201",                                " ") / 100  # → 0.0  (no space in query — control)

Reproduction

A self-contained script that uses olmocr's own TextPresenceTest:

# repro_partial_ratio_bug.py
from olmocr.bench.tests import TextPresenceTest, TestType

candidate_md = "\n"  # what a downstream pipeline writes for "no extracted text"

present_test = TextPresenceTest(
    pdf="x.pdf", page=1, id="p1", type=TestType.PRESENT.value,
    text="an expression of good will from you for this occasion",
)
absent_test = TextPresenceTest(
    pdf="x.pdf", page=1, id="a1", type=TestType.ABSENT.value,
    text="TELEPHONE 478",
)

print("PRESENT:", present_test.run(candidate_md))  # (True,  '')   ← should be False
print("ABSENT: ", absent_test.run(candidate_md))   # (False, ...)  ← should be True

Both decisions are reversed from the test's stated intent.

Real-world impact

Caught while running olmOCR-Bench old_scans (n=98 pages, full set) against a text-layer-only parser (pymupdf4llm) on a container without Tesseract installed. The parser produced 0-byte output for every page; the splitter wrote "\n" per file; olmocr.bench.benchmark reported an aggregate pass_rate of 0.6183, with 11 of 98 pages scoring a perfect 1.0.

The legitimate empty-output baseline for old_scans.jsonl (computed by hand: absent_count / total_tests per page) is ~0.13. The score is roughly 5x inflated.

After patching TextPresenceTest.run with a length guard (PR coming next), the same run on the same data drops from 0.6183 to 0.1749 — matching the empty-output baseline plus a small residual from PDFs that do have partial text layers.

Why this matters even outside our use case

The bug isn't specific to "parser produced empty output." Any candidate with only a few characters of real content (degraded scans, partial OCR, dropped pages) will inflate against this scoring path. A bench that can't distinguish "no output" from "perfect output" for queries containing a space cannot reliably rank pipelines on degraded inputs — which is exactly the regime old_scans is meant to discriminate.

Proposed fix

Add a length guard to TextPresenceTest.run before consulting partial_ratio:

min_required = len(reference_query) - self.max_diffs
if len(md_content) < min_required:
    if self.type == TestType.PRESENT.value:
        return False, f"Candidate too short ({len(md_content)} chars) ..."
    else:  # ABSENT
        return True, ""

If the candidate's content is shorter than the query minus allowed edits, the query cannot be plausibly contained — so present returns False and absent returns True without consulting partial_ratio.

PR with the patch + two regression tests in tests/test_tests.py is on its way. Existing 148 tests in test_tests.py continue to pass; the two new tests cover empty / "\n" / " " / short candidates against space-containing queries.

Affected versions

Reproduced on the current main (commit f7cfe4c) of allenai/olmocr. The relevant code in olmocr/bench/tests.py:150-182 has been stable for several months, so all recent versions are likely affected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions