Summary
TextPresenceTest.run in olmocr/bench/tests.py inverts its decision when the candidate's md_content is much shorter than the test's text query. A candidate that produced literally no text (e.g. a one-byte file containing only "\n") makes most present tests falsely pass and most absent tests falsely fail, inflating overall scores by ~5x against the legitimate empty-output baseline.
The bug is silent and load-bearing: it makes the bench unable to distinguish "produced perfect markdown" from "produced nothing at all" for any natural-language query that happens to contain a space — which is essentially all present tests in old_scans.jsonl.
Root cause
Three steps interact:
- Splitter / pipeline emits an effectively empty candidate file. Common in real workloads when a parser has no text layer + no working OCR for a page. In our case the file is exactly
"\n".
normalize_text("\n") returns " " (a single space). The whitespace-collapsing regex re.sub(r"\s+", " ", md_content) turns the lone newline into a lone space.
fuzz.partial_ratio(query, " ") returns 1.0 for any query that contains a space character. partial_ratio scales by the shorter of its two arguments, so a single-character md whose character appears verbatim somewhere in the query scores a perfect match. With max_diffs=0 the threshold is 1.0, so:
PRESENT on a space-containing query → falsely PASS
ABSENT on a space-containing query → falsely FAIL
Quick demonstration (any rapidfuzz install):
from rapidfuzz import fuzz
fuzz.partial_ratio("an expression of good will from you", " ") / 100 # → 1.0
fuzz.partial_ratio("TELEPHONE 478", " ") / 100 # → 1.0
fuzz.partial_ratio("4201", " ") / 100 # → 0.0 (no space in query — control)
Reproduction
A self-contained script that uses olmocr's own TextPresenceTest:
# repro_partial_ratio_bug.py
from olmocr.bench.tests import TextPresenceTest, TestType
candidate_md = "\n" # what a downstream pipeline writes for "no extracted text"
present_test = TextPresenceTest(
pdf="x.pdf", page=1, id="p1", type=TestType.PRESENT.value,
text="an expression of good will from you for this occasion",
)
absent_test = TextPresenceTest(
pdf="x.pdf", page=1, id="a1", type=TestType.ABSENT.value,
text="TELEPHONE 478",
)
print("PRESENT:", present_test.run(candidate_md)) # (True, '') ← should be False
print("ABSENT: ", absent_test.run(candidate_md)) # (False, ...) ← should be True
Both decisions are reversed from the test's stated intent.
Real-world impact
Caught while running olmOCR-Bench old_scans (n=98 pages, full set) against a text-layer-only parser (pymupdf4llm) on a container without Tesseract installed. The parser produced 0-byte output for every page; the splitter wrote "\n" per file; olmocr.bench.benchmark reported an aggregate pass_rate of 0.6183, with 11 of 98 pages scoring a perfect 1.0.
The legitimate empty-output baseline for old_scans.jsonl (computed by hand: absent_count / total_tests per page) is ~0.13. The score is roughly 5x inflated.
After patching TextPresenceTest.run with a length guard (PR coming next), the same run on the same data drops from 0.6183 to 0.1749 — matching the empty-output baseline plus a small residual from PDFs that do have partial text layers.
Why this matters even outside our use case
The bug isn't specific to "parser produced empty output." Any candidate with only a few characters of real content (degraded scans, partial OCR, dropped pages) will inflate against this scoring path. A bench that can't distinguish "no output" from "perfect output" for queries containing a space cannot reliably rank pipelines on degraded inputs — which is exactly the regime old_scans is meant to discriminate.
Proposed fix
Add a length guard to TextPresenceTest.run before consulting partial_ratio:
min_required = len(reference_query) - self.max_diffs
if len(md_content) < min_required:
if self.type == TestType.PRESENT.value:
return False, f"Candidate too short ({len(md_content)} chars) ..."
else: # ABSENT
return True, ""
If the candidate's content is shorter than the query minus allowed edits, the query cannot be plausibly contained — so present returns False and absent returns True without consulting partial_ratio.
PR with the patch + two regression tests in tests/test_tests.py is on its way. Existing 148 tests in test_tests.py continue to pass; the two new tests cover empty / "\n" / " " / short candidates against space-containing queries.
Affected versions
Reproduced on the current main (commit f7cfe4c) of allenai/olmocr. The relevant code in olmocr/bench/tests.py:150-182 has been stable for several months, so all recent versions are likely affected.
Summary
TextPresenceTest.runinolmocr/bench/tests.pyinverts its decision when the candidate'smd_contentis much shorter than the test'stextquery. A candidate that produced literally no text (e.g. a one-byte file containing only"\n") makes mostpresenttests falsely pass and mostabsenttests falsely fail, inflating overall scores by ~5x against the legitimate empty-output baseline.The bug is silent and load-bearing: it makes the bench unable to distinguish "produced perfect markdown" from "produced nothing at all" for any natural-language query that happens to contain a space — which is essentially all
presenttests inold_scans.jsonl.Root cause
Three steps interact:
"\n".normalize_text("\n")returns" "(a single space). The whitespace-collapsing regexre.sub(r"\s+", " ", md_content)turns the lone newline into a lone space.fuzz.partial_ratio(query, " ")returns1.0for anyquerythat contains a space character.partial_ratioscales by the shorter of its two arguments, so a single-character md whose character appears verbatim somewhere in the query scores a perfect match. Withmax_diffs=0the threshold is1.0, so:PRESENTon a space-containing query → falsely PASSABSENTon a space-containing query → falsely FAILQuick demonstration (any rapidfuzz install):
Reproduction
A self-contained script that uses olmocr's own
TextPresenceTest:Both decisions are reversed from the test's stated intent.
Real-world impact
Caught while running
olmOCR-Benchold_scans(n=98 pages, full set) against a text-layer-only parser (pymupdf4llm) on a container without Tesseract installed. The parser produced 0-byte output for every page; the splitter wrote"\n"per file;olmocr.bench.benchmarkreported an aggregatepass_rateof 0.6183, with 11 of 98 pages scoring a perfect 1.0.The legitimate empty-output baseline for
old_scans.jsonl(computed by hand:absent_count / total_testsper page) is ~0.13. The score is roughly 5x inflated.After patching
TextPresenceTest.runwith a length guard (PR coming next), the same run on the same data drops from 0.6183 to 0.1749 — matching the empty-output baseline plus a small residual from PDFs that do have partial text layers.Why this matters even outside our use case
The bug isn't specific to "parser produced empty output." Any candidate with only a few characters of real content (degraded scans, partial OCR, dropped pages) will inflate against this scoring path. A bench that can't distinguish "no output" from "perfect output" for queries containing a space cannot reliably rank pipelines on degraded inputs — which is exactly the regime
old_scansis meant to discriminate.Proposed fix
Add a length guard to
TextPresenceTest.runbefore consultingpartial_ratio:If the candidate's content is shorter than the query minus allowed edits, the query cannot be plausibly contained — so
presentreturns False andabsentreturns True without consultingpartial_ratio.PR with the patch + two regression tests in
tests/test_tests.pyis on its way. Existing 148 tests intest_tests.pycontinue to pass; the two new tests cover empty /"\n"/" "/ short candidates against space-containing queries.Affected versions
Reproduced on the current
main(commitf7cfe4c) ofallenai/olmocr. The relevant code inolmocr/bench/tests.py:150-182has been stable for several months, so all recent versions are likely affected.