olmocr.bench scoring: `partial_ratio` falsely matches when candidate is near-empty (e.g. single `\\n`)

## Summary

`TextPresenceTest.run` in [`olmocr/bench/tests.py`](https://github.com/allenai/olmocr/blob/main/olmocr/bench/tests.py) inverts its decision when the candidate's `md_content` is much shorter than the test's `text` query. A candidate that produced **literally no text** (e.g. a one-byte file containing only `"\n"`) makes most `present` tests **falsely pass** and most `absent` tests **falsely fail**, inflating overall scores by ~5x against the legitimate empty-output baseline.

The bug is silent and load-bearing: it makes the bench unable to distinguish "produced perfect markdown" from "produced nothing at all" for any natural-language query that happens to contain a space — which is essentially all `present` tests in `old_scans.jsonl`.

## Root cause

Three steps interact:

1. **Splitter / pipeline emits an effectively empty candidate file.** Common in real workloads when a parser has no text layer + no working OCR for a page. In our case the file is exactly `"\n"`.
2. **`normalize_text("\n")` returns `" "`** (a single space). The whitespace-collapsing regex `re.sub(r"\s+", " ", md_content)` turns the lone newline into a lone space.
3. **`fuzz.partial_ratio(query, " ")` returns `1.0`** for any `query` that contains a space character. `partial_ratio` scales by the *shorter* of its two arguments, so a single-character md whose character appears verbatim somewhere in the query scores a perfect match. With `max_diffs=0` the threshold is `1.0`, so:
   - `PRESENT` on a space-containing query → falsely **PASS**
   - `ABSENT` on a space-containing query → falsely **FAIL**

Quick demonstration (any rapidfuzz install):

```python
from rapidfuzz import fuzz
fuzz.partial_ratio("an expression of good will from you", " ") / 100  # → 1.0
fuzz.partial_ratio("TELEPHONE 478",                       " ") / 100  # → 1.0
fuzz.partial_ratio("4201",                                " ") / 100  # → 0.0  (no space in query — control)
```

## Reproduction

A self-contained script that uses olmocr's own `TextPresenceTest`:

```python
# repro_partial_ratio_bug.py
from olmocr.bench.tests import TextPresenceTest, TestType

candidate_md = "\n"  # what a downstream pipeline writes for "no extracted text"

present_test = TextPresenceTest(
    pdf="x.pdf", page=1, id="p1", type=TestType.PRESENT.value,
    text="an expression of good will from you for this occasion",
)
absent_test = TextPresenceTest(
    pdf="x.pdf", page=1, id="a1", type=TestType.ABSENT.value,
    text="TELEPHONE 478",
)

print("PRESENT:", present_test.run(candidate_md))  # (True,  '')   ← should be False
print("ABSENT: ", absent_test.run(candidate_md))   # (False, ...)  ← should be True
```

Both decisions are reversed from the test's stated intent.

## Real-world impact

Caught while running [`olmOCR-Bench`](https://huggingface.co/datasets/allenai/olmOCR-bench) `old_scans` (n=98 pages, full set) against a text-layer-only parser (pymupdf4llm) on a container without Tesseract installed. The parser produced 0-byte output for every page; the splitter wrote `"\n"` per file; `olmocr.bench.benchmark` reported an aggregate `pass_rate` of **0.6183**, with **11 of 98 pages scoring a perfect 1.0**.

The legitimate empty-output baseline for `old_scans.jsonl` (computed by hand: `absent_count / total_tests` per page) is **~0.13**. The score is roughly **5x inflated**.

After patching `TextPresenceTest.run` with a length guard (PR coming next), the same run on the same data drops from **0.6183** to **0.1749** — matching the empty-output baseline plus a small residual from PDFs that do have partial text layers.

## Why this matters even outside our use case

The bug isn't specific to "parser produced empty output." Any candidate with only a few characters of real content (degraded scans, partial OCR, dropped pages) will inflate against this scoring path. A bench that can't distinguish "no output" from "perfect output" for queries containing a space cannot reliably rank pipelines on degraded inputs — which is exactly the regime `old_scans` is meant to discriminate.

## Proposed fix

Add a length guard to `TextPresenceTest.run` before consulting `partial_ratio`:

```python
min_required = len(reference_query) - self.max_diffs
if len(md_content) < min_required:
    if self.type == TestType.PRESENT.value:
        return False, f"Candidate too short ({len(md_content)} chars) ..."
    else:  # ABSENT
        return True, ""
```

If the candidate's content is shorter than the query minus allowed edits, the query cannot be plausibly contained — so `present` returns False and `absent` returns True without consulting `partial_ratio`.

PR with the patch + two regression tests in `tests/test_tests.py` is on its way. Existing 148 tests in `test_tests.py` continue to pass; the two new tests cover empty / `"\n"` / `" "` / short candidates against space-containing queries.

## Affected versions

Reproduced on the current `main` (commit `f7cfe4c`) of `allenai/olmocr`. The relevant code in `olmocr/bench/tests.py:150-182` has been stable for several months, so all recent versions are likely affected.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

olmocr.bench scoring: `partial_ratio` falsely matches when candidate is near-empty (e.g. single `\\n`) #461

Summary

Root cause

Reproduction

Real-world impact

Why this matters even outside our use case

Proposed fix

Affected versions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

olmocr.bench scoring: partial_ratio falsely matches when candidate is near-empty (e.g. single \\n) #461

Description

Summary

Root cause

Reproduction

Real-world impact

Why this matters even outside our use case

Proposed fix

Affected versions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

olmocr.bench scoring: `partial_ratio` falsely matches when candidate is near-empty (e.g. single `\\n`) #461