fix(harness): consolidate duplicated math scoring into shared module by cheese-cakee · Pull Request #383 · Luce-Org/lucebox-hub

cheese-cakee · 2026-06-14T16:59:53Z

Context

_extract_boxed, _normalize_math, and _math_equiv are copy-pasted across three files with no shared import:

server/scripts/bench_llm.py:164-251
harness/client_test_runner.py:1365-1452
harness/benchmarks/generation_benchmark.py:55-119

bench_server.py:38 imports from bench_llm (shared, correct). The other two are truly independent copies.

bench_llm and client_test_runner are functionally identical. generation_benchmark is a stripped-down version that is missing:

Feature	bench_llm	client_test_runner	generation_benchmark
Currency `$` strip (`$18` → `18`)	yes	yes	no
Unit stripping (cm, m, km, kg, ...)	yes (13 units)	yes (13 units)	no
`\frac` whitespace normalization	yes	yes	no
Mixed number matching (`1 \frac{1}{2}` → `1.5`)	yes	yes	no

Bug

generation_benchmark runs benchmark comparisons against a llama.cpp baseline. Because its _normalize_math and _math_equiv are incomplete, it will score identical model outputs differently than the other two paths. Example: input "$18" normalizes to "18" in bench_llm/client_test_runner but stays "$18" in generation_benchmark. Inputs with \frac whitespace or mixed numbers will also disagree on equivalence.

Fix

Consolidate into a single shared module (harness/math_scoring.py) with the superset implementation. Import from the three consumers. Delete the local copies.

Changes

Added harness/math_scoring.py — canonical home for _extract_boxed, _normalize_math, _math_equiv
server/scripts/bench_llm.py — removed local definitions, imports from math_scoring
harness/client_test_runner.py — removed local definitions, imports from math_scoring
harness/benchmarks/generation_benchmark.py — removed local definitions, imports from math_scoring

Verification

python -c "from math_scoring import _extract_boxed, _normalize_math, _math_equiv" — imports resolve
bench_server.py import chain (bench_llm → math_scoring) verified
No local function definitions remain in any consumer file
-264 lines across the codebase (net removal of duplicated code)

Notes

generation_benchmark now gets the full normalization stack (currency, units, \frac, mixed numbers) — this is a behavior change for that script, but it was a bug (scoring disagreement)
No tests exist for these functions — a follow-up could add unit tests for math_scoring.py

Closes #382

_extract_boxed, _normalize_math, and _math_equiv were copy-pasted across three files with no shared import. generation_benchmark's copy was missing currency/unit normalization, \frac whitespace handling, and mixed-number matching — causing benchmark scores to disagree on identical model outputs. Move the superset implementation (from bench_llm.py) into harness/math_scoring.py. Import from the three consumers and delete the local copies.

cubic-dev-ai

2 issues found across 4 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

…alence Identified by cubic: two issues in math scoring. 1. \\tfrac and \\dfrac were stripped destructively instead of normalizing to \\frac, breaking equivalence for valid fractional answers (e.g. \\tfrac{1}{2} == 0.5 returned False). 2. Fraction equivalence used unanchored substring matching, allowing composite expressions to be incorrectly graded equivalent to a scalar (e.g. \\frac{1}{2} + \\frac{3}{4} == 0.5 returned True).

client_test_runner.py and generation_benchmark.py imported _normalize_math but only use it transitively via _math_equiv, so ruff F401 (in CI's select=[F,I,UP,B]) failed the lint gate. Drop the unused name from both imports. bench_llm.py keeps it (re-exported to bench_server.py) and is outside ruff's include list. Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai Bot reviewed Jun 14, 2026

View reviewed changes

Comment thread harness/math_scoring.py Outdated

Comment thread harness/math_scoring.py Outdated

cheese-cakee and others added 2 commits June 14, 2026 22:44

davide221 merged commit 9d61083 into Luce-Org:main Jun 15, 2026
5 checks passed

cheese-cakee deleted the fix/math-scoring-duplication branch June 15, 2026 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(harness): consolidate duplicated math scoring into shared module#383

fix(harness): consolidate duplicated math scoring into shared module#383
davide221 merged 3 commits into
Luce-Org:mainfrom
cheese-cakee:fix/math-scoring-duplication

cheese-cakee commented Jun 14, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cheese-cakee commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Bug

Fix

Changes

Verification

Notes

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cheese-cakee commented Jun 14, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading