Skip to content

fix(harness): consolidate duplicated math scoring into shared module#383

Merged
davide221 merged 3 commits into
Luce-Org:mainfrom
cheese-cakee:fix/math-scoring-duplication
Jun 15, 2026
Merged

fix(harness): consolidate duplicated math scoring into shared module#383
davide221 merged 3 commits into
Luce-Org:mainfrom
cheese-cakee:fix/math-scoring-duplication

Conversation

@cheese-cakee

@cheese-cakee cheese-cakee commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Context

_extract_boxed, _normalize_math, and _math_equiv are copy-pasted across three files with no shared import:

  • server/scripts/bench_llm.py:164-251
  • harness/client_test_runner.py:1365-1452
  • harness/benchmarks/generation_benchmark.py:55-119

bench_server.py:38 imports from bench_llm (shared, correct). The other two are truly independent copies.

bench_llm and client_test_runner are functionally identical. generation_benchmark is a stripped-down version that is missing:

Feature bench_llm client_test_runner generation_benchmark
Currency $ strip ($1818) yes yes no
Unit stripping (cm, m, km, kg, ...) yes (13 units) yes (13 units) no
\frac whitespace normalization yes yes no
Mixed number matching (1 \frac{1}{2}1.5) yes yes no

Bug

generation_benchmark runs benchmark comparisons against a llama.cpp baseline. Because its _normalize_math and _math_equiv are incomplete, it will score identical model outputs differently than the other two paths. Example: input "$18" normalizes to "18" in bench_llm/client_test_runner but stays "$18" in generation_benchmark. Inputs with \frac whitespace or mixed numbers will also disagree on equivalence.

Fix

Consolidate into a single shared module (harness/math_scoring.py) with the superset implementation. Import from the three consumers. Delete the local copies.

Changes

  • Added harness/math_scoring.py — canonical home for _extract_boxed, _normalize_math, _math_equiv
  • server/scripts/bench_llm.py — removed local definitions, imports from math_scoring
  • harness/client_test_runner.py — removed local definitions, imports from math_scoring
  • harness/benchmarks/generation_benchmark.py — removed local definitions, imports from math_scoring

Verification

  • python -c "from math_scoring import _extract_boxed, _normalize_math, _math_equiv" — imports resolve
  • bench_server.py import chain (bench_llmmath_scoring) verified
  • No local function definitions remain in any consumer file
  • -264 lines across the codebase (net removal of duplicated code)

Notes

  • generation_benchmark now gets the full normalization stack (currency, units, \frac, mixed numbers) — this is a behavior change for that script, but it was a bug (scoring disagreement)
  • No tests exist for these functions — a follow-up could add unit tests for math_scoring.py

Closes #382

_extract_boxed, _normalize_math, and _math_equiv were copy-pasted
across three files with no shared import. generation_benchmark's copy
was missing currency/unit normalization, \frac whitespace handling,
and mixed-number matching — causing benchmark scores to disagree on
identical model outputs.

Move the superset implementation (from bench_llm.py) into
harness/math_scoring.py. Import from the three consumers and delete
the local copies.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 4 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread harness/math_scoring.py Outdated
Comment thread harness/math_scoring.py Outdated
cheese-cakee and others added 2 commits June 14, 2026 22:44
…alence

Identified by cubic: two issues in math scoring.

1. \\tfrac and \\dfrac were stripped destructively instead of normalizing
   to \\frac, breaking equivalence for valid fractional answers
   (e.g. \\tfrac{1}{2} == 0.5 returned False).

2. Fraction equivalence used unanchored substring matching, allowing
   composite expressions to be incorrectly graded equivalent to a
   scalar (e.g. \\frac{1}{2} + \\frac{3}{4} == 0.5 returned True).
client_test_runner.py and generation_benchmark.py imported _normalize_math
but only use it transitively via _math_equiv, so ruff F401 (in CI's
select=[F,I,UP,B]) failed the lint gate. Drop the unused name from both
imports. bench_llm.py keeps it (re-exported to bench_server.py) and is
outside ruff's include list.

Co-Authored-By: WOZCODE <contact@withwoz.com>
@davide221 davide221 merged commit 9d61083 into Luce-Org:main Jun 15, 2026
5 checks passed
@cheese-cakee cheese-cakee deleted the fix/math-scoring-duplication branch June 15, 2026 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

harness: math scoring functions duplicated across 3 files with diverged implementations

2 participants