fix(harness): consolidate duplicated math scoring into shared module#383
Merged
davide221 merged 3 commits intoJun 15, 2026
Merged
Conversation
_extract_boxed, _normalize_math, and _math_equiv were copy-pasted across three files with no shared import. generation_benchmark's copy was missing currency/unit normalization, \frac whitespace handling, and mixed-number matching — causing benchmark scores to disagree on identical model outputs. Move the superset implementation (from bench_llm.py) into harness/math_scoring.py. Import from the three consumers and delete the local copies.
Contributor
There was a problem hiding this comment.
2 issues found across 4 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
…alence
Identified by cubic: two issues in math scoring.
1. \\tfrac and \\dfrac were stripped destructively instead of normalizing
to \\frac, breaking equivalence for valid fractional answers
(e.g. \\tfrac{1}{2} == 0.5 returned False).
2. Fraction equivalence used unanchored substring matching, allowing
composite expressions to be incorrectly graded equivalent to a
scalar (e.g. \\frac{1}{2} + \\frac{3}{4} == 0.5 returned True).
client_test_runner.py and generation_benchmark.py imported _normalize_math but only use it transitively via _math_equiv, so ruff F401 (in CI's select=[F,I,UP,B]) failed the lint gate. Drop the unused name from both imports. bench_llm.py keeps it (re-exported to bench_server.py) and is outside ruff's include list. Co-Authored-By: WOZCODE <contact@withwoz.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
_extract_boxed,_normalize_math, and_math_equivare copy-pasted across three files with no shared import:server/scripts/bench_llm.py:164-251harness/client_test_runner.py:1365-1452harness/benchmarks/generation_benchmark.py:55-119bench_server.py:38imports frombench_llm(shared, correct). The other two are truly independent copies.bench_llmandclient_test_runnerare functionally identical.generation_benchmarkis a stripped-down version that is missing:$strip ($18→18)\fracwhitespace normalization1 \frac{1}{2}→1.5)Bug
generation_benchmarkruns benchmark comparisons against a llama.cpp baseline. Because its_normalize_mathand_math_equivare incomplete, it will score identical model outputs differently than the other two paths. Example: input"$18"normalizes to"18"inbench_llm/client_test_runnerbut stays"$18"ingeneration_benchmark. Inputs with\fracwhitespace or mixed numbers will also disagree on equivalence.Fix
Consolidate into a single shared module (
harness/math_scoring.py) with the superset implementation. Import from the three consumers. Delete the local copies.Changes
harness/math_scoring.py— canonical home for_extract_boxed,_normalize_math,_math_equivserver/scripts/bench_llm.py— removed local definitions, imports frommath_scoringharness/client_test_runner.py— removed local definitions, imports frommath_scoringharness/benchmarks/generation_benchmark.py— removed local definitions, imports frommath_scoringVerification
python -c "from math_scoring import _extract_boxed, _normalize_math, _math_equiv"— imports resolvebench_server.pyimport chain (bench_llm→math_scoring) verified-264 linesacross the codebase (net removal of duplicated code)Notes
generation_benchmarknow gets the full normalization stack (currency, units,\frac, mixed numbers) — this is a behavior change for that script, but it was a bug (scoring disagreement)math_scoring.pyCloses #382