feat(ifbench): add dataset and few-shot base-model task#13
feat(ifbench): add dataset and few-shot base-model task#13jack-scitix-ai wants to merge 1 commit into
Conversation
5f7be17 to
7f7da06
Compare
ethan-scitix
left a comment
There was a problem hiding this comment.
Review — IFBench (0-shot generative).
Verified empirically: the vendored scorer is byte-identical to upstream 1091c4c3 (only the import + NLTK-cache adaptations), score=loose prompt-level matches AllenAI's official headline metric, reasoning separation worked 300/300 on the author's runs (0 <think> leakage), float64 kwargs are checker-safe, deps/lock/registration are consistent, and the 4 new tests pass. Nice port. Two items need resolving before merge (1–2), plus two small follow-ups (3–4).
1. Question: the reproduction story isn't aligned at the official decoding. sieval/tasks/ifbench_0shot_gen.py:11-13 documents "official reproduction: temperature=0, 32768". At exactly those settings the run scores 34.0, not the 37.3 target. The only run that hits 37.3 uses temp=0.6 / top_p=0.95 / top_k=20 — which is Qwen3's recommended sampling, not IFBench's leaderboard口径 (per allenai/IFBench#5 the 37.3 entry is temperature=0, no penalties). The gap is greedy degeneration (≈18/300 runaway generations truncate into empty answers), not a port defect. Please either report 34.0 as the matched-decoding number with that caveat, or drop the 37.3 alignment claim — the temp=0.6 run shouldn't be presented as reproduction.
2. Question: drop the file/dir/JSONL branching in sieval/datasets/ifbench.py:42-64. source is pinned to hf:allenai/IFBench_test@<rev>, so production always loads the staged parquet through the load_dataset fallback. The IFBench_test.jsonl branch loads an unpinned alternate artifact that bypasses the pinned-source contract and is never produced by the download pipeline — yet it's the only path the two unit tests exercise (the real parquet path is untested). Suggest matching the ifeval.py sibling (load_dataset + apply_eval_split) and testing that path.
3. Nit: Dockerfile is missing two NLTK corpora. IFBench's checkers need stopwords + averaged_perceptron_tagger_eng (pulled at import via instructions_util.download_nltk_resources()). Dockerfile:11-13 pre-seeds only punkt punkt_tab wordnet omw-1.4, so the prepared image still triggers a runtime network download for those two. Add them to the nltk.downloader line so the image stays self-contained (the IFEval sibling relies on this pre-seed rather than import-time download).
4. Nit: document the setuptools<81 ceiling. pyproject.toml:33 — the pin exists because syllapy imports pkg_resources (removed in setuptools 81). A one-line comment will stop a future bump from breaking import syllapy.
Type
Summary
hf:allenai/IFBench_test, pinned to revision2e8a48de45ff3bf41242f927254ca81b59ca3ae2.ifbench_0shot_gen, a 0-shot generative chat task for precise, verifiable instruction following (58 OOD constraints over held-out WildChat prompts).evaluation_lib+ instruction registry/checkers undersieval/community/ifbench/; the task reports prompt-level loose accuracy as the headlinescore, the metric the IFBench paper reports.[ifbench]optional-dependency group (emoji,nltk,syllapy,setuptools<81).max_gen_toks: 32768,stop_sequences: ["</answer>"],temperature: 0,process_output: "r1_style", thinking enabled.temperature 0) with Qwen3 thinking enabled, the model can enter deterministic repetition loops in the reasoning trace, exhaust the token budget, and emit an empty answer — which fails every instruction. Reasoning-chain stripping is handled by the chat backend separatingreasoning_contentfromcontent, sotexts[0]is already the answer without the trace.Test Plan
Automated
ruff check && ruff format --check)ty checkormypy --strict)pdm run pytest)Manual
sieval dataset download ifbenchsucceeds).ifbench_0shot_genon the full IFBench test split (300 prompts) with Qwen/Qwen3-32B via a Scitix OpenAI-compatible endpoint.1. Comparison to the leaderboard
This 37.3 is not an exact reproduction — it happened to come out at
temperature 0.6 / top_p 0.95 / top_k 20 / max_tokens 38912(thinking enabled), and since the temperature is not 0 the result is inherently random. Two temperature-0.6 runs landed at 37.33 and 37.00 (the second withseed 42), so the precise match to 37.3 is one fortunate sample, not a deterministic outcome. Validated report (from output):{"score": 37.33, "loose_prompt_level_accuracy": 37.33, "strict_prompt_level_accuracy": 30.00, "fails": 0}2. Why we did not run at the official
temperature 0The IFBench authors posted their hyperparameters in allenai/IFBench#5:
temperature 0,max_gen_toks 32768,stop ["</answer>"],process_output "r1_style", thinking enabled. We could not reproduce the score under these settings: with Qwen3 thinking enabled andtemperature 0, the model reliably enters a degenerate reasoning-repetition loop that exhausts the token budget and returns an empty answer, which fails every instruction. This is experimentally confirmed (from output):Raising
max_tokensdoes not help — the greedy loop is deterministic and never terminates, andstop: ["</answer>"]never fires because generation never reaches the answer boundary. So the author's posted settings appear problematic in practice on this endpoint, and we currently cannot explain how their implementation avoids the loop. We therefore fall back to non-greedy sampling, which breaks the loops (empty answers drop from 18 → 1).3. Stability check — fixed config, 5 runs averaged
Because the 37.3 above is a noisy single sample, we separately verified the scoring/implementation is stable by fixing a clean config (thinking off,
temperature 0,max_tokens 8192— no repetition bug when thinking is disabled) and running the full split 5 times (from output):Mean 27.80, stdev 0.56, range 1.33, with 0 empty answers in all 5 runs. The small spread indicates the eval pipeline produces stable, reproducible scores, so the implementation itself is sound — the difficulty in matching 37.3 is a model/endpoint decoding issue, not a scorer problem.
Checklist
Required (all PRs)
type(scope): description)AI-Generated Code - <model> (<provider>)in module docstringcore/If: New or Modified Benchmark
sieval dataset download ifbenchsucceeds)__init__.pyIf: community/ Changes
If: New Dependency