feat(mbpp): add dataset and few-shot base-model task#12
Conversation
26997af to
18f7342
Compare
ethan-scitix
left a comment
There was a problem hiding this comment.
Reviewed against lm-evaluation-harness MBPP @ 1dd9310. Reproduction verified empirically (76.6, 0 fails) — the few-shot set, prompt template, [DONE] stop, test_list[:3] scoring, and splits all match the reference. No blocking correctness bugs. Requesting changes on one architectural point plus a few consistency/fidelity items.
sieval/datasets/mbpp.py + downloaders/url.py
Question (architecture): lm-eval's mbpp.yaml loads google-research-datasets/mbpp config full from HF — parquet, no trust_remote_code, and it already provides the prompt/test/validation/train splits (10/500/90/374) that mbpp.py:94 rebuilds by hand. Switching to hf:google-research-datasets/mbpp@<sha> aligns with the reference and drops MBPP_JSONL_URL, _resolve_data_file, the split rebuild, and the empty-split guard. It also makes the url.py _has_compressed_content change unnecessary — MBPP is its only caller (mmlu/gpqa are identity-served), so please drop it or split it into its own PR.
sieval/tasks/mbpp_kshot_base_gen.py
Question (mbpp_kshot_base_gen.py:80-122): Collapse num_shots + k-alias + conflict-check into a single k: int | None per the theoremqa sibling (PR #3); keep pass_k. The alias re-introduces the k ambiguity it tried to remove.
Nit (mbpp_kshot_base_gen.py:172-178): Drop dict[str, object], pass explicit kwargs like PR #6: return await self.model.agenerate(pre, n=self._n, stop=list(self._stop)).
Nit (mbpp_kshot_base_gen.py:86): Make stop non-optional (tuple[str, ...] = STOP_SEQUENCES, per PR #6) — [DONE] is prompt-required, and this removes the conditional above.
Nit (mbpp_kshot_base_gen.py:138): lm-eval doesn't strip the joined tests; .strip() drops a trailing space on 14/500 prompts (score-neutral, but an unannotated divergence in a strict-reproduction task). Drop it or annotate.
Nit (mbpp_kshot_base_gen.py:54-61, reference_impl notes): Add a half-sentence that DeepSeek-V3's 72.6 uses an unspecified protocol, so the +4.0 vs the Qwen-aligned 76.6 isn't read as an implementation error.
sieval/community/mbpp.py
Nit (community/mbpp.py:5): Pin the URL to 1dd931087362abba74e0375c8c631295559f48b2 (the SHA the few-shot was copied from), matching the task's reference_impl.
Type
feature — new benchmark, task, or capability
Summary
MBPPDataset(loadsmbpp.jsonl, rebuilds the official task_id splits: prompt 1-10 / test 11-510 / validation 511-600 / train 601-974) andMBPPFewShotBaseGenTask(few-shot base-model pass@k code-exec scoring).[BEGIN]/[DONE]prompt,[DONE]stop, fixed task_id 2/3/4 few-shot examples, exec againsttest_list[:3]. Few-shot set + splits trace to the original google-research MBPP README; that repo ships data only (no scorer), so lm-eval is the executable reference.raw.githubusercontent.comserves gzip), which MBPP'surl:source requires.Test Plan
Automated
ruff check && ruff format --check)ty checkormypy --strict)pdm run pytest)Manual
Tested Qwen2.5-72B-Base, MBPP 3-shot greedy. Implementation aligns to lm-eval-harness.
Checklist
Required (all PRs)
type(scope): description)AI-Generated Code - <model> (<provider>)in module docstringcore/If: New or Modified Benchmark
sieval dataset download <name>succeeds)__init__.pyIf: community/ Changes