feat(mbpp): add dataset and few-shot base-model task by jack-scitix-ai · Pull Request #12 · scitix/sieval

jack-scitix-ai · 2026-06-22T16:52:56Z

Type

feature — new benchmark, task, or capability

Summary

Add MBPPDataset (loads mbpp.jsonl, rebuilds the official task_id splits: prompt 1-10 / test 11-510 / validation 511-600 / train 601-974) and MBPPFewShotBaseGenTask (few-shot base-model pass@k code-exec scoring).
Faithful to lm-evaluation-harness MBPP: [BEGIN]/[DONE] prompt, [DONE] stop, fixed task_id 2/3/4 few-shot examples, exec against test_list[:3]. Few-shot set + splits trace to the original google-research MBPP README; that repo ships data only (no scorer), so lm-eval is the executable reference.
Bundled fix: URL downloader skips the Content-Length truncation check on compressed transport (raw.githubusercontent.com serves gzip), which MBPP's url: source requires.
References: lm-eval MBPP (https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mbpp/mbpp.yaml), original dataset (https://github.com/google-research/google-research/tree/master/mbpp), DeepSeek-V3 report (arXiv:2412.19437), Qwen3 report (arXiv:2505.09388).

Test Plan

Automated

Lint/format clean (ruff check && ruff format --check)
Type check clean (ty check or mypy --strict)
Unit tests pass (pdm run pytest)

Manual

Tested Qwen2.5-72B-Base, MBPP 3-shot greedy. Implementation aligns to lm-eval-harness.

Model	Expected	Actual	Diff
Qwen2.5-72B-Base	76.0 (Qwen3 report, Table 3)	76.6	+0.6(+0.79%)
Qwen2.5-72B-Base	72.6 (DeepSeek-V3 report, Table 3)	76.6	+4.0 (+5.51%)

Checklist

Required (all PRs)

PR title follows conventional format (type(scope): description)
No internal paths, credentials, or personal info in committed files
AI-generated code has AI-Generated Code - <model> (<provider>) in module docstring
No new upper-layer dependencies added to core/
Deleted code verified — no remaining call sites depend on it

If: New or Modified Benchmark

Reference paper/repo linked in Summary
Score comparison table included (model, expected, actual, diff)
Dataset loading tested (sieval dataset download <name> succeeds)
Task registered in package-level __init__.py

If: community/ Changes

Upstream diff documented (what differs and why)
License attribution preserved

ethan-scitix

Reviewed against lm-evaluation-harness MBPP @ 1dd9310. Reproduction verified empirically (76.6, 0 fails) — the few-shot set, prompt template, [DONE] stop, test_list[:3] scoring, and splits all match the reference. No blocking correctness bugs. Requesting changes on one architectural point plus a few consistency/fidelity items.

`sieval/datasets/mbpp.py` + `downloaders/url.py`

Question (architecture): lm-eval's mbpp.yaml loads google-research-datasets/mbpp config full from HF — parquet, no trust_remote_code, and it already provides the prompt/test/validation/train splits (10/500/90/374) that mbpp.py:94 rebuilds by hand. Switching to hf:google-research-datasets/mbpp@<sha> aligns with the reference and drops MBPP_JSONL_URL, _resolve_data_file, the split rebuild, and the empty-split guard. It also makes the url.py _has_compressed_content change unnecessary — MBPP is its only caller (mmlu/gpqa are identity-served), so please drop it or split it into its own PR.

`sieval/tasks/mbpp_kshot_base_gen.py`

Question (mbpp_kshot_base_gen.py:80-122): Collapse num_shots + k-alias + conflict-check into a single k: int | None per the theoremqa sibling (PR #3); keep pass_k. The alias re-introduces the k ambiguity it tried to remove.

Nit (mbpp_kshot_base_gen.py:172-178): Drop dict[str, object], pass explicit kwargs like PR #6: return await self.model.agenerate(pre, n=self._n, stop=list(self._stop)).

Nit (mbpp_kshot_base_gen.py:86): Make stop non-optional (tuple[str, ...] = STOP_SEQUENCES, per PR #6) — [DONE] is prompt-required, and this removes the conditional above.

Nit (mbpp_kshot_base_gen.py:138): lm-eval doesn't strip the joined tests; .strip() drops a trailing space on 14/500 prompts (score-neutral, but an unannotated divergence in a strict-reproduction task). Drop it or annotate.

Nit (mbpp_kshot_base_gen.py:54-61, reference_impl notes): Add a half-sentence that DeepSeek-V3's 72.6 uses an unspecified protocol, so the +4.0 vs the Qwen-aligned 76.6 isn't read as an implementation error.

`sieval/community/mbpp.py`

Nit (community/mbpp.py:5): Pin the URL to 1dd931087362abba74e0375c8c631295559f48b2 (the SHA the few-shot was copied from), matching the task's reference_impl.

jack-scitix-ai added 2 commits June 23, 2026 16:21

feat(mbpp): add dataset and few-shot base-model task

337381a

fix(mbpp): remove score refs from task docstring

18f7342

jack-scitix-ai force-pushed the feat/mbpp branch from 26997af to 18f7342 Compare June 23, 2026 08:21

ethan-scitix requested changes Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(mbpp): add dataset and few-shot base-model task#12

feat(mbpp): add dataset and few-shot base-model task#12
jack-scitix-ai wants to merge 2 commits into
scitix:mainfrom
jack-scitix-ai:feat/mbpp

jack-scitix-ai commented Jun 22, 2026

Uh oh!

ethan-scitix left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jack-scitix-ai commented Jun 22, 2026

Type

Summary

Test Plan

Automated

Manual

Checklist

Required (all PRs)

If: New or Modified Benchmark

If: community/ Changes

Uh oh!

ethan-scitix left a comment

Choose a reason for hiding this comment

sieval/datasets/mbpp.py + downloaders/url.py

sieval/tasks/mbpp_kshot_base_gen.py

sieval/community/mbpp.py

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`sieval/datasets/mbpp.py` + `downloaders/url.py`

`sieval/tasks/mbpp_kshot_base_gen.py`

`sieval/community/mbpp.py`