Skip to content

feat(mbpp): add dataset and few-shot base-model task#12

Open
jack-scitix-ai wants to merge 4 commits into
scitix:mainfrom
jack-scitix-ai:feat/mbpp
Open

feat(mbpp): add dataset and few-shot base-model task#12
jack-scitix-ai wants to merge 4 commits into
scitix:mainfrom
jack-scitix-ai:feat/mbpp

Conversation

@jack-scitix-ai

Copy link
Copy Markdown
Contributor

Type

feature — new benchmark, task, or capability

Summary

  • Add MBPPDataset (loads mbpp.jsonl, rebuilds the official task_id splits: prompt 1-10 / test 11-510 / validation 511-600 / train 601-974) and MBPPFewShotBaseGenTask (few-shot base-model pass@k code-exec scoring).
  • Faithful to lm-evaluation-harness MBPP: [BEGIN]/[DONE] prompt, [DONE] stop, fixed task_id 2/3/4 few-shot examples, exec against test_list[:3]. Few-shot set + splits trace to the original google-research MBPP README; that repo ships data only (no scorer), so lm-eval is the executable reference.
  • Bundled fix: URL downloader skips the Content-Length truncation check on compressed transport (raw.githubusercontent.com serves gzip), which MBPP's url: source requires.
  • References: lm-eval MBPP (https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mbpp/mbpp.yaml), original dataset (https://github.com/google-research/google-research/tree/master/mbpp), DeepSeek-V3 report (arXiv:2412.19437), Qwen3 report (arXiv:2505.09388).

Test Plan

Automated

  • Lint/format clean (ruff check && ruff format --check)
  • Type check clean (ty check or mypy --strict)
  • Unit tests pass (pdm run pytest)

Manual

Tested Qwen2.5-72B-Base, MBPP 3-shot greedy. Implementation aligns to lm-eval-harness.

Model Expected Actual Diff
Qwen2.5-72B-Base 76.0 (Qwen3 report, Table 3) 76.6 +0.6(+0.79%)
Qwen2.5-72B-Base 72.6 (DeepSeek-V3 report, Table 3) 76.6 +4.0 (+5.51%)

Checklist

Required (all PRs)

  • PR title follows conventional format (type(scope): description)
  • No internal paths, credentials, or personal info in committed files
  • AI-generated code has AI-Generated Code - <model> (<provider>) in module docstring
  • No new upper-layer dependencies added to core/
  • Deleted code verified — no remaining call sites depend on it

If: New or Modified Benchmark

  • Reference paper/repo linked in Summary
  • Score comparison table included (model, expected, actual, diff)
  • Dataset loading tested (sieval dataset download <name> succeeds)
  • Task registered in package-level __init__.py

If: community/ Changes

  • Upstream diff documented (what differs and why)
  • License attribution preserved

@ethan-scitix ethan-scitix left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed against lm-evaluation-harness MBPP @ 1dd9310. Reproduction verified empirically (76.6, 0 fails) — the few-shot set, prompt template, [DONE] stop, test_list[:3] scoring, and splits all match the reference. No blocking correctness bugs. Requesting changes on one architectural point plus a few consistency/fidelity items.

sieval/datasets/mbpp.py + downloaders/url.py

Question (architecture): lm-eval's mbpp.yaml loads google-research-datasets/mbpp config full from HF — parquet, no trust_remote_code, and it already provides the prompt/test/validation/train splits (10/500/90/374) that mbpp.py:94 rebuilds by hand. Switching to hf:google-research-datasets/mbpp@<sha> aligns with the reference and drops MBPP_JSONL_URL, _resolve_data_file, the split rebuild, and the empty-split guard. It also makes the url.py _has_compressed_content change unnecessary — MBPP is its only caller (mmlu/gpqa are identity-served), so please drop it or split it into its own PR.

sieval/tasks/mbpp_kshot_base_gen.py

Question (mbpp_kshot_base_gen.py:80-122): Collapse num_shots + k-alias + conflict-check into a single k: int | None per the theoremqa sibling (PR #3); keep pass_k. The alias re-introduces the k ambiguity it tried to remove.

Nit (mbpp_kshot_base_gen.py:172-178): Drop dict[str, object], pass explicit kwargs like PR #6: return await self.model.agenerate(pre, n=self._n, stop=list(self._stop)).

Nit (mbpp_kshot_base_gen.py:86): Make stop non-optional (tuple[str, ...] = STOP_SEQUENCES, per PR #6) — [DONE] is prompt-required, and this removes the conditional above.

Nit (mbpp_kshot_base_gen.py:138): lm-eval doesn't strip the joined tests; .strip() drops a trailing space on 14/500 prompts (score-neutral, but an unannotated divergence in a strict-reproduction task). Drop it or annotate.

Nit (mbpp_kshot_base_gen.py:54-61, reference_impl notes): Add a half-sentence that DeepSeek-V3's 72.6 uses an unspecified protocol, so the +4.0 vs the Qwen-aligned 76.6 isn't read as an implementation error.

sieval/community/mbpp.py

Nit (community/mbpp.py:5): Pin the URL to 1dd931087362abba74e0375c8c631295559f48b2 (the SHA the few-shot was copied from), matching the task's reference_impl.

@jack-scitix-ai

Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review — addressed all points.

datasets/mbpp.py + downloaders/url.py (architecture)

  • Switched to hf:google-research-datasets/mbpp@4bb6404fdc6cacfda99d4ac4205087b89d32030c config full. The repo natively ships prompt/test/validation/train (10/500/90/374) with a schema identical to MBPPDatasetSample, so I dropped MBPP_JSONL_URL, _resolve_data_file, the task_id-range split rebuild, and the empty-split guard. The loader is now the minimal HF form (matches human_eval/math_500).
  • With MBPP off the URL path, _has_compressed_content had no remaining caller (mmlu/gpqa are identity-served), so I reverted url.py to its pre-PR baseline (git diff against the base is empty) rather than splitting it out. Happy to reintroduce it as its own PR if you'd like the capability kept.

tasks/mbpp_kshot_base_gen.py

  • Collapsed num_shots + k-alias + conflict-check into a single k: int (kept pass_k), per the theoremqa sibling.
  • infer now passes explicit kwargs: agenerate(pre, n=self._n, stop=list(self._stop)).
  • stop is non-optional (tuple[str, ...] = STOP_SEQUENCES); removed the conditional.
  • Dropped .strip() in _format_tests so the joined tests match lm-eval byte-for-byte; added a comment noting the retained trailing space.
  • reference_impl notes now state DeepSeek-V3's 72.6 uses an unspecified protocol, so the gap to the Qwen-aligned number reads as a protocol difference, not an implementation error.

community/mbpp.py

  • Pinned the upstream utils.py URL to 1dd931087362abba74e0375c8c631295559f48b2, matching the task's reference_impl.

Tests / sync

  • Removed test_mbpp.py — it only covered the now-deleted URL/split machinery (the other 12 HF loaders have no dataset unit test). Reverted the compressed-transport case in test_url.py. Task tests updated for the k rename.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants