Skip to content

feat(mbpp): add dataset and few-shot base-model task#12

Open
jack-scitix-ai wants to merge 2 commits into
scitix:mainfrom
jack-scitix-ai:feat/mbpp
Open

feat(mbpp): add dataset and few-shot base-model task#12
jack-scitix-ai wants to merge 2 commits into
scitix:mainfrom
jack-scitix-ai:feat/mbpp

Conversation

@jack-scitix-ai

Copy link
Copy Markdown
Contributor

Type

feature — new benchmark, task, or capability

Summary

  • Add MBPPDataset (loads mbpp.jsonl, rebuilds the official task_id splits: prompt 1-10 / test 11-510 / validation 511-600 / train 601-974) and MBPPFewShotBaseGenTask (few-shot base-model pass@k code-exec scoring).
  • Faithful to lm-evaluation-harness MBPP: [BEGIN]/[DONE] prompt, [DONE] stop, fixed task_id 2/3/4 few-shot examples, exec against test_list[:3]. Few-shot set + splits trace to the original google-research MBPP README; that repo ships data only (no scorer), so lm-eval is the executable reference.
  • Bundled fix: URL downloader skips the Content-Length truncation check on compressed transport (raw.githubusercontent.com serves gzip), which MBPP's url: source requires.
  • References: lm-eval MBPP (https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mbpp/mbpp.yaml), original dataset (https://github.com/google-research/google-research/tree/master/mbpp), DeepSeek-V3 report (arXiv:2412.19437), Qwen3 report (arXiv:2505.09388).

Test Plan

Automated

  • Lint/format clean (ruff check && ruff format --check)
  • Type check clean (ty check or mypy --strict)
  • Unit tests pass (pdm run pytest)

Manual

Tested Qwen2.5-72B-Base, MBPP 3-shot greedy. Implementation aligns to lm-eval-harness.

Model Expected Actual Diff
Qwen2.5-72B-Base 76.0 (Qwen3 report, Table 3) 76.6 +0.6(+0.79%)
Qwen2.5-72B-Base 72.6 (DeepSeek-V3 report, Table 3) 76.6 +4.0 (+5.51%)

Checklist

Required (all PRs)

  • PR title follows conventional format (type(scope): description)
  • No internal paths, credentials, or personal info in committed files
  • AI-generated code has AI-Generated Code - <model> (<provider>) in module docstring
  • No new upper-layer dependencies added to core/
  • Deleted code verified — no remaining call sites depend on it

If: New or Modified Benchmark

  • Reference paper/repo linked in Summary
  • Score comparison table included (model, expected, actual, diff)
  • Dataset loading tested (sieval dataset download <name> succeeds)
  • Task registered in package-level __init__.py

If: community/ Changes

  • Upstream diff documented (what differs and why)
  • License attribution preserved

@ethan-scitix ethan-scitix left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed against lm-evaluation-harness MBPP @ 1dd9310. Reproduction verified empirically (76.6, 0 fails) — the few-shot set, prompt template, [DONE] stop, test_list[:3] scoring, and splits all match the reference. No blocking correctness bugs. Requesting changes on one architectural point plus a few consistency/fidelity items.

sieval/datasets/mbpp.py + downloaders/url.py

Question (architecture): lm-eval's mbpp.yaml loads google-research-datasets/mbpp config full from HF — parquet, no trust_remote_code, and it already provides the prompt/test/validation/train splits (10/500/90/374) that mbpp.py:94 rebuilds by hand. Switching to hf:google-research-datasets/mbpp@<sha> aligns with the reference and drops MBPP_JSONL_URL, _resolve_data_file, the split rebuild, and the empty-split guard. It also makes the url.py _has_compressed_content change unnecessary — MBPP is its only caller (mmlu/gpqa are identity-served), so please drop it or split it into its own PR.

sieval/tasks/mbpp_kshot_base_gen.py

Question (mbpp_kshot_base_gen.py:80-122): Collapse num_shots + k-alias + conflict-check into a single k: int | None per the theoremqa sibling (PR #3); keep pass_k. The alias re-introduces the k ambiguity it tried to remove.

Nit (mbpp_kshot_base_gen.py:172-178): Drop dict[str, object], pass explicit kwargs like PR #6: return await self.model.agenerate(pre, n=self._n, stop=list(self._stop)).

Nit (mbpp_kshot_base_gen.py:86): Make stop non-optional (tuple[str, ...] = STOP_SEQUENCES, per PR #6) — [DONE] is prompt-required, and this removes the conditional above.

Nit (mbpp_kshot_base_gen.py:138): lm-eval doesn't strip the joined tests; .strip() drops a trailing space on 14/500 prompts (score-neutral, but an unannotated divergence in a strict-reproduction task). Drop it or annotate.

Nit (mbpp_kshot_base_gen.py:54-61, reference_impl notes): Add a half-sentence that DeepSeek-V3's 72.6 uses an unspecified protocol, so the +4.0 vs the Qwen-aligned 76.6 isn't read as an implementation error.

sieval/community/mbpp.py

Nit (community/mbpp.py:5): Pin the URL to 1dd931087362abba74e0375c8c631295559f48b2 (the SHA the few-shot was copied from), matching the task's reference_impl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants