feat(livecodebench): add dataset and few-shot base-model task by jack-scitix-ai · Pull Request #14 · scitix/sieval

jack-scitix-ai · 2026-06-23T08:08:38Z

Type

feature — new benchmark, task, or capability

Summary

Add livecodebench_code_generation_kshot_base_gen — a base-model (completions / GenModel) few-shot counterpart to the existing instruct task livecodebench_code_generation_0shot_gen.
Mirrors LiveCodeBench's official base-model prompt builder (get_base_model_question_template_answer), decomposed into a cacheable few-shot prefix + per-sample target block so the fixed prefix is built once in setup(), not per sample. For n_shot=1 the rendered prompt is byte-identical to upstream.
Default n_shot=3 + stop=("###",) target DeepSeek-V3 Table 3 "LiveCodeBench-Base (Pass@1), 3-shot". Upstream ships only 2 few-shot examples per pool, so a 3rd example per pool is sieval-authored (marked + documented).
Decoding params (temperature/top_p/max_tokens) are not set by the task — only n (pass@k) and the prompt-coupled stop are forwarded; decoding comes from model config / infer_args.
Pin the LiveCodeBench HF dataset to a fixed revision (@0fe84c39…), for reproducibility.
Refs: LiveCodeBench · DeepSeek-V3 report (Table 3).

Test Plan

Automated

Lint/format clean (ruff check && ruff format --check)
Type check clean (ty check or mypy --strict)
Unit tests pass (pdm run pytest)

Manual

Model: Qwen2.5-72B (base), served via sglang, greedy (temperature=0, top_p=1, max_tokens=2000), n_shot=3.
Window: 2024-08-01 → 2024-11-01 (0801-1101), 83 problems.
Reproducibility: identical score across 3 independent runs (fails=0, timeouts=15).

Model	Expected	Actual	Diff
Qwen2.5-72B-Base	12.9( (DeepSeek-V3 Report))	13.25	+0.35 (+2.7%)

Checklist

Required (all PRs)

PR title follows conventional format (type(scope): description)
No internal paths, credentials, or personal info in committed files
AI-generated code has AI-Generated Code - <model> (<provider>) in module docstring
No new upper-layer dependencies added to core/
Deleted code verified — no remaining call sites depend on it

If: New or Modified Benchmark

Reference paper/repo linked in Summary
Score comparison table included (model, expected, actual, diff)
Dataset loading tested (sieval dataset download <name> succeeds)
Task registered in package-level __init__.py

If: community/ Changes

Upstream diff documented (what differs and why)
License attribution preserved

ethan-scitix

Verdict: approve — faithful reproduction, no blocking issues. Two non-blocking nits below.

Verified empirically:

Prompt builder byte-identical to upstream get_base_model_question_template_answer at n_shot=1 (both pools); func.json/stdin.json examples [0]/[1] are verbatim upstream, [2] is _source-tagged sieval-authored.
extract_code(lmstyle="GenericBase"), stop=("###",), max_tokens=2000 all match the upstream lcb_runner defaults; GenericBase ↔ base-model template coupling confirmed in upstream format_prompt_generation.
Reproduction: 3 deterministic Qwen2.5-72B-Base runs all land Pass@1 = 13.25 (22/166), 0 fails, 0 anomalies — vs DeepSeek-V3 Table 3 anchor 12.9 (+0.35 abs, slightly high, consistent with the documented sieval-authored 3rd few-shot). The 15 timeouts are genuine service-side subprocess timeouts (46–62s, CPU pegged ~100%), correctly scored as failures; no sample hit the HTTP-timeout/exception path.
Upstream URL + pinned HF revision resolve; check_tasks/check_datasets/check_dep_coverage and 12/12 unit tests pass.

The deviation disclosure (n_shot generalization, the 3rd authored example, external eval sandbox) across the docstring, reference_impl.notes, index.json, and the dedicated pinning test is exemplary.

Nit: sieval/community/livecodebench/prompts/code_generation.py:5 — use importlib.resources instead of Path(__file__).parent. The latter breaks under non-filesystem loaders (zipimport) and diverges from the repo pattern (sieval/meta/loader.py:25 uses files(__package__).joinpath(...)). Since the path had to be adapted from upstream's cwd-relative open("lcb_runner/...") anyway, land the robust form:

from importlib.resources import files
_FEWSHOT = files(__package__).joinpath("few_shot_examples", "generation")
func = json.loads((_FEWSHOT / "func.json").read_text(encoding="utf-8"))
stdin = json.loads((_FEWSHOT / "stdin.json").read_text(encoding="utf-8"))

Nit: sieval/tasks/livecodebench_code_generation_kshot_base_gen.py:162-165 — dead cache-miss branch in preprocess; rely on setup(). has_starter is always a bool, and setup() (:154-156) pre-fills both keys before any sample runs (sieval/core/runners/runner.py:337), so this if never fires in the pipeline (only when preprocess is called without setup(), i.e. the standalone test). Drop it and index directly — also removes the in-preprocess mutation of shared state:

@override
async def preprocess(self, raw, ctx):
    has_starter = bool(raw["starter_code"])
    return self._fewshot_prefix[has_starter] + get_base_model_target_block(
        raw["question_content"], raw["starter_code"]
    )

Tests then need a setup() call first (matches the framework contract). Or drop setup() and go lazy-only like the gsm8k_kshot_base_gen sibling — either way, pick one mechanism.

jack-scitix-ai · 2026-06-25T10:12:17Z

Thanks, both addressed in the latest commit.

importlib.resources: switched the few-shot loader to files(__package__).joinpath(...) + read_text, matching sieval/meta/loader.py. Robust under zipimport now.
Dead cache-miss branch: dropped it; preprocess indexes self._fewshot_prefix directly (kept the setup() mechanism), which also removes the in-preprocess shared-state mutation. Updated the standalone test to call setup() first per the framework contract.

feat(lcb): add dataset and few-shot base-model task

cd28283

jack-scitix-ai force-pushed the feat/lcb-base branch from 087ba4a to cd28283 Compare June 23, 2026 08:14

ethan-scitix reviewed Jun 24, 2026

View reviewed changes

fix(lcb): use importlib.resources; drop dead preprocess cache branch

201f3fe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(livecodebench): add dataset and few-shot base-model task#14

feat(livecodebench): add dataset and few-shot base-model task#14
jack-scitix-ai wants to merge 2 commits into
scitix:mainfrom
jack-scitix-ai:feat/lcb-base

jack-scitix-ai commented Jun 23, 2026

Uh oh!

ethan-scitix left a comment

Uh oh!

jack-scitix-ai commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jack-scitix-ai commented Jun 23, 2026

Type

Summary

Test Plan

Automated

Manual

Checklist

Required (all PRs)

If: New or Modified Benchmark

If: community/ Changes

Uh oh!

ethan-scitix left a comment

Choose a reason for hiding this comment

Uh oh!

jack-scitix-ai commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants