Skip to content

feat(livecodebench): add dataset and few-shot base-model task#14

Open
jack-scitix-ai wants to merge 2 commits into
scitix:mainfrom
jack-scitix-ai:feat/lcb-base
Open

feat(livecodebench): add dataset and few-shot base-model task#14
jack-scitix-ai wants to merge 2 commits into
scitix:mainfrom
jack-scitix-ai:feat/lcb-base

Conversation

@jack-scitix-ai

Copy link
Copy Markdown
Contributor

Type

  • feature — new benchmark, task, or capability

Summary

  • Add livecodebench_code_generation_kshot_base_gen — a base-model (completions / GenModel) few-shot counterpart to the existing instruct task livecodebench_code_generation_0shot_gen.
  • Mirrors LiveCodeBench's official base-model prompt builder (get_base_model_question_template_answer), decomposed into a cacheable few-shot prefix + per-sample target block so the fixed prefix is built once in setup(), not per sample. For n_shot=1 the rendered prompt is byte-identical to upstream.
  • Default n_shot=3 + stop=("###",) target DeepSeek-V3 Table 3 "LiveCodeBench-Base (Pass@1), 3-shot". Upstream ships only 2 few-shot examples per pool, so a 3rd example per pool is sieval-authored (marked + documented).
  • Decoding params (temperature/top_p/max_tokens) are not set by the task — only n (pass@k) and the prompt-coupled stop are forwarded; decoding comes from model config / infer_args.
  • Pin the LiveCodeBench HF dataset to a fixed revision (@0fe84c39…), for reproducibility.
  • Refs: LiveCodeBench · DeepSeek-V3 report (Table 3).

Test Plan

Automated

  • Lint/format clean (ruff check && ruff format --check)
  • Type check clean (ty check or mypy --strict)
  • Unit tests pass (pdm run pytest)

Manual

  • Model: Qwen2.5-72B (base), served via sglang, greedy (temperature=0, top_p=1, max_tokens=2000), n_shot=3.
  • Window: 2024-08-01 → 2024-11-01 (0801-1101), 83 problems.
  • Reproducibility: identical score across 3 independent runs (fails=0, timeouts=15).
Model Expected Actual Diff
Qwen2.5-72B-Base 12.9( (DeepSeek-V3 Report)) 13.25 +0.35 (+2.7%)

Checklist

Required (all PRs)

  • PR title follows conventional format (type(scope): description)
  • No internal paths, credentials, or personal info in committed files
  • AI-generated code has AI-Generated Code - <model> (<provider>) in module docstring
  • No new upper-layer dependencies added to core/
  • Deleted code verified — no remaining call sites depend on it

If: New or Modified Benchmark

  • Reference paper/repo linked in Summary
  • Score comparison table included (model, expected, actual, diff)
  • Dataset loading tested (sieval dataset download <name> succeeds)
  • Task registered in package-level __init__.py

If: community/ Changes

  • Upstream diff documented (what differs and why)
  • License attribution preserved

@ethan-scitix ethan-scitix left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: approve — faithful reproduction, no blocking issues. Two non-blocking nits below.

Verified empirically:

  • Prompt builder byte-identical to upstream get_base_model_question_template_answer at n_shot=1 (both pools); func.json/stdin.json examples [0]/[1] are verbatim upstream, [2] is _source-tagged sieval-authored.
  • extract_code(lmstyle="GenericBase"), stop=("###",), max_tokens=2000 all match the upstream lcb_runner defaults; GenericBase ↔ base-model template coupling confirmed in upstream format_prompt_generation.
  • Reproduction: 3 deterministic Qwen2.5-72B-Base runs all land Pass@1 = 13.25 (22/166), 0 fails, 0 anomalies — vs DeepSeek-V3 Table 3 anchor 12.9 (+0.35 abs, slightly high, consistent with the documented sieval-authored 3rd few-shot). The 15 timeouts are genuine service-side subprocess timeouts (46–62s, CPU pegged ~100%), correctly scored as failures; no sample hit the HTTP-timeout/exception path.
  • Upstream URL + pinned HF revision resolve; check_tasks/check_datasets/check_dep_coverage and 12/12 unit tests pass.

The deviation disclosure (n_shot generalization, the 3rd authored example, external eval sandbox) across the docstring, reference_impl.notes, index.json, and the dedicated pinning test is exemplary.


Nit: sieval/community/livecodebench/prompts/code_generation.py:5 — use importlib.resources instead of Path(__file__).parent. The latter breaks under non-filesystem loaders (zipimport) and diverges from the repo pattern (sieval/meta/loader.py:25 uses files(__package__).joinpath(...)). Since the path had to be adapted from upstream's cwd-relative open("lcb_runner/...") anyway, land the robust form:

from importlib.resources import files
_FEWSHOT = files(__package__).joinpath("few_shot_examples", "generation")
func = json.loads((_FEWSHOT / "func.json").read_text(encoding="utf-8"))
stdin = json.loads((_FEWSHOT / "stdin.json").read_text(encoding="utf-8"))

Nit: sieval/tasks/livecodebench_code_generation_kshot_base_gen.py:162-165 — dead cache-miss branch in preprocess; rely on setup(). has_starter is always a bool, and setup() (:154-156) pre-fills both keys before any sample runs (sieval/core/runners/runner.py:337), so this if never fires in the pipeline (only when preprocess is called without setup(), i.e. the standalone test). Drop it and index directly — also removes the in-preprocess mutation of shared state:

@override
async def preprocess(self, raw, ctx):
    has_starter = bool(raw["starter_code"])
    return self._fewshot_prefix[has_starter] + get_base_model_target_block(
        raw["question_content"], raw["starter_code"]
    )

Tests then need a setup() call first (matches the framework contract). Or drop setup() and go lazy-only like the gsm8k_kshot_base_gen sibling — either way, pick one mechanism.

@jack-scitix-ai

Copy link
Copy Markdown
Contributor Author

Thanks, both addressed in the latest commit.

  1. importlib.resources: switched the few-shot loader to files(__package__).joinpath(...) + read_text, matching sieval/meta/loader.py. Robust under zipimport now.
  2. Dead cache-miss branch: dropped it; preprocess indexes self._fewshot_prefix directly (kept the setup() mechanism), which also removes the in-preprocess shared-state mutation. Updated the standalone test to call setup() first per the framework contract.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants