feat(livecodebench): add dataset and few-shot base-model task#14
feat(livecodebench): add dataset and few-shot base-model task#14jack-scitix-ai wants to merge 2 commits into
Conversation
087ba4a to
cd28283
Compare
ethan-scitix
left a comment
There was a problem hiding this comment.
Verdict: approve — faithful reproduction, no blocking issues. Two non-blocking nits below.
Verified empirically:
- Prompt builder byte-identical to upstream
get_base_model_question_template_answeratn_shot=1(both pools);func.json/stdin.jsonexamples [0]/[1] are verbatim upstream, [2] is_source-tagged sieval-authored. extract_code(lmstyle="GenericBase"),stop=("###",),max_tokens=2000all match the upstreamlcb_runnerdefaults;GenericBase↔ base-model template coupling confirmed in upstreamformat_prompt_generation.- Reproduction: 3 deterministic Qwen2.5-72B-Base runs all land Pass@1 = 13.25 (22/166), 0 fails, 0 anomalies — vs DeepSeek-V3 Table 3 anchor 12.9 (+0.35 abs, slightly high, consistent with the documented sieval-authored 3rd few-shot). The 15 timeouts are genuine service-side subprocess timeouts (46–62s, CPU pegged ~100%), correctly scored as failures; no sample hit the HTTP-timeout/exception path.
- Upstream URL + pinned HF revision resolve;
check_tasks/check_datasets/check_dep_coverageand 12/12 unit tests pass.
The deviation disclosure (n_shot generalization, the 3rd authored example, external eval sandbox) across the docstring, reference_impl.notes, index.json, and the dedicated pinning test is exemplary.
Nit: sieval/community/livecodebench/prompts/code_generation.py:5 — use importlib.resources instead of Path(__file__).parent. The latter breaks under non-filesystem loaders (zipimport) and diverges from the repo pattern (sieval/meta/loader.py:25 uses files(__package__).joinpath(...)). Since the path had to be adapted from upstream's cwd-relative open("lcb_runner/...") anyway, land the robust form:
from importlib.resources import files
_FEWSHOT = files(__package__).joinpath("few_shot_examples", "generation")
func = json.loads((_FEWSHOT / "func.json").read_text(encoding="utf-8"))
stdin = json.loads((_FEWSHOT / "stdin.json").read_text(encoding="utf-8"))Nit: sieval/tasks/livecodebench_code_generation_kshot_base_gen.py:162-165 — dead cache-miss branch in preprocess; rely on setup(). has_starter is always a bool, and setup() (:154-156) pre-fills both keys before any sample runs (sieval/core/runners/runner.py:337), so this if never fires in the pipeline (only when preprocess is called without setup(), i.e. the standalone test). Drop it and index directly — also removes the in-preprocess mutation of shared state:
@override
async def preprocess(self, raw, ctx):
has_starter = bool(raw["starter_code"])
return self._fewshot_prefix[has_starter] + get_base_model_target_block(
raw["question_content"], raw["starter_code"]
)Tests then need a setup() call first (matches the framework contract). Or drop setup() and go lazy-only like the gsm8k_kshot_base_gen sibling — either way, pick one mechanism.
|
Thanks, both addressed in the latest commit.
|
Type
Summary
livecodebench_code_generation_kshot_base_gen— a base-model (completions /GenModel) few-shot counterpart to the existing instruct tasklivecodebench_code_generation_0shot_gen.get_base_model_question_template_answer), decomposed into a cacheable few-shot prefix + per-sample target block so the fixed prefix is built once insetup(), not per sample. Forn_shot=1the rendered prompt is byte-identical to upstream.n_shot=3+stop=("###",)target DeepSeek-V3 Table 3 "LiveCodeBench-Base (Pass@1), 3-shot". Upstream ships only 2 few-shot examples per pool, so a 3rd example per pool is sieval-authored (marked + documented).temperature/top_p/max_tokens) are not set by the task — onlyn(pass@k) and the prompt-coupledstopare forwarded; decoding comes from model config /infer_args.@0fe84c39…), for reproducibility.Test Plan
Automated
ruff check && ruff format --check)ty checkormypy --strict)pdm run pytest)Manual
temperature=0,top_p=1,max_tokens=2000),n_shot=3.0801-1101), 83 problems.fails=0,timeouts=15).Checklist
Required (all PRs)
type(scope): description)AI-Generated Code - <model> (<provider>)in module docstringcore/If: New or Modified Benchmark
sieval dataset download <name>succeeds)__init__.pyIf: community/ Changes