Add Hugging Face dataset artifact + builder for the 100-task corpus#6
Open
xmpuspus wants to merge 1 commit into
Open
Add Hugging Face dataset artifact + builder for the 100-task corpus#6xmpuspus wants to merge 1 commit into
xmpuspus wants to merge 1 commit into
Conversation
Stage the 100-task corpus as an HF dataset for publishing. Flattens each task YAML into one JSONL row (rubrics/constraints as JSON strings), with a dataset card that labels the tasks honestly as synthetic_overlay / contamination_risk high over 5 real pinned repos. push_dataset.sh handles auth + upload; build_hf_dataset.py regenerates the artifact. Co-Authored-By: Xavier Puspus
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Publishes the AWB task corpus as a Hugging Face dataset and adds a deterministic builder so it can be regenerated each release.
What this adds
scripts/build_hf_dataset.py— flattens every task YAML into one JSONL row (rubrics/constraints serialized as JSON strings), emitsstats.jsonwith distributions and the canonicaltask_set_hash.huggingface/dataset/—data/tasks.jsonl(100 rows, 21 columns),README.mddataset card,stats.json.huggingface/push_dataset.sh— auth + upload helper.Provenance honesty
The dataset card labels all 100 tasks as
synthetic_overlay/contamination_risk: highover 5 real pinned repos (FastAPI, httpx, Flask, Click, Starlette). Not marketed as harvested real PRs.Verification
task_set_hashmatches AWB's SHA-256-over-task-YAMLs schemeNote:
huggingface/dataset/data/tasks.jsonlwas force-added past the*.jsonlgitignore rule.Reviewed by Xavier Puspus