Skip to content

Add Hugging Face dataset artifact + builder for the 100-task corpus#6

Open
xmpuspus wants to merge 1 commit into
mainfrom
feat/hf-dataset
Open

Add Hugging Face dataset artifact + builder for the 100-task corpus#6
xmpuspus wants to merge 1 commit into
mainfrom
feat/hf-dataset

Conversation

@xmpuspus

Copy link
Copy Markdown
Owner

Publishes the AWB task corpus as a Hugging Face dataset and adds a deterministic builder so it can be regenerated each release.

What this adds

  • scripts/build_hf_dataset.py — flattens every task YAML into one JSONL row (rubrics/constraints serialized as JSON strings), emits stats.json with distributions and the canonical task_set_hash.
  • huggingface/dataset/data/tasks.jsonl (100 rows, 21 columns), README.md dataset card, stats.json.
  • huggingface/push_dataset.sh — auth + upload helper.

Provenance honesty

The dataset card labels all 100 tasks as synthetic_overlay / contamination_risk: high over 5 real pinned repos (FastAPI, httpx, Flask, Click, Starlette). Not marketed as harvested real PRs.

Verification

  • 100/100 rubrics sum to 100
  • every row carries a real pinned commit SHA
  • task_set_hash matches AWB's SHA-256-over-task-YAMLs scheme

Note: huggingface/dataset/data/tasks.jsonl was force-added past the *.jsonl gitignore rule.

Reviewed by Xavier Puspus

Stage the 100-task corpus as an HF dataset for publishing. Flattens each
task YAML into one JSONL row (rubrics/constraints as JSON strings), with a
dataset card that labels the tasks honestly as synthetic_overlay /
contamination_risk high over 5 real pinned repos. push_dataset.sh handles
auth + upload; build_hf_dataset.py regenerates the artifact.

Co-Authored-By: Xavier Puspus
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant