Skip to content

Contributor workflow: SOP + enforced CI for submitting new benchmark tasks via PR #75

Description

@pranavraja99

Context

To scale the benchmark beyond the core team, we need a clear, enforced workflow for outside contributors to submit new tasks via PR — with automated guardrails so a reviewer can trust that a submitted task + verifier actually works before approving.

Scope

  • SOP doc: how to submit a new benchmark task via PR (task/prompt format, how to declare the simulated env, how to write a verifier, where files go)
  • Enforced CI tests on task-submission PRs that verify the verifier works:
    • the task loads and runs
    • the verifier passes on a known-good ("golden") trajectory and fails on a known-bad one (golden bracketing)
  • A test run on the PR so the reviewer can inspect the task + a sample trajectory and approve / request changes
  • Prefer a maintainer-gated, slash-command trigger for any paid test run (don't auto-spend on every PR)

Acceptance criteria

  • A documented, repeatable submission flow lives in the repo (README/CONTRIBUTING)
  • CI blocks a task PR whose verifier doesn't bracket (passes golden / fails bad)
  • Reviewer can see a sample trajectory for the submitted task from the PR
  • Dry-run the flow with one example task PR

References

  • validate-golden (golden fail→pass bracketing), the /benchmark PR-comment trigger pattern, feedback_paid_ci_slash_command_triggers

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions