Context
To scale the benchmark beyond the core team, we need a clear, enforced workflow for outside contributors to submit new tasks via PR — with automated guardrails so a reviewer can trust that a submitted task + verifier actually works before approving.
Scope
Acceptance criteria
References
validate-golden (golden fail→pass bracketing), the /benchmark PR-comment trigger pattern, feedback_paid_ci_slash_command_triggers
Context
To scale the benchmark beyond the core team, we need a clear, enforced workflow for outside contributors to submit new tasks via PR — with automated guardrails so a reviewer can trust that a submitted task + verifier actually works before approving.
Scope
Acceptance criteria
References
validate-golden(golden fail→pass bracketing), the/benchmarkPR-comment trigger pattern,feedback_paid_ci_slash_command_triggers