Scope
Turn benchmarks/webarena/stage0-manifest.json into executable pinned contracts against a reviewed local WebArena environment.
Why it matters
Auto Browser should not publish competitive benchmark numbers until the task subset, environment, and evidence requirements are reproducible.
Done when
- The local WebArena environment/source revision is pinned.
- The five manifest task classes map to executable contracts.
- Runs save trace, actions, screenshots, and model-decision evidence.
- Docs clearly state whether the lane is scored or still tracked-only.
Scope
Turn
benchmarks/webarena/stage0-manifest.jsoninto executable pinned contracts against a reviewed local WebArena environment.Why it matters
Auto Browser should not publish competitive benchmark numbers until the task subset, environment, and evidence requirements are reproducible.
Done when