Skip to content

Wire the WebArena Stage 0 benchmark manifest #39

Description

@LvcidPsyche

Scope

Turn benchmarks/webarena/stage0-manifest.json into executable pinned contracts against a reviewed local WebArena environment.

Why it matters

Auto Browser should not publish competitive benchmark numbers until the task subset, environment, and evidence requirements are reproducible.

Done when

  • The local WebArena environment/source revision is pinned.
  • The five manifest task classes map to executable contracts.
  • Runs save trace, actions, screenshots, and model-decision evidence.
  • Docs clearly state whether the lane is scored or still tracked-only.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions