A simple, focused benchmark viewer built for iterative development.
Build incrementally:
- Start minimal - Core functionality only
- Works immediately - No broken features
- Iterates cleanly - Easy to add features one at a time
- Tests well - Each feature can be unit tested
- Stays simple - Resist feature creep
Progressive enhancement approach:
- Single HTML file
- Embedded JSON data or API calls
- Pure Alpine.js for reactivity
- Shows completed benchmark runs
What works:
- List of benchmark runs
- Task-level results (pass/fail)
- Step-by-step execution trace
- Screenshot display
- Basic metrics (success rate, avg steps)
- Poll
/api/benchmark/statusevery 5s - Show progress bar for running eval
- Update metrics in real-time
- ETA calculation
- Cost tracking per run
- Worker status and utilization
- Performance charts (success rate over time)
- Domain filtering
- Failure clustering
- Regression detection
- Task difficulty ranking
- Model comparison side-by-side
openadapt-viewer/
├── viewers/
│ └── benchmark/
│ ├── minimal_viewer.html # Single-file viewer (492 lines)
│ └── generator.py # Python generator for embedding data
Single file design:
- Easy to open (no server needed for basic use)
- Easy to test (just open in browser)
- Easy to share (copy one file)
- Can still serve via HTTP for API calls
The viewer consumes benchmark run data in this format:
{
"runs": [
{
"run_name": "waa_eval_20251217_test_real",
"benchmark_name": "waa",
"model_id": "openai-api",
"num_tasks": 10,
"num_success": 0,
"success_rate": 0.0,
"avg_steps": 3.5,
"avg_time_seconds": 9.87,
"tasks": [
{
"task_id": "browser_1",
"success": false,
"score": 0.0,
"num_steps": 3,
"error": null
}
]
}
]
}Task execution details (from execution.json):
{
"task_id": "browser_1",
"status": "completed",
"steps": [
{
"step_number": 1,
"timestamp": "2025-12-16T16:10:49.444888",
"action": "CLICK(x=100, y=200)",
"reasoning": "Click on the browser icon",
"screenshot_path": "screenshots/step_001.png"
}
]
}The viewer uses these endpoints (implemented in openadapt-ml/openadapt_ml/cloud/local.py):
Returns list of all benchmark runs.
Response:
[
{
"run_name": "waa_eval_20251217_test_real",
"benchmark_name": "waa",
"model_id": "openai-api",
"num_tasks": 10,
"num_success": 0,
"success_rate": 0.0,
"avg_steps": 3.5,
"tasks": [...]
}
]Returns execution details for a specific task.
Response:
{
"task_id": "browser_1",
"status": "completed",
"steps": [
{
"step_number": 1,
"action": "CLICK(...)",
"reasoning": "...",
"screenshot_path": "screenshots/step_001.png"
}
]
}Serves screenshot PNG files.
# From openadapt-ml directory
cd /Users/abrichr/oa/src/openadapt-ml
# Serve with API endpoints
uv run python -m openadapt_ml.cloud.local serve --port 8765
# Open minimal viewer
open http://localhost:8765/minimal_benchmark.html# From openadapt-viewer directory
cd /Users/abrichr/oa/src/openadapt-viewer
# Generate viewer with embedded data
python viewers/benchmark/generator.py \
--results-dir /Users/abrichr/oa/src/openadapt-ml/benchmark_results \
--run-name waa_eval_20251217_test_real \
--output minimal_benchmark.html
# Open in browser (no server needed)
open minimal_benchmark.htmlCopy viewer to a served directory:
# Copy to openadapt-ml training_output for serving
cp /Users/abrichr/oa/src/openadapt-viewer/viewers/benchmark/minimal_viewer.html \
/Users/abrichr/oa/src/openadapt-ml/training_output/current/minimal_benchmark.html
# Open in browser
open http://localhost:8765/minimal_benchmark.htmlThe viewer is designed for easy iteration. Here's how to add features:
- Update summary.json to include the new metric
- Add metric card to the HTML:
<div class="metric-card">
<div class="metric-value" x-text="selectedRun?.new_metric || 0"></div>
<div class="metric-label">New Metric</div>
</div>- Add polling to Alpine.js data:
init() {
this.loadRuns();
setInterval(() => this.pollProgress(), 5000);
},
async pollProgress() {
const response = await fetch('/api/benchmark/status');
const status = await response.json();
if (status.running) {
// Update progress bar
this.progress = status.tasks_completed / status.tasks_total;
}
}- Add progress UI:
<div class="progress-bar" x-show="running">
<div class="progress-fill" :style="`width: ${progress * 100}%`"></div>
</div>- Include Chart.js from CDN:
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>- Add chart component:
<canvas x-ref="successChart"></canvas>- Initialize in Alpine.js:
init() {
this.loadRuns();
this.$nextTick(() => this.createChart());
},
createChart() {
new Chart(this.$refs.successChart, {
type: 'line',
data: {
labels: this.runs.map(r => r.run_name),
datasets: [{
label: 'Success Rate',
data: this.runs.map(r => r.success_rate * 100)
}]
}
});
}The viewer can be tested without a server:
# Create test data
cat > test_data.json <<EOF
{
"runs": [
{
"run_name": "test_run",
"num_tasks": 5,
"success_rate": 0.6
}
]
}
EOF
# Embed test data in viewer
python viewers/benchmark/generator.py \
--run-name test_run \
--output test_viewer.html
# Open and verify
open test_viewer.htmlTest with real benchmark data:
# Run a small benchmark
cd /Users/abrichr/oa/src/openadapt-ml
uv run python -m openadapt_ml.benchmarks.cli test-collection --tasks 3
# Verify data structure
ls benchmark_results/*/tasks/*/
# Load in viewer
uv run python -m openadapt_ml.cloud.local serve --open
# Navigate to minimal_benchmark.htmlThe MVP meets these criteria:
- ✅ Single HTML file under 500 lines (492 lines)
- ✅ Works with file:// protocol (no server needed for embedded data)
- ✅ Works with HTTP API (loads from
/api/benchmark/runs) - ✅ Shows real benchmark data
- ✅ Easy to understand code (Alpine.js + vanilla JS)
- ✅ Clear extension points for features
- ✅ Tested with actual benchmark runs
When adding features, follow this order:
Iteration 2: Live Progress (next)
- Add polling for
/api/benchmark/status - Show progress bar with ETA
- Auto-refresh on completion
- Estimated effort: 2 hours
Iteration 3: Enhanced Display
- Add cost display per run
- Add domain filtering
- Add search/filter for tasks
- Estimated effort: 3 hours
Iteration 4: Charts
- Success rate trend over time
- Domain breakdown chart
- Task difficulty distribution
- Estimated effort: 4 hours
Iteration 5: Analysis
- Failure pattern clustering
- Regression detection
- Model comparison view
- Estimated effort: 8 hours
- Already used in other OpenAdapt viewers (consistency)
- No build step required
- Simple reactivity model
- Small footprint (~15KB)
- Easier to debug (everything in one place)
- Can be opened without a server
- Easy to share/deploy
- Forces simplicity
- No build tooling needed
- Faster iteration
- Smaller surface area for bugs
- Easy for others to modify
- Ship working features immediately
- Each iteration adds value
- Easy to test each level independently
- Clear rollback points
The full benchmark viewer (benchmark_viewer.html) has:
- Live progress tracking
- Worker utilization
- Cost tracking
- Domain filtering
- Charts and analysis
The minimal viewer intentionally omits these to:
- Reduce complexity
- Make code easier to understand
- Allow incremental feature addition
- Ensure core features work perfectly
Both viewers can coexist. Use minimal viewer for:
- Quick results review
- Debugging benchmark runs
- Sharing results with stakeholders
- Embedding in documentation
Use full viewer for:
- Live monitoring during runs
- Performance analysis
- Cost tracking
- Team coordination