Benchmark Viewer Implementation Review

Date: January 17, 2026 Reviewer: Claude Code Context: User reports benchmark viewer at http://localhost:8765/benchmark.html shows "no evaluation running" and mock data warning despite Azure evaluation (agent aace3b9) being active.

Executive Summary

Current viewer is a monolithic 3,065-line HTML file (568KB) generated from 4,774-line Python module (172KB) with extensive features, most of which are unused or broken.
Root cause of "no evaluation running" issue: Viewer loads data from 48+ mock benchmark runs but shows "unknown - 0%" for all, displaying only the most recent run which was completed (status: idle).
The live Azure evaluation is NOT connected to this viewer - it's running in openadapt-evals package with separate tracking, while this viewer polls for benchmark_live.json which hasn't been updated since January 9.
Significant technical debt: Inline HTML generation in Python, complex SSE/polling fallback logic, multiple overlapping panels, tight coupling between viewer and server.
Recommendation: Deprecate and rewrite using openadapt-evals viewer architecture (1,283 lines, focused, maintainable).

Current State Analysis

Files and Sizes

File	Size	Lines	Purpose
`openadapt_ml/training/benchmark_viewer.py`	172KB	4,774	Python generator with inline HTML/CSS/JS
`training_output/current/benchmark.html`	568KB	3,065	Generated multi-run benchmark viewer
`openadapt_evals/benchmarks/viewer.py`	43KB	1,283	Newer, focused viewer generator

Complexity Metrics

benchmark_viewer.py (172KB):

Functions: 12 major functions
- 6 panel generators (background tasks, live eval, Azure jobs, VM discovery, run benchmark)
- 3 HTML generators (single run, multi-run, empty)
- 3 utility functions
CSS Classes: 139 unique classes
JavaScript Functions: 40+ functions
Data Loading: SSE (EventSource) + polling fallback + stale connection detection

Generated benchmark.html (568KB):

Embedded Data: JSON data for 48+ benchmark runs inline
Polling Mechanisms:
- SSE connection to /api/benchmark-sse?interval=2
- Fallback polling to /api/benchmark-live every 2s
- Background tasks polling to /api/tasks every 10s
- Stale connection detection (60s timeout)
UI Panels: 5 major panels (Live Eval, Background Tasks, Azure Jobs, VM Discovery, Run Benchmark)

Feature Inventory

Features Implemented

Feature	Status	Usage	Notes
Multi-run dropdown	Working	Used	Shows 48+ mock runs, all "unknown - 0%"
Live evaluation panel	Broken	Unused	Shows "no evaluation running"
Background tasks panel	Unknown	Unused	Polls `/api/tasks` endpoint
Azure jobs panel	Unknown	Unused	Not visible in current view
VM discovery panel	Unknown	Unused	Not visible in current view
Run benchmark panel	Unknown	Unused	Form to trigger new benchmarks
SSE connection	Broken	Unused	Tries to connect, falls back to polling
Task list with filters	Working	Partially	Can filter by domain/status
Step-by-step viewer	Working	Used	Shows screenshots, actions, reasoning
Playback controls	Working	Used	Prev/Next navigation

Features Over-Engineered

SSE + Polling + Stale Detection: Three layers of connection management for live updates, but live eval never works
Multiple Panel System: 5 separate panels (Live Eval, Tasks, Azure, VM, Run) - most never visible
Multi-run Support: Loads 48+ benchmark runs into single dropdown, but meaningful comparison is impossible
Mock Data Banner: Warns about mock data, but ALL 48 runs show 0% - not helpful

Features Broken

Live Evaluation Display: Shows "no evaluation running" despite active Azure eval
Run Identification: All 48 mock runs show "unknown - 0%" - no useful metadata
Server Connection: SSE fails, polling finds stale benchmark_live.json (last updated Jan 9)

Data Sources

Expected to load from:

/api/benchmark-sse?interval=2 (SSE stream) → Fails, connection error
/api/benchmark-live (polling fallback) → Returns old data from Jan 9
/api/tasks (background tasks) → Unknown status
benchmark_results/ directory → Loads 48+ mock runs from disk

Why isn't it showing the current evaluation?

The Azure evaluation running in openadapt-evals uses a different tracking system:

openadapt-evals: Uses LiveEvaluationTracker writing to live_eval_state.json
openadapt-ml viewer: Expects benchmark_live.json in training_output/current/
Result: No connection between the two systems

Current benchmark_live.json status:

{
  "status": "setup",
  "timestamp": "2026-01-09T12:12:53.252452",
  "tasks_completed": 0,
  "total_tasks": 0,
  "phase": "initializing",
  "detail": "Connecting to Azure VM..."
}

This is 8 days old and stuck in "setup" phase.

Technical Debt Analysis

Issue 1: Inline HTML Generation in Python

Severity: High Impact: Makes UI changes require Python changes, hard to test HTML/CSS/JS independently

def _generate_benchmark_viewer_html(
    metadata: dict,
    summary: dict,
    tasks: list[dict],
    benchmark_dir: Path,
    shared_header_css: str,
    shared_header_html: str,
) -> str:
    html = f"""<!DOCTYPE html>
<html lang="en">
<head>
    <style>
        .task-item {{ ... }}
        .task-header {{ ... }}
        # 800+ lines of inline CSS
    </style>
</head>
<body>
    # 1000+ lines of inline HTML
    <script>
        // 500+ lines of inline JavaScript
    </script>
</body>
</html>
"""
    return html

Problem:

Can't use standard HTML/CSS/JS tooling (linters, formatters, live reload)
Changes require Python regeneration
Difficult to debug browser issues
No separation of concerns

Fix Complexity: Hard (requires architecture change)

Issue 2: Complex Multi-Run Support

Severity: Medium Impact: UI becomes unusable with 48+ runs, dropdown doesn't provide meaningful comparison

Current behavior:

<select id="run-selector">
    <option value="0">unknown - 0% (waa-mock_eval_20260117_101209)</option>
    <option value="1">unknown - 0% (waa-mock_eval_20260117_101208)</option>
    <!-- ... 46 more identical entries ... -->
</select>

Problems:

No meaningful differentiation between runs
All show "unknown - 0%" - metadata not loaded properly
No side-by-side comparison (just switches active run)
Performance issues with 48+ runs loaded into memory

Fix Complexity: Medium (requires better metadata extraction + UI redesign)

Issue 3: Broken Live Evaluation Connection

Severity: Critical Impact: Primary use case (monitoring live evaluations) doesn't work

Architecture mismatch:

openadapt-evals (running eval)
    └── LiveEvaluationTracker
        └── writes live_eval_state.json

openadapt-ml viewer (monitoring)
    └── Polls /api/benchmark-live
        └── Expects benchmark_live.json

Result: NO CONNECTION

Root causes:

Different file names (live_eval_state.json vs benchmark_live.json)
Different locations (eval runs in openadapt-evals context)
No HTTP server bridging the two

Fix Complexity: Medium (need to unify tracking or create API bridge)

Issue 4: SSE Complexity with No Benefit

Severity: Low Impact: Adds complexity but SSE connection never works, always falls back to polling

Code complexity:

class LiveEvaluationSSEClient {
    constructor() {
        this.eventSource = null;
        this.pollingInterval = null;
        this.staleCheckInterval = null;
        this.usePolling = false;
        this.reconnectAttempts = 0;
        this.maxReconnectAttempts = 5;
        this.reconnectDelay = 2000;
        this.lastHeartbeat = Date.now();
        // ... 200+ lines of connection management
    }
}

Reality: SSE connection never succeeds, immediately falls back to polling

Fix Complexity: Easy (remove SSE code, use polling only)

Issue 5: Unused Panel System

Severity: Low Impact: Code maintains 5 panels but only 2-3 are ever visible

Panels:

Live Evaluation Panel - Broken, shows "no evaluation running"
Background Tasks Panel - Never visible
Azure Jobs Panel - Never visible
VM Discovery Panel - Never visible
Run Benchmark Panel - Never visible

Generated code:

Each panel: ~300 lines CSS + 200 lines HTML + 100 lines JS
Total overhead: ~3,000 lines for unused features

Fix Complexity: Easy (remove unused panels)

Root Cause Analysis

Why Is This So Complex?

Historical context (inferred from code):

Started simple: Single-run viewer showing one benchmark result
Added multi-run: Dropdown to switch between runs (no comparison)
Added live tracking: SSE + polling to show running evaluations
Added Azure support: Panels for Azure jobs, VM discovery, background tasks
Added run triggering: Form to start new benchmarks from UI

Result: Feature accretion without refactoring

Each feature was added by extending the monolithic HTML generator, not by modularizing.

Why Inline HTML Generation?

Theory: Quick prototyping turned permanent

# Easy to start:
def generate_viewer():
    return f"<html>...</html>"

# Hard to maintain:
def generate_viewer():
    return f"""
        <html>
            <head>
                <style>{_get_panel_css()}</style>
            </head>
            <body>
                {_get_shared_header()}
                {_get_live_eval_panel()}
                {_get_background_tasks_panel()}
                {_get_azure_jobs_panel()}
                # ... 1000+ lines
            </body>
        </html>
    """

No clear migration path once HTML grows to 3,000+ lines.

Why Broken Live Connection?

Package split without coordination:

openadapt-ml/benchmarks/ → openadapt-evals/ (code migrated)
Live tracking stayed in openadapt-ml viewer
openadapt-evals developed new LiveEvaluationTracker
Result: Two separate tracking systems, incompatible

Comparison with Modern Approach

openadapt-evals/benchmarks/viewer.py

Size: 1,283 lines (vs 4,774) Focus: Single purpose - benchmark results visualization Architecture: Clean separation

def generate_benchmark_viewer(
    benchmark_dir: Path | str,
    output_path: Path | str | None = None,
) -> Path:
    # Load data
    metadata = load_benchmark_metadata(benchmark_dir)
    summary = load_benchmark_summary(benchmark_dir)
    tasks = load_task_results(benchmark_dir)

    # Generate HTML (still inline, but focused)
    html = _generate_benchmark_viewer_html(
        metadata=metadata,
        summary=summary,
        tasks=tasks,
        benchmark_dir=benchmark_dir,
    )

    output_path.write_text(html)
    return output_path

Key improvements:

Single run focus: No multi-run complexity
No live tracking: Static viewer for completed runs
Simpler data model: metadata.json + summary.json + tasks/
Better metadata: Loads full task details, not "unknown - 0%"

Comparison table:

Feature	openadapt-ml viewer	openadapt-evals viewer
File size	172KB (4,774 lines)	43KB (1,283 lines)
Generated HTML	568KB (3,065 lines)	~100KB (~600 lines)
Multi-run support	Yes (48+ runs)	No (single run)
Live tracking	Broken (SSE + polling)	No (static only)
Extra panels	5 panels	0 panels
Data loading	3 mechanisms	1 mechanism (file loading)
Maintenance burden	High	Medium

Issues Found (Categorized by Severity)

Critical

Issue	Impact	Fix Complexity
Live evaluation not connected	Primary use case broken	Medium (API bridge)
All runs show "unknown - 0%"	Cannot identify runs	Medium (metadata fix)

High

Issue	Impact	Fix Complexity
Inline HTML in Python	Hard to maintain, test, debug	Hard (requires rewrite)
Monolithic 4,774-line module	Difficult to understand, modify	Hard (requires refactor)

Medium

Issue	Impact	Fix Complexity
48+ runs in dropdown	UI unusable, no comparison	Medium (redesign UI)
Stale benchmark_live.json	Shows 8-day-old "setup" status	Easy (delete or update)
SSE connection always fails	Complexity with no benefit	Easy (remove SSE)

Low

Issue	Impact	Fix Complexity
5 panels, 3 never visible	Code bloat	Easy (remove panels)
Mock data banner on all runs	Not helpful (all are mock)	Easy (remove or fix detection)

User Experience Issues

Confusion

"No evaluation running" - User expects to see Azure eval (aace3b9), sees nothing
Mock data warning - Shows on all 48 runs, not actionable
"unknown - 0%" - Every run looks identical, can't identify which is which
48 runs in dropdown - Overwhelming, no way to filter or compare

Missing Functionality

No live Azure eval tracking - Must check Azure portal manually
No run comparison - Can't compare two runs side-by-side
No filtering - Can't filter runs by date, success rate, or model
No search - Can't search tasks by ID or instruction

Performance

568KB HTML file - Slow to load, parse
48+ runs loaded in memory - Browser memory usage
Multiple polling intervals - Battery drain, network overhead

Recommendation: Deprecate and Rewrite

Why Not Fix?

Effort to fix current viewer:

Refactor inline HTML → separate files (2 days)
Fix live evaluation connection (1 day)
Fix metadata loading for runs (1 day)
Remove unused panels (0.5 days)
Simplify SSE/polling (0.5 days)
Add run comparison UI (2 days)

Total: ~7 days to fix technical debt

Effort to rewrite based on openadapt-evals:

Adapt openadapt-evals viewer (0.5 days)
Add live tracking API bridge (1 day)
Add multi-run comparison (optional) (1 day)
Test and deploy (0.5 days)

Total: ~3 days for clean implementation

Rewrite Benefits

Cleaner architecture: Build on proven openadapt-evals viewer
Better maintainability: 1,283 lines vs 4,774 lines
Proper separation: HTML/CSS/JS in separate files (future improvement)
Unified tracking: Bridge to openadapt-evals LiveEvaluationTracker
No technical debt: Fresh start without legacy complexity

Rewrite Risks

Feature loss: Some multi-run features may need reimplementation
Learning curve: Developers must learn new codebase
Migration: Existing bookmarks/links need updating

Migration Path

Phase 1: Parallel deployment (Week 1)

Deploy new viewer at /benchmark-v2.html
Keep old viewer at /benchmark.html
Both accessible, users can compare

Phase 2: Feedback and iteration (Week 2)

Gather user feedback
Add missing features to new viewer
Fix any issues

Phase 3: Deprecation (Week 3)

Redirect /benchmark.html to /benchmark-v2.html
Add deprecation notice
Update documentation

Phase 4: Removal (Week 4)

Remove old viewer code
Clean up deprecated endpoints
Archive for reference

Immediate Actions (Short-term Fixes)

While planning rewrite, these quick fixes improve current state:

1. Fix "No Evaluation Running" Message

Problem: User runs Azure eval, viewer shows "no evaluation running"

Quick fix: Update benchmark_live.json manually during eval

# In openadapt-evals Azure runner, write status updates:
echo '{"status": "running", "current_task": {"task_id": "notepad_1", ...}}' > \
    /Users/abrichr/oa/src/openadapt-ml/training_output/current/benchmark_live.json

Fix Complexity: Easy (5 minutes) Proper fix: Bridge LiveEvaluationTracker to benchmark_live.json (1 day)

2. Fix "Unknown - 0%" Run Labels

Problem: All 48 runs show identical labels

Quick fix: Generate better metadata during mock runs

# In openadapt_evals/benchmarks/runner.py
metadata = {
    "run_name": f"{benchmark_name}_{timestamp}",
    "model_id": agent.model_id or "unknown",  # Get from agent
    "created_at": datetime.now().isoformat(),
}

Fix Complexity: Easy (30 minutes)

3. Remove SSE Connection Code

Problem: Adds complexity, never works

Quick fix: Remove SSE client, use polling only

// Remove 200+ lines of SSE code
// Keep only:
async function fetchLiveEvaluationPolling() { ... }
setInterval(fetchLiveEvaluationPolling, 2000);

Fix Complexity: Easy (1 hour)

Long-term Vision

Unified Viewer Architecture

openadapt-viewer/ (new package)
├── src/
│   ├── components/
│   │   ├── TaskList.tsx       # Reusable task list
│   │   ├── StepViewer.tsx     # Step-by-step replay
│   │   ├── MetricsPanel.tsx   # Success rate, charts
│   │   └── Filters.tsx        # Domain, status filters
│   ├── pages/
│   │   ├── BenchmarkViewer.tsx  # Completed runs
│   │   ├── LiveTracker.tsx      # Running evaluations
│   │   └── Comparison.tsx       # Side-by-side runs
│   └── api/
│       ├── benchmark-api.ts   # Load benchmark data
│       └── live-api.ts        # Live eval updates
├── public/
│   └── benchmark.html         # Entry point
└── package.json               # React + Vite

Benefits:

Component reuse: TaskList used in benchmark + live tracker
Standard tooling: React DevTools, Vite HMR, TypeScript
Easy testing: Jest + React Testing Library
Real separation: HTML/CSS/JS in separate files
Scalable: Add new views without modifying core

Timeline: 2-3 weeks for full React rewrite

Appendix: File Locations

File	Path	Size
Old viewer generator	`/Users/abrichr/oa/src/openadapt-ml/openadapt_ml/training/benchmark_viewer.py`	172KB
Generated viewer	`/Users/abrichr/oa/src/openadapt-ml/training_output/current/benchmark.html`	568KB
New viewer generator	`/Users/abrichr/oa/src/openadapt-evals/openadapt_evals/benchmarks/viewer.py`	43KB
Live state (old)	`/Users/abrichr/oa/src/openadapt-ml/training_output/current/benchmark_live.json`	180B
Benchmark results	`/Users/abrichr/oa/src/openadapt-ml/benchmark_results/`	48+ runs

Conclusion

The current benchmark viewer represents a classic case of feature accretion without refactoring. What started as a simple single-run viewer grew to support multi-run comparison, live tracking, Azure integration, and benchmark triggering - all added as inline HTML generation in a monolithic Python module.

The viewer is not broken by bugs, it's broken by design. The architecture cannot support the intended use cases:

Live evaluation tracking: Requires API bridge to openadapt-evals
Multi-run comparison: Current UI can't handle 48+ runs meaningfully
Maintainability: Inline HTML in 4,774-line Python file is unmaintainable

Recommendation: Deprecate and rewrite based on openadapt-evals viewer architecture. The 3-day rewrite investment is better than 7 days fixing technical debt, and results in a cleaner, more maintainable codebase.

Immediate actions (if rewrite is delayed):

Bridge LiveEvaluationTracker to benchmark_live.json (1 day)
Fix metadata loading for runs (0.5 days)
Remove SSE complexity (1 hour)

Long-term vision: React-based component architecture in new openadapt-viewer package, enabling reuse across benchmark, training, and capture viewers.

FilesExpand file tree

BENCHMARK_VIEWER_REVIEW.md

Latest commit

History