Date: January 17, 2026
Reviewer: Claude Code
Context: User reports benchmark viewer at http://localhost:8765/benchmark.html shows "no evaluation running" and mock data warning despite Azure evaluation (agent aace3b9) being active.
-
Current viewer is a monolithic 3,065-line HTML file (568KB) generated from 4,774-line Python module (172KB) with extensive features, most of which are unused or broken.
-
Root cause of "no evaluation running" issue: Viewer loads data from 48+ mock benchmark runs but shows "unknown - 0%" for all, displaying only the most recent run which was completed (status: idle).
-
The live Azure evaluation is NOT connected to this viewer - it's running in openadapt-evals package with separate tracking, while this viewer polls for
benchmark_live.jsonwhich hasn't been updated since January 9. -
Significant technical debt: Inline HTML generation in Python, complex SSE/polling fallback logic, multiple overlapping panels, tight coupling between viewer and server.
-
Recommendation: Deprecate and rewrite using openadapt-evals viewer architecture (1,283 lines, focused, maintainable).
| File | Size | Lines | Purpose |
|---|---|---|---|
openadapt_ml/training/benchmark_viewer.py |
172KB | 4,774 | Python generator with inline HTML/CSS/JS |
training_output/current/benchmark.html |
568KB | 3,065 | Generated multi-run benchmark viewer |
openadapt_evals/benchmarks/viewer.py |
43KB | 1,283 | Newer, focused viewer generator |
benchmark_viewer.py (172KB):
- Functions: 12 major functions
- 6 panel generators (background tasks, live eval, Azure jobs, VM discovery, run benchmark)
- 3 HTML generators (single run, multi-run, empty)
- 3 utility functions
- CSS Classes: 139 unique classes
- JavaScript Functions: 40+ functions
- Data Loading: SSE (EventSource) + polling fallback + stale connection detection
Generated benchmark.html (568KB):
- Embedded Data: JSON data for 48+ benchmark runs inline
- Polling Mechanisms:
- SSE connection to
/api/benchmark-sse?interval=2 - Fallback polling to
/api/benchmark-liveevery 2s - Background tasks polling to
/api/tasksevery 10s - Stale connection detection (60s timeout)
- SSE connection to
- UI Panels: 5 major panels (Live Eval, Background Tasks, Azure Jobs, VM Discovery, Run Benchmark)
| Feature | Status | Usage | Notes |
|---|---|---|---|
| Multi-run dropdown | Working | Used | Shows 48+ mock runs, all "unknown - 0%" |
| Live evaluation panel | Broken | Unused | Shows "no evaluation running" |
| Background tasks panel | Unknown | Unused | Polls /api/tasks endpoint |
| Azure jobs panel | Unknown | Unused | Not visible in current view |
| VM discovery panel | Unknown | Unused | Not visible in current view |
| Run benchmark panel | Unknown | Unused | Form to trigger new benchmarks |
| SSE connection | Broken | Unused | Tries to connect, falls back to polling |
| Task list with filters | Working | Partially | Can filter by domain/status |
| Step-by-step viewer | Working | Used | Shows screenshots, actions, reasoning |
| Playback controls | Working | Used | Prev/Next navigation |
- SSE + Polling + Stale Detection: Three layers of connection management for live updates, but live eval never works
- Multiple Panel System: 5 separate panels (Live Eval, Tasks, Azure, VM, Run) - most never visible
- Multi-run Support: Loads 48+ benchmark runs into single dropdown, but meaningful comparison is impossible
- Mock Data Banner: Warns about mock data, but ALL 48 runs show 0% - not helpful
- Live Evaluation Display: Shows "no evaluation running" despite active Azure eval
- Run Identification: All 48 mock runs show "unknown - 0%" - no useful metadata
- Server Connection: SSE fails, polling finds stale
benchmark_live.json(last updated Jan 9)
Expected to load from:
/api/benchmark-sse?interval=2(SSE stream) → Fails, connection error/api/benchmark-live(polling fallback) → Returns old data from Jan 9/api/tasks(background tasks) → Unknown statusbenchmark_results/directory → Loads 48+ mock runs from disk
Why isn't it showing the current evaluation?
The Azure evaluation running in openadapt-evals uses a different tracking system:
- openadapt-evals: Uses
LiveEvaluationTrackerwriting tolive_eval_state.json - openadapt-ml viewer: Expects
benchmark_live.jsonintraining_output/current/ - Result: No connection between the two systems
Current benchmark_live.json status:
{
"status": "setup",
"timestamp": "2026-01-09T12:12:53.252452",
"tasks_completed": 0,
"total_tasks": 0,
"phase": "initializing",
"detail": "Connecting to Azure VM..."
}This is 8 days old and stuck in "setup" phase.
Severity: High Impact: Makes UI changes require Python changes, hard to test HTML/CSS/JS independently
def _generate_benchmark_viewer_html(
metadata: dict,
summary: dict,
tasks: list[dict],
benchmark_dir: Path,
shared_header_css: str,
shared_header_html: str,
) -> str:
html = f"""<!DOCTYPE html>
<html lang="en">
<head>
<style>
.task-item {{ ... }}
.task-header {{ ... }}
# 800+ lines of inline CSS
</style>
</head>
<body>
# 1000+ lines of inline HTML
<script>
// 500+ lines of inline JavaScript
</script>
</body>
</html>
"""
return htmlProblem:
- Can't use standard HTML/CSS/JS tooling (linters, formatters, live reload)
- Changes require Python regeneration
- Difficult to debug browser issues
- No separation of concerns
Fix Complexity: Hard (requires architecture change)
Severity: Medium Impact: UI becomes unusable with 48+ runs, dropdown doesn't provide meaningful comparison
Current behavior:
<select id="run-selector">
<option value="0">unknown - 0% (waa-mock_eval_20260117_101209)</option>
<option value="1">unknown - 0% (waa-mock_eval_20260117_101208)</option>
<!-- ... 46 more identical entries ... -->
</select>Problems:
- No meaningful differentiation between runs
- All show "unknown - 0%" - metadata not loaded properly
- No side-by-side comparison (just switches active run)
- Performance issues with 48+ runs loaded into memory
Fix Complexity: Medium (requires better metadata extraction + UI redesign)
Severity: Critical Impact: Primary use case (monitoring live evaluations) doesn't work
Architecture mismatch:
openadapt-evals (running eval)
└── LiveEvaluationTracker
└── writes live_eval_state.json
openadapt-ml viewer (monitoring)
└── Polls /api/benchmark-live
└── Expects benchmark_live.json
Result: NO CONNECTION
Root causes:
- Different file names (
live_eval_state.jsonvsbenchmark_live.json) - Different locations (eval runs in openadapt-evals context)
- No HTTP server bridging the two
Fix Complexity: Medium (need to unify tracking or create API bridge)
Severity: Low Impact: Adds complexity but SSE connection never works, always falls back to polling
Code complexity:
class LiveEvaluationSSEClient {
constructor() {
this.eventSource = null;
this.pollingInterval = null;
this.staleCheckInterval = null;
this.usePolling = false;
this.reconnectAttempts = 0;
this.maxReconnectAttempts = 5;
this.reconnectDelay = 2000;
this.lastHeartbeat = Date.now();
// ... 200+ lines of connection management
}
}Reality: SSE connection never succeeds, immediately falls back to polling
Fix Complexity: Easy (remove SSE code, use polling only)
Severity: Low Impact: Code maintains 5 panels but only 2-3 are ever visible
Panels:
- Live Evaluation Panel - Broken, shows "no evaluation running"
- Background Tasks Panel - Never visible
- Azure Jobs Panel - Never visible
- VM Discovery Panel - Never visible
- Run Benchmark Panel - Never visible
Generated code:
- Each panel: ~300 lines CSS + 200 lines HTML + 100 lines JS
- Total overhead: ~3,000 lines for unused features
Fix Complexity: Easy (remove unused panels)
Historical context (inferred from code):
- Started simple: Single-run viewer showing one benchmark result
- Added multi-run: Dropdown to switch between runs (no comparison)
- Added live tracking: SSE + polling to show running evaluations
- Added Azure support: Panels for Azure jobs, VM discovery, background tasks
- Added run triggering: Form to start new benchmarks from UI
Result: Feature accretion without refactoring
Each feature was added by extending the monolithic HTML generator, not by modularizing.
Theory: Quick prototyping turned permanent
# Easy to start:
def generate_viewer():
return f"<html>...</html>"
# Hard to maintain:
def generate_viewer():
return f"""
<html>
<head>
<style>{_get_panel_css()}</style>
</head>
<body>
{_get_shared_header()}
{_get_live_eval_panel()}
{_get_background_tasks_panel()}
{_get_azure_jobs_panel()}
# ... 1000+ lines
</body>
</html>
"""No clear migration path once HTML grows to 3,000+ lines.
Package split without coordination:
openadapt-ml/benchmarks/→openadapt-evals/(code migrated)- Live tracking stayed in openadapt-ml viewer
- openadapt-evals developed new LiveEvaluationTracker
- Result: Two separate tracking systems, incompatible
Size: 1,283 lines (vs 4,774) Focus: Single purpose - benchmark results visualization Architecture: Clean separation
def generate_benchmark_viewer(
benchmark_dir: Path | str,
output_path: Path | str | None = None,
) -> Path:
# Load data
metadata = load_benchmark_metadata(benchmark_dir)
summary = load_benchmark_summary(benchmark_dir)
tasks = load_task_results(benchmark_dir)
# Generate HTML (still inline, but focused)
html = _generate_benchmark_viewer_html(
metadata=metadata,
summary=summary,
tasks=tasks,
benchmark_dir=benchmark_dir,
)
output_path.write_text(html)
return output_pathKey improvements:
- Single run focus: No multi-run complexity
- No live tracking: Static viewer for completed runs
- Simpler data model: metadata.json + summary.json + tasks/
- Better metadata: Loads full task details, not "unknown - 0%"
Comparison table:
| Feature | openadapt-ml viewer | openadapt-evals viewer |
|---|---|---|
| File size | 172KB (4,774 lines) | 43KB (1,283 lines) |
| Generated HTML | 568KB (3,065 lines) | ~100KB (~600 lines) |
| Multi-run support | Yes (48+ runs) | No (single run) |
| Live tracking | Broken (SSE + polling) | No (static only) |
| Extra panels | 5 panels | 0 panels |
| Data loading | 3 mechanisms | 1 mechanism (file loading) |
| Maintenance burden | High | Medium |
| Issue | Impact | Fix Complexity |
|---|---|---|
| Live evaluation not connected | Primary use case broken | Medium (API bridge) |
| All runs show "unknown - 0%" | Cannot identify runs | Medium (metadata fix) |
| Issue | Impact | Fix Complexity |
|---|---|---|
| Inline HTML in Python | Hard to maintain, test, debug | Hard (requires rewrite) |
| Monolithic 4,774-line module | Difficult to understand, modify | Hard (requires refactor) |
| Issue | Impact | Fix Complexity |
|---|---|---|
| 48+ runs in dropdown | UI unusable, no comparison | Medium (redesign UI) |
| Stale benchmark_live.json | Shows 8-day-old "setup" status | Easy (delete or update) |
| SSE connection always fails | Complexity with no benefit | Easy (remove SSE) |
| Issue | Impact | Fix Complexity |
|---|---|---|
| 5 panels, 3 never visible | Code bloat | Easy (remove panels) |
| Mock data banner on all runs | Not helpful (all are mock) | Easy (remove or fix detection) |
- "No evaluation running" - User expects to see Azure eval (aace3b9), sees nothing
- Mock data warning - Shows on all 48 runs, not actionable
- "unknown - 0%" - Every run looks identical, can't identify which is which
- 48 runs in dropdown - Overwhelming, no way to filter or compare
- No live Azure eval tracking - Must check Azure portal manually
- No run comparison - Can't compare two runs side-by-side
- No filtering - Can't filter runs by date, success rate, or model
- No search - Can't search tasks by ID or instruction
- 568KB HTML file - Slow to load, parse
- 48+ runs loaded in memory - Browser memory usage
- Multiple polling intervals - Battery drain, network overhead
Effort to fix current viewer:
- Refactor inline HTML → separate files (2 days)
- Fix live evaluation connection (1 day)
- Fix metadata loading for runs (1 day)
- Remove unused panels (0.5 days)
- Simplify SSE/polling (0.5 days)
- Add run comparison UI (2 days)
Total: ~7 days to fix technical debt
Effort to rewrite based on openadapt-evals:
- Adapt openadapt-evals viewer (0.5 days)
- Add live tracking API bridge (1 day)
- Add multi-run comparison (optional) (1 day)
- Test and deploy (0.5 days)
Total: ~3 days for clean implementation
- Cleaner architecture: Build on proven openadapt-evals viewer
- Better maintainability: 1,283 lines vs 4,774 lines
- Proper separation: HTML/CSS/JS in separate files (future improvement)
- Unified tracking: Bridge to openadapt-evals LiveEvaluationTracker
- No technical debt: Fresh start without legacy complexity
- Feature loss: Some multi-run features may need reimplementation
- Learning curve: Developers must learn new codebase
- Migration: Existing bookmarks/links need updating
Phase 1: Parallel deployment (Week 1)
- Deploy new viewer at
/benchmark-v2.html - Keep old viewer at
/benchmark.html - Both accessible, users can compare
Phase 2: Feedback and iteration (Week 2)
- Gather user feedback
- Add missing features to new viewer
- Fix any issues
Phase 3: Deprecation (Week 3)
- Redirect
/benchmark.htmlto/benchmark-v2.html - Add deprecation notice
- Update documentation
Phase 4: Removal (Week 4)
- Remove old viewer code
- Clean up deprecated endpoints
- Archive for reference
While planning rewrite, these quick fixes improve current state:
Problem: User runs Azure eval, viewer shows "no evaluation running"
Quick fix: Update benchmark_live.json manually during eval
# In openadapt-evals Azure runner, write status updates:
echo '{"status": "running", "current_task": {"task_id": "notepad_1", ...}}' > \
/Users/abrichr/oa/src/openadapt-ml/training_output/current/benchmark_live.jsonFix Complexity: Easy (5 minutes) Proper fix: Bridge LiveEvaluationTracker to benchmark_live.json (1 day)
Problem: All 48 runs show identical labels
Quick fix: Generate better metadata during mock runs
# In openadapt_evals/benchmarks/runner.py
metadata = {
"run_name": f"{benchmark_name}_{timestamp}",
"model_id": agent.model_id or "unknown", # Get from agent
"created_at": datetime.now().isoformat(),
}Fix Complexity: Easy (30 minutes)
Problem: Adds complexity, never works
Quick fix: Remove SSE client, use polling only
// Remove 200+ lines of SSE code
// Keep only:
async function fetchLiveEvaluationPolling() { ... }
setInterval(fetchLiveEvaluationPolling, 2000);Fix Complexity: Easy (1 hour)
openadapt-viewer/ (new package)
├── src/
│ ├── components/
│ │ ├── TaskList.tsx # Reusable task list
│ │ ├── StepViewer.tsx # Step-by-step replay
│ │ ├── MetricsPanel.tsx # Success rate, charts
│ │ └── Filters.tsx # Domain, status filters
│ ├── pages/
│ │ ├── BenchmarkViewer.tsx # Completed runs
│ │ ├── LiveTracker.tsx # Running evaluations
│ │ └── Comparison.tsx # Side-by-side runs
│ └── api/
│ ├── benchmark-api.ts # Load benchmark data
│ └── live-api.ts # Live eval updates
├── public/
│ └── benchmark.html # Entry point
└── package.json # React + Vite
Benefits:
- Component reuse: TaskList used in benchmark + live tracker
- Standard tooling: React DevTools, Vite HMR, TypeScript
- Easy testing: Jest + React Testing Library
- Real separation: HTML/CSS/JS in separate files
- Scalable: Add new views without modifying core
Timeline: 2-3 weeks for full React rewrite
| File | Path | Size |
|---|---|---|
| Old viewer generator | /Users/abrichr/oa/src/openadapt-ml/openadapt_ml/training/benchmark_viewer.py |
172KB |
| Generated viewer | /Users/abrichr/oa/src/openadapt-ml/training_output/current/benchmark.html |
568KB |
| New viewer generator | /Users/abrichr/oa/src/openadapt-evals/openadapt_evals/benchmarks/viewer.py |
43KB |
| Live state (old) | /Users/abrichr/oa/src/openadapt-ml/training_output/current/benchmark_live.json |
180B |
| Benchmark results | /Users/abrichr/oa/src/openadapt-ml/benchmark_results/ |
48+ runs |
The current benchmark viewer represents a classic case of feature accretion without refactoring. What started as a simple single-run viewer grew to support multi-run comparison, live tracking, Azure integration, and benchmark triggering - all added as inline HTML generation in a monolithic Python module.
The viewer is not broken by bugs, it's broken by design. The architecture cannot support the intended use cases:
- Live evaluation tracking: Requires API bridge to openadapt-evals
- Multi-run comparison: Current UI can't handle 48+ runs meaningfully
- Maintainability: Inline HTML in 4,774-line Python file is unmaintainable
Recommendation: Deprecate and rewrite based on openadapt-evals viewer architecture. The 3-day rewrite investment is better than 7 days fixing technical debt, and results in a cleaner, more maintainable codebase.
Immediate actions (if rewrite is delayed):
- Bridge LiveEvaluationTracker to benchmark_live.json (1 day)
- Fix metadata loading for runs (0.5 days)
- Remove SSE complexity (1 hour)
Long-term vision: React-based component architecture in new openadapt-viewer package, enabling reuse across benchmark, training, and capture viewers.