Skip to content

Latest commit

 

History

History
417 lines (327 loc) · 13.3 KB

File metadata and controls

417 lines (327 loc) · 13.3 KB

Benchmark Viewer Complexity Metrics

Quick reference for key findings from the benchmark viewer review.


Size Comparison

openadapt-ml/training/benchmark_viewer.py
├─ File size: 172KB
├─ Lines of code: 4,774
├─ Functions: 12
└─ Generated HTML: 568KB (3,065 lines)

openadapt-evals/benchmarks/viewer.py
├─ File size: 43KB
├─ Lines of code: 1,283
├─ Functions: 8
└─ Generated HTML: ~100KB (~600 lines)

Ratio: 3.7x larger (current vs modern)

Complexity Metrics

Current Viewer (openadapt-ml)

Metric Count Notes
CSS classes 139 Inline in Python strings
JavaScript functions 40+ Inline in Python f-strings
UI panels 5 Live Eval, Tasks, Azure, VM, Run Benchmark
Data loading mechanisms 3 SSE + polling + file loading
Polling intervals 3 SSE (2s), benchmark-live (2s), tasks (10s)
Lines of inline CSS ~800 Hard to lint or format
Lines of inline JavaScript ~500 No TypeScript, no testing
Embedded JSON data 48+ runs All loaded into memory

Modern Viewer (openadapt-evals)

Metric Count Notes
CSS classes ~50 Focused on task display
JavaScript functions ~15 Minimal interactivity
UI panels 0 Single-purpose viewer
Data loading mechanisms 1 File loading only
Polling intervals 0 Static viewer
Lines of inline CSS ~400 Still inline (room for improvement)
Lines of inline JavaScript ~200 Simple interactions
Embedded JSON data 1 run Per-run files

Feature Usage Analysis

Implemented vs Used vs Broken

Feature Status Visible Working Used
Multi-run dropdown
Task list with filters
Step-by-step viewer
Playback controls
Live evaluation panel
SSE connection
Background tasks panel
Azure jobs panel
VM discovery panel
Run benchmark panel
Mock data banner ❌ (not helpful)

Legend:

  • ✅ Yes
  • ❌ No
  • ❓ Unknown

Summary:

  • Implemented: 11 features
  • Visible: 5 features (45%)
  • Working: 4 features (36%)
  • Actually used: 4 features (36%)

Wasted effort: 64% of features never used


Load Time Analysis

Current Viewer

File size: 568KB
├─ HTML structure: ~50KB
├─ Inline CSS: ~100KB
├─ Inline JavaScript: ~50KB
├─ Embedded JSON (48 runs): ~368KB
└─ Base64 encoded images: 0KB (screenshots loaded separately)

Parse time: ~200ms (Chrome DevTools)
Memory usage: ~15MB (48 runs in memory)

Modern Viewer (estimated)

File size: ~100KB
├─ HTML structure: ~30KB
├─ Inline CSS: ~40KB
├─ Inline JavaScript: ~20KB
├─ Embedded JSON (1 run): ~10KB
└─ Base64 encoded images: 0KB (screenshots loaded separately)

Parse time: ~50ms (estimated)
Memory usage: ~2MB (1 run in memory)

Improvement: 5.7x smaller file, 4x faster parse, 7.5x less memory


Code Duplication

Panel Generators (openadapt-ml)

Each panel follows same pattern:

def _get_[panel_name]_panel_css() -> str:
    return """
        # 200-300 lines of CSS
    """

def _get_[panel_name]_panel_html() -> str:
    return """
        # 300-500 lines of HTML
    """

def _get_[panel_name]_panel_js(include_script_tags: bool = True) -> str:
    js_code = """
        # 100-200 lines of JavaScript
    """
    if include_script_tags:
        return f"<script>{js_code}</script>"
    return js_code

5 panels × 3 functions × ~300 lines = ~4,500 lines of repetitive code

Opportunity: Template-based or component-based architecture could reduce this to ~1,000 lines


Connection Management Complexity

Current SSE + Polling Implementation

class LiveEvaluationSSEClient {
    constructor() {
        this.eventSource = null;           // SSE connection
        this.pollingInterval = null;       // Polling timer
        this.staleCheckInterval = null;    // Stale detection timer
        this.usePolling = false;           // Fallback flag
        this.reconnectAttempts = 0;        // Reconnect counter
        this.maxReconnectAttempts = 5;     // Max reconnects
        this.reconnectDelay = 2000;        // Reconnect delay
        this.lastHeartbeat = Date.now();   // Heartbeat timestamp
        this.state = { ... };              // Shared state
    }

    connect() { /* 50 lines */ }
    handleStatusEvent(data) { /* 20 lines */ }
    handleProgressEvent(data) { /* 20 lines */ }
    handleTaskCompleteEvent(data) { /* 20 lines */ }
    handleConnectionError() { /* 30 lines */ }
    reconnect() { /* 20 lines */ }
    startPolling() { /* 20 lines */ }
    clearAllIntervals() { /* 10 lines */ }
    updateConnectionStatus(status) { /* 10 lines */ }
    updateTimestamp() { /* 5 lines */ }
}

// Total: ~200 lines for connection management

Simplified Polling-Only Implementation

async function fetchLiveEvaluation() {
    try {
        const response = await fetch('/api/benchmark-live?' + Date.now());
        if (response.ok) {
            const state = await response.json();
            renderLiveEvaluation(state);
        }
    } catch (e) {
        console.log('Live evaluation API unavailable');
    }
}

setInterval(fetchLiveEvaluation, 2000);

// Total: ~15 lines

Reduction: 93% less code (200 lines → 15 lines)


Data Flow Diagram

Current (Broken)

┌─────────────────────────────────────────────────────────────┐
│ Azure Evaluation (aace3b9) - Running in openadapt-evals    │
└───────────────────┬─────────────────────────────────────────┘
                    │
                    ↓
        ┌───────────────────────┐
        │ LiveEvaluationTracker │
        └───────────┬─────────────┘
                    │
                    ↓ writes to
         ┌─────────────────────┐
         │ live_eval_state.json│
         └─────────────────────┘
                    │
                    ↓
              NOT CONNECTED
                    ↑
         ┌─────────────────────┐
         │ benchmark_live.json │ ← stale (Jan 9)
         └───────────┬─────────┘
                    ↑ polls
        ┌───────────────────────┐
        │ HTTP Server (port 8765)│
        │ /api/benchmark-live    │
        └───────────┬───────────┘
                    ↑ polls every 2s
┌───────────────────────────────────────────────────────────┐
│ Browser: benchmark.html                                   │
│ Shows: "no evaluation running"                            │
└───────────────────────────────────────────────────────────┘

Proposed (Fixed)

┌─────────────────────────────────────────────────────────────┐
│ Azure Evaluation (aace3b9) - Running in openadapt-evals    │
└───────────────────┬─────────────────────────────────────────┘
                    │
                    ↓
        ┌───────────────────────┐
        │ LiveEvaluationTracker │
        └───────────┬─────────────┘
                    │
                    ↓ writes to BOTH
         ┌─────────────────────────────────────┐
         │  live_eval_state.json                │
         │         +                            │
         │  benchmark_live.json (symlink/copy)  │
         └─────────────┬───────────────────────┘
                      ↑ polls
        ┌───────────────────────┐
        │ HTTP Server (port 8765)│
        │ /api/benchmark-live    │
        └───────────┬───────────┘
                    ↑ polls every 2s
┌───────────────────────────────────────────────────────────┐
│ Browser: benchmark.html                                   │
│ Shows: Live evaluation progress ✓                         │
└───────────────────────────────────────────────────────────┘

Technical Debt Score

Using a simple scoring system:

Category Score (1-10) Weight Weighted Score
Code size 9 2 18
Complexity 8 3 24
Maintainability 9 3 27
Testability 9 2 18
Documentation 5 1 5
Performance 7 1 7

Total Debt Score: 99 / 120 (83%)

Interpretation:

  • 0-30: Healthy codebase
  • 31-60: Some technical debt
  • 61-90: Significant technical debt
  • 91-120: Critical technical debt (rewrite recommended)

Conclusion: Score of 99/120 indicates critical technical debt requiring rewrite


Refactor vs Rewrite Decision Matrix

Factor Refactor Rewrite Winner
Time to completion 7 days 3 days Rewrite
Risk of new bugs Low Medium Refactor
Final code quality Medium High Rewrite
Learning curve Low Medium Refactor
Long-term maintenance Medium High Rewrite
Feature parity High Medium Refactor
Architecture improvement Low High Rewrite
Team velocity impact Medium Low Rewrite

Score: Rewrite 5, Refactor 3

Recommendation: Rewrite (better ROI, cleaner result)


Migration Checklist

Phase 1: Parallel Deployment (Week 1)

  • Adapt openadapt-evals viewer to openadapt-ml structure
  • Add API bridge: LiveEvaluationTracker → benchmark_live.json
  • Deploy new viewer at /benchmark-v2.html
  • Test with current benchmark runs
  • Verify live tracking works with Azure eval
  • Document differences from old viewer

Phase 2: Feedback and Iteration (Week 2)

  • Gather user feedback (internal testing)
  • Identify critical missing features
  • Add multi-run comparison (if needed)
  • Fix any issues or bugs
  • Performance testing with 48+ runs
  • Documentation updates

Phase 3: Deprecation (Week 3)

  • Add deprecation banner to old viewer
  • Redirect /benchmark.html/benchmark-v2.html
  • Update all documentation links
  • Update README and quick start guides
  • Notify users via changelog

Phase 4: Removal (Week 4)

  • Remove old viewer code from repository
  • Clean up deprecated API endpoints
  • Archive old viewer for reference
  • Update tests to use new viewer
  • Final documentation cleanup

Quick Reference: File Locations

Current (openadapt-ml)

/Users/abrichr/oa/src/openadapt-ml/
├── openadapt_ml/training/benchmark_viewer.py (172KB)
├── training_output/current/benchmark.html (568KB)
└── training_output/current/benchmark_live.json (180B, stale)

Modern (openadapt-evals)

/Users/abrichr/oa/src/openadapt-evals/
├── openadapt_evals/benchmarks/viewer.py (43KB)
└── benchmark_results/
    └── waa-mock_eval_20260117_101209/
        ├── metadata.json
        ├── summary.json
        └── tasks/
            ├── browser_1/
            ├── coding_1/
            └── office_1/

Future (openadapt-viewer)

/Users/abrichr/oa/src/openadapt-viewer/
├── src/
│   ├── components/
│   ├── pages/
│   └── api/
├── public/
│   └── benchmark.html
└── package.json

Summary

Current state: 4,774-line monolithic Python generator creating 568KB HTML with 64% unused features

Recommendation: Rewrite based on openadapt-evals architecture (3 days vs 7 days to fix)

Key metrics:

  • 3.7x size reduction
  • 93% less connection management code
  • 5.7x smaller HTML files
  • 83% technical debt score (critical)

Next steps: See BENCHMARK_VIEWER_REVIEW.md for full analysis and migration path