Date: 2026-01-17 Priority: HIGH (P0) Status: ✓ RESOLVED
User Report: "The benchmark viewer is displaying synthetic/assumed descriptions instead of REAL data from the capture."
Example: User sees "Click System Settings icon in dock" and believes this is made-up/synthetic data.
Root Cause: Misconception about data provenance - the description is real ML-generated data from GPT-4o analysis of the recording, not synthetic/invented data. However, the viewer did not clearly label this provenance, leading to confusion.
-
Source Data is 100% Real
capture.db: 1,561 raw hardware events (mouse, keyboard, screen)episodes.json: ML-generated semantic episodes from GPT-4o- Screenshots: 457 frames from actual recording
-
Descriptions ARE from Real Data
- "Click System Settings icon in dock" appears in
/Users/abrichr/oa/src/openadapt-capture/turn-off-nightshift/episodes.jsonline 18 - Generated by GPT-4o (specified in line 99:
"llm_model": "gpt-4o") - Based on analysis of actual screenshots + mouse events
- Confidence: 0.92 (92% confidence in segmentation)
- "Click System Settings icon in dock" appears in
-
No Synthetic Data Generation
real_data_loader.pyreads episodes.json verbatim (no invention)- Viewer displays episode data as-is (no transformation)
- Sample data function
create_sample_data()NOT called whenuse_real_data=True
Not a data fidelity issue, but a data provenance labeling issue.
The viewer was showing real ML-generated data but failing to indicate:
- WHERE it came from (episodes.json)
- HOW it was created (GPT-4o inference)
- CONFIDENCE level (92%)
- DISTINCTION between raw events vs ML interpretations
File: DATA_PIPELINE_ANALYSIS.md
Documents the three data layers:
- Layer 1: Raw Events (capture.db) - hardware-level events
- Layer 2: ML-Generated Episodes (episodes.json) - semantic descriptions
- Layer 3: Viewer Data (BenchmarkRun) - UI-ready format
Shows complete data flow from capture → segmentation → viewer.
File: DATA_FIDELITY_POLICY.md
Establishes formal guidelines:
- NEVER invent data (use actual values from source)
- ALWAYS label provenance (RAW, ML-INFERRED, HUMAN-LABELED, DERIVED)
- Distinguish source vs content (where from vs what it says)
- When in doubt, show raw (default to hardware events if uncertain)
Includes code examples, violation examples, and testing requirements.
Changes:
# BEFORE
action_type="real_action" # Ambiguous - what kind of "real"?
action_details={
"description": step_text,
# Missing provenance metadata
}
# AFTER
action_type="ml_inferred" # Clear provenance
action_details={
"description": step_text,
"provenance": "ml_inferred",
"source": "episodes.json",
"model": "gpt-4o",
"confidence": 0.92,
"processing_timestamp": "2026-01-17T12:00:00.000000",
}Changes:
<!-- BEFORE: No indication of provenance -->
<span>Click System Settings icon in dock</span>
<!-- AFTER: Clear ML-INFERRED badge with tooltip -->
<span class="oa-badge oa-badge-ml"
title="Generated by gpt-4o with 92% confidence">
ML-INFERRED
</span>
<span>Click System Settings icon in dock</span><details class="oa-metadata-details">
<summary>View Provenance & Metadata</summary>
<div class="oa-metadata">
<div class="oa-metadata-item">
<span class="oa-label">Model:</span>
<span class="oa-value">gpt-4o</span>
</div>
<div class="oa-metadata-item">
<span class="oa-label">Confidence:</span>
<span class="oa-value">92.0%</span>
</div>
<div class="oa-metadata-item">
<span class="oa-label">Source:</span>
<span class="oa-value">episodes.json</span>
</div>
<div class="oa-metadata-item">
<span class="oa-label">Episode:</span>
<span class="oa-value">Navigate to System Settings</span>
</div>
</div>
</details>.oa-badge-ml {
background: var(--oa-accent-dim);
color: var(--oa-accent);
border: 1px solid var(--oa-accent);
}
.oa-metadata {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
gap: 12px;
}File: regenerate_viewer_with_provenance.py
One-command script to regenerate the viewer with new provenance labels:
python regenerate_viewer_with_provenance.pyOutputs: benchmark_viewer_with_provenance.html
Action: REAL_ACTION
Details: {"description": "Click System Settings icon in dock", ...}
User thinks: "Where did this description come from? Did you make it up?"
[ML-INFERRED] Click System Settings icon in dock
↑ Hover shows: "Generated by gpt-4o with 92% confidence"
▸ View Provenance & Metadata
Model: gpt-4o
Confidence: 92.0%
Source: episodes.json
Episode: Navigate to System Settings
Frame Index: 0
User understands: "This is GPT-4o's interpretation of the recording with 92% confidence."
- ✓
DATA_PIPELINE_ANALYSIS.md- Complete data flow analysis - ✓
DATA_FIDELITY_POLICY.md- Formal policy and guidelines - ✓
DATA_FIDELITY_RESOLUTION.md- This document
- ✓
src/openadapt_viewer/viewers/benchmark/real_data_loader.py- Added provenance metadata - ✓
src/openadapt_viewer/viewers/benchmark/generator.py- Added provenance UI
- ✓
regenerate_viewer_with_provenance.py- Regeneration script
-
Run regeneration script:
cd /Users/abrichr/oa/src/openadapt-viewer python regenerate_viewer_with_provenance.py -
Open generated viewer:
open benchmark_viewer_with_provenance.html
-
Check for provenance labels:
- Each step should show "ML-INFERRED" badge
- Hover over badge shows "Generated by gpt-4o with 92% confidence"
- Click "View Provenance & Metadata" shows full metadata
-
Verify data source:
- Confirm descriptions match
turn-off-nightshift/episodes.json - Verify model name is "gpt-4o"
- Check confidence is 0.92
- Confirm descriptions match
ML-Generated: Real data produced by analyzing actual recordings
- Example: GPT-4o looking at screenshots and inferring "Click Settings icon"
- Provenance: episodes.json (from real recording analysis)
- Status: Real data at semantic level
Synthetic: Fake data invented for demos/tests
- Example:
create_sample_data()function output - Provenance: Python code (not from recording)
- Status: Fake data, test-only
The nightshift descriptions are ML-GENERATED, not SYNTHETIC.
Source (WHERE): File path, database, API Content (WHAT): Actual values and descriptions
Both can be "real":
- Real source + Real content = ✓ nightshift episodes.json
- Real source + Fake content = Sample data in test file
- Fake source + Fake content = Hardcoded demo data
Users need to know:
- What they're seeing (description)
- Where it came from (episodes.json)
- How it was created (GPT-4o analysis)
- Confidence in the data (92%)
Without labels, even real data looks suspicious.
-
Always show provenance badges:
[RAW]for hardware events[ML-INFERRED]for ML-generated descriptions[HUMAN-LABELED]for human annotations[DERIVED]for calculated values
-
Include expandable metadata:
- Model name and version
- Confidence scores
- Source file/table
- Timestamp
-
Provide raw event access:
- Show raw mouse coordinates alongside ML interpretation
- Link to original screenshot
- Display timestamp and event type
-
Follow DATA_FIDELITY_POLICY.md:
- Never invent data
- Always label provenance
- When in doubt, show raw
- Generate viewer with
regenerate_viewer_with_provenance.py - Open in browser
- Select a task (e.g., "episode_001")
- Navigate to first step
- Verify "ML-INFERRED" badge is visible
- Hover over badge, verify tooltip shows model + confidence
- Click "View Provenance & Metadata"
- Verify metadata shows:
- Model: gpt-4o
- Confidence: 92.0%
- Source: episodes.json
- Episode: Navigate to System Settings
- Frame Index: 0
- Compare description to episodes.json line 18
- Verify they match exactly
Create tests to verify:
def test_provenance_labels_present():
"""Verify all steps have provenance labels."""
viewer = load_viewer("benchmark_viewer.html")
for step in viewer.steps:
assert "provenance" in step.action_details
assert step.action_details["provenance"] in ["raw", "ml_inferred", "human_labeled"]
def test_ml_metadata_complete():
"""Verify ML-inferred data includes model and confidence."""
viewer = load_viewer("benchmark_viewer.html")
for step in viewer.steps:
if step.action_details["provenance"] == "ml_inferred":
assert "model" in step.action_details
assert "confidence" in step.action_details
assert 0.0 <= step.action_details["confidence"] <= 1.0✓ RESOLVED
- Investigation complete
- Root cause identified (provenance labeling, not data fidelity)
- Documentation created (DATA_PIPELINE_ANALYSIS.md, DATA_FIDELITY_POLICY.md)
- Code updated (real_data_loader.py, generator.py)
- Regeneration script created
- Verification instructions provided
- Run regeneration script to update the viewer
- Review generated viewer to confirm provenance labels
- Share with user to confirm issue is resolved
- Apply to other viewers (segmentation, training, etc.)
- Add automated tests for provenance labeling
If similar confusion arises in the future:
- Check
DATA_PIPELINE_ANALYSIS.mdfor data flow - Review
DATA_FIDELITY_POLICY.mdfor guidelines - Verify provenance labels are present and accurate
- Distinguish ML-generated (real) from synthetic (fake)
Key principle: Display what exists, label how it was created.