Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 79 additions & 55 deletions backend/PERF_OVERHAUL.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,20 +19,22 @@ verity fixture missions.
| P1 | #8 tolerant continuation | ✅ shipped #438 |
| P1 | #9 navigation leak | ✅ shipped #438 |
| P1 | #10 markdown size cap | ✅ shipped #438 |
| P2 | #11 virtualize chat | ⏸ deferred — see below |
| P2 | #12 virtualize thoughts sheet | ⏸ deferred — see below |
| P2 | #11 virtualize chat | ✅ implemented in this branch |
| P2 | #12 virtualize thoughts sheet | ✅ implemented in this branch |
| P2 | #13 lazy markdown | ✅ shipped #443 |
| P2 | #14 memoize derived slices | ✅ shipped #442 |
| P2 | #15 split ControlView | ⏸ deferred — see below |
| P2 | #16 worker reducer | ⏸ deferred — see below |
| P3 | #17 delta summarization | ⏸ deferred — backend, large |
| P2 | #15 split ControlView | ✅ implemented in this branch (slice stores) |
| P2 | #16 worker reducer | ✅ implemented in this branch |
| P3 | #17 delta summarization | ✅ implemented in this branch |
| P3 | #18 since_seq cursors | ⏸ deferred — backend, large |
| P3 | #19 WS migration | ⏸ deferred — backend, large |
| P3 | #19 WS migration | ✅ implemented in this branch |
| P3 | #20 per-mission channels | ⏸ deferred — backend, medium |
| P3 | #21 backend text_delta backpressure | ⏸ deferred — backend, medium |
| P4 | #22-24 content model | ⏸ deferred — cross-stack |
| P4 | #22 negotiated `text_op` protocol | ✅ implemented in this branch |
| P4 | #23 canonical assistant rows | ✅ implemented in this branch |
| P4 | #24 tool-output truncation | ⏸ deferred — backend, medium |
| P5 | #25 health budget telemetry | ⏸ deferred — needs ingestion |
| P5 | #26 Playwright perf CI | ⏸ deferred — flaky-risk in CI |
| P5 | #26 Playwright perf CI | ✅ implemented in this branch |
| P5 | #27 STREAMING.md | ✅ shipped (this file's sibling) |

## Before / after (verity mission `3a902278`, 1882 events)
Expand All @@ -52,66 +54,90 @@ The original symptom (74-second freezes on opening verity #1884)
disappeared after P1-#4..#10 alone. Subsequent items are
optimisations, not bug fixes.

## New measurements from this branch

| Item | Measurement |
| --- | --- |
| P2-#11/#12 virtualization | `dashboard/tests/control-perf.spec.ts` fixture mission with 500 messages passes DOM `<5k`; local Chromium run completed in 34.0s. |
| P3-#17 summarization | `inactive_stream_summary_reduces_large_payload_by_ten_x` covers the read-side collapse and asserts the synthetic payload reduction is at least 10x. |
| P4-#22 negotiated deltas | `text_op_stream_transform_converts_cumulative_delta_to_insert_then_replace` and `text_op_stream_transform_finalizes_before_assistant_message` cover the `cap=text_op` transport conversion path. |
| P4-#23 canonical rows | `finalized_text_ops_collapse_to_canonical_assistant_row` proves a finalized `text_op` log is replaced by one `assistant_message_canonical` row. |
| P5-#26 perf CI | Playwright `control @perf keeps large mission within browser budgets` asserts heap `<300MB`, max longtask `<500ms`, and DOM `<5k`. |

Validation commands run locally:

```bash
cargo fmt --all --check
cargo check -q
cargo test -q inactive_stream_summary --lib
cargo test -q text_op_stream_transform --lib
cargo test -q finalized_text_ops_collapse_to_canonical_assistant_row --lib
cd dashboard && npx tsc --noEmit
cd dashboard && bun run build
cd dashboard && PLAYWRIGHT_PORT=3001 PLAYWRIGHT_BASE_URL=http://localhost:3001 bunx playwright test tests/control-perf.spec.ts --project=chromium
cd ios_dashboard && xcodebuild -project SandboxedDashboard.xcodeproj -scheme SandboxedDashboard -destination 'platform=iOS Simulator,name=iPhone 17 Pro,OS=26.4' build
```

iOS simulator smoke evidence:

- `ios-control-direct-after-sequential.png` shows historical replay of the
goal fixture mission against the dev backend.
- The first dev-backend run surfaced a Swift concurrency abort in
`loadMission`; the saved-mission metadata/transcript fetch was made
sequential and the fixture rendered after rebuild/reinstall.

## Deferred items, with reasoning

These are intentional decisions to stop work, not abandoned TODOs.

### P2-#11/12: virtualize chat list + thoughts sheet

The chat list already uses CSS `content-visibility: auto` +
`contain-intrinsic-size: auto 140px` on every row, which gives the
browser permission to skip layout and paint for off-screen rows
without any JS-level virtualizer. Combined with P2-#13 (lazy
markdown) and the P2-#14 memoization, the perf overlay no longer
shows DOM-traversal cost in the longtask profile on a 1.8k-event
mission. A `@tanstack/react-virtual` integration would add 30 KB to
the bundle and a non-trivial scroll-anchor refactor; cost > benefit
at current data sizes. Revisit if a mission with >5k visible items
becomes routine.
The main transcript and thoughts sheet now use `@tanstack/react-virtual`
with estimated row heights, mount-time measurement, bottom anchoring,
and a scroll-to-bottom pill when the user is away from the bottom. The
Playwright perf fixture holds the 500-message mission under the DOM
budget.

### P2-#15: split ControlView into subscribers

The win here is preventing the entire 9k-line component from
re-rendering on every state tick. Half the win has already been
captured: `ChatItemRow` is `memo()`-wrapped, the derived views go
through `useMemo`+`useDeferredValue`, the 1Hz timer is shared
(P1-#7), and SSE bursts coalesce into one commit per frame (P1-#6).
The remaining win comes from migrating to Zustand-or-similar so
unrelated state slices stop triggering global re-renders. That's a
multi-day refactor with high regression risk. Track separately;
don't bundle it into the perf overhaul.
The dashboard now mirrors the iOS-style split with explicit stores for
items, queue, thinking, streaming diagnostics, and the viewing mission.
The layout component owns layout state while panels subscribe to the
slices they render.

### P2-#16: Web Worker for `eventsToItems`

After P0+P1 landed the `replay:apply` reducer runs at most
**65 ms** for a 5000-event replay on the verity fixture (measured
via the `replay:apply` console.time + the metrics overlay's
"Reducers (cum)" panel). Moving it to a worker requires extracting
~250 lines of helpers, building a worker bundle in Next.js, and
paying ~100ms of structured-clone cost on every call to ship a 5k
ChatItem[] back across the boundary. Net: probably break-even at
current sizes, regression on small-list call sites that fire
hundreds of times per session. Deferred until a mission size emerges
where the reducer alone exceeds 200ms.
`eventsToItemsImpl` and its parsing/continuation helpers live in
`events-reducer.ts`; `events-worker.ts` exposes a Promise RPC worker
using Next's `new Worker(new URL(..., import.meta.url))` bundling path.
Existing synchronous `eventsToItems()` call sites remain available, while
initial transcript and `since_seq=0` replays route through the worker
with a sync fallback if worker startup fails.

### P3-#17..#21: backend streaming changes

All five require coordinated dashboard + iOS + backend changes and
substantial test coverage. The current SSE + `/events` shape is
stable across three clients; changing it has high blast radius. The
per-mission broadcast channels (P3-#20) and the text_delta
backpressure (P3-#21) are the cheapest wins remaining; track them as
follow-up issues.
P3-#17 adds a pure read-side summarization pass for inactive missions:
`thinking` and `text_delta` runs are collapsed for `/events`, `/trace`,
and transcript wrappers only when `updated_at` is older than five
minutes. Persisted rows are unchanged, and active missions keep the
incremental path.

P3-#19 adds `/api/control/ws` with 15s heartbeats, client resume, and
dashboard WS-first/SSE-fallback behavior. P3-#18, #20, and #21 remain
deferred follow-ups.

### P4-#22..#24: content model changes

CRDT-style deltas (P4-#22), canonical-bubble persistence (P4-#23),
and tool-output truncation (P4-#24) all imply data-model migrations.
The P1-#8 tolerant continuation heuristic absorbs most of the
duplicate-token symptom that motivated #22. #23 needs a clear
data-loss story before it's worth the risk. #24 should be done but
also needs backend cooperation (the streaming download endpoint
doesn't exist yet).
P4-#22 is implemented as a negotiated transport capability:
dashboard and iOS append `cap=text_op`, the backend converts cumulative
`text_delta` buffers into `TextOp::Insert`/`Replace` events for that
connection, and older clients continue receiving cumulative
`text_delta`.

P4-#23 persists in-flight `text_op` rows and collapses a finalized
`bubble_id` into one `assistant_message_canonical` row. Historical
fetches return that canonical row instead of the delta log. Existing
missions are unchanged. P4-#24 remains a deferred follow-up.

### P5-#25: health budget telemetry

Expand All @@ -123,11 +149,9 @@ on where the telemetry should land.

### P5-#26: Playwright perf CI

`@playwright/test --grep perf` runs that load a fixture mission and
assert heap/longtask/DOM budgets are flaky-prone — the Vercel
preview deploy lifecycle alone introduces 30s+ of variance. Better
done as a manual regression script that the perf overlay's
`?debug=perf` already supports.
`dashboard/tests/control-perf.spec.ts` is marked `@perf`, loads the
fixture with `?debug=perf`, waits 30s, and asserts heap, longtask, and
DOM budgets.

## Operational

Expand Down
44 changes: 38 additions & 6 deletions backend/STREAMING.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,21 @@
# Streaming contract

The control plane emits agent activity to dashboard and iOS clients via a
single SSE stream (`GET /api/control/stream`) plus a database-backed
event log (`GET /api/control/missions/:id/events`, `…/trace`,
`…/transcript`). This document is the canonical contract for what each
The control plane emits agent activity to dashboard and iOS clients via
SSE (`GET /api/control/stream`), WebSocket (`GET /api/control/ws`), and
a database-backed event log (`GET /api/control/missions/:id/events`,
`…/trace`, `…/transcript`). This document is the canonical contract for what each
backend emits and what each client expects. It exists because two
incidents (the iOS replay-text-delta bug, the verity duplicated-thoughts
freeze) traced back to drift between these three implementations.

## SSE channel

`GET /api/control/stream?mission=<uuid>`
`GET /api/control/stream?mission=<uuid>&cap=text_op`

| Param | Meaning |
| --- | --- |
| `mission` | When present, the server only emits events whose `mission_id` matches. `status` and `stream_lagged` (connection-scoped) always pass. Omit the param to receive every event the authenticated user can see (used by the mission list and the `?debug=perf` overlay). |
| `cap` | Optional comma-separated client capabilities. `text_op` asks the transport to convert cumulative `text_delta` events into negotiated CRDT-style `text_op` operations for this connection. Omit it for the cumulative compatibility path. |

Each line is one of:

Expand All @@ -26,6 +27,27 @@ Each line is one of:
- `event: <type>\ndata: <json>\n\n` — a single `AgentEvent`. See the
type list below.

## WebSocket channel

`GET /api/control/ws?mission=<uuid>&cap=text_op`

The WebSocket stream carries the same JSON `AgentEvent` payloads as SSE,
including the `type` discriminator. The dashboard attempts WebSocket first
and falls back to SSE if the upgrade fails.

Additional WebSocket-only messages:

- Server heartbeat every 15s: `{"seq": N}` where `N` is the latest stored
event sequence for the filtered mission, or `0` without a mission filter.
- Client resume request: `{"type":"resume","since_seq": N}`. When the
socket has a `mission=<uuid>` filter, the server fetches stored events
with `sequence > N`, converts known rows back into `AgentEvent` shape,
and sends them before continuing live broadcast delivery.

When `mission=<uuid>` is present, WebSocket uses the same per-mission
broadcast channel as SSE. Connection-scoped `status` and FIDO events still
come from the global channel.

### Event types

All events carry `mission_id` (optional for connection-scoped events).
Expand All @@ -37,6 +59,7 @@ Listed with the backends that emit them.
| `user_message` | server | `{id, mission_id, content, queued}` | Echoes the user message back after persisting. |
| `assistant_message` | server | `{id, mission_id, content, success, cost_cents?, cost_source?, model?, shared_files?}` | One per completed agent turn. **Cumulative content** — the message is the final consolidated text. |
| `text_delta` | grok, codex | `{mission_id, content, event_id?}` | **Cumulative buffer** — the `content` field contains the *entire* text so far, not the new tokens. Clients must consolidate by replacing, not appending. See "Continuation rule". |
| `text_op` | negotiated streaming backends | `{mission_id, bubble_id, ops}` | CRDT-style delta stream. `ops` entries are `insert`, `replace`, or `finalize`; clients apply them to a local buffer keyed by `bubble_id`. Backends only emit this when the client advertises support; `text_delta` remains the compatibility path. |
| `thinking` | grok, codex | `{mission_id, content, done, goal_role?, event_id?}` | Cumulative buffer. `done: true` finalises the current thought; subsequent non-prefix payloads start a new thought. |
| `tool_call` | all | `{mission_id, tool_call_id, name, args}` | One per tool invocation. |
| `tool_result` | all | `{mission_id, tool_call_id, name, result}` | Pairs with `tool_call` via `tool_call_id`. |
Expand Down Expand Up @@ -127,11 +150,20 @@ Events NOT persisted:
- `status`, `stream_lagged`, `fido_sign_request` — connection-scoped.
- `mission_activity` — diagnostic only, intentionally not stored.

For negotiated `text_op` streams, in-flight ops persist as `text_op` rows.
When a `finalize` op arrives, the mission store applies the full op log for
that `bubble_id`, deletes those delta rows, and writes one
`assistant_message_canonical` row. Future `/events` fetches return the
canonical row rather than the op log. Existing missions and cumulative
`text_delta` rows are unchanged.

## Client expectations

### Dashboard (`dashboard/src/app/control/control-client.tsx`)

- Connects with `?mission=<id>` when viewing a specific mission.
- Connects with `?mission=<id>` when viewing a specific mission. The
transport prefers `/api/control/ws` and falls back to `/api/control/stream`
on WebSocket connection error.
- Reconnects whenever the viewing mission changes.
- Coalesces `text_delta` and `thinking` re-renders via
`requestAnimationFrame` — at most one React commit per frame.
Expand Down
5 changes: 5 additions & 0 deletions dashboard/bun.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions dashboard/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
"@codemirror/theme-one-dark": "^6.1.3",
"@codemirror/view": "^6.39.12",
"@radix-ui/react-slot": "^1.2.4",
"@tanstack/react-virtual": "^3.13.24",
"@types/prismjs": "^1.26.5",
"@types/react-syntax-highlighter": "^15.5.13",
"@uiw/react-codemirror": "^4.25.4",
Expand Down
9 changes: 6 additions & 3 deletions dashboard/playwright.config.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
import { defineConfig, devices } from '@playwright/test';

const port = process.env.PLAYWRIGHT_PORT || '3099';
const baseURL = process.env.PLAYWRIGHT_BASE_URL || `http://localhost:${port}`;

export default defineConfig({
testDir: './tests',
fullyParallel: true,
Expand All @@ -9,7 +12,7 @@ export default defineConfig({
reporter: 'html',
timeout: 30000, // 30 seconds per test
use: {
baseURL: 'http://localhost:3099',
baseURL,
trace: 'on-first-retry',
},

Expand All @@ -24,8 +27,8 @@ export default defineConfig({
],

webServer: {
command: 'bun dev --port 3099',
url: 'http://localhost:3099',
command: `bunx next dev --port ${port}`,
url: baseURL,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Playwright port/baseURL mismatch when only one env var set

Medium Severity

When PLAYWRIGHT_BASE_URL is set without PLAYWRIGHT_PORT (e.g. PLAYWRIGHT_BASE_URL=http://localhost:3001), port defaults to '3099', so webServer.command starts Next on 3099 while webServer.url polls 3001. The test run will hang waiting for a server that never appears on the expected port. Deriving port from the URL, or requiring both vars together, would fix this.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e8cbb30. Configure here.

reuseExistingServer: !process.env.CI,
timeout: 120000, // 2 minutes for server to start
env: {
Expand Down
Loading
Loading