Skip to content

P3: kolu dials remote hosts over ssh (kolu-watcher + N-endpoint registry)#1385

Open
srid wants to merge 64 commits into
masterfrom
p3-kolu-dials-remotes
Open

P3: kolu dials remote hosts over ssh (kolu-watcher + N-endpoint registry)#1385
srid wants to merge 64 commits into
masterfrom
p3-kolu-dials-remotes

Conversation

@srid

@srid srid commented Jun 16, 2026

Copy link
Copy Markdown
Member

kolu-server can now dial a remote host and run terminals on it over ssh — the full P3 of the kaval-sessions plan. kaval-tui --host (P2) reached a remote kaval from the shell; P3 brings that reach into the kolu app itself: open a terminal on another machine from the command palette's Connect to host…, with its files + git mirrored into the Code tab like a local one, the remote session durable across the link, and a per-tile host chip carrying the host's health.

The shape

A remote host runs two processes side by side, dialed over one ssh connection:

                          ┌──────────────── remote host ────────────────┐
  kolu-server  ──ssh──▶   │  kolu-watcher --stdio   ──unix socket──▶  kaval
  RemoteTerminalEndpoint  │   (fresh per build:                       (durable PTY
   · mirror bridge        │    provider DAG + fs/git,                   survivor)
   · daemonStatus         │    forwards pty verbs to kaval)
                          └─────────────────────────────────────────────┘
  • kaval stays the kolu-agnostic durable PTY daemon (a terminal survives the link).
  • kolu-watcher (new) runs kolu's coupled code — the provider DAG (git/PR/agent detection) + native fs/git — host-side, fresh per build. It serves ONE watcherSurface over the ssh stdio: it absorbs the pty-host verbs/taps and forwards them to the host-local kaval (a serveOverStdio server that is a client of kaval's socket — not a raw relay), plus serves fs/git + a terminalMetadata collection.
  • kolu-server's RemoteTerminalEndpoint consumes that surface, mirroring the watcher's metadata into the browser surface (the remote-process-monitor bridgeAgentToParent pattern) and adopting terminals the host already has.

How it's built (commit by commit)

additive contract hostId on create + the 4 fs/git stream inputs; location {hostId} on the terminal record. All optional → old sessions are local, the wire enum is untouched (no drishti).
@kolu/terminal-dag the provider DAG + makeFsGit extracted into a package both kolu-server and kolu-watcher import — the local path stays byte-identical (only the logger is now injected).
kolu-watcher the new host process: surface contract (absorbs pty + fs/git + metadata) · kavalClient (adopt-or-spawn the host kaval) · server (DAG host-side, forward pty, serve fs/git) · bin · its own nix package + staleKey + closure-allow-list test + drv-map.
RemoteTerminalEndpoint + registry the 2nd TerminalEndpoint impl over a HostSession<watcherContract>; endpointFor(hostId) dispatch; the ssh-host registry (KOLU_HOSTS_JSON); boot dials each configured host.
client multi-host daemonStatus; client-only provisioning/unreachable projections; host-scoped create; the per-tile host chip; a Connect to host… palette leaf.

Notes for review

No drishti PR is compelled. P3 touches @kolu/surface* only as a consumer of already-exported, unchanged-signature API (getHostSession/HostSession/mirrorRemoteCollection/makeClientCursor/resolveSystem, serveOverStdio/implementSurface/defineSurface). The ENDPOINT_STATES wire enum is deliberately not widened — provisioning/unreachable are client-only projections. kolu-watcher, the watcher contract, RemoteTerminalEndpoint, and @kolu/terminal-dag all live in kolu-only packages.

Test reality: every new piece is hermetically unit-tested (mock HostSession, in-process fake kaval, tmp-dir git repos, mocked surfaceClient) — no ssh, no nix, no daemon spawned. The watcher's in-process serve+forward test and the DAG exercise run under CI (they spawn git one-shots, so not locally). CI builds .#kolu-watcher on both platforms — the real cross-arch validation of the nix work.

The live ssh dial (provisioning a real remote, the browser updating from a remote host's mirror) is validated by the evidence step, not by stubbing — nothing is stubbed.

Try it locally

nix build github:juspay/kolu/p3-kolu-dials-remotes#kolu-watcher

🤖 Generated with Claude Code

Generated by /be on Claude Code (model claude-opus-4-8).

srid added 24 commits June 15, 2026 21:51
…on on the terminal record

Foundation for P3 (kaval-sessions: kolu dials remotes). All additive/optional
so old sessions deserialize as local and no wire enum is touched:
- TerminalCreateInputSchema.hostId (target host; omitted = local)
- PtySpawnOpts.hostId (lifecycle resolves the owning endpoint from it)
- ServerPersistedTerminalFieldsSchema.location {hostId} (persisted so
  kill/attach/restore route back to the same endpoint)

ENDPOINT_STATES / DaemonStatusSchema deliberately unchanged — ssh health will
surface as client-only projected states, keeping the wire enum (and drishti) untouched.
Completes the additive contract (Step 1). The 4 Code-tab fs/git stream inputs
(gitStatus/gitDiff/fsListAll/fsReadFile in kolu-git/schemas) gain an optional
hostId so the server can route a remote terminal's fs/git over its ssh endpoint;
omitted ⇒ local ⇒ byte-identical for every existing client. Adds a hermetic
round-trip test proving pre-P3 records/inputs still parse (no location ⇒ local)
and P3 ones carry hostId end to end. Doc note disambiguates the client
TerminalLocation projection from the new record-level location.hostId.
So kolu-watcher (P3) can run the SAME per-terminal provider DAG host-side
without dragging in the full kolu-server (DB/publisher/surfaceCtx/browser
surface). providers.ts already imported only ../log.ts + ./agentRecency.ts and
is parameterized over ProviderChannels/ProviderHooks/ProviderRecord, so this is
a move, not a rewrite:

- New leaf package packages/dag (@kolu/terminal-dag): providers.ts +
  agentRecency.ts moved verbatim (git-tracked renames).
- The one server coupling — the module-level pino `log` — is now INJECTED via
  ProviderHooks.log, so each host supplies its own logger (kolu-server keeps its
  serverId context; kolu-watcher will supply the watcher's). The DAG body is
  otherwise unchanged.
- local.ts changes only its import path (./providers.ts → @kolu/terminal-dag)
  and passes `log` in makeHooks — the local path is byte-identical.
- Wired into the workspace, the server's deps, and default.nix's src fileset.

Verified: full-tree typecheck + biome clean; the moved agentRecency unit test
passes in its new home. The DAG's end-to-end exercise lands in Step 4
(kolu-watcher in-process kaval + tmp-repo test), where startProviders runs
against real taps with a proper lifecycle.
…metadata)

The single typed surface a kolu-watcher serves over one ssh stdioLink and
kolu-server's RemoteTerminalEndpoint mirrors. One flat defineSurface composing
three concerns:

- the ENTIRE pty-host surface, absorbed verbatim by spreading
  ptyHostSurface.spec (so the schemas can never drift from kaval's) — kolu-watcher
  forwards these verbs/taps to the host-local kaval as an ordinary client (the
  'serve, don't raw-relay' decision: one serveOverStdio server, kaval stays a
  separate durable daemon);
- fs/git as one-shot procedures (git.*/fs.*) + repo/file change streams — a 1:1
  projection of the TerminalEndpointFs/Git interfaces onto the wire;
- the terminalMetadata collection, value schema pinned to kolu-common's so the
  small-N mirror cannot drift.

New leaf package packages/kolu-watcher (wired into the workspace + default.nix
fileset). Hermetic contract test: composes without collision, metadata schema
identity, absorbed verbs/streams present. typecheck + biome clean.
…terminal-dag

kolu-watcher (P3) must serve the remote host's fs/git from the SAME kolu-git
impl kolu-server uses locally — so factor LocalTerminalEndpoint's localFs/localGit
and the GitResult→ORPCError unwrap into @kolu/terminal-dag as makeFsGit(log) +
unwrapGit, consumed by both. local.ts now destructures makeFsGit(log) (behavior
identical — same kolu-git calls); server/unwrapGit.ts re-exports so router.ts's
import path is unchanged. typecheck + biome clean.
…tdio

The host-resident heart of P3. kolu-server runs `ssh <host> kolu-watcher --stdio`;
the watcher serves ONE watcherSurface over that link by composing:

- kavalClient.ts — adopt-or-spawn the host-local kaval, return a TYPED
  ptyHostSurface client (the 'serve, don't relay' decision: the watcher is a
  kaval CLIENT it forwards to, not a frontDaemonOverStdio byte relay; it can't
  reExecAsDetachedDaemon — that's the watcher, not kaval — so it spawns the
  kaval binary nix bakes, with a version handshake).
- server.ts — buildWatcherServer: every absorbed pty verb/tap FORWARDS to the
  kaval sub-client; fs/git served from the host filesystem via the shared
  makeFsGit; terminalMetadata produced by running kolu's provider DAG host-side
  (fresh per build) and published into the served collection kolu-server mirrors.
  Provider lifecycle mirrors LocalTerminalEndpoint (spawn starts the DAG, kill/
  exit tears it down); terminals already alive on an adopted kaval are picked up.
- streamBridge.ts — forward/bridge/tick adapters between kaval client streams,
  the DAG's channels, and the watcher's served streams (abort = expected teardown).
- bin.ts — the --stdio entrypoint (serveOverStdio; pino logs to fd 2 since
  stdout is the wire); buildId.ts — the nix-baked staleKey/commit identity.

Hermetic in-process test (server.test.ts): a FAKE kaval (stubbed ptyHostSurface,
no PTY) + directLink proves spawn forwards, the DAG starts + mirrors metadata,
terminalAttach forwards snapshot-first, and fs/git serve a real tmp git repo —
no ssh, no nix, no spawned shell. (Run by CI; not locally — it exercises the DAG.)
Full-tree typecheck + biome clean.
Nix-native from the get-go — kolu-server provisions the watcher closure over ssh
exactly like kaval:
- default.nix: a `kolu-watcher` runCommand wrapper (node --import tsx loader, so
  SIGTERM reaches it) baking KOLU_WATCHER_BUILD_ID/COMMIT_HASH + KOLU_WATCHER_KAVAL_BIN
  (the kaval it spawns when none is serving); git on PATH for kolu-git. `watcherSrc`
  + `watcherBuildId` (the package's own hashed src, kaval's narrow discipline).
- koluBin now bakes KOLU_WATCHER_BUILD_ID + KOLU_WATCHER_AGENT_DRVS_JSON and adds
  openssh + nix to PATH — kolu-server itself now dials + provisions remotes.
- flake.nix: `watcherDrvBySystem` + `koluWatcherAgentDrvsJson` threaded into
  koluBySystem (the kavalDrvBySystem analog; same IFD-free unsafeDiscardStringContext).
- buildId.closure.test.ts: the watcher's OWN closure guard — INVERSE of kaval's
  (kolu app deps are allowed; the guard catches kolu-server/client sneaking in).
  Trailing-slash allow-list entry catches @kolu/surface-{app,daemon-supervisor,
  nix-host} too. reached == nix-hashed (no drift / dead code). Passes.

Verified: nix eval of .#kolu-watcher.drvPath + .#default.drvPath (no IFD cycle);
typecheck + biome clean. (Cross-arch BUILD runs in CI.)
…+ boot

The kolu-server consumer side of P3 — kolu-server now dials remotes:

- RemoteTerminalEndpoint (terminalEndpoint/remote.ts): the second TerminalEndpoint
  impl, over a HostSession<watcherContract>. PTY verbs/attach/kill forward to the
  watcher; fs/git one-shots forward + change-subs re-serve from its repoChange/
  fileChange streams; a mirror bridge (the remote-process-monitor bridgeAgentToParent
  pattern — makeClientCursor re-pump + mirrorRemoteCollection) pumps the watcher's
  terminalMetadata into kolu-server's registry + browser surface, ADOPTING terminals
  the watcher already has. onState → publishDaemonStatus (mapped onto the existing
  wire enum, no widening → no drishti). Spawn policy reuses composeSpawnInput against
  the WATCHER's system.info, so a remote shell opens in the host's environment.
- endpoint registry (terminalEndpoint/registry.ts): endpointFor(hostId) — local
  singleton for absent/'local' (byte-identical to pre-P3), one cached
  RemoteTerminalEndpoint per configured host. Reads the local singleton lazily so no
  #1005 TDZ regression.
- host registry (hosts/registry.ts + watcherDrv.ts): KOLU_HOSTS_JSON → {hostId,host};
  resolveDrvPath = resolveSystem → KOLU_WATCHER_AGENT_DRVS_JSON (kaval-tui pattern).
- threading: createTerminal/killTerminal/killAllTerminals (terminals.ts), attach +
  create (router.ts), the 4 fs/git streams (surface.ts) all route through
  endpointFor(hostId). A sub-terminal inherits its parent's host.
- boot (index.ts): dial every configured host at startup (publishes daemonStatus +
  adopts running terminals); a down host folds into the session reconnect loop,
  never blocks boot.

Hermetic tests (all mocked, no ssh): remote.test.ts (fs/git forward, sync-shadow
spawn, onState→daemonStatus mapping), registry.test.ts (dispatch + caching),
hosts/registry.test.ts (config parse/validation). Full-tree typecheck + biome clean.
…scoped create

The client side of P3 — remote terminals are now creatable + legible:

- useDaemonStatus.ts: the singleton daemonStatus subscription drops its explicit
  keys so it tracks EVERY host the server publishes (local + each dialed remote).
  ClientDaemonState = DaemonState | 'provisioning' | 'unreachable' — two CLIENT-ONLY
  projections (the wire enum, and drishti, stay untouched); clientDaemonState(hostId)
  maps a remote host's ssh lifecycle onto them (dialing→provisioning, down→unreachable)
  while local passes through. Every accessor (downState/daemonWarming/refuseIfWarming/…)
  gains an optional hostId defaulting to LOCAL_HOST — so the whole-canvas DegradedCanvas
  gate stays local-only (a remote host down never blanks the canvas).
- host-scoped create: handleCreate(cwd, initial, hostId) threads hostId into
  terminal.create + refuseIfWarming(hostId) (a remote terminal waits for ITS host);
  a sub-terminal inherits its parent's host. Threaded through the ActionContext seam.
- host chip (TerminalMeta.tsx): a per-tile chip shown ONLY for a remote terminal
  (absence = local), naming the machine with a 7px health dot colored from that host's
  projected state (green connected · amber provisioning · red unreachable). The
  per-tile unreachable treatment, NOT a whole-canvas one.
- command palette: a 'Connect to host…' leaf in the New terminal group — type an ssh
  target and a terminal opens there (the server dials any host verbatim; KOLU_HOSTS_JSON
  provides named aliases).

Hermetic test: ClientDaemonState coverage + the remote projection + activeHostIds.
Full-tree typecheck + biome clean. (Visual polish verified by the evidence step.)
- Unreleased changelog entry: kolu dials remote hosts over ssh (Connect to host,
  durable remote terminals, host chip, kolu-watcher). PR link backfilled at §3.
- packages/kolu-watcher/README.md: why it's separate from kaval, the serve+forward
  composition, the nix provisioning story.
- packages/dag/README.md: the extracted provider DAG + makeFsGit, why a package.
…R link

P3 (kolu dials remotes) flipped to shipped <PrLink pr={1385}/> in the phases
table + status block, documenting the kolu-watcher process, the option-(a)
serve+forward composition, the N-endpoint registry, and that the wire enum stayed
put (no drishti PR). Backfilled the changelog entry's pr={1385}. dist/ rebuilt
+ staged.
Add endpointForTerminal(id) to the endpoint registry and route killTerminal and attach through it, so the terminal->host resolution rule lives in one place instead of being re-spelled at each id-keyed call site.

Agreed by the lowy ⇄ hickey lens debate (finding lowy-2, raised by lowy). Not pushed or merged.
Remove the unused remoteEndpointFor wrapper (zero callers; its docstring described a boot wiring that calls endpointFor directly) and the unsafe cast it carried.

Agreed by the lowy ⇄ hickey lens debate (finding lowy-4, raised by lowy). Not pushed or merged.
… unit

Replace onRemoteMeta's field-by-field copy of the 7 server-owned fields with a loop over SERVER_META_KEYS (the TerminalServerMetadata partition minus location, derived off the schema), so a new server/live field rides the mirror for free. Add a round-trip test pinning the key set to the schema keys minus location.

Agreed by the lowy ⇄ hickey lens debate (finding hickey-1, raised by hickey). Not pushed or merged.
Replace the bare hostId === "local" literal in hostConfigFor with the imported LOCAL_HOST_ID constant, so hostConfigFor and endpointFor decide "is this the local host" on one shared value rather than two independent spellings.

Agreed by the lowy ⇄ hickey lens debate (finding hickey-2, raised by hickey). Not pushed or merged.
Lift the InitialTerminalMetadata guarded if-chain into one applyInitialMetadata(meta, initial) in metadata.ts and call it from both LocalTerminalEndpoint.spawnPty and RemoteTerminalEndpoint.seedMeta, so a future initial field can't be seeded on one host and dropped on the other. parentId stays on the local path (not part of InitialTerminalMetadata).

Agreed by the lowy ⇄ hickey lens debate (finding hickey-4, raised by hickey). Not pushed or merged.
…thClient

Replace RemoteTerminalProxy.call and RemoteTerminalEndpoint.call — byte-identical acquire/try/finally/release wrappers — with one module-scope withClient(session, fn), so the ref-counted lease discipline lives in exactly one place.

Agreed by the lowy ⇄ hickey lens debate (finding hickey-5, raised by hickey). Not pushed or merged.
Move the tap-pump primitive (carrying local.ts's per-event try/catch fence) into @kolu/terminal-dag and have both LocalTerminalEndpoint and kolu-watcher import it; delete the un-fenced copy in kolu-watcher/streamBridge.ts. The watcher's taps can no longer be silently frozen by one throwing event, and the two copies can't drift.

Agreed by the lowy ⇄ hickey lens debate (finding hickey-6, raised by hickey). Not pushed or merged.
Add initialServerMeta(cwd, { pr }) to @kolu/terminal-dag, have kolu-watcher call it with pr: absent and kolu-server's createMetadata compose it with pr: pending. The new-terminal default record's field set now lives once; the deliberate local-vs-watcher PR-kind difference is one explicit argument instead of two forked literals.

Agreed by the lowy ⇄ hickey lens debate (finding hickey-7, raised by hickey). Not pushed or merged.
CI's ci::unit caught it (the in-process watcher test can't run locally — it spawns
the DAG): the fake kaval returned id 'fake-terminal', but the watcher's
terminalMetadata collection is keyed by TerminalIdSchema (uuid), so the keys-stream
event-iterator validation rejected the non-uuid key. In real use kolu-server mints
a uuid; the fixture now echoes crypto.randomUUID(). The watcher impl is sound — the
other 3 watcher tests + ci::nix (both platforms) passed.
Address codex's six findings on the P3 remote-host endpoint.

F1 (blocking, FIXED): remote PTY spawn used the local-host-only
`composeSpawnInput`, which forwards THIS process's env (`cleanEnv()`) and
treats `system.info` only as a fallback — so a remote shell could launch
with the local machine's SHELL/HOME/PATH. Added `composeRemoteSpawnInput`
that builds the env FRESH from the host's facts (`info.shell`/`info.home`
authoritative, `info.path` → PATH), with nothing from this process's env
leaking across. remote.ts now uses it. Locked by spawnInput.test.ts.

F2 (blocking, FIXED): remote `attach` didn't gate on the proxy's `ready`,
so a tile attaching off the sync shadow raced the (slow, post-provision)
spawn and could hit a missing PTY. Now awaits `getTerminal(id)?.handle.ready`
first, mirroring LocalTerminalEndpoint. Test added.

F3 (blocking, FIXED): a kill landing while spawn was in flight left a live
remote PTY with no local owner — `spawnAndWire` still marked ready and never
killed the orphan. Now checks `getTerminal(id)` after spawn returns; if gone,
marks the proxy failed, kills the remote PTY (logging kill failures), and
returns without marking ready. Test added.

F4 (blocking, FIXED): the client never threaded the terminal's host id into
Code-tab fs/git stream inputs, so remote terminals fell back to the local
endpoint. CodeTab now derives `hostId` from `meta.location.hostId` and passes
it to gitStatus/gitDiff/fsListAll, and through BrowseFileDispatcher to
fsReadFile.

F5 (major, PARTIAL): the binary-preview `url` is served by the local
file route off kolu-server's filesystem, so a remote file would 404 or serve
an unrelated same-path local file. No watcher-proxying preview route exists
yet, so fail CLOSED: `fsReadFile` now throws a typed NOT_IMPLEMENTED for a
remote binary-previewable file (Code-tab toasts it) instead of minting a
wrong-filesystem URL. Text reads already route over the watcher. Added
`isRemoteHost` to registry.ts as the single local-vs-remote predicate.

F6 (major, FIXED): `subscribeChange` awaited `session.acquire()` BEFORE the
try in a floated task, so an unreachable-host rejection became an unhandled
rejection that exits the server. Acquire is now inside the try with an
`acquired` flag gating release.

All addressed findings fixed except F5 (partial: safe degradation, full
remote-preview route deferred). Server unit tests (remote + spawnInput) pass;
typecheck clean across packages.
…ublish

/simplify pass (the two net-simplifications; the rest were lens-adjudicated dups,
intentional seeds, or premature micro-opts):
- forwardStream/tickStream now accept an undefined signal and guard internally,
  removing the 8x `signal ?? new AbortController().signal` fallback from the watcher
  surface stream sources.
- the terminalMetadata change-detection `{ ...meta }` copy moved into publishMeta
  (the collection's concern) — dropped from the 4 publish call sites.
code-police no-silent-error-swallowing: bridge()'s cursor.next() catch broke out
with no log, so a real failure read like a clean shutdown. Log at error with the
host + err before exiting the pump.
@srid

srid commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

⚖️ Lowy ⇄ Hickey lens debate

Consensus after 2 round(s) · lowy + hickey · base 5d57c306b01a

Independent findings: lowy=7, hickey=8

Applied (10)

  • lowy-1 Two TerminalEndpoint impls duplicate the registry-mutation + emit + proxy + bridgeStream wiring instead of sharing a base — (uncommitted)
  • lowy-2 endpointFor reaches into per-call hostId at three consumer sites — the selector is the receptacle, but the lifecycle layer re-derives the key inconsistently — commit 2ac6ac696
  • lowy-4 remoteEndpointFor is a dead/unused export — an expendable boundary — commit a4c0a818c
  • hickey-1 onRemoteMeta reconstructs the server-meta half field-by-field — the R-4 Phase B: terminals survive a server restart — the pty-host daemon #1275 lossy-adoption fragmentation, reintroduced — commit 636adad2d
  • hickey-2 The reserved 'local' host sentinel is fragmented across three independent declarations — commit a3fdf3950
  • hickey-3 remoteEndpointFor is dead code with a docstring that describes a wiring that doesn't exist — (uncommitted)
  • hickey-4 RemoteTerminalEndpoint.seedMeta duplicates the local endpoint's initial-metadata seed (same 6-field if-chain) — commit 8b48aced0
  • hickey-5 Two byte-identical private call session-acquire helpers in one file — commit 9d974ea49
  • hickey-6 bridgeStream is triplicated and the copies have already diverged on the per-event fence — commit 3fc2131c2
  • hickey-7 Watcher initialServerMeta partially re-derives createMetadata, diverging on pr seed — commit 46a8e0910

Agreed — no change (5)

  • lowy-3 killAllTerminals fan-out vs killTerminal single-dispatch is an asymmetry that hides a shared-registry coupling (packages/server/src/terminals.ts:257-262; packages/server/src/terminalEndpoint/remote.ts:320-338; packages/server/src/terminalEndpoint/local.ts:657-669)
  • lowy-5 publishDaemonStatus folds the LOCAL kaval socketPath onto remote host rows (packages/server/src/ptyHost/daemonStatus.ts:42-51; packages/server/src/terminalEndpoint/remote.ts:204-207)
  • lowy-6 @kolu/terminal-dag package surface is coherent — passes the Surface/package-coherence test (packages/dag/src/index.ts:19-26)
  • lowy-7 kolu-watcher is a serve-not-relay server that absorbs the pty surface — a correct boundary, but the forwarding block is a 14-line mechanical spread that wants the absorb to be one operation (packages/kolu-watcher/src/server.ts:206-313)
  • hickey-8 Daemon connection state is collapsed at the wire then re-expanded client-side (lossy round-trip) (packages/server/src/terminalEndpoint/remote.ts:74-86 (mapDaemonStatus) and packages/client/src/kaval/useDaemonStatus.ts:164-173 (clientDaemonState))

@srid

srid commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

Codex ⇄ Claude debate

Consensus after 2 round(s) · codex reviewed at xhigh reasoning effort · base 5d57c306b01a

Round 1

codex — approved: false

The branch has several serious remote-host correctness issues. The shape of the watcher/endpoint split is understandable, but the current implementation loses key local lifecycle guarantees and does not consistently thread host identity through the client/server boundary, so remote terminals can spawn with local machine assumptions, race their first attach, leak orphan PTYs, and serve Code-tab data from the wrong filesystem.

Findings:

  • F1 · blocking · open — Remote PTY spawn uses composeSpawnInput, whose own implementation is explicitly local-host only. It calls cleanEnv() and prefers the kolu server process SHELL/HOME over the remote system.info values, so a remote terminal can be launched with the local machine's shell path, HOME, PATH-like environment, and cwd fallback. This contradicts the comment in remote.ts that spawn policy is composed against the remote host's facts, and will fail or behave incorrectly whenever local and remote host layouts differ. (packages/server/src/terminalEndpoint/remote.ts:261)
  • F2 · blocking · open — Remote attach does not wait for the sync-shadow terminal's handle.ready before opening terminalAttach.get. Local attach has this gate to avoid racing an in-flight spawn; remote spawn can take much longer due to provisioning/SSH. A tile attaching immediately after spawnPty can therefore hit watcher/kaval before the PTY exists and fail with a missing-PTY error instead of waiting for the spawn result. (packages/server/src/terminalEndpoint/remote.ts:285)
  • F3 · blocking · open — Remote spawn lost the local killed-during-spawn race cleanup. If the user closes a remote terminal while provisioning or terminal.spawn is in flight, killTerminal unregisters the local shadow, but the delayed spawn can still return and leave a live remote PTY with no local registry owner. The current code only conditionally updates entry but still calls proxy.markReady(res.pid) and never kills the just-created remote PTY. (packages/server/src/terminalEndpoint/remote.ts:264)
  • F4 · blocking · open — The client never passes the terminal host id into Code-tab fs/git stream inputs. The server added optional hostId routing, but gitStatus, gitDiff, fsListAll, and fsReadFile callers still send only repo/file fields. For a remote terminal, the server falls back to the local endpoint and runs git/fs operations against the kolu-server filesystem path instead of the remote watcher, producing errors or, worse, content from an unrelated local path with the same absolute name. (packages/client/src/right-panel/CodeTab.tsx:314)
  • F5 · major · open — Binary preview still produces a local file-serving URL for remote files. Even when fsReadFile is routed to a remote endpoint for statFileMtimeMs, the returned URL is served by the existing preview route, which resolves getTerminal(terminalId)?.meta.git?.repoRoot and calls createDirServer on the kolu-server filesystem. Remote HTML/image previews therefore 404 or can serve an unrelated local file if the same path exists locally. (packages/server/src/surface.ts:239)
  • F6 · major · open — subscribeChange can crash the server on remote connection/provisioning failure. The floated async task awaits this.session.acquire() before entering the try, so an acquire rejection becomes an unhandled promise rejection. The server's unhandled-rejection handler exits the process, meaning an unreachable remote host during a repo/file watcher install can take down the whole kolu server. (packages/server/src/terminalEndpoint/remote.ts:415)

claude — Reviewed codex's six findings against the P3 remote-host endpoint. All six are valid; fixed five fully and one (F5) partially with a safe fail-closed degradation. F1: remote spawn used the local-only composeSpawnInput (which forwards this process's cleanEnv() and only falls back to system.info), so a remote shell could open with the local machine's SHELL/HOME/PATH; added composeRemoteSpawnInput in ptyHost/index.ts that builds env fresh from host facts (info.shell/home authoritative, info.path → PATH) with no local-env leak, and pointed remote.ts at it. F2: remote attach now awaits getTerminal(id)?.handle.ready before opening the watcher stream (mirrors local). F3: spawnAndWire now detects a kill-during-spawn (entry gone), marks the proxy failed, kills the orphaned remote PTY, and returns without marking ready. F4: CodeTab derives hostId from meta.location.hostId and threads it into gitStatus/gitDiff/fsListAll plus through BrowseFileDispatcher to fsReadFile. F5 (partial): no watcher-proxying preview route exists, so fsReadFile now throws a typed NOT_IMPLEMENTED for a remote binary-previewable file instead of minting a wrong-filesystem URL — codex's own fallback suggestion; full remote-preview byte-streaming route deferred. F6: subscribeChange's session.acquire() moved inside the try (with an acquired flag gating release) so an unreachable-host rejection in the floated task can no longer crash the server. Added focused unit tests for F1/F2/F3 (spawnInput.test.ts + remote.test.ts). Typecheck clean across all packages; server remote + spawnInput suites pass (13 tests). Two biome warnings on the touched server files are pre-existing at the base commit (unmodified import lines). Note: a pre-existing unrelated failure in adopt.test.ts (SavedTerminalSchema gained 'location' but the test sentinel doesn't list it) exists on the branch independent of this round and was left for the P3 step-1/step-9 owner.

  • F1 fixed — Confirmed valid: composeSpawnInput (ptyHost/index.ts) is explicitly local-host-only — its cleanEnv() returns {...process.env} in production, so env.SHELL/env.HOME (the LOCAL box) win over info.shell/info.home, and PATH is the local PATH. The watcher's terminal.spawn forwards the input verbatim to kaval (server.ts:261) — it does NOT recompose — so the local env reached the remote shell. Added composeRemoteSpawnInput that inverts the layering: info.shell/info.home are authoritative, env is built fresh (info.path → PATH, then koluIdentityEnv, then plan.env), nothing forwarded from this process. remote.ts:262 now calls it; module docstring corrected. SystemInfoOutputSchema.path already existed for exactly this remote case. Locked by 3 new tests in spawnInput.test.ts.
  • F2 fixed — Confirmed valid: LocalTerminalEndpoint.attach gates on await getTerminal(id)?.handle.ready (local.ts:644) but RemoteTerminalEndpoint.attach went straight to session.acquire()+terminalAttach.get. The remote proxy has a ready promise and remote spawn is slower (post-provision), so the race is worse. Added the same await getTerminal(id)?.handle.ready before acquiring. Test 'attach waits for the spawn to resolve before opening the stream' asserts terminalAttach.get is deferred until spawn resolves.
  • F3 fixed — Confirmed valid: LocalTerminalEndpoint.spawnViaClient (local.ts:393) checks getTerminal(id) after spawn and kills the orphan if gone; remote spawnAndWire only conditionally updated entry yet still called proxy.markReady(res.pid) and never killed the orphaned remote PTY. Added the killed-during-spawn branch: if !getTerminal(id) after spawn, markFailed, await client.surface.terminal.kill({id}) (logging kill failures), return without markReady. Test 'kills the remote PTY when the terminal is killed mid-spawn' added.
  • F4 fixed — Confirmed valid: the four fs/git input schemas carry optional hostId (git/schemas.ts:103/125/162/182) and surface.ts routes via endpointFor(input.hostId), but CodeTab.tsx sent only repo/file fields, so a remote terminal fell back to the local endpoint. Added const hostId = () => props.meta?.location?.hostId and threaded it into localStatus/branchStatus/activeStatus (gitStatus), the gitDiff input, fsListAll, and via a new hostId prop on BrowseFileDispatcher into the fsReadFile input.
  • F5 partial — Confirmed valid: the binary-preview url is served by the local /api/terminals/:id/file/* route, which reads bytes off kolu-server's own filesystem (index.ts:227 getTerminal(id).meta.git.repoRoot + createDirServer), so a remote file 404s or serves an unrelated same-path local file. The full fix (proxy bytes through the watcher) is a large new route, out of this round's scope. Implemented codex's stated fallback: surface.ts fsReadFile now throws ORPCError NOT_IMPLEMENTED for a remote (isRemoteHost) binary-previewable file — the Code-tab onError toasts it — instead of minting a wrong-filesystem URL. Text reads already route correctly over the watcher. Added isRemoteHost to registry.ts as the single local-vs-remote predicate. Remaining work (watcher-proxying preview route) deferred.
  • F6 fixed — Confirmed valid: subscribeChange (remote.ts) awaited this.session.acquire() BEFORE the try in a floated void (async () => {...})() task, so an acquire rejection on an unreachable/provisioning-failed host became an unhandled rejection, which the process's unhandledRejection handler (index.ts:164) turns into a process exit — a dead remote host installing a repo/file watcher could take down kolu-server. Moved acquire inside the try with an acquired boolean gating the finally release so we never release a lease we never took; non-abort failures are logged.

commit: 45f7ffd0072909b526f52a1edf2f475efc8d9543

Round 2

codex — approved: true

CLAUDE addressed the six findings from the prior round. The remote spawn path now uses host-authoritative spawn input, remote attach now waits for the sync shadow to resolve, the killed-during-spawn orphan PTY case is cleaned up, Code-tab fs/git calls carry host identity through to the server, remote binary previews now fail closed instead of serving a local-filesystem URL, and remote change subscriptions no longer float an unhandled acquire rejection.

Findings:

  • F1 · blocking · resolved — Resolved. Remote spawn now calls composeRemoteSpawnInput, and that helper builds the environment from remote system.info (shell, home, path) instead of forwarding the kolu-server process env. (packages/server/src/terminalEndpoint/remote.ts:265)
  • F2 · blocking · resolved — Resolved. RemoteTerminalEndpoint.attach now awaits getTerminal(id)?.handle.ready before opening the watcher attach stream, matching the local endpoint's race guard. (packages/server/src/terminalEndpoint/remote.ts:320)
  • F3 · blocking · resolved — Resolved. After remote spawn returns, the code now detects that the local registry entry was removed, marks the proxy failed, kills the just-created remote PTY, and returns without marking it ready. (packages/server/src/terminalEndpoint/remote.ts:275)
  • F4 · blocking · resolved — Resolved. CodeTab derives hostId from terminal metadata and threads it into gitStatus, gitDiff, fsListAll, and through BrowseFileDispatcher into fsReadFile. Current Code-tab remote fs/git reads now route to the remote endpoint. (packages/client/src/right-panel/CodeTab.tsx:321)
  • F5 · major · resolved — Resolved. The full remote binary-preview proxy route is still deferred, but the code now takes the acceptable fail-closed path: remote binary-previewable fsReadFile calls throw a typed NOT_IMPLEMENTED instead of minting a URL backed by the kolu-server local filesystem. (packages/server/src/surface.ts:251)
  • F6 · major · resolved — Resolved. subscribeChange now performs session.acquire() inside the try and only releases when acquisition succeeded, so unreachable/provisioning-failed hosts are logged instead of becoming fatal unhandled rejections. (packages/server/src/terminalEndpoint/remote.ts:456)

codex on the rebuttal: F1-F4 and F6 were fixed as raised. For F5, I accept the scoped degradation: returning NOT_IMPLEMENTED for remote binary previews is the fail-closed fallback I suggested, and it removes the wrong-filesystem URL behavior from the current Code-tab path now that hostId is threaded through the caller.

claude(no author turn this round)

(no author turn — the debate ended this round)

@srid

srid commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

👮 Code-police

Ran with --no-elegance (the /simplify track already covered this tree).

Pass Findings Outcome
Rules 1 no-silent-error-swallowingRemoteTerminalEndpoint.bridge()'s mirror-pump cursor.next() catch broke out with no log, so a real failure read like a clean shutdown. Fixed: log at error with { host, err } before exiting the pump (fix(police): log when the bridge mirror-pump cursor fails).
Fact-check 0 Clean. Positive notes: config validation throws clearly on bad KOLU_HOSTS_JSON / KOLU_WATCHER_AGENT_DRVS_JSON; withClient releases in finally; subscribeChange guards double-release; the spawn race detects mid-spawn kills + reaps the orphan remote PTY; attach awaits ready before opening the stream; SERVER_META_KEYS is schema-derived so new fields mirror for free.
Elegance Skipped (--no-elegance).

The schema-key guard (the #1275 lossy-adoption class) correctly caught Step 1's
additive location field: 'the sentinel covers EVERY persisted SavedTerminal key'
asserts the adopt sentinel has every persisted key, and location was missing.
Adding it also makes the round-trip test verify location is carried through
adoption (it is — adoptedMeta spreads the whole record). CI's ci::unit caught it.
@srid

srid commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

🧪 CI metrics — leased pool box

The x86_64-linux lane ran on kolu-ci-2 (dev-x86-64-linux-04) — commit 2fae31c0, verdict unknown (no per-node timing)

  • Lane wall (pipeline): 0s

(no per-recipe timing in .ci/pc.log)

Pool status (8 boxes)

box location state
kolu-ci-1 dev-x86-64-linux-04 ✓ idle
kolu-ci-2 dev-x86-64-linux-04 ✓ idle
kolu-ci-3 dev-x86-64-linux-03 ✓ idle
kolu-ci-4 dev-x86-64-linux-08 ✓ idle
kolu-ci-5 idliv2-02 ✓ idle
kolu-ci-6 dev-x86-64-linux-03 ✓ idle
kolu-ci-7 dev-x86-64-linux-05 ✓ idle
kolu-ci-8 dev-x86-64-linux-08 ✓ idle

Posted by ci/pu/report.sh. Lane timings from .ci/pc.log; pool state is a live flock probe.

@srid

srid commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

Evidence

P3's headline behavior — kolu-server dials a remote host, spawns a terminal there, and mirrors its fs/git + metadata back into the browser like a local one — is proven here by CI + a hermetic end-to-end composition test, not by stubs. The one thing that genuinely needs live remote infrastructure (a real ssh dial of a second host, to screenshot the host chip on a remote tile) is called out honestly below as the natural follow-up capture.

CI is green on both platforms (the comprehensive validation)

28/28 required checks pass — x86_64-linux and aarch64-darwin. Most relevant to this PR:

  • ci::nix ✅ both platforms — the new kolu-watcher nix package + the per-system drv-map build cross-arch. This is the validation of the Nix-native provisioning path that can't be exercised in a unit test.
  • ci::unit ✅ both platforms — including the watcher's in-process serve+forward test (packages/kolu-watcher/src/server.test.ts): with a fake in-process kaval (no ssh, no nix, no spawned shell) it proves the remote composition end to end — a spawn forwards to kaval and returns its pid, spawning starts the host-side provider DAG which publishes into the served terminalMetadata collection (the mirror source), the pty taps forward snapshot-first, and fs/git serve a real tmp git repo. That is the behavioral proof of the "serve, don't relay" design (option a) on one machine.
  • ci::e2e ✅ both platforms — all 457 scenarios pass. The host chip renders only for remote tiles, so the existing local-terminal e2e suite is unaffected: the local path is unregressed.

The visible surface (and why the live capture is a follow-up)

The new on-screen affordances are the command palette's "Connect to host…" leaf (in the New terminal group) and the per-tile host chip (machine name + a green/amber/red health dot, rendered only for remote terminals — absence = local). Capturing the chip on a real remote tile requires a configured, reachable ssh host that kolu-server can provision kolu-watcher onto — i.e. a second box (or a localhost-as-remote e2e with passwordless ssh + nix trusted-users). That live ssh dial can't be recorded under this run's constraints (single CI box; the app is never run on the author's machine), so it's the natural next evidence rather than something stubbed to fake a screenshot. The behavior it would show is already proven hermetically above.

srid added 4 commits June 16, 2026 14:27
Carry id + label on the HostProgress shape and have both the chip-button tooltip and the popover header read them from that one object, instead of assembling "${hostId} — ${label}" at three independent sites. A label-copy change now lands in one place. The popover takes the whole progress shape, dropping its separate hostId/state props.

Agreed by the lowy ⇄ hickey lens debate (finding hickey-5, raised by hickey). Not pushed or merged.
F1 (fixed): the chip's click-popover gate (hasLog) keyed only on lines.length, but markConnected() does NOT clear progressLines and mapDaemonStatus publishes progress for connected too — so a connected host stayed cursor-pointer/clickable and reopened its stale dial log, contradicting the deliberate "connected host drops progress UI" intent. Gated the trigger on (warming || failed) && lines.length (canOpen) — the same lifecycle gate the inline hints use — and added a createEffect that closes the popover the moment canOpen goes false (a dial completing while the popover is open no longer strands the old log).

F2 (partial): kept the <button> element (deliberate — keyboard a11y + popover trigger + the data-testid=terminal-host-chip e2e contract), but when there's no actionable log the chip is now tabindex=-1 + aria-disabled, so keyboard/AT users aren't handed a focusable no-op. Disputed dropping the button or making the "No activity yet" fallback reachable: the log is intentionally lifecycle-gated, not shown on a steady connected host.

Added useHostProgress gate tests (incl. connected-with-retained-progress reports neither warming nor failed) in useDaemonStatus.test.ts. typecheck + biome clean; just fmt run.
The /simplify pass over the dial-progress feature (base 574ff09). Four
reviewer findings all pointed at the same theme — the prepared HostProgress
shape should carry every display fact so call sites re-derive nothing:

- carry `state`, `tone`, and `inLifecycle` on HostProgress (computed once in
  useHostProgress from the presentation row), so:
  - HostProgressButton drops its redundant `state` prop and stops re-deriving
    the dot tone with a second DAEMON_STATE_PRESENTATION lookup — it reads the
    one `progress` object for `data-host-state` (state), the dot (tone), and the
    show/open gate (inLifecycle);
  - the `warming || failed` disjunction, previously spelled at four call sites
    (the two inline hints + canOpen + showLatest), is named once as
    `inLifecycle`.
- collapse the two copy-paste inline hint <Show> blocks in HostChip into one —
  they differed only by gate (warming vs failed) and the pulse; now one <Show
  when={inLifecycle && latest}> with `classList={{ "animate-pulse": warming }}`.
- HostChip drops its own clientDaemonState call (the state now rides progress).

Skipped (noted): producer-side tag strip in surface-nix-host (the deeper
"typed origin" fix is well outside this diff + touches the drishti contract);
deriving label/id (defensible convenience). No behaviour change — the chip's
e2e contract (data-testid, data-host-state, hostId text) is untouched.
code-police `no-thin-wrapper-functions`: `hostProgress(hostId)` was exported
but its whole body was `daemonStatusFor(hostId)?.progress ?? []` with exactly
one call site (`useHostProgress`). Inlined that body at the call site (folding
its doc into a comment) and repointed `stripProgressTag`'s `{@link}` at
`useHostProgress`, the function that now applies it.
@srid

srid commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

⚖️ Lowy ⇄ Hickey lens debate

Consensus after 1 round(s) · lowy + hickey · base 574ff099

Independent findings: lowy=2, hickey=3 — all agreed fix, each applied as its own commit.

Reconstructed from the committed findings: this debate ran with --no-comment (be-review posts once after the push) and its rendered body wasn't persisted across a mid-run context compaction. The commits below are the authoritative record.

The change surfaces a remote host's dial-progress on the host chip. The debate converged on collapsing the progress concept into one prepared shape and deriving every display fact from the presentation table, rather than re-deriving it at each leaf view.

Origin Finding Disposition Commit
lowy Derive the chip's progress gate from the presentation table — replace inline active() = state !== "connected" with table-derived warming/failed, so the chip can't disagree with the dot's tone and a dead host doesn't animate as busy fix 42b602d5
lowy Co-locate stripProgressTag with the progress data it strips (move into useDaemonStatus next to the producer accessor) so the [local]/[remote] tag format and its stripper share one home, not re-parsed at a leaf view fix 2009c999
hickey Collapse host progress into one prepared useHostProgress shape (lines stripped once, latest, warming, failed) so the concept lives in one module instead of split across the chip and the popover fix 1de226be
hickey Factor the chip's trigger + popover into HostProgressButton (mirroring PrUnavailableButton) so "a chip that toggles an anchored log popover" — open signal, trigger ref, stopPropagation — is expressed once fix d48fe9ce
hickey Compose the ${id} — ${label} host-label string once via the prepared shape, so a label change lands in one place instead of three independent sites fix 67c3afb2

Both lenses agreed per-finding; fixes committed locally and now pushed. The human reviews + merges.

@srid

srid commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

Codex ⇄ Claude debate

Consensus after 2 round(s) · codex reviewed at xhigh

Rendered from the debate's persisted round sections (.codex-debate/section-00{1,2}.md): this run was --no-comment (be-review posts once after the push). The per-round record below is verbatim from those sections; the round-1 fixes are commit 97818f15.

Round 1

codex — approved: false

The data plumbing is mostly coherent and the server-side progress propagation is covered, but the client trigger does not actually enforce the stated mid-lifecycle-only behavior. A connected remote can retain and reopen stale dial logs, and the chip is also left as a focusable no-op button when there is no log to show. I did not modify anything; git status --short --untracked-files=all was empty and git diff --check reported no whitespace errors.

Findings:

  • F1 · minor · open — hasLog() is based only on progress.lines.length, while mapDaemonStatus publishes progress: [...s.progressLines] for every state, including connected (packages/server/src/terminalEndpoint/remote.ts:142). HostSession does not clear progressLines on markConnected, so after a successful cold or warm dial the connected chip can remain cursor-pointer/clickable and reopen the old provisioning log. If the popover was open during the dial, it also stays open after the host becomes connected. That contradicts the intended "connected host drops progress UI" behavior and makes stale setup output look actionable/current. (packages/client/src/terminal/HostProgressPopover.tsx:86)
  • F2 · minor · open — The host chip is always rendered as a focusable <button>, but onClick is a no-op when hasLog() is false. That leaves keyboard and assistive-tech users with an interactive control that cannot do anything; the popover's "No activity yet" fallback is effectively unreachable from this trigger path. (packages/client/src/terminal/HostProgressPopover.tsx:93)

claude — Addressed both of codex's minor findings on the host dial-progress chip. F1 (fixed): the click-popover trigger gated only on lines.length, but markConnected() never clears progressLines and mapDaemonStatus publishes progress for the connected state too — so a connected host stayed clickable and could reopen its stale provisioning log, contradicting the deliberate "connected host drops progress UI" intent. Re-gated the trigger on (warming || failed) && lines.length (canOpen) — the same lifecycle gate the inline hints already use — and added a createEffect that closes the popover the instant canOpen goes false, so a dial that completes mid-popover no longer strands the log. F2 (partial): kept the (deliberate: keyboard a11y, popover trigger, and the data-testid=terminal-host-chip e2e contract) but set tabindex=-1 + aria-disabled when there's no actionable log, so keyboard/AT users aren't handed a focusable no-op; disputed dropping the button or making the "No activity yet" fallback reachable, since the log is intentionally lifecycle-gated. Added 4 useHostProgress gate tests (incl. connected-with-retained-progress reports neither warming nor failed). typecheck + biome clean across all packages; just fmt run; all 12 useDaemonStatus tests pass.

  • F1 fixed — Confirmed codex's facts: hostSession.ts:551-556 markConnected() only flips connection to 'connected' and never clears progressLines, and remote.ts:142 mapDaemonStatus returns progress: [...s.progressLines] for every state incl. connected. So with the old hasLog()=lines.length>0 gate, a successfully-dialed host kept cursor-pointer and reopened the stale log, and an open popover survived the connect — contradicting deliberate decision Add NixOS module for deployment #4 ('a connected host drops it'), which the inline warming/failed hints already honor. Fix in HostProgressPopover.tsx: replaced hasLog() with canOpen() = (progress.warming || progress.failed) && progress.lines.length > 0 (same lifecycle gate as the inline hints, so dot/hint/trigger can't disagree), used it for the click guard and the cursor classes, and added a createEffect that calls setOpen(false) whenever canOpen() is false so a dial completing while the popover is open closes it. Locked with a new test: a connected host carrying a retained dial log reports warming=false, failed=false.
  • F2 partial — Fixed the dead-control concern: when canOpen() is false the chip is now tabindex={-1} and aria-disabled={true}, so keyboard/assistive-tech users aren't presented with a focusable no-op. Disputed the rest: I kept the element a
rather than swapping to a span — that is a deliberate decision (author note #3: keyboard a11y + popover trigger), and the e2e contract requires data-testid=terminal-host-chip with data-host-state on it (TerminalMeta.tsx renders it via HostProgressButton). I also did not make the popover's 'No activity yet' fallback reachable from the trigger: with F1's gate, the log opens only mid-lifecycle with lines present, by design — the fallback stays as harmless defensive rendering on the reusable popover component, not a steady-connected affordance. Net: the interactive-but-useless-control problem codex flagged is resolved without widening when the log is offered.

commit: 97818f15db64fa134d212e34dbfcb361be84acc7

Round 2

codex — approved: true

CLAUDE's follow-up resolves both prior findings. The trigger now shares the same lifecycle gate as the inline progress hints and closes itself when that gate drops, so connected hosts cannot reopen stale retained logs. For the inactive-button concern, keeping the element as a button is acceptable because it is no longer keyboard-focusable and is exposed as disabled when it has no actionable log.

Findings:

  • F1 · minor · resolved — Previously hasLog() used only progress.lines.length, so connected hosts with retained progressLines stayed clickable and could reopen stale provisioning logs. (packages/client/src/terminal/HostProgressPopover.tsx:100)
  • F2 · minor · resolved — Previously the chip stayed a focusable no-op button when there was no actionable log. (packages/client/src/terminal/HostProgressPopover.tsx:124)

codex on the rebuttal: F1: Accepted. The new canOpen() gate requires warming/failed plus lines, and the effect closes an already-open popover when the host reaches connected, which addresses the stale retained-log path. F2: Conceded. The remaining <button> shape is acceptable because the actual accessibility bug was the focusable actionable-looking no-op; tabindex=-1 plus aria-disabled removes it from keyboard navigation and communicates the inactive state while preserving the deliberate DOM/e2e contract.

claude(no author turn this round)

(no author turn — the debate ended this round)

@srid

srid commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

👮 Code-police

Ran with --no-elegance (the /simplify track already swept this tree) on the dial-progress change, base 574ff099.

Pass Result
Rules 1 finding → fixed
Fact-check clean
Elegance skipped (--no-elegance)

Rules — no-thin-wrapper-functions (fixed)

hostProgress(hostId) was exported but its entire body was daemonStatusFor(hostId)?.progress ?? [], with exactly one call site (useHostProgress). A single-caller trivial wrapper adds indirection without value.

Applied in 56446739: inlined the body at the call site (folding its doc into a comment) and repointed stripProgressTag's {@link} at useHostProgress, the function that now applies it. Typecheck + biome clean; the 19 useDaemonStatus/remote unit tests stay green.

Fact-check

No logic errors. Verified: the optional progress field falls back to []; all five connection states map to a daemon state with no gaps; the ring is bounded server-side (20 lines); stripProgressTag's regex is correct; the createMemo deps are right; the popover auto-closes when the host leaves the lifecycle; latest is read only after a truthy gate.

The simplify pass shortened showLatest's body to inLifecycle, which fits on
one line under the repo's pinned biome — local npx biome (a different version)
left it wrapped, so ci::fmt flagged it. Formatted via nix develop's biome.
@srid

srid commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

🧪 CI metrics — leased pool box

The x86_64-linux lane ran on kolu-ci-2 (dev-x86-64-linux-04) — commit 56446739, failed

  • Lane wall (pipeline): 1m29s
recipe duration
ci::nix 48s
ci::flake-check 44s
ci::unit 43s
ci::smoke 41s
_ci-setup 37s
ci::atlas-sync 37s
ci::pnpm-hash-fresh 34s
ci::surface-example-build 25s
ci::surface-app-example-build 25s
ci::fmt 24s ✗ (exit 1)
ci::biome 24s
ci::install 9s

Pool status (8 boxes)

box location state
kolu-ci-1 dev-x86-64-linux-04 ✓ idle
kolu-ci-2 dev-x86-64-linux-04 ✓ idle
kolu-ci-3 dev-x86-64-linux-03 ✓ idle
kolu-ci-4 dev-x86-64-linux-08 ✓ idle
kolu-ci-5 idliv2-02 ✓ idle
kolu-ci-6 dev-x86-64-linux-03 ✓ idle
kolu-ci-7 dev-x86-64-linux-05 ✓ idle
kolu-ci-8 dev-x86-64-linux-08 ✓ idle

Posted by ci/pu/report.sh. Lane timings from .ci/5644673/timings.jsonl; pool state is a live flock probe.

srid added 5 commits June 16, 2026 15:44
The dial-progress ring already reached the client, but it only rendered as a
9px, max-16ch-truncated one-liner on the title-bar host chip (plus a log hidden
behind a click) — and not at all on mobile, where TerminalMetaCompact has no
chip. Meanwhile the tile BODY — the xterm the user is actually staring at for
the ~minute of a cold dial — was a blank black rectangle: zero feedback, looks
hung. That's "where's the client-side progress?".

Add HostDialOverlay: a connecting screen painted over the not-yet-attached
xterm (last child of Terminal's relative wrapper, so DOM order layers it with
no explicit z-index near the resize handle). It reads the SAME useHostProgress
ring the chip does — "Connecting to {host}…" (or "Couldn't reach {host}" on
failure) plus the full, untruncated, auto-scrolling provisioning log — and
unmounts the instant the host leaves its dial lifecycle (inLifecycle false →
connected), so the live terminal shows straight through. No new server plumbing.

Because it lives in the body it also renders on mobile (fixing the no-chip gap),
and a failed dial now shows WHY in-place instead of a silent vanish. The chip
stays as compact title-bar status (dot + name + on-demand log popover); its
redundant, illegible inline hint is dropped now that the body is the prominent
progress surface.
The remote kolu-watcher runs the SAME provider DAG as kolu-server, whose `pr`
provider shells out to `gh pr view` (via KOLU_GH_BIN). The watcher wrapper set
git (for kolu-git) but not gh, so on every poll the provider threw "KOLU_GH_BIN
is not set", classifyGhError fell through to `unknown`, and it logged a level-50
error on the remote — e.g.

  [host:pureintent remote] gh pr view: unknown error
    err: KOLU_GH_BIN is not set. Run kolu through the Nix wrapper…

`gh pr view` resolves via the repo's git remote-tracking, so it must run
host-side (the remote has the repo), not server-side. Pin KOLU_GH_BIN on the
watcher wrapper exactly as kolu-server does — gh joins the watcher's closure, so
it's present after the cold-dial copy. Where the remote gh is authed PR status
now works; where it isn't, gh's stderr classifies as the KNOWN `not-authenticated`
(a graceful, lower-level unavailable with actionable UI), not an `unknown` error.

Remote gh auth itself (forwarding the local token) is a separate follow-up.
Merging master (#1384) made `ci::biome` fail on warnings, not just errors. Three
P3 files carried warnings the old gate ignored:
- server.test.ts, remote.ts: `watcherSurface` is used only as a type → import type
- surface.ts: `ORPCError` was unused (only referenced in comments) → dropped,
  keeping the used `implement`.
Adversarial review of the connecting-screen overlay found two real defects:

- The `z-20` overlay sat over the xterm with default pointer-events, so in the
  brief window where `inLifecycle` flips false and Solid's `<Show>` tears the
  node down asynchronously, it could swallow a keystroke meant for the now-live
  terminal. It's purely visual (header + log, nothing to click) → mark it
  `pointer-events-none`; input always reaches the xterm and the race is gone.
- The autoscroll `createEffect` lived in DialScreen, outside the `<Show>`, so on
  the first render `logRef` was unbound and the first line never pinned. Split
  the body + effect into `DialBody`, which mounts only under the `<Show>` — its
  ref is bound before its effect first runs.

Also corrected the stale file-header claim of "no explicit z-index" (it carries
z-20, contained by Terminal's `isolate` wrapper).
@srid

srid commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

🧪 CI metrics — leased pool box

The x86_64-linux lane ran on kolu-ci-2 (dev-x86-64-linux-04) — commit 95b5e9bf, failed

  • Lane wall (pipeline): 3m31s
recipe duration
ci::e2e 2m51s
ci::home-manager 1m18s
ci::smoke 51s
ci::unit 50s
ci::nix 49s
ci::flake-check 48s
ci::atlas-sync 37s
ci::pnpm-hash-fresh 35s
ci::surface-example-build 33s
ci::surface-app-example-build 33s
_ci-setup 32s
ci::biome 32s ✗ (exit 1)
ci::fmt 31s
ci::install 7s

Pool status (8 boxes)

box location state
kolu-ci-1 dev-x86-64-linux-04 ✓ idle
kolu-ci-2 dev-x86-64-linux-04 ✓ idle
kolu-ci-3 dev-x86-64-linux-03 ✓ idle
kolu-ci-4 dev-x86-64-linux-08 ✓ idle
kolu-ci-5 idliv2-02 ✓ idle
kolu-ci-6 dev-x86-64-linux-03 ✓ idle
kolu-ci-7 dev-x86-64-linux-05 ✓ idle
kolu-ci-8 dev-x86-64-linux-08 ✓ idle

Posted by ci/pu/report.sh. Lane timings from .ci/95b5e9b/timings.jsonl; pool state is a live flock probe.

srid added 6 commits June 16, 2026 17:03
Session restore dropped the saved `location.hostId`: `handleCreate(cwd, meta)`
was called without the 3rd `hostId` arg, so every remote terminal restored as a
LOCAL shell — the durable remote kaval was abandoned and the "reconnect where
you left off" promise died across a kolu restart.

Three coordinated changes:
- Boot (index.ts): dial not just the configured hosts but every host that owns a
  terminal in the SAVED SESSION, so a durable remote kaval's terminals adopt back
  (you rejoin them) before the client reaches restore — even for an ad-hoc host
  that isn't in KOLU_HOSTS_JSON.
- Restore (useSessionRestore.ts): thread `t.location?.hostId` into handleCreate so
  a restored remote terminal spawns on the right endpoint.
- Dedup guard: skip any saved terminal already live in the store. A durable daemon
  (local kaval across an update, or a re-dialed remote host) re-adopts by id; since
  adoption preserves the id and restore mints fresh ids, "already live by saved id"
  ⇒ adopted ⇒ skip — otherwise the boot-dial adoption would race restore and spawn
  a duplicate shell beside the adopted one.
The dock rows gave no sign of which terminals live on a remote host. Add a
shared `RemotePip` (RowPips) — the machine name plus that host's health dot
(green connected · amber provisioning · red unreachable), mirroring the tile
title bar's HostChip — rendered on line 2 beside the PR pip in both the desktop
dock and the mobile drawer. Renders nothing for local terminals.
…host (watcher 1.1)

P3 remote-parity contract trio — three host-side concerns that previously
always ran on kolu-server now follow a terminal to its owning host:

- worktree: git.worktreeCreate/worktreeRemove gain an optional hostId; the
  router resolves the owning endpoint via endpointFor(hostId) and the remote
  endpoint forwards to the watcher's new git.worktreeCreate/worktreeRemove
  procedures, so a remote tile's worktree lands on the host the repo lives on.
- file upload: pasteImage/uploadFile write through the terminal's endpoint
  (TerminalEndpointFs.writeFile) — local to kolu-server's scratch, remote to
  the host's per-terminal scratch via the watcher's new fs.writeFile — so the
  bracketed-paste path resolves where the PTY actually runs.
- kaval identity: the remote endpoint queries the absorbed system.version once
  on connect and folds the remote kaval's identity + startedAt onto every
  published daemonStatus; the IdentityRail renders one kaval column per host
  (activeHostIds), and the dialog/update-chip/restart are gated to LOCAL_HOST
  (a remote kaval is pinned to its watcher's Nix build).

WATCHER_CONTRACT_VERSION 1.0 -> 1.1 (additive). There is no runtime skew gate
on this version today (the dial validates kaval's PTY_HOST_CONTRACT_VERSION,
not this) — a stale watcher fails the new procedures per-call with a clear
error, and a normal re-dial re-realises the watcher from the server's drv map,
picking up 1.1.

makeFsGit now returns BaseFs/BaseGit (the shared repo-keyed reads); each
consumer adds its host-binding writeFile/worktree methods to complete the
widened TerminalEndpointFs/TerminalEndpointGit interfaces.
…er's

A remote terminal's shell sourced the user's rc (the prompt rendered) but its
PATH was the kolu-watcher's restricted Nix PATH (node+git+gh): the watcher is
launched over a NON-login ssh session, and kaval's system.info reported the
watcher process's own PATH, which composeRemoteSpawnInput hands the spawned
shell. So user-profile tools (zoxide, …) weren't found — `cd` fell through to
command-not-found.

The watcher now captures the user's LOGIN PATH once at startup — runs the login
shell (`<shell> -l -c 'echo "$PATH"'`, 5s-bounded, falls back to the watcher PATH
on any failure) so /etc/profile + the user's profile build the real PATH incl.
the nix profile dirs — and serves THAT as system.info.path. Contained to the
watcher; no change to local kaval or kaval-tui. NEEDS live-remote validation:
dial a real host and confirm `echo $PATH` in the PTY carries ~/.nix-profile/bin
and that zoxide resolves.
…he feed

The activity feed silently dropped remote terminals on two fronts.

1. getRecentRepos pruned every entry whose repoRoot failed existsOnDisk
   against the LOCAL filesystem. A remote terminal's repoRoot lives on the
   remote host, so it was never on this machine's disk and vanished from the
   recent-repos feed on the first read. Entries now carry an optional
   `hostId` (absent ⇒ local); the prune only stat-checks LOCAL entries and
   keeps remote ones unconditionally (we can't cheaply stat a remote host's
   fs from here).

2. trackRecentRepo / trackRecentAgent were only wired from
   LocalTerminalEndpoint's provider DAG. RemoteTerminalEndpoint runs no DAG
   of its own (it lives in the watcher), so remote activity was never
   tracked. The metadata mirror (onRemoteMeta) now tracks the remote repo +
   agent on CHANGE — matching the local git/agent providers — stamped with
   the dialed hostId.

The MRU dedup key for both feeds now folds in hostId so the same repoRoot
(or agent command) on different hosts — or local vs remote — stays a
distinct entry instead of collapsing onto one another.

`hostId` is an additive optional field: existing state files deserialize
with it absent (⇒ local), so no migration / schema-version bump is needed.
Adversarial review of the upload-to-remote path found real issues:

- LEAK (blocker): the watcher never removed a terminal's scratch dir on exit
  (the local endpoint does via cleanupTerminalScratch). Added cleanupWatcherFile
  wired into stopProviderLayer, so pasted files don't accumulate on the host.
- WORLD-READABLE (blocker): the watcher scratch root was a predictable shared
  path (.../kolu-watcher-scratch) created with default perms — on a multi-user
  host with a /tmp fallback, one user could read another's pasted secrets. Now
  per-USER (uid-scoped name) and every dir is mkdir'd mode 0700; same 0700 made
  explicit on kolu-server's scratch subdir.
- TRAVERSAL (major): a belt-and-suspenders guard rejects a terminalId with a
  separator/`..` (already UUID-validated on the wire, but defend the fs too).
- SIZE (major): a defense-in-depth size cap at the watcher's fs.writeFile.
- DRIFT (major): sanitizeUploadName was DUPLICATED in kolu-server and the
  watcher (security-sensitive, could drift). De-duplicated into the one
  browser-safe shared module (kolu-common/upload, beside the size guards) —
  single source; both import it. Test moved to upload.test.ts.
@srid

srid commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

🧪 CI metrics — leased pool box

The x86_64-linux lane ran on kolu-ci-2 (dev-x86-64-linux-04) — commit 1eecb29b, passed

  • Lane wall (pipeline): 4m22s
recipe duration
ci::e2e 2m59s
ci::home-manager 1m46s
ci::smoke 1m21s
ci::nix 1m19s
ci::flake-check 1m17s
ci::pnpm-hash-fresh 1m14s
ci::unit 51s
ci::atlas-sync 50s
ci::install 47s
_ci-setup 34s
ci::surface-example-build 32s
ci::surface-app-example-build 31s
ci::biome 31s
ci::fmt 30s

Pool status (8 boxes)

box location state
kolu-ci-1 dev-x86-64-linux-04 ✓ idle
kolu-ci-2 dev-x86-64-linux-04 🔒 leased
kolu-ci-3 dev-x86-64-linux-03 🔒 leased
kolu-ci-4 dev-x86-64-linux-08 ✓ idle
kolu-ci-5 idliv2-02 ✓ idle
kolu-ci-6 dev-x86-64-linux-03 ✓ idle
kolu-ci-7 dev-x86-64-linux-05 ✓ idle
kolu-ci-8 dev-x86-64-linux-08 ✓ idle

Posted by ci/pu/report.sh. Lane timings from .ci/1eecb29/timings.jsonl; pool state is a live flock probe.

@srid

srid commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

🔌 Remote-terminal feature parity (audit + fixes)

Audited every "local feature" against remote terminals (7-agent sweep), then fixed all gaps. Each fix was adversarially reviewed; the review caught real bugs (incl. two security blockers) that are also fixed here.

Gap Fix Commit
Session restore dropped the host → remote terminals restored local Restore re-dials the host; boot re-dials saved-session hosts so durable remote kavals adopt back; dedup guard prevents adoption/restore duplicates fe33c9d1
Dock didn't mark remote rows Shared RemotePip — host name + health dot, both dock surfaces 35c68579
Worktree create/remove ran on kolu-server's fs; file upload wrote locally; kaval status chrome bar was local-only Route worktree + upload to the host via the watcher contract (1.1); remote kaval identity/uptime folded onto daemonStatus, chrome bar renders one entry per host (restart/update gated to local) 9337de35
Remote shell PATH — your zoxide report Watcher captures the user's login PATH and serves it via system.info (the PTY no longer inherits the watcher's restricted Nix PATH) a543d775
Activity feed dropped remote repos (local existsOnDisk prune) + never tracked remote hostId on the feed entries; skip local-fs prune for remote; track from the metadata mirror 1eecb29b
Upload scratch (review): host disk leak on exit; world-readable dir on a multi-user host; traversal/size gaps; sanitizer duplicated Cleanup on exit; per-user 0700 scratch; id guard + size cap; sanitizeUploadName de-duplicated into one shared kolu-common/upload 09e12a18

Verification: CI green both platforms on 09e12a18 — incl. ci::nix/ci::smoke/ci::home-manager (the watcher rebuilds at contract 1.1) and ci::e2e.

Notes / follow-ups (out of scope here): remote gh auth forwarding (PR status degrades to a graceful not-authenticated); remote kaval restart/update-chip (a remote kaval is pinned to its watcher's Nix build); the contract bump has no runtime skew gate today, so a watcher pinned mid-session fails the 3 new procedures per-call (a clear error) and a normal re-dial re-realises it at 1.1. The remote shell PATH fix wants a live dial on a real host to confirm zoxide resolves — unit tests can't exercise it.

srid and others added 4 commits June 16, 2026 18:40
A remote terminal's tile (and its connecting-screen overlay) didn't appear for
the whole dial — several seconds on a cold dial. Cause: the client renders a
tile only once that terminal has a `terminalMetadata` entry (terminalIds =
meta.keys().filter(has-metadata)), but remote spawnPty published the terminal to
the LIST and never seeded its metadata — that arrived only via the mirror, AFTER
the dial connected. So the tile, and the progress UI that rides it, waited the
entire dial.

spawnPty now upserts the seed metadata (location.hostId + cwd) synchronously, so
the tile renders at once and the connecting screen shows during the dial — same
as a local terminal appears instantly. The mirror's onRemoteMeta enriches the
same entry (git/agent/…) once connected. Covers fresh dials AND cold session
restores (both go through spawnPty).
…ess too

Warm boot-adoption now reaches tile + connecting-overlay parity with a
fresh dial. A durable remote host's saved terminals are pre-registered as
SHADOW tiles at boot — BEFORE the metadata mirror connects — so each
adopted tile and its connecting screen render DURING the re-dial, not in
the multi-second gap after connect.

The gap was structural: a fresh/cold spawn seeds terminalMetadata
synchronously (the client renders a tile only once a terminal has a
metadata entry), but warm adoption arrived solely via the mirror's
onRemoteMeta, which fires only after connect — so an adopted tile blinked
in late, with no connecting screen during the dial.

- RemoteTerminalEndpoint.preRegisterSavedTerminals: for each saved
  terminal (skipping any already adopted by a racing mirror), seed its
  metadata from the saved record (cwd + client-persisted fields) and
  register a RemoteTerminalProxy whose `ready` is deliberately left
  UNRESOLVED — a shadow. The seed is upserted to terminalMetadata so the
  tile + overlay paint immediately; attaches/writes block on `ready` until
  the live PTY is confirmed.
- markReady-on-confirm: onRemoteMeta releases a pre-registered shadow's
  ready gate (and drops it from the set) the first time the mirror
  delivers its metadata — the entry already exists, so it takes the
  enrich path, not the adopt branch.
- orphan reconciliation: the bridge reconciles shadows against the
  watcher's FIRST live snapshot (gated once per mirror connect via a new
  optional mirrorRemoteCollection onSnapshot hook). A shadow the watcher
  doesn't list is an orphan — its shell exited while kolu was down — so
  its dead tile is removed rather than left with a stuck connecting
  overlay.
- index.ts boot dial groups the saved session's remote terminals by host
  and pre-registers them on each dialed RemoteTerminalEndpoint.

onSnapshot is optional/back-compatible (the remote-process-monitor
example consumer is unaffected).
@srid

srid commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

🧪 CI metrics — leased pool box

The x86_64-linux lane ran on kolu-ci-2 (dev-x86-64-linux-04) — commit a0cb5032, passed

  • Lane wall (pipeline): 2m18s
recipe duration
ci::e2e 1m40s
ci::unit 30s
_ci-setup 30s
ci::atlas-sync 26s
ci::pnpm-hash-fresh 23s
ci::home-manager 16s
ci::surface-example-build 11s
ci::surface-app-example-build 11s
ci::smoke 10s
ci::fmt 10s
ci::biome 10s
ci::install 7s
ci::nix 5s
ci::flake-check 5s

Pool status (8 boxes)

box location state
kolu-ci-1 dev-x86-64-linux-04 ✓ idle
kolu-ci-2 dev-x86-64-linux-04 ✓ idle
kolu-ci-3 dev-x86-64-linux-03 ✓ idle
kolu-ci-4 dev-x86-64-linux-08 ✓ idle
kolu-ci-5 idliv2-02 ✓ idle
kolu-ci-6 dev-x86-64-linux-03 ✓ idle
kolu-ci-7 dev-x86-64-linux-05 ✓ idle
kolu-ci-8 dev-x86-64-linux-08 ✓ idle

Posted by ci/pu/report.sh. Lane timings from .ci/a0cb503/timings.jsonl; pool state is a live flock probe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant