Production kolu.service disrupted during agent /be run — harden host isolation & add a production health gate

## Production `kolu.service` disrupted during an agent `/be` run — harden host isolation & add a production health gate around evidence

### What happened
During a long autonomous `/be` session (PR #1331), the live production `kolu.service` went down/restarted, disconnecting the user's clients mid-session. The user experienced it as "production was killed."

### What the data shows
- `systemctl --user show kolu`: **`Result=success`, `NRestarts=1`**, `ActiveEnterTimestamp = 2026-06-13 13:45:38`. So the prior instance stopped **cleanly** (not OOM, not a signal/crash) and the unit restarted **once**.
- Journal at 13:45:32–33 (immediately before the new instance came up): a websocket `disconnected … code:1001` (client going away), git reflog/index/working-tree watchers *retired*, and an in-flight `terminal.attach` aborted with `AbortError`. These are **consequences of the restart**, not its cause. (The large `"NOT_SUPPORTED_ERR": 9` log flood is just the DOMException constant table being serialized for each logged error — not a separate error storm.)
- The PR-evidence capture that ran in the same window executed **on an ephemeral `pu` box**, not locally — the orchestration transcript shows 27 `pu connect kolu-pr-1331 …` calls, the e2e harness invoked as `pu connect kolu-pr-1331 -- "bash ~/run-evidence.sh"`, and `pu destroy kolu-pr-1331` afterward. So this was **not** the #1109 failure mode (an agent running bare `just dev` / `just test-quick` locally and binding production's fixed ports).

### Assessment
The restart was clean and I could not attribute it to a specific local port-bind or crash from the agent run. The most plausible contributing factor is **resource/host contention**: a long agent session on the *same host as production* runs heavy local work (repeated `just check` / `just fmt` → `nix develop` + `biome` + `tsc`, multiple gauntlet subagents, several `chrome-devtools-mcp` chromium launches), which can make the live instance sluggish enough to be (manually or automatically) restarted. Either way, the user's production was disrupted during agent activity, which should not happen.

### Proposed hardening
1. **Production health gate in the agent flow.** `/be` §5 (and `/do`) should snapshot `systemctl --user is-active kolu` + main PID **before** and **after** the evidence/CI steps, and surface any restart/disruption immediately instead of finishing "green" while the user's instance bounced.
2. **Keep heavy agent work off the production host.** Long autonomous sessions that hammer `nix develop` / `biome` / browser launches alongside a live `kolu.service` are a contention risk. Consider running the agent (or at least its build/test/evidence orchestration) on a separate box, or nice/cgroup-limiting agent-spawned builds so they can't starve production.
3. **Re-affirm and verify the pu-box rule.** Evidence ran remotely here (good), but the rule (`evidence must run on pu boxes`, never local `just dev`/`test-quick` — see #1109) should be machine-checked: the evidence skill could assert it never binds a local port and refuse to proceed if production `kolu.service` is live on the host.

### Environment
- PR: #1331 (`feat/code-scope-segments`), commit `9db5fefc`
- Host: `pureintent`, `kolu.service` (user unit) on `100.122.32.106:7692`, nix-store build `…-kolu-stamped`
- Time of restart: 2026-06-13 13:45:38 EDT
- Related: #1109 (agent bound production ports via local `just dev`), #1204 (pu-box issues log)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production kolu.service disrupted during agent /be run — harden host isolation & add a production health gate #1334

Production `kolu.service` disrupted during an agent `/be` run — harden host isolation & add a production health gate around evidence

What happened

What the data shows

Assessment

Proposed hardening

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Production kolu.service disrupted during agent /be run — harden host isolation & add a production health gate #1334

Description

Production kolu.service disrupted during an agent /be run — harden host isolation & add a production health gate around evidence

What happened

What the data shows

Assessment

Proposed hardening

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Production `kolu.service` disrupted during an agent `/be` run — harden host isolation & add a production health gate around evidence