You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Production kolu.service disrupted during an agent /be run — harden host isolation & add a production health gate around evidence
What happened
During a long autonomous /be session (PR #1331), the live production kolu.service went down/restarted, disconnecting the user's clients mid-session. The user experienced it as "production was killed."
What the data shows
systemctl --user show kolu: Result=success, NRestarts=1, ActiveEnterTimestamp = 2026-06-13 13:45:38. So the prior instance stopped cleanly (not OOM, not a signal/crash) and the unit restarted once.
Journal at 13:45:32–33 (immediately before the new instance came up): a websocket disconnected … code:1001 (client going away), git reflog/index/working-tree watchers retired, and an in-flight terminal.attach aborted with AbortError. These are consequences of the restart, not its cause. (The large "NOT_SUPPORTED_ERR": 9 log flood is just the DOMException constant table being serialized for each logged error — not a separate error storm.)
The PR-evidence capture that ran in the same window executed on an ephemeral pu box, not locally — the orchestration transcript shows 27 pu connect kolu-pr-1331 … calls, the e2e harness invoked as pu connect kolu-pr-1331 -- "bash ~/run-evidence.sh", and pu destroy kolu-pr-1331 afterward. So this was not the Stop the Claude pill spinning forever on orphaned background tasks #1109 failure mode (an agent running bare just dev / just test-quick locally and binding production's fixed ports).
Assessment
The restart was clean and I could not attribute it to a specific local port-bind or crash from the agent run. The most plausible contributing factor is resource/host contention: a long agent session on the same host as production runs heavy local work (repeated just check / just fmt → nix develop + biome + tsc, multiple gauntlet subagents, several chrome-devtools-mcp chromium launches), which can make the live instance sluggish enough to be (manually or automatically) restarted. Either way, the user's production was disrupted during agent activity, which should not happen.
Proposed hardening
Production health gate in the agent flow./be §5 (and /do) should snapshot systemctl --user is-active kolu + main PID before and after the evidence/CI steps, and surface any restart/disruption immediately instead of finishing "green" while the user's instance bounced.
Keep heavy agent work off the production host. Long autonomous sessions that hammer nix develop / biome / browser launches alongside a live kolu.service are a contention risk. Consider running the agent (or at least its build/test/evidence orchestration) on a separate box, or nice/cgroup-limiting agent-spawned builds so they can't starve production.
Re-affirm and verify the pu-box rule. Evidence ran remotely here (good), but the rule (evidence must run on pu boxes, never local just dev/test-quick — see Stop the Claude pill spinning forever on orphaned background tasks #1109) should be machine-checked: the evidence skill could assert it never binds a local port and refuse to proceed if production kolu.service is live on the host.
Production
kolu.servicedisrupted during an agent/berun — harden host isolation & add a production health gate around evidenceWhat happened
During a long autonomous
/besession (PR #1331), the live productionkolu.servicewent down/restarted, disconnecting the user's clients mid-session. The user experienced it as "production was killed."What the data shows
systemctl --user show kolu:Result=success,NRestarts=1,ActiveEnterTimestamp = 2026-06-13 13:45:38. So the prior instance stopped cleanly (not OOM, not a signal/crash) and the unit restarted once.disconnected … code:1001(client going away), git reflog/index/working-tree watchers retired, and an in-flightterminal.attachaborted withAbortError. These are consequences of the restart, not its cause. (The large"NOT_SUPPORTED_ERR": 9log flood is just the DOMException constant table being serialized for each logged error — not a separate error storm.)pubox, not locally — the orchestration transcript shows 27pu connect kolu-pr-1331 …calls, the e2e harness invoked aspu connect kolu-pr-1331 -- "bash ~/run-evidence.sh", andpu destroy kolu-pr-1331afterward. So this was not the Stop the Claude pill spinning forever on orphaned background tasks #1109 failure mode (an agent running barejust dev/just test-quicklocally and binding production's fixed ports).Assessment
The restart was clean and I could not attribute it to a specific local port-bind or crash from the agent run. The most plausible contributing factor is resource/host contention: a long agent session on the same host as production runs heavy local work (repeated
just check/just fmt→nix develop+biome+tsc, multiple gauntlet subagents, severalchrome-devtools-mcpchromium launches), which can make the live instance sluggish enough to be (manually or automatically) restarted. Either way, the user's production was disrupted during agent activity, which should not happen.Proposed hardening
/be§5 (and/do) should snapshotsystemctl --user is-active kolu+ main PID before and after the evidence/CI steps, and surface any restart/disruption immediately instead of finishing "green" while the user's instance bounced.nix develop/biome/ browser launches alongside a livekolu.serviceare a contention risk. Consider running the agent (or at least its build/test/evidence orchestration) on a separate box, or nice/cgroup-limiting agent-spawned builds so they can't starve production.evidence must run on pu boxes, never localjust dev/test-quick— see Stop the Claude pill spinning forever on orphaned background tasks #1109) should be machine-checked: the evidence skill could assert it never binds a local port and refuse to proceed if productionkolu.serviceis live on the host.Environment
feat/code-scope-segments), commit9db5fefcpureintent,kolu.service(user unit) on100.122.32.106:7692, nix-store build…-kolu-stampedjust dev), [Log] pu issues #1204 (pu-box issues log)