Skip to content

Production kolu.service disrupted during agent /be run — harden host isolation & add a production health gate #1334

@srid

Description

@srid

Production kolu.service disrupted during an agent /be run — harden host isolation & add a production health gate around evidence

What happened

During a long autonomous /be session (PR #1331), the live production kolu.service went down/restarted, disconnecting the user's clients mid-session. The user experienced it as "production was killed."

What the data shows

  • systemctl --user show kolu: Result=success, NRestarts=1, ActiveEnterTimestamp = 2026-06-13 13:45:38. So the prior instance stopped cleanly (not OOM, not a signal/crash) and the unit restarted once.
  • Journal at 13:45:32–33 (immediately before the new instance came up): a websocket disconnected … code:1001 (client going away), git reflog/index/working-tree watchers retired, and an in-flight terminal.attach aborted with AbortError. These are consequences of the restart, not its cause. (The large "NOT_SUPPORTED_ERR": 9 log flood is just the DOMException constant table being serialized for each logged error — not a separate error storm.)
  • The PR-evidence capture that ran in the same window executed on an ephemeral pu box, not locally — the orchestration transcript shows 27 pu connect kolu-pr-1331 … calls, the e2e harness invoked as pu connect kolu-pr-1331 -- "bash ~/run-evidence.sh", and pu destroy kolu-pr-1331 afterward. So this was not the Stop the Claude pill spinning forever on orphaned background tasks #1109 failure mode (an agent running bare just dev / just test-quick locally and binding production's fixed ports).

Assessment

The restart was clean and I could not attribute it to a specific local port-bind or crash from the agent run. The most plausible contributing factor is resource/host contention: a long agent session on the same host as production runs heavy local work (repeated just check / just fmtnix develop + biome + tsc, multiple gauntlet subagents, several chrome-devtools-mcp chromium launches), which can make the live instance sluggish enough to be (manually or automatically) restarted. Either way, the user's production was disrupted during agent activity, which should not happen.

Proposed hardening

  1. Production health gate in the agent flow. /be §5 (and /do) should snapshot systemctl --user is-active kolu + main PID before and after the evidence/CI steps, and surface any restart/disruption immediately instead of finishing "green" while the user's instance bounced.
  2. Keep heavy agent work off the production host. Long autonomous sessions that hammer nix develop / biome / browser launches alongside a live kolu.service are a contention risk. Consider running the agent (or at least its build/test/evidence orchestration) on a separate box, or nice/cgroup-limiting agent-spawned builds so they can't starve production.
  3. Re-affirm and verify the pu-box rule. Evidence ran remotely here (good), but the rule (evidence must run on pu boxes, never local just dev/test-quick — see Stop the Claude pill spinning forever on orphaned background tasks #1109) should be machine-checked: the evidence skill could assert it never binds a local port and refuse to proceed if production kolu.service is live on the host.

Environment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions