Skip to content

feat(supervision): auto-resume orphaned worker missions once#540

Merged
Th0rgal merged 1 commit into
masterfrom
feat/auto-resume-orphan-workers
Jun 12, 2026
Merged

feat(supervision): auto-resume orphaned worker missions once#540
Th0rgal merged 1 commit into
masterfrom
feat/auto-resume-orphan-workers

Conversation

@Th0rgal

@Th0rgal Th0rgal commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Problem

When a worker mission's runner dies — a deploy SIGTERM or a crash — the watchdog marks it interrupted (orphan_no_runner) and the orchestrator boss had to notice and call resume_worker manually. One campaign had 10 such manual resumes; every prod deploy I did needed hand-resumes of the running workers.

Fix

The stuck-mission watchdog now issues a single supervised ResumeMission for orphaned missions that have a parent_mission_id (i.e. workers), tracked in a once-per-process set so a worker that dies again stays interrupted for the boss to triage — no resume storms possible.

Scope / safety

  • Only Case 2 (Active in DB, no live runner = runner death, environmental). Deliberately not Case 1 (watchdog_stalled), which can be a genuine hang that should not auto-restart.
  • Workers only (parent_mission_id present); top-level missions are untouched.
  • Once per process; resume failure leaves the worker interrupted with a warn log.

cargo test --lib: 998 passed.


Note

Medium Risk
Changes mission lifecycle in supervision (automatic ResumeMission for workers); bounded by once-per-process and workers-only, but still affects orchestration behavior after deploys and crashes.

Overview
The stuck-mission watchdog now auto-resumes orphaned worker missions once when it marks them interrupted for Case 2 (DB still Active, no in-memory runner — e.g. deploy SIGTERM or runner crash).

After emitting orphan_no_runner / MissionStatusChanged, it enqueues ResumeMission only when parent_mission_id is set (workers, not top-level missions). A process-lifetime HashSet ensures at most one auto-resume per worker; a second death stays interrupted for manual triage. Failed resumes are logged and left interrupted. Case 1 (watchdog stall / watchdog_stalled) is unchanged — no auto-restart there.

Reviewed by Cursor Bugbot for commit c095fea. Bugbot is set up for automated code reviews on this repo. Configure here.

When a worker mission's runner dies (deploy SIGTERM or crash), the
watchdog marks it interrupted (orphan_no_runner) and the boss had to
notice and resume_worker by hand — 10 such manual resumes in one
campaign. The watchdog now issues a single supervised ResumeMission for
orphaned missions that have a parent_mission_id, tracked once-per-process
so a worker that dies again stays interrupted for the boss to triage.

Scoped to Case 2 (runner death, environmental) only — NOT Case 1
watchdog-stalls, which may be genuine hangs that should not auto-restart.
@vercel

vercel Bot commented Jun 12, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
sandboxed-dashboard Ready Ready Preview, Comment Jun 12, 2026 2:25pm
sandboxed-sh Ready Ready Preview, Comment Jun 12, 2026 2:25pm

Request Review

@Th0rgal Th0rgal merged commit 058c6e0 into master Jun 12, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant