feat(supervision): auto-resume orphaned worker missions once by Th0rgal · Pull Request #540 · Th0rgal/sandboxed.sh

Th0rgal · 2026-06-12T14:25:22Z

Problem

When a worker mission's runner dies — a deploy SIGTERM or a crash — the watchdog marks it interrupted (orphan_no_runner) and the orchestrator boss had to notice and call resume_worker manually. One campaign had 10 such manual resumes; every prod deploy I did needed hand-resumes of the running workers.

Fix

The stuck-mission watchdog now issues a single supervised ResumeMission for orphaned missions that have a parent_mission_id (i.e. workers), tracked in a once-per-process set so a worker that dies again stays interrupted for the boss to triage — no resume storms possible.

Scope / safety

Only Case 2 (Active in DB, no live runner = runner death, environmental). Deliberately not Case 1 (watchdog_stalled), which can be a genuine hang that should not auto-restart.
Workers only (parent_mission_id present); top-level missions are untouched.
Once per process; resume failure leaves the worker interrupted with a warn log.

cargo test --lib: 998 passed.

Note

Medium Risk
Changes mission lifecycle in supervision (automatic ResumeMission for workers); bounded by once-per-process and workers-only, but still affects orchestration behavior after deploys and crashes.

Overview
The stuck-mission watchdog now auto-resumes orphaned worker missions once when it marks them interrupted for Case 2 (DB still Active, no in-memory runner — e.g. deploy SIGTERM or runner crash).

After emitting orphan_no_runner / MissionStatusChanged, it enqueues ResumeMission only when parent_mission_id is set (workers, not top-level missions). A process-lifetime HashSet ensures at most one auto-resume per worker; a second death stays interrupted for manual triage. Failed resumes are logged and left interrupted. Case 1 (watchdog stall / watchdog_stalled) is unchanged — no auto-restart there.

^{Reviewed by Cursor Bugbot for commit c095fea. Bugbot is set up for automated code reviews on this repo. Configure here.}

When a worker mission's runner dies (deploy SIGTERM or crash), the watchdog marks it interrupted (orphan_no_runner) and the boss had to notice and resume_worker by hand — 10 such manual resumes in one campaign. The watchdog now issues a single supervised ResumeMission for orphaned missions that have a parent_mission_id, tracked once-per-process so a worker that dies again stays interrupted for the boss to triage. Scoped to Case 2 (runner death, environmental) only — NOT Case 1 watchdog-stalls, which may be genuine hangs that should not auto-restart.

vercel · 2026-06-12T14:25:23Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
sandboxed-dashboard	Ready	Preview, Comment	Jun 12, 2026 2:25pm
sandboxed-sh	Ready	Preview, Comment	Jun 12, 2026 2:25pm

vercel Bot deployed to Preview – sandboxed-sh June 12, 2026 14:25 View deployment

vercel Bot deployed to Preview – sandboxed-dashboard June 12, 2026 14:25 View deployment

Th0rgal merged commit 058c6e0 into master Jun 12, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(supervision): auto-resume orphaned worker missions once#540

feat(supervision): auto-resume orphaned worker missions once#540
Th0rgal merged 1 commit into
masterfrom
feat/auto-resume-orphan-workers

Th0rgal commented Jun 12, 2026 •

edited by cursor Bot

Loading

Uh oh!

vercel Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Th0rgal commented Jun 12, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Scope / safety

Uh oh!

vercel Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Th0rgal commented Jun 12, 2026 •

edited by cursor Bot

Loading

vercel Bot commented Jun 12, 2026 •

edited

Loading