feat(supervision): auto-resume orphaned worker missions once#540
Merged
Conversation
When a worker mission's runner dies (deploy SIGTERM or crash), the watchdog marks it interrupted (orphan_no_runner) and the boss had to notice and resume_worker by hand — 10 such manual resumes in one campaign. The watchdog now issues a single supervised ResumeMission for orphaned missions that have a parent_mission_id, tracked once-per-process so a worker that dies again stays interrupted for the boss to triage. Scoped to Case 2 (runner death, environmental) only — NOT Case 1 watchdog-stalls, which may be genuine hangs that should not auto-restart.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a worker mission's runner dies — a deploy SIGTERM or a crash — the watchdog marks it
interrupted(orphan_no_runner) and the orchestrator boss had to notice and callresume_workermanually. One campaign had 10 such manual resumes; every prod deploy I did needed hand-resumes of the running workers.Fix
The stuck-mission watchdog now issues a single supervised
ResumeMissionfor orphaned missions that have aparent_mission_id(i.e. workers), tracked in a once-per-process set so a worker that dies again stays interrupted for the boss to triage — no resume storms possible.Scope / safety
watchdog_stalled), which can be a genuine hang that should not auto-restart.cargo test --lib: 998 passed.Note
Medium Risk
Changes mission lifecycle in supervision (automatic ResumeMission for workers); bounded by once-per-process and workers-only, but still affects orchestration behavior after deploys and crashes.
Overview
The stuck-mission watchdog now auto-resumes orphaned worker missions once when it marks them interrupted for Case 2 (DB still
Active, no in-memory runner — e.g. deploy SIGTERM or runner crash).After emitting
orphan_no_runner/MissionStatusChanged, it enqueuesResumeMissiononly whenparent_mission_idis set (workers, not top-level missions). A process-lifetimeHashSetensures at most one auto-resume per worker; a second death stays interrupted for manual triage. Failed resumes are logged and left interrupted. Case 1 (watchdog stall /watchdog_stalled) is unchanged — no auto-restart there.Reviewed by Cursor Bugbot for commit c095fea. Bugbot is set up for automated code reviews on this repo. Configure here.