Skip to content

feat: acquisition watchdog — Slack alert on prematurely-ended acquisitions#565

Open
Alpaca233 wants to merge 20 commits into
masterfrom
acquisition-watchdog
Open

feat: acquisition watchdog — Slack alert on prematurely-ended acquisitions#565
Alpaca233 wants to merge 20 commits into
masterfrom
acquisition-watchdog

Conversation

@Alpaca233

Copy link
Copy Markdown
Collaborator

Summary

Adds an independent acquisition_watchdog process that posts a single Slack alert when an acquisition ends prematurely — process crash / hang / kill, fatal error, or user abort — covering acquisitions launched from the GUI and from the MCP control server, on Ubuntu and Windows.

The core idea: a crashing process can't report its own death, so the in-process SlackNotifier can never catch a segfault/OOM-kill/power-loss/freeze. The watchdog runs out-of-process and watches on-disk breadcrumbs the engine leaves behind.

  • Breadcrumb protocol — new dependency-free squid/acquisition_state.py writes a single run.json atomically: running at start, a throttled heartbeat (+progress) during the run, and ended with a computed reason in the worker's finally. Written in the engine (MultiPointController/MultiPointWorker), so GUI- and server-launched runs are both covered with no extra code.
  • Watchdogacquisition_watchdog/ (config, alerts, monitor, CLI). Polls run.json; classifies running+dead-PID/stale-heartbeat → crash/hang, and ended+{error, user_abort, completed_with_errors} → alert; completed is silent. De-duplicates per run_id (persisted, survives restart). Lightweight — never imports control/Qt.
  • Notifier split — extracted a shared squid/slack.py sender; the in-process SlackNotifier now announces only clean completions (the watchdog owns premature-end alerts → no double-alerting). Both read credentials from the same cache/slack_settings.yaml the GUI writes.
  • Shutdown — quitting mid-acquisition aborts+joins the worker so it records user_abort rather than looking like a crash.
  • Deployment — systemd user unit (Linux) + Task Scheduler install.ps1 (Windows) to run it as an always-on service. The same code can later run as a remote monitor for power-loss coverage (see spec "Future work").
  • Design spec + implementation plan under docs/superpowers/.

How it works

GUI/server process                     acquisition_watchdog (independent)
 run_acquisition → run.json(running)    poll run.json every ~5s
 worker loop     → heartbeat+progress   running + (pid dead | stale) → crash/hang → Slack
 finally         → run.json(ended,reason) ended + non-clean reason     → Slack
                                         ended + completed             → silent

Test plan

  • Unit: breadcrumb writer (atomic write, throttled heartbeat), squid.slack sender, watchdog config (reads cache/slack_settings.yaml), alert formatting, monitor classify/dedup/PID-degrade, worker end-reason logic — all pass.
  • Integration (simulation): full running → ended breadcrumb lifecycle via the engine.
  • black --check . clean (224 files); targeted feature suite 104 passed; full suite 1434 passed / 8 skipped.
  • Real deployment check: configure Slack in the GUI, start an acquisition, kill -9 it (and separately abort it), confirm one alert per event.

Notes

  • CI caveat: on the dev machine the full pytest run exits 139 (segfault) at interpreter teardownafter all tests report pass. The faulting thread is in C-extension/Qt/cupy finalization (this repo already disables memory-profiling in CI to dodge a related fork+threads hazard). The two new Qt integration tests run a full simulated-acquisition teardown that the suite otherwise skips (the equivalent test_MultiPointWorker tests are @pytest.mark.skip for a QApplication.processEvents() issue). If CI exits 139, the fix is to skip-mark those two tests consistent with that existing convention.
  • v1 scope: progress-stall detection and machine-power-loss coverage are intentionally deferred (spec "Future work").

🤖 Generated with Claude Code

Alpaca233 and others added 18 commits June 23, 2026 11:05
…on plan

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…remature alerts

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tion

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a .gitignore negation so the Windows Task Scheduler XML (a committed
deployment artifact) is tracked despite the repo-wide *.xml ignore rule.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…l.ps1

Removes the binary UTF-16 Task Scheduler XML and the repo-root .gitignore
negation it required; install.ps1 now builds the task inline via
New-ScheduledTask* cmdlets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(the GUI's real source)

The watchdog previously read a non-existent [SlackNotifications] .ini section;
real credentials live in cache/slack_settings.yaml (bot_token/channel_id/enabled),
written by the GUI Slack dialog. Without this the watchdog never alerts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Generalize the worker's self-abort into _abort_due_to_error(): every worker
  self-abort is an error (user aborts arrive via the external flag), so tag the
  cause in one helper. This also fixes 6 error-abort paths (null capture info,
  null frame, job dispatch/exec failure, frame-wait timeouts) that previously
  left _abort_cause unset and were misclassified as user_abort.
- Drop the dead SlackConfig.enabled field (the watchdog gates on watchdog_enabled,
  independent of the GUI's enabled toggle).
- Remove the redundant NullRunStateWriter() re-init inside run_acquisition (the
  pre-try guard already covers the failure path).
- Cache Monitor._base instead of recomputing default_state_dir(); drop the
  redundant per-heartbeat expected_timepoints field (expected.timepoints is
  authoritative); update progress only after the beat() throttle check.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an out-of-process “acquisition watchdog” that monitors an on-disk acquisition breadcrumb (run.json) and posts a single Slack alert when an acquisition ends prematurely (crash/hang/error/user abort), while trimming the in-process notifier to only announce clean completions.

Changes:

  • Introduces squid.acquisition_state (atomic run-state breadcrumbs) and a stdlib-only squid.slack sender shared by GUI notifier + watchdog.
  • Adds the standalone acquisition_watchdog package (config, monitor/dedup/classification, alert formatting, CLI) plus service install recipes (systemd/Windows).
  • Wires engine + shutdown behavior to write start/heartbeat/end breadcrumbs and avoid double-alerting; adds unit/integration tests covering the lifecycle.

Reviewed changes

Copilot reviewed 27 out of 29 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
software/squid/acquisition_state.py New atomic breadcrumb writer/reader used by engine + watchdog.
software/squid/slack.py New dependency-free Slack chat.postMessage helper shared across components.
software/acquisition_watchdog/init.py Watchdog package marker.
software/acquisition_watchdog/main.py CLI entry point for running the watchdog once/forever.
software/acquisition_watchdog/config.py Loads Slack settings from cache/slack_settings.yaml.
software/acquisition_watchdog/monitor.py Poll/classify/dedup logic; sends Slack alerts.
software/acquisition_watchdog/alerts.py Slack alert text + Block Kit formatting.
software/acquisition_watchdog/README.md Operational docs for running/installing the watchdog.
software/acquisition_watchdog/systemd/squid-acquisition-watchdog.service systemd user unit recipe.
software/acquisition_watchdog/windows/install.ps1 Windows Task Scheduler install script.
software/control/core/multi_point_controller.py Writes “running” breadcrumb at acquisition start; passes writer into worker; closes breadcrumb on setup failure.
software/control/core/multi_point_worker.py Heartbeat + end-reason computation; writes end breadcrumb; tags error-driven aborts.
software/control/slack_notifier.py Delegates message send to squid.slack; gates finish messages to clean completion via reason.
software/main_hcs.py On shutdown mid-acquisition, requests abort + joins worker to allow proper user_abort breadcrumb.
software/tests/control/conftest.py Autouse fixture redirects watchdog state dir to tmp during tests.
software/tests/control/test_watchdog_breadcrumbs.py Integration smoke test for “running” breadcrumb creation.
software/tests/control/test_watchdog_integration.py End-to-end simulated acquisition breadcrumb lifecycle test.
software/tests/control/test_worker_reason.py Unit tests for worker end-reason computation.
software/tests/control/test_slack_notifier_send.py Verifies notifier delegates to squid.slack.post_message.
software/tests/control/test_notifier_trim.py Verifies notifier only sends finish message on reason == completed.
software/tests/squid/test_slack.py Unit tests for squid.slack.post_message.
software/tests/squid/test_acquisition_state.py Unit tests for breadcrumb writer/read/throttled heartbeat.
software/tests/acquisition_watchdog/init.py Test package marker.
software/tests/acquisition_watchdog/test_config.py Tests Slack-settings path resolution + YAML parsing behavior.
software/tests/acquisition_watchdog/test_monitor.py Tests classify table + dedup persistence behavior.
software/tests/acquisition_watchdog/test_alerts.py Tests alert formatting includes key facts.
software/tests/acquisition_watchdog/test_cli.py Tests CLI --once vs default run mode.
software/docs/superpowers/specs/2026-06-23-acquisition-watchdog-design.md Design spec for the watchdog architecture/protocol.
software/docs/superpowers/plans/2026-06-23-acquisition-watchdog.md Detailed implementation plan and test checklist.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +47 to +51
return SlackConfig(
bot_token=(data.get("bot_token") or None),
channel_id=(data.get("channel_id") or None),
watchdog_enabled=bool(data.get("watchdog_enabled", True)),
)
Comment thread software/main_hcs.py
Comment on lines +444 to +447
log.info("Acquisition in progress at shutdown; requesting abort before exit.")
mpc.request_abort_aquisition()
if getattr(mpc, "thread", None) is not None:
mpc.thread.join(timeout=15.0)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude Code] Fixed in 7186e26 — after the 15s join we now check thread.is_alive() and log a warning when the worker is still running, making the "exit-then-watchdog-reports-crash" case explicit and diagnosable.

Addresses Copilot review: the shutdown abort joins the worker with a 15s
timeout but didn't check the result. If the thread is still alive, os._exit()
follows and the watchdog will report a crash — now logged explicitly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants