feat: acquisition watchdog — Slack alert on prematurely-ended acquisitions#565
feat: acquisition watchdog — Slack alert on prematurely-ended acquisitions#565Alpaca233 wants to merge 20 commits into
Conversation
…on plan Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…remature alerts Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tion Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a .gitignore negation so the Windows Task Scheduler XML (a committed deployment artifact) is tracked despite the repo-wide *.xml ignore rule. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…l.ps1 Removes the binary UTF-16 Task Scheduler XML and the repo-root .gitignore negation it required; install.ps1 now builds the task inline via New-ScheduledTask* cmdlets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(the GUI's real source) The watchdog previously read a non-existent [SlackNotifications] .ini section; real credentials live in cache/slack_settings.yaml (bot_token/channel_id/enabled), written by the GUI Slack dialog. Without this the watchdog never alerts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Generalize the worker's self-abort into _abort_due_to_error(): every worker self-abort is an error (user aborts arrive via the external flag), so tag the cause in one helper. This also fixes 6 error-abort paths (null capture info, null frame, job dispatch/exec failure, frame-wait timeouts) that previously left _abort_cause unset and were misclassified as user_abort. - Drop the dead SlackConfig.enabled field (the watchdog gates on watchdog_enabled, independent of the GUI's enabled toggle). - Remove the redundant NullRunStateWriter() re-init inside run_acquisition (the pre-try guard already covers the failure path). - Cache Monitor._base instead of recomputing default_state_dir(); drop the redundant per-heartbeat expected_timepoints field (expected.timepoints is authoritative); update progress only after the beat() throttle check. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds an out-of-process “acquisition watchdog” that monitors an on-disk acquisition breadcrumb (run.json) and posts a single Slack alert when an acquisition ends prematurely (crash/hang/error/user abort), while trimming the in-process notifier to only announce clean completions.
Changes:
- Introduces
squid.acquisition_state(atomic run-state breadcrumbs) and a stdlib-onlysquid.slacksender shared by GUI notifier + watchdog. - Adds the standalone
acquisition_watchdogpackage (config, monitor/dedup/classification, alert formatting, CLI) plus service install recipes (systemd/Windows). - Wires engine + shutdown behavior to write start/heartbeat/end breadcrumbs and avoid double-alerting; adds unit/integration tests covering the lifecycle.
Reviewed changes
Copilot reviewed 27 out of 29 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| software/squid/acquisition_state.py | New atomic breadcrumb writer/reader used by engine + watchdog. |
| software/squid/slack.py | New dependency-free Slack chat.postMessage helper shared across components. |
| software/acquisition_watchdog/init.py | Watchdog package marker. |
| software/acquisition_watchdog/main.py | CLI entry point for running the watchdog once/forever. |
| software/acquisition_watchdog/config.py | Loads Slack settings from cache/slack_settings.yaml. |
| software/acquisition_watchdog/monitor.py | Poll/classify/dedup logic; sends Slack alerts. |
| software/acquisition_watchdog/alerts.py | Slack alert text + Block Kit formatting. |
| software/acquisition_watchdog/README.md | Operational docs for running/installing the watchdog. |
| software/acquisition_watchdog/systemd/squid-acquisition-watchdog.service | systemd user unit recipe. |
| software/acquisition_watchdog/windows/install.ps1 | Windows Task Scheduler install script. |
| software/control/core/multi_point_controller.py | Writes “running” breadcrumb at acquisition start; passes writer into worker; closes breadcrumb on setup failure. |
| software/control/core/multi_point_worker.py | Heartbeat + end-reason computation; writes end breadcrumb; tags error-driven aborts. |
| software/control/slack_notifier.py | Delegates message send to squid.slack; gates finish messages to clean completion via reason. |
| software/main_hcs.py | On shutdown mid-acquisition, requests abort + joins worker to allow proper user_abort breadcrumb. |
| software/tests/control/conftest.py | Autouse fixture redirects watchdog state dir to tmp during tests. |
| software/tests/control/test_watchdog_breadcrumbs.py | Integration smoke test for “running” breadcrumb creation. |
| software/tests/control/test_watchdog_integration.py | End-to-end simulated acquisition breadcrumb lifecycle test. |
| software/tests/control/test_worker_reason.py | Unit tests for worker end-reason computation. |
| software/tests/control/test_slack_notifier_send.py | Verifies notifier delegates to squid.slack.post_message. |
| software/tests/control/test_notifier_trim.py | Verifies notifier only sends finish message on reason == completed. |
| software/tests/squid/test_slack.py | Unit tests for squid.slack.post_message. |
| software/tests/squid/test_acquisition_state.py | Unit tests for breadcrumb writer/read/throttled heartbeat. |
| software/tests/acquisition_watchdog/init.py | Test package marker. |
| software/tests/acquisition_watchdog/test_config.py | Tests Slack-settings path resolution + YAML parsing behavior. |
| software/tests/acquisition_watchdog/test_monitor.py | Tests classify table + dedup persistence behavior. |
| software/tests/acquisition_watchdog/test_alerts.py | Tests alert formatting includes key facts. |
| software/tests/acquisition_watchdog/test_cli.py | Tests CLI --once vs default run mode. |
| software/docs/superpowers/specs/2026-06-23-acquisition-watchdog-design.md | Design spec for the watchdog architecture/protocol. |
| software/docs/superpowers/plans/2026-06-23-acquisition-watchdog.md | Detailed implementation plan and test checklist. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return SlackConfig( | ||
| bot_token=(data.get("bot_token") or None), | ||
| channel_id=(data.get("channel_id") or None), | ||
| watchdog_enabled=bool(data.get("watchdog_enabled", True)), | ||
| ) |
| log.info("Acquisition in progress at shutdown; requesting abort before exit.") | ||
| mpc.request_abort_aquisition() | ||
| if getattr(mpc, "thread", None) is not None: | ||
| mpc.thread.join(timeout=15.0) |
There was a problem hiding this comment.
[Claude Code] Fixed in 7186e26 — after the 15s join we now check thread.is_alive() and log a warning when the worker is still running, making the "exit-then-watchdog-reports-crash" case explicit and diagnosable.
Addresses Copilot review: the shutdown abort joins the worker with a 15s timeout but didn't check the result. If the thread is still alive, os._exit() follows and the watchdog will report a crash — now logged explicitly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Adds an independent
acquisition_watchdogprocess that posts a single Slack alert when an acquisition ends prematurely — process crash / hang / kill, fatal error, or user abort — covering acquisitions launched from the GUI and from the MCP control server, on Ubuntu and Windows.The core idea: a crashing process can't report its own death, so the in-process
SlackNotifiercan never catch a segfault/OOM-kill/power-loss/freeze. The watchdog runs out-of-process and watches on-disk breadcrumbs the engine leaves behind.squid/acquisition_state.pywrites a singlerun.jsonatomically:runningat start, a throttled heartbeat (+progress) during the run, andendedwith a computedreasonin the worker'sfinally. Written in the engine (MultiPointController/MultiPointWorker), so GUI- and server-launched runs are both covered with no extra code.acquisition_watchdog/(config, alerts, monitor, CLI). Pollsrun.json; classifiesrunning+dead-PID/stale-heartbeat → crash/hang, andended+{error, user_abort, completed_with_errors} → alert;completedis silent. De-duplicates perrun_id(persisted, survives restart). Lightweight — never importscontrol/Qt.squid/slack.pysender; the in-processSlackNotifiernow announces only clean completions (the watchdog owns premature-end alerts → no double-alerting). Both read credentials from the samecache/slack_settings.yamlthe GUI writes.user_abortrather than looking like a crash.install.ps1(Windows) to run it as an always-on service. The same code can later run as a remote monitor for power-loss coverage (see spec "Future work").docs/superpowers/.How it works
Test plan
squid.slacksender, watchdog config (readscache/slack_settings.yaml), alert formatting, monitor classify/dedup/PID-degrade, worker end-reason logic — all pass.running → endedbreadcrumb lifecycle via the engine.black --check .clean (224 files); targeted feature suite 104 passed; full suite 1434 passed / 8 skipped.kill -9it (and separately abort it), confirm one alert per event.Notes
test_MultiPointWorkertests are@pytest.mark.skipfor aQApplication.processEvents()issue). If CI exits 139, the fix is to skip-mark those two tests consistent with that existing convention.🤖 Generated with Claude Code