fix: bound graceful shutdown drain to 2s by Isla-Liu · Pull Request #3393 · BloopAI/vibe-kanban

Isla-Liu · 2026-04-24T13:44:57Z

Summary

axum::serve(...).with_graceful_shutdown(...) waits for every in-flight connection to finish before dropping the listener. Long-lived streams (SSE, WebSocket, long-poll) do not drain on their own, so a single open stream can keep the drain future pending indefinitely after the shutdown signal fires.

In normal operation the tokio runtime drop at main() exit catches this by aborting the spawned serve tasks. On Windows, however, CTRL_CLOSE_EVENT (user closes the launcher / terminal window) grants the process only ~5 seconds before TerminateProcess. An unbounded drain leaves no time for perform_cleanup_actions or a clean listener drop — cleanup may never run, and the AFD kernel state of the half-closed listener can linger under a phantom PID.

This change bounds the drain to 2 seconds after shutdown_token.cancel(). If it doesn't complete, abort the listener tasks so their sockets drop before exit, leaving headroom for perform_cleanup_actions within the Windows CTRL_CLOSE_EVENT grace period.

Behaviour changes

Windows: graceful shutdown now completes within Windows' CTRL_CLOSE 5s grace window; perform_cleanup_actions runs reliably.
Linux: long-lived streams (SSE/WS) are aborted 2s after shutdown instead of blocking axum::serve until runtime drop. End-of-process semantics are unchanged — the difference is that cleanup now runs in a consistent bounded window rather than racing the runtime drop.
No new dependencies; no public API changes.

Test plan

cargo build --release --bin server on Windows 11 MSVC; binary boots, accepts traffic, and shuts down cleanly on Ctrl+C with live browser-side SSE / WS connections present
Captured tracing output shows "Shutdown signal received" immediately followed by perform_cleanup_actions logs — drain completes sub-millisecond when peers cooperate
The 2s timeout path is defensive: the tracing::warn!("Graceful shutdown exceeded ...") branch is there for pathological cases (peers that refuse to disconnect) so the listener is still dropped before process exit

Notes

Does not attempt to fully fix the separate Windows-specific quirk where the kernel (AFD) can retain LISTEN socket state under a phantom PID after process exit even with clean shutdown. That requires SO_REUSEADDR handling at the bind site with different security tradeoffs (Windows SO_REUSEADDR is port-hijack semantics) and is deliberately out of scope for this PR.
Reproducer for the overall Windows orphan-listener issue this partially addresses: run a release build, open the UI in Chrome (holds SSE), then close the launcher console. Before this change, subsequent server.exe runs hit Error: Io(Os { code: 10048, kind: AddrInUse ... }). After this change, cleanup is guaranteed to run and the race window is bounded, but a full fix requires SO_REUSEADDR which is left to application-level deployments with loopback-only bindings.

Note

Medium Risk
Changes server shutdown behavior by adding time-bounded drain and forced task aborts, which could affect long-lived connections and shutdown ordering if timeouts are too aggressive.

Overview
Implements a bounded graceful shutdown for the main server and preview proxy: after shutdown is signaled, the code now waits up to 2s for axum drains to complete, then aborts the listener tasks and briefly waits for abort propagation before continuing.

Refactors the shutdown select! to use &mut JoinHandles and captures AbortHandles up front to avoid re-polling completed join handles, ensuring cleanup (perform_cleanup_actions) runs reliably even with long-lived connections (e.g., SSE/WS), especially on Windows.

^{Reviewed by Cursor Bugbot for commit 1e62439. Bugbot is set up for automated code reviews on this repo. Configure here.}

axum::serve's with_graceful_shutdown waits for every in-flight connection to finish before dropping the listener. Long-lived streams (SSE, WebSocket, long-poll) do not drain on their own, so a single open stream can keep the drain future pending indefinitely after the shutdown signal fires. On Windows, CTRL_CLOSE_EVENT (user closes the launcher / terminal window) grants the process only ~5 seconds before TerminateProcess. An unbounded drain leaves no time for perform_cleanup_actions or a clean listener drop before the OS hard-kills the process. Bound drain to 2 seconds after shutdown_token.cancel(); if it doesn't complete, abort the listener tasks so their sockets drop before exit. On Linux, long-lived streams (SSE/WS) are aborted at shutdown instead of blocking the drain until runtime drop — end-of-process semantics are unchanged, just in a consistent bounded window.

Copilot

Pull request overview

Bounds axum::serve(...).with_graceful_shutdown(...) drain time during shutdown so long-lived connections (SSE/WS/long-poll) can’t block process exit indefinitely, improving reliability of cleanup—especially under Windows’ CTRL_CLOSE_EVENT time constraints.

Changes:

Introduces a 2s shutdown drain timeout after shutdown_token.cancel().
Aborts the main/proxy listener tasks if drain exceeds the timeout.
Adjusts task handle usage (&mut JoinHandle) to support bounded waiting and aborting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-24T13:48:24Z

+        tracing::warn!(
+            "Graceful shutdown exceeded {:?}; aborting listeners",
+            SHUTDOWN_DRAIN_TIMEOUT
+        );
+        main_handle.abort();
+        proxy_handle.abort();
+        // Wait for abort to propagate so the listener sockets are dropped.
+        let _ = (&mut main_handle).await;
+        let _ = (&mut proxy_handle).await;
+    }


After the timeout elapses, the code aborts the listener tasks but then awaits both JoinHandles without any bound. If a task fails to observe cancellation promptly, shutdown can still hang past the intended 2s window (and potentially exceed the Windows ~5s close-event grace period). Consider adding a second (short) timeout around the post-abort() joins or otherwise ensuring this path is also time-bounded.

Addressed in 1e62439 — post-abort wait is now wrapped in a second tokio::time::timeout(500ms). Worst-case total shutdown is now 2s (drain) + 0.5s (abort observe) = 2.5s, well inside the Windows CTRL_CLOSE ~5s grace window.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 78b6563. Configure here.}

Cursor Bugbot flagged that `JoinHandle::poll` panics if called after it has already returned `Ready`. The original patch tripped this in two ways: (1) when the `_ = &mut main_handle` branch of `tokio::select!` fires, subsequent `(&mut main_handle).await` inside the drain future re-polls the completed handle and panics; (2) in the timeout branch, if `main_handle` finishes inside the drain but `proxy_handle` does not, the post-abort `(&mut main_handle).await` panics for the same reason. Copilot separately noted the post-abort joins were unbounded, so a task slow to observe cancellation could push total shutdown past the Windows 5s CTRL_CLOSE_EVENT grace window. Switch to `AbortHandle::is_finished()` polling for completion observation — `AbortHandle` has no panic-on-re-poll hazard — and wrap the post-abort wait in a second `tokio::time::timeout(500ms)` so worst-case shutdown (2s drain + 0.5s post-abort) stays well inside the 5s grace window. Addresses: - BloopAI#3393 (comment) (cursor[bot]) - BloopAI#3393 (comment) (Copilot)

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Juanlucasbg · 2026-05-01T23:25:03Z

Aggressive review summary — PR #3393

68-line fix to crates/server/src/main.rs adding a bounded graceful-shutdown drain (2s timeout, then abort listener tasks, then a 500ms post-abort window). Verdict: clean — recommend merge.

Why the change is correct

axum::serve(...).with_graceful_shutdown(...) waits for every in-flight connection to finish before dropping the listener. Long-lived streams (SSE, WebSocket, long-poll) don't drain on their own, so a single open stream can hold the listener socket alive indefinitely. On Windows, when the console delivers CTRL_CLOSE_EVENT the OS grants ~5s before TerminateProcess — if the listener is still bound at that point, AFD kernel state can linger under the dead PID and block rebinding the port until reboot.

The PR's three-phase approach:

0-2s: Poll AbortHandle::is_finished() waiting for graceful drain.
2s+: If still draining, log a warning, abort the listener tasks.
2-2.5s: Give abort a 500ms window to propagate (some tasks observe cancellation lazily).

Total worst-case shutdown ≤ 2.5s, well inside the 5s Windows grace window.

Findings

Structural — PASS: The author noticed a real footgun: JoinHandle::poll panics if called after the handle has already returned Ready (which _ = &mut handle in select! does). The PR sidesteps this by capturing abort_handle() upfront and polling is_finished() instead of re-awaiting the join handle. Excellent comment-block (lines 187-208) walking the next reader through the rationale.
Adversarial — 0 HIGH: Edge cases — both servers already finished before drain → poll loop sees is_finished() && is_finished() immediately → exits without timeout. Both servers hung → 2s timeout fires → abort. One finished, one hung → loop waits for the second → 2s timeout. All safe.
Performance: 20ms poll interval gives 50 polls/sec, max ~100 polls in the 2s window — negligible.
Conventions — PASS: Constants are at module scope (SHUTDOWN_DRAIN_TIMEOUT, POST_ABORT_TIMEOUT, COMPLETION_POLL_INTERVAL); easy to tune.

NITs

Telemetry: When the timeout fires, only a single tracing::warn! is emitted. A counter or structured field (e.g. force_aborted=true) would let operators see how often this path triggers in production.
Test gap: A unit test would be hard for this path (it's main()), but a #[cfg(test)] shim that wraps the drain logic in a function and tests the three branches (drain completes / drain times out / both abort) would lock in the panic-avoidance discipline.
COMPLETION_POLL_INTERVAL = 20ms: Could be larger (e.g. 50ms) — the human-perceptible upper bound is fine. Minor.

Verdict

Approve.

— Reviewed by automated single-pass review (concurrency / shutdown-correctness; full 4-tool battery skipped — diff is well-commented and addresses a specific Windows port-rebind bug).

Copilot AI review requested due to automatic review settings April 24, 2026 13:44

Copilot started reviewing on behalf of Isla-Liu April 24, 2026 13:45 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

cursor Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread crates/server/src/main.rs Outdated

Isla-Liu requested a review from Copilot April 24, 2026 14:12

Copilot started reviewing on behalf of Isla-Liu April 24, 2026 14:12 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: bound graceful shutdown drain to 2s#3393

fix: bound graceful shutdown drain to 2s#3393
Isla-Liu wants to merge 2 commits into
BloopAI:mainfrom
Isla-Liu:fix/bounded-graceful-shutdown

Isla-Liu commented Apr 24, 2026 •

edited by cursor Bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Isla-Liu Apr 24, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Juanlucasbg commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Isla-Liu commented Apr 24, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behaviour changes

Test plan

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Isla-Liu Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Juanlucasbg commented May 1, 2026

Aggressive review summary — PR #3393

Why the change is correct

Findings

NITs

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Isla-Liu commented Apr 24, 2026 •

edited by cursor Bot

Loading