Skip to content

fix: bound graceful shutdown drain to 2s#3393

Open
Isla-Liu wants to merge 2 commits into
BloopAI:mainfrom
Isla-Liu:fix/bounded-graceful-shutdown
Open

fix: bound graceful shutdown drain to 2s#3393
Isla-Liu wants to merge 2 commits into
BloopAI:mainfrom
Isla-Liu:fix/bounded-graceful-shutdown

Conversation

@Isla-Liu

@Isla-Liu Isla-Liu commented Apr 24, 2026

Copy link
Copy Markdown

Summary

axum::serve(...).with_graceful_shutdown(...) waits for every in-flight connection to finish before dropping the listener. Long-lived streams (SSE, WebSocket, long-poll) do not drain on their own, so a single open stream can keep the drain future pending indefinitely after the shutdown signal fires.

In normal operation the tokio runtime drop at main() exit catches this by aborting the spawned serve tasks. On Windows, however, CTRL_CLOSE_EVENT (user closes the launcher / terminal window) grants the process only ~5 seconds before TerminateProcess. An unbounded drain leaves no time for perform_cleanup_actions or a clean listener drop — cleanup may never run, and the AFD kernel state of the half-closed listener can linger under a phantom PID.

This change bounds the drain to 2 seconds after shutdown_token.cancel(). If it doesn't complete, abort the listener tasks so their sockets drop before exit, leaving headroom for perform_cleanup_actions within the Windows CTRL_CLOSE_EVENT grace period.

Behaviour changes

  • Windows: graceful shutdown now completes within Windows' CTRL_CLOSE 5s grace window; perform_cleanup_actions runs reliably.
  • Linux: long-lived streams (SSE/WS) are aborted 2s after shutdown instead of blocking axum::serve until runtime drop. End-of-process semantics are unchanged — the difference is that cleanup now runs in a consistent bounded window rather than racing the runtime drop.
  • No new dependencies; no public API changes.

Test plan

  • cargo build --release --bin server on Windows 11 MSVC; binary boots, accepts traffic, and shuts down cleanly on Ctrl+C with live browser-side SSE / WS connections present
  • Captured tracing output shows "Shutdown signal received" immediately followed by perform_cleanup_actions logs — drain completes sub-millisecond when peers cooperate
  • The 2s timeout path is defensive: the tracing::warn!("Graceful shutdown exceeded ...") branch is there for pathological cases (peers that refuse to disconnect) so the listener is still dropped before process exit

Notes

  • Does not attempt to fully fix the separate Windows-specific quirk where the kernel (AFD) can retain LISTEN socket state under a phantom PID after process exit even with clean shutdown. That requires SO_REUSEADDR handling at the bind site with different security tradeoffs (Windows SO_REUSEADDR is port-hijack semantics) and is deliberately out of scope for this PR.
  • Reproducer for the overall Windows orphan-listener issue this partially addresses: run a release build, open the UI in Chrome (holds SSE), then close the launcher console. Before this change, subsequent server.exe runs hit Error: Io(Os { code: 10048, kind: AddrInUse ... }). After this change, cleanup is guaranteed to run and the race window is bounded, but a full fix requires SO_REUSEADDR which is left to application-level deployments with loopback-only bindings.

Note

Medium Risk
Changes server shutdown behavior by adding time-bounded drain and forced task aborts, which could affect long-lived connections and shutdown ordering if timeouts are too aggressive.

Overview
Implements a bounded graceful shutdown for the main server and preview proxy: after shutdown is signaled, the code now waits up to 2s for axum drains to complete, then aborts the listener tasks and briefly waits for abort propagation before continuing.

Refactors the shutdown select! to use &mut JoinHandles and captures AbortHandles up front to avoid re-polling completed join handles, ensuring cleanup (perform_cleanup_actions) runs reliably even with long-lived connections (e.g., SSE/WS), especially on Windows.

Reviewed by Cursor Bugbot for commit 1e62439. Bugbot is set up for automated code reviews on this repo. Configure here.

axum::serve's with_graceful_shutdown waits for every in-flight connection
to finish before dropping the listener. Long-lived streams (SSE,
WebSocket, long-poll) do not drain on their own, so a single open stream
can keep the drain future pending indefinitely after the shutdown signal
fires.

On Windows, CTRL_CLOSE_EVENT (user closes the launcher / terminal window)
grants the process only ~5 seconds before TerminateProcess. An unbounded
drain leaves no time for perform_cleanup_actions or a clean listener
drop before the OS hard-kills the process.

Bound drain to 2 seconds after shutdown_token.cancel(); if it doesn't
complete, abort the listener tasks so their sockets drop before exit.
On Linux, long-lived streams (SSE/WS) are aborted at shutdown instead of
blocking the drain until runtime drop — end-of-process semantics are
unchanged, just in a consistent bounded window.
Copilot AI review requested due to automatic review settings April 24, 2026 13:44

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Bounds axum::serve(...).with_graceful_shutdown(...) drain time during shutdown so long-lived connections (SSE/WS/long-poll) can’t block process exit indefinitely, improving reliability of cleanup—especially under Windows’ CTRL_CLOSE_EVENT time constraints.

Changes:

  • Introduces a 2s shutdown drain timeout after shutdown_token.cancel().
  • Aborts the main/proxy listener tasks if drain exceeds the timeout.
  • Adjusts task handle usage (&mut JoinHandle) to support bounded waiting and aborting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/server/src/main.rs
Comment on lines +223 to +232
tracing::warn!(
"Graceful shutdown exceeded {:?}; aborting listeners",
SHUTDOWN_DRAIN_TIMEOUT
);
main_handle.abort();
proxy_handle.abort();
// Wait for abort to propagate so the listener sockets are dropped.
let _ = (&mut main_handle).await;
let _ = (&mut proxy_handle).await;
}

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the timeout elapses, the code aborts the listener tasks but then awaits both JoinHandles without any bound. If a task fails to observe cancellation promptly, shutdown can still hang past the intended 2s window (and potentially exceed the Windows ~5s close-event grace period). Consider adding a second (short) timeout around the post-abort() joins or otherwise ensuring this path is also time-bounded.

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 1e62439 — post-abort wait is now wrapped in a second tokio::time::timeout(500ms). Worst-case total shutdown is now 2s (drain) + 0.5s (abort observe) = 2.5s, well inside the Windows CTRL_CLOSE ~5s grace window.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 78b6563. Configure here.

Comment thread crates/server/src/main.rs Outdated
Cursor Bugbot flagged that `JoinHandle::poll` panics if called after it
has already returned `Ready`. The original patch tripped this in two
ways: (1) when the `_ = &mut main_handle` branch of `tokio::select!`
fires, subsequent `(&mut main_handle).await` inside the drain future
re-polls the completed handle and panics; (2) in the timeout branch,
if `main_handle` finishes inside the drain but `proxy_handle` does not,
the post-abort `(&mut main_handle).await` panics for the same reason.

Copilot separately noted the post-abort joins were unbounded, so a task
slow to observe cancellation could push total shutdown past the Windows
5s CTRL_CLOSE_EVENT grace window.

Switch to `AbortHandle::is_finished()` polling for completion observation
— `AbortHandle` has no panic-on-re-poll hazard — and wrap the post-abort
wait in a second `tokio::time::timeout(500ms)` so worst-case shutdown
(2s drain + 0.5s post-abort) stays well inside the 5s grace window.

Addresses:
- BloopAI#3393 (comment) (cursor[bot])
- BloopAI#3393 (comment) (Copilot)

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Juanlucasbg

Copy link
Copy Markdown

Aggressive review summary — PR #3393

68-line fix to crates/server/src/main.rs adding a bounded graceful-shutdown drain (2s timeout, then abort listener tasks, then a 500ms post-abort window). Verdict: clean — recommend merge.

Why the change is correct

axum::serve(...).with_graceful_shutdown(...) waits for every in-flight connection to finish before dropping the listener. Long-lived streams (SSE, WebSocket, long-poll) don't drain on their own, so a single open stream can hold the listener socket alive indefinitely. On Windows, when the console delivers CTRL_CLOSE_EVENT the OS grants ~5s before TerminateProcess — if the listener is still bound at that point, AFD kernel state can linger under the dead PID and block rebinding the port until reboot.

The PR's three-phase approach:

  1. 0-2s: Poll AbortHandle::is_finished() waiting for graceful drain.
  2. 2s+: If still draining, log a warning, abort the listener tasks.
  3. 2-2.5s: Give abort a 500ms window to propagate (some tasks observe cancellation lazily).

Total worst-case shutdown ≤ 2.5s, well inside the 5s Windows grace window.

Findings

  • Structural — PASS: The author noticed a real footgun: JoinHandle::poll panics if called after the handle has already returned Ready (which _ = &mut handle in select! does). The PR sidesteps this by capturing abort_handle() upfront and polling is_finished() instead of re-awaiting the join handle. Excellent comment-block (lines 187-208) walking the next reader through the rationale.
  • Adversarial — 0 HIGH: Edge cases — both servers already finished before drain → poll loop sees is_finished() && is_finished() immediately → exits without timeout. Both servers hung → 2s timeout fires → abort. One finished, one hung → loop waits for the second → 2s timeout. All safe.
  • Performance: 20ms poll interval gives 50 polls/sec, max ~100 polls in the 2s window — negligible.
  • Conventions — PASS: Constants are at module scope (SHUTDOWN_DRAIN_TIMEOUT, POST_ABORT_TIMEOUT, COMPLETION_POLL_INTERVAL); easy to tune.

NITs

  • Telemetry: When the timeout fires, only a single tracing::warn! is emitted. A counter or structured field (e.g. force_aborted=true) would let operators see how often this path triggers in production.
  • Test gap: A unit test would be hard for this path (it's main()), but a #[cfg(test)] shim that wraps the drain logic in a function and tests the three branches (drain completes / drain times out / both abort) would lock in the panic-avoidance discipline.
  • COMPLETION_POLL_INTERVAL = 20ms: Could be larger (e.g. 50ms) — the human-perceptible upper bound is fine. Minor.

Verdict

Approve.

— Reviewed by automated single-pass review (concurrency / shutdown-correctness; full 4-tool battery skipped — diff is well-commented and addresses a specific Windows port-rebind bug).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants