fix: bound graceful shutdown drain to 2s#3393
Conversation
axum::serve's with_graceful_shutdown waits for every in-flight connection to finish before dropping the listener. Long-lived streams (SSE, WebSocket, long-poll) do not drain on their own, so a single open stream can keep the drain future pending indefinitely after the shutdown signal fires. On Windows, CTRL_CLOSE_EVENT (user closes the launcher / terminal window) grants the process only ~5 seconds before TerminateProcess. An unbounded drain leaves no time for perform_cleanup_actions or a clean listener drop before the OS hard-kills the process. Bound drain to 2 seconds after shutdown_token.cancel(); if it doesn't complete, abort the listener tasks so their sockets drop before exit. On Linux, long-lived streams (SSE/WS) are aborted at shutdown instead of blocking the drain until runtime drop — end-of-process semantics are unchanged, just in a consistent bounded window.
There was a problem hiding this comment.
Pull request overview
Bounds axum::serve(...).with_graceful_shutdown(...) drain time during shutdown so long-lived connections (SSE/WS/long-poll) can’t block process exit indefinitely, improving reliability of cleanup—especially under Windows’ CTRL_CLOSE_EVENT time constraints.
Changes:
- Introduces a 2s shutdown drain timeout after
shutdown_token.cancel(). - Aborts the main/proxy listener tasks if drain exceeds the timeout.
- Adjusts task handle usage (
&mut JoinHandle) to support bounded waiting and aborting.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| tracing::warn!( | ||
| "Graceful shutdown exceeded {:?}; aborting listeners", | ||
| SHUTDOWN_DRAIN_TIMEOUT | ||
| ); | ||
| main_handle.abort(); | ||
| proxy_handle.abort(); | ||
| // Wait for abort to propagate so the listener sockets are dropped. | ||
| let _ = (&mut main_handle).await; | ||
| let _ = (&mut proxy_handle).await; | ||
| } |
There was a problem hiding this comment.
After the timeout elapses, the code aborts the listener tasks but then awaits both JoinHandles without any bound. If a task fails to observe cancellation promptly, shutdown can still hang past the intended 2s window (and potentially exceed the Windows ~5s close-event grace period). Consider adding a second (short) timeout around the post-abort() joins or otherwise ensuring this path is also time-bounded.
There was a problem hiding this comment.
Addressed in 1e62439 — post-abort wait is now wrapped in a second tokio::time::timeout(500ms). Worst-case total shutdown is now 2s (drain) + 0.5s (abort observe) = 2.5s, well inside the Windows CTRL_CLOSE ~5s grace window.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 78b6563. Configure here.
Cursor Bugbot flagged that `JoinHandle::poll` panics if called after it has already returned `Ready`. The original patch tripped this in two ways: (1) when the `_ = &mut main_handle` branch of `tokio::select!` fires, subsequent `(&mut main_handle).await` inside the drain future re-polls the completed handle and panics; (2) in the timeout branch, if `main_handle` finishes inside the drain but `proxy_handle` does not, the post-abort `(&mut main_handle).await` panics for the same reason. Copilot separately noted the post-abort joins were unbounded, so a task slow to observe cancellation could push total shutdown past the Windows 5s CTRL_CLOSE_EVENT grace window. Switch to `AbortHandle::is_finished()` polling for completion observation — `AbortHandle` has no panic-on-re-poll hazard — and wrap the post-abort wait in a second `tokio::time::timeout(500ms)` so worst-case shutdown (2s drain + 0.5s post-abort) stays well inside the 5s grace window. Addresses: - BloopAI#3393 (comment) (cursor[bot]) - BloopAI#3393 (comment) (Copilot)
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Aggressive review summary — PR #339368-line fix to Why the change is correct
The PR's three-phase approach:
Total worst-case shutdown ≤ 2.5s, well inside the 5s Windows grace window. Findings
NITs
VerdictApprove. — Reviewed by automated single-pass review (concurrency / shutdown-correctness; full 4-tool battery skipped — diff is well-commented and addresses a specific Windows port-rebind bug). |

Summary
axum::serve(...).with_graceful_shutdown(...)waits for every in-flight connection to finish before dropping the listener. Long-lived streams (SSE, WebSocket, long-poll) do not drain on their own, so a single open stream can keep the drain future pending indefinitely after the shutdown signal fires.In normal operation the tokio runtime drop at
main()exit catches this by aborting the spawned serve tasks. On Windows, however,CTRL_CLOSE_EVENT(user closes the launcher / terminal window) grants the process only ~5 seconds beforeTerminateProcess. An unbounded drain leaves no time forperform_cleanup_actionsor a clean listener drop — cleanup may never run, and the AFD kernel state of the half-closed listener can linger under a phantom PID.This change bounds the drain to 2 seconds after
shutdown_token.cancel(). If it doesn't complete, abort the listener tasks so their sockets drop before exit, leaving headroom forperform_cleanup_actionswithin the WindowsCTRL_CLOSE_EVENTgrace period.Behaviour changes
perform_cleanup_actionsruns reliably.axum::serveuntil runtime drop. End-of-process semantics are unchanged — the difference is that cleanup now runs in a consistent bounded window rather than racing the runtime drop.Test plan
cargo build --release --bin serveron Windows 11 MSVC; binary boots, accepts traffic, and shuts down cleanly onCtrl+Cwith live browser-side SSE / WS connections presenttracingoutput shows"Shutdown signal received"immediately followed byperform_cleanup_actionslogs — drain completes sub-millisecond when peers cooperatetracing::warn!("Graceful shutdown exceeded ...")branch is there for pathological cases (peers that refuse to disconnect) so the listener is still dropped before process exitNotes
SO_REUSEADDRhandling at the bind site with different security tradeoffs (WindowsSO_REUSEADDRis port-hijack semantics) and is deliberately out of scope for this PR.server.exeruns hitError: Io(Os { code: 10048, kind: AddrInUse ... }). After this change, cleanup is guaranteed to run and the race window is bounded, but a full fix requiresSO_REUSEADDRwhich is left to application-level deployments with loopback-only bindings.Note
Medium Risk
Changes server shutdown behavior by adding time-bounded drain and forced task aborts, which could affect long-lived connections and shutdown ordering if timeouts are too aggressive.
Overview
Implements a bounded graceful shutdown for the main server and preview proxy: after shutdown is signaled, the code now waits up to
2sforaxumdrains to complete, then aborts the listener tasks and briefly waits for abort propagation before continuing.Refactors the shutdown
select!to use&mut JoinHandles and capturesAbortHandles up front to avoid re-polling completed join handles, ensuring cleanup (perform_cleanup_actions) runs reliably even with long-lived connections (e.g., SSE/WS), especially on Windows.Reviewed by Cursor Bugbot for commit 1e62439. Bugbot is set up for automated code reviews on this repo. Configure here.