Skip to content

Drain HTTP server before relaying termination signal to upstream#132

Open
alexspeller wants to merge 1 commit into
basecamp:mainfrom
alexspeller:drain-before-signaling-upstream
Open

Drain HTTP server before relaying termination signal to upstream#132
alexspeller wants to merge 1 commit into
basecamp:mainfrom
alexspeller:drain-before-signaling-upstream

Conversation

@alexspeller

Copy link
Copy Markdown
Contributor

Problem

On SIGTERM/SIGINT, Thruster relays the signal straight to the upstream process and only closes its own HTTP listener afterwards, via the deferred server.Stop() in Service.Run — which doesn't run until upstream.Run() returns, i.e. until the upstream has already exited.

That leaves a window, for the entire duration of the upstream's shutdown (typically tens of seconds while it drains in-flight requests), where Thruster still accepts new connections on its HTTP port but the upstream has stopped listening. Each such connection fails at the proxy hop with connection refused and is returned to the client as a 502. On a busy service this produces a burst of user-facing 502s on every deploy.

Change

Reverse the order: intercept the signal in the Service, drain the HTTP server first — closing the listener so new connections are refused at the TCP level (where an upstream load balancer routes them elsewhere) while in-flight requests finish against the still-running upstream — and only then relay the signal to the upstream.

  • Signal handling moves from UpstreamProcess to Service, which owns both the server and the upstream and so is the natural place to coordinate their shutdown order.
  • Server.Stop is made idempotent, since it is now called both from the signal handler and from the deferred cleanup in Run.
  • The drain timeout was previously hardcoded at 5s — which never mattered, since the listener wasn't closed until the upstream had already exited. Now that it is meaningful, it is configurable via HTTP_DRAIN_TIMEOUT, defaulting to 30s to match kamal-proxy's --drain-timeout.

Before / after

Wrapping a backend that releases its port on SIGTERM and then takes a few seconds to exit (as Puma does while draining), while hammering the front-end during the shutdown window:

  • Before: every request during the window → 502 (dial tcp ...: connect: connection refused)
  • After: every request during the window → TCP connection refused, no 5xx; in-flight requests continue to drain via http.Server.Shutdown

Tests

Adds service_test.go asserting the HTTP listener stops accepting connections before the upstream is signalled, plus config coverage for the new HTTP_DRAIN_TIMEOUT setting.

Copilot AI review requested due to automatic review settings June 4, 2026 11:07

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Implements a more reliable graceful shutdown flow by draining the HTTP server before relaying termination signals to the wrapped upstream process, with a configurable drain timeout.

Changes:

  • Move SIGINT/SIGTERM handling from UpstreamProcess into Service to coordinate server drain + upstream signaling order.
  • Add HTTP_DRAIN_TIMEOUT/HttpDrainTimeout config and use it for Server.Stop() shutdown deadlines.
  • Add a regression test covering “stop accepting connections before signaling upstream”, plus README docs.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
internal/upstream_process.go Removes standalone signal forwarding from the upstream wrapper so shutdown can be coordinated at the service level.
internal/service.go Adds signal handling + gracefulShutdown that drains the server before signaling upstream.
internal/server.go Makes Stop() idempotent and uses configurable drain timeout.
internal/config.go Adds HttpDrainTimeout with env var + default.
internal/config_test.go Extends config tests to cover the new drain timeout.
internal/service_test.go Adds a test asserting the server listener is closed before upstream signaling.
README.md Documents HTTP_DRAIN_TIMEOUT.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/service.go Outdated
Comment thread internal/service.go Outdated
Comment thread internal/service.go
Comment thread internal/service_test.go Outdated
On SIGTERM/SIGINT, Thruster relayed the signal straight to the upstream
process and only closed its own HTTP listener afterwards, via the deferred
server.Stop() in Service.Run -- which doesn't run until upstream.Run()
returns, i.e. until the upstream has already exited.

That leaves a window for the entire duration of the upstream's shutdown
(typically tens of seconds while it drains in-flight requests) where
Thruster still accepts new connections on its HTTP port but the upstream
has stopped listening. Each such connection fails at the proxy hop with
'connection refused' and is returned to the client as a 502. On a busy
service this produces a burst of user-facing 502s on every deploy.

Reverse the order: intercept the signal in the Service, drain the HTTP
server first (closing the listener so new connections are refused at the
TCP level, where an upstream load balancer will route them elsewhere, while
in-flight requests finish against the still-running upstream), and only
then relay the signal to the upstream. Signal handling moves from
UpstreamProcess to Service, which owns both the server and the upstream and
so is the natural place to coordinate their shutdown order. Server.Stop is
made idempotent so it can be called both here and from the deferred cleanup.

The drain timeout was previously hardcoded at 5 seconds -- which never
mattered before, since the listener wasn't closed until the upstream had
already exited. Now that it is meaningful, make it configurable via
HTTP_DRAIN_TIMEOUT, defaulting to 30 seconds to match kamal-proxy's
drain-timeout.
@alexspeller alexspeller force-pushed the drain-before-signaling-upstream branch from 7353a0a to 310147e Compare June 4, 2026 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants