Skip to content

Reload race condition causes persistent 500 responses on /aggregated_metrics endpoint #5310

@roryabraham

Description

@roryabraham

Describe the bug

When fluentd receives SIGHUP (e.g. via systemctl reload fluentd.service) to reload its
configuration, worker threads restart in-place. A new worker thread hits Errno::EADDRINUSE
when trying to bind its port because the old thread has not yet fully released the socket. The
crashing thread terminates silently, leaving the fluent-plugin-prometheus HTTP server on port
24231 running but unable to collect stats from the dead thread. All subsequent requests to the
/aggregated_metrics endpoint return HTTP 500 with body
"Connection refused - connect(2) for 127.0.0.1:<port>".

To Reproduce

  1. Install fluent-package with the following configuration (see Your Configuration below)
  2. Start Fluentd: systemctl start fluentd
  3. Send a reload signal: systemctl reload fluentd.service (sends SIGHUP)
  4. Observe the warn log for EADDRINUSE on one of the worker threads
  5. Request http://localhost:24231/aggregated_metrics — it returns HTTP 500 with body
    "Connection refused - connect(2) for 127.0.0.1:<port>"

Expected behavior

After a reload (SIGHUP), all worker threads should successfully rebind their ports and Fluentd should continue serving /aggregated_metrics with HTTP 200. If a thread cannot rebind, it should either retry or fall back gracefully rather than crashing and leaving the supervisor in a broken state.

Your Environment

- Fluentd version: fluentd 1.19.1 (efdc4dca81c23480c9b55e13e55de6aa925b1cf5)
- Package version: fluent-package 6.0.1
- Operating system: Ubuntu 24.04.3
- Kernel version: 6.8.0-94-generic

Your Configuration

<system>
  workers 4
</system>

<source>
  @type prometheus
  bind 0.0.0.0
  port 24231
  metrics_path /metrics
</source>

<source>
  @type prometheus_output_monitor
</source>

<source>
  @type prometheus_monitor
</source>

<source>
  @type http
  bind 127.0.0.1
  port 24224
  <parse>
    @type json
  </parse>
</source>

<match **>
  @type null
</match>

Your Error Log

2026-03-31 22:47:09 +0000 [warn]: #3  0.07s: Async::Task
      | Task may have ended with unhandled exception.
      |   Errno::EADDRINUSE: Address already in use - bind
      |   → /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/wrapper.rb:152 in 'Socket#bind'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/wrapper.rb:152 in 'IO::Endpoint::Wrapper#bind'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/host_endpoint.rb:68 in 'block in IO::Endpoint::HostEndpoint#bind'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/host_endpoint.rb:67 in 'Array#each'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/host_endpoint.rb:67 in 'Enumerator#each'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/host_endpoint.rb:67 in 'Enumerable#map'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/host_endpoint.rb:67 in 'IO::Endpoint::HostEndpoint#bind'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/async-http-0.89.0/lib/async/http/endpoint.rb:216 in 'Async::HTTP::Endpoint#bind'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/generic.rb:82 in 'IO::Endpoint::Generic#accept'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/async-http-0.89.0/lib/async/http/server.rb:67 in 'block in Async::HTTP::Server#run'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/async-2.24.0/lib/async/task.rb:200 in 'block in Async::Task#run'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/async-2.24.0/lib/async/task.rb:438 in 'block in Async::Task#schedule'

Additional context

The crash leaves the fluent-plugin-prometheus /aggregated_metrics endpoint permanently
returning HTTP 500 (body: "Connection refused - connect(2) for 127.0.0.1:<port>") until
Fluentd is fully restarted. The affected thread is one of the per-worker Prometheus stats
collectors that the aggregation endpoint queries internally.

The stack trace points to io-endpoint-0.15.2 and async-http-0.89.0 — the new thread starts
its HTTP server bind before the old thread's socket is fully closed, suggesting either a missing
SO_REUSEPORT/SO_REUSEADDR option or insufficient drain time before rebinding during a
SIGHUP-triggered reload.

Our workaround is to avoid systemctl reload (SIGHUP) in favor of systemctl restart (full
stop + start), which guarantees the old process is dead and all sockets released before the new
one starts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    To-Do

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions