Reload race condition causes persistent 500 responses on `/aggregated_metrics` endpoint

### Describe the bug

When `fluentd` receives `SIGHUP` (e.g. via `systemctl reload fluentd.service`) to reload its
configuration, worker threads restart in-place. A new worker thread hits `Errno::EADDRINUSE`
when trying to bind its port because the old thread has not yet fully released the socket. The
crashing thread terminates silently, leaving the `fluent-plugin-prometheus` HTTP server on port
24231 running but unable to collect stats from the dead thread. All subsequent requests to the
`/aggregated_metrics` endpoint return HTTP 500 with body
`"Connection refused - connect(2) for 127.0.0.1:<port>"`.

### To Reproduce

1. Install `fluent-package` with the following configuration (see **Your Configuration** below)
2. Start Fluentd: `systemctl start fluentd`
3. Send a reload signal: `systemctl reload fluentd.service` (sends SIGHUP)
4. Observe the warn log for `EADDRINUSE` on one of the worker threads
5. Request `http://localhost:24231/aggregated_metrics` — it returns HTTP 500 with body
   `"Connection refused - connect(2) for 127.0.0.1:<port>"`

### Expected behavior

After a reload (SIGHUP), all worker threads should successfully rebind their ports and Fluentd should continue serving `/aggregated_metrics` with HTTP 200. If a thread cannot rebind, it should either retry or fall back gracefully rather than crashing and leaving the supervisor in a broken state.

### Your Environment

```markdown
- Fluentd version: fluentd 1.19.1 (efdc4dca81c23480c9b55e13e55de6aa925b1cf5)
- Package version: fluent-package 6.0.1
- Operating system: Ubuntu 24.04.3
- Kernel version: 6.8.0-94-generic
```

### Your Configuration

```apache
<system>
  workers 4
</system>

<source>
  @type prometheus
  bind 0.0.0.0
  port 24231
  metrics_path /metrics
</source>

<source>
  @type prometheus_output_monitor
</source>

<source>
  @type prometheus_monitor
</source>

<source>
  @type http
  bind 127.0.0.1
  port 24224
  <parse>
    @type json
  </parse>
</source>

<match **>
  @type null
</match>
```

### Your Error Log

```shell
2026-03-31 22:47:09 +0000 [warn]: #3  0.07s: Async::Task
      | Task may have ended with unhandled exception.
      |   Errno::EADDRINUSE: Address already in use - bind
      |   → /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/wrapper.rb:152 in 'Socket#bind'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/wrapper.rb:152 in 'IO::Endpoint::Wrapper#bind'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/host_endpoint.rb:68 in 'block in IO::Endpoint::HostEndpoint#bind'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/host_endpoint.rb:67 in 'Array#each'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/host_endpoint.rb:67 in 'Enumerator#each'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/host_endpoint.rb:67 in 'Enumerable#map'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/host_endpoint.rb:67 in 'IO::Endpoint::HostEndpoint#bind'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/async-http-0.89.0/lib/async/http/endpoint.rb:216 in 'Async::HTTP::Endpoint#bind'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/generic.rb:82 in 'IO::Endpoint::Generic#accept'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/async-http-0.89.0/lib/async/http/server.rb:67 in 'block in Async::HTTP::Server#run'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/async-2.24.0/lib/async/task.rb:200 in 'block in Async::Task#run'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/async-2.24.0/lib/async/task.rb:438 in 'block in Async::Task#schedule'
```

### Additional context

The crash leaves the `fluent-plugin-prometheus` `/aggregated_metrics` endpoint permanently
returning HTTP 500 (body: `"Connection refused - connect(2) for 127.0.0.1:<port>"`) until
Fluentd is fully restarted. The affected thread is one of the per-worker Prometheus stats
collectors that the aggregation endpoint queries internally.

The stack trace points to `io-endpoint-0.15.2` and `async-http-0.89.0` — the new thread starts
its HTTP server bind before the old thread's socket is fully closed, suggesting either a missing
`SO_REUSEPORT`/`SO_REUSEADDR` option or insufficient drain time before rebinding during a
SIGHUP-triggered reload.

Our workaround is to avoid `systemctl reload` (SIGHUP) in favor of `systemctl restart` (full
stop + start), which guarantees the old process is dead and all sockets released before the new
one starts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reload race condition causes persistent 500 responses on `/aggregated_metrics` endpoint #5310

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reload race condition causes persistent 500 responses on /aggregated_metrics endpoint #5310

Description

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Reload race condition causes persistent 500 responses on `/aggregated_metrics` endpoint #5310