fix: recover STDIO transport from crashed reading tasks by TejGandham · Pull Request #249 · cloudwalk/hermes-mcp

TejGandham · 2026-04-04T15:11:13Z

Summary

Task crash recovery: If the Task.async reading task crashes instead of returning {:error, :eof}, the transport's catch-all handle_info(_msg, state) silently drops the EXIT/DOWN signals. No new reading task is started — the transport stays alive but deaf to stdin. This adds explicit handle_info clauses for {:DOWN, ...} and {:EXIT, ...} from the reading task that log the error and start a new reading task.
Multi-message dispatch fix: handle_incoming_data passed the full message list to process_message/2 instead of iterating with Enum.each/2, causing a FunctionClauseError when a single IO.read returned multiple newline-delimited JSON-RPC messages.

Reproduction

Start an MCP server with STDIO transport handling long-running tool calls (500+ seconds)
After a few successful calls, the transport enters a state where it appears "Connected" but all tool calls hang indefinitely
No error output, no timeout — the BEAM process stays alive with a healthy-looking stdio pipe
Only fix is to kill the BEAM process manually

Test plan

Existing STDIO transport tests pass
Manual test: kill the reading task from iex — transport should recover and continue reading
Verify EXIT with :normal reason is still handled by the catch-all (no false recovery)

The STDIO transport uses Task.async to read from stdin. If the reading task crashes (rather than returning {:error, :eof}), the transport's catch-all handle_info silently drops the EXIT/DOWN signals and never starts a new reading task. The transport stays alive but can no longer read from stdin — it appears healthy to MCP health checks while silently dropping all tool calls. This adds explicit handle_info clauses for: - {:DOWN, ...} from a crashed reading task (process monitor signal) - {:EXIT, ...} with non-normal reason (process link signal, since the transport traps exits) Both clauses log the crash and start a new reading task, restoring the transport to a functional state. Also fixes handle_incoming_data passing the full message list to process_message instead of iterating with Enum.each — this caused a FunctionClauseError when a single read contained multiple newline-delimited JSON-RPC messages.

cloudwalk-review-agent

Summary

Two real bugs fixed correctly.

The :DOWN/:EXIT dual-handler approach is sound: Task.async both links and monitors, so both signals can fire for the same crash. The second signal to arrive won't match because state.reading_task is already updated to the new task's pid — no double-restart. The when reason != :normal guard lets normal exits (:eof result path) fall through to the catch-all cleanly.

The Enum.each fix is correct — process_message/2 is side-effect only, so discarding the return value is intentional. Note that with batched reads, the GenServer loop now blocks for up to N × request_timeout (each non-notification message triggers a synchronous GenServer.call), but this mirrors the pre-existing single-message behavior and is inherent to the synchronous design.

cloudwalk-review-agent · 2026-04-04T15:21:53Z

lib/hermes/server/transport/stdio.ex

@@ -144,6 +144,23 @@ defmodule Hermes.Server.Transport.STDIO do
    end
  end



Minor: _ref discards the monitor reference instead of matching it against task.ref. Since Task.async stores the monitor ref in %Task{ref: ref}, you could use {:DOWN, ref, :process, pid, reason}, %{reading_task: %Task{pid: pid, ref: ref}} to be fully defensive. In practice unique pids make this safe, but the stricter match is available for free.

cloudwalk-review-agent bot approved these changes Apr 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: recover STDIO transport from crashed reading tasks#249

fix: recover STDIO transport from crashed reading tasks#249
TejGandham wants to merge 1 commit intocloudwalk:mainfrom
TejGandham:fix/stdio-transport-task-recovery

TejGandham commented Apr 4, 2026

Uh oh!

cloudwalk-review-agent bot left a comment

Uh oh!

cloudwalk-review-agent bot Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -144,6 +144,23 @@ defmodule Hermes.Server.Transport.STDIO do
		end
		end

Conversation

TejGandham commented Apr 4, 2026

Summary

Reproduction

Test plan

Uh oh!

cloudwalk-review-agent bot left a comment

Choose a reason for hiding this comment

Summary

Uh oh!

cloudwalk-review-agent bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant