Skip to content

fix: recover STDIO transport from crashed reading tasks#249

Open
TejGandham wants to merge 1 commit intocloudwalk:mainfrom
TejGandham:fix/stdio-transport-task-recovery
Open

fix: recover STDIO transport from crashed reading tasks#249
TejGandham wants to merge 1 commit intocloudwalk:mainfrom
TejGandham:fix/stdio-transport-task-recovery

Conversation

@TejGandham
Copy link
Copy Markdown

Summary

  • Task crash recovery: If the Task.async reading task crashes instead of returning {:error, :eof}, the transport's catch-all handle_info(_msg, state) silently drops the EXIT/DOWN signals. No new reading task is started — the transport stays alive but deaf to stdin. This adds explicit handle_info clauses for {:DOWN, ...} and {:EXIT, ...} from the reading task that log the error and start a new reading task.
  • Multi-message dispatch fix: handle_incoming_data passed the full message list to process_message/2 instead of iterating with Enum.each/2, causing a FunctionClauseError when a single IO.read returned multiple newline-delimited JSON-RPC messages.

Reproduction

  1. Start an MCP server with STDIO transport handling long-running tool calls (500+ seconds)
  2. After a few successful calls, the transport enters a state where it appears "Connected" but all tool calls hang indefinitely
  3. No error output, no timeout — the BEAM process stays alive with a healthy-looking stdio pipe
  4. Only fix is to kill the BEAM process manually

Test plan

  • Existing STDIO transport tests pass
  • Manual test: kill the reading task from iex — transport should recover and continue reading
  • Verify EXIT with :normal reason is still handled by the catch-all (no false recovery)

The STDIO transport uses Task.async to read from stdin. If the reading
task crashes (rather than returning {:error, :eof}), the transport's
catch-all handle_info silently drops the EXIT/DOWN signals and never
starts a new reading task. The transport stays alive but can no longer
read from stdin — it appears healthy to MCP health checks while
silently dropping all tool calls.

This adds explicit handle_info clauses for:
- {:DOWN, ...} from a crashed reading task (process monitor signal)
- {:EXIT, ...} with non-normal reason (process link signal, since
  the transport traps exits)

Both clauses log the crash and start a new reading task, restoring
the transport to a functional state.

Also fixes handle_incoming_data passing the full message list to
process_message instead of iterating with Enum.each — this caused
a FunctionClauseError when a single read contained multiple
newline-delimited JSON-RPC messages.
Copy link
Copy Markdown

@cloudwalk-review-agent cloudwalk-review-agent bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Two real bugs fixed correctly.

The :DOWN/:EXIT dual-handler approach is sound: Task.async both links and monitors, so both signals can fire for the same crash. The second signal to arrive won't match because state.reading_task is already updated to the new task's pid — no double-restart. The when reason != :normal guard lets normal exits (:eof result path) fall through to the catch-all cleanly.

The Enum.each fix is correct — process_message/2 is side-effect only, so discarding the return value is intentional. Note that with batched reads, the GenServer loop now blocks for up to N × request_timeout (each non-notification message triggers a synchronous GenServer.call), but this mirrors the pre-existing single-message behavior and is inherent to the synchronous design.

@@ -144,6 +144,23 @@ defmodule Hermes.Server.Transport.STDIO do
end
end

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: _ref discards the monitor reference instead of matching it against task.ref. Since Task.async stores the monitor ref in %Task{ref: ref}, you could use {:DOWN, ref, :process, pid, reason}, %{reading_task: %Task{pid: pid, ref: ref}} to be fully defensive. In practice unique pids make this safe, but the stricter match is available for free.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant