fix: recover STDIO transport from crashed reading tasks#249
fix: recover STDIO transport from crashed reading tasks#249TejGandham wants to merge 1 commit intocloudwalk:mainfrom
Conversation
The STDIO transport uses Task.async to read from stdin. If the reading
task crashes (rather than returning {:error, :eof}), the transport's
catch-all handle_info silently drops the EXIT/DOWN signals and never
starts a new reading task. The transport stays alive but can no longer
read from stdin — it appears healthy to MCP health checks while
silently dropping all tool calls.
This adds explicit handle_info clauses for:
- {:DOWN, ...} from a crashed reading task (process monitor signal)
- {:EXIT, ...} with non-normal reason (process link signal, since
the transport traps exits)
Both clauses log the crash and start a new reading task, restoring
the transport to a functional state.
Also fixes handle_incoming_data passing the full message list to
process_message instead of iterating with Enum.each — this caused
a FunctionClauseError when a single read contained multiple
newline-delimited JSON-RPC messages.
There was a problem hiding this comment.
Summary
Two real bugs fixed correctly.
The :DOWN/:EXIT dual-handler approach is sound: Task.async both links and monitors, so both signals can fire for the same crash. The second signal to arrive won't match because state.reading_task is already updated to the new task's pid — no double-restart. The when reason != :normal guard lets normal exits (:eof result path) fall through to the catch-all cleanly.
The Enum.each fix is correct — process_message/2 is side-effect only, so discarding the return value is intentional. Note that with batched reads, the GenServer loop now blocks for up to N × request_timeout (each non-notification message triggers a synchronous GenServer.call), but this mirrors the pre-existing single-message behavior and is inherent to the synchronous design.
| @@ -144,6 +144,23 @@ defmodule Hermes.Server.Transport.STDIO do | |||
| end | |||
| end | |||
|
|
|||
There was a problem hiding this comment.
Minor: _ref discards the monitor reference instead of matching it against task.ref. Since Task.async stores the monitor ref in %Task{ref: ref}, you could use {:DOWN, ref, :process, pid, reason}, %{reading_task: %Task{pid: pid, ref: ref}} to be fully defensive. In practice unique pids make this safe, but the stricter match is available for free.
Summary
Task.asyncreading task crashes instead of returning{:error, :eof}, the transport's catch-allhandle_info(_msg, state)silently drops the EXIT/DOWN signals. No new reading task is started — the transport stays alive but deaf to stdin. This adds explicithandle_infoclauses for{:DOWN, ...}and{:EXIT, ...}from the reading task that log the error and start a new reading task.handle_incoming_datapassed the full message list toprocess_message/2instead of iterating withEnum.each/2, causing aFunctionClauseErrorwhen a singleIO.readreturned multiple newline-delimited JSON-RPC messages.Reproduction
Test plan
iex— transport should recover and continue reading:normalreason is still handled by the catch-all (no false recovery)