fix(runtime): prevent tracked jobs hanging forever on broker disconnect#184
fix(runtime): prevent tracked jobs hanging forever on broker disconnect#184jonny981 wants to merge 1 commit intoopenai:mainfrom
Conversation
A Codex task can become stuck in `status: running` indefinitely with no way to recover short of manually wiping the job-store and broker session. The user-visible symptom is that `/codex:status` keeps reporting a job as running for many minutes after the underlying codex CLI worker has died, and the next `/codex:cancel` reports `ECONNREFUSED` on the broker socket. Three independent root causes combine to produce the wedge: 1. `captureTurn` (lib/codex.mjs) awaits `state.completion` with no race against the client's `exitPromise`. When the underlying codex app-server (or the broker socket) dies mid-turn, no terminal event is ever delivered, so the await never resolves. Pending RPC requests are rejected by `handleExit` but the turn-completion promise is unrelated to those, so it sits forever. 2. `runTrackedJob` (lib/tracked-jobs.mjs) awaits the runner with no timeout fallback. When `captureTurn` hangs, the entire companion process hangs, and the per-job state file is never transitioned to a terminal status. 3. `app-server-broker.mjs` never listens for the underlying app-server client's exit. When the codex CLI process dies, the broker keeps running with a dead client, accepting new connections but unable to serve them. The next companion sees a stale broker.json that points at a half-dead broker. 4. Status reads (`buildStatusSnapshot` and `buildSingleJobSnapshot`) do not probe whether tracked PIDs are still alive. Once a job is wedged, every status query reports it as running. Fixes: - captureTurn now races `state.completion` against `client.exitPromise`. If the client closes before a terminal event arrives, the turn is rejected with the recorded exit error and progress emits a `failed` notification. - runTrackedJob races the runner against a hard timeout (default 30 minutes, override via `CODEX_COMPANION_JOB_TIMEOUT_MS` or the `timeoutMs` option). On timeout the job transitions to `failed` with a clear error message instead of leaving the companion stuck. - The broker now subscribes to its appClient's `exitPromise`. When the underlying codex CLI exits unexpectedly the broker logs the reason, fans out a `notifications/broker/shuttingDown` event to any connected socket, tears down the unix socket and pid file, then exits with status 1. The next companion will detect a dead endpoint via `ensureBrokerSession` and respawn cleanly. - A new `isProcessAlive` helper in lib/process.mjs uses `kill(pid, 0)` to probe pid liveness without affecting the process. - A new `markDeadPidJobFailed` and `reconcileIfDead` helper in lib/job-control.mjs reconcile any active job whose tracked PID is gone. Reconciliation now runs on every status read path, not only the `--wait` polling loop, so a single `/codex:status` call surfaces dead workers immediately. Closes openai#183 (runTrackedJob/captureTurn hang in finalizing). Refs openai#176, openai#164 (dead PID detection on status reads). Tests: - tests/process.test.mjs: 3 new cases for `isProcessAlive`. - tests/dead-pid-reconcile.test.mjs: 6 new cases covering `markDeadPidJobFailed` guards (active state, PID match) and the reconcile-on-status behaviour for both single-job and full snapshots. - tests/tracked-jobs-timeout.test.mjs: 2 new cases proving the hard timeout transitions a hung job to `failed` and that successful runners still complete normally. All 11 new tests pass. The 5 failing tests under the existing suite (`collectReviewContext`, three `runtime.test.mjs` snapshot tests, and `resolveStateDir`) are pre-existing on `main` and unrelated to this patch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0d13e46207
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const rawJobs = sortJobsNewestFirst(filterJobsForCurrentSession(listJobs(workspaceRoot), options)); | ||
| // Reconcile any active jobs whose tracked PID is dead before partitioning, | ||
| // so a single status read surfaces stuck workers immediately. | ||
| const jobs = rawJobs.map((job) => reconcileIfDead(workspaceRoot, job)); |
There was a problem hiding this comment.
Re-sort jobs after dead-PID reconciliation
buildStatusSnapshot sorts once (rawJobs) and then mutates job states via reconcileIfDead, but it never re-sorts before computing latestFinished/recent. If a stale running job near the end of the list is reconciled to failed, it keeps its old position and can be dropped by the maxJobs slice, so a single /codex:status call may reconcile a dead worker without showing it anywhere in the output. I reproduced this with >8 newer completed jobs: the reconciled dead job disappeared from both running and recent despite being newly failed.
Useful? React with 👍 / 👎.
Summary
A Codex task can become stuck in
status: runningindefinitely with no way to recover short of manually wiping the job-store and broker session. The user-visible symptom is that/codex:statuskeeps reporting a job as running for many minutes after the underlying codex CLI worker has died, and the next/codex:cancelreportsECONNREFUSEDon the broker socket. This PR fixes four independent root causes that combine to produce the wedge.Closes #183. Refs #176, #164.
Reproduction
Run any moderately heavy task that exercises the broker for ~3-5 minutes. In my case it was a Codex review of a ~4,500-line markdown corpus. After the codex CLI processed several
nl/jq/rgshell commands and started reasoning about the result, the worker fell silent. The job-store kept reportingphase: runningfor 9+ minutes with no log progress./codex:cancelreturnedECONNREFUSED /var/folders/.../cxc-XXXXXX/broker.sock. Both the companion PID and the broker PID were dead by the time I cancelled.Root causes
captureTurnhangs forever on disconnection.lib/codex.mjsawaitsstate.completionwith no race against the client'sexitPromise. When the underlying codex app-server (or the broker socket) dies mid-turn, no terminal event is delivered, so the await never resolves. Pending RPC requests are rejected byhandleExitbut the turn-completion promise is unrelated to those, so it sits forever.runTrackedJobhas no timeout fallback.lib/tracked-jobs.mjsawaits the runner without a timeout. WhencaptureTurnhangs, the entire companion process hangs and the per-job state file is never transitioned to a terminal status.The broker zombifies on app-server death.
app-server-broker.mjsnever listens for the underlying app-server client's exit. When the codex CLI process dies, the broker keeps running with a dead client, accepting new connections but unable to serve them. The next companion sees a stalebroker.jsonthat points at a half-dead broker.Status reads do not probe PID liveness.
buildStatusSnapshotandbuildSingleJobSnapshotnever check whether tracked PIDs are still alive. Once a job is wedged, every/codex:statusquery reports it as running.Fixes
captureTurnraces against client exit (lib/codex.mjs). A newexitWatchpromise rejectsstate.completionwith the recorded exit error when the client closes before a terminal event arrives. Progress emits afailednotification so the user sees what happened.runTrackedJobenforces a hard timeout (lib/tracked-jobs.mjs). Default 30 minutes, override viaCODEX_COMPANION_JOB_TIMEOUT_MSenv var or thetimeoutMsoption. On timeout the job transitions tofailedwith a clear error message instead of leaving the companion stuck.Broker subscribes to appClient exit (app-server-broker.mjs). When the underlying codex CLI exits unexpectedly the broker logs the reason, fans out a
notifications/broker/shuttingDownevent to any connected socket, tears down the unix socket and pid file, then exits with status 1. The next companion will detect a dead endpoint viaensureBrokerSessionand respawn cleanly.isProcessAlivehelper (lib/process.mjs). Useskill(pid, 0)to probe pid liveness without affecting the process.markDeadPidJobFailed+reconcileIfDead(lib/job-control.mjs). Reconcile any active job whose tracked PID is gone. Reconciliation runs on every status read path now, not only the--waitpolling loop, so a single/codex:statuscall surfaces dead workers immediately. This is broader than Fix stuck running jobs by detecting dead review/task PIDs #176, which only reconciled insidewaitForSingleJobSnapshot.Relationship to existing PRs/issues
status --wait. If Fix stuck running jobs by detecting dead review/task PIDs #176 lands first, the overlap is small (the helper functions and a couple of import lines).Test plan
tests/process.test.mjs— 3 new cases forisProcessAlive(invalid input, current pid, exited child).tests/dead-pid-reconcile.test.mjs— 6 new cases coveringmarkDeadPidJobFailedguards and reconcile-on-status for both single-job and full snapshots.tests/tracked-jobs-timeout.test.mjs— 2 new cases proving the hard timeout transitions a hung job tofailedand that fast runners still complete normally.node --test tests/*.test.mjs→ 91 pass, 5 fail. The 5 failures (collectReviewContext skips untracked directories, threeruntime.test.mjssnapshot tests, andresolveStateDir uses a temp-backed per-workspace directory) are pre-existing onmainand unrelated to this patch — I confirmed by running the same suite against pristinemain.Test plan (manual)
/codex:statusnow reconciles a dead job tofailedon a single call.failedstatus.process.kill(pid, 0)on Unix paths and the platform branch interminateProcessTreeis unchanged).🤖 Generated with Claude Code