Skip to content

fix(runtime): prevent tracked jobs hanging forever on broker disconnect#184

Open
jonny981 wants to merge 1 commit intoopenai:mainfrom
jonny981:fix/captureturn-hang-and-dead-pid-detection
Open

fix(runtime): prevent tracked jobs hanging forever on broker disconnect#184
jonny981 wants to merge 1 commit intoopenai:mainfrom
jonny981:fix/captureturn-hang-and-dead-pid-detection

Conversation

@jonny981
Copy link
Copy Markdown

@jonny981 jonny981 commented Apr 8, 2026

Summary

A Codex task can become stuck in status: running indefinitely with no way to recover short of manually wiping the job-store and broker session. The user-visible symptom is that /codex:status keeps reporting a job as running for many minutes after the underlying codex CLI worker has died, and the next /codex:cancel reports ECONNREFUSED on the broker socket. This PR fixes four independent root causes that combine to produce the wedge.

Closes #183. Refs #176, #164.

Reproduction

Run any moderately heavy task that exercises the broker for ~3-5 minutes. In my case it was a Codex review of a ~4,500-line markdown corpus. After the codex CLI processed several nl/jq/rg shell commands and started reasoning about the result, the worker fell silent. The job-store kept reporting phase: running for 9+ minutes with no log progress. /codex:cancel returned ECONNREFUSED /var/folders/.../cxc-XXXXXX/broker.sock. Both the companion PID and the broker PID were dead by the time I cancelled.

Root causes

  1. captureTurn hangs forever on disconnection. lib/codex.mjs awaits state.completion with no race against the client's exitPromise. When the underlying codex app-server (or the broker socket) dies mid-turn, no terminal event is delivered, so the await never resolves. Pending RPC requests are rejected by handleExit but the turn-completion promise is unrelated to those, so it sits forever.

  2. runTrackedJob has no timeout fallback. lib/tracked-jobs.mjs awaits the runner without a timeout. When captureTurn hangs, the entire companion process hangs and the per-job state file is never transitioned to a terminal status.

  3. The broker zombifies on app-server death. app-server-broker.mjs never listens for the underlying app-server client's exit. When the codex CLI process dies, the broker keeps running with a dead client, accepting new connections but unable to serve them. The next companion sees a stale broker.json that points at a half-dead broker.

  4. Status reads do not probe PID liveness. buildStatusSnapshot and buildSingleJobSnapshot never check whether tracked PIDs are still alive. Once a job is wedged, every /codex:status query reports it as running.

Fixes

  • captureTurn races against client exit (lib/codex.mjs). A new exitWatch promise rejects state.completion with the recorded exit error when the client closes before a terminal event arrives. Progress emits a failed notification so the user sees what happened.

  • runTrackedJob enforces a hard timeout (lib/tracked-jobs.mjs). Default 30 minutes, override via CODEX_COMPANION_JOB_TIMEOUT_MS env var or the timeoutMs option. On timeout the job transitions to failed with a clear error message instead of leaving the companion stuck.

  • Broker subscribes to appClient exit (app-server-broker.mjs). When the underlying codex CLI exits unexpectedly the broker logs the reason, fans out a notifications/broker/shuttingDown event to any connected socket, tears down the unix socket and pid file, then exits with status 1. The next companion will detect a dead endpoint via ensureBrokerSession and respawn cleanly.

  • isProcessAlive helper (lib/process.mjs). Uses kill(pid, 0) to probe pid liveness without affecting the process.

  • markDeadPidJobFailed + reconcileIfDead (lib/job-control.mjs). Reconcile any active job whose tracked PID is gone. Reconciliation runs on every status read path now, not only the --wait polling loop, so a single /codex:status call surfaces dead workers immediately. This is broader than Fix stuck running jobs by detecting dead review/task PIDs #176, which only reconciled inside waitForSingleJobSnapshot.

Relationship to existing PRs/issues

Test plan

  • New unit tests (11 cases, all pass):
    • tests/process.test.mjs — 3 new cases for isProcessAlive (invalid input, current pid, exited child).
    • tests/dead-pid-reconcile.test.mjs — 6 new cases covering markDeadPidJobFailed guards and reconcile-on-status for both single-job and full snapshots.
    • tests/tracked-jobs-timeout.test.mjs — 2 new cases proving the hard timeout transitions a hung job to failed and that fast runners still complete normally.
  • Full suite: node --test tests/*.test.mjs → 91 pass, 5 fail. The 5 failures (collectReviewContext skips untracked directories, three runtime.test.mjs snapshot tests, and resolveStateDir uses a temp-backed per-workspace directory) are pre-existing on main and unrelated to this patch — I confirmed by running the same suite against pristine main.

Test plan (manual)

  • Reproduced the wedge with a long-running Codex task on macOS.
  • Verified /codex:status now reconciles a dead job to failed on a single call.
  • Verified the new hard timeout fires and writes a failed status.
  • Smoke-test on Windows (I do not have a Windows host; the changes use only process.kill(pid, 0) on Unix paths and the platform branch in terminateProcessTree is unchanged).

🤖 Generated with Claude Code

A Codex task can become stuck in `status: running` indefinitely with no
way to recover short of manually wiping the job-store and broker session.
The user-visible symptom is that `/codex:status` keeps reporting a job as
running for many minutes after the underlying codex CLI worker has died,
and the next `/codex:cancel` reports `ECONNREFUSED` on the broker socket.

Three independent root causes combine to produce the wedge:

1. `captureTurn` (lib/codex.mjs) awaits `state.completion` with no race
   against the client's `exitPromise`. When the underlying codex
   app-server (or the broker socket) dies mid-turn, no terminal event is
   ever delivered, so the await never resolves. Pending RPC requests are
   rejected by `handleExit` but the turn-completion promise is unrelated
   to those, so it sits forever.

2. `runTrackedJob` (lib/tracked-jobs.mjs) awaits the runner with no
   timeout fallback. When `captureTurn` hangs, the entire companion
   process hangs, and the per-job state file is never transitioned to a
   terminal status.

3. `app-server-broker.mjs` never listens for the underlying app-server
   client's exit. When the codex CLI process dies, the broker keeps
   running with a dead client, accepting new connections but unable to
   serve them. The next companion sees a stale broker.json that points
   at a half-dead broker.

4. Status reads (`buildStatusSnapshot` and `buildSingleJobSnapshot`) do
   not probe whether tracked PIDs are still alive. Once a job is wedged,
   every status query reports it as running.

Fixes:

- captureTurn now races `state.completion` against `client.exitPromise`.
  If the client closes before a terminal event arrives, the turn is
  rejected with the recorded exit error and progress emits a `failed`
  notification.

- runTrackedJob races the runner against a hard timeout (default 30
  minutes, override via `CODEX_COMPANION_JOB_TIMEOUT_MS` or the
  `timeoutMs` option). On timeout the job transitions to `failed` with
  a clear error message instead of leaving the companion stuck.

- The broker now subscribes to its appClient's `exitPromise`. When the
  underlying codex CLI exits unexpectedly the broker logs the reason,
  fans out a `notifications/broker/shuttingDown` event to any connected
  socket, tears down the unix socket and pid file, then exits with
  status 1. The next companion will detect a dead endpoint via
  `ensureBrokerSession` and respawn cleanly.

- A new `isProcessAlive` helper in lib/process.mjs uses `kill(pid, 0)`
  to probe pid liveness without affecting the process.

- A new `markDeadPidJobFailed` and `reconcileIfDead` helper in
  lib/job-control.mjs reconcile any active job whose tracked PID is
  gone. Reconciliation now runs on every status read path, not only the
  `--wait` polling loop, so a single `/codex:status` call surfaces dead
  workers immediately.

Closes openai#183 (runTrackedJob/captureTurn hang in finalizing).
Refs openai#176, openai#164 (dead PID detection on status reads).

Tests:

- tests/process.test.mjs: 3 new cases for `isProcessAlive`.
- tests/dead-pid-reconcile.test.mjs: 6 new cases covering
  `markDeadPidJobFailed` guards (active state, PID match) and the
  reconcile-on-status behaviour for both single-job and full snapshots.
- tests/tracked-jobs-timeout.test.mjs: 2 new cases proving the hard
  timeout transitions a hung job to `failed` and that successful
  runners still complete normally.

All 11 new tests pass. The 5 failing tests under the existing suite
(`collectReviewContext`, three `runtime.test.mjs` snapshot tests, and
`resolveStateDir`) are pre-existing on `main` and unrelated to this
patch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jonny981 jonny981 requested a review from a team April 8, 2026 15:50
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0d13e46207

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +312 to +315
const rawJobs = sortJobsNewestFirst(filterJobsForCurrentSession(listJobs(workspaceRoot), options));
// Reconcile any active jobs whose tracked PID is dead before partitioning,
// so a single status read surfaces stuck workers immediately.
const jobs = rawJobs.map((job) => reconcileIfDead(workspaceRoot, job));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Re-sort jobs after dead-PID reconciliation

buildStatusSnapshot sorts once (rawJobs) and then mutates job states via reconcileIfDead, but it never re-sorts before computing latestFinished/recent. If a stale running job near the end of the list is reconciled to failed, it keeps its old position and can be dropped by the maxJobs slice, so a single /codex:status call may reconcile a dead worker without showing it anywhere in the output. I reproduced this with >8 newer completed jobs: the reconciled dead job disappeared from both running and recent despite being newly failed.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

runTrackedJob / captureTurn can hang in phase: finalizing indefinitely (no timeout)

1 participant