fix(runtime): prevent tracked jobs hanging forever on broker disconnect by jonny981 · Pull Request #184 · openai/codex-plugin-cc

jonny981 · 2026-04-08T15:50:39Z

Summary

A Codex task can become stuck in status: running indefinitely with no way to recover short of manually wiping the job-store and broker session. The user-visible symptom is that /codex:status keeps reporting a job as running for many minutes after the underlying codex CLI worker has died, and the next /codex:cancel reports ECONNREFUSED on the broker socket. This PR fixes four independent root causes that combine to produce the wedge.

Closes #183. Refs #176, #164.

Reproduction

Run any moderately heavy task that exercises the broker for ~3-5 minutes. In my case it was a Codex review of a ~4,500-line markdown corpus. After the codex CLI processed several nl/jq/rg shell commands and started reasoning about the result, the worker fell silent. The job-store kept reporting phase: running for 9+ minutes with no log progress. /codex:cancel returned ECONNREFUSED /var/folders/.../cxc-XXXXXX/broker.sock. Both the companion PID and the broker PID were dead by the time I cancelled.

Root causes

captureTurn hangs forever on disconnection. lib/codex.mjs awaits state.completion with no race against the client's exitPromise. When the underlying codex app-server (or the broker socket) dies mid-turn, no terminal event is delivered, so the await never resolves. Pending RPC requests are rejected by handleExit but the turn-completion promise is unrelated to those, so it sits forever.
runTrackedJob has no timeout fallback. lib/tracked-jobs.mjs awaits the runner without a timeout. When captureTurn hangs, the entire companion process hangs and the per-job state file is never transitioned to a terminal status.
The broker zombifies on app-server death. app-server-broker.mjs never listens for the underlying app-server client's exit. When the codex CLI process dies, the broker keeps running with a dead client, accepting new connections but unable to serve them. The next companion sees a stale broker.json that points at a half-dead broker.
Status reads do not probe PID liveness. buildStatusSnapshot and buildSingleJobSnapshot never check whether tracked PIDs are still alive. Once a job is wedged, every /codex:status query reports it as running.

Fixes

captureTurn races against client exit (lib/codex.mjs). A new exitWatch promise rejects state.completion with the recorded exit error when the client closes before a terminal event arrives. Progress emits a failed notification so the user sees what happened.
runTrackedJob enforces a hard timeout (lib/tracked-jobs.mjs). Default 30 minutes, override via CODEX_COMPANION_JOB_TIMEOUT_MS env var or the timeoutMs option. On timeout the job transitions to failed with a clear error message instead of leaving the companion stuck.
Broker subscribes to appClient exit (app-server-broker.mjs). When the underlying codex CLI exits unexpectedly the broker logs the reason, fans out a notifications/broker/shuttingDown event to any connected socket, tears down the unix socket and pid file, then exits with status 1. The next companion will detect a dead endpoint via ensureBrokerSession and respawn cleanly.
isProcessAlive helper (lib/process.mjs). Uses kill(pid, 0) to probe pid liveness without affecting the process.
markDeadPidJobFailed + reconcileIfDead (lib/job-control.mjs). Reconcile any active job whose tracked PID is gone. Reconciliation runs on every status read path now, not only the --wait polling loop, so a single /codex:status call surfaces dead workers immediately. This is broader than Fix stuck running jobs by detecting dead review/task PIDs #176, which only reconciled inside waitForSingleJobSnapshot.

Relationship to existing PRs/issues

runTrackedJob / captureTurn can hang in phase: finalizing indefinitely (no timeout) #183 (runTrackedJob/captureTurn hang in finalizing) — fully addressed by the captureTurn race + the hard timeout.
Fix stuck running jobs by detecting dead review/task PIDs #176 (Fix stuck running jobs by detecting dead review/task PIDs) — this PR includes equivalent dead-PID detection but also reconciles on every status read, not only status --wait. If Fix stuck running jobs by detecting dead review/task PIDs #176 lands first, the overlap is small (the helper functions and a couple of import lines).
Review jobs stuck in 'running' after process death — no dead-PID detection #164 (Review jobs stuck in 'running' after process death) — also addressed.

Test plan

New unit tests (11 cases, all pass):
- tests/process.test.mjs — 3 new cases for isProcessAlive (invalid input, current pid, exited child).
- tests/dead-pid-reconcile.test.mjs — 6 new cases covering markDeadPidJobFailed guards and reconcile-on-status for both single-job and full snapshots.
- tests/tracked-jobs-timeout.test.mjs — 2 new cases proving the hard timeout transitions a hung job to failed and that fast runners still complete normally.
Full suite: node --test tests/*.test.mjs → 91 pass, 5 fail. The 5 failures (collectReviewContext skips untracked directories, three runtime.test.mjs snapshot tests, and resolveStateDir uses a temp-backed per-workspace directory) are pre-existing on main and unrelated to this patch — I confirmed by running the same suite against pristine main.

Test plan (manual)

Reproduced the wedge with a long-running Codex task on macOS.
Verified /codex:status now reconciles a dead job to failed on a single call.
Verified the new hard timeout fires and writes a failed status.
Smoke-test on Windows (I do not have a Windows host; the changes use only process.kill(pid, 0) on Unix paths and the platform branch in terminateProcessTree is unchanged).

🤖 Generated with Claude Code

A Codex task can become stuck in `status: running` indefinitely with no way to recover short of manually wiping the job-store and broker session. The user-visible symptom is that `/codex:status` keeps reporting a job as running for many minutes after the underlying codex CLI worker has died, and the next `/codex:cancel` reports `ECONNREFUSED` on the broker socket. Three independent root causes combine to produce the wedge: 1. `captureTurn` (lib/codex.mjs) awaits `state.completion` with no race against the client's `exitPromise`. When the underlying codex app-server (or the broker socket) dies mid-turn, no terminal event is ever delivered, so the await never resolves. Pending RPC requests are rejected by `handleExit` but the turn-completion promise is unrelated to those, so it sits forever. 2. `runTrackedJob` (lib/tracked-jobs.mjs) awaits the runner with no timeout fallback. When `captureTurn` hangs, the entire companion process hangs, and the per-job state file is never transitioned to a terminal status. 3. `app-server-broker.mjs` never listens for the underlying app-server client's exit. When the codex CLI process dies, the broker keeps running with a dead client, accepting new connections but unable to serve them. The next companion sees a stale broker.json that points at a half-dead broker. 4. Status reads (`buildStatusSnapshot` and `buildSingleJobSnapshot`) do not probe whether tracked PIDs are still alive. Once a job is wedged, every status query reports it as running. Fixes: - captureTurn now races `state.completion` against `client.exitPromise`. If the client closes before a terminal event arrives, the turn is rejected with the recorded exit error and progress emits a `failed` notification. - runTrackedJob races the runner against a hard timeout (default 30 minutes, override via `CODEX_COMPANION_JOB_TIMEOUT_MS` or the `timeoutMs` option). On timeout the job transitions to `failed` with a clear error message instead of leaving the companion stuck. - The broker now subscribes to its appClient's `exitPromise`. When the underlying codex CLI exits unexpectedly the broker logs the reason, fans out a `notifications/broker/shuttingDown` event to any connected socket, tears down the unix socket and pid file, then exits with status 1. The next companion will detect a dead endpoint via `ensureBrokerSession` and respawn cleanly. - A new `isProcessAlive` helper in lib/process.mjs uses `kill(pid, 0)` to probe pid liveness without affecting the process. - A new `markDeadPidJobFailed` and `reconcileIfDead` helper in lib/job-control.mjs reconcile any active job whose tracked PID is gone. Reconciliation now runs on every status read path, not only the `--wait` polling loop, so a single `/codex:status` call surfaces dead workers immediately. Closes openai#183 (runTrackedJob/captureTurn hang in finalizing). Refs openai#176, openai#164 (dead PID detection on status reads). Tests: - tests/process.test.mjs: 3 new cases for `isProcessAlive`. - tests/dead-pid-reconcile.test.mjs: 6 new cases covering `markDeadPidJobFailed` guards (active state, PID match) and the reconcile-on-status behaviour for both single-job and full snapshots. - tests/tracked-jobs-timeout.test.mjs: 2 new cases proving the hard timeout transitions a hung job to `failed` and that successful runners still complete normally. All 11 new tests pass. The 5 failing tests under the existing suite (`collectReviewContext`, three `runtime.test.mjs` snapshot tests, and `resolveStateDir`) are pre-existing on `main` and unrelated to this patch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0d13e46207

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-08T15:56:08Z

plugins/codex/scripts/lib/job-control.mjs

+  const rawJobs = sortJobsNewestFirst(filterJobsForCurrentSession(listJobs(workspaceRoot), options));
+  // Reconcile any active jobs whose tracked PID is dead before partitioning,
+  // so a single status read surfaces stuck workers immediately.
+  const jobs = rawJobs.map((job) => reconcileIfDead(workspaceRoot, job));


Re-sort jobs after dead-PID reconciliation

buildStatusSnapshot sorts once (rawJobs) and then mutates job states via reconcileIfDead, but it never re-sorts before computing latestFinished/recent. If a stale running job near the end of the list is reconciled to failed, it keeps its old position and can be dropped by the maxJobs slice, so a single /codex:status call may reconcile a dead worker without showing it anywhere in the output. I reproduced this with >8 newer completed jobs: the reconciled dead job disappeared from both running and recent despite being newly failed.

Useful? React with 👍 / 👎.

jonny981 requested a review from a team April 8, 2026 15:50

chatgpt-codex-connector bot reviewed Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(runtime): prevent tracked jobs hanging forever on broker disconnect#184

fix(runtime): prevent tracked jobs hanging forever on broker disconnect#184
jonny981 wants to merge 1 commit intoopenai:mainfrom
jonny981:fix/captureturn-hang-and-dead-pid-detection

jonny981 commented Apr 8, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jonny981 commented Apr 8, 2026

Summary

Reproduction

Root causes

Fixes

Relationship to existing PRs/issues

Test plan

Test plan (manual)

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant