Skip to content

Fix stuck running jobs by detecting dead review/task PIDs#176

Open
Abdooo2235 wants to merge 3 commits intoopenai:mainfrom
Abdooo2235:fix/issue-164-dead-pid-detection
Open

Fix stuck running jobs by detecting dead review/task PIDs#176
Abdooo2235 wants to merge 3 commits intoopenai:mainfrom
Abdooo2235:fix/issue-164-dead-pid-detection

Conversation

@Abdooo2235
Copy link
Copy Markdown

@Abdooo2235 Abdooo2235 commented Apr 7, 2026

Summary

Fixes stuck Codex jobs that stayed in queued/running after the tracked process died unexpectedly.

Closes #164.

Problem

When a review/task process exited unexpectedly, job state could remain running indefinitely.
That caused status --wait to keep polling until timeout with no actionable failure reason.

What Changed

  • Added PID liveness detection for tracked jobs.
  • Added reconciliation that converts stale active jobs with dead PIDs to failed.
  • Persisted reconciliation to both the state index and per-job file.
  • Stored a clear failure message and completion timestamp when dead PID is detected.
  • Surfaced errorMessage in status output for easier diagnosis.
  • Included wait-timeout context in single-job status rendering.
  • Added a regression test for dead-PID plus status --wait behavior.

Behavior After Fix

  • Dead worker/reviewer PID is detected promptly.
  • Job transitions to failed instead of remaining running.
  • status --wait returns quickly with failure state rather than timing out.
  • Diagnostic reason is visible in status output and persisted job artifacts.

Test Plan

  • Ran full suite: npm test
  • Result: 67 passed, 0 failed.
  • Includes regression coverage for dead-PID reconciliation in status --wait.

- detect dead tracked PIDs and reconcile queued/running jobs to failed\n- persist failure state to state index and job file with clear diagnostics\n- surface error details and wait timeout context in status output\n- add regression test covering dead-PID status --wait behavior
@Abdooo2235 Abdooo2235 requested a review from a team April 7, 2026 16:16
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 39b560f530

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@Abdooo2235
Copy link
Copy Markdown
Author

Addressed the Codex race-condition feedback in commit 5ff5fc5.

Changes made:

  • Updated markDeadPidJobFailed to re-read the latest persisted job before applying failedPatch.
  • Added guards so failure is only written when the latest persisted state is still active (queued/running) and PID still matches the originally observed PID.
  • This prevents downgrading legitimately completed/failed jobs during status polling races.
  • Added regression test: "status dead-pid reconciliation does not downgrade a concurrently completed job".

Validation: npm test passes (68/68).

@Abdooo2235
Copy link
Copy Markdown
Author

Abdooo2235 commented Apr 7, 2026

Implemented follow-up fix for the race noted in Codex feedback (commit 5ff5fc5), including persisted-state + PID guard checks and a regression test.

Validation is green: npm test (68/68).

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ff5fc5d60

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@Abdooo2235
Copy link
Copy Markdown
Author

Abdooo2235 commented Apr 7, 2026

Dead-PID Handling Update

Implemented on branch fix/issue-164-dead-pid-detection.

Summary

This PR now handles stale running/queued jobs whose tracked process has died by reconciling them into a consistent terminal state.

What changed

  • Added dead-PID detection and reconciliation in plugins/codex/scripts/lib/job-control.mjs.
  • Added race guards so terminal states are not overwritten during polling races.
  • Synced state.json from the latest persisted snapshot on early-return paths to prevent index/file split-brain.
  • Improved status rendering so failure details are visible in job output.
  • Added regression coverage for dead-PID timeout behavior and concurrent-completion race cases.

Validation

npm test
  • 68 passed, 0 failed.

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Breezy!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. More of your lovely PRs please.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Review jobs stuck in 'running' after process death — no dead-PID detection

1 participant