[aw-failures] Failure Investigator pre-fetch returns empty failed_run_ids while in-window agentic failures exist — discovery bli
[Content truncated due to length]

### Recommendation

**Fix the Failure Investigator's deterministic pre-fetch so its `failed_run_ids`/`failures` capture *run-level* failures, not just `agent`-job failures — it currently returns empty while real in-window agentic failures exist, so the investigator would call `noop` and miss them.**

### Problem statement

In the 2026-06-13 ~08:00Z run (6h window), `/tmp/gh-aw/agent/failure-investigator/prefetch.json` reported:

```json
{ "failed_run_ids": [], "failures": [] }
```

...but an independent `gh run list` over the same window found **two failed agentic workflow runs** plus two CI failures. Had the agent trusted the pre-fetch alone (as step 0 instructs: "Only call additional logs/list APIs when a required field is missing or stale"), it would have reported no failures and missed a P1.

### Evidence — in-window agentic failures the pre-fetch missed

| Run | Workflow | Created | Failing job | Note |
|---|---|---|---|---|
| [§27458229377](https://github.com/github/gh-aw/actions/runs/27458229377) | Daily Safe Outputs Git Simulator | 05:47Z | `push_repo_memory` | `agent` job **succeeded**; post-agent job failed → run = `failure` (tracked by #39024) |
| [§27458341448](https://github.com/github/gh-aw/actions/runs/27458341448) | Avenger | 05:52Z | `agent` | transient network + test timeout; self-recovered next run |

### Probable root cause

The pre-fetch appears to key on the **`agent` job conclusion** rather than the **run-level conclusion**. That has two gaps:

1. Runs where `agent` succeeds but a post-agent job (`push_repo_memory`, `safe_outputs`, `conclusion`) fails are dropped — exactly the Git Simulator case, which is also the highest-signal failure (a 5-day P1 streak).
2. Runs that emit a valid `noop`/safe-output but still exit the `agent` job as `failure` (Avenger) may also be filtered.

Net effect: a run whose GitHub conclusion is `failure` is not guaranteed to appear in `failed_run_ids`. This silently narrows discovery and makes a `noop` result look authoritative when it is not.

### Proposed remediation

1. Base `failed_run_ids` on the **workflow-run `conclusion`** (`failure | timed_out | startup_failure | cancelled`) for agentic workflows in-window, independent of any single job's status.
2. Annotate each entry with the failing job name(s) so clustering can distinguish agent-phase vs post-agent-phase (infra) failures.
3. Keep the existing agent-job signal as an additional field, not the filter.

### Success criteria / verification

1. For a window containing a run like 27458229377 (agent OK, `push_repo_memory` failed), the pre-fetch's `failed_run_ids` includes that run.
2. `failed_run_ids` count matches `gh run list` `failure`-conclusion agentic runs in the window.
3. Each failure entry carries the failing job name(s).

### References

- [§27458229377](https://github.com/github/gh-aw/actions/runs/27458229377) · [§27458341448](https://github.com/github/gh-aw/actions/runs/27458341448)
- Related: #39024 (the missed P1)







> Generated by [🔍 [aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/27460999651) · 233.2 AIC · ⌖ 14.2 AIC · ⊞ 5.1K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)
> - [x] expires  on Jun 20, 2026, 12:10 AM UTC-08:00






---

### Resolution — verified fixed

**Close this issue: the deterministic pre-fetch now keys on run-level conclusion and annotates failing job names, exactly as requested.**

The 2026-06-13 13:17Z Failure Investigator run loaded `/tmp/gh-aw/agent/failure-investigator/prefetch.json` and it returned **20 `failed_run_ids`** (not empty), each carrying the new `failed_job_names` and `conclusion` fields.

#### Success criteria — all met

1. **Run-level conclusion drives discovery** — the payload now includes runs whose `agent` job did not itself fail. Example: Smoke CI runs [§27466680204](https://github.com/github/gh-aw/actions/runs/27466680204) and [§27466673731](https://github.com/github/gh-aw/actions/runs/27466673731) appear with `failed_job_names` = `[activation, agent, pre_activation, push_repo_memory, safe_outputs]` — i.e. post-agent/non-agent job failures are now captured (the Git-Simulator class from this report).
2. **Failure entries carry failing job name(s)** — every entry has a populated `failed_job_names` array.
3. **Agent-job signal retained as an additional field** — `agent_job_conclusion` is present alongside the run-level `conclusion`, not used as the filter.

The original symptom (`failed_run_ids: []` while in-window failures existed) does not reproduce. Closing as fixed; reopen if a future window shows a run with a `failure`/`timed_out`/`startup_failure` conclusion missing from the pre-fetch.

_Verified by run [§27467779429](https://github.com/github/gh-aw/actions/runs/27467779429)._

> Generated by [🔍 [aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/27467779429) · 191.6 AIC · ⌖ 12.9 AIC · ⊞ 5.1K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aw-failures] Failure Investigator pre-fetch returns empty failed_run_ids while in-window agentic failures exist — discovery bli [Content truncated due to length] #39037

Recommendation

Problem statement

Evidence — in-window agentic failures the pre-fetch missed

Probable root cause

Proposed remediation

Success criteria / verification

References

Resolution — verified fixed

Success criteria — all met

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Run	Workflow	Created	Failing job	Note
§27458229377	Daily Safe Outputs Git Simulator	05:47Z	`push_repo_memory`	`agent` job succeeded; post-agent job failed → run = `failure` (tracked by #39024)
§27458341448	Avenger	05:52Z	`agent`	transient network + test timeout; self-recovered next run

[aw-failures] Failure Investigator pre-fetch returns empty failed_run_ids while in-window agentic failures exist — discovery bli [Content truncated due to length] #39037

Description

Recommendation

Problem statement

Evidence — in-window agentic failures the pre-fetch missed

Probable root cause

Proposed remediation

Success criteria / verification

References

Resolution — verified fixed

Success criteria — all met

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions