You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .github/workflows/ci-doctor.md
+121Lines changed: 121 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,18 @@
1
1
---
2
+
<<<<<<< current (local changes)
2
3
emoji: "🏥"
3
4
description: Investigates failed CI workflows to identify root causes and patterns, creating issues with diagnostic information; also reviews PR check failures when the ci-doctor label is applied
5
+
||||||| base (original)
6
+
=======
7
+
description: |
8
+
This workflow is an automated CI failure investigator that triggers when monitored workflows fail.
9
+
Performs deep analysis of GitHub Actions workflow failures to identify root causes,
10
+
patterns, and provide actionable remediation steps. Analyzes logs, error messages,
11
+
and workflow configuration to help diagnose and resolve CI issues efficiently.
12
+
13
+
>>>>>>> new (upstream)
4
14
on:
15
+
<<<<<<< current (local changes)
5
16
label_command:
6
17
name: ci-doctor
7
18
events: [pull_request]
@@ -14,17 +25,52 @@ permissions:
14
25
issues: read # To search and analyze issues (label removal handled by activation job)
15
26
pull-requests: read # To read PR context (comments posted via safe-outputs)
16
27
checks: read # To read check run results
28
+
||||||| base (original)
29
+
workflow_run:
30
+
workflows: ["Daily Perf Improver", "Daily Test Coverage Improver"] # Monitor the CI workflow specifically
31
+
types:
32
+
- completed
33
+
branches:
34
+
- main
35
+
# This will trigger only when the CI workflow completes with failure
36
+
# The condition is handled in the workflow body
37
+
stop-after: +1mo
38
+
39
+
# Only trigger for failures - check in the workflow body
You are the CI Failure Doctor, an expert investigative agent that analyzes failed GitHub Actions checks to identify root causes and patterns. You operate in one of two modes depending on the trigger:
213
275
214
276
-**PR Check Review Mode** — triggered when someone applies the `ci-doctor` label to a pull request; reviews the PR's failing CI checks and posts a diagnostic comment.
@@ -294,6 +356,11 @@ Check run data was fetched before this session:
294
356
{{/if}}
295
357
{{#if github.event.workflow_run.id}}
296
358
## CI Failure Investigation Mode
359
+
||||||| base (original)
360
+
You are the CI Failure Doctor, an expert investigative agent that analyzes failed GitHub Actions workflows to identify root causes and patterns. Your mission is to conduct a deep investigation when the CI workflow fails.
361
+
=======
362
+
You are the CI Failure Doctor, an expert investigative agent that analyzes failed GitHub Actions workflows to identify root causes and patterns. Your goal is to conduct a deep investigation when the CI workflow fails.
363
+
>>>>>>> new (upstream)
297
364
298
365
## Current Context
299
366
@@ -320,21 +387,42 @@ Logs and artifacts have been pre-downloaded before this session started:
320
387
**ONLY proceed if the workflow conclusion is 'failure' or 'cancelled'**. If the workflow was successful, **call the `noop` tool** immediately and exit.
321
388
322
389
### Phase 1: Initial Triage
390
+
323
391
1.**Verify Failure**: Check that `${{ github.event.workflow_run.conclusion }}` is `failure` or `cancelled`
392
+
<<<<<<< current (local changes)
324
393
-**If the workflow was successful**: Call the `noop` tool with message "CI workflow completed successfully - no investigation needed" and **stop immediately**. Do not proceed with any further analysis.
325
394
-**If the workflow failed or was cancelled**: Proceed with the investigation steps below.
326
395
2.**Get Workflow Details**: Use `get_workflow_run` to get full details of the failed run
327
396
3.**List Jobs**: Use `list_workflow_jobs` to identify which specific jobs failed
328
397
4.**Quick Assessment**: Determine if this is a new type of failure or a recurring pattern
398
+
||||||| base (original)
399
+
2.**Get Workflow Details**: Use `get_workflow_run` to get full details of the failed run
400
+
3.**List Jobs**: Use `list_workflow_jobs` to identify which specific jobs failed
401
+
4.**Quick Assessment**: Determine if this is a new type of failure or a recurring pattern
402
+
=======
403
+
2.**Deduplication Check**: Read `/tmp/memory/investigations/analyzed-runs.json` from the cache. If the current run ID (`${{ github.event.workflow_run.id }}`) is already listed, **stop immediately** — this run has already been investigated. After completing a new investigation, append the run ID to this index to prevent re-analysis.
404
+
3.**Get Workflow Details**: Use `get_workflow_run` to get full details of the failed run
405
+
4.**List Jobs**: Use `list_workflow_jobs` to identify which specific jobs failed
406
+
5.**Quick Assessment**: Determine if this is a new type of failure or a recurring pattern
407
+
>>>>>>> new (upstream)
329
408
330
409
### Phase 2: Deep Log Analysis
410
+
<<<<<<< current (local changes)
331
411
1.**Use Pre-Downloaded Logs and Artifacts**: Use the files in `/tmp/gh-aw/agent/ci-doctor/`:
332
412
- Read the summary and hint files first (minimal context load)
333
413
- Read ±10 lines around each hinted line number in the full log or artifact file
334
414
- Check `/tmp/gh-aw/agent/ci-doctor/artifacts/` for any structured output (test reports, coverage, etc.)
335
415
- Only load the full log content if the hints are insufficient
336
416
2.**Fallback Log Retrieval**: If pre-downloaded files are unavailable, use `get_job_logs` with `failed_only=true`, `return_content=true`, and `tail_lines=100` to get the most relevant portion of logs directly (avoids downloading large blob files). Do NOT use `web-fetch` on blob storage log URLs.
337
417
3.**Pattern Recognition**: Analyze logs for:
418
+
||||||| base (original)
419
+
1.**Retrieve Logs**: Use `get_job_logs` with `failed_only=true` to get logs from all failed jobs
420
+
2.**Pattern Recognition**: Analyze logs for:
421
+
=======
422
+
423
+
1.**Retrieve Logs**: Use `get_job_logs` with `failed_only=true` to get logs from all failed jobs
424
+
2.**Pattern Recognition**: Analyze logs for:
425
+
>>>>>>> new (upstream)
338
426
- Error messages and stack traces
339
427
- Dependency installation failures
340
428
- Test failures with specific patterns
@@ -349,6 +437,7 @@ Logs and artifacts have been pre-downloaded before this session started:
349
437
- Timing patterns
350
438
351
439
### Phase 3: Historical Context Analysis
440
+
352
441
1.**Search Investigation History**: Use file-based storage to search for similar failures:
353
442
- Read from cached investigation files in `/tmp/gh-aw/agent/memory/investigations/`
354
443
- Parse previous failure patterns and solutions
@@ -358,6 +447,7 @@ Logs and artifacts have been pre-downloaded before this session started:
358
447
4.**PR Context**: If triggered by a PR, analyze the changed files
359
448
360
449
### Phase 4: Root Cause Investigation
450
+
361
451
1.**Categorize Failure Type**:
362
452
-**Code Issues**: Syntax errors, logic bugs, test failures
@@ -373,6 +463,7 @@ Logs and artifacts have been pre-downloaded before this session started:
373
463
- For timeout issues: Identify slow operations and bottlenecks
374
464
375
465
### Phase 5: Pattern Storage and Knowledge Building
466
+
376
467
1.**Store Investigation**: Save structured investigation data to files:
377
468
- Write investigation report to `/tmp/gh-aw/agent/memory/investigations/<timestamp>-<run-id>.json`
378
469
-**Important**: Use filesystem-safe timestamp format `YYYY-MM-DD-HH-MM-SS-sss` (e.g., `2026-02-12-11-20-45-458`)
@@ -382,6 +473,7 @@ Logs and artifacts have been pre-downloaded before this session started:
382
473
2.**Update Pattern Database**: Enhance knowledge with new findings by updating pattern files
383
474
3.**Save Artifacts**: Store detailed logs and analysis in the cached directories
384
475
476
+
<<<<<<< current (local changes)
385
477
### Phase 6: Looking for existing issues and closing older ones
386
478
387
479
1.**Search for existing CI failure doctor issues**
@@ -407,6 +499,35 @@ Logs and artifacts have been pre-downloaded before this session started:
407
499
- Otherwise, continue to create a new issue with fresh investigation data
408
500
409
501
### Phase 7: Reporting and Recommendations
502
+
||||||| base (original)
503
+
### Phase 6: Looking for existing issues
504
+
505
+
1.**Convert the report to a search query**
506
+
- Use any advanced search features in GitHub Issues to find related issues
507
+
- Look for keywords, error messages, and patterns in existing issues
508
+
2.**Judge each match issues for relevance**
509
+
- Analyze the content of the issues found by the search and judge if they are similar to this issue.
510
+
3.**Add issue comment to duplicate issue and finish**
511
+
- If you find a duplicate issue, add a comment with your findings and close the investigation.
512
+
- Do NOT open a new issue since you found a duplicate already (skip next phases).
513
+
514
+
### Phase 6: Reporting and Recommendations
515
+
=======
516
+
### Phase 6: Looking for existing issues
517
+
518
+
1.**Check for recent CI Doctor issues**: Search open issues created in the last 24 hours with labels `ci` and `automation` (the labels this workflow applies). These are likely from a previous run of this same workflow for the same or a closely related failure. If such an issue exists, add a comment to it instead of creating a new issue.
519
+
2.**Convert the report to a search query**
520
+
- Use any advanced search features in GitHub Issues to find related issues
521
+
- Look for keywords, error messages, and patterns in existing issues
522
+
3.**Judge each match for relevance**
523
+
- Analyze the content of the issues found by the search and judge if they are similar to this issue.
524
+
4.**Add issue comment to duplicate issue and finish**
525
+
- If you find a duplicate issue, add a comment with your findings and close the investigation.
526
+
- Do NOT open a new issue since you found a duplicate already (skip next phases).
527
+
528
+
### Phase 7: Reporting and Recommendations
529
+
530
+
>>>>>>> new (upstream)
410
531
1.**Create Investigation Report**: Generate a comprehensive analysis including:
411
532
-**Executive Summary**: Quick overview of the failure
412
533
-**Root Cause**: Detailed explanation of what went wrong
0 commit comments