ci(skills-eval): mint Brev auth per-run + fix silent cleanup leak by richa-nvidia · Pull Request #680 · NVIDIA-AI-Blueprints/rag

richa-nvidia · 2026-06-08T07:37:54Z

Adds a long-lived API-key auth step at the start of every workflow run so brev calls never fail on stale ~/.brev/credentials.json. The self-hosted runner's refresh_token expired after ~14 days, leaving brev delete unauthenticated, and the old cleanup loop swallowed the error silently (2>/dev/null + ||) and exited 0.

Also two cleanup fixes that are independent of the auth change:

if: success() -> if: always(). Agent crashes are exactly when GPU cleanup is most needed.
Replace sort -u | while ... done (loop runs in subshell, FAILED flag was lost) with while ... done < <(sort -u ...) so the loop body runs in the parent shell and exit $FAILED actually fails the step on any brev delete error.

Background: GHA run 26894711065 leaked an H100x2 (rag-eval-gpu-0491531) for 12+ hours because all three failure modes above stacked.

Requires repo secret BREV_API_KEY to be set; the step fails loudly with ::error::BREV_API_KEY secret not set if it's missing.

Description

Checklist

I am familiar with the Contributing Guidelines.
All commits are signed-off (git commit -s) and GPG signed (git commit -S).
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

Summary by CodeRabbit

Bug Fixes
- Improved GPU VM cleanup to run even when the workflow fails, with clearer failure reporting if any VM deletions error.
- Refined results artifact collection to package only the current run’s results, skipping cleanly when no results are present.
Chores
- Added a Brev CLI authentication validation step for Skills Eval using a long-lived API key, with early failure if credentials are missing or invalid.

Adds a long-lived API-key auth step at the start of every workflow run so brev calls never fail on stale ~/.brev/credentials.json. The self-hosted runner's refresh_token expired after ~14 days, leaving brev delete unauthenticated, and the old cleanup loop swallowed the error silently (2>/dev/null + ||) and exited 0. Also two cleanup fixes that are independent of the auth change: - if: success() -> if: always(). Agent crashes are exactly when GPU cleanup is most needed. - Replace sort -u | while ... done (loop runs in subshell, FAILED flag was lost) with while ... done < <(sort -u ...) so the loop body runs in the parent shell and exit $FAILED actually fails the step on any brev delete error. Background: GHA run 26894711065 leaked an H100x2 (rag-eval-gpu-0491531) for 12+ hours because all three failure modes above stacked. Requires repo secret BREV_API_KEY to be set; the step fails loudly with ::error::BREV_API_KEY secret not set if it's missing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: richa <ricsingh@nvidia.com>

copy-pr-bot · 2026-06-08T07:37:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-08T07:38:07Z

📝 Walkthrough

Walkthrough

This PR enhances the Skills Eval workflow with Brev CLI authentication for non-push triggers or skill changes, improves pre-checkout Docker cleanup to handle root-owned volume chains, narrows results archiving to run-specific subdirectories, and strengthens GPU VM cleanup to run always while properly propagating deletion failures via process substitution and error logging.

Changes

Skills Eval Workflow: Brev CLI Authentication and Robust Cleanup

Layer / File(s)	Summary
Pre-checkout Docker volume cleanup `.github/workflows/skills-eval.yml`	Reworks cleanup to mount parent directory and remove target by basename to handle root-owned chains; expands volume directory list and adjusts existence test semantics.
Brev CLI authentication setup `.github/workflows/skills-eval.yml`	New conditional authentication step validates `BREV_API_KEY`, logs in to Brev CLI with org ID, and verifies login by running `brev ls` for non-push events or skill-related changes.
Results archiving to run-specific subdir `.github/workflows/skills-eval.yml`	Updates result collection to archive only `/tmp/skill-eval/results/<run_id>/`, checks for `result.json` presence, and exits cleanly when subdir or results are absent.
GPU VM cleanup with robust failure handling `.github/workflows/skills-eval.yml`	Condition changes from `success()` to `always()`; replaces piped while-loop with process-substitution loop to accumulate deletion failures, emit `::error::` on failures, and exit with aggregated status.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

A rabbit hops through CI lanes,
Adding keys to unlock the way,
Then catches stumbles cleanup makes—
Errors logged in bright array! 🐰✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the main changes: adding per-run Brev authentication and fixing silent cleanup issues in the CI workflow.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ci/skills-eval-brev-api-key

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/skills-eval.yml:
- Around line 132-133: The Brev CLI calls (e.g., the commands shown via "brev
login --api-key \"$BREV_API_KEY\" --org-id \"$BREV_ORG_ID\"" and "brev ls") need
per-call hard timeouts to avoid hung jobs; wrap each Brev invocation with the
POSIX timeout utility (or an equivalent timeout function) and capture
non-zero/timeouts as failures while continuing to run remaining cleanup steps,
and apply the same change to the other Brev calls referenced around the later
block (the calls in the 217-223 region) so each CLI call fails fast and
aggregates failure status rather than blocking the whole job.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 1d48ea5a-a203-4a85-b49d-4863e227d9b0

📥 Commits

Reviewing files that changed from the base of the PR and between 329875a and 9262458.

📒 Files selected for processing (1)

.github/workflows/skills-eval.yml

coderabbitai · 2026-06-08T07:42:03Z

+          brev login --api-key "$BREV_API_KEY" --org-id "$BREV_ORG_ID"
+          if ! brev ls >/dev/null 2>&1; then


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add hard timeouts for Brev CLI calls.

A hung Brev API/CLI call can consume the entire job timeout and prevent deleting remaining instances. Add per-call timeout guards and keep aggregating failures.

Suggested patch

- brev login --api-key "$BREV_API_KEY" --org-id "$BREV_ORG_ID" - if ! brev ls >/dev/null 2>&1; then + timeout 60s brev login --api-key "$BREV_API_KEY" --org-id "$BREV_ORG_ID" + if ! timeout 30s brev ls >/dev/null 2>&1; then echo "::error::Brev auth failed: brev ls returned non-zero after login" exit 1 fi @@ - if brev delete "$INSTANCE"; then + if timeout 180s brev delete "$INSTANCE"; then echo "OK Deleted $INSTANCE" else - echo "::error::Failed to delete $INSTANCE (exit $?)" + rc=$? + if [ "$rc" -eq 124 ]; then + echo "::error::Timed out deleting $INSTANCE" + else + echo "::error::Failed to delete $INSTANCE (exit $rc)" + fi FAILED=1 fi

Also applies to: 217-223

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/skills-eval.yml around lines 132 - 133, The Brev CLI calls (e.g., the commands shown via "brev login --api-key \"$BREV_API_KEY\" --org-id \"$BREV_ORG_ID\"" and "brev ls") need per-call hard timeouts to avoid hung jobs; wrap each Brev invocation with the POSIX timeout utility (or an equivalent timeout function) and capture non-zero/timeouts as failures while continuing to run remaining cleanup steps, and apply the same change to the other Brev calls referenced around the later block (the calls in the 217-223 region) so each CLI call fails fast and aggregates failure status rather than blocking the whole job.

…esults Two cleanup-pipeline fixes uncovered while investigating why every nightly since 2026-06-04 has been uploading a byte-identical 06-03 artifact: 1. Pre-checkout cleanup couldn't remove root-owned orphan dirs. The 06-03 trial agent wrote a custom compose file with a relative bind mount source `./deploy/compose/seaweedfs-config/s3.json` from inside `ci/`. Docker (running as root) did mkdir -p on the chain, creating empty root-owned `ci/deploy/` and `ci/deploy/compose/`. The old cleanup did `docker run -v $TARGET:/target alpine rm -rf /target/*` (clears contents, fine) then `rm -rf $TARGET || true` (fails — ubuntu can't write to a root-owned parent, error swallowed). Result: `actions/checkout clean: true` hits EACCES on rmdir, every nightly checkout has failed for 12+ days. Fix: mount the PARENT dir into alpine and `rm -rf /parent/<name>`. Docker is root inside the container → can remove root-owned dirs by name regardless of parent ownership. Also add `ci/deploy` to the loop so the orphan parent gets cleaned alongside its child. 2. Collect step tarred ALL of /tmp/skill-eval/results/, not just this run's subdir. The runner's disk persists between jobs. If checkout fails → agent skipped → no fresh results, but the Collect step still found 06-03's leftover subdirs on disk, tarred them up, and the Upload step labeled the artifact with today's run_id. Nine days of fake "passing" nightly artifacts. Fix: tar only `/tmp/skill-eval/results/${GITHUB_RUN_ID}/`. If the agent didn't write that subdir (because it never ran), there's nothing to archive and the step exits 0 with a clear message. Upload's `if-no-files-found: ignore` handles the missing tar. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: richa <ricsingh@nvidia.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/skills-eval.yml:
- Around line 187-199: The issue is that a stale
`/tmp/skills-eval-results.tar.gz` file can persist on self-hosted runners from
previous runs. When the conditional block checking for result.json files finds
no results and skips archive creation, the old tarball from a prior run can
still be uploaded at line 211. Add a command at the beginning of this step to
remove any existing `/tmp/skills-eval-results.tar.gz` file before the
conditional check, ensuring that only newly created archives get uploaded or
nothing gets uploaded if no results exist in the current run.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 9331150d-c6a3-490e-b059-5198dc8287be

📥 Commits

Reviewing files that changed from the base of the PR and between 9262458 and 4886eb1.

📒 Files selected for processing (1)

.github/workflows/skills-eval.yml

coderabbitai · 2026-06-15T08:33:29Z

+          RUN_DIR="/tmp/skill-eval/results/${{ github.run_id }}"
+          if [ ! -d "$RUN_DIR" ]; then
+            echo "no results subdir for this run (${{ github.run_id }}) — nothing to archive"
            exit 0
          fi
-          RESULTS=$(find /tmp/skill-eval/results -maxdepth 3 -name "result.json" 2>/dev/null | head -50 || true)
+          RESULTS=$(find "$RUN_DIR" -maxdepth 3 -name "result.json" 2>/dev/null | head -50 || true)
          if [ -n "$RESULTS" ]; then
-            tar czf /tmp/skills-eval-results.tar.gz -C /tmp/skill-eval results
-            echo "archived $(echo "$RESULTS" | wc -l) result.json files"
+            tar czf /tmp/skills-eval-results.tar.gz \
+              -C /tmp/skill-eval/results "${{ github.run_id }}"
+            echo "archived $(echo "$RESULTS" | wc -l) result.json files for run ${{ github.run_id }}"
          else
-            echo "results dir exists but empty — nothing to archive"
+            echo "results dir exists for this run but no result.json files — nothing to archive"
          fi


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clear stale tarball before conditional archive creation.

At Line 187, this step may skip creating a new archive, but /tmp/skills-eval-results.tar.gz can persist on a self-hosted runner and then be uploaded as current-run output at Line 211.

Suggested patch

- name: Collect results for artifact if: always() && (github.event_name != 'push' || steps.changes.outputs.skills == 'true') run: | + rm -f /tmp/skills-eval-results.tar.gz RUN_DIR="/tmp/skill-eval/results/${{ github.run_id }}" if [ ! -d "$RUN_DIR" ]; then echo "no results subdir for this run (${{ github.run_id }}) — nothing to archive" exit 0 fi

🧰 Tools

🪛 zizmor (1.25.2)

[warning] 187-187: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

[warning] 189-189: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

[warning] 195-195: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

[warning] 195-195: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/skills-eval.yml around lines 187 - 199, The issue is that a stale `/tmp/skills-eval-results.tar.gz` file can persist on self-hosted runners from previous runs. When the conditional block checking for result.json files finds no results and skips archive creation, the old tarball from a prior run can still be uploaded at line 211. Add a command at the beginning of this step to remove any existing `/tmp/skills-eval-results.tar.gz` file before the conditional check, ensuring that only newly created archives get uploaded or nothing gets uploaded if no results exist in the current run.

…icker (#701) Two independent fixes in one file: 1. Schedule trigger now exports MANUAL_FULL_SWEEP=1 (like workflow_dispatch already does). Before this, the schedule event fell through to the PR-mode code path in skills_eval_agent.py and died with "FATAL: PR_NUMBER not set in environment". Also default MANUAL_SKILLS_FILTER to "*" when INPUT_SKILLS is empty (schedule events don't populate `inputs`). 2. workflow_dispatch picker now lists the 3 skills that actually exist on disk (rag-blueprint, rag-eval, rag-perf) instead of 10 phantom names (rag-deploy-blueprint, rag-ingest-documents, etc.) that were left over from an old skill taxonomy. The LLM agent already enumerates skills/*/eval/*.json itself, so the stale picker was a UX trap only — anyone picking one of the old names got a silent BLOCKED result. ONBOARDING.md:487-492 documents this as a known gotcha; this PR closes the gotcha. Discovered while diagnosing GHA run 28214488174 (first nightly after PR #680). The Brev auth step and cleanup step from #680 both passed cleanly — this is a separate, pre-existing bug surfaced by them running on schedule for the first time. Signed-off-by: richa <ricsingh@nvidia.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

smasurekar approved these changes Jun 15, 2026

View reviewed changes

smasurekar merged commit 3241f06 into main Jun 15, 2026
9 of 10 checks passed

smasurekar deleted the ci/skills-eval-brev-api-key branch June 15, 2026 09:58

richa-nvidia mentioned this pull request Jun 29, 2026

ci(skills-eval): wire schedule to MANUAL_FULL_SWEEP + sync dispatch picker #701

Merged

3 tasks

richa-nvidia mentioned this pull request Jul 2, 2026

fix(skills): add resources.platforms.cpu to rag-blueprint nvidia_hosted spec #705

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci(skills-eval): mint Brev auth per-run + fix silent cleanup leak#680

ci(skills-eval): mint Brev auth per-run + fix silent cleanup leak#680
smasurekar merged 2 commits into
mainfrom
ci/skills-eval-brev-api-key

richa-nvidia commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Jun 8, 2026

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		brev login --api-key "$BREV_API_KEY" --org-id "$BREV_ORG_ID"
		if ! brev ls >/dev/null 2>&1; then

Uh oh!

Conversation

richa-nvidia commented Jun 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Jun 8, 2026

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

richa-nvidia commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading