Skip to content

ci(skills-eval): mint Brev auth per-run + fix silent cleanup leak#680

Merged
smasurekar merged 2 commits into
mainfrom
ci/skills-eval-brev-api-key
Jun 15, 2026
Merged

ci(skills-eval): mint Brev auth per-run + fix silent cleanup leak#680
smasurekar merged 2 commits into
mainfrom
ci/skills-eval-brev-api-key

Conversation

@richa-nvidia

@richa-nvidia richa-nvidia commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Adds a long-lived API-key auth step at the start of every workflow run so brev calls never fail on stale ~/.brev/credentials.json. The self-hosted runner's refresh_token expired after ~14 days, leaving brev delete unauthenticated, and the old cleanup loop swallowed the error silently (2>/dev/null + ||) and exited 0.

Also two cleanup fixes that are independent of the auth change:

  • if: success() -> if: always(). Agent crashes are exactly when GPU cleanup is most needed.

  • Replace sort -u | while ... done (loop runs in subshell, FAILED flag was lost) with while ... done < <(sort -u ...) so the loop body runs in the parent shell and exit $FAILED actually fails the step on any brev delete error.

Background: GHA run 26894711065 leaked an H100x2 (rag-eval-gpu-0491531) for 12+ hours because all three failure modes above stacked.

Requires repo secret BREV_API_KEY to be set; the step fails loudly with ::error::BREV_API_KEY secret not set if it's missing.

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • All commits are signed-off (git commit -s) and GPG signed (git commit -S).
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

Summary by CodeRabbit

  • Bug Fixes

    • Improved GPU VM cleanup to run even when the workflow fails, with clearer failure reporting if any VM deletions error.
    • Refined results artifact collection to package only the current run’s results, skipping cleanly when no results are present.
  • Chores

    • Added a Brev CLI authentication validation step for Skills Eval using a long-lived API key, with early failure if credentials are missing or invalid.

Adds a long-lived API-key auth step at the start of every workflow run
so brev calls never fail on stale ~/.brev/credentials.json. The
self-hosted runner's refresh_token expired after ~14 days, leaving
brev delete unauthenticated, and the old cleanup loop swallowed the
error silently (2>/dev/null + ||) and exited 0.

Also two cleanup fixes that are independent of the auth change:

  - if: success() -> if: always(). Agent crashes are exactly when
    GPU cleanup is most needed.

  - Replace sort -u | while ... done (loop runs in subshell, FAILED
    flag was lost) with while ... done < <(sort -u ...) so the loop
    body runs in the parent shell and exit $FAILED actually fails the
    step on any brev delete error.

Background: GHA run 26894711065 leaked an H100x2 (rag-eval-gpu-0491531)
for 12+ hours because all three failure modes above stacked.

Requires repo secret BREV_API_KEY to be set; the step fails loudly with
::error::BREV_API_KEY secret not set if it's missing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: richa <ricsingh@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 8, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR enhances the Skills Eval workflow with Brev CLI authentication for non-push triggers or skill changes, improves pre-checkout Docker cleanup to handle root-owned volume chains, narrows results archiving to run-specific subdirectories, and strengthens GPU VM cleanup to run always while properly propagating deletion failures via process substitution and error logging.

Changes

Skills Eval Workflow: Brev CLI Authentication and Robust Cleanup

Layer / File(s) Summary
Pre-checkout Docker volume cleanup
.github/workflows/skills-eval.yml
Reworks cleanup to mount parent directory and remove target by basename to handle root-owned chains; expands volume directory list and adjusts existence test semantics.
Brev CLI authentication setup
.github/workflows/skills-eval.yml
New conditional authentication step validates BREV_API_KEY, logs in to Brev CLI with org ID, and verifies login by running brev ls for non-push events or skill-related changes.
Results archiving to run-specific subdir
.github/workflows/skills-eval.yml
Updates result collection to archive only /tmp/skill-eval/results/<run_id>/, checks for result.json presence, and exits cleanly when subdir or results are absent.
GPU VM cleanup with robust failure handling
.github/workflows/skills-eval.yml
Condition changes from success() to always(); replaces piped while-loop with process-substitution loop to accumulate deletion failures, emit ::error:: on failures, and exit with aggregated status.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

A rabbit hops through CI lanes,
Adding keys to unlock the way,
Then catches stumbles cleanup makes—
Errors logged in bright array! 🐰✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main changes: adding per-run Brev authentication and fixing silent cleanup issues in the CI workflow.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ci/skills-eval-brev-api-key

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/skills-eval.yml:
- Around line 132-133: The Brev CLI calls (e.g., the commands shown via "brev
login --api-key \"$BREV_API_KEY\" --org-id \"$BREV_ORG_ID\"" and "brev ls") need
per-call hard timeouts to avoid hung jobs; wrap each Brev invocation with the
POSIX timeout utility (or an equivalent timeout function) and capture
non-zero/timeouts as failures while continuing to run remaining cleanup steps,
and apply the same change to the other Brev calls referenced around the later
block (the calls in the 217-223 region) so each CLI call fails fast and
aggregates failure status rather than blocking the whole job.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 1d48ea5a-a203-4a85-b49d-4863e227d9b0

📥 Commits

Reviewing files that changed from the base of the PR and between 329875a and 9262458.

📒 Files selected for processing (1)
  • .github/workflows/skills-eval.yml

Comment on lines +132 to +133
brev login --api-key "$BREV_API_KEY" --org-id "$BREV_ORG_ID"
if ! brev ls >/dev/null 2>&1; then

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add hard timeouts for Brev CLI calls.

A hung Brev API/CLI call can consume the entire job timeout and prevent deleting remaining instances. Add per-call timeout guards and keep aggregating failures.

Suggested patch
-          brev login --api-key "$BREV_API_KEY" --org-id "$BREV_ORG_ID"
-          if ! brev ls >/dev/null 2>&1; then
+          timeout 60s brev login --api-key "$BREV_API_KEY" --org-id "$BREV_ORG_ID"
+          if ! timeout 30s brev ls >/dev/null 2>&1; then
             echo "::error::Brev auth failed: brev ls returned non-zero after login"
             exit 1
           fi
@@
-            if brev delete "$INSTANCE"; then
+            if timeout 180s brev delete "$INSTANCE"; then
               echo "OK Deleted $INSTANCE"
             else
-              echo "::error::Failed to delete $INSTANCE (exit $?)"
+              rc=$?
+              if [ "$rc" -eq 124 ]; then
+                echo "::error::Timed out deleting $INSTANCE"
+              else
+                echo "::error::Failed to delete $INSTANCE (exit $rc)"
+              fi
               FAILED=1
             fi

Also applies to: 217-223

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/skills-eval.yml around lines 132 - 133, The Brev CLI calls
(e.g., the commands shown via "brev login --api-key \"$BREV_API_KEY\" --org-id
\"$BREV_ORG_ID\"" and "brev ls") need per-call hard timeouts to avoid hung jobs;
wrap each Brev invocation with the POSIX timeout utility (or an equivalent
timeout function) and capture non-zero/timeouts as failures while continuing to
run remaining cleanup steps, and apply the same change to the other Brev calls
referenced around the later block (the calls in the 217-223 region) so each CLI
call fails fast and aggregates failure status rather than blocking the whole
job.

…esults

Two cleanup-pipeline fixes uncovered while investigating why every nightly
since 2026-06-04 has been uploading a byte-identical 06-03 artifact:

  1. Pre-checkout cleanup couldn't remove root-owned orphan dirs.
     The 06-03 trial agent wrote a custom compose file with a relative
     bind mount source `./deploy/compose/seaweedfs-config/s3.json` from
     inside `ci/`. Docker (running as root) did mkdir -p on the chain,
     creating empty root-owned `ci/deploy/` and `ci/deploy/compose/`.
     The old cleanup did `docker run -v $TARGET:/target alpine rm -rf
     /target/*` (clears contents, fine) then `rm -rf $TARGET || true`
     (fails — ubuntu can't write to a root-owned parent, error swallowed).
     Result: `actions/checkout clean: true` hits EACCES on rmdir, every
     nightly checkout has failed for 12+ days.

     Fix: mount the PARENT dir into alpine and `rm -rf /parent/<name>`.
     Docker is root inside the container → can remove root-owned dirs
     by name regardless of parent ownership. Also add `ci/deploy` to
     the loop so the orphan parent gets cleaned alongside its child.

  2. Collect step tarred ALL of /tmp/skill-eval/results/, not just
     this run's subdir.
     The runner's disk persists between jobs. If checkout fails →
     agent skipped → no fresh results, but the Collect step still
     found 06-03's leftover subdirs on disk, tarred them up, and the
     Upload step labeled the artifact with today's run_id. Nine days
     of fake "passing" nightly artifacts.

     Fix: tar only `/tmp/skill-eval/results/${GITHUB_RUN_ID}/`. If the
     agent didn't write that subdir (because it never ran), there's
     nothing to archive and the step exits 0 with a clear message.
     Upload's `if-no-files-found: ignore` handles the missing tar.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: richa <ricsingh@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/skills-eval.yml:
- Around line 187-199: The issue is that a stale
`/tmp/skills-eval-results.tar.gz` file can persist on self-hosted runners from
previous runs. When the conditional block checking for result.json files finds
no results and skips archive creation, the old tarball from a prior run can
still be uploaded at line 211. Add a command at the beginning of this step to
remove any existing `/tmp/skills-eval-results.tar.gz` file before the
conditional check, ensuring that only newly created archives get uploaded or
nothing gets uploaded if no results exist in the current run.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 9331150d-c6a3-490e-b059-5198dc8287be

📥 Commits

Reviewing files that changed from the base of the PR and between 9262458 and 4886eb1.

📒 Files selected for processing (1)
  • .github/workflows/skills-eval.yml

Comment on lines +187 to 199
RUN_DIR="/tmp/skill-eval/results/${{ github.run_id }}"
if [ ! -d "$RUN_DIR" ]; then
echo "no results subdir for this run (${{ github.run_id }}) — nothing to archive"
exit 0
fi
RESULTS=$(find /tmp/skill-eval/results -maxdepth 3 -name "result.json" 2>/dev/null | head -50 || true)
RESULTS=$(find "$RUN_DIR" -maxdepth 3 -name "result.json" 2>/dev/null | head -50 || true)
if [ -n "$RESULTS" ]; then
tar czf /tmp/skills-eval-results.tar.gz -C /tmp/skill-eval results
echo "archived $(echo "$RESULTS" | wc -l) result.json files"
tar czf /tmp/skills-eval-results.tar.gz \
-C /tmp/skill-eval/results "${{ github.run_id }}"
echo "archived $(echo "$RESULTS" | wc -l) result.json files for run ${{ github.run_id }}"
else
echo "results dir exists but empty — nothing to archive"
echo "results dir exists for this run but no result.json files — nothing to archive"
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clear stale tarball before conditional archive creation.

At Line 187, this step may skip creating a new archive, but /tmp/skills-eval-results.tar.gz can persist on a self-hosted runner and then be uploaded as current-run output at Line 211.

Suggested patch
       - name: Collect results for artifact
         if: always() && (github.event_name != 'push' || steps.changes.outputs.skills == 'true')
         run: |
+          rm -f /tmp/skills-eval-results.tar.gz
           RUN_DIR="/tmp/skill-eval/results/${{ github.run_id }}"
           if [ ! -d "$RUN_DIR" ]; then
             echo "no results subdir for this run (${{ github.run_id }}) — nothing to archive"
             exit 0
           fi
🧰 Tools
🪛 zizmor (1.25.2)

[warning] 187-187: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)


[warning] 189-189: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)


[warning] 195-195: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)


[warning] 195-195: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/skills-eval.yml around lines 187 - 199, The issue is that
a stale `/tmp/skills-eval-results.tar.gz` file can persist on self-hosted
runners from previous runs. When the conditional block checking for result.json
files finds no results and skips archive creation, the old tarball from a prior
run can still be uploaded at line 211. Add a command at the beginning of this
step to remove any existing `/tmp/skills-eval-results.tar.gz` file before the
conditional check, ensuring that only newly created archives get uploaded or
nothing gets uploaded if no results exist in the current run.

@smasurekar smasurekar merged commit 3241f06 into main Jun 15, 2026
9 of 10 checks passed
@smasurekar smasurekar deleted the ci/skills-eval-brev-api-key branch June 15, 2026 09:58
shubhadeepd pushed a commit that referenced this pull request Jun 30, 2026
…icker (#701)

Two independent fixes in one file:

1. Schedule trigger now exports MANUAL_FULL_SWEEP=1 (like
   workflow_dispatch already does). Before this, the schedule event
   fell through to the PR-mode code path in skills_eval_agent.py and
   died with "FATAL: PR_NUMBER not set in environment". Also default
   MANUAL_SKILLS_FILTER to "*" when INPUT_SKILLS is empty (schedule
   events don't populate `inputs`).

2. workflow_dispatch picker now lists the 3 skills that actually exist
   on disk (rag-blueprint, rag-eval, rag-perf) instead of 10 phantom
   names (rag-deploy-blueprint, rag-ingest-documents, etc.) that were
   left over from an old skill taxonomy. The LLM agent already
   enumerates skills/*/eval/*.json itself, so the stale picker was a
   UX trap only — anyone picking one of the old names got a silent
   BLOCKED result. ONBOARDING.md:487-492 documents this as a known
   gotcha; this PR closes the gotcha.

Discovered while diagnosing GHA run 28214488174 (first nightly after
PR #680). The Brev auth step and cleanup step from #680 both passed
cleanly — this is a separate, pre-existing bug surfaced by them
running on schedule for the first time.

Signed-off-by: richa <ricsingh@nvidia.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants