ci(skills-eval): mint Brev auth per-run + fix silent cleanup leak#680
Conversation
Adds a long-lived API-key auth step at the start of every workflow run
so brev calls never fail on stale ~/.brev/credentials.json. The
self-hosted runner's refresh_token expired after ~14 days, leaving
brev delete unauthenticated, and the old cleanup loop swallowed the
error silently (2>/dev/null + ||) and exited 0.
Also two cleanup fixes that are independent of the auth change:
- if: success() -> if: always(). Agent crashes are exactly when
GPU cleanup is most needed.
- Replace sort -u | while ... done (loop runs in subshell, FAILED
flag was lost) with while ... done < <(sort -u ...) so the loop
body runs in the parent shell and exit $FAILED actually fails the
step on any brev delete error.
Background: GHA run 26894711065 leaked an H100x2 (rag-eval-gpu-0491531)
for 12+ hours because all three failure modes above stacked.
Requires repo secret BREV_API_KEY to be set; the step fails loudly with
::error::BREV_API_KEY secret not set if it's missing.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: richa <ricsingh@nvidia.com>
📝 WalkthroughWalkthroughThis PR enhances the Skills Eval workflow with Brev CLI authentication for non-push triggers or skill changes, improves pre-checkout Docker cleanup to handle root-owned volume chains, narrows results archiving to run-specific subdirectories, and strengthens GPU VM cleanup to run always while properly propagating deletion failures via process substitution and error logging. ChangesSkills Eval Workflow: Brev CLI Authentication and Robust Cleanup
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/skills-eval.yml:
- Around line 132-133: The Brev CLI calls (e.g., the commands shown via "brev
login --api-key \"$BREV_API_KEY\" --org-id \"$BREV_ORG_ID\"" and "brev ls") need
per-call hard timeouts to avoid hung jobs; wrap each Brev invocation with the
POSIX timeout utility (or an equivalent timeout function) and capture
non-zero/timeouts as failures while continuing to run remaining cleanup steps,
and apply the same change to the other Brev calls referenced around the later
block (the calls in the 217-223 region) so each CLI call fails fast and
aggregates failure status rather than blocking the whole job.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 1d48ea5a-a203-4a85-b49d-4863e227d9b0
📒 Files selected for processing (1)
.github/workflows/skills-eval.yml
| brev login --api-key "$BREV_API_KEY" --org-id "$BREV_ORG_ID" | ||
| if ! brev ls >/dev/null 2>&1; then |
There was a problem hiding this comment.
Add hard timeouts for Brev CLI calls.
A hung Brev API/CLI call can consume the entire job timeout and prevent deleting remaining instances. Add per-call timeout guards and keep aggregating failures.
Suggested patch
- brev login --api-key "$BREV_API_KEY" --org-id "$BREV_ORG_ID"
- if ! brev ls >/dev/null 2>&1; then
+ timeout 60s brev login --api-key "$BREV_API_KEY" --org-id "$BREV_ORG_ID"
+ if ! timeout 30s brev ls >/dev/null 2>&1; then
echo "::error::Brev auth failed: brev ls returned non-zero after login"
exit 1
fi
@@
- if brev delete "$INSTANCE"; then
+ if timeout 180s brev delete "$INSTANCE"; then
echo "OK Deleted $INSTANCE"
else
- echo "::error::Failed to delete $INSTANCE (exit $?)"
+ rc=$?
+ if [ "$rc" -eq 124 ]; then
+ echo "::error::Timed out deleting $INSTANCE"
+ else
+ echo "::error::Failed to delete $INSTANCE (exit $rc)"
+ fi
FAILED=1
fiAlso applies to: 217-223
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/skills-eval.yml around lines 132 - 133, The Brev CLI calls
(e.g., the commands shown via "brev login --api-key \"$BREV_API_KEY\" --org-id
\"$BREV_ORG_ID\"" and "brev ls") need per-call hard timeouts to avoid hung jobs;
wrap each Brev invocation with the POSIX timeout utility (or an equivalent
timeout function) and capture non-zero/timeouts as failures while continuing to
run remaining cleanup steps, and apply the same change to the other Brev calls
referenced around the later block (the calls in the 217-223 region) so each CLI
call fails fast and aggregates failure status rather than blocking the whole
job.
…esults
Two cleanup-pipeline fixes uncovered while investigating why every nightly
since 2026-06-04 has been uploading a byte-identical 06-03 artifact:
1. Pre-checkout cleanup couldn't remove root-owned orphan dirs.
The 06-03 trial agent wrote a custom compose file with a relative
bind mount source `./deploy/compose/seaweedfs-config/s3.json` from
inside `ci/`. Docker (running as root) did mkdir -p on the chain,
creating empty root-owned `ci/deploy/` and `ci/deploy/compose/`.
The old cleanup did `docker run -v $TARGET:/target alpine rm -rf
/target/*` (clears contents, fine) then `rm -rf $TARGET || true`
(fails — ubuntu can't write to a root-owned parent, error swallowed).
Result: `actions/checkout clean: true` hits EACCES on rmdir, every
nightly checkout has failed for 12+ days.
Fix: mount the PARENT dir into alpine and `rm -rf /parent/<name>`.
Docker is root inside the container → can remove root-owned dirs
by name regardless of parent ownership. Also add `ci/deploy` to
the loop so the orphan parent gets cleaned alongside its child.
2. Collect step tarred ALL of /tmp/skill-eval/results/, not just
this run's subdir.
The runner's disk persists between jobs. If checkout fails →
agent skipped → no fresh results, but the Collect step still
found 06-03's leftover subdirs on disk, tarred them up, and the
Upload step labeled the artifact with today's run_id. Nine days
of fake "passing" nightly artifacts.
Fix: tar only `/tmp/skill-eval/results/${GITHUB_RUN_ID}/`. If the
agent didn't write that subdir (because it never ran), there's
nothing to archive and the step exits 0 with a clear message.
Upload's `if-no-files-found: ignore` handles the missing tar.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: richa <ricsingh@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/skills-eval.yml:
- Around line 187-199: The issue is that a stale
`/tmp/skills-eval-results.tar.gz` file can persist on self-hosted runners from
previous runs. When the conditional block checking for result.json files finds
no results and skips archive creation, the old tarball from a prior run can
still be uploaded at line 211. Add a command at the beginning of this step to
remove any existing `/tmp/skills-eval-results.tar.gz` file before the
conditional check, ensuring that only newly created archives get uploaded or
nothing gets uploaded if no results exist in the current run.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 9331150d-c6a3-490e-b059-5198dc8287be
📒 Files selected for processing (1)
.github/workflows/skills-eval.yml
| RUN_DIR="/tmp/skill-eval/results/${{ github.run_id }}" | ||
| if [ ! -d "$RUN_DIR" ]; then | ||
| echo "no results subdir for this run (${{ github.run_id }}) — nothing to archive" | ||
| exit 0 | ||
| fi | ||
| RESULTS=$(find /tmp/skill-eval/results -maxdepth 3 -name "result.json" 2>/dev/null | head -50 || true) | ||
| RESULTS=$(find "$RUN_DIR" -maxdepth 3 -name "result.json" 2>/dev/null | head -50 || true) | ||
| if [ -n "$RESULTS" ]; then | ||
| tar czf /tmp/skills-eval-results.tar.gz -C /tmp/skill-eval results | ||
| echo "archived $(echo "$RESULTS" | wc -l) result.json files" | ||
| tar czf /tmp/skills-eval-results.tar.gz \ | ||
| -C /tmp/skill-eval/results "${{ github.run_id }}" | ||
| echo "archived $(echo "$RESULTS" | wc -l) result.json files for run ${{ github.run_id }}" | ||
| else | ||
| echo "results dir exists but empty — nothing to archive" | ||
| echo "results dir exists for this run but no result.json files — nothing to archive" | ||
| fi |
There was a problem hiding this comment.
Clear stale tarball before conditional archive creation.
At Line 187, this step may skip creating a new archive, but /tmp/skills-eval-results.tar.gz can persist on a self-hosted runner and then be uploaded as current-run output at Line 211.
Suggested patch
- name: Collect results for artifact
if: always() && (github.event_name != 'push' || steps.changes.outputs.skills == 'true')
run: |
+ rm -f /tmp/skills-eval-results.tar.gz
RUN_DIR="/tmp/skill-eval/results/${{ github.run_id }}"
if [ ! -d "$RUN_DIR" ]; then
echo "no results subdir for this run (${{ github.run_id }}) — nothing to archive"
exit 0
fi🧰 Tools
🪛 zizmor (1.25.2)
[warning] 187-187: code injection via template expansion (template-injection): may expand into attacker-controllable code
(template-injection)
[warning] 189-189: code injection via template expansion (template-injection): may expand into attacker-controllable code
(template-injection)
[warning] 195-195: code injection via template expansion (template-injection): may expand into attacker-controllable code
(template-injection)
[warning] 195-195: code injection via template expansion (template-injection): may expand into attacker-controllable code
(template-injection)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/skills-eval.yml around lines 187 - 199, The issue is that
a stale `/tmp/skills-eval-results.tar.gz` file can persist on self-hosted
runners from previous runs. When the conditional block checking for result.json
files finds no results and skips archive creation, the old tarball from a prior
run can still be uploaded at line 211. Add a command at the beginning of this
step to remove any existing `/tmp/skills-eval-results.tar.gz` file before the
conditional check, ensuring that only newly created archives get uploaded or
nothing gets uploaded if no results exist in the current run.
…icker (#701) Two independent fixes in one file: 1. Schedule trigger now exports MANUAL_FULL_SWEEP=1 (like workflow_dispatch already does). Before this, the schedule event fell through to the PR-mode code path in skills_eval_agent.py and died with "FATAL: PR_NUMBER not set in environment". Also default MANUAL_SKILLS_FILTER to "*" when INPUT_SKILLS is empty (schedule events don't populate `inputs`). 2. workflow_dispatch picker now lists the 3 skills that actually exist on disk (rag-blueprint, rag-eval, rag-perf) instead of 10 phantom names (rag-deploy-blueprint, rag-ingest-documents, etc.) that were left over from an old skill taxonomy. The LLM agent already enumerates skills/*/eval/*.json itself, so the stale picker was a UX trap only — anyone picking one of the old names got a silent BLOCKED result. ONBOARDING.md:487-492 documents this as a known gotcha; this PR closes the gotcha. Discovered while diagnosing GHA run 28214488174 (first nightly after PR #680). The Brev auth step and cleanup step from #680 both passed cleanly — this is a separate, pre-existing bug surfaced by them running on schedule for the first time. Signed-off-by: richa <ricsingh@nvidia.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Adds a long-lived API-key auth step at the start of every workflow run so brev calls never fail on stale ~/.brev/credentials.json. The self-hosted runner's refresh_token expired after ~14 days, leaving brev delete unauthenticated, and the old cleanup loop swallowed the error silently (2>/dev/null + ||) and exited 0.
Also two cleanup fixes that are independent of the auth change:
if: success() -> if: always(). Agent crashes are exactly when GPU cleanup is most needed.
Replace sort -u | while ... done (loop runs in subshell, FAILED flag was lost) with while ... done < <(sort -u ...) so the loop body runs in the parent shell and exit $FAILED actually fails the step on any brev delete error.
Background: GHA run 26894711065 leaked an H100x2 (rag-eval-gpu-0491531) for 12+ hours because all three failure modes above stacked.
Requires repo secret BREV_API_KEY to be set; the step fails loudly with ::error::BREV_API_KEY secret not set if it's missing.
Description
Checklist
git commit -s) and GPG signed (git commit -S).Summary by CodeRabbit
Bug Fixes
Chores