Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 36 additions & 5 deletions .github/workflows/skills-eval.yml
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,27 @@ jobs:
- '.github/workflows/skills-eval.yml'
- 'ci/run_skill_eval.sh'

- name: Authenticate Brev CLI (long-lived API key)
# Mints fresh Brev auth at the start of every run so the workflow
# does not depend on persistent ~/.brev/credentials.json state on
# the self-hosted runner. Fails loudly upfront if the secret is
# missing or wrong — instead of failing 4h later at cleanup.
if: github.event_name != 'push' || steps.changes.outputs.skills == 'true'
env:
BREV_API_KEY: ${{ secrets.BREV_API_KEY }}
BREV_ORG_ID: org-2wW9qP0o1LFbMHyE2UnAFPIil8H
run: |
if [ -z "$BREV_API_KEY" ]; then
echo "::error::BREV_API_KEY secret not set"
exit 1
fi
brev login --api-key "$BREV_API_KEY" --org-id "$BREV_ORG_ID"
if ! brev ls >/dev/null 2>&1; then
Comment on lines +139 to +140

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add hard timeouts for Brev CLI calls.

A hung Brev API/CLI call can consume the entire job timeout and prevent deleting remaining instances. Add per-call timeout guards and keep aggregating failures.

Suggested patch
-          brev login --api-key "$BREV_API_KEY" --org-id "$BREV_ORG_ID"
-          if ! brev ls >/dev/null 2>&1; then
+          timeout 60s brev login --api-key "$BREV_API_KEY" --org-id "$BREV_ORG_ID"
+          if ! timeout 30s brev ls >/dev/null 2>&1; then
             echo "::error::Brev auth failed: brev ls returned non-zero after login"
             exit 1
           fi
@@
-            if brev delete "$INSTANCE"; then
+            if timeout 180s brev delete "$INSTANCE"; then
               echo "OK Deleted $INSTANCE"
             else
-              echo "::error::Failed to delete $INSTANCE (exit $?)"
+              rc=$?
+              if [ "$rc" -eq 124 ]; then
+                echo "::error::Timed out deleting $INSTANCE"
+              else
+                echo "::error::Failed to delete $INSTANCE (exit $rc)"
+              fi
               FAILED=1
             fi

Also applies to: 217-223

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/skills-eval.yml around lines 132 - 133, The Brev CLI calls
(e.g., the commands shown via "brev login --api-key \"$BREV_API_KEY\" --org-id
\"$BREV_ORG_ID\"" and "brev ls") need per-call hard timeouts to avoid hung jobs;
wrap each Brev invocation with the POSIX timeout utility (or an equivalent
timeout function) and capture non-zero/timeouts as failures while continuing to
run remaining cleanup steps, and apply the same change to the other Brev calls
referenced around the later block (the calls in the 217-223 region) so each CLI
call fails fast and aggregates failure status rather than blocking the whole
job.

echo "::error::Brev auth failed: brev ls returned non-zero after login"
exit 1
fi
echo "Brev auth OK"

- name: Run skills eval agent
id: agent
if: github.event_name != 'push' || steps.changes.outputs.skills == 'true'
Expand Down Expand Up @@ -176,8 +197,11 @@ jobs:
if-no-files-found: ignore
retention-days: 7

- name: Delete GPU Brev VMs on success
if: success() && (github.event_name != 'push' || steps.changes.outputs.skills == 'true')
- name: Delete GPU Brev VMs
# Run on success AND failure — agent crashes are exactly when GPU
# cleanup is most needed. Process substitution (not pipe-to-while)
# so FAILED=1 inside the loop is visible after the loop exits.
if: always() && (github.event_name != 'push' || steps.changes.outputs.skills == 'true')
run: |
RECORD="/tmp/brev/started-by-${{ github.run_id }}.txt"
if [ ! -f "$RECORD" ]; then
Expand All @@ -186,11 +210,18 @@ jobs:
fi
echo "Waiting 5 min cooldown before deleting VMs..."
sleep 300
sort -u "$RECORD" | while read -r INSTANCE; do
FAILED=0
while read -r INSTANCE; do
[ -z "$INSTANCE" ] && continue
echo "Deleting Brev VM: $INSTANCE"
brev delete "$INSTANCE" 2>/dev/null && echo "Deleted $INSTANCE" || echo "Could not delete $INSTANCE (may already be gone)"
done
if brev delete "$INSTANCE"; then
echo "OK Deleted $INSTANCE"
else
echo "::error::Failed to delete $INSTANCE (exit $?)"
FAILED=1
fi
done < <(sort -u "$RECORD")
exit $FAILED

- name: Skip note when no skills/ changes
if: github.event_name == 'push' && steps.changes.outputs.skills != 'true'
Expand Down
Loading