Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: buildkit-autoscaling-alerts
namespace: monitoring
labels:
app.kubernetes.io/part-of: osdc-monitoring
spec:
groups:
- name: buildkit-autoscaling
rules:
# KEDA can't read the scale metric — if it persists past the ScaledObject's
# failureThreshold, KEDA drops to the fixed fallback pool instead of scaling.
- alert: BuildkitKedaScalerErrors
expr: |
sum by (scaledObject) (increase(keda_scaler_errors_total[15m])) > 0
for: 10m
labels:
severity: warning
team: pytorch-dev-infra
priority: P3
annotations:
summary: "KEDA can't read the scale metric for {{ $labels.scaledObject }}"
description: "KEDA scaler errors for {{ $labels.scaledObject }} over the last 15m; sustained errors trip the fallback to the fixed BuildKit pool."

- alert: BuildkitKedaScaledObjectErrors
expr: |
sum by (scaledObject) (increase(keda_scaledobject_errors_total[15m])) > 0
for: 10m
labels:
severity: warning
team: pytorch-dev-infra
priority: P3
annotations:
summary: "KEDA ScaledObject {{ $labels.scaledObject }} reconcile errors"
description: "KEDA failed to reconcile ScaledObject {{ $labels.scaledObject }} in the last 15m; autoscaling for that arch may be stale."

# Builds stuck waiting for a pod — the pool isn't scaling up fast enough.
- alert: BuildkitQueueBacklog
expr: |
haproxy_backend_current_queue{proxy=~"bk_amd64|bk_arm64"} > 0
for: 15m

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that during peak this one will mis fire, say batches of 2 arrive at minute 0, 3, 10, 14.

This will fire, even they are all scaling in 7-8 minutes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should haproxy_backend_current_queue{proxy=~"bk_amd64|bk_arm64"} > 20 or other value

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — bumped it to > 20. With the 15m window that won't fire on normal burst churn (your 0/3/10/14 example stays well under 20 and drains in minutes); it only trips on a real sustained backlog the pool isn't clearing.

labels:
severity: warning
team: pytorch-dev-infra
priority: P3
annotations:
summary: "BuildKit {{ $labels.proxy }} has builds queued for 15m"
description: "Builds have been waiting in the {{ $labels.proxy }} queue for 15m — the pool isn't scaling up fast enough (or is at max) to meet demand."
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ kind: Kustomization

resources:
- arc-alerts.yaml
- buildkit-autoscaling-alerts.yaml
- infrastructure-alerts.yaml
- gpu-alerts.yaml
- node-compactor-alerts.yaml
Expand Down
Loading