Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: buildkit-autoscaling-alerts
namespace: monitoring
labels:
app.kubernetes.io/part-of: osdc-monitoring
spec:
groups:
- name: buildkit-autoscaling
rules:
# KEDA can't read the scale metric — if it persists past the ScaledObject's
# failureThreshold, KEDA drops to the fixed fallback pool instead of scaling.
- alert: BuildkitKedaScalerErrors
expr: |
sum by (scaledObject) (increase(keda_scaler_detail_errors_total[15m])) > 0
for: 10m
labels:
severity: warning
team: pytorch-dev-infra
priority: P3
annotations:
summary: "KEDA can't read the scale metric for {{ $labels.scaledObject }}"
description: "KEDA scaler errors for {{ $labels.scaledObject }} over the last 15m; sustained errors trip the fallback to the fixed BuildKit pool."

- alert: BuildkitKedaScaledObjectErrors
expr: |
sum by (scaledObject) (increase(keda_scaled_object_errors_total[15m])) > 0
for: 10m
labels:
severity: warning
team: pytorch-dev-infra
priority: P3
annotations:
summary: "KEDA ScaledObject {{ $labels.scaledObject }} reconcile errors"
description: "KEDA failed to reconcile ScaledObject {{ $labels.scaledObject }} in the last 15m; autoscaling for that arch may be stale."

# A real backlog the pool can't keep up with. The >20 threshold (not >0)
# avoids firing on normal burst churn, where small batches keep the queue
# briefly non-zero but still drain within minutes as pods scale up.
- alert: BuildkitQueueBacklog
expr: |
haproxy_backend_current_queue{proxy=~"bk_amd64|bk_arm64"} > 20
for: 15m

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that during peak this one will mis fire, say batches of 2 arrive at minute 0, 3, 10, 14.

This will fire, even they are all scaling in 7-8 minutes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should haproxy_backend_current_queue{proxy=~"bk_amd64|bk_arm64"} > 20 or other value

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — bumped it to > 20. With the 15m window that won't fire on normal burst churn (your 0/3/10/14 example stays well under 20 and drains in minutes); it only trips on a real sustained backlog the pool isn't clearing.

labels:
severity: warning
team: pytorch-dev-infra
priority: P3
annotations:
summary: "BuildKit {{ $labels.proxy }} backlog: >20 builds queued for 15m"
description: "More than 20 builds have been waiting in the {{ $labels.proxy }} queue for 15m — beyond normal burst churn; the pool isn't scaling up fast enough (or is at max)."
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ kind: Kustomization

resources:
- arc-alerts.yaml
- buildkit-autoscaling-alerts.yaml
- infrastructure-alerts.yaml
- gpu-alerts.yaml
- node-compactor-alerts.yaml
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,4 @@ spec:
# Keep only operationally important HAProxy metrics
- action: keep
sourceLabels: [__name__]
regex: "haproxy_server_status|haproxy_server_current_sessions|haproxy_server_connection_errors_total|haproxy_backend_current_sessions"
regex: "haproxy_server_status|haproxy_server_current_sessions|haproxy_server_connection_errors_total|haproxy_backend_current_sessions|haproxy_backend_current_queue"
Loading