Skip to content

Commit 3606074

Browse files
committed
monitoring: add buildkit autoscaling alerts
KEDA scaler/scaledobject errors (fallback risk) + HAProxy queue backlog (pool not scaling fast enough). Uses metrics from the KEDA ServiceMonitor and the existing buildkit-haproxy scrape. ghstack-source-id: 3400f35 Pull-Request: #727
1 parent e1e5703 commit 3606074

3 files changed

Lines changed: 53 additions & 1 deletion

File tree

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
apiVersion: monitoring.coreos.com/v1
2+
kind: PrometheusRule
3+
metadata:
4+
name: buildkit-autoscaling-alerts
5+
namespace: monitoring
6+
labels:
7+
app.kubernetes.io/part-of: osdc-monitoring
8+
spec:
9+
groups:
10+
- name: buildkit-autoscaling
11+
rules:
12+
# KEDA can't read the scale metric — if it persists past the ScaledObject's
13+
# failureThreshold, KEDA drops to the fixed fallback pool instead of scaling.
14+
- alert: BuildkitKedaScalerErrors
15+
expr: |
16+
sum by (scaledObject) (increase(keda_scaler_detail_errors_total[15m])) > 0
17+
for: 10m
18+
labels:
19+
severity: warning
20+
team: pytorch-dev-infra
21+
priority: P3
22+
annotations:
23+
summary: "KEDA can't read the scale metric for {{ $labels.scaledObject }}"
24+
description: "KEDA scaler errors for {{ $labels.scaledObject }} over the last 15m; sustained errors trip the fallback to the fixed BuildKit pool."
25+
26+
- alert: BuildkitKedaScaledObjectErrors
27+
expr: |
28+
sum by (scaledObject) (increase(keda_scaled_object_errors_total[15m])) > 0
29+
for: 10m
30+
labels:
31+
severity: warning
32+
team: pytorch-dev-infra
33+
priority: P3
34+
annotations:
35+
summary: "KEDA ScaledObject {{ $labels.scaledObject }} reconcile errors"
36+
description: "KEDA failed to reconcile ScaledObject {{ $labels.scaledObject }} in the last 15m; autoscaling for that arch may be stale."
37+
38+
# A real backlog the pool can't keep up with. The >20 threshold (not >0)
39+
# avoids firing on normal burst churn, where small batches keep the queue
40+
# briefly non-zero but still drain within minutes as pods scale up.
41+
- alert: BuildkitQueueBacklog
42+
expr: |
43+
haproxy_backend_current_queue{proxy=~"bk_amd64|bk_arm64"} > 20
44+
for: 15m
45+
labels:
46+
severity: warning
47+
team: pytorch-dev-infra
48+
priority: P3
49+
annotations:
50+
summary: "BuildKit {{ $labels.proxy }} backlog: >20 builds queued for 15m"
51+
description: "More than 20 builds have been waiting in the {{ $labels.proxy }} queue for 15m — beyond normal burst churn; the pool isn't scaling up fast enough (or is at max)."

osdc/modules/monitoring/kubernetes/alerts/kustomization.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ kind: Kustomization
33

44
resources:
55
- arc-alerts.yaml
6+
- buildkit-autoscaling-alerts.yaml
67
- infrastructure-alerts.yaml
78
- gpu-alerts.yaml
89
- node-compactor-alerts.yaml

osdc/modules/monitoring/kubernetes/monitors/servicemonitors/buildkit-haproxy.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,4 +20,4 @@ spec:
2020
# Keep only operationally important HAProxy metrics
2121
- action: keep
2222
sourceLabels: [__name__]
23-
regex: "haproxy_server_status|haproxy_server_current_sessions|haproxy_server_connection_errors_total|haproxy_backend_current_sessions"
23+
regex: "haproxy_server_status|haproxy_server_current_sessions|haproxy_server_connection_errors_total|haproxy_backend_current_sessions|haproxy_backend_current_queue"

0 commit comments

Comments
 (0)