monitoring: add buildkit autoscaling alerts#727
Conversation
tofu plan — arc-cbr-production-uw1✅ Plan succeeded · commit Plan output |
tofu plan — meta-prod-aws-ue1✅ Plan succeeded · commit Plan output |
| - alert: BuildkitQueueBacklog | ||
| expr: | | ||
| haproxy_backend_current_queue{proxy=~"bk_amd64|bk_arm64"} > 0 | ||
| for: 15m |
There was a problem hiding this comment.
I suspect that during peak this one will mis fire, say batches of 2 arrive at minute 0, 3, 10, 14.
This will fire, even they are all scaling in 7-8 minutes.
There was a problem hiding this comment.
maybe we should haproxy_backend_current_queue{proxy=~"bk_amd64|bk_arm64"} > 20 or other value
There was a problem hiding this comment.
Done — bumped it to > 20. With the 15m window that won't fire on normal burst churn (your 0/3/10/14 example stays well under 20 and drains in minutes); it only trips on a real sustained backlog the pool isn't clearing.
|
please review the alerts, make sure they are using the correct metrics, and they don't misfire on regular non-failure scenarios. |
|
Claude finished @jeanschmidt's task in 6m 43s —— View job Review: buildkit autoscaling alerts
I focused on two things you asked about: are the metrics correct, and will they misfire on normal load. I found one definite blocker (queue alert can never fire), two likely-wrong KEDA metric names, and confirmed jeanschmidt's misfire concerns are valid. Details below. 🔴 1.
|
| Alert | Used (line) | Likely correct |
|---|---|---|
BuildkitKedaScalerErrors |
keda_scaler_errors_total (:16) |
keda_scaler_errors |
BuildkitKedaScaledObjectErrors |
keda_scaledobject_errors_total (:28) |
keda_scaled_object_errors |
Why this matters:
keda_scaler_errors_totaldoes exist but is the unlabeled grand total (noscaledObjectlabel).sum by (scaledObject)(...)would still fire, but{{ $labels.scaledObject }}in the summary/description renders empty. The labeled per-scaler counter iskeda_scaler_errors(labels includescaledObject,scaler,metric).keda_scaledobject_errors_totalalmost certainly doesn't exist — the subsystem isscaled_object, so the series iskeda_scaled_object_errors. If the name is wrong the alert silently never fires (same failure mode as Adding AWS Public CIDRs to the EKS allowlist #1, but harder to notice).
These don't pass through any scrape filter problem — the keda ServiceMonitor keeps keda_.* — so this is purely a naming issue. Please confirm the exact names by querying Mimir (group by(__name__)({__name__=~"keda_scaler.*|keda_scaled.*"})) or kubectl port-forward to the keda-operator metrics port and grepping /metrics, then update both alerts.
🟠 3. Misfire risk — jeanschmidt is right, and it applies to all three alerts
BuildkitQueueBacklog (once #1 is fixed). haproxy_backend_current_queue is the instantaneous queue depth (qcur). With maxconn 1 per server (buildkit/.../haproxy.yaml:74-82), a request queues the moment all pods are busy — that's the designed behavior, not a failure. > 0 for 15m only requires the queue to be continuously non-empty, which jeanschmidt's example (batches at min 0/3/10/14) hits during healthy peak even though each build scales in 7–8 min. I'd take his suggestion of a depth threshold:
haproxy_backend_current_queue{proxy=~"bk_amd64|bk_arm64"} > 20
A threshold says "more work is piling up than the pool can absorb," which is the actual symptom you care about (pool at/near max and not keeping up), rather than "the queue did its job." Tune the 20 to roughly maxReplicaCount × targetValue headroom per arch.
BuildkitKedaScalerErrors / ...ScaledObjectErrors. increase(...[15m]) > 0 combined with for: 10m fires on a single transient error — one error keeps increase above 0 for the full 15m window, which trivially satisfies for: 10m. But the fallback you're trying to detect only trips after failureThreshold: 3 (generate_buildkit.py:524). So a single recovered blip (e.g. one missed scrape of the LB metrics endpoint during a pod roll) pages even though autoscaling never degraded. Align the threshold with what actually causes harm:
sum by (scaledObject) (increase(keda_scaler_errors[15m])) > 3
That matches the failureThreshold that trips the fallback and filters out single transient errors.
✅ What looks good
- Routing the alerts through a
PrometheusRuleCRD + adding toalerts/kustomization.yamlis the right wiring (synced to Grafana Cloud viamimir.rules.kubernetes). - 60s-friendly: alerts evaluate on existing scraped series, no new sub-60s monitors. Consistent with the module's
severity/team/prioritylabel convention. keda_.*is already kept by the keda ServiceMonitor, so once the names in Changes to create and manage multiple EKS clusters #2 are corrected the data path is fine.
Suggested order
- Add
haproxy_backend_current_queueto the buildkit-haproxy ServiceMonitor allowlist (Adding AWS Public CIDRs to the EKS allowlist #1). - Confirm & fix the two KEDA metric names against the live operator (Changes to create and manage multiple EKS clusters #2).
- Switch the queue alert to a depth threshold and bump the KEDA error threshold to
> 3(Import ARC code from pytorch-gha-infra #3).
Note: I reviewed only; I haven't changed any files. Happy to apply any of the above if you want — just say which.
tofu plan — arc-cbr-production✅ Plan succeeded · commit Plan output |
|
Thanks — all three fixed (verified against KEDA
|
Stack from ghstack (oldest at bottom):
KEDA scaler/scaledobject errors (fallback risk) + HAProxy queue backlog
(pool not scaling fast enough). Uses metrics from the KEDA ServiceMonitor
and the existing buildkit-haproxy scrape.