BuildKit autoscaling (staging + prod): in-cluster KEDA + LB queue + warm baseline#701
Closed
huydhn wants to merge 4 commits into
Closed
BuildKit autoscaling (staging + prod): in-cluster KEDA + LB queue + warm baseline#701huydhn wants to merge 4 commits into
huydhn wants to merge 4 commits into
Conversation
github-merge-queue Bot
pushed a commit
that referenced
this pull request
Jun 6, 2026
Adds optional per-arch `buildkit.{amd64,arm64}_replicas` and
`..._pods_per_node` overrides (fall back to the shared
`replicas_per_arch`/`pods_per_node`, so other clusters render
unchanged), then resizes the prod fleet:
- **amd64:** `m6id.24xlarge`, 2/node, 42 vCPU/155 GiB — **32 replicas**
(was 12)
- **arm64:** `m7gd.16xlarge`, 4/node, ~14 vCPU/51 GiB — **8 replicas**
(smaller pods, more of them; ≈ the pre-OSDC `m7g.4xlarge` build runner)
NodePool limits scale per-arch automatically. `just test` pass (97%
cov), `just lint` 13/13.
Independent of the autoscaling work on #701.
---------
Signed-off-by: Huy Do <huydo@meta.com>
Contributor
|
this does not solve the burst problem, when a change comes and trigger docker rebuild on multiple CIs, all the connections will be dispatched to the current existing buildkit pods and the newer ones will sit idle. We need to find a way to fix this. |
54d6925 to
4a87e0a
Compare
4a87e0a to
d27514a
Compare
…seline **Impact:** OSDC arc-staging buildkit only (autoscaling is opt-in; other clusters unchanged). **Risk:** low Absorb ciflow/docker bursts without overloading existing pods, and scale back to a small warm per-arch baseline when idle. - HAProxy `server maxconn 1` + `timeout queue`: one build per pod; excess builds queue and flow onto new pods as they register, instead of stacking on busy pods (so scaled-up pods don't sit idle). - KEDA ScaledObject per arch via `metrics-api` scraping the LB's own metrics (haproxy_backend_current_sessions) — no Grafana / external metrics backend. - Warm baseline: amd64_min=2 / arm64_min=4 (1 physical node each); *_max caps the burst and sizes the NodePool limits. - preStop drain + PDB + long terminationGracePeriodSeconds for kill-free scale-down. staging: amd64 m6id.24xlarge @ 2/node (min 2), arm64 m7gd.16xlarge @ 4/node (min 4). Runner-side connect retry (separate pytorch/pytorch change) lets a build tolerate waiting for a pod from a cold/queued pool. Testing: just lint 13/13, just test pass (generate_buildkit.py 98%). Signed-off-by: Huy Do <huydo@meta.com>
d7a9356 to
d7c3bb4
Compare
Signed-off-by: Huy Do <huydo@meta.com>
d7c3bb4 to
76502b1
Compare
Same min per arch as staging (amd64 2 / arm64 4). Max sized from 14-day docker-build concurrency: amd64 128 (peak 105 + headroom), arm64 16 (peak 8, likely capped by the old fixed pool).
Burst 8 parallel buildctl builds per arch (each holds a maxconn=1 slot ~10m via sleep). With amd64_min=2 they serialize ~43m > timeout 30m and fail unless KEDA scales the pool up; one wave ~18m when it does.
Contributor
Author
Staging validation runDrove a balanced burst of 8 amd64 + 8 arm64 builds against the staging pool (each held a BuildKit nodes / pods during the run:
Observations:
|
Contributor
Author
|
Closing this in favour of the stack at #725 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Autoscale the per-arch buildkitd pools so
ciflow/dockerbursts don't overload the existing pods, scaling back to a small warm baseline when idle. Opt-in viabuildkit.autoscaling.enabled; clusters without it are unchanged.How
server maxconn 1(matchesmax-parallelism=1) +timeout queue 60m. Excess builds queue and flow onto new pods as they register, instead of stacking on busy pods — so scaled-up pods don't sit idle (the burst problem).ScaledObjectper arch viametrics-apion the LB's own metrics (haproxy_backend_current_sessions). No Grafana / external metrics backend.amd64_min=2/arm64_min=4≈ 1 physical node each;*_maxcaps the burst and sizes the NodePool limits.preStopdrain (waits for the pod's:1234to go idle) + PDB + longterminationGracePeriodSeconds.Needs the
kedamodule (CRDs).Enabled on
arc-staging): amd64m6id.24xlarge@ 2/node, arm64m7gd.16xlarge@ 4/node. min 2 / 4, max 8 / 8.arc-cbr-production): same instances/min. Max sized from 14-day docker-build concurrency — amd64 128 (peak 105 + headroom), arm64 16 (peak 8, likely capped by the old fixed pool). Replaces the old fixed32 / 8.Integration test
osdc/integration-testsgains a scale test (build-image-scale.yaml): bursts 8 parallelbuildctlbuilds per arch, each holding amaxconn=1slot ~10m via a sleep. Withmin=2the builds serialize (~43m) and the job times out (30m) unless KEDA scales up (~18m, one wave) — so the suite fails if autoscaling doesn't happen.Companion change
A runner-side connect retry in
pytorch/pytorch.ci/docker/build.shlets a build wait for a pod from a cold/queued pool (drafted separately).just lint13/13 ·just testpass.