Skip to content

BuildKit autoscaling (staging + prod): in-cluster KEDA + LB queue + warm baseline#701

Closed
huydhn wants to merge 4 commits into
pytorch:mainfrom
huydhn:buildkit-autoscaling-keda
Closed

BuildKit autoscaling (staging + prod): in-cluster KEDA + LB queue + warm baseline#701
huydhn wants to merge 4 commits into
pytorch:mainfrom
huydhn:buildkit-autoscaling-keda

Conversation

@huydhn

@huydhn huydhn commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Autoscale the per-arch buildkitd pools so ciflow/docker bursts don't overload the existing pods, scaling back to a small warm baseline when idle. Opt-in via buildkit.autoscaling.enabled; clusters without it are unchanged.

How

  • One build per pod — HAProxy server maxconn 1 (matches max-parallelism=1) + timeout queue 60m. Excess builds queue and flow onto new pods as they register, instead of stacking on busy pods — so scaled-up pods don't sit idle (the burst problem).
  • In-cluster scale signal — KEDA ScaledObject per arch via metrics-api on the LB's own metrics (haproxy_backend_current_sessions). No Grafana / external metrics backend.
  • Warm baselineamd64_min=2 / arm64_min=41 physical node each; *_max caps the burst and sizes the NodePool limits.
  • Kill-free scale-downpreStop drain (waits for the pod's :1234 to go idle) + PDB + long terminationGracePeriodSeconds.

Needs the keda module (CRDs).

Enabled on

  • Staging (arc-staging): amd64 m6id.24xlarge @ 2/node, arm64 m7gd.16xlarge @ 4/node. min 2 / 4, max 8 / 8.
  • Prod (arc-cbr-production): same instances/min. Max sized from 14-day docker-build concurrency — amd64 128 (peak 105 + headroom), arm64 16 (peak 8, likely capped by the old fixed pool). Replaces the old fixed 32 / 8.

Integration test

osdc/integration-tests gains a scale test (build-image-scale.yaml): bursts 8 parallel buildctl builds per arch, each holding a maxconn=1 slot ~10m via a sleep. With min=2 the builds serialize (~43m) and the job times out (30m) unless KEDA scales up (~18m, one wave) — so the suite fails if autoscaling doesn't happen.

Companion change

A runner-side connect retry in pytorch/pytorch .ci/docker/build.sh lets a build wait for a pod from a cold/queued pool (drafted separately).

just lint 13/13 · just test pass.

@huydhn huydhn requested a review from jeanschmidt as a code owner June 5, 2026 03:33
@huydhn huydhn marked this pull request as draft June 5, 2026 03:35
github-merge-queue Bot pushed a commit that referenced this pull request Jun 6, 2026
Adds optional per-arch `buildkit.{amd64,arm64}_replicas` and
`..._pods_per_node` overrides (fall back to the shared
`replicas_per_arch`/`pods_per_node`, so other clusters render
unchanged), then resizes the prod fleet:

- **amd64:** `m6id.24xlarge`, 2/node, 42 vCPU/155 GiB — **32 replicas**
(was 12)
- **arm64:** `m7gd.16xlarge`, 4/node, ~14 vCPU/51 GiB — **8 replicas**
(smaller pods, more of them; ≈ the pre-OSDC `m7g.4xlarge` build runner)

NodePool limits scale per-arch automatically. `just test` pass (97%
cov), `just lint` 13/13.

Independent of the autoscaling work on #701.

---------

Signed-off-by: Huy Do <huydo@meta.com>
@jeanschmidt

jeanschmidt commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

this does not solve the burst problem, when a change comes and trigger docker rebuild on multiple CIs, all the connections will be dispatched to the current existing buildkit pods and the newer ones will sit idle. We need to find a way to fix this.

@huydhn huydhn force-pushed the buildkit-autoscaling-keda branch 2 times, most recently from 54d6925 to 4a87e0a Compare June 9, 2026 18:36
@huydhn huydhn changed the title Autoscale BuildKit builders on docker-build job count via KEDA (staging) BuildKit autoscaling on staging: in-cluster KEDA + LB queue + warm baseline Jun 9, 2026
@huydhn huydhn force-pushed the buildkit-autoscaling-keda branch from 4a87e0a to d27514a Compare June 9, 2026 19:54
…seline

**Impact:** OSDC arc-staging buildkit only (autoscaling is opt-in; other
clusters unchanged).
**Risk:** low

Absorb ciflow/docker bursts without overloading existing pods, and scale back to
a small warm per-arch baseline when idle.

- HAProxy `server maxconn 1` + `timeout queue`: one build per pod; excess builds
  queue and flow onto new pods as they register, instead of stacking on busy
  pods (so scaled-up pods don't sit idle).
- KEDA ScaledObject per arch via `metrics-api` scraping the LB's own metrics
  (haproxy_backend_current_sessions) — no Grafana / external metrics backend.
- Warm baseline: amd64_min=2 / arm64_min=4 (1 physical node each); *_max caps
  the burst and sizes the NodePool limits.
- preStop drain + PDB + long terminationGracePeriodSeconds for kill-free
  scale-down.

staging: amd64 m6id.24xlarge @ 2/node (min 2), arm64 m7gd.16xlarge @ 4/node
(min 4). Runner-side connect retry (separate pytorch/pytorch change) lets a build
tolerate waiting for a pod from a cold/queued pool.

Testing: just lint 13/13, just test pass (generate_buildkit.py 98%).
Signed-off-by: Huy Do <huydo@meta.com>
@huydhn huydhn force-pushed the buildkit-autoscaling-keda branch 6 times, most recently from d7a9356 to d7c3bb4 Compare June 10, 2026 08:03
Signed-off-by: Huy Do <huydo@meta.com>
@huydhn huydhn force-pushed the buildkit-autoscaling-keda branch from d7c3bb4 to 76502b1 Compare June 10, 2026 08:04
huydhn added 2 commits June 10, 2026 01:35
Same min per arch as staging (amd64 2 / arm64 4). Max sized from 14-day
docker-build concurrency: amd64 128 (peak 105 + headroom), arm64 16
(peak 8, likely capped by the old fixed pool).
Burst 8 parallel buildctl builds per arch (each holds a maxconn=1 slot
~10m via sleep). With amd64_min=2 they serialize ~43m > timeout 30m and
fail unless KEDA scales the pool up; one wave ~18m when it does.
@huydhn huydhn changed the title BuildKit autoscaling on staging: in-cluster KEDA + LB queue + warm baseline BuildKit autoscaling (staging + prod): in-cluster KEDA + LB queue + warm baseline Jun 10, 2026
@huydhn

huydhn commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Staging validation run

Drove a balanced burst of 8 amd64 + 8 arm64 builds against the staging pool (each held a maxconn=1 slot ~5m), the same shape as the new integration-test scale test: https://github.com/pytorch/pytorch-canary/actions/runs/2724750286616/16 builds succeeded.

BuildKit nodes / pods during the run:

Phase amd64 pods arm64 pods buildkit nodes
Baseline (before) 2 4 2 (1 amd64 + 1 arm64)
Mid-burst (~T+9m) scaling 2→8 scaling 4→8 climbing; 1 pod already draining
Peak 8 8 6 (4× m6id.24xlarge + 2× m7gd.16xlarge)
After (~T+18m) back to 2 back to 4 trailing down

Observations:

  • Queue worked as intended — every queued buildctl connected and rode the queue onto new pods as they registered; no connect timeouts, so no runner-side wait was needed for this burst.
  • Scale-up — KEDA brought both arches to max (8/8) off the in-cluster HAProxy session metric.
  • Kill-free scale-down — pods drained (preStop waited for :1234 to go idle) and returned to the 2/4 baseline with zero failed builds.
  • Node consolidation lag (expected) — with consolidationPolicy: WhenEmpty, survivor pods left some nodes half-full, so the node count trails the pod count back down rather than dropping immediately. This is the deliberate trade documented in the module README (some idle node cost for zero build disruption).

@huydhn

huydhn commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Closing this in favour of the stack at #725

@huydhn huydhn closed this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants