BuildKit autoscaling (staging + prod): in-cluster KEDA + LB queue + warm baseline by huydhn · Pull Request #701 · pytorch/ci-infra

huydhn · 2026-06-05T03:33:45Z

Autoscale the per-arch buildkitd pools so ciflow/docker bursts don't overload the existing pods, scaling back to a small warm baseline when idle. Opt-in via buildkit.autoscaling.enabled; clusters without it are unchanged.

How

One build per pod — HAProxy server maxconn 1 (matches max-parallelism=1) + timeout queue 60m. Excess builds queue and flow onto new pods as they register, instead of stacking on busy pods — so scaled-up pods don't sit idle (the burst problem).
In-cluster scale signal — KEDA ScaledObject per arch via metrics-api on the LB's own metrics (haproxy_backend_current_sessions). No Grafana / external metrics backend.
Warm baseline — amd64_min=2 / arm64_min=4 ≈ 1 physical node each; *_max caps the burst and sizes the NodePool limits.
Kill-free scale-down — preStop drain (waits for the pod's :1234 to go idle) + PDB + long terminationGracePeriodSeconds.

Needs the keda module (CRDs).

Enabled on

Staging (arc-staging): amd64 m6id.24xlarge @ 2/node, arm64 m7gd.16xlarge @ 4/node. min 2 / 4, max 8 / 8.
Prod (arc-cbr-production): same instances/min. Max sized from 14-day docker-build concurrency — amd64 128 (peak 105 + headroom), arm64 16 (peak 8, likely capped by the old fixed pool). Replaces the old fixed 32 / 8.

Integration test

osdc/integration-tests gains a scale test (build-image-scale.yaml): bursts 8 parallel buildctl builds per arch, each holding a maxconn=1 slot ~10m via a sleep. With min=2 the builds serialize (~43m) and the job times out (30m) unless KEDA scales up (~18m, one wave) — so the suite fails if autoscaling doesn't happen.

Companion change

A runner-side connect retry in pytorch/pytorch .ci/docker/build.sh lets a build wait for a pod from a cold/queued pool (drafted separately).

just lint 13/13 · just test pass.

Adds optional per-arch `buildkit.{amd64,arm64}_replicas` and `..._pods_per_node` overrides (fall back to the shared `replicas_per_arch`/`pods_per_node`, so other clusters render unchanged), then resizes the prod fleet: - **amd64:** `m6id.24xlarge`, 2/node, 42 vCPU/155 GiB — **32 replicas** (was 12) - **arm64:** `m7gd.16xlarge`, 4/node, ~14 vCPU/51 GiB — **8 replicas** (smaller pods, more of them; ≈ the pre-OSDC `m7g.4xlarge` build runner) NodePool limits scale per-arch automatically. `just test` pass (97% cov), `just lint` 13/13. Independent of the autoscaling work on #701. --------- Signed-off-by: Huy Do <huydo@meta.com>

jeanschmidt · 2026-06-06T02:24:19Z

this does not solve the burst problem, when a change comes and trigger docker rebuild on multiple CIs, all the connections will be dispatched to the current existing buildkit pods and the newer ones will sit idle. We need to find a way to fix this.

…seline **Impact:** OSDC arc-staging buildkit only (autoscaling is opt-in; other clusters unchanged). **Risk:** low Absorb ciflow/docker bursts without overloading existing pods, and scale back to a small warm per-arch baseline when idle. - HAProxy `server maxconn 1` + `timeout queue`: one build per pod; excess builds queue and flow onto new pods as they register, instead of stacking on busy pods (so scaled-up pods don't sit idle). - KEDA ScaledObject per arch via `metrics-api` scraping the LB's own metrics (haproxy_backend_current_sessions) — no Grafana / external metrics backend. - Warm baseline: amd64_min=2 / arm64_min=4 (1 physical node each); *_max caps the burst and sizes the NodePool limits. - preStop drain + PDB + long terminationGracePeriodSeconds for kill-free scale-down. staging: amd64 m6id.24xlarge @ 2/node (min 2), arm64 m7gd.16xlarge @ 4/node (min 4). Runner-side connect retry (separate pytorch/pytorch change) lets a build tolerate waiting for a pod from a cold/queued pool. Testing: just lint 13/13, just test pass (generate_buildkit.py 98%). Signed-off-by: Huy Do <huydo@meta.com>

Signed-off-by: Huy Do <huydo@meta.com>

Same min per arch as staging (amd64 2 / arm64 4). Max sized from 14-day docker-build concurrency: amd64 128 (peak 105 + headroom), arm64 16 (peak 8, likely capped by the old fixed pool).

Burst 8 parallel buildctl builds per arch (each holds a maxconn=1 slot ~10m via sleep). With amd64_min=2 they serialize ~43m > timeout 30m and fail unless KEDA scales the pool up; one wave ~18m when it does.

huydhn · 2026-06-10T08:40:13Z

Staging validation run

Drove a balanced burst of 8 amd64 + 8 arm64 builds against the staging pool (each held a maxconn=1 slot ~5m), the same shape as the new integration-test scale test: https://github.com/pytorch/pytorch-canary/actions/runs/27247502866 — 16/16 builds succeeded.

BuildKit nodes / pods during the run:

Phase	amd64 pods	arm64 pods	buildkit nodes
Baseline (before)	2	4	2 (1 amd64 + 1 arm64)
Mid-burst (~T+9m)	scaling 2→8	scaling 4→8	climbing; 1 pod already draining
Peak	8	8	6 (4× `m6id.24xlarge` + 2× `m7gd.16xlarge`)
After (~T+18m)	back to 2	back to 4	trailing down

Observations:

Queue worked as intended — every queued buildctl connected and rode the queue onto new pods as they registered; no connect timeouts, so no runner-side wait was needed for this burst.
Scale-up — KEDA brought both arches to max (8/8) off the in-cluster HAProxy session metric.
Kill-free scale-down — pods drained (preStop waited for :1234 to go idle) and returned to the 2/4 baseline with zero failed builds.
Node consolidation lag (expected) — with consolidationPolicy: WhenEmpty, survivor pods left some nodes half-full, so the node count trails the pod count back down rather than dropping immediately. This is the deliberate trade documented in the module README (some idle node cost for zero build disruption).

huydhn · 2026-06-10T08:55:30Z

Closing this in favour of the stack at #725

huydhn requested a review from jeanschmidt as a code owner June 5, 2026 03:33

huydhn marked this pull request as draft June 5, 2026 03:35

huydhn mentioned this pull request Jun 5, 2026

Support per-arch BuildKit replica counts and resize prod fleet #702

Merged

huydhn force-pushed the buildkit-autoscaling-keda branch 2 times, most recently from 54d6925 to 4a87e0a Compare June 9, 2026 18:36

huydhn changed the title ~~Autoscale BuildKit builders on docker-build job count via KEDA (staging)~~ BuildKit autoscaling on staging: in-cluster KEDA + LB queue + warm baseline Jun 9, 2026

huydhn force-pushed the buildkit-autoscaling-keda branch from 4a87e0a to d27514a Compare June 9, 2026 19:54

huydhn force-pushed the buildkit-autoscaling-keda branch 6 times, most recently from d7a9356 to d7c3bb4 Compare June 10, 2026 08:03

Merge branch 'main' into buildkit-autoscaling-keda

76502b1

Signed-off-by: Huy Do <huydo@meta.com>

huydhn force-pushed the buildkit-autoscaling-keda branch from d7c3bb4 to 76502b1 Compare June 10, 2026 08:04

huydhn added 2 commits June 10, 2026 01:35

buildkit: enable autoscaling on prod (arc-cbr-production)

a3b704b

Same min per arch as staging (amd64 2 / arm64 4). Max sized from 14-day docker-build concurrency: amd64 128 (peak 105 + headroom), arm64 16 (peak 8, likely capped by the old fixed pool).

integration-tests: add buildkit autoscaling scale test

b54d95a

Burst 8 parallel buildctl builds per arch (each holds a maxconn=1 slot ~10m via sleep). With amd64_min=2 they serialize ~43m > timeout 30m and fail unless KEDA scales the pool up; one wave ~18m when it does.

huydhn changed the title ~~BuildKit autoscaling on staging: in-cluster KEDA + LB queue + warm baseline~~ BuildKit autoscaling (staging + prod): in-cluster KEDA + LB queue + warm baseline Jun 10, 2026

huydhn closed this Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BuildKit autoscaling (staging + prod): in-cluster KEDA + LB queue + warm baseline#701

BuildKit autoscaling (staging + prod): in-cluster KEDA + LB queue + warm baseline#701
huydhn wants to merge 4 commits into
pytorch:mainfrom
huydhn:buildkit-autoscaling-keda

huydhn commented Jun 5, 2026 •

edited

Loading

Uh oh!

jeanschmidt commented Jun 6, 2026 •

edited

Loading

Uh oh!

huydhn commented Jun 10, 2026

Uh oh!

huydhn commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

huydhn commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How

Enabled on

Integration test

Companion change

Uh oh!

jeanschmidt commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huydhn commented Jun 10, 2026

Staging validation run

Uh oh!

huydhn commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

huydhn commented Jun 5, 2026 •

edited

Loading

jeanschmidt commented Jun 6, 2026 •

edited

Loading