Skip to content

Commit e7c5415

Browse files
authored
Migrate A100 CUDA CI jobs to OSDC runners (#20212)
Moves the A100-dependent CUDA CI jobs from `pytorch/test-infra` `linux_job_v2` (AWS) to `linux_job_v3` (OSDC/ARC), and remaps their runner labels per `pytorch/.github/arc.yaml`. ### Migrated jobs (now on OSDC / `linux_job_v3`) - `cuda.yml`: `export-model-cuda-artifact`, `test-model-cuda-e2e` - `cuda-perf.yml`: `export-models`, `benchmark-cuda` ### Runner label mapping | AWS label | OSDC label | |---|---| | `linux.aws.a100` | `mt-l-x86iavx512-11-125-a100` | | `linux.g5.4xlarge.nvidia.gpu` (A10G fallback branch) | `mt-l-x86aavx2-29-113-a10g` | The A10G fallback branch in each conditional runner expression had to move to an OSDC label too, since `linux_job_v3` requires ARC labels and that branch belongs to the same A100-dependent jobs. ### Left unchanged Jobs that never run on A100 stay on `linux_job_v2` / `linux.g5.4xlarge.nvidia.gpu`: `test-cuda-builds`, `test-models-cuda`, `unittest-cuda`, `test-cuda-pybind`. `linux_job_v3` resolves the docker image and `--gpus all` identically to v2 for these jobs (none set `docker-image`), so build/runtime behavior is unchanged. Authored with Claude Code.
1 parent 1388200 commit e7c5415

2 files changed

Lines changed: 40 additions & 8 deletions

File tree

.github/workflows/cuda-perf.yml

Lines changed: 20 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@ jobs:
124124
export-models:
125125
name: export-models
126126
needs: set-parameters
127-
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
127+
uses: pytorch/test-infra/.github/workflows/linux_job_v3.yml@main
128128
permissions:
129129
id-token: write
130130
contents: read
@@ -135,7 +135,7 @@ jobs:
135135
with:
136136
timeout: 90
137137
secrets-env: EXECUTORCH_HF_TOKEN
138-
runner: ${{ contains(matrix.model, 'Qwen3.5-35B-A3B') && 'linux.aws.a100' || 'linux.g5.4xlarge.nvidia.gpu' }}
138+
runner: ${{ contains(matrix.model, 'Qwen3.5-35B-A3B') && 'mt-l-x86iavx512-11-125-a100' || 'mt-l-x86aavx2-29-113-a10g' }}
139139
gpu-arch-type: cuda
140140
gpu-arch-version: "13.0"
141141
use-custom-docker-registry: false
@@ -145,6 +145,14 @@ jobs:
145145
script: |
146146
set -eux
147147
echo "::group::Setup ExecuTorch"
148+
# OSDC runners can't reach the public PyPI CDN that download.pytorch.org's
149+
# transitive deps resolve to. Pre-install torch's pure-python deps from the
150+
# in-cluster pypi-cache and drop the default cpu extra-index so the cuda
151+
# torch wheel is the only candidate.
152+
export PIP_EXTRA_INDEX_URL=
153+
# fsspec is pinned to satisfy datasets' fsspec[http]<=2025.3.0 so the later
154+
# examples install doesn't try to downgrade it from the public CDN.
155+
pip install filelock typing-extensions "setuptools<82" sympy networkx jinja2 "fsspec[http]<=2025.3.0" numpy pillow
148156
# Disable MKL to avoid duplicate target error when conda has multiple MKL installations
149157
export USE_MKL=OFF
150158
./install_executorch.sh
@@ -192,7 +200,7 @@ jobs:
192200
contains(needs.changed-files.outputs.changed-files, '.ci/scripts/test_model_e2e.sh') ||
193201
needs.run-decision.outputs.is-full-run == 'true'
194202
)
195-
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
203+
uses: pytorch/test-infra/.github/workflows/linux_job_v3.yml@main
196204
permissions:
197205
id-token: write
198206
contents: read
@@ -201,7 +209,7 @@ jobs:
201209
fail-fast: false
202210
with:
203211
timeout: 90
204-
runner: ${{ contains(matrix.model, 'Qwen3.5-35B-A3B') && 'linux.aws.a100' || 'linux.g5.4xlarge.nvidia.gpu' }}
212+
runner: ${{ contains(matrix.model, 'Qwen3.5-35B-A3B') && 'mt-l-x86iavx512-11-125-a100' || 'mt-l-x86aavx2-29-113-a10g' }}
205213
gpu-arch-type: cuda
206214
gpu-arch-version: "13.0"
207215
use-custom-docker-registry: false
@@ -212,6 +220,14 @@ jobs:
212220
script: |
213221
set -eux
214222
echo "::group::Setup environment"
223+
# OSDC runners can't reach the public PyPI CDN that download.pytorch.org's
224+
# transitive deps resolve to. Pre-install torch's pure-python deps from the
225+
# in-cluster pypi-cache and drop the default cpu extra-index so the cuda
226+
# torch wheel is the only candidate.
227+
export PIP_EXTRA_INDEX_URL=
228+
# fsspec is pinned to satisfy datasets' fsspec[http]<=2025.3.0 so the later
229+
# examples install doesn't try to downgrade it from the public CDN.
230+
pip install filelock typing-extensions "setuptools<82" sympy networkx jinja2 "fsspec[http]<=2025.3.0" numpy pillow
215231
./install_requirements.sh
216232
pip list
217233
echo "::endgroup::"

.github/workflows/cuda.yml

Lines changed: 20 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -229,7 +229,7 @@ jobs:
229229
contains(needs.changed-files.outputs.changed-files, '.ci/scripts/test_model_e2e.sh') ||
230230
needs.run-decision.outputs.is-full-run == 'true'
231231
)
232-
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
232+
uses: pytorch/test-infra/.github/workflows/linux_job_v3.yml@main
233233
permissions:
234234
id-token: write
235235
contents: read
@@ -342,7 +342,7 @@ jobs:
342342
with:
343343
timeout: 150
344344
secrets-env: EXECUTORCH_HF_TOKEN
345-
runner: ${{ (matrix.model.name == 'Qwen3.5-35B-A3B-HQQ-INT4' || matrix.model.name == 'gemma-4-31B-it-HQQ-INT4') && 'linux.aws.a100' || 'linux.g5.4xlarge.nvidia.gpu' }}
345+
runner: ${{ (matrix.model.name == 'Qwen3.5-35B-A3B-HQQ-INT4' || matrix.model.name == 'gemma-4-31B-it-HQQ-INT4') && 'mt-l-x86iavx512-11-125-a100' || 'mt-l-x86aavx2-29-113-a10g' }}
346346
gpu-arch-type: cuda
347347
gpu-arch-version: "13.0"
348348
use-custom-docker-registry: false
@@ -353,6 +353,14 @@ jobs:
353353
set -eux
354354
355355
echo "::group::Setup ExecuTorch"
356+
# OSDC runners can't reach the public PyPI CDN that download.pytorch.org's
357+
# transitive deps resolve to. Pre-install torch's pure-python deps from the
358+
# in-cluster pypi-cache and drop the default cpu extra-index so the cuda
359+
# torch wheel is the only candidate.
360+
export PIP_EXTRA_INDEX_URL=
361+
# fsspec is pinned to satisfy datasets' fsspec[http]<=2025.3.0 so the later
362+
# examples install doesn't try to downgrade it from the public CDN.
363+
pip install filelock typing-extensions "setuptools<82" sympy networkx jinja2 "fsspec[http]<=2025.3.0" numpy pillow
356364
# Disable MKL to avoid duplicate target error when conda has multiple MKL installations
357365
export USE_MKL=OFF
358366
./install_executorch.sh
@@ -390,7 +398,7 @@ jobs:
390398
contains(needs.changed-files.outputs.changed-files, '.ci/scripts/test_model_e2e.sh') ||
391399
needs.run-decision.outputs.is-full-run == 'true'
392400
)
393-
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
401+
uses: pytorch/test-infra/.github/workflows/linux_job_v3.yml@main
394402
permissions:
395403
id-token: write
396404
contents: read
@@ -494,14 +502,22 @@ jobs:
494502
quant: "non-quantized"
495503
with:
496504
timeout: 90
497-
runner: ${{ (matrix.model.name == 'Qwen3.5-35B-A3B-HQQ-INT4' || matrix.model.name == 'gemma-4-31B-it-HQQ-INT4') && 'linux.aws.a100' || 'linux.g5.4xlarge.nvidia.gpu' }}
505+
runner: ${{ (matrix.model.name == 'Qwen3.5-35B-A3B-HQQ-INT4' || matrix.model.name == 'gemma-4-31B-it-HQQ-INT4') && 'mt-l-x86iavx512-11-125-a100' || 'mt-l-x86aavx2-29-113-a10g' }}
498506
gpu-arch-type: cuda
499507
gpu-arch-version: "13.0"
500508
use-custom-docker-registry: false
501509
submodules: recursive
502510
download-artifact: ${{ matrix.model.repo }}-${{ matrix.model.name }}-cuda-${{ matrix.quant }}
503511
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
504512
script: |
513+
# OSDC runners can't reach the public PyPI CDN that download.pytorch.org's
514+
# transitive deps resolve to. Pre-install torch's pure-python deps from the
515+
# in-cluster pypi-cache and drop the default cpu extra-index so the cuda
516+
# torch wheel is the only candidate.
517+
export PIP_EXTRA_INDEX_URL=
518+
# fsspec is pinned to satisfy datasets' fsspec[http]<=2025.3.0 so the later
519+
# examples install doesn't try to downgrade it from the public CDN.
520+
pip install filelock typing-extensions "setuptools<82" sympy networkx jinja2 "fsspec[http]<=2025.3.0" numpy pillow
505521
source .ci/scripts/test_model_e2e.sh cuda "${{ matrix.model.repo }}/${{ matrix.model.name }}" "${{ matrix.quant }}" "${RUNNER_ARTIFACT_DIR}"
506522
507523
test-cuda-pybind:

0 commit comments

Comments
 (0)