[ZenCPU] Changes with respect to docker build and relevant cpu tests#38703
[ZenCPU] Changes with respect to docker build and relevant cpu tests#38703Chinmay-Kulkarni-AMD wants to merge 6 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive CI support for AMD Zen CPUs, adding new Buildkite test pipelines, a dedicated Docker image build process, and specific test skips for models requiring float16 compute. The review feedback highlights several critical improvements for CI robustness and efficiency: ensuring that changes to C++ source and build files trigger the relevant tests, avoiding redundant Docker builds by leveraging pre-built images, using unique job identifiers in image tags to prevent race conditions, and enhancing security by removing unnecessary privileged execution flags in the test environment.
| # allow to bind to different cores | ||
| CORE_RANGE=${CORE_RANGE:-48-95} | ||
| NUMA_NODE=${NUMA_NODE:-1} | ||
| IMAGE_NAME="cpu-test-$NUMA_NODE" |
There was a problem hiding this comment.
Using a static IMAGE_NAME based only on the NUMA_NODE can lead to race conditions if multiple CI jobs run concurrently on the same agent and share the same Docker daemon. It is safer to include a unique identifier like BUILDKITE_JOB_ID in the tag.
| IMAGE_NAME="cpu-test-$NUMA_NODE" | |
| IMAGE_NAME="cpu-test-$NUMA_NODE-$BUILDKITE_JOB_ID" |
There was a problem hiding this comment.
prefix the name with amd.
IMAGE_NAME="amd-cpu-text-$NUMA_NODE-$BUILDKITE_JOB_ID"
|
|
||
| # building the docker image | ||
| echo "--- :docker: Building Docker image" | ||
| docker build --progress plain --tag "$IMAGE_NAME" --target vllm-zen-test -f docker/Dockerfile.cpu . |
There was a problem hiding this comment.
This script performs a full docker build every time it is executed. Since there are multiple test steps in cpu.yaml calling this script, the image is rebuilt multiple times per CI run, which is highly inefficient. The script should ideally pull and use the image already built and pushed by the image-build-amd-cpu step in the pipeline.
There was a problem hiding this comment.
Keeping the structure similar to the existing image-build-cpu.sh file
| docker build --progress plain --tag "$IMAGE_NAME" --target vllm-zen-test -f docker/Dockerfile.cpu . | ||
|
|
||
| # Run the image, setting --shm-size=4g for tensor parallel. | ||
| docker run --rm --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN -e VLLM_CPU_KVCACHE_SPACE=16 -e VLLM_CPU_CI_ENV=1 -e VLLM_CPU_SIM_MULTI_NUMA=1 --shm-size=4g "$IMAGE_NAME" \ |
There was a problem hiding this comment.
Running the container with --privileged=true is a significant security risk and should be avoided unless absolutely necessary. The required functionality for CPU pinning (--cpuset-cpus, --cpuset-mems) and shared memory (--shm-size) does not require elevated privileges. Please remove this flag or replace it with specific capabilities if needed.
| docker run --rm --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN -e VLLM_CPU_KVCACHE_SPACE=16 -e VLLM_CPU_CI_ENV=1 -e VLLM_CPU_SIM_MULTI_NUMA=1 --shm-size=4g "$IMAGE_NAME" \ | |
| docker run --rm --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" -v ~/.cache/huggingface:/root/.cache/huggingface -e HF_TOKEN -e VLLM_CPU_KVCACHE_SPACE=16 -e VLLM_CPU_CI_ENV=1 -e VLLM_CPU_SIM_MULTI_NUMA=1 --shm-size=4g "$IMAGE_NAME" \ |
There was a problem hiding this comment.
Keeping the structure similar to the existing image-build-cpu.sh file
2d70961 to
1ad34b3
Compare
|
Adding @andy-neuma @tlrmchlsmth for reviews |
717d9ce to
1b5d7fe
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
cdbb683 to
9979290
Compare
592d4fd to
ad8b8a9
Compare
ad8b8a9 to
47beb61
Compare
| - label: AMD-CPU-Kernel Tests | ||
| depends_on: [] | ||
| soft_fail: false | ||
| device: amd-zen-cpu-inference |
There was a problem hiding this comment.
please update to match "zen5" ...
There was a problem hiding this comment.
Made the required change.
4fda385 to
3e7343d
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
hi @AndreasKaratzas we have raised this PR to enable vLLM CPU CI on Zen CPU instances. Can you please review and help with the merge? |
AndreasKaratzas
left a comment
There was a problem hiding this comment.
I would advise an RFC for the addition of a new platform, the Zen platform, on CI. It is clear that you want to contribute on this stack, and hence this is not going to be your last contribution. So the wider community should probably advise on how to do that. For example my opinion is to separate from "CPU" platform as much as possible (trust me you'll thank me later). But this is exactly a matter of personal opinion and I am not the only committer. So probably an RFC would be the way to go.
|
|
||
| ######################### ZEN CPU TEST IMAGE ######################### | ||
| FROM vllm-openai-zen AS vllm-zen-test | ||
|
|
||
| WORKDIR /vllm-workspace | ||
|
|
||
| COPY --from=vllm-test-deps /vllm-workspace/requirements/test/cpu.txt requirements/test/cpu.txt | ||
|
|
||
| RUN --mount=type=cache,target=/root/.cache/uv \ | ||
| uv pip install -r requirements/test/cpu.txt | ||
|
|
||
| ADD ./tests/ ./tests/ | ||
| ADD ./examples/ ./examples/ | ||
| ADD ./benchmarks/ ./benchmarks/ | ||
| ADD ./vllm/collect_env.py . | ||
| ADD ./docker/ ./docker/ | ||
| ADD ./.buildkite/ ./.buildkite/ | ||
|
|
||
| RUN --mount=type=cache,target=/root/.cache/uv \ | ||
| uv pip install -e tests/vllm_test_utils | ||
|
|
||
| # enable fast downloads from hf (for testing) | ||
| ENV HF_XET_HIGH_PERFORMANCE=1 | ||
| # increase timeout for hf downloads (for testing) | ||
| ENV HF_HUB_DOWNLOAD_TIMEOUT=60 | ||
|
|
||
|
|
||
| ENTRYPOINT [] |
There was a problem hiding this comment.
I think that this should be in its own dockerfile. This is a standalone stage not inheriting from any other in this dockerfile, so that is why I am thinking that it does not belong in the same dockerfile.
Also I think that for this exact reason, there should be an RFC first. I'm positive for this and this PR (with a bit of modification) is a good start, but the process would be an RFC to lay out the plan of adding Zen CPUs in CI and the areas that Zen is going to be busy with (I remember for example the zentorch quantized PR which is very recent).
There was a problem hiding this comment.
Hello Andreas, the restructuring of files and snippets into zen specific files is done and pushed. RFC is also created for introducing the Zen platform into the vLLM and vLLM CI which can be viewed here: #44421
| pytest -x -v -s tests/kernels/test_awq_int4_to_int8.py" | ||
|
|
||
| - label: AMD-CPU-Kernel Tests | ||
| depends_on: [] |
There was a problem hiding this comment.
From one engineer to another, I am thinking that it's best if you start separating Zen test yamls/dockerfiles/scripts from the CPU platform. It will be very hard to do it later on, and trust me it will be inevitable. Also, keep the naming as "zen" even for the filenames instead of "amd" since AMD can mislead to GPU deployment too (for example one of the file names in this PR that has this "amd" is the introduced run-amd-cpu-test.sh -- btw the first thing that you might tell me is "why your team (ROCm) are calling it AMD then and not ROCm or something"? Well, the answer is in my second sentence above).
There was a problem hiding this comment.
All the zen specific code and the build snippets are carved out into "zen"-specific files.
| bash .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh dp_tp" | ||
|
|
||
| - label: AMD-CPU-Distributed Tests | ||
| depends_on: [] |
There was a problem hiding this comment.
After you define a Dockerfile.zen make sure that you add that docker image build in the dependency list here.
There was a problem hiding this comment.
Currently keeping the parity with the cpu.yaml. All of the tests are currently building images locally and using them to run the tests. Do you suggest to build the image once and use it across every test unlike the current cpu.yaml?
There was a problem hiding this comment.
So, take a look at vllm/.buildkite/test_areas/entrypoints.yaml line 75. For AMD GPU builds we have to let the pipeline generator know that we depend on the amd docker build rather than the cuda one. I assume that you would need to do the same. But I might be wrong. Check it out and lmk :)
There was a problem hiding this comment.
zen.yaml follows the hardware_tests/cpu.yaml model: no_plugin: true steps where the run script builds the image locally and runs it, so there's no pre-built image to depend on — that's why depends_on: []. The entrypoints.yaml example is the docker-plugin model where the step pulls a pre-built image, hence the image-build-amd dependency. I can switch Zen to that pull-based model (depend on image-build-zen-cpu + pull instead of local build) if you'd prefer the build-once approach — it'd diverge from cpu.yaml but avoid the redundant build. However, explicit testing would be required if docker pull "$REGISTRY/$REPO:$BUILDKITE_COMMIT-zen-cpu" works as expected, if we want to build once and use it across every test.
Is it possible to make sure that the current setup works properly and then push the "build once and pull for every test run" feature?
There was a problem hiding this comment.
I thought that for Zen you need to invoke your own image cause of zen specific features like zentorch, i m also not sure i understand the "local build" idea. CI workflow for GPUs works like this:
bootstrap --> docker build on buildkite --> test groups begin after they pull the correct container from the previous step
That is for both upstream CI and AMD ROCm CI. What is the workflow for CPU CI? Does the main CPU image contain the dependencies you need for Zen tests?
There was a problem hiding this comment.
The reason I went with the local-build model is that I have no way to confirm if the zen5 agents on zen5 queue configuration have ECR pull credentials. Let me lay out how the current setup works.
The Zen test steps build the image in-step, with no_plugin: true — there's no pre-built image to depend on:
# .buildkite/hardware_tests/zen.yaml
- label: Zen-CPU-Kernel Tests
depends_on: []
soft_fail: false
device: zen5
no_plugin: true
...
commands:
- |
bash .buildkite/scripts/hardware_ci/run-zen-cpu-test.sh 20m "
pytest -x -v -s tests/model_executor/test_cpu_unquantized_gemm_dispatch.py"
The runner builds the CPU base + Zen image from the checkout and runs it — no docker pull of a registry image anywhere:
# .buildkite/scripts/hardware_ci/run-zen-cpu-test.sh
# Build CPU base image (Dockerfile.zen layers on top of this).
docker build --progress plain \
--tag "$CPU_BASE_IMAGE" \
--target vllm-openai \
-f docker/Dockerfile.cpu .
# Build the Zen CPU test image on top of the CPU base.
docker build --progress plain \
--build-arg BASE_IMAGE="$CPU_BASE_IMAGE" \
--tag "$IMAGE_NAME" \
--target vllm-zen-test \
-f docker/Dockerfile.zen .
# Run tests inside the just-built image.
docker run --rm ... "$IMAGE_NAME" \
timeout "$TIMEOUT_VAL" bash -c "... ${TEST_COMMAND}"
Because everything is built from the repo, this needs no ECR pull credentials on the zen5 agents, so it doesn't depend on registry access being wired up on that queue. This is the same model hardware_tests/cpu.yaml uses :
# .buildkite/hardware_tests/cpu.yaml
- label: CPU-Kernel Tests
depends_on: []
device: intel_cpu
no_plugin: true
commands:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 30m "..."
# .buildkite/scripts/hardware_ci/run-cpu-test.sh
docker build --progress plain --tag "$IMAGE_NAME" --target vllm-test -f docker/Dockerfile.cpu .
docker run --rm ... "$IMAGE_NAME" timeout "$TIMEOUT_VAL" bash -c "... ${TEST_COMMAND}"
To confirm your point directly: yes, Zen needs its own image — the main CPU image has no zentorch (grep zentorch docker/Dockerfile.cpu returns nothing); zentorch is installed only in Dockerfile.zen, layered on top of the CPU vllm-openai base.
The build-once-pull model you described (build → push → tests pull) is cleaner, and the pieces for it already exist — image_build_zen_cpu.sh builds and pushes the -zen-cpu image:
# .buildkite/image_build/image_build_zen_cpu.sh
ZEN_TAG="$REGISTRY/$REPO:$BUILDKITE_COMMIT-zen-cpu"
...
docker build --file docker/Dockerfile.zen \
--build-arg BASE_IMAGE="$CPU_BASE_TAG" \
--tag "$ZEN_TAG" \
--target vllm-zen-test \
--progress plain .
docker push "$ZEN_TAG"
To switch the tests to consume it, I'd add depends_on: [image-build-zen-cpu] to the steps and change the runner from the two docker builds above to docker pull "$REGISTRY/$REPO:$BUILDKITE_COMMIT-zen-cpu". The only blocker is that this needs the zen5 agents to be able to pull from ECR, which I can't verify from my side (the always-pull: true in amd_zen5_plugin_template from ci-infra#343 suggests they're meant to).
| # authenticate with AWS ECR | ||
| aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY" |
There was a problem hiding this comment.
I see that this is in parity (near-exact verbatim) with the image_build_cpu.sh script. Word of warning: Since you are going to be setting the infra side, be very careful of tokens and login credentials here. You must make sure that they are protected such that if there is a PR up that attempts to print them during CI evaluation, you are not compromised. Make sure and be very careful with your infra team about this. I know that with Github secrets it would have been easy, but with Buildkite agents, I don't know the process to be honest with you, so I can just raise a word of warning.
| docker build --progress plain --tag "$IMAGE_NAME" --target vllm-zen-test -f docker/Dockerfile.cpu . | ||
|
|
||
| # Run the image, setting --shm-size=4g for tensor parallel. | ||
| docker run --rm --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN -e VLLM_CPU_KVCACHE_SPACE=16 -e VLLM_CPU_CI_ENV=1 -e VLLM_CPU_SIM_MULTI_NUMA=1 --shm-size=4g "$IMAGE_NAME" \ |
There was a problem hiding this comment.
Same warning for the HF token that I gave in a different spot for AWS credentials.
3e7343d to
264080c
Compare
474171d to
4b92a0d
Compare
Signed-off-by: Chinmay Kulkarni <Chinmay.Kulkarni@amd.com>
Signed-off-by: Chinmay Kulkarni <Chinmay.Kulkarni@amd.com>
…erence tests Signed-off-by: Chinmay Kulkarni <Chinmay.Kulkarni@amd.com>
Signed-off-by: Chinmay Kulkarni <Chinmay.Kulkarni@amd.com>
Signed-off-by: Chinmay Kulkarni <Chinmay.Kulkarni@amd.com>
…vidual files Signed-off-by: Chinmay Kulkarni <Chinmay.Kulkarni@amd.com>
4b92a0d to
19c0dd7
Compare
Purpose
Add AMD Zen CPU CI infrastructure and fix test compatibility for the AMD Zen CPU platform.
AMD Zen CPUs do not support float16 compute natively (
zen_cpu.pyonly advertises[bfloat16, float32]as supported dtypes). Models whose native dtype is float16are automatically converted to bfloat16 at load time. This causes numerical
divergence for certain models:
bigscience/bloom-560m: ALiBi attention produces degenerate (all-identical)logprob distributions in bfloat16, causing complete output divergence from
HuggingFace reference. The model works correctly in float16 and float32.
Qwen/Qwen2.5-Math-PRM-7B: Test hardcodesdtype="half", which isunsupported on AMD Zen CPUs.
This PR:
distributed tests), mirroring the existing Intel CPU test matrix.
vllm-zen-testDocker build target for AMD Zen CPU test images.bloom-560mandQwen2.5-Math-PRM-7Btests on AMD Zen CPUs withclear reason annotations.
Test Plan
Run the following on an AMD Zen CPU machine:
Test Result
Generation tests: 14 passed, 2 skipped (bloom-560m), 80 deselected ==== 14 passed, 2 skipped, 80 deselected, 94 warnings in 654.79s (0:10:54) =====Pooling tests: 5 passed, 2 skipped (Qwen2.5-Math-PRM-7B), 78 deselected ===== 5 passed, 2 skipped, 78 deselected, 33 warnings in 468.85s (0:07:48) =====Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.