Modify TheRock manylinux Dockerfile to take an arch build arg.#470
Conversation
|
|
||
| ### Container Build Arguments: | ||
| # TheRock nightly index URL (GPU-family-specific). | ||
| ARG THEROCK_INDEX_URL=https://rocm.nightlies.amd.com/v2/gfx950-dcgpu/ |
There was a problem hiding this comment.
This is still using a per-family index as the default, and that index does not include packages supporting any device-* extras (pip will warn but not error when you try to install an extra that does not exist).
I'd suggest switching to https://rocm.nightlies.amd.com/whl-multi-arch/.
There was a problem hiding this comment.
This is just a default which is never used. The multi-arch index is passed in through CI so we never actually use the single-arch index anymore. This shouldn't affect functionality. I can clean this up though. You are right that this shouldn't be around, even as a leftover artifact.
| "rocm[libraries,devel,device-gfx950]${THEROCK_VERSION:+==}${THEROCK_VERSION}" && \ | ||
| "rocm[libraries,devel,device-${GFX_ARCH}]${THEROCK_VERSION:+==}${THEROCK_VERSION}" && \ |
There was a problem hiding this comment.
There are many cases where you may want to install support for multiple devices, but this only allows a single device-. At least this can take all as the extra, but another expected scenario is device-gfx942,device-gfx950 for MI300 + MI355.
See the docs at https://github.com/ROCm/TheRock/blob/main/RELEASES.md#installing-multi-arch-rocm-python-packages.
There was a problem hiding this comment.
I guess you could technically set the gfx942,device-gfx950 to install multiple? Only the first device would need to omit the device- prefix...
There was a problem hiding this comment.
@ScottTodd We usually only run tests in our CI with a single arch at a time. Would this be a blocker for TheRock CI? I understand that the workaround is a little unwieldy. I am making a PR to clean up the other comments but none of those will change current functionality.
There was a problem hiding this comment.
Here's some of what we do with gfx arch lists:
- CI builds on
pull_requestandpushevents useamdgpu_family_info_matrix_presubmitfrom https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/amdgpu_family_matrix.py. Currently that means building ROCm artifacts (e.g. hipBLASLt) for with a few GPU "families":[gfx942],[gfx1100,gfx1101,gfx1102,gfx1103],[gfx1151],[gfx1200,gfx1201]and then ROCm python packages and PyTorch python packages for those all individual targets together:[gfx942,gfx1100,gfx1101,gfx1102,gfx1103,gfx1151,gfx1200,gfx1201]- CI builds can also opt-in to building additional GPU families or "all archs" using opt-in github labels (https://github.com/ROCm/TheRock/blob/main/docs/development/ci_behavior_manipulation.md#pull-request)
- Release builds use the full list of supported GPU families/targets. Like the CI builds, we build ROCm artifacts for GPU families then build ROCm python packages, PyTorch python packages, JAX python packages, etc. for the combined set of targets in a single build command (e.g.
gfx942,gfx1100,gfx1101,gfx1102,gfx1103,gfx1151,gfx1200,gfx1201,gfx950,gfx900,gfx906,gfx908,gfx90a,gfx1010,gfx1011,gfx1012,gfx1030,gfx1031,gfx1032,gfx1033,gfx1034,gfx1035,gfx1036,gfx1150,gfx1152,gfx1153, see https://github.com/ROCm/rockrel/actions/runs/27930498942/job/82641334022#step:14:16) - Test workflow runs will sometimes build for a single GPU target to reduce the load on the build/test runners
Practically speaking, in the CI context when we build for a specific list of targets that list will be what those packages treat as "all" (in tarballs, in python package device extras, etc.), so jobs like the pytorch and jax builds could either use an explicit list of all targets that were built or they can look up the list from the provided rocm packages / use "all".
So I don't think this would necessarily be blocking for integration with TheRock's CI/CD as device-all may be enough (that's what @erman-gurses has on ROCm/TheRock#6054 right now),
There was a problem hiding this comment.
@ScottTodd I have just opened a PR here: #473 to fix this just in case.
| FROM quay.io/pypa/manylinux_2_28_x86_64 | ||
|
|
||
| ### Container Build Arguments: | ||
| # TheRock nightly index URL (GPU-family-specific). |
There was a problem hiding this comment.
The multi-arch index is not GPU-family-specific.
Motivation
TheRock manylinux docker image currently builds only for the gfx950 architecture. This PR has the Dockerfile take an input argument instead and builds the image with the desired architecture. If no argument is provided it defaults to gfx950, preserving existing behavior.
Changes
All changes are contained to TheRock manylinux dockerfile:
docker/manylinux/Dockerfile.jax-manylinux_2_28-therock