Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions docker/manylinux/Dockerfile.jax-manylinux_2_28-therock
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ FROM quay.io/pypa/manylinux_2_28_x86_64
ARG THEROCK_INDEX_URL=https://rocm.nightlies.amd.com/v2/gfx950-dcgpu/

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still using a per-family index as the default, and that index does not include packages supporting any device-* extras (pip will warn but not error when you try to install an extra that does not exist).

I'd suggest switching to https://rocm.nightlies.amd.com/whl-multi-arch/.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a default which is never used. The multi-arch index is passed in through CI so we never actually use the single-arch index anymore. This shouldn't affect functionality. I can clean this up though. You are right that this shouldn't be around, even as a leftover artifact.

# TheRock version number (e.g. "7.13.0a20260401" or "" for latest).
ARG THEROCK_VERSION=""
# GPU architecture / device family (selects the rocm[device-<arch>] extra).
ARG GFX_ARCH=gfx950

# Install patchelf and headers for numactl
RUN --mount=type=cache,target=/var/cache/dnf \
Expand All @@ -20,7 +22,7 @@ RUN --mount=type=cache,target=/var/cache/dnf \
RUN --mount=type=cache,mode=0755,target=/root/.cache/pip \
/opt/python/cp312-cp312/bin/python3 -m pip install \
--pre --index-url "${THEROCK_INDEX_URL}" \
"rocm[libraries,devel,device-gfx950]${THEROCK_VERSION:+==}${THEROCK_VERSION}" && \
"rocm[libraries,devel,device-${GFX_ARCH}]${THEROCK_VERSION:+==}${THEROCK_VERSION}" && \
Comment on lines -23 to +25

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many cases where you may want to install support for multiple devices, but this only allows a single device-. At least this can take all as the extra, but another expected scenario is device-gfx942,device-gfx950 for MI300 + MI355.

See the docs at https://github.com/ROCm/TheRock/blob/main/RELEASES.md#installing-multi-arch-rocm-python-packages.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you could technically set the gfx942,device-gfx950 to install multiple? Only the first device would need to omit the device- prefix...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScottTodd We usually only run tests in our CI with a single arch at a time. Would this be a blocker for TheRock CI? I understand that the workaround is a little unwieldy. I am making a PR to clean up the other comments but none of those will change current functionality.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's some of what we do with gfx arch lists:

Practically speaking, in the CI context when we build for a specific list of targets that list will be what those packages treat as "all" (in tarballs, in python package device extras, etc.), so jobs like the pytorch and jax builds could either use an explicit list of all targets that were built or they can look up the list from the provided rocm packages / use "all".

So I don't think this would necessarily be blocking for integration with TheRock's CI/CD as device-all may be enough (that's what @erman-gurses has on ROCm/TheRock#6054 right now),

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScottTodd I have just opened a PR here: #473 to fix this just in case.

ln -sf /opt/python/cp312-cp312/bin/rocm-sdk /usr/local/bin/rocm-sdk && \
rocm-sdk init

Expand Down Expand Up @@ -61,4 +63,5 @@ RUN --mount=type=cache,target=/var/cache/dnf \
dnf clean all

LABEL com.amdgpu.therock_index_url="$THEROCK_INDEX_URL" \
com.amdgpu.therock_version="$THEROCK_VERSION"
com.amdgpu.therock_version="$THEROCK_VERSION" \
com.amdgpu.gfx_arch="$GFX_ARCH"
Loading