Skip to content

Modify TheRock manylinux Dockerfile to take an arch build arg.#470

Merged
tsrw2048 merged 1 commit into
rocm-jax-infrafrom
add-arch-build-arg
Jun 19, 2026
Merged

Modify TheRock manylinux Dockerfile to take an arch build arg.#470
tsrw2048 merged 1 commit into
rocm-jax-infrafrom
add-arch-build-arg

Conversation

@tsrw2048

Copy link
Copy Markdown
Contributor

Motivation

TheRock manylinux docker image currently builds only for the gfx950 architecture. This PR has the Dockerfile take an input argument instead and builds the image with the desired architecture. If no argument is provided it defaults to gfx950, preserving existing behavior.

Changes

All changes are contained to TheRock manylinux dockerfile: docker/manylinux/Dockerfile.jax-manylinux_2_28-therock

@mminutoli mminutoli left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tsrw2048 tsrw2048 merged commit 1482406 into rocm-jax-infra Jun 19, 2026
13 checks passed

### Container Build Arguments:
# TheRock nightly index URL (GPU-family-specific).
ARG THEROCK_INDEX_URL=https://rocm.nightlies.amd.com/v2/gfx950-dcgpu/

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still using a per-family index as the default, and that index does not include packages supporting any device-* extras (pip will warn but not error when you try to install an extra that does not exist).

I'd suggest switching to https://rocm.nightlies.amd.com/whl-multi-arch/.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a default which is never used. The multi-arch index is passed in through CI so we never actually use the single-arch index anymore. This shouldn't affect functionality. I can clean this up though. You are right that this shouldn't be around, even as a leftover artifact.

Comment on lines -23 to +25
"rocm[libraries,devel,device-gfx950]${THEROCK_VERSION:+==}${THEROCK_VERSION}" && \
"rocm[libraries,devel,device-${GFX_ARCH}]${THEROCK_VERSION:+==}${THEROCK_VERSION}" && \

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many cases where you may want to install support for multiple devices, but this only allows a single device-. At least this can take all as the extra, but another expected scenario is device-gfx942,device-gfx950 for MI300 + MI355.

See the docs at https://github.com/ROCm/TheRock/blob/main/RELEASES.md#installing-multi-arch-rocm-python-packages.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you could technically set the gfx942,device-gfx950 to install multiple? Only the first device would need to omit the device- prefix...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScottTodd We usually only run tests in our CI with a single arch at a time. Would this be a blocker for TheRock CI? I understand that the workaround is a little unwieldy. I am making a PR to clean up the other comments but none of those will change current functionality.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's some of what we do with gfx arch lists:

Practically speaking, in the CI context when we build for a specific list of targets that list will be what those packages treat as "all" (in tarballs, in python package device extras, etc.), so jobs like the pytorch and jax builds could either use an explicit list of all targets that were built or they can look up the list from the provided rocm packages / use "all".

So I don't think this would necessarily be blocking for integration with TheRock's CI/CD as device-all may be enough (that's what @erman-gurses has on ROCm/TheRock#6054 right now),

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScottTodd I have just opened a PR here: #473 to fix this just in case.

FROM quay.io/pypa/manylinux_2_28_x86_64

### Container Build Arguments:
# TheRock nightly index URL (GPU-family-specific).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multi-arch index is not GPU-family-specific.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants