Feat/update skyrl template#11
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the SkyRL container template (CUDA/Python stack and startup behavior) and aligns runtime/service setup across templates by introducing a new shared start1.sh script and adjusting bake tags/args.
Changes:
- Bump SkyRL base to CUDA 12.8.1 and build Python 3.12 from source; install SkyRL via
uvwith extras. - Introduce
container-template/start1.shand switch SkyRL/PyTorch images to run it as the container command. - Update bake configuration/tags for SkyRL and PyTorch.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
official-templates/skyrl/Dockerfile |
Moves SkyRL to CUDA 12.8.1 + Python 3.12 and switches startup script handling (currently breaks multi-context COPY usage). |
official-templates/skyrl/docker-bake.hcl |
Adjusts SkyRL image tag and args (currently mismatched with the updated Dockerfile and passes an unused arg). |
official-templates/pytorch/Dockerfile |
Switches to start1.sh, removes ubuntu user, and adds Jupyter config/hardening changes. |
official-templates/pytorch/docker-bake.hcl |
Updates tag scheme to ${TAG_SUFFIX} and bumps suffix. |
container-template/start1.sh |
New shared startup script to start nginx/ssh/jupyter and run pre/post hooks. |
Comments suppressed due to low confidence (2)
official-templates/skyrl/Dockerfile:20
- The proxy stage copies nginx.conf and readme.html from the default context, but those files aren’t present in official-templates/skyrl. Since docker-bake.hcl provides a named build context "proxy", this should be sourced from that context (or add the missing files to the skyrl template directory). As written, the build will fail.
# nginx config / landing page (keep as its own stage to match existing COPY --from=proxy)
FROM scratch AS proxy
# These two files must exist in the build context; adjust paths if not at root
COPY nginx.conf nginx.conf
COPY readme.html readme.html
official-templates/skyrl/Dockerfile:133
- NCCL_TESTS_VERSION is declared as a build arg, but the nccl-tests clone doesn’t pin to that version/tag. This makes builds non-reproducible and leaves the ARG unused. Use git clone --branch "${NCCL_TESTS_VERSION}" (as done in official-templates/pytorch/Dockerfile) or remove the ARG.
# nccl-tests
# =========================================================
RUN set -eux; \
git clone --depth 1 https://github.com/NVIDIA/nccl-tests.git /tmp/nccl-tests; \
make -C /tmp/nccl-tests -j"$(nproc)"; \
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| FROM scratch AS logo | ||
| COPY container-template/yotta.txt yotta.txt | ||
|
|
||
| FROM scratch AS scripts | ||
| COPY container-template/start.sh start.sh | ||
| COPY start1.sh start1.sh |
There was a problem hiding this comment.
The multi-context files are being copied from the default build context, but they don’t exist under official-templates/skyrl (e.g., container-template/yotta.txt and start1.sh). Also, defining a stage named "scripts" prevents using the buildx bake named context "scripts" of the same name. Consider removing these scratch stages (logo/scripts) and copying directly from the bake contexts (e.g., COPY --from=logo yotta.txt ..., COPY --from=scripts start1.sh ...), or rename the stages/contexts to avoid the collision.
| PYTHONUNBUFFERED=1 \ | ||
| PATH=/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/bin:$PATH \ | ||
| LD_LIBRARY_PATH=/usr/local/nvidia/lib64:$LD_LIBRARY_PATH \ | ||
| JUPYTER_PASSWORD=yotta |
There was a problem hiding this comment.
JUPYTER_PASSWORD is hard-coded in the image (ENV JUPYTER_PASSWORD=yotta), but docker-bake.hcl passes JUPYTER_PASSWORD as a build arg. To avoid baking a default credential into the image and to honor the bake arg, declare ARG JUPYTER_PASSWORD and set ENV JUPYTER_PASSWORD=${JUPYTER_PASSWORD} (or remove the bake arg if runtime-only).
| usermod -aG sudo ubuntu || true; \ | ||
| echo "ubuntu ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/ubuntu; \ | ||
| echo "ubuntu:ubuntu" | chpasswd || true | ||
| RUN mkdir -p /workspace && chmod 777 /workspace /root |
There was a problem hiding this comment.
Making /root world-writable (chmod 777) is a security risk and is unusual for a base template. Prefer keeping /root owned by root with 700/755 permissions, and only open up /workspace (or another dedicated writable directory) if the platform requires it.
| RUN mkdir -p /workspace && chmod 777 /workspace /root | |
| RUN mkdir -p /workspace && chmod 777 /workspace && chmod 700 /root |
| # ========================================================= | ||
| # Users: ubuntu (compat) | ||
| # Workspace initialization | ||
| # ========================================================= | ||
| RUN set -eux; \ | ||
| useradd -ms /bin/bash ubuntu || true; \ | ||
| usermod -aG sudo ubuntu || true; \ | ||
| echo "ubuntu ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/ubuntu; \ | ||
| echo "ubuntu:ubuntu" | chpasswd || true | ||
| RUN mkdir -p /workspace && chmod 777 /workspace /root | ||
|
|
There was a problem hiding this comment.
With the ubuntu user creation removed, the image effectively only has root, but sshd_config is set to PermitRootLogin no (later in this Dockerfile) and start1.sh only provisions ~/.ssh for the current user. This combination will make SSH access impossible. Either create a non-root user and have start1.sh provision that user’s authorized_keys, or allow root key-based login (e.g., PermitRootLogin prohibit-password) to match the startup script.
| # ========================================================= | ||
| # start.sh (must be kept as-is) | ||
| # ========================================================= | ||
| COPY --from=scripts start.sh /start.sh | ||
| RUN chmod 755 /start.sh | ||
| COPY --from=scripts start1.sh /start1.sh | ||
| RUN chmod 755 /start1.sh && \ | ||
| sed -i 's/\r$//' /start1.sh |
There was a problem hiding this comment.
The section header says “start.sh (must be kept as-is)” but the image now copies /start1.sh. Update the comment to match the new startup script name to avoid misleading future edits.
| if [ ! -f /etc/ssh/ssh_host_dsa_key ]; then | ||
| ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key -q -N '' | ||
| echo "DSA key fingerprint:" | ||
| ssh-keygen -lf /etc/ssh/ssh_host_dsa_key.pub | ||
| fi | ||
|
|
There was a problem hiding this comment.
Generating a DSA SSH host key is deprecated and considered insecure; many OpenSSH configurations disable DSA entirely. Consider removing the ssh_host_dsa_key generation block and rely on RSA/ECDSA/ED25519 keys.
| if [ ! -f /etc/ssh/ssh_host_dsa_key ]; then | |
| ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key -q -N '' | |
| echo "DSA key fingerprint:" | |
| ssh-keygen -lf /etc/ssh/ssh_host_dsa_key.pub | |
| fi |
| local script_msg=$2 | ||
| if [ -f "${script_path}" ]; then | ||
| echo "${script_msg}" | ||
| bash ${script_path} |
There was a problem hiding this comment.
execute_script invokes bash with an unquoted path (bash ${script_path}), which can break if the path ever contains whitespace or glob characters. Quote the variable when executing.
| bash ${script_path} | |
| bash "${script_path}" |
| printenv | grep -E '^YOTTA_|^PATH=|^_=' | awk -F = '{ print "export " $1 "=\"" $2 "\"" }' >> /etc/rp_environment | ||
| echo 'export PATH=/usr/local/nvidia/bin:/usr/local/cuda-12.8/bin:~/.local/bin:$PATH' >> /etc/rp_environment | ||
| echo 'source /etc/rp_environment' >> ~/.bashrc |
There was a problem hiding this comment.
export_env_vars appends to /etc/rp_environment and ~/.bashrc on every container start, which can lead to duplicated lines across restarts. Consider rewriting /etc/rp_environment each time (or guarding with a marker) and only adding the “source /etc/rp_environment” line if it isn’t already present.
| printenv | grep -E '^YOTTA_|^PATH=|^_=' | awk -F = '{ print "export " $1 "=\"" $2 "\"" }' >> /etc/rp_environment | |
| echo 'export PATH=/usr/local/nvidia/bin:/usr/local/cuda-12.8/bin:~/.local/bin:$PATH' >> /etc/rp_environment | |
| echo 'source /etc/rp_environment' >> ~/.bashrc | |
| printenv | grep -E '^YOTTA_|^PATH=|^_=' | awk -F = '{ print "export " $1 "=\"" $2 "\"" }' > /etc/rp_environment | |
| echo 'export PATH=/usr/local/nvidia/bin:/usr/local/cuda-12.8/bin:~/.local/bin:$PATH' >> /etc/rp_environment | |
| if ! grep -qxF 'source /etc/rp_environment' ~/.bashrc 2>/dev/null; then | |
| echo 'source /etc/rp_environment' >> ~/.bashrc | |
| fi |
| # ========================================================= | ||
| # nccl-tests |
There was a problem hiding this comment.
The nccl-tests section doesn’t appear to disable MPI (unlike official-templates/pytorch/Dockerfile, which builds with MPI=0 to avoid mpi.h missing). Unless MPI dev headers are installed elsewhere, the build is likely to fail. Consider building nccl-tests with MPI=0 or adding the required MPI dev packages.
| variable "PUBLISHER" { default = "yottalabsai" } | ||
| variable "TAG_SUFFIX" { default = "2026010901" } | ||
| variable "TAG_SUFFIX" { default = "2026031701" } | ||
|
|
||
| group "default" { | ||
| targets = ["pytorch290"] |
There was a problem hiding this comment.
The PR title indicates it only updates the SkyRL template, but this PR also changes the PyTorch template and adds a new shared startup script under container-template/. Please update the PR title/description to reflect the broader scope, or consider splitting into separate PRs to reduce review and release risk.
|
PR Hygiene scripts = "../../container-template" post_start.sh removed entirely — The old image ran TensorBoard on port 6006 via post_start.sh. Port 6006 is still EXPOSEd (line 178) but nothing starts TensorBoard anymore. Either remove the port or add TensorBoard startup back. tini removed from entrypoint — The old image used ENTRYPOINT ["/usr/bin/tini", "--"] for proper PID 1 signal handling. The new entrypoint is just CMD ["/bin/bash", "-c", "exec /bin/bash /start1.sh"]. Without tini, zombie processes won't be reaped and SIGTERM may not propagate to children. tini is still installed in the apt layer — use it. nccl-tests not pinned to NCCL_TESTS_VERSION — The ARG NCCL_TESTS_VERSION="v2.13.11" is declared (line 2) but the git clone at line ~155 does --depth 1 without --branch "${NCCL_TESTS_VERSION}". The ARG is unused — non-reproducible builds. SSH config not configured — Same issue as the pytorch changes: the old ubuntu user setup and sshd_config patching are removed, but SSH is still started in start1.sh. Root login will be prohibited by default. Medium Image tag doesn't match actual stack — docker-bake.hcl:21 tags the image as ray2.51-py3.11-cuda12.1-... but the Dockerfile now builds Python 3.12 on CUDA 12.8.1. The tag is misleading. JUPYTER_PASSWORD bake variable says ubuntu — docker-bake.hcl:9 sets default "ubuntu" but the Dockerfile ENV sets yotta. The bake arg isn't passed as a build arg anymore (it was removed from the args block... wait, it's still there at line 29). So the bake value "ubuntu" will override the Dockerfile's "yotta" — inconsistent with the pytorch template. chmod 777 /workspace /root — Dockerfile:82. Root's home should not be world-writable. Hardcoded PY_MM="3.12" inside a RUN that also uses ARG PYTHON_VERSION — If someone changes PYTHON_VERSION to 3.11.x, the symlinks will point to 3.12 and break. Derive PY_MM from PYTHON_VERSION (e.g. PY_MM="${PYTHON_VERSION%.*}"). --enable-optimizations removed from Python build — Same as pytorch, ~10-20% slower Python without explanation. Low unsloth removed — Same concern as Megatron-LM. Comment at line 155 still says start.sh (must be kept as-is) — but it's now start1.sh. Update the comment. chown -R root:root /workspace — Dockerfile:173. Unnecessary — files created by root are already owned by root. And it conflicts with chmod 777 set earlier (the intent was any-user writable, but now ownership is root). start1.sh — Issues DSA key generation — deprecated, remove. Summary Fix the scripts stage COPY path |
Summary
Type of Change
container-template/)Checklist
Dockerfilebuilds successfully (docker buildx bake <target> --no-cache)README.mdupdated to reflect any interface or version changesdocker-bake.hcltarget name and image tag are correctTest Notes