Skip to content

Feat/update skyrl template#11

Open
AanningZoeng wants to merge 5 commits into
mainfrom
feat/update-skyrl-template
Open

Feat/update skyrl template#11
AanningZoeng wants to merge 5 commits into
mainfrom
feat/update-skyrl-template

Conversation

@AanningZoeng

Copy link
Copy Markdown
Collaborator

Summary

Type of Change

  • New template
  • Update to existing template (version bump, new packages, etc.)
  • Bug fix
  • Shared infrastructure change (container-template/)
  • Documentation only

Checklist

  • Dockerfile builds successfully (docker buildx bake <target> --no-cache)
  • Container starts and all services are reachable
  • README.md updated to reflect any interface or version changes
  • No secrets, tokens, or internal hostnames committed
  • docker-bake.hcl target name and image tag are correct

Test Notes

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the SkyRL container template (CUDA/Python stack and startup behavior) and aligns runtime/service setup across templates by introducing a new shared start1.sh script and adjusting bake tags/args.

Changes:

  • Bump SkyRL base to CUDA 12.8.1 and build Python 3.12 from source; install SkyRL via uv with extras.
  • Introduce container-template/start1.sh and switch SkyRL/PyTorch images to run it as the container command.
  • Update bake configuration/tags for SkyRL and PyTorch.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
official-templates/skyrl/Dockerfile Moves SkyRL to CUDA 12.8.1 + Python 3.12 and switches startup script handling (currently breaks multi-context COPY usage).
official-templates/skyrl/docker-bake.hcl Adjusts SkyRL image tag and args (currently mismatched with the updated Dockerfile and passes an unused arg).
official-templates/pytorch/Dockerfile Switches to start1.sh, removes ubuntu user, and adds Jupyter config/hardening changes.
official-templates/pytorch/docker-bake.hcl Updates tag scheme to ${TAG_SUFFIX} and bumps suffix.
container-template/start1.sh New shared startup script to start nginx/ssh/jupyter and run pre/post hooks.
Comments suppressed due to low confidence (2)

official-templates/skyrl/Dockerfile:20

  • The proxy stage copies nginx.conf and readme.html from the default context, but those files aren’t present in official-templates/skyrl. Since docker-bake.hcl provides a named build context "proxy", this should be sourced from that context (or add the missing files to the skyrl template directory). As written, the build will fail.
# nginx config / landing page (keep as its own stage to match existing COPY --from=proxy)
FROM scratch AS proxy
# These two files must exist in the build context; adjust paths if not at root
COPY nginx.conf nginx.conf
COPY readme.html readme.html

official-templates/skyrl/Dockerfile:133

  • NCCL_TESTS_VERSION is declared as a build arg, but the nccl-tests clone doesn’t pin to that version/tag. This makes builds non-reproducible and leaves the ARG unused. Use git clone --branch "${NCCL_TESTS_VERSION}" (as done in official-templates/pytorch/Dockerfile) or remove the ARG.
# nccl-tests
# =========================================================
RUN set -eux; \
    git clone --depth 1 https://github.com/NVIDIA/nccl-tests.git /tmp/nccl-tests; \
    make -C /tmp/nccl-tests -j"$(nproc)"; \

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread official-templates/skyrl/Dockerfile Outdated
Comment on lines +10 to +14
FROM scratch AS logo
COPY container-template/yotta.txt yotta.txt

FROM scratch AS scripts
COPY container-template/start.sh start.sh
COPY start1.sh start1.sh

Copilot AI Apr 13, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multi-context files are being copied from the default build context, but they don’t exist under official-templates/skyrl (e.g., container-template/yotta.txt and start1.sh). Also, defining a stage named "scripts" prevents using the buildx bake named context "scripts" of the same name. Consider removing these scratch stages (logo/scripts) and copying directly from the bake contexts (e.g., COPY --from=logo yotta.txt ..., COPY --from=scripts start1.sh ...), or rename the stages/contexts to avoid the collision.

Copilot uses AI. Check for mistakes.
Comment on lines +37 to +40
PYTHONUNBUFFERED=1 \
PATH=/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/bin:$PATH \
LD_LIBRARY_PATH=/usr/local/nvidia/lib64:$LD_LIBRARY_PATH \
JUPYTER_PASSWORD=yotta

Copilot AI Apr 13, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JUPYTER_PASSWORD is hard-coded in the image (ENV JUPYTER_PASSWORD=yotta), but docker-bake.hcl passes JUPYTER_PASSWORD as a build arg. To avoid baking a default credential into the image and to honor the bake arg, declare ARG JUPYTER_PASSWORD and set ENV JUPYTER_PASSWORD=${JUPYTER_PASSWORD} (or remove the bake arg if runtime-only).

Copilot uses AI. Check for mistakes.
usermod -aG sudo ubuntu || true; \
echo "ubuntu ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/ubuntu; \
echo "ubuntu:ubuntu" | chpasswd || true
RUN mkdir -p /workspace && chmod 777 /workspace /root

Copilot AI Apr 13, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making /root world-writable (chmod 777) is a security risk and is unusual for a base template. Prefer keeping /root owned by root with 700/755 permissions, and only open up /workspace (or another dedicated writable directory) if the platform requires it.

Suggested change
RUN mkdir -p /workspace && chmod 777 /workspace /root
RUN mkdir -p /workspace && chmod 777 /workspace && chmod 700 /root

Copilot uses AI. Check for mistakes.
Comment on lines 72 to 76
# =========================================================
# Users: ubuntu (compat)
# Workspace initialization
# =========================================================
RUN set -eux; \
useradd -ms /bin/bash ubuntu || true; \
usermod -aG sudo ubuntu || true; \
echo "ubuntu ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/ubuntu; \
echo "ubuntu:ubuntu" | chpasswd || true
RUN mkdir -p /workspace && chmod 777 /workspace /root

Copilot AI Apr 13, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the ubuntu user creation removed, the image effectively only has root, but sshd_config is set to PermitRootLogin no (later in this Dockerfile) and start1.sh only provisions ~/.ssh for the current user. This combination will make SSH access impossible. Either create a non-root user and have start1.sh provision that user’s authorized_keys, or allow root key-based login (e.g., PermitRootLogin prohibit-password) to match the startup script.

Copilot uses AI. Check for mistakes.
Comment thread official-templates/skyrl/Dockerfile Outdated
Comment on lines +155 to +160
# =========================================================
# start.sh (must be kept as-is)
# =========================================================
COPY --from=scripts start.sh /start.sh
RUN chmod 755 /start.sh
COPY --from=scripts start1.sh /start1.sh
RUN chmod 755 /start1.sh && \
sed -i 's/\r$//' /start1.sh

Copilot AI Apr 13, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The section header says “start.sh (must be kept as-is)” but the image now copies /start1.sh. Update the comment to match the new startup script name to avoid misleading future edits.

Copilot uses AI. Check for mistakes.
Comment on lines +39 to +44
if [ ! -f /etc/ssh/ssh_host_dsa_key ]; then
ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key -q -N ''
echo "DSA key fingerprint:"
ssh-keygen -lf /etc/ssh/ssh_host_dsa_key.pub
fi

Copilot AI Apr 13, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generating a DSA SSH host key is deprecated and considered insecure; many OpenSSH configurations disable DSA entirely. Consider removing the ssh_host_dsa_key generation block and rely on RSA/ECDSA/ED25519 keys.

Suggested change
if [ ! -f /etc/ssh/ssh_host_dsa_key ]; then
ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key -q -N ''
echo "DSA key fingerprint:"
ssh-keygen -lf /etc/ssh/ssh_host_dsa_key.pub
fi

Copilot uses AI. Check for mistakes.
local script_msg=$2
if [ -f "${script_path}" ]; then
echo "${script_msg}"
bash ${script_path}

Copilot AI Apr 13, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

execute_script invokes bash with an unquoted path (bash ${script_path}), which can break if the path ever contains whitespace or glob characters. Quote the variable when executing.

Suggested change
bash ${script_path}
bash "${script_path}"

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +72
printenv | grep -E '^YOTTA_|^PATH=|^_=' | awk -F = '{ print "export " $1 "=\"" $2 "\"" }' >> /etc/rp_environment
echo 'export PATH=/usr/local/nvidia/bin:/usr/local/cuda-12.8/bin:~/.local/bin:$PATH' >> /etc/rp_environment
echo 'source /etc/rp_environment' >> ~/.bashrc

Copilot AI Apr 13, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

export_env_vars appends to /etc/rp_environment and ~/.bashrc on every container start, which can lead to duplicated lines across restarts. Consider rewriting /etc/rp_environment each time (or guarding with a marker) and only adding the “source /etc/rp_environment” line if it isn’t already present.

Suggested change
printenv | grep -E '^YOTTA_|^PATH=|^_=' | awk -F = '{ print "export " $1 "=\"" $2 "\"" }' >> /etc/rp_environment
echo 'export PATH=/usr/local/nvidia/bin:/usr/local/cuda-12.8/bin:~/.local/bin:$PATH' >> /etc/rp_environment
echo 'source /etc/rp_environment' >> ~/.bashrc
printenv | grep -E '^YOTTA_|^PATH=|^_=' | awk -F = '{ print "export " $1 "=\"" $2 "\"" }' > /etc/rp_environment
echo 'export PATH=/usr/local/nvidia/bin:/usr/local/cuda-12.8/bin:~/.local/bin:$PATH' >> /etc/rp_environment
if ! grep -qxF 'source /etc/rp_environment' ~/.bashrc 2>/dev/null; then
echo 'source /etc/rp_environment' >> ~/.bashrc
fi

Copilot uses AI. Check for mistakes.
Comment on lines 128 to 129
# =========================================================
# nccl-tests

Copilot AI Apr 13, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nccl-tests section doesn’t appear to disable MPI (unlike official-templates/pytorch/Dockerfile, which builds with MPI=0 to avoid mpi.h missing). Unless MPI dev headers are installed elsewhere, the build is likely to fail. Consider building nccl-tests with MPI=0 or adding the required MPI dev packages.

Copilot uses AI. Check for mistakes.
Comment on lines 1 to 5
variable "PUBLISHER" { default = "yottalabsai" }
variable "TAG_SUFFIX" { default = "2026010901" }
variable "TAG_SUFFIX" { default = "2026031701" }

group "default" {
targets = ["pytorch290"]

Copilot AI Apr 13, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title indicates it only updates the SkyRL template, but this PR also changes the PyTorch template and adds a new shared startup script under container-template/. Please update the PR title/description to reflect the broader scope, or consider splitting into separate PRs to reduce review and release risk.

Copilot uses AI. Check for mistakes.
@dadaism

dadaism commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

PR Hygiene
Summary empty, test notes empty — same pattern as previous PRs.
Fourth PR with the same pytorch changes — PRs #6, #9, #10, and now #11 all carry identical pytorch Dockerfile/bake diffs. These should be landed once and rebased onto, not copy-pasted into every feature branch.
Type of change says "Update to existing template" — but this also adds start1.sh (shared infrastructure) and modifies pytorch. Check both boxes or split.
SkyRL Dockerfile — Issues
Critical
COPY start1.sh start1.sh in the scripts stage uses wrong context — Dockerfile:14. The old line was COPY container-template/start.sh start.sh, which copies from the build context root. The new line COPY start1.sh start1.sh assumes start1.sh is at the skyrl directory root, but it lives at container-template/start1.sh. The bake file's scripts context points to ../../container-template, so this should be COPY start1.sh start1.sh only if the named context overrides it — but the bake file's contexts block still has:

scripts = "../../container-template"
So the COPY in the scripts scratch stage (which runs before named contexts apply) will look for start1.sh in the default build context (the skyrl directory), not in container-template/. This will likely fail to build. Needs to be COPY container-template/start1.sh start1.sh or the bake context needs restructuring.

post_start.sh removed entirely — The old image ran TensorBoard on port 6006 via post_start.sh. Port 6006 is still EXPOSEd (line 178) but nothing starts TensorBoard anymore. Either remove the port or add TensorBoard startup back.

tini removed from entrypoint — The old image used ENTRYPOINT ["/usr/bin/tini", "--"] for proper PID 1 signal handling. The new entrypoint is just CMD ["/bin/bash", "-c", "exec /bin/bash /start1.sh"]. Without tini, zombie processes won't be reaped and SIGTERM may not propagate to children. tini is still installed in the apt layer — use it.

nccl-tests not pinned to NCCL_TESTS_VERSION — The ARG NCCL_TESTS_VERSION="v2.13.11" is declared (line 2) but the git clone at line ~155 does --depth 1 without --branch "${NCCL_TESTS_VERSION}". The ARG is unused — non-reproducible builds.

SSH config not configured — Same issue as the pytorch changes: the old ubuntu user setup and sshd_config patching are removed, but SSH is still started in start1.sh. Root login will be prohibited by default.

Medium
SkyRL cloned without pinned commit — Dockerfile:105. git clone --depth 1 of HEAD. Not reproducible — a broken commit upstream will break your image.

Image tag doesn't match actual stack — docker-bake.hcl:21 tags the image as ray2.51-py3.11-cuda12.1-... but the Dockerfile now builds Python 3.12 on CUDA 12.8.1. The tag is misleading.

JUPYTER_PASSWORD bake variable says ubuntu — docker-bake.hcl:9 sets default "ubuntu" but the Dockerfile ENV sets yotta. The bake arg isn't passed as a build arg anymore (it was removed from the args block... wait, it's still there at line 29). So the bake value "ubuntu" will override the Dockerfile's "yotta" — inconsistent with the pytorch template.

chmod 777 /workspace /root — Dockerfile:82. Root's home should not be world-writable.

Hardcoded PY_MM="3.12" inside a RUN that also uses ARG PYTHON_VERSION — If someone changes PYTHON_VERSION to 3.11.x, the symlinks will point to 3.12 and break. Derive PY_MM from PYTHON_VERSION (e.g. PY_MM="${PYTHON_VERSION%.*}").

--enable-optimizations removed from Python build — Same as pytorch, ~10-20% slower Python without explanation.

Low
Megatron-LM removed without comment — The old image installed Megatron-LM. If SkyRL users depend on it, this is a breaking change.

unsloth removed — Same concern as Megatron-LM.

Comment at line 155 still says start.sh (must be kept as-is) — but it's now start1.sh. Update the comment.

chown -R root:root /workspace — Dockerfile:173. Unnecessary — files created by root are already owned by root. And it conflicts with chmod 777 set earlier (the intent was any-user writable, but now ownership is root).

start1.sh — Issues
(Same as PR #10 review, this appears to be the fixed version with shebang on line 1)

DSA key generation — deprecated, remove.
Unquoted variables — $JUPYTER_PASSWORD, * in jupyter command will glob-expand.
/etc/rp_environment — RunPod artifact, not Yotta.
Hardcoded cuda-12.8 — use /usr/local/cuda/bin symlink.
chmod 700 -R ~/.ssh — should be 700 on dir, 600 on authorized_keys.
Script name — start1.sh is opaque; rename to start.sh or entrypoint.sh.
Pytorch Dockerfile — Recurring Issues
Same as PRs #6/#9/#10 — all still apply (hardcoded Jupyter token, removed SSH config, removed uv/conda, chmod 777, duplicate dir creation, etc.).

Summary
The SkyRL rewrite has the right direction (building Python from source, using uv for dependency resolution with overrides), but has several build-breaking issues: the start1.sh COPY context is likely wrong, the image tag doesn't match the actual stack versions, and the JUPYTER_PASSWORD bake/Dockerfile values conflict. The removal of tini as PID 1 is a regression. I'd recommend:

Fix the scripts stage COPY path
Restore tini as ENTRYPOINT
Correct the image tag to reflect Python 3.12 / CUDA 12.8.1
Add SSH config (PermitRootLogin yes) since the ubuntu user is gone
Split pytorch changes into their own PR (land once, stop duplicating)

@dadaism dadaism mentioned this pull request Apr 15, 2026
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants