Feat/update skyrl template by AanningZoeng · Pull Request #11 · yottalabsai/container

AanningZoeng · 2026-04-13T08:07:43Z

Summary

Type of Change

New template
Update to existing template (version bump, new packages, etc.)
Bug fix
Shared infrastructure change (container-template/)
Documentation only

Checklist

Dockerfile builds successfully (docker buildx bake <target> --no-cache)
Container starts and all services are reachable
README.md updated to reflect any interface or version changes
No secrets, tokens, or internal hostnames committed
docker-bake.hcl target name and image tag are correct

Test Notes

Copilot

Pull request overview

Updates the SkyRL container template (CUDA/Python stack and startup behavior) and aligns runtime/service setup across templates by introducing a new shared start1.sh script and adjusting bake tags/args.

Changes:

Bump SkyRL base to CUDA 12.8.1 and build Python 3.12 from source; install SkyRL via uv with extras.
Introduce container-template/start1.sh and switch SkyRL/PyTorch images to run it as the container command.
Update bake configuration/tags for SkyRL and PyTorch.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 16 comments.

Show a summary per file

File	Description
`official-templates/skyrl/Dockerfile`	Moves SkyRL to CUDA 12.8.1 + Python 3.12 and switches startup script handling (currently breaks multi-context COPY usage).
`official-templates/skyrl/docker-bake.hcl`	Adjusts SkyRL image tag and args (currently mismatched with the updated Dockerfile and passes an unused arg).
`official-templates/pytorch/Dockerfile`	Switches to `start1.sh`, removes ubuntu user, and adds Jupyter config/hardening changes.
`official-templates/pytorch/docker-bake.hcl`	Updates tag scheme to `${TAG_SUFFIX}` and bumps suffix.
`container-template/start1.sh`	New shared startup script to start nginx/ssh/jupyter and run pre/post hooks.

Comments suppressed due to low confidence (2)

official-templates/skyrl/Dockerfile:20

The proxy stage copies nginx.conf and readme.html from the default context, but those files aren’t present in official-templates/skyrl. Since docker-bake.hcl provides a named build context "proxy", this should be sourced from that context (or add the missing files to the skyrl template directory). As written, the build will fail.

# nginx config / landing page (keep as its own stage to match existing COPY --from=proxy)
FROM scratch AS proxy
# These two files must exist in the build context; adjust paths if not at root
COPY nginx.conf nginx.conf
COPY readme.html readme.html

official-templates/skyrl/Dockerfile:133

NCCL_TESTS_VERSION is declared as a build arg, but the nccl-tests clone doesn’t pin to that version/tag. This makes builds non-reproducible and leaves the ARG unused. Use git clone --branch "${NCCL_TESTS_VERSION}" (as done in official-templates/pytorch/Dockerfile) or remove the ARG.

# nccl-tests
# =========================================================
RUN set -eux; \
    git clone --depth 1 https://github.com/NVIDIA/nccl-tests.git /tmp/nccl-tests; \
    make -C /tmp/nccl-tests -j"$(nproc)"; \

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-13T08:13:55Z

 FROM scratch AS logo
 COPY container-template/yotta.txt yotta.txt

 FROM scratch AS scripts
-COPY container-template/start.sh start.sh
+COPY start1.sh start1.sh


The multi-context files are being copied from the default build context, but they don’t exist under official-templates/skyrl (e.g., container-template/yotta.txt and start1.sh). Also, defining a stage named "scripts" prevents using the buildx bake named context "scripts" of the same name. Consider removing these scratch stages (logo/scripts) and copying directly from the bake contexts (e.g., COPY --from=logo yotta.txt ..., COPY --from=scripts start1.sh ...), or rename the stages/contexts to avoid the collision.

Copilot · 2026-04-13T08:13:56Z

+    PYTHONUNBUFFERED=1 \
+    PATH=/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/bin:$PATH \
+    LD_LIBRARY_PATH=/usr/local/nvidia/lib64:$LD_LIBRARY_PATH \
+    JUPYTER_PASSWORD=yotta


JUPYTER_PASSWORD is hard-coded in the image (ENV JUPYTER_PASSWORD=yotta), but docker-bake.hcl passes JUPYTER_PASSWORD as a build arg. To avoid baking a default credential into the image and to honor the bake arg, declare ARG JUPYTER_PASSWORD and set ENV JUPYTER_PASSWORD=${JUPYTER_PASSWORD} (or remove the bake arg if runtime-only).

Copilot · 2026-04-13T08:13:56Z

-    usermod -aG sudo ubuntu || true; \
-    echo "ubuntu ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/ubuntu; \
-    echo "ubuntu:ubuntu" | chpasswd || true
+RUN mkdir -p /workspace && chmod 777 /workspace /root


Making /root world-writable (chmod 777) is a security risk and is unusual for a base template. Prefer keeping /root owned by root with 700/755 permissions, and only open up /workspace (or another dedicated writable directory) if the platform requires it.

Suggested change

RUN mkdir -p /workspace && chmod 777 /workspace /root

RUN mkdir -p /workspace && chmod 777 /workspace && chmod 700 /root

Copilot · 2026-04-13T08:13:56Z

 # =========================================================
-# Users: ubuntu (compat)
+# Workspace initialization
 # =========================================================
-RUN set -eux; \
-    useradd -ms /bin/bash ubuntu || true; \
-    usermod -aG sudo ubuntu || true; \
-    echo "ubuntu ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/ubuntu; \
-    echo "ubuntu:ubuntu" | chpasswd || true
+RUN mkdir -p /workspace && chmod 777 /workspace /root



With the ubuntu user creation removed, the image effectively only has root, but sshd_config is set to PermitRootLogin no (later in this Dockerfile) and start1.sh only provisions ~/.ssh for the current user. This combination will make SSH access impossible. Either create a non-root user and have start1.sh provision that user’s authorized_keys, or allow root key-based login (e.g., PermitRootLogin prohibit-password) to match the startup script.

Copilot · 2026-04-13T08:13:57Z

 # =========================================================
 # start.sh (must be kept as-is)
 # =========================================================
-COPY --from=scripts start.sh /start.sh
-RUN chmod 755 /start.sh
+COPY --from=scripts start1.sh /start1.sh
+RUN chmod 755 /start1.sh && \
+    sed -i 's/\r$//' /start1.sh


The section header says “start.sh (must be kept as-is)” but the image now copies /start1.sh. Update the comment to match the new startup script name to avoid misleading future edits.

Copilot · 2026-04-13T08:13:58Z

+    if [ ! -f /etc/ssh/ssh_host_dsa_key ]; then
+        ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key -q -N ''
+        echo "DSA key fingerprint:"
+        ssh-keygen -lf /etc/ssh/ssh_host_dsa_key.pub
+    fi
+


Generating a DSA SSH host key is deprecated and considered insecure; many OpenSSH configurations disable DSA entirely. Consider removing the ssh_host_dsa_key generation block and rely on RSA/ECDSA/ED25519 keys.

Suggested change

if [ ! -f /etc/ssh/ssh_host_dsa_key ]; then

ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key -q -N ''

echo "DSA key fingerprint:"

ssh-keygen -lf /etc/ssh/ssh_host_dsa_key.pub

fi

Copilot · 2026-04-13T08:13:59Z

+    local script_msg=$2
+    if [ -f "${script_path}" ]; then
+        echo "${script_msg}"
+        bash ${script_path}


execute_script invokes bash with an unquoted path (bash ${script_path}), which can break if the path ever contains whitespace or glob characters. Quote the variable when executing.

Suggested change

bash ${script_path}

bash "${script_path}"

Copilot · 2026-04-13T08:13:59Z

+    printenv | grep -E '^YOTTA_|^PATH=|^_=' | awk -F = '{ print "export " $1 "=\"" $2 "\"" }' >> /etc/rp_environment
+    echo 'export PATH=/usr/local/nvidia/bin:/usr/local/cuda-12.8/bin:~/.local/bin:$PATH' >> /etc/rp_environment
+    echo 'source /etc/rp_environment' >> ~/.bashrc


export_env_vars appends to /etc/rp_environment and ~/.bashrc on every container start, which can lead to duplicated lines across restarts. Consider rewriting /etc/rp_environment each time (or guarding with a marker) and only adding the “source /etc/rp_environment” line if it isn’t already present.

Suggested change

printenv | grep -E '^YOTTA_|^PATH=|^_=' | awk -F = '{ print "export " $1 "=\"" $2 "\"" }' >> /etc/rp_environment

echo 'export PATH=/usr/local/nvidia/bin:/usr/local/cuda-12.8/bin:~/.local/bin:$PATH' >> /etc/rp_environment

echo 'source /etc/rp_environment' >> ~/.bashrc

printenv | grep -E '^YOTTA_|^PATH=|^_=' | awk -F = '{ print "export " $1 "=\"" $2 "\"" }' > /etc/rp_environment

echo 'export PATH=/usr/local/nvidia/bin:/usr/local/cuda-12.8/bin:~/.local/bin:$PATH' >> /etc/rp_environment

if ! grep -qxF 'source /etc/rp_environment' ~/.bashrc 2>/dev/null; then

echo 'source /etc/rp_environment' >> ~/.bashrc

fi

Copilot · 2026-04-13T08:13:59Z

 # =========================================================
 # nccl-tests


The nccl-tests section doesn’t appear to disable MPI (unlike official-templates/pytorch/Dockerfile, which builds with MPI=0 to avoid mpi.h missing). Unless MPI dev headers are installed elsewhere, the build is likely to fail. Consider building nccl-tests with MPI=0 or adding the required MPI dev packages.

Copilot · 2026-04-13T08:13:59Z

 variable "PUBLISHER"  { default = "yottalabsai" }
-variable "TAG_SUFFIX" { default = "2026010901" }
+variable "TAG_SUFFIX" { default = "2026031701" }

 group "default" {
  targets = ["pytorch290"]


The PR title indicates it only updates the SkyRL template, but this PR also changes the PyTorch template and adds a new shared startup script under container-template/. Please update the PR title/description to reflect the broader scope, or consider splitting into separate PRs to reduce review and release risk.

dadaism · 2026-04-15T08:56:45Z

PR Hygiene
Summary empty, test notes empty — same pattern as previous PRs.
Fourth PR with the same pytorch changes — PRs #6, #9, #10, and now #11 all carry identical pytorch Dockerfile/bake diffs. These should be landed once and rebased onto, not copy-pasted into every feature branch.
Type of change says "Update to existing template" — but this also adds start1.sh (shared infrastructure) and modifies pytorch. Check both boxes or split.
SkyRL Dockerfile — Issues
Critical
COPY start1.sh start1.sh in the scripts stage uses wrong context — Dockerfile:14. The old line was COPY container-template/start.sh start.sh, which copies from the build context root. The new line COPY start1.sh start1.sh assumes start1.sh is at the skyrl directory root, but it lives at container-template/start1.sh. The bake file's scripts context points to ../../container-template, so this should be COPY start1.sh start1.sh only if the named context overrides it — but the bake file's contexts block still has:

scripts = "../../container-template"
So the COPY in the scripts scratch stage (which runs before named contexts apply) will look for start1.sh in the default build context (the skyrl directory), not in container-template/. This will likely fail to build. Needs to be COPY container-template/start1.sh start1.sh or the bake context needs restructuring.

post_start.sh removed entirely — The old image ran TensorBoard on port 6006 via post_start.sh. Port 6006 is still EXPOSEd (line 178) but nothing starts TensorBoard anymore. Either remove the port or add TensorBoard startup back.

tini removed from entrypoint — The old image used ENTRYPOINT ["/usr/bin/tini", "--"] for proper PID 1 signal handling. The new entrypoint is just CMD ["/bin/bash", "-c", "exec /bin/bash /start1.sh"]. Without tini, zombie processes won't be reaped and SIGTERM may not propagate to children. tini is still installed in the apt layer — use it.

nccl-tests not pinned to NCCL_TESTS_VERSION — The ARG NCCL_TESTS_VERSION="v2.13.11" is declared (line 2) but the git clone at line ~155 does --depth 1 without --branch "${NCCL_TESTS_VERSION}". The ARG is unused — non-reproducible builds.

SSH config not configured — Same issue as the pytorch changes: the old ubuntu user setup and sshd_config patching are removed, but SSH is still started in start1.sh. Root login will be prohibited by default.

Medium
SkyRL cloned without pinned commit — Dockerfile:105. git clone --depth 1 of HEAD. Not reproducible — a broken commit upstream will break your image.

Image tag doesn't match actual stack — docker-bake.hcl:21 tags the image as ray2.51-py3.11-cuda12.1-... but the Dockerfile now builds Python 3.12 on CUDA 12.8.1. The tag is misleading.

JUPYTER_PASSWORD bake variable says ubuntu — docker-bake.hcl:9 sets default "ubuntu" but the Dockerfile ENV sets yotta. The bake arg isn't passed as a build arg anymore (it was removed from the args block... wait, it's still there at line 29). So the bake value "ubuntu" will override the Dockerfile's "yotta" — inconsistent with the pytorch template.

chmod 777 /workspace /root — Dockerfile:82. Root's home should not be world-writable.

Hardcoded PY_MM="3.12" inside a RUN that also uses ARG PYTHON_VERSION — If someone changes PYTHON_VERSION to 3.11.x, the symlinks will point to 3.12 and break. Derive PY_MM from PYTHON_VERSION (e.g. PY_MM="${PYTHON_VERSION%.*}").

--enable-optimizations removed from Python build — Same as pytorch, ~10-20% slower Python without explanation.

Low
Megatron-LM removed without comment — The old image installed Megatron-LM. If SkyRL users depend on it, this is a breaking change.

unsloth removed — Same concern as Megatron-LM.

Comment at line 155 still says start.sh (must be kept as-is) — but it's now start1.sh. Update the comment.

chown -R root:root /workspace — Dockerfile:173. Unnecessary — files created by root are already owned by root. And it conflicts with chmod 777 set earlier (the intent was any-user writable, but now ownership is root).

start1.sh — Issues
(Same as PR #10 review, this appears to be the fixed version with shebang on line 1)

DSA key generation — deprecated, remove.
Unquoted variables — $JUPYTER_PASSWORD, * in jupyter command will glob-expand.
/etc/rp_environment — RunPod artifact, not Yotta.
Hardcoded cuda-12.8 — use /usr/local/cuda/bin symlink.
chmod 700 -R ~/.ssh — should be 700 on dir, 600 on authorized_keys.
Script name — start1.sh is opaque; rename to start.sh or entrypoint.sh.
Pytorch Dockerfile — Recurring Issues
Same as PRs #6/#9/#10 — all still apply (hardcoded Jupyter token, removed SSH config, removed uv/conda, chmod 777, duplicate dir creation, etc.).

Summary
The SkyRL rewrite has the right direction (building Python from source, using uv for dependency resolution with overrides), but has several build-breaking issues: the start1.sh COPY context is likely wrong, the image tag doesn't match the actual stack versions, and the JUPYTER_PASSWORD bake/Dockerfile values conflict. The removal of tini as PID 1 is a regression. I'd recommend:

Fix the scripts stage COPY path
Restore tini as ENTRYPOINT
Correct the image tag to reflect Python 3.12 / CUDA 12.8.1
Add SSH config (PermitRootLogin yes) since the ubuntu user is gone
Split pytorch changes into their own PR (land once, stop duplicating)

AanningZoeng added 3 commits March 17, 2026 02:43

update

e2b5627

Merge branch 'main' of github.com:yottalabsai/container

632016c

feat: update skyrl template and add start1.sh to container-template

5295b32

AanningZoeng requested a review from Copilot April 13, 2026 08:07

Copilot started reviewing on behalf of AanningZoeng April 13, 2026 08:08 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

dadaism mentioned this pull request Apr 15, 2026

Feat/update unsloth template #12

Open

10 tasks

AanningZoeng added 2 commits April 15, 2026 22:27

update the start script

d85d626

Merge branch 'main' into feat/update-skyrl-template

7904f61

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/update skyrl template#11

Feat/update skyrl template#11
AanningZoeng wants to merge 5 commits into
mainfrom
feat/update-skyrl-template

AanningZoeng commented Apr 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

dadaism commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	RUN mkdir -p /workspace && chmod 777 /workspace /root
	RUN mkdir -p /workspace && chmod 777 /workspace && chmod 700 /root

		# =========================================================
		# nccl-tests

Conversation

AanningZoeng commented Apr 13, 2026

Summary

Type of Change

Checklist

Test Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

dadaism commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants