Skip to content

ci(docker-new): split base-cuda layer and restructure CI pipelines#7025

Draft
xmfcx wants to merge 10 commits intomainfrom
ci/docker-new-layer-restructure
Draft

ci(docker-new): split base-cuda layer and restructure CI pipelines#7025
xmfcx wants to merge 10 commits intomainfrom
ci/docker-new-layer-restructure

Conversation

@xmfcx
Copy link
Copy Markdown
Contributor

@xmfcx xmfcx commented Apr 17, 2026

This PR is being split into smaller, focused PRs for easier review. It stays open in draft while the stack is prepared; it will be closed once every successor lands on main.

Split stack

Each PR is stacked on top of the previous one. Review bottom-to-top. Branches are already pushed; PRs are opened one at a time so each merges before the next.

  1. ci(apt): harden apt with retries + Azure mirror + archive fallback
  2. ci: gate setup-universe behind run:health-check, unify label reusable
  3. ci(health-check): run on push to main + split into pr / main / reusable
  4. ci(health-check): rewire cache to match docker-new restructure
  5. ci(docker-new): split base-cuda layer and restructure CI pipelines
  6. docs(docker-new): add examples (basic cpu/dri/nvidia + awsim + planning-simulator)

Original combined description preserved in the commit body of feat/split at 135bd8256.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 17, 2026

Thank you for contributing to the Autoware project!

🚧 If your pull request is in progress, switch it to draft mode.

Please ensure:

Restructures the docker-new image graph and CI topology:

- Add base-cuda-runtime / base-cuda-devel stages as a dedicated CUDA
  base, and rebase universe-cuda off them instead of universe.
- Drop universe-runtime-dependencies; fold into universe.
- Split rmw and nvidia ansible roles into standalone playbooks so
  Dockerfiles bind-mount only what they need.
- Switch every BuildKit RUN mount to named IDs
  (apt/ccache/pip/pipx, ROS_DISTRO-scoped where relevant).
- Split per-distro CI into per-arch jobs; extract multi-arch manifest
  stitching into docker-manifest-new.yaml.
- Persist BuildKit mount caches across runs via actions/cache +
  buildkit-cache-dance, with a size guard and lineage pruning.
- Add docker-new/examples/ (basic cpu/dri/nvidia + awsim + planning
  simulator demos).

Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
@xmfcx xmfcx force-pushed the ci/docker-new-layer-restructure branch from e757246 to 92c674e Compare April 17, 2026 05:16
@xmfcx xmfcx added the run:health-check Run health-check label Apr 17, 2026
xmfcx added 2 commits April 17, 2026 08:31
- Switch buildcache-new tag to its own scope
  `health-check-<build-type>-humble-<platform>-<ref>` (instead of the
  ci-universe tags that no longer populate), with current-ref + main
  fallback reads and a current-ref write, mirroring
  docker-build-new.yaml's pattern.
- Add read-only `buildkit-cache-dance` injection fed by the
  `buildkit-mounts-ci-universe-humble-<platform>-*` tarballs the main
  pipeline saves, so health-check reuses the same apt / ccache / pip /
  pipx mount state without writing back.

Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
- Put setup-universe behind the same `run:health-check` label the
  docker-new health-check workflow uses. setup-universe runs the
  setup-dev-env.sh script on a self-hosted runner, so triggering it on
  every PR is expensive; a single label now toggles both heavyweight
  pre-merge checks together.
- Switch health-check from `make-sure-label-is-present.yaml@v1` to
  `require-label.yaml@v1`, matching the sub-repo build-and-test
  workflows and the new setup-universe gate.

Also drops an unused `load: false` in the bake step (false is the
default).

Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
xmfcx added a commit to autowarefoundation/autoware-spell-check-dict that referenced this pull request Apr 17, 2026
Words flagged by spell-check in autowarefoundation/autoware#7025:

- argjson        (jq --argjson flag)
- containerimage (docker buildx output type)
- imagetools     (docker buildx subcommand)
- llvmpipe       (mesa software renderer)
- radeonsi       (mesa AMD Gallium driver)
- redownloading  (compound word)
- tzdata         (ubuntu timezone package)

Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
xmfcx added a commit to autowarefoundation/autoware-spell-check-dict that referenced this pull request Apr 17, 2026
Words flagged by spell-check in autowarefoundation/autoware#7025:

- argjson        (jq --argjson flag)
- containerimage (docker buildx output type)
- imagetools     (docker buildx subcommand)
- llvmpipe       (mesa software renderer)
- radeonsi       (mesa AMD Gallium driver)
- redownloading  (compound word)
- tzdata         (ubuntu timezone package)

Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
xmfcx added 5 commits April 17, 2026 10:01
Transient TCP timeouts to archive.ubuntu.com have been failing the
docker build during ansible's `apt install` step. Configure libapt to
retry 5x with 30s HTTP(S) timeouts via `/etc/apt/apt.conf.d/99-retries`
in the base image, inherited by every downstream stage.

Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
Add the Azure-hosted Canonical mirror as the primary apt source, keeping
archive.ubuntu.com as a built-in failsafe. apt tries mirrors in order
per URI, so Azure serves the fast happy path (especially for GHA runners
on Azure) and archive.ubuntu.com takes over if Azure has a blip.

Handles both classic /etc/apt/sources.list (jammy / humble base) and the
deb822 /etc/apt/sources.list.d/ubuntu.sources (noble / jazzy base).

Strictly more reliable than the previous single-mirror config: both
mirrors would have to be unreachable for the build to fail, instead of
just archive.ubuntu.com. Combined with Acquire::Retries, this covers
both steady-state mirror outages and transient TCP flakes.

Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
Mirror the reliability hardening from docker-new/base.Dockerfile into
the setup-universe job, which runs in a bare ubuntu:22.04 / ubuntu:24.04
container. Applied before the first apt call so the initial
`apt-get update` already benefits:

- Acquire::Retries "5" with 30s HTTP(S) timeouts to survive transient
  TCP flakes against archive.ubuntu.com.
- Azure mirror (azure.archive.ubuntu.com) as the primary apt source with
  archive.ubuntu.com as the failsafe. Handles both classic
  /etc/apt/sources.list (jammy) and deb822
  /etc/apt/sources.list.d/ubuntu.sources (noble) formats.

Uses only coreutils (echo, sed), so no package installs need to
succeed before the mirror config is in place.

Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
The workflow's docker-bake step reads from a `:...-main` registry cache
ref as a fallback, populated only when the workflow runs on `main`.
Previously it fired only on PR/schedule/dispatch, so the main cache
never got written on merges, starving PR builds of warm layers.

- Add `push` trigger on `main`.
- Skip `require-label` for non-PR events (push/schedule/dispatch).
- Use `always() && (... || github.event_name != 'pull_request')` on
  docker-build so it runs even when require-label was skipped.
- Simplify per-step gates from `schedule || workflow_dispatch` to
  `!= 'pull_request'`, which naturally covers push too.

Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
Disentangle the three-mode gating in a single file by splitting it
into dedicated workflows per trigger and a shared reusable job.

- health-check-reusable.yaml: `workflow_call` with the matrix docker
  build job. Step-level `if:`s reduced to the genuinely orthogonal
  matrix filters (`build-type != 'main'`, `platform == 'arm64'`,
  `build-type == 'nightly'`). No label or paths logic.
- health-check-pr.yaml: `pull_request` trigger with `paths:` filter at
  the trigger level (replaces the per-step `changed-files` gate). Uses
  the `require-label` reusable workflow which exits 1 on missing
  label, so a watched-path PR without `run:health-check` turns red
  instead of silently skipping. Calls the reusable on success.
- health-check.yaml: trimmed to `push: main`, `schedule`, and
  `workflow_dispatch`. No label check, no paths filter; directly
  calls the reusable. Purpose is to keep populating the shared
  `:health-check-*-main` BuildKit registry cache for PRs to read.

Net: ~10 repeated `if:` conditionals removed, the `changed-files`
step removed, each trigger mode expressed in its own file.

Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
@xmfcx xmfcx force-pushed the ci/docker-new-layer-restructure branch from 0700e43 to 9af1585 Compare April 17, 2026 09:20
Previously the classic /etc/apt/sources.list (jammy) was handled by
duplicating `deb` lines with azure.archive.ubuntu.com first and
archive.ubuntu.com second. apt treats parallel deb lines as independent
sources and fetches InRelease/Packages from both, wasting ~35 MB per
update rather than falling back only when needed.

Switch both the Dockerfile and setup-universe workflow to apt's
`mirror+file://` transport:

- Write /etc/apt/ubuntu-mirrors.list with Azure first, archive second.
- Replace each `http://archive.ubuntu.com/ubuntu` reference (classic or
  deb822) with `mirror+file:///etc/apt/ubuntu-mirrors.list`.

apt reads the list and tries URLs in order, using the first that
responds, so we get true first-win fallback in both sources formats.
security.ubuntu.com is left untouched (separate host, not mirrored on
Azure).

The file-existence guard uses `if [ -f "$f" ]; then ...; fi` rather
than `[ -f "$f" ] && sed ...`: under `sh -e` (the default for GHA
`run:` steps) the `&&` chain short-circuits and returns 1 when the
file doesn't exist, tripping errexit. Only one of the two sources
formats is present on any given Ubuntu version, so the non-matching
iteration would otherwise kill the step.

Verified locally by baking both the humble (jammy / classic
sources.list) and jazzy (noble / deb822) base images end to end; the
apt-get install inside the ansible rmw role succeeded via
mirror+file:// on both.

Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>

style(pre-commit): autofix
@xmfcx xmfcx force-pushed the ci/docker-new-layer-restructure branch from b4f16f0 to bf848fa Compare April 17, 2026 09:21
`apt-transport-mirror` treats unannotated peer URLs in a mirrorlist as
equal, and spreads each `apt-get update` request across them for load
balancing. That combined with `mirror+file://` in bf848fa means one
`InRelease` can come from azure while the matching `Packages.gz` comes
from archive.ubuntu.com (or vice versa), and when the two hosts are
mid-sync apt fails with:

  File has unexpected size (4263777 != 4263737). Mirror sync in progress?

Observed on an earlier PR run at step:15 line 881 — archive served an
`InRelease` that disagreed with the `Packages.gz` its mirror had just
rotated. `Acquire::Retries` doesn't help here because the bad file is
consistently returned until the mirror finishes its push.

Annotate the mirrorlist with `priority:` (lower = preferred) so every
request goes to azure first and archive is touched only when azure
fails. This eliminates the cross-host version-skew race entirely;
the only remaining case is a mid-push on azure alone, which is rare
and typically covered by `Acquire::Retries`.

Secondary benefit: on GHA runners (which live inside Azure's network)
azure.archive.ubuntu.com is an order of magnitude faster than the
public archive.ubuntu.com, so pinning azure as primary is also a
throughput win — "load balancing" over an unevenly fast pair was a
pessimization, not a speedup.

Signed-off-by: Mete Fatih Cırıt <mfc@autoware.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run:health-check Run health-check

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant