Skip to content

CI: run dl-gpu GPU tests on Lyris pre-cluster via generic trigger pipeline#1830

Draft
Alexey-Rivkin wants to merge 2 commits into
ai-dynamo:mainfrom
Alexey-Rivkin:ci/lyris-gpu-exec
Draft

CI: run dl-gpu GPU tests on Lyris pre-cluster via generic trigger pipeline#1830
Alexey-Rivkin wants to merge 2 commits into
ai-dynamo:mainfrom
Alexey-Rivkin:ci/lyris-gpu-exec

Conversation

@Alexey-Rivkin

@Alexey-Rivkin Alexey-Rivkin commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

What?

Run the nixl-ci-dl-gpu GPU tests on the Lyris pre-cluster instead of dlcluster. Jenkins still owns the image build, the matrix, and result reporting; only the GPU execution step changes.

Why?

Pre-cluster onboarding forbids CI service accounts (personal credentials only) and the sanctioned execution path is a project runner on the Lyris frontend, so the old svc-nixl SSH + salloc/srun model can't be lifted as-is.

How?

  • New step triggers a generic lyris-exec GitLab pipeline (via vendored .ci/scripts/trigger_and_wait.sh) that runs all four tests in one Lyris allocation and reports back. Removed the dlcluster salloc/srun/scancel steps, the svc-nixl SSH credential, and the dlcluster SLURM_* env. The cpp (UCX_IB_REG_METHODS=rcache) and nixlbench (HAS_GPU=false) per-test env is kept inline.
  • Draft: depends on the precluster-poc pipeline (MR !1) landing + live Lyris validation; needs Jenkins creds lyris-trigger-token / lyris-api-token; LYRIS_PIPELINE_REF flips to main once that merges; starts advisory before it gates.

Replace the slurmCI allocation + 4 individual srun steps (and the
pipeline_stop teardown) with a single "Run DL tests on Lyris" step that
writes a test-cmds file and calls the vendored trigger_and_wait.sh helper.

The helper triggers the generic lyris-exec pipeline on precluster-poc,
polls to completion, fetches artifacts, and exits 0/1/90 so Jenkins can
gate the PR check correctly. Infra failures retry up to 2 times.

Per-test env nuances (UCX_IB_REG_METHODS=rcache for cpp,
HAS_GPU=false for nixlbench) are expressed as inline shell env-var
prefixes in the test command, which is equivalent to the old slurmEnv
injection.

Credentials LYRIS_TRIGGER_TOKEN and LYRIS_API_TOKEN must be created in
Jenkins before this job runs (see TODO comments in the YAML).
@github-actions

Copy link
Copy Markdown

👋 Hi Alexey-Rivkin! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant