CI: run dl-gpu GPU tests on Lyris pre-cluster via generic trigger pipeline#1830
Draft
Alexey-Rivkin wants to merge 2 commits into
Draft
CI: run dl-gpu GPU tests on Lyris pre-cluster via generic trigger pipeline#1830Alexey-Rivkin wants to merge 2 commits into
Alexey-Rivkin wants to merge 2 commits into
Conversation
Replace the slurmCI allocation + 4 individual srun steps (and the pipeline_stop teardown) with a single "Run DL tests on Lyris" step that writes a test-cmds file and calls the vendored trigger_and_wait.sh helper. The helper triggers the generic lyris-exec pipeline on precluster-poc, polls to completion, fetches artifacts, and exits 0/1/90 so Jenkins can gate the PR check correctly. Infra failures retry up to 2 times. Per-test env nuances (UCX_IB_REG_METHODS=rcache for cpp, HAS_GPU=false for nixlbench) are expressed as inline shell env-var prefixes in the test command, which is equivalent to the old slurmEnv injection. Credentials LYRIS_TRIGGER_TOKEN and LYRIS_API_TOKEN must be created in Jenkins before this job runs (see TODO comments in the YAML).
|
👋 Hi Alexey-Rivkin! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What?
Run the
nixl-ci-dl-gpuGPU tests on the Lyris pre-cluster instead of dlcluster. Jenkins still owns the image build, the matrix, and result reporting; only the GPU execution step changes.Why?
Pre-cluster onboarding forbids CI service accounts (personal credentials only) and the sanctioned execution path is a project runner on the Lyris frontend, so the old
svc-nixlSSH +salloc/srunmodel can't be lifted as-is.How?
lyris-execGitLab pipeline (via vendored.ci/scripts/trigger_and_wait.sh) that runs all four tests in one Lyris allocation and reports back. Removed the dlclustersalloc/srun/scancel steps, thesvc-nixlSSH credential, and the dlclusterSLURM_*env. The cpp (UCX_IB_REG_METHODS=rcache) and nixlbench (HAS_GPU=false) per-test env is kept inline.lyris-trigger-token/lyris-api-token;LYRIS_PIPELINE_REFflips tomainonce that merges; starts advisory before it gates.