Nixl ep ci adding ep to dlcluster without ci changes - draft#1813
Draft
lishapira wants to merge 26 commits into
Draft
Nixl ep ci adding ep to dlcluster without ci changes - draft#1813lishapira wants to merge 26 commits into
lishapira wants to merge 26 commits into
Conversation
- Use 4 processes instead of 8 for the elastic EP CI test. - Assert that rank failures only occur for ranks marked for kill in the plan, catching unexpected crashes during any phase. - Verify the SIGTERM exit count matches the plan's expected kills, catching cases where the fault-tolerance kill mechanism fails. - Add get_total_killed_ranks() helper to Plan class. Made-with: Cursor
Last phase now uses [0, 3] instead of [0, 1], introducing a rank-index gap (rank 1 and 2 absent) to exercise sparse rank handling. Made-with: Cursor
Made-with: Cursor
…flag Address review comment: elastic.py now exposes an optional --validate-plan flag that enables plan-specific assertions (unexpected failure rejection and SIGTERM count verification). test_python.sh passes the flag for its specific plan. Made-with: Cursor
The global meson.build sets -rdc=true for all CUDA targets, which causes nvlink to enforce register limits across call boundaries at link time. nixl_ep kernels use --register-usage-level=10 and call nixlPut() which uses 215 registers, exceeding the nvlink limit and failing at link time. Adding -rdc=false overrides the global setting for nixl_ep only, so nixlPut gets inlined at compile time instead of being linked separately by nvlink. This matches the standalone setup.py build behavior. Made-with: Cursor
Install DOCA GPUNetIO dev packages when PRE_INSTALLED_ENV skips the full apt bootstrap but BUILD_NIXL_EP=true requires them. UCX device headers include doca_gpunetio_dev_verbs_qp.cuh which fails compilation without the dev package installed. Unset UCX_NET_DEVICES in elastic test subshell so UCX auto-selects a GPU-capable transport. When set by the CI environment, UCX is restricted to a device without GPU peer memory support, causing the RDMA path to fail with "no lane found" errors. Bump CI_IMAGE_TAG to 20260421-1 (build.sh changed). Made-with: Cursor
DOCA device headers (.cuh) may be installed to a path that nvcc does
not search by default, and meson may not find a pkg-config file for
doca-gpunetio to add the include path automatically. Copy all .cuh
files from the DOCA installation directory to ${CUDA_HOME}/include/
so nvcc can find doca_gpunetio_dev_verbs_qp.cuh, which is included
transitively via UCX device headers.
Made-with: Cursor
The UCX-master build activates UCX GPU Device API which triggers the full gdaki.cuh include chain requiring both .cuh and .h DOCA GPUNetIO headers. Add a wildcard search for all doca_gpunetio* files (any extension) from /usr/include and /opt. Bump CI_IMAGE_TAG to 20260421-3 to trigger Docker image rebuild.
…hout GPU API - Export UCX_VERSION in Dockerfile.gpu-test for test_ep.sh. - Fail EP elastic step when UCX_VERSION=master and BUILD_NIXL_EP=true but nixl_ep_cpp is missing; keep skipping on other UCX versions. - Skip examples/device/ep in Meson when UCX GPU Device API is unavailable. - Bump CI_IMAGE_TAG to 20260427-1 in build and test matrices.
Convert "Run DL EP elastic tests" from a raw sudo+ssh shell command to the slurmCI module format used by all other test steps. This fixes two bugs introduced after the rebase onto c83d742: 1. Wrong job ID file path: the old step read from /mnt/pvc/dl_job_id_<ver>_<build>.txt but the Allocate step now writes to ${JOB_ID_FILE_ROOT}/job_id_<ver>_<build>.txt, causing --slurm_job_id to be empty. 2. SSH Permission denied: the raw sudo -u svc-nixl approach never loaded the Jenkins SSH credential (svc-nixl-ssh_key), so SSH to dlcluster.nvidia.com failed with permission denied. Using slurmCI with credentialsId injects the key automatically. Bump CI_IMAGE_TAG to 20260428-1 in all three matrix YAML files to trigger a fresh base image build that incorporates the current build.sh (DOCA GPUNetIO headers block) and the PyTorch CUDA-version alignment from main (commit 1200fe5). Made-with: Cursor
basic.json and no_expansion.json are identical; remove the duplicate and use no_expansion.json consistently in all elastic test calls. Made-with: Cursor
elastic.py imports the nixl_ep package, not nixl_ep_cpp directly. Use "import nixl_ep" so the check matches the actual runtime import path. Made-with: Cursor
Remove the EP-local -rdc=false workaround and instead use a target-local buildtype=custom override so Meson does not add CUDA -G. This avoids nvlink register-count failures while keeping global buildtype=debug and RDC enabled. Bump CI_IMAGE_TAG to rebuild CI images.
Update the CI matrix image tag to the latest EP test image. mypy resolves nixl_ep through the meta-dispatcher added for CUDA-specific EP wheels. That dispatcher exports backend attributes dynamically, so topk_idx_t exists at runtime but is invisible to static analysis unless it is declared under TYPE_CHECKING. Declare topk_idx_t as a torch.dtype for type checking and keep call sites using nixl_ep.topk_idx_t directly. This only changes the static typing surface and does not affect runtime behavior.
Remove EP-specific DOCA block in build.sh: 1. Delete duplicated DOCA install (exists in the base-image bootstrap). 2. Move DOCA header-copy into the base-image bootstrap, adjacent to the DOCA install, so PR builds inherit the staged headers for free. Bump CI_IMAGE_TAG to 20260622-1 in all four matrices so the base image rebuilds with the new bootstrap.
* test_ep.sh: change elastic test arguments; drop UCX_RNDV_THRESH=inf * examples/device/ep/meson.build: Replace ['buildtype=custom', 'optimization=3'] with ['buildtype=release'] so the debug override matches the release build flag bundle; Drop the ucx_gpu_device_api_available skip.
|
👋 Hi lishapira! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
Contributor
Author
|
/build |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.