Skip to content

Nixl ep ci adding ep to dlcluster without ci changes - draft#1813

Draft
lishapira wants to merge 26 commits into
ai-dynamo:mainfrom
lishapira:nixl_ep_ci_adding_ep_to_dlcluster_without_ci_changes
Draft

Nixl ep ci adding ep to dlcluster without ci changes - draft#1813
lishapira wants to merge 26 commits into
ai-dynamo:mainfrom
lishapira:nixl_ep_ci_adding_ep_to_dlcluster_without_ci_changes

Conversation

@lishapira

Copy link
Copy Markdown
Contributor

No description provided.

lishapira added 26 commits June 23, 2026 07:49
- Use 4 processes instead of 8 for the elastic EP CI test.
- Assert that rank failures only occur for ranks marked for kill
  in the plan, catching unexpected crashes during any phase.
- Verify the SIGTERM exit count matches the plan's expected kills,
  catching cases where the fault-tolerance kill mechanism fails.
- Add get_total_killed_ranks() helper to Plan class.

Made-with: Cursor
Last phase now uses [0, 3] instead of [0, 1], introducing a
rank-index gap (rank 1 and 2 absent) to exercise sparse rank handling.

Made-with: Cursor
…flag

Address review comment: elastic.py now exposes an optional
--validate-plan flag that enables plan-specific assertions
(unexpected failure rejection and SIGTERM count verification).
test_python.sh passes the flag for its specific plan.

Made-with: Cursor
The global meson.build sets -rdc=true for all CUDA targets, which causes
nvlink to enforce register limits across call boundaries at link time.
nixl_ep kernels use --register-usage-level=10 and call nixlPut() which
uses 215 registers, exceeding the nvlink limit and failing at link time.

Adding -rdc=false overrides the global setting for nixl_ep only, so
nixlPut gets inlined at compile time instead of being linked separately
by nvlink. This matches the standalone setup.py build behavior.

Made-with: Cursor
Install DOCA GPUNetIO dev packages when PRE_INSTALLED_ENV skips the full
apt bootstrap but BUILD_NIXL_EP=true requires them. UCX device headers
include doca_gpunetio_dev_verbs_qp.cuh which fails compilation without
the dev package installed.

Unset UCX_NET_DEVICES in elastic test subshell so UCX auto-selects a
GPU-capable transport. When set by the CI environment, UCX is restricted
to a device without GPU peer memory support, causing the RDMA path to
fail with "no lane found" errors.

Bump CI_IMAGE_TAG to 20260421-1 (build.sh changed).

Made-with: Cursor
DOCA device headers (.cuh) may be installed to a path that nvcc does
not search by default, and meson may not find a pkg-config file for
doca-gpunetio to add the include path automatically. Copy all .cuh
files from the DOCA installation directory to ${CUDA_HOME}/include/
so nvcc can find doca_gpunetio_dev_verbs_qp.cuh, which is included
transitively via UCX device headers.

Made-with: Cursor
The UCX-master build activates UCX GPU Device API which triggers the full
gdaki.cuh include chain requiring both .cuh and .h DOCA GPUNetIO headers.
Add a wildcard search for all doca_gpunetio* files (any extension) from /usr/include and /opt.
Bump CI_IMAGE_TAG to 20260421-3 to trigger Docker image rebuild.
…hout GPU API

- Export UCX_VERSION in Dockerfile.gpu-test for test_ep.sh.
- Fail EP elastic step when UCX_VERSION=master and BUILD_NIXL_EP=true but
  nixl_ep_cpp is missing; keep skipping on other UCX versions.
- Skip examples/device/ep in Meson when UCX GPU Device API is unavailable.
- Bump CI_IMAGE_TAG to 20260427-1 in build and test matrices.
Convert "Run DL EP elastic tests" from a raw sudo+ssh shell command to
the slurmCI module format used by all other test steps. This fixes two
bugs introduced after the rebase onto c83d742:

1. Wrong job ID file path: the old step read from
   /mnt/pvc/dl_job_id_<ver>_<build>.txt but the Allocate step now
   writes to ${JOB_ID_FILE_ROOT}/job_id_<ver>_<build>.txt, causing
   --slurm_job_id to be empty.

2. SSH Permission denied: the raw sudo -u svc-nixl approach never
   loaded the Jenkins SSH credential (svc-nixl-ssh_key), so SSH to
   dlcluster.nvidia.com failed with permission denied. Using slurmCI
   with credentialsId injects the key automatically.

Bump CI_IMAGE_TAG to 20260428-1 in all three matrix YAML files to
trigger a fresh base image build that incorporates the current build.sh
(DOCA GPUNetIO headers block) and the PyTorch CUDA-version alignment
from main (commit 1200fe5).

Made-with: Cursor
basic.json and no_expansion.json are identical; remove the duplicate
and use no_expansion.json consistently in all elastic test calls.

Made-with: Cursor
elastic.py imports the nixl_ep package, not nixl_ep_cpp directly.
Use "import nixl_ep" so the check matches the actual runtime import path.

Made-with: Cursor
Remove the EP-local -rdc=false workaround and instead use a target-local
buildtype=custom override so Meson does not add CUDA -G. This avoids
nvlink register-count failures while keeping global buildtype=debug and
RDC enabled. Bump CI_IMAGE_TAG to rebuild CI images.
Update the CI matrix image tag to the latest EP test image.

mypy resolves nixl_ep through the meta-dispatcher added for CUDA-specific EP wheels. That dispatcher exports backend attributes dynamically, so topk_idx_t exists at runtime but is invisible to static analysis unless it is declared under TYPE_CHECKING.

Declare topk_idx_t as a torch.dtype for type checking and keep call sites using nixl_ep.topk_idx_t directly. This only changes the static typing surface and does not affect runtime behavior.
Remove EP-specific DOCA block in build.sh:
1. Delete duplicated DOCA install (exists in the base-image bootstrap).
2. Move DOCA header-copy into the base-image bootstrap, adjacent to the
   DOCA install, so PR builds inherit the staged headers for free.

Bump CI_IMAGE_TAG to 20260622-1 in all four matrices so the base image
rebuilds with the new bootstrap.
* test_ep.sh: change elastic test arguments; drop UCX_RNDV_THRESH=inf
* examples/device/ep/meson.build:
Replace ['buildtype=custom', 'optimization=3'] with ['buildtype=release'] so the debug override matches the release build flag bundle;
Drop the ucx_gpu_device_api_available
  skip.
@copy-pr-bot

copy-pr-bot Bot commented Jun 23, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

Copy link
Copy Markdown

👋 Hi lishapira! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@lishapira

Copy link
Copy Markdown
Contributor Author

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant