Nixl ep ci adding ep to dlcluster without ci changes - draft by lishapira · Pull Request #1813 · ai-dynamo/nixl

lishapira · 2026-06-23T14:54:29Z

No description provided.

- Use 4 processes instead of 8 for the elastic EP CI test. - Assert that rank failures only occur for ranks marked for kill in the plan, catching unexpected crashes during any phase. - Verify the SIGTERM exit count matches the plan's expected kills, catching cases where the fault-tolerance kill mechanism fails. - Add get_total_killed_ranks() helper to Plan class. Made-with: Cursor

Last phase now uses [0, 3] instead of [0, 1], introducing a rank-index gap (rank 1 and 2 absent) to exercise sparse rank handling. Made-with: Cursor

Made-with: Cursor

…flag Address review comment: elastic.py now exposes an optional --validate-plan flag that enables plan-specific assertions (unexpected failure rejection and SIGTERM count verification). test_python.sh passes the flag for its specific plan. Made-with: Cursor

Made-with: Cursor

The global meson.build sets -rdc=true for all CUDA targets, which causes nvlink to enforce register limits across call boundaries at link time. nixl_ep kernels use --register-usage-level=10 and call nixlPut() which uses 215 registers, exceeding the nvlink limit and failing at link time. Adding -rdc=false overrides the global setting for nixl_ep only, so nixlPut gets inlined at compile time instead of being linked separately by nvlink. This matches the standalone setup.py build behavior. Made-with: Cursor

Install DOCA GPUNetIO dev packages when PRE_INSTALLED_ENV skips the full apt bootstrap but BUILD_NIXL_EP=true requires them. UCX device headers include doca_gpunetio_dev_verbs_qp.cuh which fails compilation without the dev package installed. Unset UCX_NET_DEVICES in elastic test subshell so UCX auto-selects a GPU-capable transport. When set by the CI environment, UCX is restricted to a device without GPU peer memory support, causing the RDMA path to fail with "no lane found" errors. Bump CI_IMAGE_TAG to 20260421-1 (build.sh changed). Made-with: Cursor

DOCA device headers (.cuh) may be installed to a path that nvcc does not search by default, and meson may not find a pkg-config file for doca-gpunetio to add the include path automatically. Copy all .cuh files from the DOCA installation directory to ${CUDA_HOME}/include/ so nvcc can find doca_gpunetio_dev_verbs_qp.cuh, which is included transitively via UCX device headers. Made-with: Cursor

The UCX-master build activates UCX GPU Device API which triggers the full gdaki.cuh include chain requiring both .cuh and .h DOCA GPUNetIO headers. Add a wildcard search for all doca_gpunetio* files (any extension) from /usr/include and /opt. Bump CI_IMAGE_TAG to 20260421-3 to trigger Docker image rebuild.

…hout GPU API - Export UCX_VERSION in Dockerfile.gpu-test for test_ep.sh. - Fail EP elastic step when UCX_VERSION=master and BUILD_NIXL_EP=true but nixl_ep_cpp is missing; keep skipping on other UCX versions. - Skip examples/device/ep in Meson when UCX GPU Device API is unavailable. - Bump CI_IMAGE_TAG to 20260427-1 in build and test matrices.

Convert "Run DL EP elastic tests" from a raw sudo+ssh shell command to the slurmCI module format used by all other test steps. This fixes two bugs introduced after the rebase onto c83d742: 1. Wrong job ID file path: the old step read from /mnt/pvc/dl_job_id_<ver>_<build>.txt but the Allocate step now writes to ${JOB_ID_FILE_ROOT}/job_id_<ver>_<build>.txt, causing --slurm_job_id to be empty. 2. SSH Permission denied: the raw sudo -u svc-nixl approach never loaded the Jenkins SSH credential (svc-nixl-ssh_key), so SSH to dlcluster.nvidia.com failed with permission denied. Using slurmCI with credentialsId injects the key automatically. Bump CI_IMAGE_TAG to 20260428-1 in all three matrix YAML files to trigger a fresh base image build that incorporates the current build.sh (DOCA GPUNetIO headers block) and the PyTorch CUDA-version alignment from main (commit 1200fe5). Made-with: Cursor

basic.json and no_expansion.json are identical; remove the duplicate and use no_expansion.json consistently in all elastic test calls. Made-with: Cursor

elastic.py imports the nixl_ep package, not nixl_ep_cpp directly. Use "import nixl_ep" so the check matches the actual runtime import path. Made-with: Cursor

…v tuning

Remove the EP-local -rdc=false workaround and instead use a target-local buildtype=custom override so Meson does not add CUDA -G. This avoids nvlink register-count failures while keeping global buildtype=debug and RDC enabled. Bump CI_IMAGE_TAG to rebuild CI images.

Update the CI matrix image tag to the latest EP test image. mypy resolves nixl_ep through the meta-dispatcher added for CUDA-specific EP wheels. That dispatcher exports backend attributes dynamically, so topk_idx_t exists at runtime but is invisible to static analysis unless it is declared under TYPE_CHECKING. Declare topk_idx_t as a torch.dtype for type checking and keep call sites using nixl_ep.topk_idx_t directly. This only changes the static typing surface and does not affect runtime behavior.

Remove EP-specific DOCA block in build.sh: 1. Delete duplicated DOCA install (exists in the base-image bootstrap). 2. Move DOCA header-copy into the base-image bootstrap, adjacent to the DOCA install, so PR builds inherit the staged headers for free. Bump CI_IMAGE_TAG to 20260622-1 in all four matrices so the base image rebuilds with the new bootstrap.

* test_ep.sh: change elastic test arguments; drop UCX_RNDV_THRESH=inf * examples/device/ep/meson.build: Replace ['buildtype=custom', 'optimization=3'] with ['buildtype=release'] so the debug override matches the release build flag bundle; Drop the ucx_gpu_device_api_available skip.

copy-pr-bot · 2026-06-23T14:54:33Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-06-23T14:54:39Z

👋 Hi lishapira! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

lishapira · 2026-06-23T14:55:26Z

/build

lishapira added 26 commits June 23, 2026 07:49

nixl_ep: Add elastic test to CI pipeline

9344a42

nixl_ep: update plan for CI test

fd27923

Last phase now uses [0, 3] instead of [0, 1], introducing a rank-index gap (rank 1 and 2 absent) to exercise sparse rank handling. Made-with: Cursor

nixl_ep: scope PYTHONPATH to elastic CI test invocation

ae54f31

Made-with: Cursor

nixl_ep ci: add basic plan and NVLink/RDMA test variants for each plan

5257a18

Made-with: Cursor

nixl_ep_ci: run elastic EP tests from DL cluster job

dac7a48

nixl_ep ci: replace basic.json with no_expansion.json in test_ep.sh

df50cdc

basic.json and no_expansion.json are identical; remove the duplicate and use no_expansion.json consistently in all elastic test calls. Made-with: Cursor

nixl_ep ci: switch test_ep.sh to /bin/bash

dc75428

nixl_ep ci: fix nixl_ep availability check in test_ep.sh

51a12f3

elastic.py imports the nixl_ep package, not nixl_ep_cpp directly. Use "import nixl_ep" so the check matches the actual runtime import path. Made-with: Cursor

nixl_ep ci: NVLink-only EP elastic tests, system info logging, UCX en…

ff41f8b

…v tuning

nixl_ep ci: remove redundant ENV vars, update CI_IMAGE_TAG

ba0ca6d

nixl_ep ci: shorten test_ep.sh comments

30729d8

nixl_ep ci: Restore blank line and update CI_IMAGE_TAG

76e4973

nixl_ep ci: make elastic failure validation phase-local

ac57432

nixl_ep CI: remove unused start_etcd_server call

2e55368

pull-request-size Bot added the size/L label Jun 23, 2026

github-actions Bot added the external-contribution label Jun 23, 2026

nv-nmailhot force-pushed the main branch from b775042 to b11012e Compare June 24, 2026 06:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nixl ep ci adding ep to dlcluster without ci changes - draft#1813

Nixl ep ci adding ep to dlcluster without ci changes - draft#1813
lishapira wants to merge 26 commits into
ai-dynamo:mainfrom
lishapira:nixl_ep_ci_adding_ep_to_dlcluster_without_ci_changes

lishapira commented Jun 23, 2026

Uh oh!

copy-pr-bot Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

lishapira commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lishapira commented Jun 23, 2026

Uh oh!

copy-pr-bot Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

lishapira commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant