Releases: NVIDIA/cloudai
v1.7.0-3
What's Changed
- [CI] use uv by @podkidyshev in #925
- [CLI/Core] Polymorphic agent dispatch + TestRun owns trial counter by @rutayan-nv in #933
- Bump pydantic-settings from 2.12.0 to 2.14.2 by @dependabot[bot] in #940
- Bump tornado from 6.5.5 to 6.5.7 by @dependabot[bot] in #935
- Bump starlette from 1.0.1 to 1.3.1 by @dependabot[bot] in #936
Full Changelog: v1.7.0-2...v1.7.0-3
v1.7.0-2
What's Changed
- NIXL EP: fix rank removal by @podkidyshev in #913
- Add UCC/NCCL alltoallv, deepEP v1/v2 and moe benchmark by @ybenvidia in #891
- add DSE support for AI Dynamo + LMCache aiperf workload by @saivishal1999 in #914
- [NIXL] Fix NIXL UCX worker node placement by @podkidyshev in #922
- Handle OSError from git subprocess calls in GitRepo installable by @shreyaskommuri in #916
- Fix standalone sleep dry-run command generation by @shreyaskommuri in #921
- Bump starlette from 0.52.1 to 1.0.1 by @dependabot[bot] in #912
- [vLLM/SGLang] multi-node by @podkidyshev in #918
New Contributors
- @shreyaskommuri made their first contribution in #916
Full Changelog: v1.7.0-1...v1.7.0-2
v1.7.0-1
What's Changed
- Installables: allow custom ones by @podkidyshev in #885
- NIXL EP: add single sbatch support by @podkidyshev in #889
- Append trajectory row on cache hits by @rutayan-nv in #888
- Ipod/custom srun bash by @podkidyshev in #896
- [Configurator] Make select_action observation-aware by @rutayan-nv in #892
- feat(dynamo_mocker): add GPU-free LLM inference simulation workload by @saivishal1999 in #895
- Bump idna from 3.11 to 3.15 by @dependabot[bot] in #897
- Bump python-dotenv from 1.2.1 to 1.2.2 by @dependabot[bot] in #878
- Bump urllib3 from 2.6.3 to 2.7.0 by @dependabot[bot] in #887
- vLLM/SGLANG: add semantic degradation support by @podkidyshev in #890
- feat(ai_dynamo): add aiperf workload support by @saivishal1999 in #898
- AIDynamo: add semantic degradation evaluation support by @podkidyshev in #903
- AIDynamo: enable LMCache by @podkidyshev in #906
- AIDynamo: enable multiple AIPerf runs during a single test run by @podkidyshev in #907
- AIDynamo: Optional restart of DynamoRouter between AIPerf re-runs by @podkidyshev in #908
- AIDynamo: shared node disagg inference by @podkidyshev in #909
- vLLM/SGLang: comparison report by @podkidyshev in #904
- NIXL EP: comparison report by @podkidyshev in #911
New Contributors
- @saivishal1999 made their first contribution in #895
Full Changelog: v1.6.1...v1.7.0-1
v1.6.1
New Changes
- Added support for the following workloads:
- vLLM - LLM serving benchmark support with Slurm execution, disaggregated prefill/decode mode, multi-node serving, reporting, DSE metrics, and NIXL-related options
- SGLang - LLM serving benchmark support sharing the common vLLM/SGLang serving flow, reporting, health checks, and multi-node execution
- NIXL EP - NIXL Expert Parallelism workload with Slurm command generation, log parsing, reporting, and tests
- Added DSE reporting, including richer visualization of design-space exploration results and best-configuration selection
- Added report generation for MegatronRun and OSU benchmarks
- Added support for CNI specification configuration for NCCL and AI Dynamo workloads on Kubernetes
Backward Compatibility Notes
-
AI Dynamo configuration schema
- Worker settings now use explicit
prefill_workeranddecode_workerblocks with nestedargs. - Older fields such as
prefill-cmd,decode-cmd, top-level worker parallelism keys,run_script, andhuggingface_home_container_pathshould be migrated to the new schema.
- Worker settings now use explicit
-
Megatron-Bridge configuration schema
model_family_nameandmodel_recipe_namereplace the earliermodel_nameandmodel_sizefields.time_limitis now taken from the test run rather thancmd_args.- A Megatron-Bridge git repo only overrides the container copy when
mount_as = "/opt/Megatron-Bridge"is set.
-
Custom workload implementations
- Custom workloads that override
constraint_check(self, tr)should update the method signature to accept the newsystemargument.
- Custom workloads that override
LLM Serving Improvements
CloudAI now includes first-class support for vLLM and SGLang serving workloads. The implementation includes shared serving infrastructure, Slurm command generation, result reporting, disaggregated prefill/decode support, two-node serving flows, custom health check endpoints, and more robust startup, shutdown, and cleanup handling. vLLM also supports DSE metrics, NIXL thread options, boolean flag handling, and constraint checks.
Megatron and Megatron-Bridge Improvements
Megatron-Bridge support was updated for r0.3.0 recipes and improved configuration handling. GPU counts can be derived from the system configuration, time limits are managed by the test run, VP parameters are handled more reliably, and status checks reduce false passes. MegatronRun now has report generation support and improved success detection, including timeout handling.
NIXL, Kubernetes, and Networking
NIXL workloads gained a new EP workload, updated CLI argument handling, support for separate ETCD containers, improved ETCD failure handling, safer mount cleanup, and installable fixes around nested Docker image paths and submodules. Kubernetes support was improved with CNI spec handling for NCCL and AI Dynamo, while NCCL Kubernetes tests were refactored for better reuse and temporary-resource management.
Reporting, Configuration, and Parsing
Reporting now includes DSE reports, OSU benchmark reports, MegatronRun reports, and reward override support for constraint failures. Configuration handling is more robust with improved duplicate-key errors, system config detection, path expansion/storage, first-sweep messaging, and agent configuration/caching updates.
Architecture, Reliability, and Tooling
Job monitoring no longer relies on asyncio, heavy imports are blocked at module level, and command shell checks no longer run during object creation. Slurm handling was improved around node exclusion, reservation nodes, GPU resource requesting, and propagation of extra Slurm arguments. Tooling was refreshed with pre-commit, updated CI workflows, uv usage in CI, Node 24-compatible GitHub Actions, broader tests organized by system/workload, and dependency updates.
Documentation
Documentation was expanded for vLLM, SGLang, NIXL EP, Systems, workload requirements, reporting, troubleshooting, and tutorial/user guide content. Workload pages and release configurations were updated to match the new workloads and configuration flows.
All Changed
- Bump to v1.6 + upgrade dependencies by @amaslenn in #798
- Upgrade GitHub Actions to latest versions by @salmanmkc in #751
- Upgrade GitHub Actions for Node 24 compatibility by @salmanmkc in #750
- Ban "heavy" imports on module level by @amaslenn in #801
- Remove asyncio usage in jobs monitoring by @amaslenn in #796
- Bump pillow from 12.1.0 to 12.1.1 by @dependabot[bot] in #802
- Add report generation strategy for the MegatronRun by @juntaowww in #787
- Fix accedentially reverted version bump by @amaslenn in #805
- Add support for running vLLM by @amaslenn in #799
- Unit-tests per system/workload by @podkidyshev in #808
- Fix
nsyssubfield merging behavior by @juntaowww in #795 - Add support for setting NIXL num threads for vLLM CLI by @amaslenn in #809
- Fix base_tr fixture dependency by @podkidyshev in #810
- Fixes CLOUDAI-15: Updated copyright check by @podkidyshev in #811
- Add report generation for OSU Benchmark by @allkoow in #807
- Single sbatch + NIXL + ETCD issues by @podkidyshev in #812
- Support separate ETCD container for NIXL workloads by @amaslenn in #813
- Yet another attempt on the right copyright by @podkidyshev in #815
- Refactor NCCL k8s test cases to improve re-use and temp resources management by @amaslenn in #817
- Support DSE metrics for vLLM by @amaslenn in #816
- Agent configs by @podkidyshev in #818
- AI Dynamo updates by @karya0 in #814
- Avoid silent failure when commit hash is invalid by @juntaowww in #820
- Warning on using first sweep by @podkidyshev in #822
- Update CLI args format for NIXL bench by @amaslenn in #823
- Fix commit verification: commit/branch/tag support by @podkidyshev in #824
- Megatron-Bridge updates by @podkidyshev in #821
- pre-commit by @podkidyshev in #827
- Add documentation for Systems by @amaslenn in #826
- Bump werkzeug from 3.1.5 to 3.1.6 by @dependabot[bot] in #828
- Address doc issues by @amaslenn in #831
- Use uv in ci by @podkidyshev in #835
- Bump tornado from 6.5.4 to 6.5.5 by @dependabot[bot] in #833
- Add SGLang workload by @amaslenn in #834
- Merge common part of vLLM and SGLang by @amaslenn in #836
- NIXL update: filepath and device_list by @podkidyshev in #829
- Agents caching by @podkidyshev in #837
- Add support for x2 nodes serving for vLLM and SGLang by @amaslenn in #839
- Megatron-Bridge r0.3.0 enhancement by @juntaowww in #830
- Avoid real system calls by @amaslenn in #842
- Do not run CommandShell check during object creation by @amaslenn in #843
- Cleanup NIXL file mounts by @podkidyshev in #840
- Formatting changes by @RulaHallak in #838
- Add NIXL EP workload by @amaslenn in #845
- DSE reporting by @podkidyshev in #846
- Support CNI spec for NCCL over k8s by @amaslenn in #848
- Bump requests from 2.32.5 to 2.33.0 by @dependabot[bot] in #852
- MBridge: time limit managed by test run by @podkidyshev in #849
- CNI spec support for Dynamo @ k8s by @amaslenn in #854
- MBridge: using gpus-per-node from system by @podkidyshev in #847
- Update CODEOWNERS by @amaslenn in #856
- VLLM: boolean flags and constraints by @podkidyshev in #857
- Allow profiling ranks in string format with comma as separator by @juntaowww in #855
- MBridge: fix vp parameter handling by @podkidyshev in #858
- Bump pygments from 2.19.2 to 2.20.0 by @dependabot[bot] in #853
- MBridge: revert metrics parsing by @podkidyshev in #862
- Installables: nested docker image path by @podkidyshev in #861
- Megatron Run: status check by @podkidyshev in #859
- Fix path expansion/...
v1.6.1-3
What's Changed
- Fix broken duplicate test name detection in TestParser.parse_all() by @rutayan-nv in #875
- Parsing: enhance error handling by @podkidyshev in #876
- fix various vllm/sglan bugs by @podkidyshev in #877
- vLLM, SGLang: fix long server start by @podkidyshev in #879
- vLLM, SGLang: cleanup fix for single-sbatch by @podkidyshev in #880
- Parsing: fix system config detection by @podkidyshev in #881
- vLLM, SGLang: custom healthcheck endpoint by @podkidyshev in #882
- fix secret scan false positive by @podkidyshev in #883
Full Changelog: v1.6.1-2...v1.6.1-3
v1.6.1-2
What's Changed
- Constraint failure reward override by @alexmanle in #865
- Amanley/reward overrides by @alexmanle in #869
- MegatronRun: fix
.loadtest + allow timeouts by @podkidyshev in #866 - Bump pytest from 9.0.2 to 9.0.3 by @dependabot[bot] in #871
- Bump pillow from 12.1.1 to 12.2.0 by @dependabot[bot] in #870
- Bump uv from 0.10.0 to 0.11.6 by @dependabot[bot] in #867
- Installables: submodules fix by @podkidyshev in #872
Full Changelog: v1.6.1-1...v1.6.1-2
v1.6.1-1
What's Changed
- Bump pygments from 2.19.2 to 2.20.0 by @dependabot[bot] in #853
- MBridge: revert metrics parsing by @podkidyshev in #862
- Installables: nested docker image path by @podkidyshev in #861
- Megatron Run: status check by @podkidyshev in #859
- Fix path expansion/storage by @amaslenn in #864
Full Changelog: v1.6.0b7...v1.6.1-1
v1.6 TP7
What's Changed
- Support CNI spec for NCCL over k8s by @amaslenn in #848
- Bump requests from 2.32.5 to 2.33.0 by @dependabot[bot] in #852
- MBridge: time limit managed by test run by @podkidyshev in #849
- CNI spec support for Dynamo @ k8s by @amaslenn in #854
- MBridge: using gpus-per-node from system by @podkidyshev in #847
- Update CODEOWNERS by @amaslenn in #856
- VLLM: boolean flags and constraints by @podkidyshev in #857
- Allow profiling ranks in string format with comma as separator by @juntaowww in #855
- MBridge: fix vp parameter handling by @podkidyshev in #858
Important changes
Megatron-Bridge
time_limitwas removed from MegatronBridgecmd_args. It is now to be set on scenario level, just like for other workloadsgpus_per_nodewas removed from MegatronBridgecmd_args. The value is now taken from system config just as in other workloads
Full Changelog: v1.6.beta6...v1.6.0b7
v1.6.beta6
What's Changed
- Megatron-Bridge r0.3.0 enhancement by @juntaowww in #830
- Avoid real system calls by @amaslenn in #842
- Do not run CommandShell check during object creation by @amaslenn in #843
- Cleanup NIXL file mounts by @podkidyshev in #840
- Formatting changes by @RulaHallak in #838
- Add NIXL EP workload by @amaslenn in #845
- DSE reporting by @podkidyshev in #846
Full Changelog: v1.6.beta5...v1.6.beta6
v1.6.beta5
What's Changed
- Use uv in ci by @podkidyshev in #835
- Bump tornado from 6.5.4 to 6.5.5 by @dependabot[bot] in #833
- Add SGLang workload by @amaslenn in #834
- Merge common part of vLLM and SGLang by @amaslenn in #836
- NIXL update: filepath and device_list by @podkidyshev in #829
- Agents caching by @podkidyshev in #837
- Add support for x2 nodes serving for vLLM and SGLang by @amaslenn in #839
Full Changelog: v1.6.beta4...v1.6.beta5