Releases · nv-legate/cupynumeric

24 Jun 19:48

v26.06.00

ae639f3

v26.06.00 Latest

Latest

This is a beta release of cuPyNumeric.

Pip wheels are available on PyPI at https://pypi.org/project/nvidia-cupynumeric/ and https://pypi.org/project/nvidia-cupynumeric-cu12/, for Linux (x86-64 and ARM64, with CUDA 12/13 and single-node multi-GPU support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/cupynumeric, for Linux (x86-64 and ARM64, with CUDA 12/13 and multi-node support). Windows is currently supported through WSL.

Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/26.06/.

Support matrix changes

Add CUDA 13 support to pip wheels
Add support for Python 3.14 to all packages
Add macOS pip wheel for the Legate profiler
Remove networking support from pip wheels - due to limitations of the wheel distribution channel, the networking functionality of the wheels was not robust; use conda packages or build from source instead

Highlights

New features

cupynumeric.linalg.cho_factor
cupynumeric.linalg.cho_solve
cupynumeric.linalg.inv
cupynumeric.linalg.solve_triangular
cupynumeric.mgrid
cupynumeric.ndimage.convolve

Performance improvements

Fix parallelization of cupynumeric.pad
Improved Thrust- and NCCL-based implementations for advanced indexing operations
Merge index array zipping with subsequent scatter/gather kernel in advanced indexing operations (single-GPU only)
Add better heuristics for picking between cuSolver getrf/s and batched cuBLAS APIs
Avoid some extraneous type conversions in reduction operations
Reduce some Python overheads in ufunc operations

Examples

Add a microbenchmark suite (ufunc, gemm, gemv, sort, reduction, indexing, FFT, etc.)
Add examples showcasing different options for interoperating with PyTorch
Add CFD example based on https://github.com/barbagroup/CFDPython

Tooling

Add an Nsight Systems recipe that measures the degree of task-level parallelism
Add a tutorial on profiling & debugging
Add more anti-patterns to "cuPyNumeric Doctor" detector tool

UX improvements

Remove fallback to NumPy for missing APIs and small arrays
Add an option to disable bounds checking in various operations, which removes a source of blocking
initial support for using cuPyNumeric ndarrays inside a Python Legate task (creation of task-local ndarrays is not supported yet)
Initial implementation of the Python Array API

Known issues

As of October 2025, Perlmutter jobs that request more than 32 GB of device memory (for example, --fbmem 64000) must include REALM_DEFAULT_ARGS='-gex:bindcuda 0'. Otherwise the OFI provider aborts with Unexpected error 12 (Cannot allocate memory) from fi_mr_regattr().
We are aware of performance regressions with cupynumeric.einsum on Blackwell GPUs, starting to occur with cuBLAS 13.2. These are under investigation.

With recent versions of UCX you might see warning messages like these:

ib_md.c:296  UCX  ERROR ibv_reg_mr(address=(nil), length=134217728, access=0xf) failed: Bad address
ucp_mm.c:81   UCX  ERROR failed to register address (nil) (cuda) length 134217728 on md[6]=mlx5_0: Input/output error (md supports: host|cuda)
0.000000 {5}{ucp}: ucp_mem_map failed

These can be ignored (they are not fatal, and don't appear to have an effect on performance). We are investigating how they can be addressed / removed.

We are aware of possible hangs when calling APIs that use cuSolverMp (e.g. multi-GPU cupynumeric.linalg.solve). We are in contact with the cuSolverMp maintainers to address these. In the meantime, depending on the underlying cause, one or more of the following workarounds should resolve the hangs:
- export NCCL_PXN_DISABLE=1
- export CUDA_MODULE_LOADING=EAGER
- run in rank-per-gpu mode

Known issues

This release is missing CUDA 13 wheels for Python 3.14 (conda packages for this combination are available). We are working to add this combination.

Full Changelog: v26.01.00...v26.06.00

Assets 2

31 Jan 11:59

manopapad

v26.01.00

ae1c787

v26.01.00

This is a beta release of cuPyNumeric.

Pip wheels are available on PyPI at https://pypi.org/project/nvidia-cupynumeric/, for Linux (x86-64 and ARM64, with CUDA 12 and multi-node support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/cupynumeric, for Linux (x86-64 and ARM64, with CUDA 12/13 and multi-node support). GASNet-based (rather than UCX-based) conda packages are under the gex label. Windows is currently supported through WSL.

Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/26.01/.

Highlights

Added functionality

Implement cupynumeric.pad.
Implement cupynumeric.linalg.pinv (single-CPU/GPU only)
Implement from_dlpack for exporting cuPyNumeric ndarray s through the DLPack interface
Detect when an object being used to initialize a cuPyNumeric ndarray implements the DLPack interface, and use it if possible.

Bugfixes

Ensure unimplemented stub functions always return cuPyNumeric ndarrays.

Known issues

We are aware of hangs when using cuSolverMp-based APIs on 4+ Perlmutter nodes. This appears to be a cluster-specific issue, that we are investigating.
We are aware of performance regressions with cupynumeric.einsum on Blackwell GPUs, starting to occur with cuBLAS 13.2. These are under investigation.

Full Changelog: v25.11.00...v26.01.00

Assets 2

27 Nov 06:25

manopapad

v25.11.00

de7cf3f

v25.11.00

This is a beta release of cuPyNumeric.

Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.11/.

Highlights

Support matrix changes

Start distributing conda packages for CUDA 13.
Port to cuSolverMp 0.7 (now the new required minimum).
Validate cuPyNumeric on DGX Spark.

Note that currently the pip wheels do not include CUDA 13 support, nor cuSolverMp support (linear solve / matrix decomposition APIs are constrained to single-GPU execution when using the wheels).

Added functionality

cupynumeric.histogram2d and cupynumeric.histogramdd
cupynumeric.lexsort
cupynumeric.isin
Multi-GPU & multi-node implementation of QR factorization, based on cuSolverMp

Performance improvements

Accelerate axis-wise reductions on GPUs by combining multiple kernel invocations into one.
Parallelize specialized implementation for cupynumeric.take, and use it in more cases, including cupynumeric.take_along_axis.

UX improvements

I/O functions (e.g. hdf5 to_file) and memory offloading (e.g. offload_to) functions from Legate now accept cuPyNumeric ndarrays directly.

Known issues

We are aware of hangs when using cuSolverMp-based APIs on 4+ Perlmutter nodes. This appears to be a cluster-specific issue, that we are investigating.
We are aware of hangs when using UCX 1.19 with the CUDA 13 conda packages. These are typically accompanied by an error message like this:
```
ib_md.c:287  UCX  ERROR ibv_reg_mr(address=(nil), length=134217728, access=0xf) failed: Bad address
ucp_mm.c:76   UCX  ERROR failed to register address (nil) (cuda) length 134217728 on md[6]=mlx5_0: Input/output error (md supports: host|cuda)
```
We are investigating a proper fix. For the time being, setting UCX_MEMTYPE_CACHE=no in the environment appears to resolve the hang, at the cost of potentially decreasing UCX performance.

Full Changelog: v25.10.00...v25.11.00

Assets 2

30 Oct 21:23

manopapad

v25.10.00

66d872d

v25.10.00

This is a beta release of cuPyNumeric.

Pip wheels are available on PyPI at https://pypi.org/project/nvidia-cupynumeric/, for Linux (x86-64 and ARM64, with CUDA and multi-node support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/cupynumeric, for Linux (x86-64 and ARM64, with CUDA and multi-node support). GASNet-based (rather than UCX-based) conda packages are under the gex label. Windows is currently supported through WSL.

Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.10/.

Highlights

Added functionality

Implement cupynumeric.in1d.
Add DLPack import/export support to cuPyNumeric ndarrays.
Allow batched input for cupynumeric.linalg.solve.

Performance improvements

Optimized implementation for the special axis= case of cupynumeric.take.
Improve heuristics for choosing between batched and unbatched matrix multiplication.
Improved implementation of cupynumeric.nonzero that uses no additional scratch space.
Identify special cases of advanced indexing that can be executed faster using cupynumeric.einsum.

Documentation / profiling

Add a tutorial on using Legate Tasks to extend cuPyNumeric.
Add a user warning when an operation (e.g. printing to the console) causes a sharded array to be gathered onto a single memory.
Add sub-boxes to the Legate profiler, showing how long the Python interpreter spends inside cuPyNumeric API calls.

Breaking changes

Move nightly conda packages to a dedicated channel, -c legate-nightly.

Known issues

We are aware of hangs occurring under certain platforms and UCC configurations, when using cuSolverMp-backed multi-GPU operations (Cholesky factorization and linear solve). We expect these to be fixed by the 25.11 release, that updates to cuSolverMp 0.7.

Full Changelog: v25.08.00...v25.10.00

Assets 2

05 Sep 07:38

manopapad

v25.08.00

7146e78

v25.08.00

This is a beta release of cuPyNumeric.

Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.08/.

New features

Added functionality

Multi-node multi-GPU capable SVD, specialized for tall-skinny matrices
cupynumeric.cross
cupynumeric.insert
cupynumeric.logspace
cupynumeric.real_if_close
cupynumeric.roots
cupynumeric.ravel_multi_index
cupynumeric.copyto
cupynumeric.diagflat
cupynumeric.delete
cupynumeric.nan_to_num
Support multi-axis reductions

Performance Improvements

Improve robustness & speed of cupynumeric.sort, by combining allocations where possible, and adding synchronization barriers around NCCL collectives.
Remove some extraneous blocking that was only necessary to match the behavior of NumPy 1.x.
Improve performance of NumPy fallback, in particular removing extraneous array copies, and adding special cases for quick fallback to functions such as cupynumeric.concatenate.

Miscellaneous

Unify all environment variables that control cuPyNumeric's NumPy fallback heuristics, to a single one, CUPYNUMERIC_MAX_EAGER_VOLUME.
Allow any available BLAS implementation to be used in a source build.

Full Changelog: v25.07.00...v25.08.00

Assets 2

09 Jul 18:36

marcinz

v25.07.00

6132d84

v25.07.00

This is a beta release of cuPyNumeric.

Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.07/.

New features

Added functionality

Multi-node multi-GPU capable cupynumeric.linalg.solve and cupynumeric.linalg.cholesky, backed by cuSolverMp.
Single-GPU cupynumeric.linalg.eigh/eigvalsh, backed by cuSolver.
cupynumeric.round

Support matrix changes

macOS wheels are now available on PyPI.
Add support for Blackwell CUDA architecture and MNNVL.
Drop support for Python 3.10 and add support for Python 3.13.
Remove NumPy 1.X restriction from packages (now compatible with NumPy 2.X).

Tuning

Add an optional "doctor" mode, that will detect some common anti-patterns causing bad performance. Enable with CUPYNUMERIC_DOCTOR=1, see https://docs.nvidia.com/cupynumeric/25.07/api/settings.html#doctor for more information.

Documentation

A basic cuPyNumeric tutorial is available, see https://docs.nvidia.com/cupynumeric/25.07/user/tutorial.html.
Start publishing nightly doc builds to https://nv-legate.github.io/cupynumeric.

Full Changelog: v25.03.02...v25.07.00

Known issues

Multi-node runs can occasionally segfault at exit. This issue is under investigation. Preliminary investigation suggests that the issue depends on the ordering between cuPyNumeric and OpenBLAS teardown. There is no impact to the correctness of the computation and subsequent GPU usage.
If the user explicitly forces multi-GPU execution of a sorting operation on very small arrays (about as many elements as the number of GPUs) this can result in CUDA errors. In normal conditions cuPyNumeric would not be GPU-accelerating operations of this size. A fix for this issue is in development and will be made available in an upcoming nightly build.

Assets 2

09 Apr 19:08

marcinz

v25.03.02

1fa4560

v25.03.02

This is a beta release of cuPyNumeric.

Linux x86 and ARM builds for Python 3.10 - 3.12 are available on PyPI at https://pypi.org/project/nvidia-cupynumeric/, and as conda packages at https://anaconda.org/legate/cupynumeric.

Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.03/.

New features

PIP install support

With this release, Linux x86 and ARM builds of cuPyNumeric are available for Python 3.10 - 3.12 as Python wheels on PyPI in addition to conda.

cuPyNumeric can be installed with:
```
pip install nvidia-cupynumeric
```
See https://docs.nvidia.com/cupynumeric/25.03/installation.html#installing-pypi-packages for further instructions.
These wheels support multi-node execution through UCX.
See https://docs.nvidia.com/legate/25.03/networking-wheels.html for more details.

Assets 2

17 Mar 23:04

manopapad

v25.03.00

e6be689

v25.03.00

This is a beta release of cuPyNumeric.

Linux x86 and ARM conda packages are available for this release at https://anaconda.org/legate/cupynumeric.

Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.03/.

New features

Licensing

With this release the Legate framework, on which cuPyNumeric is based, becomes open-source, under the Apache-2.0 license. This makes the entire cuPyNumeric stack (anything above the CUDA library level) open-source.

Added functionality

Matrix exponential: cupynumeric.linalg.expm
Batched eigendecomposition: cupynumeric.linalg.eigvals & cupynumeric.linalg.eig

Performance improvements

No longer doing unnecessary streaming when running matrix multiplication on a single processor/GPU.

UX improvements

Add thelegate.core.ProfileRange Python context manager, to annotate sub-spans within a larger task span on the profiler visualization.
Add the local_task_array helper function, that can be used in Python tasks to create a view over a Store/Array argument, using a NumPy or CuPy array as appropriate based on the type of memory where the data is located.

Documentation improvements

Add a user guide chapter on accelerating multi-GPU HDF5 workloads.

Known issues

We are aware of possible performance regressions when using UCX 1.18. We are temporarily restricting our packages to UCX <= 1.17 while we investigate this.

Assets 2

08 Feb 06:20

marcinz

v25.01.00

0464776

v25.01.00

This is a beta release of cuPyNumeric.

Linux x86 and ARM conda packages are available at https://anaconda.org/legate/cupynumeric.

Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.01/.

New features

Added functionality

Add the method parameter to cupynumeric.convolve.
Increase the maximum array dimension from 4 to 6.
Experimental support for NumPy 2.0 (not reflected in the package constraints yet).

Memory management enhancements

Updates to take advantage of the deferred-eager pool unification in Legate. This change has the potential to increase the effective available memory capacity by up to 100% for many usecases. It also removes the need for the user to adjust the --eager-alloc-percentage.
Add the offload_to() API, that allows a user to offload an array to a particular memory kind, such that any copies in other memories are discarded. This can be useful e.g. to evict an array from GPU memory onto system memory, freeing up space for subsequent GPU tasks.

I/O improvements

Use cuFile to accelerate HDF5 reads on the GPU.
Add support for reading "binary" HDF5 datasets (in particular useful for reading boolean-type datasets).

UX Improvements

Consider NUMA node topology when allocating CPU cores and memory during automatic machine configuration.
Add environment variable LEGATE_LIMIT_STDOUT, to only print out the output from one of the copies of the top-level program in a multi-process execution.
Remove an extraneous warning about __buffer__ being unimplemented.

Deprecations

Drop support for the Maxwell GPU architecture. cuPyNumeric now requires at least Pascal (sm_60).

Assets 2

07 Dec 06:44

marcinz

v24.11.02

0bc7ba6

v24.11.02

This is a patch release of cuPyNumeric.

Linux x86 and ARM conda packages are available at https://anaconda.org/legate/cupynumeric.

Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/24.11/.

Packaging Changes

Update for Legate v24.11.01

Assets 2

Uh oh!

Releases: nv-legate/cupynumeric

v26.06.00

Support matrix changes

Highlights

New features

Performance improvements

Examples

Tooling

UX improvements

Known issues

Known issues

Uh oh!

v26.01.00

Highlights

Added functionality

Bugfixes

Known issues

Uh oh!

v25.11.00

Highlights

Support matrix changes

Added functionality

Performance improvements

UX improvements

Known issues

Uh oh!

v25.10.00

Highlights

Added functionality

Performance improvements

Documentation / profiling

Breaking changes

Known issues

Uh oh!

v25.08.00

New features

Added functionality

Performance Improvements

Miscellaneous

Uh oh!

v25.07.00

New features

Added functionality

Support matrix changes

Tuning

Documentation

Known issues

Uh oh!

v25.03.02

New features

PIP install support

Uh oh!

v25.03.00

New features

Licensing

Added functionality

Performance improvements

UX improvements

Documentation improvements

Known issues

Uh oh!

v25.01.00

New features

Added functionality

Memory management enhancements

I/O improvements

UX Improvements

Deprecations

Uh oh!

v24.11.02

Packaging Changes

Uh oh!