Skip to content

v26.06.00

Latest

Choose a tag to compare

@manopapad manopapad released this 24 Jun 19:48
· 69 commits to main since this release
ae639f3

This is a beta release of cuPyNumeric.

Pip wheels are available on PyPI at https://pypi.org/project/nvidia-cupynumeric/ and https://pypi.org/project/nvidia-cupynumeric-cu12/, for Linux (x86-64 and ARM64, with CUDA 12/13 and single-node multi-GPU support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/cupynumeric, for Linux (x86-64 and ARM64, with CUDA 12/13 and multi-node support). Windows is currently supported through WSL.

Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/26.06/.

Support matrix changes

  • Add CUDA 13 support to pip wheels
  • Add support for Python 3.14 to all packages
  • Add macOS pip wheel for the Legate profiler
  • Remove networking support from pip wheels - due to limitations of the wheel distribution channel, the networking functionality of the wheels was not robust; use conda packages or build from source instead

Highlights

New features

  • cupynumeric.linalg.cho_factor
  • cupynumeric.linalg.cho_solve
  • cupynumeric.linalg.inv
  • cupynumeric.linalg.solve_triangular
  • cupynumeric.mgrid
  • cupynumeric.ndimage.convolve

Performance improvements

  • Fix parallelization of cupynumeric.pad
  • Improved Thrust- and NCCL-based implementations for advanced indexing operations
  • Merge index array zipping with subsequent scatter/gather kernel in advanced indexing operations (single-GPU only)
  • Add better heuristics for picking between cuSolver getrf/s and batched cuBLAS APIs
  • Avoid some extraneous type conversions in reduction operations
  • Reduce some Python overheads in ufunc operations

Examples

  • Add a microbenchmark suite (ufunc, gemm, gemv, sort, reduction, indexing, FFT, etc.)
  • Add examples showcasing different options for interoperating with PyTorch
  • Add CFD example based on https://github.com/barbagroup/CFDPython

Tooling

  • Add an Nsight Systems recipe that measures the degree of task-level parallelism
  • Add a tutorial on profiling & debugging
  • Add more anti-patterns to "cuPyNumeric Doctor" detector tool

UX improvements

  • Remove fallback to NumPy for missing APIs and small arrays
  • Add an option to disable bounds checking in various operations, which removes a source of blocking
  • initial support for using cuPyNumeric ndarrays inside a Python Legate task (creation of task-local ndarrays is not supported yet)
  • Initial implementation of the Python Array API

Known issues

  • As of October 2025, Perlmutter jobs that request more than 32 GB of device memory (for example, --fbmem 64000) must include REALM_DEFAULT_ARGS='-gex:bindcuda 0'. Otherwise the OFI provider aborts with Unexpected error 12 (Cannot allocate memory) from fi_mr_regattr().
  • We are aware of performance regressions with cupynumeric.einsum on Blackwell GPUs, starting to occur with cuBLAS 13.2. These are under investigation.
  • With recent versions of UCX you might see warning messages like these:
    ib_md.c:296  UCX  ERROR ibv_reg_mr(address=(nil), length=134217728, access=0xf) failed: Bad address
    ucp_mm.c:81   UCX  ERROR failed to register address (nil) (cuda) length 134217728 on md[6]=mlx5_0: Input/output error (md supports: host|cuda)
    0.000000 {5}{ucp}: ucp_mem_map failed
    
    These can be ignored (they are not fatal, and don't appear to have an effect on performance). We are investigating how they can be addressed / removed.
  • We are aware of possible hangs when calling APIs that use cuSolverMp (e.g. multi-GPU cupynumeric.linalg.solve). We are in contact with the cuSolverMp maintainers to address these. In the meantime, depending on the underlying cause, one or more of the following workarounds should resolve the hangs:
    • export NCCL_PXN_DISABLE=1
    • export CUDA_MODULE_LOADING=EAGER
    • run in rank-per-gpu mode

Known issues

This release is missing CUDA 13 wheels for Python 3.14 (conda packages for this combination are available). We are working to add this combination.

Full Changelog: v26.01.00...v26.06.00