CUB 2.0.0
Summary
The CUB 2.0.0 major release adds a dependency on libcu++ and contains several breaking changes. These include new diagnostics when inspecting device-only lambdas from the host, an updated method of determining accumulator types for algorithms like Reduce and Scan, and a compile-time replacement for the runtime debug_synchronous debugging flags.
This release also includes several new features. DeviceHistogram now supports __half and better handles various edge cases. WarpReduce now performs correctly when restricted to a single-thread “warp”, and will use the __reduce_add_sync accelerated intrinsic (introduced with Ampere) when appropriate. DeviceRadixSort learned to handle the case where begin_bit == end_bit.
Several algorithms also have updated documentation, with a particular focus on clarifying which operations can and cannot be performed in-place.
Breaking Changes
- #448 Add libcu++ dependency (v1.8.0+).
- #448: The following macros are no longer defined by default. They can be re-enabled by defining
CUB_PROVIDE_LEGACY_ARCH_MACROS. These will be completely removed in a future release.CUB_IS_HOST_CODE: Replace withNV_IF_TARGET.CUB_IS_DEVICE_CODE: Replace withNV_IF_TARGET.CUB_INCLUDE_HOST_CODE: Replace withNV_IF_TARGET.CUB_INCLUDE_DEVICE_CODE: Replace withNV_IF_TARGET.
- #486: CUB’s CUDA Runtime support macros have been updated to support
NV_IF_TARGET. They are now defined consistently across all host/device compilation passes. This should not affect most usages of these macros, but may require changes for some edge cases.CUB_RUNTIME_FUNCTION: Execution space annotations for functions that invoke CUDA Runtime APIs.- Old behavior:
- RDC enabled: Defined to
__host__ __device__ - RDC not enabled:
- NVCC host pass: Defined to
__host__ __device__ - NVCC device pass: Defined to
__host__
- NVCC host pass: Defined to
- RDC enabled: Defined to
- New behavior:
- RDC enabled: Defined to
__host__ __device__ - RDC not enabled: Defined to
__host__
- RDC enabled: Defined to
- Old behavior:
CUB_RUNTIME_ENABLED: No change in behavior, but no longer used in CUB. Provided for legacy support only. Legacy behavior:- RDC enabled: Macro is defined.
- RDC not enabled:
- NVCC host pass: Macro is defined.
- NVCC device pass: Macro is not defined.
CUB_RDC_ENABLED: New macro, may be combined withNV_IF_TARGETto replace most usages ofCUB_RUNTIME_ENABLED. Behavior:- RDC enabled: Macro is defined.
- RDC not enabled: Macro is not defined.
- #509: A compile-time error is now emitted when a
__device__-only lambda’s return type is queried from host code (requires libcu++ ≥ 1.9.0).- Due to limitations in the CUDA programming model, the result of this query is unreliable, and will silently return an incorrect result. This leads to difficult to debug errors.
- When using libcu++ 1.9.0, an error will be emitted with information about work-arounds:
- Use a named function object with a
__device__-only implementation ofoperator(). - Use a
__host__ __device__lambda. - Use
cuda::proclaim_return_type(Added in libcu++ 1.9.0)
- Use a named function object with a
- #509: Use the result type of the binary reduction operator for accumulating intermediate results in the
DeviceReducealgorithm, following guidance from http://wg21.link/P2322R6.- This change requires host-side introspection of the binary operator’s signature, and device-only extended lambda functions can no longer be used.
- In addition to the behavioral changes, the interfaces for the
Dispatch*Reducelayer have changed:DispatchReduce:- Now accepts accumulator type as last parameter.
- Now accepts initializer type instead of output iterator value type.
- Constructor now accepts
initas initial type instead of output iterator value type.
DispatchSegmentedReduce:- Accepts accumulator type as last parameter.
- Accepts initializer type instead of output iterator value type.
- Thread operators now accept parameters using different types:
Equality,Inequality,InequalityWrapper,Sum,Difference,Division,Max,ArgMax,Min,ArgMin. ThreadReducenow accepts accumulator type and uses a different type forprefix.
- #511: Use the result type of the binary operator for accumulating intermediate results in the
DeviceScan,DeviceScanByKey, andDeviceReduceByKeyalgorithms, following guidance from http://wg21.link/P2322R6.- This change requires host-side introspection of the binary operator’s signature, and device-only extended lambda functions can no longer be used.
- In addition to the behavioral changes, the interfaces for the
Dispatchlayer have changed:DispatchScannow accepts accumulator type as a template parameter.DispatchScanByKeynow accepts accumulator type as a template parameter.DispatchReduceByKeynow accepts accumulator type as the last template parameter.
- #527: Deprecate the
debug_synchronousflags on device algorithms.- This flag no longer has any effect. Define
CUB_DEBUG_SYNCduring compilation to enable these checks. - Moving this option from run-time to compile-time avoids the compilation overhead of unused debugging paths in production code.
- This flag no longer has any effect. Define
New Features
- #514: Support
__halfinDeviceHistogram. - #516: Add support for single-threaded invocations of
WarpReduce. - #516: Use
__reduce_add_synchardware acceleration forWarpReduceon supported architectures.
Bug Fixes
- #481: Fix the device-wide radix sort implementations to simply copy the input to the output when
begin_bit == end_bit. - #487: Fix
DeviceHistogram::Evenfor a variety of edge cases:- Bin ids are now correctly computed when mixing different types for
SampleTandLevelT. - Bin ids are now correctly computed when
LevelTis an integral type and the number of levels does not evenly divide the level range.
- Bin ids are now correctly computed when mixing different types for
- #508: Ensure that
temp_storage_bytesis properly set in theAdjacentDifferenceCopydevice algorithms. - #508: Remove excessive calls to the binary operator given to the
AdjacentDifferenceCopydevice algorithms. - #533: Fix debugging utilities when RDC is disabled.
Other Enhancements
- #448: Removed special case code for unsupported CUDA architectures.
- #448: Replace several usages of
__CUDA_ARCH__with<nv/target>to handle host/device code divergence. - #448: Mark unused PTX arch parameters as legacy.
- #476: Enabled additional debug logging for the onesweep radix sort implementation. Thanks to @canonizer for this contribution.
- #480: Add
CUB_DISABLE_BF16_SUPPORTto avoid including thecuda_bf16.hheader or using the__nv_bfloat16type. - #486: Add debug log messages for post-kernel debug synchronizations.
- #490: Clarify documentation for in-place usage of
DeviceScanalgorithms. - #494: Clarify documentation for in-place usage of
DeviceHistogramalgorithms. - #495: Clarify documentation for in-place usage of
DevicePartitionalgorithms. - #499: Clarify documentation for in-place usage of
Device*Sortalgorithms. - #500: Clarify documentation for in-place usage of
DeviceReducealgorithms. - #501: Clarify documentation for in-place usage of
DeviceRunLengthEncodealgorithms. - #503: Clarify documentation for in-place usage of
DeviceSelectalgorithms. - #518: Fix typo in
WarpMergeSortdocumentation. - #519: Clarify segmented sort documentation regarding the handling of elements that are not included in any segment.