- Fused DISCO kernel for serial convolution: fuses the sparse psi contraction and weight multiplication into a single autograd region, avoiding the K-expanded intermediate activation in the graph and reducing memory footprint by a factor of K. Enabled via
fused=TrueonDiscreteContinuousConvS2. - Distributed DISCO convolution now uses
reduce_scatterinstead ofall_reduce+scatterfor the polar reduction, cutting communicated data volume in half. - Added fused distributed DISCO convolution variant which reduces activation storage.
- Improved performance for CPU based attention kernels.
- Converted all Python assert statements to torch._check for better torch.compile friendliness. All asserts in cosntructors were converted into ValueErrors for streamlined and clear error handling. In C++ and CUDA compiled code, all dynamic asserts were changed to TORCH_CHECK calls.
- Improved performance of distributed attention kernels achieved by splitting the kernel into two different ones for dense and less dense rows. This happens behind the scenes and the distributed attention API is unchanged.
- Improved distributed tests: tests now only print on rank 0 and test states are broadcast to all ranks before being triggered, to ensure clean failure on all ranks in case of failing tests.
- Splitting logic in distributed SHT improved. Now the SHT splits all leading dims up to the spatial dims when performing the all-to-all. It also automatically applies padding if the split tensor dim is smaller than the number of ranks it is split across. Tests were added to cover these cases.
- Fixed a problem with squeezing singleton dimensions in the
GaussianRandomFieldS2noise tensor. - Breaking: default
basis_norm_modeforDistributedDiscreteContinuousConvS2andDistributedDiscreteContinuousConvTransposeS2changed from"mean"to"nodal"to match the serialDiscreteContinuousConvS2/DiscreteContinuousConvTransposeS2defaults. Distributed and serial DISCO layers now share the same default normalization unless explicitly overridden.
- Fourier-Bessel filter basis; Hann window basis with per-type init factors via
get_init_factors - Standardized L2 normalization on the unit disk (harmonic, Zernike, Fourier-Bessel); on a disk of radius R the norm equals R via the Jacobian
- New DISCO basis normalization modes
modal(mean-subtracted, reduces spectral leakage) andgeometric(spherical cap area measure) - Deprecated
basis_norm_mode="individual"→"nodal"and"area ratio"→"geometric"(old names emitDeprecationWarning) - Faster DISCO sparsity-pattern setup; OpenMP forward/backward kernels with up to ~55x speedup in some configurations
- Cross-attention (
key != value != query) inAttentionS2,NeighborhoodAttentionS2, andDistributedNeighborhoodAttentionS2 - Serial attention upsampling when
nlon_out % nlon_in == 0: CPU/CUDA/torch upsample kernels and matching reference DistributedNeighborhoodAttentionS2for self-attention and downsampling (distributed upsample not yet implemented)- Optional per-head QK RMS norm (
use_qknorm) forAttentionS2andNeighborhoodAttentionS2; shape checks across attention layers - Fixed Q/K/V projection gain when input dim != embedding dim
- Breaking: default
NeighborhoodAttentionS2scale changed from1/sqrt(k_channels)to1/sqrt(k_channels // num_heads)to match standard MHA head-dim scaling (num_heads > 1) - Faster Legendre coefficient precomputation for SHT layers
- Differentiable
polar_halo_exchangeandget_group_neighborsfor distributed attention - More robust distributed transpose;
_reduceclones beforeall_reducefortorch.compilecompatibility - Fixed Galewsky initial condition NaN from overflow; convolution adapter for mismatched residual channel counts
- Midpoint rule for filter-basis L2 norm integration (O(h^2)); improved
_precompute_convolution_tensor_s2docstring - Expanded attention tests (including upsample); new
tests/test_filter_basis.py; broader layer integrity coverage
- New CPU backend (OpenMP-accelerated) for both DISCO convolution and attention layers
- Pre-compiled manylinux wheels for multiple PyTorch and CUDA versions, available on PyPI and pypi.nvidia.com
- Revised truncation logic for the SHT: centralized in new
truncation.pymodule, enforcing triangular truncation (lmax = min(lmax, mmax)) across all SHT classes. Note: truncation for equiangular/equiangular-trapezoidal grids changed fromnlatto(nlat+1)//2 - SHT performance improvements: contraction dimensions are now transposed to be stride-1 before einsum, and real/imaginary parts are split into separate contiguous tensors
- New
fft.pywrapper module with proper Hermitian symmetry enforcement inirfftand explicit mode truncation inrfft - Full PyTorch 2 custom operator compatibility for DISCO and attention layers using
torch.libraryregistration, enablingtorch.compileandtorch.export - Restructured DISCO convolution and attention code into proper subpackages (
torch_harmonics/disco/,torch_harmonics/attention/) - Added double precision support for DISCO convolution
- Fixed Schmidt normalization for derivatives of associated Legendre polynomials
- Fixed up/downsampling in attention layers when input and output shapes differ
- Fixed
GaussianRandomFieldS2to useisht.lmax/isht.mmaxfor compatibility with revised truncation logic - Distributed module: added shape verification for transpose and gather operations, controllable via
TORCH_HARMONICS_DISTRIBUTED_DEBUG - Distributed module: fixed
finalize()bug where process group was not properly destroyed - Query functions
torch_harmonics.disco.optimized_kernels_is_availableandtorch_harmonics.attention.optimized_kernels_is_availablefor checking optimized layer availability - Quadrature helper functions
precompute_latitudesandprecompute_longitudesare now public API - added new tests:
- Comprehensive SHT test suite now covering vector SHT, Schmidt normalization, batch dimensions, and multiple grid types
- New test suites for
SpectralConvS2,QuadratureS2,GaussianRandomFieldS2, andResampleS2Enhanced DISCO convolution tests covering different input/output channel counts and double precision - Enhanced attention tests with up/downsampling and
opcheckintegration - New distributed tests for primitives, quadrature, and spectral convolution
- Shared test utilities module (
testutils.py)
- Adding Driscoll-Healy (spectral) convolutions
- Adding QuadratureS2 method which allows to integrate a spherical field over one of the supported grids
- Adding tests for QuadratureS2 and Driscoll-Healy spectral convolutions
- Improving setup for distributed tests, refactoring and code re-use for distributed and serial tests
- Decreasing problem sizes for some tests, allowing for faster execution
- Adding an additional caching test based of contents of a torch tensor
- DistributedRealVectorSHT now does truncation correctly, previously this was not guaranteed
- Double precision support for DISCO convolution
- Revised the truncation logic for the SHT
- Restructuring torch-harmonics module to streamline usage of the new attention and DISCO layers
- Full PyTorch 2 custom operator compatibility, allowing for exporting models with torch-harmonics layers using torch.export
- Added OpenMP accelerated CPU backend for DISCO and attention layers
- New query functions
torch_harmonics.disco.optimized_kernels_is_availableandtorch_harmonics.attention.optimized_kernels_is_availablefor optimized layers availability - More tests for DISCO and attention layers
- Cleaned up notebooks
- Adding spherical attention and spherical neighborhood attention
- Custom CUDA kerneles for spherical neighborhood attention
- New datasets for segmentation and depth estimation on the sphere based on the 2D3DS dataset
- added new spherical architectures and corresponding baselines
- S2 Transformer
- S2 Segformer
- S2 U-Net
- Reorganized examples folder, including new examples based on the 2d3ds dataset
- Added spherical loss functions to examples
- Added plotting module
- Updated docstrings
- Adding cache for precomoputed tensors such as weight tensors for DISCO and SHT
- Cache is returning copies of tensors and not references. Users are still encouraged to re-use those tensors manually in their models because this will also save memory. However, the cache will help with model setup speed.
- Adding test which ensures that cache is working correctly
- New normalization mode
supportfor DISCO convolutions - More efficient computation of Morlet filter basis
- Changed default for Morlet filter basis to a Hann window function
- New filter basis normalization in DISCO convolutions
- More robust pre-computation of DISCO convolution tensor
- Reworked DISCO filter basis datastructure
- Support for new filter basis types
- Added Zernike polynomial basis on a disk
- Added Morlet wavelet basis functions on a spherical disk
- Cleaning up the SFNO example and adding new Local Spherical Neural Operator model
- Updated resampling module to extend input signal to the poles if needed
- Added slerp interpolation to the resampling module
- Added distributed resampling module
- Changing default grid in all SHT routines to
equiangular - Hotfix to the numpy version requirements
- Added resampling modules for convenience
- Changing behavior of distributed SHT to use
dim=-3as channel dimension - Fixing SHT unittests to test SHT and ISHT individually, rather than the roundtrip
- Changing the way custom CUDA extensions are handled
- Hotfix to AMP in SFNO example
- CUDA-accelerated DISCO convolutions
- Updated DISCO convolutions to support even number of collocation points across the diameter
- Distributed DISCO convolutions
- Fused quadrature into multiplication with the Psi tensor to lower memory footprint
- Removed DISCO convolution in the plane to focus on the sphere
- Updated unit tests which now include tests for the distributed convolutions
- Discrete-continuous (DISCO) convolutions on the sphere and in two dimensions
- DISCO supports isotropic and anisotropic kernel functions parameterized as hat functions
- Supports regular and transpose convolutions
- Accelerated spherical DISCO convolutions on GPU via Triton implementation
- Unittests for DISCO convolutions in
tests/test_convolution.py
- Reworking distributed to allow for uneven split tensors, effectively removing the necessity of padding the transformed tensors
- Distributed SHT tests are now using unittest. Test extended to vector SHT versions
- Tests are defined in
torch_harmonics/distributed/distributed_tests.py - Base pytorch container version bumped up to 23.11 in Dockerfile
- Adding gradient check in unit tests
- Temporary work-around for NCCL contiguous issues with distributed SHT
- Refactored examples and documentation
- Updated SFNO example
- Adding github CI
- Changed SHT modules to convert dtype dynamically when computing the SHT/ISHT
- Bugfixes to fix importing examples
- Minor bugfixes to export SFNO code
- Readme should now render correctly in PyPI
- Added SFNO example
- Added Shallow Water Equations Dataset for SFNO training
- Cleanup of the repository and added PyPI
- Updated Readme
- Reworked distributed SHT
- Module for sampling Gaussian Random Fields on the sphere
- Computation of associated Legendre polynomials
- changed algorithm to compute the associated Legendre polynomials for improved stability
- Improved Readme
- Vector Spherical Harmonic Transforms
- projects vector-valued fields onto the vector Spherical Harmonics
- supports computation of div and curl on the sphere
- New quadrature rules
- Clenshaw-Curtis quadrature rule
- Fejér quadrature rule
- Legendre-Gauss-Lobatto quadrature
- New notebooks
- complete with differentiable Shallow Water Solver
- notebook on quadrature and interpolation
- Unit tests
- Refactor of the API
- Renaming from torch_sht to torch_harmonics
- Adding distributed SHT support
- New logo
- Single GPU forward and backward transform
- Minimal code example and notebook