Summary
The non-scattering (clear-sky) adding method in ADA_Module.f90 (lines 228-246) can be parallelized using a parallel prefix scan with affine tuples, reducing sequential depth from O(N_layers) to O(log N_layers).
The radiance recurrence Rad_UP[k-1] = source_up[k] + trans[k] * Rad_UP[k] is an affine map, and affine map composition is associative — enabling a Hillis-Steele parallel scan on GPU.
Benchmark (RTX 3060, CUDA 13.2)
| Configuration |
Profiles |
Layers |
Angles |
GPU Time |
Max Rel Error |
Status |
| IR sounder |
10,000 |
60 |
6 |
0.18 ms |
4.1e-07 |
PASS |
| IR sounder |
50,000 |
60 |
6 |
0.87 ms |
4.3e-07 |
PASS |
| High-res |
10,000 |
100 |
6 |
0.32 ms |
4.3e-07 |
PASS |
| MW sounder |
100,000 |
60 |
2 |
0.59 ms |
3.5e-07 |
PASS |
| Multi-stream |
10,000 |
60 |
16 |
0.44 ms |
4.4e-07 |
PASS |
| Deep atm |
1,000 |
200 |
6 |
0.06 ms |
3.1e-07 |
PASS |
Zero NaN, zero Inf across all tests.
Scope & Limitations
What works: The scalar (non-scattering) adding path where layer transmittance and reflectance are diagonal. Each discrete ordinate angle is processed independently.
What is NOT tested: The full scattering path (lines 198-226) which involves nZ×nZ matrix operations with matinv at every layer. The matrix Redheffer star product is theoretically still associative, but the parallel scan with matrix inverses inside the operator is more complex and may have numerical issues. This needs further research.
Hardware limitation: Tested on RTX 3060 12GB only. Production CRTM workloads on WCOSS2 would need A100/H100 validation.
Mathematical Basis
Same technique as applied to rte-rrtmgp (earth-system-radiation/rte-rrtmgp#393):
- Affine tuple operator:
(a₂,b₂) ∘ (a₁,b₁) = (a₂·a₁, a₂·b₁+b₂) (Martin & Cundy, ICLR 2018)
- Associativity proven by Grant & Hunt (1969) for RT layer operators
Code
CUDA kernel: https://github.com/consigcody94/parallel-prefix-rt/blob/master/benchmarks/cuda/crtm_scalar_adding.cu
Summary
The non-scattering (clear-sky) adding method in
ADA_Module.f90(lines 228-246) can be parallelized using a parallel prefix scan with affine tuples, reducing sequential depth from O(N_layers) to O(log N_layers).The radiance recurrence
Rad_UP[k-1] = source_up[k] + trans[k] * Rad_UP[k]is an affine map, and affine map composition is associative — enabling a Hillis-Steele parallel scan on GPU.Benchmark (RTX 3060, CUDA 13.2)
Zero NaN, zero Inf across all tests.
Scope & Limitations
What works: The scalar (non-scattering) adding path where layer transmittance and reflectance are diagonal. Each discrete ordinate angle is processed independently.
What is NOT tested: The full scattering path (lines 198-226) which involves nZ×nZ matrix operations with
matinvat every layer. The matrix Redheffer star product is theoretically still associative, but the parallel scan with matrix inverses inside the operator is more complex and may have numerical issues. This needs further research.Hardware limitation: Tested on RTX 3060 12GB only. Production CRTM workloads on WCOSS2 would need A100/H100 validation.
Mathematical Basis
Same technique as applied to rte-rrtmgp (earth-system-radiation/rte-rrtmgp#393):
(a₂,b₂) ∘ (a₁,b₁) = (a₂·a₁, a₂·b₁+b₂)(Martin & Cundy, ICLR 2018)Code
CUDA kernel: https://github.com/consigcody94/parallel-prefix-rt/blob/master/benchmarks/cuda/crtm_scalar_adding.cu