GPU parallel prefix scan for clear-sky adding method — 6/6 tests pass

## Summary

The non-scattering (clear-sky) adding method in `ADA_Module.f90` (lines 228-246) can be parallelized using a parallel prefix scan with affine tuples, reducing sequential depth from O(N_layers) to O(log N_layers).

The radiance recurrence `Rad_UP[k-1] = source_up[k] + trans[k] * Rad_UP[k]` is an affine map, and affine map composition is associative — enabling a Hillis-Steele parallel scan on GPU.

## Benchmark (RTX 3060, CUDA 13.2)

| Configuration | Profiles | Layers | Angles | GPU Time | Max Rel Error | Status |
|---|---|---|---|---|---|---|
| IR sounder | 10,000 | 60 | 6 | 0.18 ms | 4.1e-07 | PASS |
| IR sounder | 50,000 | 60 | 6 | 0.87 ms | 4.3e-07 | PASS |
| High-res | 10,000 | 100 | 6 | 0.32 ms | 4.3e-07 | PASS |
| MW sounder | 100,000 | 60 | 2 | 0.59 ms | 3.5e-07 | PASS |
| Multi-stream | 10,000 | 60 | 16 | 0.44 ms | 4.4e-07 | PASS |
| Deep atm | 1,000 | 200 | 6 | 0.06 ms | 3.1e-07 | PASS |

Zero NaN, zero Inf across all tests.

## Scope & Limitations

**What works:** The scalar (non-scattering) adding path where layer transmittance and reflectance are diagonal. Each discrete ordinate angle is processed independently.

**What is NOT tested:** The full scattering path (lines 198-226) which involves nZ×nZ matrix operations with `matinv` at every layer. The matrix Redheffer star product is theoretically still associative, but the parallel scan with matrix inverses inside the operator is more complex and may have numerical issues. This needs further research.

**Hardware limitation:** Tested on RTX 3060 12GB only. Production CRTM workloads on WCOSS2 would need A100/H100 validation.

## Mathematical Basis

Same technique as applied to rte-rrtmgp (earth-system-radiation/rte-rrtmgp#393):
- Affine tuple operator: `(a₂,b₂) ∘ (a₁,b₁) = (a₂·a₁, a₂·b₁+b₂)` (Martin & Cundy, ICLR 2018)
- Associativity proven by Grant & Hunt (1969) for RT layer operators

## Code

CUDA kernel: https://github.com/consigcody94/parallel-prefix-rt/blob/master/benchmarks/cuda/crtm_scalar_adding.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU parallel prefix scan for clear-sky adding method — 6/6 tests pass #298

Summary

Benchmark (RTX 3060, CUDA 13.2)

Scope & Limitations

Mathematical Basis

Code

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Configuration	Profiles	Layers	Angles	GPU Time	Max Rel Error	Status
IR sounder	10,000	60	6	0.18 ms	4.1e-07	PASS
IR sounder	50,000	60	6	0.87 ms	4.3e-07	PASS
High-res	10,000	100	6	0.32 ms	4.3e-07	PASS
MW sounder	100,000	60	2	0.59 ms	3.5e-07	PASS
Multi-stream	10,000	60	16	0.44 ms	4.4e-07	PASS
Deep atm	1,000	200	6	0.06 ms	3.1e-07	PASS

GPU parallel prefix scan for clear-sky adding method — 6/6 tests pass #298

Description

Summary

Benchmark (RTX 3060, CUDA 13.2)

Scope & Limitations

Mathematical Basis

Code

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions