Skip to content
Open
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
6ce0b9f
Move adstack primal/adjoint storage onto a per-dispatch heap buffer.
duburcqa Apr 17, 2026
1942733
Extend heap-backed SPIR-V adstack to i32/u1 and reject other primitiv…
duburcqa Apr 17, 2026
01b6fb4
[SPIRV] Fix empty-dispatch heap allocation, ctx_buffers_ use-after-fr…
duburcqa Apr 17, 2026
4382ee1
[SPIRV] Remove dead AdStackHeapKind::function_scope path and align mi…
duburcqa Apr 17, 2026
f0f9281
[SPIRV] Eagerly emit ad_stack_heap_thread_base_* at alloca site to av…
duburcqa Apr 17, 2026
bfac9bb
[SPIRV] Skip adstack reject-unsupported-type test when device lacks s…
duburcqa Apr 17, 2026
d185383
[SPIRV] Widen adstack heap invoc_id*stride to u64 when Int64 is avail…
duburcqa Apr 18, 2026
95707dc
fixup! Move adstack primal/adjoint storage onto a per-dispatch heap b…
duburcqa Apr 18, 2026
77220a7
fixup! [SPIRV] Widen adstack heap invoc_id*stride to u64 when Int64 i…
duburcqa Apr 18, 2026
d1b7853
[SPIRV] Split cached-field comment: buffers are lazy, thread_base eager
duburcqa Apr 18, 2026
e767d2f
[LLVM] Move adstack primal/adjoint storage onto a per-runtime heap bu…
duburcqa Apr 19, 2026
74cfbeb
Raise default_ad_stack_size from 32 to 256 on the heap-backed adstack.
duburcqa Apr 19, 2026
945cf37
Expose default_ad_stack_size as a qd.init knob and document rationale.
duburcqa Apr 19, 2026
3aa9c26
[LLVM] Size adstack heap by end-begin for range_for and zero-init on …
duburcqa Apr 19, 2026
622a200
fixup! [LLVM] Size adstack heap by end-begin for range_for and zero-i…
duburcqa Apr 19, 2026
d9813a3
fixup! Raise default_ad_stack_size from 32 to 256 on the heap-backed …
duburcqa Apr 19, 2026
17a5ea1
fixup! Expose default_ad_stack_size as a qd.init knob and document ra…
duburcqa Apr 19, 2026
9d9e0c3
fixup! [LLVM] Size adstack heap by end-begin for range_for and zero-i…
duburcqa Apr 19, 2026
324fed2
fixup! [LLVM] Size adstack heap by end-begin for range_for and zero-i…
duburcqa Apr 19, 2026
4a49d88
fixup! [LLVM] Move adstack primal/adjoint storage onto a per-runtime …
duburcqa Apr 19, 2026
f016a05
[LLVM] Host-manage the adstack heap and size it tightly per task.
duburcqa Apr 19, 2026
73ebd99
[AMDGPU] Free RuntimeContext per launch; restore tight adstack sizing.
duburcqa Apr 19, 2026
4b2c22f
Fix stale preallocate_runtime_memory comment referencing removed adst…
duburcqa Apr 19, 2026
b9152b7
Address review comments: dead code + stale docs.
duburcqa Apr 19, 2026
56187e7
[CUDA] Reject graph=True on kernels that use the reverse-mode autodif…
duburcqa Apr 19, 2026
1039ce3
Address review comments: docs memory-cost formula + stale LLVM codege…
duburcqa Apr 19, 2026
60e613d
Reflow adstack-related comments to fill 120-col lines (no content cha…
duburcqa Apr 19, 2026
e9b5e48
[SPIRV] Amortized doubling when growing the adstack heap buffers.
duburcqa Apr 19, 2026
96fc260
Fix ensure_adstack_heap release-safety comment (AMDGPU does not hipFr…
duburcqa Apr 20, 2026
df52025
Scope cross-launch safety note to AMDGPU (CUDA has no device-side con…
duburcqa Apr 20, 2026
ecdf71c
Merge branch 'duburcqa/fix_adstack_perf' into duburcqa/heap_backed_ad…
hughperkins Apr 20, 2026
cda3a6a
[SPIRV/Metal] Cap advisory thread count to runtime-resolved ndrange i…
duburcqa Apr 20, 2026
6cada5b
[SPIRV/Tests] Pin grad-over-ndarray-shape ndrange does not oversize p…
duburcqa Apr 20, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 30 additions & 3 deletions docs/source/user_guide/autodiff.md
Original file line number Diff line number Diff line change
Expand Up @@ -234,9 +234,37 @@ The pattern often hides inside in-place time-stepping updates like `x[i] = x[i]

### Adstack overflow

Each adstack has a fixed capacity baked into the compiled kernel. Note that the capacity is fixed at compile time: it cannot be modified at runtime. When the compiler can prove the worst-case loop trip count, that value is used for the capacity; otherwise it falls back to a conservative default. Pass `ad_stack_size=N` to `qd.init()` to override the fallback. On SPIR-V backends (Metal, Vulkan) the allocation lives in per-thread on-chip memory, which the driver caps at a few kilobytes, so the fallback default stays small.
Each adstack has a capacity that is fixed at compile time and cannot be modified at runtime. When the compiler can prove the worst-case loop trip count, that value is used; otherwise it falls back to a conservative default.

If a kernel overflows its adstack at runtime, Quadrants raises a Python `RuntimeError` naming the overflow at the next `qd.sync()`; if the default is already too large for the target driver, pipeline creation itself fails with a similar exception at kernel-launch time. Heap-backed SPIR-V adstack, which would lift the per-thread ceiling, is left for future work.
If a kernel overflows its adstack at runtime, Quadrants raises a Python `RuntimeError` naming the overflow at the next `qd.sync()`.

**Tuning the capacity.** Two `qd.init()` knobs control adstack sizing:

- `default_ad_stack_size=N` (default `256`): the fallback capacity for loops whose trip count the compiler cannot prove statically. Every adstack whose max_size was not deducible shares this value. Prefer tuning this knob, since it only affects the branch where the compiler needed to guess.
- `ad_stack_size=N` (default `0 = adaptive`): a hard override that forces every adstack in the program to exactly `N` slots, regardless of what the compiler proved. Prefer this knob only when a targeted experiment needs uniform sizing (e.g. stress-testing the runtime heap path).

**How to pick `default_ad_stack_size`.** The reverse pass of a `K`-iteration dynamic loop emits `K + 2` pushes per adstack (the trip count plus two setup pushes: one for the initial adjoint slot and one for the primal's starting value). Size the default at the flat trip count of the deepest unprovable dynamic loop in the program, plus that headroom. Common shapes:

- A single `qd.ndrange(n, m)` whose bounds come from a field: worst case is `n_max * m_max` iterations. Pick `N >= n_max * m_max + 2`. At `max_n_dofs_per_entity = 16`, 16 x 16 = 256 hits the default exactly.
- Nested `for i in range(a[None]): for j in range(b[None]):`: worst case is `a_max * b_max`, same rule.
- A single dynamic `for i in range(a[None])`: `N >= a_max + 2`.

**Memory cost.** The adstack pipeline allocates one small scratch buffer per loop-carried variable that the reverse pass has to remember. For example, a kernel whose dynamic loop reads and updates one float accumulator needs 1 adstack; a kernel whose loop updates four different floats needs 4. Integer counters and boolean branch flags used by the reverse pass also count (typically one each per dynamic `if` or nested loop). The total memory Quadrants allocates across all those buffers is roughly

```
num_threads * stack_size * bytes_per_element * num_loop_carried_variables
```

where `bytes_per_element` depends on the element type and the backend. On the LLVM backends (CPU / CUDA / AMDGPU) each adstack slot stores both a primal and an adjoint value, so f32 costs 8, i32 costs 8, and bool costs 2 bytes per slot. On the SPIR-V backends (Metal / Vulkan) integer adstacks only store the primal (the reverse pass does not accumulate integer adjoints), and bool is widened to i32 at storage time because SPIR-V has no defined layout for `OpTypeBool`, so f32 costs 8, i32 costs 4, and bool costs 4 bytes per slot. The buffer lives on the device on GPU and in host RAM on CPU. `num_threads` is the number of threads the kernel actually dispatches, not a worst-case grid: on CPU this is the thread pool size (tens of threads), so the memory footprint stays small; on GPU it is the dispatched ndrange. The buffer grows on demand to match the largest size any launch has needed so far and is then reused across subsequent launches, so you do not need to reserve memory up front.

Two shapes blow up the cost:

- A big `ndrange` on GPU. A kernel over `ndrange(1024, 1024)` with a `default_ad_stack_size` of `256` and four f32 loop-carried variables allocates roughly `1024 * 1024 * 256 * 8 * 4 bytes = 8 GB` - easy to overrun a consumer GPU's memory budget.
- Doubling `default_ad_stack_size` doubles the backing buffer linearly, so it is the easiest knob to reach for when you hit an out-of-memory error.

If the GPU driver returns an out-of-memory error, the order of remedies is: drop `default_ad_stack_size` toward the real worst-case trip count of your dynamic loops, reduce the number of loop-carried variables the reverse pass has to remember (split a kernel, checkpoint manually, or fold two accumulators into one), then raise `device_memory_fraction` or `device_memory_GB` at `qd.init()` if your GPU has headroom.

**When the default is too small.** The runtime surfaces overflow as a `QuadrantsAssertionError` (LLVM backends) or `RuntimeError` (SPIR-V backends) on the next `qd.sync()`, and the message recommends bumping `default_ad_stack_size`. Pick `N` using the rules above and retry.

## Backend support

Expand All @@ -247,7 +275,6 @@ The adstack pipeline is behind `ad_stack_experimental_enabled=True`. Enable it w
## Known limitations

- Adstack overflow is reported as a Python-level exception on every backend, but asynchronously: the offending kernel writes to a host-polled SSBO flag during execution, and the next `qd.sync()` (explicit, or implicit via a host read like `to_numpy()` / `to_torch()`) reads the flag and raises. This follows the same pattern as CUDA async errors so every launch does not pay a per-launch sync. If you want the exception to land exactly at the offending kernel rather than at the next sync, call `qd.sync()` right after the kernel, or enable `qd.init(debug=True)` on LLVM backends to poll after every launch.
- On SPIR-V backends (Metal, Vulkan) the adstack is allocated as per-thread on-chip memory, which the driver's shader compiler caps at a few kilobytes. Kernels whose combined adstack demand exceeds that cap fail to compile and Quadrants raises a Python `RuntimeError` at kernel-launch time. LLVM backends (CPU, CUDA, AMDGPU) allocate on the heap and do not hit this limit. Lifting the SPIR-V limit by moving the adstack off on-chip memory is tracked for future work.
- Adstack trades compile time for generality. Kernels with many loop-carried variables, nested dynamic loops, or large inner-loop bodies produce visibly slow compile times - seconds stretching into minutes, and on SPIR-V backends sometimes into the territory where the driver's shader compiler gives up. Budget compile-time accordingly when migrating existing reverse-mode AD workloads.
- Reverse-mode AD does not propagate gradients through integer casts or non-real operations. No error is raised; the gradient simply stops at the cast and silently reads as zero upstream. Cast to `qd.f32` / `qd.f64` before the differentiable section.
- Backward passes on non-trivial kernels run noticeably slower than the corresponding forward pass, sometimes by an order of magnitude on SPIR-V.
16 changes: 16 additions & 0 deletions quadrants/codegen/amdgpu/codegen_amdgpu.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -353,6 +353,22 @@ class TaskCodeGenAMDGPU : public TaskCodeGenLLVM {
current_task->block_dim = stmt->block_dim;
QD_ASSERT(current_task->grid_dim != 0);
QD_ASSERT(current_task->block_dim != 0);
// Host-side adstack sizing, same scheme as codegen_cuda: tight `grid_dim * block_dim` for
// non-range_for and const-bound range_for, dynamic resolution via gtmps DtoH memcpy for
// dynamic-bound range_for. See llvm_compiled_data.h::AdStackSizingInfo for the resolution
// rule the kernel launcher applies.
if (current_task->ad_stack.per_thread_stride > 0) {
current_task->ad_stack.static_num_threads =
static_cast<std::size_t>(current_task->grid_dim) * static_cast<std::size_t>(current_task->block_dim);
if (stmt->task_type == Type::range_for && !(stmt->const_begin && stmt->const_end)) {
current_task->ad_stack.dynamic_gpu_range_for = true;
current_task->ad_stack.begin_const_value = stmt->const_begin ? stmt->begin_value : 0;
current_task->ad_stack.end_const_value = stmt->const_end ? stmt->end_value : 0;
current_task->ad_stack.begin_offset_bytes =
stmt->const_begin ? -1 : static_cast<std::int32_t>(stmt->begin_offset);
current_task->ad_stack.end_offset_bytes = stmt->const_end ? -1 : static_cast<std::int32_t>(stmt->end_offset);
}
}
offloaded_tasks.push_back(*current_task);
current_task = nullptr;
}
Expand Down
11 changes: 11 additions & 0 deletions quadrants/codegen/cpu/codegen_cpu.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,17 @@ class TaskCodeGenCPU : public TaskCodeGenLLVM {
call("LLVMRuntime_profiler_stop", get_runtime());
}
finalize_offloaded_task_function();
// Host-side adstack sizing: on CPU the adstack slot is indexed by `cpu_thread_id` in
// [0, num_cpu_threads), so sizing is independent of iteration count. Serial tasks run on the
// launcher with `cpu_thread_id == 0`. Dynamic offsets stay -1 because CPU never reads begin/end
// from gtmps for sizing - the thread pool bound is always tight.
if (current_task->ad_stack.per_thread_stride > 0) {
int cpu_threads = 1;
if (stmt->task_type != OffloadedStmt::TaskType::serial && stmt->num_cpu_threads > 0) {
cpu_threads = stmt->num_cpu_threads;
}
current_task->ad_stack.static_num_threads = static_cast<std::size_t>(cpu_threads);
}
offloaded_tasks.push_back(*current_task);
current_task = nullptr;
current_offload = nullptr;
Expand Down
18 changes: 18 additions & 0 deletions quadrants/codegen/cuda/codegen_cuda.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -640,6 +640,24 @@ class TaskCodeGenCUDA : public TaskCodeGenLLVM {
current_task->dynamic_shared_array_bytes = dynamic_shared_array_bytes;
QD_ASSERT(current_task->grid_dim != 0);
QD_ASSERT(current_task->block_dim != 0);
// Host-side adstack sizing. For non-range_for and for const-bound range_for the launcher uses
// `grid_dim * block_dim` directly, which is tight because codegen above caps grid_dim to
// ceil((end-begin)/block_dim) for const range_for and non-range_for tasks fan out over the full
// dispatch. For dynamic-bound range_for we record const values and gtmps byte offsets so the
// launcher resolves begin/end at launch time (via i32 DtoH memcpy from runtime->temporaries)
// and sizes the heap to exactly `(end - begin) * per_thread_stride`.
if (current_task->ad_stack.per_thread_stride > 0) {
current_task->ad_stack.static_num_threads =
static_cast<std::size_t>(current_task->grid_dim) * static_cast<std::size_t>(current_task->block_dim);
if (stmt->task_type == Type::range_for && !(stmt->const_begin && stmt->const_end)) {
current_task->ad_stack.dynamic_gpu_range_for = true;
current_task->ad_stack.begin_const_value = stmt->const_begin ? stmt->begin_value : 0;
current_task->ad_stack.end_const_value = stmt->const_end ? stmt->end_value : 0;
current_task->ad_stack.begin_offset_bytes =
stmt->const_begin ? -1 : static_cast<std::int32_t>(stmt->begin_offset);
current_task->ad_stack.end_offset_bytes = stmt->const_end ? -1 : static_cast<std::int32_t>(stmt->end_offset);
}
}
offloaded_tasks.push_back(*current_task);
current_task = nullptr;
}
Expand Down
Loading
Loading