Skip to content

[AutoDiff] Autodiff 9: Heap-backed adstack: lift the Metal/Vulkan per-thread capacity cap#493

Open
duburcqa wants to merge 33 commits intoduburcqa/fix_adstack_perffrom
duburcqa/heap_backed_adstack
Open

[AutoDiff] Autodiff 9: Heap-backed adstack: lift the Metal/Vulkan per-thread capacity cap#493
duburcqa wants to merge 33 commits intoduburcqa/fix_adstack_perffrom
duburcqa/heap_backed_adstack

Conversation

@duburcqa
Copy link
Copy Markdown
Contributor

@duburcqa duburcqa commented Apr 17, 2026

Summary

Moves the SPIR-V adstack's primal and adjoint arrays off Function-scope (per-thread on-chip memory) onto a shared, per-dispatch StorageBuffer sliced per invocation. Lifts the Metal/MoltenVK per-thread shader footprint cap that bounded the previous implementation to a few dozen adstack slots.

Non-f32 adstacks (i32, u1, f64, ...) continue to use Function-scope arrays because they are typically small (loop indices, if-condition flags) and fit comfortably; only f32 goes on the heap, where the cap mattered in practice.

What changed

  • TaskCodegen::visit(AdStackAllocaStmt) picks heap vs Function-scope on ret_type == f32. Push/Pop/LoadTop/AccAdjoint visitors branch on a new use_heap flag.
  • TaskCodegen::run() pre-scans the IR for every f32-valued AdStackAllocaStmt up front so get_ad_stack_heap_thread_base() captures the final per-thread stride. Without the pre-scan the lazy invoc_id * stride bakes a stale stride into the base expression, later allocas raise the stride, and threads overlap into each other's slices.
  • TaskAttributes::ad_stack_heap_per_thread_stride is written at the end of codegen and read by the runtime.
  • GfxRuntime allocates one StorageBuffer per launch sized at stride * (group_x * block_dim) * sizeof(f32). Reused across launches with grow-on-demand; on grow the previous buffer is moved into ctx_buffers_ rather than destroyed synchronously, so an in-flight cmdlist still referencing it stays valid until the stream drains.
  • Replaces the former test_adstack_shader_compile_failure_raises (which asserted Metal would reject large stacks) with test_adstack_large_capacity_heap_backed (4096-slot kernel that the old Function-scope path could not compile).
  • Adds test_adstack_mixed_f32_and_non_f32 exercising both paths (f32 primal on the heap, i32 loop index on Function-scope) in one kernel.

Stacked on #490.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e59c3d40a7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread quadrants/runtime/gfx/runtime.cpp Outdated
Comment thread quadrants/runtime/gfx/runtime.cpp Outdated
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from e59c3d4 to 559dcd4 Compare April 17, 2026 11:12
@duburcqa duburcqa force-pushed the duburcqa/fix_adstack_perf branch from 5f87046 to c7e3de9 Compare April 17, 2026 11:19
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 559dcd4 to b3bb3f1 Compare April 17, 2026 11:20
@duburcqa duburcqa force-pushed the duburcqa/fix_adstack_perf branch from c7e3de9 to f5d18d0 Compare April 17, 2026 11:37
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from b3bb3f1 to 190fc9c Compare April 17, 2026 11:37
@duburcqa duburcqa force-pushed the duburcqa/fix_adstack_perf branch from f5d18d0 to 36973d4 Compare April 17, 2026 11:44
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 190fc9c to 3117982 Compare April 17, 2026 11:44
@duburcqa duburcqa force-pushed the duburcqa/fix_adstack_perf branch from 36973d4 to 2424db0 Compare April 17, 2026 11:53
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 3117982 to 486ef11 Compare April 17, 2026 11:53
@duburcqa duburcqa force-pushed the duburcqa/fix_adstack_perf branch from 2424db0 to 4f9c08f Compare April 17, 2026 12:12
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 486ef11 to 138b1f4 Compare April 17, 2026 12:12
@duburcqa duburcqa force-pushed the duburcqa/fix_adstack_perf branch from 4f9c08f to ea66a16 Compare April 17, 2026 12:18
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 138b1f4 to 56b29f5 Compare April 17, 2026 12:18
@duburcqa duburcqa force-pushed the duburcqa/fix_adstack_perf branch from ea66a16 to 9120b6c Compare April 17, 2026 12:29
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 56b29f5 to ab56aae Compare April 17, 2026 12:29
@duburcqa duburcqa force-pushed the duburcqa/fix_adstack_perf branch from 9120b6c to 72df355 Compare April 17, 2026 12:31
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from ab56aae to 2dcf9b4 Compare April 17, 2026 12:31
@duburcqa duburcqa force-pushed the duburcqa/fix_adstack_perf branch from 72df355 to a231bc8 Compare April 17, 2026 12:43
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 2dcf9b4 to 816900b Compare April 17, 2026 12:43
@duburcqa duburcqa changed the title [SPIRV] Heap-backed adstack: lift the Metal/Vulkan per-thread capacity cap [SPIRV] Autodiff 9: Heap-backed adstack: lift the Metal/Vulkan per-thread capacity cap Apr 17, 2026
@duburcqa duburcqa changed the title [SPIRV] Autodiff 9: Heap-backed adstack: lift the Metal/Vulkan per-thread capacity cap [AutoDiff] Autodiff 9: Heap-backed adstack: lift the Metal/Vulkan per-thread capacity cap Apr 17, 2026
@duburcqa duburcqa force-pushed the duburcqa/fix_adstack_perf branch from a231bc8 to ff11ee2 Compare April 17, 2026 15:57
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 816900b to 3f8a0fe Compare April 17, 2026 15:57
@duburcqa
Copy link
Copy Markdown
Contributor Author

@claude review

Comment thread quadrants/runtime/gfx/runtime.cpp Outdated
Comment thread quadrants/runtime/gfx/runtime.cpp Outdated
Comment thread quadrants/codegen/spirv/kernel_utils.h Outdated
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GPU-side grid-stride aliasing concern I raised in my prior comment was a misunderstanding — each outer iteration runs a complete forward+backward sweep within a single func(context, tls_ptr, idx) invocation, so the per-iteration stack_init reset is correct and harmless. All previously flagged bugs have been addressed. The PR is large (14 files, new heap-backed adstack across both LLVM and SPIR-V backends) and touches critical autodiff infrastructure, so a human pass before merge is warranted.

Extended reasoning...

Overview

PR #493 moves the SPIR-V adstack's primal/adjoint storage from Function-scope (per-thread on-chip) to per-dispatch StorageBuffers sliced by invocation ID, and simultaneously moves the LLVM backend's adstack from worker-thread stack allocas to a per-runtime heap slab. It touches 14 files across codegen, runtime, tests, and documentation.

Resolution of prior concerns

All 12+ bugs found across previous review rounds have been addressed in follow-up commits. The final open item — my comment at 2026-04-19T10:33:53Z claiming a GPU-side gradient-aliasing bug — was incorrect. I assumed the forward and backward sweeps are separate kernel launches that need to share adstack data across outer iterations. Inspection of (runtime.cpp:1601-1604) shows each call to runs a complete forward+backward sweep for one element's inner dynamic loop; the adstack is initialized and fully drained within that single call. The author's defense was correct.

Security risks

No auth, crypto, or permission surfaces are touched. The primary risk is gradient correctness (silent wrong gradients on incorrect indexing or use-after-free GPU memory). The previous reviews exercised those paths thoroughly and all identified issues have fixes in the commit log.

Level of scrutiny

This is production-critical autodiff infrastructure shared across LLVM (CPU/CUDA/AMDGPU) and SPIR-V (Metal/Vulkan) backends. The changes are non-trivial: new heap buffer lifecycle management, SPIR-V capability handling, SPIR-V SSA dominance constraints, and u32-overflow guards for large-stride products. The PR went through 10+ rounds of review with all identified bugs fixed, but the scope warrants a final human read before merge.

Other factors

The bug hunting system found no remaining bugs. Test coverage is comprehensive (new tests for heap-backed large-capacity, mixed f32/non-f32 stacks, many non-f32 stacks, boundary capacity, and the old stack-budget regression). The CapabilityInt64Indexes concern I raised was correctly declined by the author — that capability does not exist in the SPIR-V grammar.

@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 26f4f68 to f016a05 Compare April 19, 2026 13:52
Comment thread quadrants/runtime/llvm/llvm_runtime_executor.cpp Outdated
duburcqa and others added 2 commits April 19, 2026 08:16
`LlvmRuntimeExecutor::synchronize()` on AMDGPU was draining
`AMDGPUContext::kernel_arg_pointer_` via `free_kernel_arg_pointer()`,
which hipFreed the `context_pointer` still in use by the launch currently
in flight. `fetch_result_uint64()` calls `synchronize()` at its top, and
`ensure_adstack_heap()` calls `fetch_result_uint64()` on its first run to
cache `&runtime->adstack_heap_buffer` / `_size`. Net result: the
`RuntimeContext` device allocation was freed mid-launch, HIP's allocator
recycled that address for the adstack heap when the heap was small
enough to fit (e.g. tight `end - begin` sizing for dynamic range_for),
and `hipMemset`/`hipMemcpy` on the heap overwrote the `RuntimeContext`
bytes. The next task's kernel then loaded `ctx->runtime` = 0 and
dereferenced it at the initial scalar-load chain, yielding the
`hipErrorIllegalAddress` we saw at PC offset 0x38 of the gfx11 range_for
kernel. The same bug does not fire with `grid_dim * block_dim` sizing
because the 800+ MB heap is never placed at a recycled small-allocation
address. It does not fire on CUDA because the CUDA sync path has no
equivalent free-list drain.

Free `context_pointer` directly at the end of `launch_llvm_kernel`,
using the same pattern as `device_result_buffer` / `device_arg_buffer`
-- `hipFree` synchronizes implicitly with pending kernels on the device,
so this works for async launches too. Drop the now-unused
`push_back_kernel_arg_pointer` / `free_kernel_arg_pointer` /
`kernel_arg_pointer_` trio from `AMDGPUContext`.

With the underlying bug fixed, `resolve_num_threads` can return the
tight `end - begin` that `runtime/cuda/kernel_launcher.cpp` already uses,
instead of the previous `grid_dim * block_dim` workaround that was
masking this crash.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ack_heap_ensure.

The comment above the `preallocate_runtime_memory()` call justified why the
bump-allocator chunk is always populated on CUDA / AMDGPU by naming
`adstack_heap_ensure` as the device-side caller of `allocate_aligned`. That
helper was removed earlier in this PR (adstack heap management now lives on the
host via `LlvmRuntimeExecutor::ensure_adstack_heap`; GPU kernels only read
`runtime->adstack_heap_buffer`, they do not allocate). The preallocation is
still necessary for the surviving device-side callers (sparse SNode activation
through `NodeManager`, `runtime_initialize_rand_states_*`, etc.), so the
behavior is right but the named example was misleading. Update the comment to
list accurate callers.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread tests/python/test_adstack.py
Comment thread quadrants/runtime/llvm/runtime_module/runtime.cpp Outdated
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp Outdated
Comment thread quadrants/runtime/cuda/kernel_launcher.cpp
duburcqa and others added 2 commits April 19, 2026 11:57
Three independent cleanups from the latest review round, no behavior change:

1. tests/python/test_adstack.py:516 — the prose comment said the kernel's
   per-thread adstack footprint is "one byte past the old 262,144 byte
   ceiling". The arithmetic `8 * (8 + 4096 * 8) = 262,208` is correct but
   262,208 - 262,144 = 64, not 1. Fixed the prose.

2. quadrants/runtime/llvm/runtime_module/runtime.cpp:733-744 — the
   `runtime_set_adstack_heap_buffer` / `runtime_set_adstack_heap_size`
   wrappers and the comment block above them documented a JIT-dispatch
   setter path that was replaced by the `runtime_get_adstack_heap_field_ptrs`
   + `memcpy_host_to_device` path earlier in this PR; the two wrappers were
   dead code with no callers. Deleted them along with the stale comment
   block. Also updated the matching `adstack_heap_buffer` field comment in
   `LLVMRuntime` and the stale duplicate block at
   `llvm_runtime_executor.cpp:589-592` plus the `ensure_adstack_heap` doc
   in `llvm_runtime_executor.h:85-90` so all four sites now describe the
   actual cached-field-pointer + memcpy publish mechanism.

3. quadrants/codegen/spirv/spirv_codegen.cpp — the `if (info.elem_type.id
   != backing_type.id)` cast blocks in `visit(AdStackLoadTopAdjStmt)` and
   `visit(AdStackAccAdjointStmt)` were dead: the `heap_kind != heap_int`
   asserts at the top of both visitors exclude the only case where
   `ad_stack_backing_type()` returns a type different from
   `info.elem_type` (the u1->i32 promotion on the int heap). Removed the
   dead blocks and added a brief comment explaining why no cast is needed
   here, unlike the primal visitors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…f stack.

Review comment flagged that the CUDA graph launch path in GraphManager::try_launch
bakes every offloaded task into a single graph, but LlvmRuntimeExecutor::ensure_adstack_heap
has to run on the host in between the serial range_for-bounds kernel (which stores
`end_value` into `runtime->temporaries`) and the range_for kernel itself (which reads
that value back to size the heap). Baking both into the same graph means the host never
gets a chance to call ensure_adstack_heap, leaving `runtime->adstack_heap_buffer = nullptr`
and letting every thread write into a near-null address.

The scenario is not reachable through the public API today: `@qd.kernel(graph=True)`
only sets `use_graph` on the primal kernel; `.grad()` returns a separate kernel with
`use_graph = False`, so reverse-mode launches never take the graph path. The existing
`grad_ptr != nullptr` guard in `resolve_ctx_ndarray_ptrs` already rejects the obvious
autograd entry point. Still worth failing loudly if a future caller manages to combine
the two through an internal path, instead of silently running with a nullptr heap.

Guard each task in `try_launch` with a clear error message naming the offending task.
No test -- the combination is not reachable through the public API, and fabricating
one would require the same internal-attribute poke the guard is meant to shield against.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread docs/source/user_guide/autodiff.md Outdated
Comment thread quadrants/codegen/llvm/codegen_llvm.cpp Outdated
Comment thread docs/source/user_guide/autodiff.md Outdated
duburcqa and others added 2 commits April 19, 2026 14:35
…n refs.

Three independent comment-level cleanups, no behavior change:

1. docs/source/user_guide/autodiff.md:258,262 -- the memory-cost formula
   used `bytes_per_element = 4 for f32 / i32 and 1 for bool`, which was
   wrong in two ways. On the LLVM backends every slot stores primal +
   adjoint, so f32 costs 8, i32 costs 8, bool costs 2. On the SPIR-V
   backends integer adstacks store only the primal (no adjoint
   accumulation for non-real types) and bool is widened to i32 (since
   `OpTypeBool` has no defined storage layout under LogicalAddressing),
   so f32 costs 8, i32 costs 4, bool costs 4. Rewrote the paragraph to
   state the per-backend breakdown explicitly, and updated the worked
   example (1024x1024 ndrange, default_ad_stack_size=256, four f32
   loop-carried variables) from 4 GB to 8 GB to reflect the missing
   primal+adjoint factor.

2. quadrants/codegen/llvm/{codegen_llvm.h,codegen_llvm.cpp} and
   quadrants/program/compile_config.h -- five comments referenced a
   nonexistent `adstack_heap_ensure` helper. That device-side sizer was
   removed earlier in this PR; the heap is now sized by
   `LlvmRuntimeExecutor::ensure_adstack_heap` on the host before each
   dispatch, and the kernel just reads `runtime->adstack_heap_buffer`
   via the `LLVMRuntime_get_adstack_heap_buffer` getter. Rewrote the
   affected comments to name the actual helpers (host-side sizer, JIT
   getter) and to make the layer boundary explicit: codegen emits a
   load of the published pointer, it does not emit a sizing call.

3. quadrants/codegen/llvm/codegen_llvm.cpp:2158 -- a sixth comment
   pointed at the `runtime_set_adstack_heap_*` wrappers as the publish
   path. Those wrappers were also deleted earlier in this PR; the
   actual path caches the two field device-addresses once via
   `runtime_get_adstack_heap_field_ptrs` and publishes subsequent grows
   through plain `memcpy_host_to_device` writes. Corrected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nge).

Three comment blocks I edited in 1039ce3 / b9152b7 had been wrapped
conservatively around 95-110 cols; this rewraps them to fill up to 115
cols of content (120 - the `// ` / `  // ` prefix) so lines only break
when they have to:

- quadrants/program/compile_config.h: default_ad_stack_size comment.
- quadrants/codegen/llvm/codegen_llvm.h: ad_stack_per_thread_stride_
  field doc and ensure_ad_stack_heap_base_llvm() function doc.
- quadrants/codegen/llvm/codegen_llvm.cpp: ensure_ad_stack_heap_base_llvm
  definition comment and the pre-scan block comment inside
  init_offloaded_task_function.

clang-format now passes cleanly on the touched files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread quadrants/runtime/gfx/runtime.cpp
Review comment flagged that the SPIR-V heap grow branches allocated exactly
`required` bytes, while the LLVM counterpart in
`LlvmRuntimeExecutor::ensure_adstack_heap` uses
`std::max(needed_bytes, 2 * current)`. In a workload that issues K launches
with monotonically increasing dispatch sizes between `synchronize()` calls,
every launch moves the old buffer into `ctx_buffers_` (the deferred-free list
the use-after-free fix added earlier in this PR), accumulating O(K^2 * N)
bytes of live-but-unused GPU memory before the next sync drains the stream.
Doubling bounds the reallocations at O(log K) and the live memory at O(K * N),
matching the LLVM behavior.

Applied to both the float and int heaps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread quadrants/runtime/llvm/llvm_runtime_executor.cpp Outdated
…ee).

Review comment flagged that the comment at `ensure_adstack_heap` above the
`adstack_heap_alloc_ = std::move(new_guard)` line falsely claims
AMDGPU's `dealloc_memory` routes to `hipFree` and implicitly synchronizes.
It does not: `AmdgpuDevice::dealloc_memory` calls
`DeviceMemoryPool::release(release_raw=false)`, which pools the allocation
through `CachingAllocator` without any `hipFree` and without any GPU sync.
CUDA's path does sync via `cuMemFree_v2` (release_raw=true there), so the
CUDA half of the comment was correct.

The code is still safe today -- not because of dealloc_memory, but because
`launch_llvm_kernel` ends with `hipFree(context_pointer)` /
`cuMemFree_v2(context_pointer)` which synchronizes with every kernel
launched during the call. By the time the *next* `launch_llvm_kernel`
reaches `ensure_adstack_heap` and can destroy the previous guard, all
kernels from the prior call have already completed. Spell that actual
invariant out in the comment so a future change to the launcher structure
cannot lean on the false dealloc-sync claim as justification.

Comment-only; no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread quadrants/runtime/llvm/llvm_runtime_executor.cpp Outdated
duburcqa and others added 2 commits April 20, 2026 00:20
…text_pointer).

Review comment flagged that the AMDGPU bullet at lines 629-636 attributed
cross-launch adstack safety to both `hipFree(context_pointer)` AND
`cuMemFree_v2(context_pointer)`, but `cuda::KernelLauncher::launch_llvm_kernel`
never allocates a device-side `context_pointer` in the first place — it
passes the host-side `RuntimeContext` by reference via
`{&ctx.get_context()}`. CUDA's safety is already complete at the preceding
CUDA bullet: `CudaDevice::dealloc_memory` passes `release_raw=true`, so the
old heap slab's free goes through `cuMemFree_v2` directly and synchronizes
before returning. Only AMDGPU needs the extra hop (because its
`DeviceMemoryPool::release(release_raw=false)` path pools the slab without
any sync), and the trailing `hipFree(context_pointer)` in
`amdgpu::KernelLauncher::launch_llvm_kernel` provides that sync.

Scoped the paragraph to AMDGPU only and added a sentence noting why CUDA
doesn't need (and doesn't have) a device-side `context_pointer`.

Comment-only; no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hughperkins
Copy link
Copy Markdown
Collaborator

hughperkins commented Apr 20, 2026

User: look at diff between duburcqa/heap_backed_adstack and origin/duburcqa/fix_adstack_perf

Assistant:

Diff between duburcqa/heap_backed_adstack and origin/duburcqa/fix_adstack_perf: 31 commits + merge, 25 files, +1265 / -200.

Theme: move adstack primal/adjoint storage from per-thread on-chip / stack memory to per-runtime heap buffers, on every backend

This PR closes the "future work" item that fix_adstack_perf explicitly deferred ("Heap-backed SPIR-V adstack, which would lift the per-thread ceiling, is tracked as follow-up"). It also moves LLVM adstacks off the worker-thread stack, which fix_adstack_perf had bounded at 256 KB via a codegen-time guard.

What's in the PR

1. LLVM heap-backed adstack (CPU / CUDA / AMDGPU)

  • runtime.cpp (LLVMRuntime struct): new Ptr adstack_heap_buffer = nullptr; u64 adstack_heap_size = 0; plus STRUCT_FIELD getters. Heap is host-managed; device kernels only read these fields.
  • codegen_llvm.cpp::visit(AdStackAllocaStmt): removed the old create_entry_block_alloca path entirely (and with it the 256 KB function-scope budget guard from fix_adstack_perf). Replaced with base = heap + thread_slot * stride + offset, where heap comes from LLVMRuntime_get_adstack_heap_buffer(runtime) (cached in entry_block to dominate every AdStack* in the task), thread_slot = linear_thread_idx(context) (same helper as rand_states indexing), stride is the per-task sum of all AdStackAllocaStmt::size_in_bytes() rounded up to 8, and offset is this stmt's slot within the per-thread slice.
  • Pre-scan in init_offloaded_task_function: walks the task body before any codegen runs to populate ad_stack_offsets_ and ad_stack_per_thread_stride_. Comment explicitly explains why: growing stride lazily would bake a stale value into earlier allocas' thread_slot * stride math.
  • AdStackSizingInfo in llvm_compiled_data.h: new per-task struct in OffloadedTask that propagates sizing info to the host launcher: per_thread_stride, static_num_threads, dynamic_gpu_range_for, plus const values and gtmps byte offsets for begin/end so the launcher can resolve dynamic range_for bounds at launch time.
  • Per-arch codegen tweaks (codegen_cpu.cpp, codegen_cuda.cpp, codegen_amdgpu.cpp): each backend fills in current_task->ad_stack after its block_dim/grid_dim are finalized. CPU sizes by cpu_thread_id range (independent of iteration count); GPU uses grid_dim * block_dim for const-bound, end - begin resolved at launch for dynamic-bound range_for.
  • LlvmRuntimeExecutor::ensure_adstack_heap(needed_bytes): new method called by each kernel launcher before dispatch. Amortized doubling growth, allocates via the device allocator, publishes new pointer/size to runtime->{adstack_heap_buffer, adstack_heap_size} via host→device memcpy (cached field-pointer addresses on first grow → zero kernel launches per grow).
  • Per-arch launcher changes:
    • CPU (runtime/cpu/kernel_launcher.cpp): adds parallel ad_stack_needed_bytes vector to Context (precomputed at register time since CPU sizing is static). launch_offloaded_tasks calls ensure_adstack_heap per task.
    • CUDA (runtime/cuda/kernel_launcher.cpp): resolve_num_threads resolves dynamic-bound range_for via 4-byte DtoH memcpy from runtime->temporaries.
    • AMDGPU (runtime/amdgpu/kernel_launcher.cpp): same as CUDA, plus a critical lifecycle change — the per-launch RuntimeContext device allocation is now hipFree'd at the end of launch_llvm_kernel rather than deferred to synchronize(). The deferred path collided with ensure_adstack_heap's field-pointer query (which calls synchronize indirectly via fetch_result_uint64), causing HIP to recycle the freed context_pointer address as the new adstack heap allocation and clobber the next task's RuntimeContext. Removes AMDGPUContext::push_back_kernel_arg_pointer / free_kernel_arg_pointer entirely.
    • CUDA graph manager (runtime/cuda/graph_manager.cpp): adds a hard QD_ERROR_IF rejecting graph=True on kernels whose tasks have per_thread_stride > 0. Reason: graph baking precludes the host-side ensure_adstack_heap between the bounds-resolving serial kernel and the range_for kernel itself.
  • materialize_runtime: changed from if (!use_device_memory_pool()) preallocate_runtime_memory() to always preallocate on CUDA / AMDGPU, with a new comment explaining that device-side runtime helpers (NodeManager, random-state init) call LLVMRuntime::allocate_aligned from a kernel and the host-only host_allocator falls through with an invalid PC if the bump-allocator chunk is empty. Latent bug surfaced by this PR's heavier use of the runtime-side allocator path.
  • finalize: releases adstack_heap_alloc_ before device teardown so its DeviceAllocationGuard runs while the RHI device is still valid.

2. SPIR-V heap-backed adstack (Metal / Vulkan)

  • Two new buffer types: BufferType::AdStackHeapFloat (backs f32 adstacks) and BufferType::AdStackHeapInt (backs i32 and u1 adstacks; u1 stored as i32 to match the IRBuilder::get_array_type bool→int promotion). Other primitive types (f64, i64, ...) are hard-errored by the codegen — no Function-scope fallback exists anymore. Commit 4382ee14e removed the dead fallback path.
  • Pre-scan in TaskCodegen::run (mirroring the LLVM pre-scan): walks the task body and accumulates ad_stack_heap_per_thread_stride_float_ and ad_stack_heap_per_thread_stride_int_ before any visitor runs. Notable: int adstacks count only primal storage, no adjoint, because auto_diff.cpp only emits AdStackAccAdjoint/LoadTopAdj on real-typed stacks (gradients silently read as zero through integer casts per the docs).
  • Eager invoc_id * stride emission at the alloca site, not lazy at first push (commit f0f928155). Comment explicitly explains the SPIR-V dominance violation that lazy emission would create: two sibling inner loops would reuse an SSA id defined in the first loop's body, which doesn't dominate the second.
  • u32-vs-u64 index arithmetic (commit d185383b6): when the device exposes Int64, invoc_id * stride + offset + count is widened to u64; otherwise the codegen emits u32 OpIMul and the runtime asserts at launch time that stride * dispatched_threads <= u32_max to catch silent wrap-around aliasing into another thread's slice.
  • GFX runtime (runtime/gfx/runtime.cpp/h): per-dispatch adstack_heap_buffer_float_ and adstack_heap_buffer_int_. Same amortized-doubling growth as the LLVM side. Empty-dispatch guard binds kDeviceNullAllocation instead of asking the RHI for a zero-sized buffer (which trips RHI_ASSERT(size > 0)). Old buffers on grow are moved into ctx_buffers_ (deferred-free) rather than freed synchronously, so in-flight cmdlists referencing them stay valid.
  • flush() no longer clears ctx_buffers_ (commit 01b6fb4f1) — submit() returns when the cmdlist is queued, not when execution completes; the deferred-free buffers may still be referenced by in-flight commands. Now only synchronize() (after wait_idle() drains the stream) clears the vector. Latent use-after-free fix.
  • buffers_name() extended to cover all the new BufferTypes, fixing a QD_ERROR("unrecognized buffer type") that would have triggered on the new bindings.

3. Sizing knob & defaults

  • default_ad_stack_size raised from 32 → 256 (commits 74cfbeb67, d9813a306). Justified by: the per-thread on-chip / worker-stack budget that bounded the previous Function-scope implementation no longer applies, and 256 covers the flat trip count of a typical qd.ndrange(16, 16).
  • Comment on default_ad_stack_size rewritten to reflect the new heap-backing reality (drops the "SPIR-V Function-scope cap" justification).
  • default_ad_stack_size exposed as a qd.init() knob (commits 945cf37ad, 17a5ea112, plus a Python binding entry in export_lang.cpp).

4. Documentation (docs/source/user_guide/autodiff.md)

  • Drops the "SPIR-V on-chip cap" limitation from the Known limitations list.
  • New "Tuning the capacity" section explaining default_ad_stack_size vs ad_stack_size, when to prefer each, and the K + 2-pushes-per-iteration rule for picking N.
  • New "Memory cost" section with the num_threads * stack_size * bytes_per_element * num_loop_carried_variables formula and per-backend element-size table (LLVM = 8 B for f32 and i32 because primal+adjoint; SPIR-V = 8 for f32, 4 for i32 because primal-only, 4 for bool widened to i32). Concrete worked example: ndrange(1024, 1024) × default_ad_stack_size=256 × 4 f32 vars ≈ 8 GB.
  • Order of remedies for OOM errors (drop default_ad_stack_size, reduce loop-carried vars, raise device_memory_fraction).

5. Tests (tests/python/test_adstack.py, +327 / -42)

Substantial test additions exercising the new heap path: per-dispatch grow-on-demand, dynamic-range_for sizing, the graph=True rejection on CUDA, the SPIR-V Int64-or-overflow assert, the unsupported-primitive-type rejection (with a shaderInt8 skip per commit bfac9bb3b), reuse across launches, etc.

Quick read

This is the architectural follow-up to the entire prior chain. The earlier PRs fixed correctness bugs in adstack on the existing on-chip / stack-frame storage model and ported it to SPIR-V; this PR replaces the storage model on every backend with a host-grown heap, then deletes the workarounds the old model required (the LLVM 256 KB codegen guard, the SPIR-V Function-scope cap docs, the AMDGPU deferred-free of context_pointer, and the SPIR-V Function-scope fallback path for non-real types). It also catches three latent bugs along the way: (a) the AMDGPU context_pointer recycle race, (b) flush() clearing ctx_buffers_ while in-flight commands still reference them, and (c) materialize_runtime skipping the bump-allocator preallocation when a memory pool was reported. The CUDA graph=True rejection is a deliberate scope cut — graph baking and host-side ensure_adstack_heap between dispatches are fundamentally incompatible — and the docs add the missing user-facing capacity-tuning guidance that becomes meaningful now that the cap is no longer a hard ceiling.

Roughly the same blast radius as fix_adstack_perf (25 files, +1265 lines), and similarly cross-cutting — touches LLVM codegen, SPIR-V codegen, three per-arch launchers, the runtime executor, the gfx runtime, and the docs. Same splittability question applies.

User: thoughts on orthoognal splittable changes in this pr?

Assistant:

Orthogonal splits that land cleanly

Several chunks here are genuinely independent of the heap migration itself. I'd pull these out first because they have their own story and their own test, and they unclutter the review of the main change:

Pre-existing bug fixes (land before anything heap-related)

  1. AMDGPUContext kernel-arg deferred-free → synchronous hipFree in launch_llvm_kernel. Commits 73ebd99ba, 96fc260e2, df52025af, plus the amdgpu_context.h cleanup. The bug is real today: LlvmRuntimeExecutor::synchronize() drains the deferred-free list, and anything that calls synchronize transitively (result fetches, the new adstack field-pointer query, a debug flush) can recycle a live context_pointer address. A good standalone test is "call runtime_jit->call<void *>(...) mid-launch and verify the next kernel's RuntimeContext wasn't clobbered." ~1 commit's worth, high correctness value, zero tie to adstack.
  2. GfxRuntime::flush() stops clearing ctx_buffers_. Commit 01b6fb4f1 (partially). Also latent: submit() returns when queued, not when executed; any deferred-free buffer put into ctx_buffers_ today (not just the new heap ones) is vulnerable. This is a one-line move from flush into synchronize + a test that holds a ref across a submit and forces a DtoH. Zero heap dependency.
  3. materialize_runtime always preallocates on CUDA/AMDGPU. Removal of the if (!use_device_memory_pool()) gate. The comment spells it out — the device-side LLVMRuntime::allocate_aligned hits host_allocator with an invalid PC when the bump chunk is empty. Surfaced by this PR's heavier runtime-side allocation, but the underlying landmine is pre-existing. Lands standalone with a regression test that exercises NodeManager / sparse activation on a memory-pool-reporting driver.
  4. buffers_name() covers ListGen/ExtArr/AdStackOverflow. Partial to commit 01b6fb4f1. The AdStackOverflow entry was added in llvm_adstack_safety but the function never got the corresponding if branch, so any debug log path that called buffers_name on one of those buffer types was hitting QD_ERROR("unrecognized buffer type"). Trivially independent.

Together these four are a small "adstack-adjacent bug fixes" PR (~150 lines). Reviewers can merge them without reading any codegen.

The knob exposure

  1. default_ad_stack_size exposed via qd.init() + docs section on tuning capacity. Commits 945cf37ad, 17a5ea112, the export_lang.cpp binding, and the docs "Tuning the capacity" block. This works on top of the current Function-scope storage too — the knob just changes the fallback default. It's a usability fix that the heap-backing PR later leans on, but it doesn't require heap-backing to be correct or useful. Ships with a test that sets the knob from Python and verifies the C++ default_ad_stack_size follows.

The two heap migrations themselves

  1. LLVM heap-backed adstack. codegen_llvm.*, llvm_compiled_data.h, runtime.cpp, llvm_runtime_executor.*, compile_config.h (default bump), per-arch codegen tweaks, per-arch launchers, CUDA graph=True rejection, docs updates for the LLVM side.
  2. SPIR-V heap-backed adstack. spirv_codegen.*, kernel_utils.*, runtime/gfx/runtime.*, docs updates for the SPIR-V side (drop the on-chip cap limitation, add bool→i32 element-size note), plus the shaderInt8 test-skip.

These two don't share code. They share a design pattern (pre-scan per-task stride, per-thread slice via invoc_id * stride + offset), but the LLVM side publishes via a LLVMRuntime struct field and the SPIR-V side via a dedicated StorageBuffer binding. They could ship in either order. Splitting them cuts the review surface roughly in half (~600 lines each) and, more importantly, lets the author write focused tests — the LLVM PR's test asserts behavior across adstacks that used to OOM the worker-thread stack, and the SPIR-V PR's test asserts behavior across adstacks that used to blow past the Function-scope driver cap. Those are two different failure modes and the current combined test file blurs them.

The SPIR-V fallback removal

  1. Hard-error non-{f32,i32,u1} primitive types in SPIR-V adstack. Commit 4382ee14e. Technically this is coupled to the SPIR-V heap PR (the fallback path exists to cover types the heap doesn't handle), but it's worth calling out as its own logical hunk in the commit structure so a reviewer doesn't have to guess why f64 suddenly became a compile-time error.

What stays bundled

These three must travel together; splitting them creates a broken intermediate state:

  • LLVM heap-backing + per-arch launcher ensure_adstack_heap calls + AdStackSizingInfo propagation. The codegen produces IR that reads from a heap pointer the launcher hasn't sized yet if you ship either half alone.
  • default_ad_stack_size bump 32 → 256. Without the heap this turns a Metal pipeline-creation failure into... still a Metal pipeline-creation failure, just sooner. Ships with the SPIR-V heap PR.
  • CUDA graph=True rejection. Only makes sense once the CUDA path actually calls ensure_adstack_heap between tasks. Ships with the LLVM heap PR.

Suggested order

1. amdgpu-kernel-arg-synchronous-free      (correctness, standalone)
2. gfx-flush-no-clear-ctx-buffers           (correctness, standalone)
3. llvm-always-preallocate-bump-chunk       (correctness, standalone)
4. spirv-buffers-name-coverage              (trivial, standalone)
5. expose-default-ad-stack-size-knob + docs (UX, no storage change)
6. llvm-heap-backed-adstack                 (core, ~600 lines)
7. spirv-heap-backed-adstack                (core, ~600 lines; reject non-primitive)

1–4 are review-in-10-minutes. 5 is a small API surface change with an obvious test. 6 and 7 are the two big ones and they're genuinely independent of each other, which is the main win: if 7 turns up a driver-specific issue on AMD Vulkan, 6 doesn't get held up, and vice versa.

The one split I'd not try to force is separating the LLVM codegen change from its launcher changes — AdStackSizingInfo is the handshake between them and there's no coherent PR boundary that keeps only one side.

@hughperkins
Copy link
Copy Markdown
Collaborator

Thoughts on splitting as follows?:

  1. amdgpu-kernel-arg-synchronous-free (correctness, standalone)
  2. gfx-flush-no-clear-ctx-buffers (correctness, standalone)
  3. llvm-always-preallocate-bump-chunk (correctness, standalone)
  4. spirv-buffers-name-coverage (trivial, standalone)
  5. expose-default-ad-stack-size-knob + docs (UX, no storage change)
  6. llvm-heap-backed-adstack (core, ~600 lines)
  7. spirv-heap-backed-adstack (core, ~600 lines; reject non-primitive)

…teration bound; surface Metal alloc failures instead of silent nil
Comment on lines +591 to +605
size_t new_size = std::max(required, 2 * adstack_heap_buffer_float_size_);
auto [buf, res] = device_->allocate_memory_unique(
{new_size, /*host_write=*/false, /*host_read=*/false, /*export_sharing=*/false, AllocUsage::Storage});
QD_ASSERT_INFO(res == RhiResult::success, "Failed to allocate adstack heap float buffer (size={})",
new_size);
// Defer the old buffer's free until the current cmdlist is submitted and synced: the previous launch
// may still be in flight and referencing the old allocation, so freeing it synchronously here (via
// `DeviceAllocationGuard`'s destructor, which runs on the `std::move` reassignment below) would
// produce a GPU-side use-after-free. `ctx_buffers_` is cleared in `synchronize()` /
// `device_to_host(...)` after the stream has drained, which is exactly the lifetime we need.
if (adstack_heap_buffer_float_) {
ctx_buffers_.push_back(std::move(adstack_heap_buffer_float_));
}
adstack_heap_buffer_float_ = std::move(buf);
adstack_heap_buffer_float_size_ = new_size;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The amortized-doubling grow paths for both the float heap (AdStackHeapFloat) and int heap (AdStackHeapInt) in gfx/runtime.cpp compute new_size = max(required, 2 * old_size) and immediately assert RhiResult::success with no fallback retry at exactly required bytes. On Metal, where maxBufferLength is capped (~256 MB on many devices), this crashes the process with a spurious assertion even when the smaller required allocation would have succeeded — e.g. old_size=150 MB, required=165 MB yields new_size=300 MB, which fails with out_of_memory, aborting the program despite a 165 MB allocation being viable. The fix is to retry with new_size = required when the doubled allocation fails and new_size > required, before calling QD_ASSERT_INFO.

Extended reasoning...

What the bug is and how it manifests

In quadrants/runtime/gfx/runtime.cpp, both the float-heap grow branch and the int-heap grow branch added by this PR use amortized doubling to reduce reallocation frequency:

size_t new_size = std::max(required, 2 * adstack_heap_buffer_float_size_);
auto [buf, res] = device_->allocate_memory_unique({new_size, ...});
QD_ASSERT_INFO(res == RhiResult::success, "Failed to allocate adstack heap float buffer (size={})", new_size);

There is no fallback path that retries with exactly required bytes if the doubled allocation fails. On Metal, every StorageBuffer is subject to the device's maxBufferLength cap (~256 MB on older Apple Silicon/Intel iGPU configurations, as low as ~128 MB on some). When old_size > maxBufferLength/2 but required <= maxBufferLength, the doubled new_size exceeds the cap and allocate_memory_unique returns RhiResult::out_of_memory — triggering the assertion and aborting the process, even though the exact required allocation would have succeeded.

The specific code path that triggers it

This bug was introduced by the amortized-doubling commit (e9b5e48) that added std::max(required, 2 * current_size). That commit correctly fixed the O(K^2 * N) accumulated GPU memory problem, but the new allocation strategy has no recovery path for the case where 2 * old_size > maxBufferLength >= required. Notably, the Metal nil-check fix added by this same PR to metal_device.mm now causes newBufferWithLength: returning nil to propagate as RhiResult::out_of_memory rather than silently wrapping nil — which means this assertion is now reachable and crash-inducing on Metal, whereas previously it manifested as silent NaN gradient corruption.

Why existing code does not prevent it

The guard condition adstack_heap_buffer_float_size_ < required correctly identifies that growth is needed. The doubled new_size is computed, passed to allocate_memory_unique, and the result is immediately asserted with no intervening check for res != RhiResult::success. There is no conditional block of the form "if the doubled size fails and required < new_size, retry with new_size = required". The same structural deficiency exists identically in the int-heap grow branch.

What the impact is

Any Metal workload that triggers adstack heap growth from a size above maxBufferLength/2 but below maxBufferLength will abort with a QD_ASSERT_INFO failure, even though the actually-needed allocation would have succeeded. Concrete example: device with maxBufferLength = 256 MB, first kernel dispatch sizes the heap at 150 MB; a second slightly larger dispatch (165 MB) triggers the grow, requests 300 MB, gets rejected by Metal, and crashes — despite a 165 MB request being well within the budget.

How to fix it

Add a fallback retry at exactly required bytes when the doubled allocation fails and new_size > required, before the assertion:

size_t new_size = std::max(required, 2 * adstack_heap_buffer_float_size_);
auto [buf, res] = device_->allocate_memory_unique({new_size, ...});
if (res != RhiResult::success && new_size > required) {
    new_size = required;
    std::tie(buf, res) = device_->allocate_memory_unique({new_size, ...});
}
QD_ASSERT_INFO(res == RhiResult::success, "Failed to allocate adstack heap float buffer", new_size);

The same fix applies to the int-heap branch. This mirrors the standard two-phase allocator pattern: try the amortized size first; fall back to the minimum viable size if the device cap prevents the doubled request.

Step-by-step proof

  1. First kernel dispatch: adstack_heap_buffer_float_size_ = 0, required = 150 MB. new_size = max(150, 0) = 150 MB. Allocation succeeds. Float heap is 150 MB.
  2. Second kernel dispatch: required = 165 MB > 150 MB, so the grow branch fires. new_size = max(165, 2*150) = 300 MB.
  3. allocate_memory_unique({300 MB, ...}) calls metal_device.mm's newBufferWithLength:300MB. This exceeds maxBufferLength = 256 MB, returns nil. The nil-check fix in this PR returns RhiResult::out_of_memory.
  4. QD_ASSERT_INFO(res == RhiResult::success, ...) fires. Process aborts with the allocation failure message.
  5. A 165 MB allocation would have succeeded, but it is never attempted. The process has terminated unnecessarily.

Comment on lines +172 to +176
// Per-runtime adstack heap slab, owned here. `ensure_adstack_heap` grows via the driver allocator and
// publishes the new pointer/size into the LLVMRuntime struct; replacing `adstack_heap_alloc_` releases the
// previous allocation via `DeviceAllocationGuard`, which calls `llvm_device()->dealloc_memory`. On CUDA / AMDGPU
// `cuMemFree_v2` / `hipFree` synchronize implicitly with respect to pending kernels on the device, so the old
// slab cannot be freed while an in-flight kernel still holds its cached base pointer.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The comment at llvm_runtime_executor.h:174-176 (added by this PR) incorrectly claims that on AMDGPU, hipFree synchronizes implicitly with pending kernels when dealloc_memory releases the old adstack slab, but AmdgpuDevice::dealloc_memory() actually routes through DeviceMemoryPool::release(release_raw=false) -> CachingAllocator::release without calling hipFree and without any GPU synchronization. The same incorrect claim in llvm_runtime_executor.cpp was corrected by commit df52025af, but the matching header comment was not updated, leaving a maintenance trap: a developer reading only the header could conclude dealloc_memory provides the AMDGPU sync barrier and safely remove the hipFree(context_pointer) at the end of amdgpu::KernelLauncher::launch_llvm_kernel, which would break AMDGPU cross-launch memory safety.

Extended reasoning...

What the bug is. The comment at llvm_runtime_executor.h:172-176 reads: "On CUDA / AMDGPU cuMemFree_v2 / hipFree synchronize implicitly with respect to pending kernels on the device, so the old slab cannot be freed while an in-flight kernel still holds its cached base pointer." This claim is accurate for CUDA (CudaDevice::dealloc_memory routes through DeviceMemoryPool::release(release_raw=true)cuMemFree_v2, which is synchronizing), but factually wrong for AMDGPU.

The specific code path that contradicts it. AmdgpuDevice::dealloc_memory() for adstack slab allocations (created with use_cached=false, use_preallocated=false) reaches DeviceMemoryPool::get_instance(Arch::amdgpu, false).release(info.size, info.ptr) with release_raw defaulting to false. DeviceMemoryPool::release with release_raw=false calls allocator_->release(size, ptr) in CachingAllocator, which simply pools the address for reuse. No hipFree is called; no GPU synchronization occurs.

Why the code is nonetheless safe today. AMDGPU cross-launch memory safety comes from a completely different mechanism: amdgpu::KernelLauncher::launch_llvm_kernel ends with AMDGPUDriver::get_instance().mem_free(context_pointer), and hipFree synchronizes implicitly with every kernel launched during that call. By the time the next launch_llvm_kernel reaches ensure_adstack_heap and could destroy the previous DeviceAllocationGuard, no in-flight GPU kernel from the prior call is still referencing the old slab. The correct explanation is in the .cpp comment that was updated by commit df52025af, but the matching .h comment was overlooked.

The maintenance trap. A developer auditing the AMDGPU memory-safety invariant reads llvm_runtime_executor.h:174-176 and sees that dealloc_memory provides the synchronization barrier. Armed with that belief, they might conclude that the final hipFree(context_pointer) in launch_llvm_kernel is redundant (since dealloc_memory supposedly syncs already) and remove it as cleanup. After that change the actual synchronization barrier is gone, but the header comment still asserts it exists — producing a real use-after-free on AMDGPU with no static or runtime check to catch it. The .cpp comment (fixed in df52025af) correctly explains that CachingAllocator::release does not call hipFree; the header and the implementation comments are now directly contradictory.

How to fix it. Update llvm_runtime_executor.h:172-176 to match the .cpp comment: state that CUDA deallocation routes through cuMemFree_v2 (synchronizing), while AMDGPU deallocation routes through CachingAllocator::release (non-synchronizing, physical memory stays mapped), and that the AMDGPU cross-launch safety invariant comes from the hipFree(context_pointer) at the end of each launch_llvm_kernel call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants