[AutoDiff] Autodiff 9: Heap-backed adstack: lift the Metal/Vulkan per-thread capacity cap#493
[AutoDiff] Autodiff 9: Heap-backed adstack: lift the Metal/Vulkan per-thread capacity cap#493duburcqa wants to merge 33 commits intoduburcqa/fix_adstack_perffrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e59c3d40a7
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
e59c3d4 to
559dcd4
Compare
5f87046 to
c7e3de9
Compare
559dcd4 to
b3bb3f1
Compare
c7e3de9 to
f5d18d0
Compare
b3bb3f1 to
190fc9c
Compare
f5d18d0 to
36973d4
Compare
190fc9c to
3117982
Compare
36973d4 to
2424db0
Compare
3117982 to
486ef11
Compare
2424db0 to
4f9c08f
Compare
486ef11 to
138b1f4
Compare
4f9c08f to
ea66a16
Compare
138b1f4 to
56b29f5
Compare
ea66a16 to
9120b6c
Compare
56b29f5 to
ab56aae
Compare
9120b6c to
72df355
Compare
ab56aae to
2dcf9b4
Compare
72df355 to
a231bc8
Compare
2dcf9b4 to
816900b
Compare
a231bc8 to
ff11ee2
Compare
816900b to
3f8a0fe
Compare
|
@claude review |
There was a problem hiding this comment.
The GPU-side grid-stride aliasing concern I raised in my prior comment was a misunderstanding — each outer iteration runs a complete forward+backward sweep within a single func(context, tls_ptr, idx) invocation, so the per-iteration stack_init reset is correct and harmless. All previously flagged bugs have been addressed. The PR is large (14 files, new heap-backed adstack across both LLVM and SPIR-V backends) and touches critical autodiff infrastructure, so a human pass before merge is warranted.
Extended reasoning...
Overview
PR #493 moves the SPIR-V adstack's primal/adjoint storage from Function-scope (per-thread on-chip) to per-dispatch StorageBuffers sliced by invocation ID, and simultaneously moves the LLVM backend's adstack from worker-thread stack allocas to a per-runtime heap slab. It touches 14 files across codegen, runtime, tests, and documentation.
Resolution of prior concerns
All 12+ bugs found across previous review rounds have been addressed in follow-up commits. The final open item — my comment at 2026-04-19T10:33:53Z claiming a GPU-side gradient-aliasing bug — was incorrect. I assumed the forward and backward sweeps are separate kernel launches that need to share adstack data across outer iterations. Inspection of (runtime.cpp:1601-1604) shows each call to runs a complete forward+backward sweep for one element's inner dynamic loop; the adstack is initialized and fully drained within that single call. The author's defense was correct.
Security risks
No auth, crypto, or permission surfaces are touched. The primary risk is gradient correctness (silent wrong gradients on incorrect indexing or use-after-free GPU memory). The previous reviews exercised those paths thoroughly and all identified issues have fixes in the commit log.
Level of scrutiny
This is production-critical autodiff infrastructure shared across LLVM (CPU/CUDA/AMDGPU) and SPIR-V (Metal/Vulkan) backends. The changes are non-trivial: new heap buffer lifecycle management, SPIR-V capability handling, SPIR-V SSA dominance constraints, and u32-overflow guards for large-stride products. The PR went through 10+ rounds of review with all identified bugs fixed, but the scope warrants a final human read before merge.
Other factors
The bug hunting system found no remaining bugs. Test coverage is comprehensive (new tests for heap-backed large-capacity, mixed f32/non-f32 stacks, many non-f32 stacks, boundary capacity, and the old stack-budget regression). The CapabilityInt64Indexes concern I raised was correctly declined by the author — that capability does not exist in the SPIR-V grammar.
26f4f68 to
f016a05
Compare
`LlvmRuntimeExecutor::synchronize()` on AMDGPU was draining `AMDGPUContext::kernel_arg_pointer_` via `free_kernel_arg_pointer()`, which hipFreed the `context_pointer` still in use by the launch currently in flight. `fetch_result_uint64()` calls `synchronize()` at its top, and `ensure_adstack_heap()` calls `fetch_result_uint64()` on its first run to cache `&runtime->adstack_heap_buffer` / `_size`. Net result: the `RuntimeContext` device allocation was freed mid-launch, HIP's allocator recycled that address for the adstack heap when the heap was small enough to fit (e.g. tight `end - begin` sizing for dynamic range_for), and `hipMemset`/`hipMemcpy` on the heap overwrote the `RuntimeContext` bytes. The next task's kernel then loaded `ctx->runtime` = 0 and dereferenced it at the initial scalar-load chain, yielding the `hipErrorIllegalAddress` we saw at PC offset 0x38 of the gfx11 range_for kernel. The same bug does not fire with `grid_dim * block_dim` sizing because the 800+ MB heap is never placed at a recycled small-allocation address. It does not fire on CUDA because the CUDA sync path has no equivalent free-list drain. Free `context_pointer` directly at the end of `launch_llvm_kernel`, using the same pattern as `device_result_buffer` / `device_arg_buffer` -- `hipFree` synchronizes implicitly with pending kernels on the device, so this works for async launches too. Drop the now-unused `push_back_kernel_arg_pointer` / `free_kernel_arg_pointer` / `kernel_arg_pointer_` trio from `AMDGPUContext`. With the underlying bug fixed, `resolve_num_threads` can return the tight `end - begin` that `runtime/cuda/kernel_launcher.cpp` already uses, instead of the previous `grid_dim * block_dim` workaround that was masking this crash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ack_heap_ensure. The comment above the `preallocate_runtime_memory()` call justified why the bump-allocator chunk is always populated on CUDA / AMDGPU by naming `adstack_heap_ensure` as the device-side caller of `allocate_aligned`. That helper was removed earlier in this PR (adstack heap management now lives on the host via `LlvmRuntimeExecutor::ensure_adstack_heap`; GPU kernels only read `runtime->adstack_heap_buffer`, they do not allocate). The preallocation is still necessary for the surviving device-side callers (sparse SNode activation through `NodeManager`, `runtime_initialize_rand_states_*`, etc.), so the behavior is right but the named example was misleading. Update the comment to list accurate callers. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three independent cleanups from the latest review round, no behavior change: 1. tests/python/test_adstack.py:516 — the prose comment said the kernel's per-thread adstack footprint is "one byte past the old 262,144 byte ceiling". The arithmetic `8 * (8 + 4096 * 8) = 262,208` is correct but 262,208 - 262,144 = 64, not 1. Fixed the prose. 2. quadrants/runtime/llvm/runtime_module/runtime.cpp:733-744 — the `runtime_set_adstack_heap_buffer` / `runtime_set_adstack_heap_size` wrappers and the comment block above them documented a JIT-dispatch setter path that was replaced by the `runtime_get_adstack_heap_field_ptrs` + `memcpy_host_to_device` path earlier in this PR; the two wrappers were dead code with no callers. Deleted them along with the stale comment block. Also updated the matching `adstack_heap_buffer` field comment in `LLVMRuntime` and the stale duplicate block at `llvm_runtime_executor.cpp:589-592` plus the `ensure_adstack_heap` doc in `llvm_runtime_executor.h:85-90` so all four sites now describe the actual cached-field-pointer + memcpy publish mechanism. 3. quadrants/codegen/spirv/spirv_codegen.cpp — the `if (info.elem_type.id != backing_type.id)` cast blocks in `visit(AdStackLoadTopAdjStmt)` and `visit(AdStackAccAdjointStmt)` were dead: the `heap_kind != heap_int` asserts at the top of both visitors exclude the only case where `ad_stack_backing_type()` returns a type different from `info.elem_type` (the u1->i32 promotion on the int heap). Removed the dead blocks and added a brief comment explaining why no cast is needed here, unlike the primal visitors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…f stack. Review comment flagged that the CUDA graph launch path in GraphManager::try_launch bakes every offloaded task into a single graph, but LlvmRuntimeExecutor::ensure_adstack_heap has to run on the host in between the serial range_for-bounds kernel (which stores `end_value` into `runtime->temporaries`) and the range_for kernel itself (which reads that value back to size the heap). Baking both into the same graph means the host never gets a chance to call ensure_adstack_heap, leaving `runtime->adstack_heap_buffer = nullptr` and letting every thread write into a near-null address. The scenario is not reachable through the public API today: `@qd.kernel(graph=True)` only sets `use_graph` on the primal kernel; `.grad()` returns a separate kernel with `use_graph = False`, so reverse-mode launches never take the graph path. The existing `grad_ptr != nullptr` guard in `resolve_ctx_ndarray_ptrs` already rejects the obvious autograd entry point. Still worth failing loudly if a future caller manages to combine the two through an internal path, instead of silently running with a nullptr heap. Guard each task in `try_launch` with a clear error message naming the offending task. No test -- the combination is not reachable through the public API, and fabricating one would require the same internal-attribute poke the guard is meant to shield against. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n refs.
Three independent comment-level cleanups, no behavior change:
1. docs/source/user_guide/autodiff.md:258,262 -- the memory-cost formula
used `bytes_per_element = 4 for f32 / i32 and 1 for bool`, which was
wrong in two ways. On the LLVM backends every slot stores primal +
adjoint, so f32 costs 8, i32 costs 8, bool costs 2. On the SPIR-V
backends integer adstacks store only the primal (no adjoint
accumulation for non-real types) and bool is widened to i32 (since
`OpTypeBool` has no defined storage layout under LogicalAddressing),
so f32 costs 8, i32 costs 4, bool costs 4. Rewrote the paragraph to
state the per-backend breakdown explicitly, and updated the worked
example (1024x1024 ndrange, default_ad_stack_size=256, four f32
loop-carried variables) from 4 GB to 8 GB to reflect the missing
primal+adjoint factor.
2. quadrants/codegen/llvm/{codegen_llvm.h,codegen_llvm.cpp} and
quadrants/program/compile_config.h -- five comments referenced a
nonexistent `adstack_heap_ensure` helper. That device-side sizer was
removed earlier in this PR; the heap is now sized by
`LlvmRuntimeExecutor::ensure_adstack_heap` on the host before each
dispatch, and the kernel just reads `runtime->adstack_heap_buffer`
via the `LLVMRuntime_get_adstack_heap_buffer` getter. Rewrote the
affected comments to name the actual helpers (host-side sizer, JIT
getter) and to make the layer boundary explicit: codegen emits a
load of the published pointer, it does not emit a sizing call.
3. quadrants/codegen/llvm/codegen_llvm.cpp:2158 -- a sixth comment
pointed at the `runtime_set_adstack_heap_*` wrappers as the publish
path. Those wrappers were also deleted earlier in this PR; the
actual path caches the two field device-addresses once via
`runtime_get_adstack_heap_field_ptrs` and publishes subsequent grows
through plain `memcpy_host_to_device` writes. Corrected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nge). Three comment blocks I edited in 1039ce3 / b9152b7 had been wrapped conservatively around 95-110 cols; this rewraps them to fill up to 115 cols of content (120 - the `// ` / ` // ` prefix) so lines only break when they have to: - quadrants/program/compile_config.h: default_ad_stack_size comment. - quadrants/codegen/llvm/codegen_llvm.h: ad_stack_per_thread_stride_ field doc and ensure_ad_stack_heap_base_llvm() function doc. - quadrants/codegen/llvm/codegen_llvm.cpp: ensure_ad_stack_heap_base_llvm definition comment and the pre-scan block comment inside init_offloaded_task_function. clang-format now passes cleanly on the touched files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review comment flagged that the SPIR-V heap grow branches allocated exactly `required` bytes, while the LLVM counterpart in `LlvmRuntimeExecutor::ensure_adstack_heap` uses `std::max(needed_bytes, 2 * current)`. In a workload that issues K launches with monotonically increasing dispatch sizes between `synchronize()` calls, every launch moves the old buffer into `ctx_buffers_` (the deferred-free list the use-after-free fix added earlier in this PR), accumulating O(K^2 * N) bytes of live-but-unused GPU memory before the next sync drains the stream. Doubling bounds the reallocations at O(log K) and the live memory at O(K * N), matching the LLVM behavior. Applied to both the float and int heaps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ee). Review comment flagged that the comment at `ensure_adstack_heap` above the `adstack_heap_alloc_ = std::move(new_guard)` line falsely claims AMDGPU's `dealloc_memory` routes to `hipFree` and implicitly synchronizes. It does not: `AmdgpuDevice::dealloc_memory` calls `DeviceMemoryPool::release(release_raw=false)`, which pools the allocation through `CachingAllocator` without any `hipFree` and without any GPU sync. CUDA's path does sync via `cuMemFree_v2` (release_raw=true there), so the CUDA half of the comment was correct. The code is still safe today -- not because of dealloc_memory, but because `launch_llvm_kernel` ends with `hipFree(context_pointer)` / `cuMemFree_v2(context_pointer)` which synchronizes with every kernel launched during the call. By the time the *next* `launch_llvm_kernel` reaches `ensure_adstack_heap` and can destroy the previous guard, all kernels from the prior call have already completed. Spell that actual invariant out in the comment so a future change to the launcher structure cannot lean on the false dealloc-sync claim as justification. Comment-only; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…text_pointer).
Review comment flagged that the AMDGPU bullet at lines 629-636 attributed
cross-launch adstack safety to both `hipFree(context_pointer)` AND
`cuMemFree_v2(context_pointer)`, but `cuda::KernelLauncher::launch_llvm_kernel`
never allocates a device-side `context_pointer` in the first place — it
passes the host-side `RuntimeContext` by reference via
`{&ctx.get_context()}`. CUDA's safety is already complete at the preceding
CUDA bullet: `CudaDevice::dealloc_memory` passes `release_raw=true`, so the
old heap slab's free goes through `cuMemFree_v2` directly and synchronizes
before returning. Only AMDGPU needs the extra hop (because its
`DeviceMemoryPool::release(release_raw=false)` path pools the slab without
any sync), and the trailing `hipFree(context_pointer)` in
`amdgpu::KernelLauncher::launch_llvm_kernel` provides that sync.
Scoped the paragraph to AMDGPU only and added a sentence noting why CUDA
doesn't need (and doesn't have) a device-side `context_pointer`.
Comment-only; no behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User: look at diff between duburcqa/heap_backed_adstack and origin/duburcqa/fix_adstack_perfAssistant: Diff between Theme: move adstack primal/adjoint storage from per-thread on-chip / stack memory to per-runtime heap buffers, on every backendThis PR closes the "future work" item that What's in the PR1. LLVM heap-backed adstack (CPU / CUDA / AMDGPU)
2. SPIR-V heap-backed adstack (Metal / Vulkan)
3. Sizing knob & defaults
4. Documentation (
|
|
Thoughts on splitting as follows?:
|
…teration bound; surface Metal alloc failures instead of silent nil
…er-thread adstack heap on Metal/Vulkan
| size_t new_size = std::max(required, 2 * adstack_heap_buffer_float_size_); | ||
| auto [buf, res] = device_->allocate_memory_unique( | ||
| {new_size, /*host_write=*/false, /*host_read=*/false, /*export_sharing=*/false, AllocUsage::Storage}); | ||
| QD_ASSERT_INFO(res == RhiResult::success, "Failed to allocate adstack heap float buffer (size={})", | ||
| new_size); | ||
| // Defer the old buffer's free until the current cmdlist is submitted and synced: the previous launch | ||
| // may still be in flight and referencing the old allocation, so freeing it synchronously here (via | ||
| // `DeviceAllocationGuard`'s destructor, which runs on the `std::move` reassignment below) would | ||
| // produce a GPU-side use-after-free. `ctx_buffers_` is cleared in `synchronize()` / | ||
| // `device_to_host(...)` after the stream has drained, which is exactly the lifetime we need. | ||
| if (adstack_heap_buffer_float_) { | ||
| ctx_buffers_.push_back(std::move(adstack_heap_buffer_float_)); | ||
| } | ||
| adstack_heap_buffer_float_ = std::move(buf); | ||
| adstack_heap_buffer_float_size_ = new_size; |
There was a problem hiding this comment.
🔴 The amortized-doubling grow paths for both the float heap (AdStackHeapFloat) and int heap (AdStackHeapInt) in gfx/runtime.cpp compute new_size = max(required, 2 * old_size) and immediately assert RhiResult::success with no fallback retry at exactly required bytes. On Metal, where maxBufferLength is capped (~256 MB on many devices), this crashes the process with a spurious assertion even when the smaller required allocation would have succeeded — e.g. old_size=150 MB, required=165 MB yields new_size=300 MB, which fails with out_of_memory, aborting the program despite a 165 MB allocation being viable. The fix is to retry with new_size = required when the doubled allocation fails and new_size > required, before calling QD_ASSERT_INFO.
Extended reasoning...
What the bug is and how it manifests
In quadrants/runtime/gfx/runtime.cpp, both the float-heap grow branch and the int-heap grow branch added by this PR use amortized doubling to reduce reallocation frequency:
size_t new_size = std::max(required, 2 * adstack_heap_buffer_float_size_);
auto [buf, res] = device_->allocate_memory_unique({new_size, ...});
QD_ASSERT_INFO(res == RhiResult::success, "Failed to allocate adstack heap float buffer (size={})", new_size);There is no fallback path that retries with exactly required bytes if the doubled allocation fails. On Metal, every StorageBuffer is subject to the device's maxBufferLength cap (~256 MB on older Apple Silicon/Intel iGPU configurations, as low as ~128 MB on some). When old_size > maxBufferLength/2 but required <= maxBufferLength, the doubled new_size exceeds the cap and allocate_memory_unique returns RhiResult::out_of_memory — triggering the assertion and aborting the process, even though the exact required allocation would have succeeded.
The specific code path that triggers it
This bug was introduced by the amortized-doubling commit (e9b5e48) that added std::max(required, 2 * current_size). That commit correctly fixed the O(K^2 * N) accumulated GPU memory problem, but the new allocation strategy has no recovery path for the case where 2 * old_size > maxBufferLength >= required. Notably, the Metal nil-check fix added by this same PR to metal_device.mm now causes newBufferWithLength: returning nil to propagate as RhiResult::out_of_memory rather than silently wrapping nil — which means this assertion is now reachable and crash-inducing on Metal, whereas previously it manifested as silent NaN gradient corruption.
Why existing code does not prevent it
The guard condition adstack_heap_buffer_float_size_ < required correctly identifies that growth is needed. The doubled new_size is computed, passed to allocate_memory_unique, and the result is immediately asserted with no intervening check for res != RhiResult::success. There is no conditional block of the form "if the doubled size fails and required < new_size, retry with new_size = required". The same structural deficiency exists identically in the int-heap grow branch.
What the impact is
Any Metal workload that triggers adstack heap growth from a size above maxBufferLength/2 but below maxBufferLength will abort with a QD_ASSERT_INFO failure, even though the actually-needed allocation would have succeeded. Concrete example: device with maxBufferLength = 256 MB, first kernel dispatch sizes the heap at 150 MB; a second slightly larger dispatch (165 MB) triggers the grow, requests 300 MB, gets rejected by Metal, and crashes — despite a 165 MB request being well within the budget.
How to fix it
Add a fallback retry at exactly required bytes when the doubled allocation fails and new_size > required, before the assertion:
size_t new_size = std::max(required, 2 * adstack_heap_buffer_float_size_);
auto [buf, res] = device_->allocate_memory_unique({new_size, ...});
if (res != RhiResult::success && new_size > required) {
new_size = required;
std::tie(buf, res) = device_->allocate_memory_unique({new_size, ...});
}
QD_ASSERT_INFO(res == RhiResult::success, "Failed to allocate adstack heap float buffer", new_size);The same fix applies to the int-heap branch. This mirrors the standard two-phase allocator pattern: try the amortized size first; fall back to the minimum viable size if the device cap prevents the doubled request.
Step-by-step proof
- First kernel dispatch:
adstack_heap_buffer_float_size_ = 0,required = 150 MB.new_size = max(150, 0) = 150 MB. Allocation succeeds. Float heap is 150 MB. - Second kernel dispatch:
required = 165 MB > 150 MB, so the grow branch fires.new_size = max(165, 2*150) = 300 MB. allocate_memory_unique({300 MB, ...})callsmetal_device.mm'snewBufferWithLength:300MB. This exceedsmaxBufferLength = 256 MB, returns nil. The nil-check fix in this PR returnsRhiResult::out_of_memory.QD_ASSERT_INFO(res == RhiResult::success, ...)fires. Process aborts with the allocation failure message.- A 165 MB allocation would have succeeded, but it is never attempted. The process has terminated unnecessarily.
| // Per-runtime adstack heap slab, owned here. `ensure_adstack_heap` grows via the driver allocator and | ||
| // publishes the new pointer/size into the LLVMRuntime struct; replacing `adstack_heap_alloc_` releases the | ||
| // previous allocation via `DeviceAllocationGuard`, which calls `llvm_device()->dealloc_memory`. On CUDA / AMDGPU | ||
| // `cuMemFree_v2` / `hipFree` synchronize implicitly with respect to pending kernels on the device, so the old | ||
| // slab cannot be freed while an in-flight kernel still holds its cached base pointer. |
There was a problem hiding this comment.
🟡 The comment at llvm_runtime_executor.h:174-176 (added by this PR) incorrectly claims that on AMDGPU, hipFree synchronizes implicitly with pending kernels when dealloc_memory releases the old adstack slab, but AmdgpuDevice::dealloc_memory() actually routes through DeviceMemoryPool::release(release_raw=false) -> CachingAllocator::release without calling hipFree and without any GPU synchronization. The same incorrect claim in llvm_runtime_executor.cpp was corrected by commit df52025af, but the matching header comment was not updated, leaving a maintenance trap: a developer reading only the header could conclude dealloc_memory provides the AMDGPU sync barrier and safely remove the hipFree(context_pointer) at the end of amdgpu::KernelLauncher::launch_llvm_kernel, which would break AMDGPU cross-launch memory safety.
Extended reasoning...
What the bug is. The comment at llvm_runtime_executor.h:172-176 reads: "On CUDA / AMDGPU cuMemFree_v2 / hipFree synchronize implicitly with respect to pending kernels on the device, so the old slab cannot be freed while an in-flight kernel still holds its cached base pointer." This claim is accurate for CUDA (CudaDevice::dealloc_memory routes through DeviceMemoryPool::release(release_raw=true) → cuMemFree_v2, which is synchronizing), but factually wrong for AMDGPU.
The specific code path that contradicts it. AmdgpuDevice::dealloc_memory() for adstack slab allocations (created with use_cached=false, use_preallocated=false) reaches DeviceMemoryPool::get_instance(Arch::amdgpu, false).release(info.size, info.ptr) with release_raw defaulting to false. DeviceMemoryPool::release with release_raw=false calls allocator_->release(size, ptr) in CachingAllocator, which simply pools the address for reuse. No hipFree is called; no GPU synchronization occurs.
Why the code is nonetheless safe today. AMDGPU cross-launch memory safety comes from a completely different mechanism: amdgpu::KernelLauncher::launch_llvm_kernel ends with AMDGPUDriver::get_instance().mem_free(context_pointer), and hipFree synchronizes implicitly with every kernel launched during that call. By the time the next launch_llvm_kernel reaches ensure_adstack_heap and could destroy the previous DeviceAllocationGuard, no in-flight GPU kernel from the prior call is still referencing the old slab. The correct explanation is in the .cpp comment that was updated by commit df52025af, but the matching .h comment was overlooked.
The maintenance trap. A developer auditing the AMDGPU memory-safety invariant reads llvm_runtime_executor.h:174-176 and sees that dealloc_memory provides the synchronization barrier. Armed with that belief, they might conclude that the final hipFree(context_pointer) in launch_llvm_kernel is redundant (since dealloc_memory supposedly syncs already) and remove it as cleanup. After that change the actual synchronization barrier is gone, but the header comment still asserts it exists — producing a real use-after-free on AMDGPU with no static or runtime check to catch it. The .cpp comment (fixed in df52025af) correctly explains that CachingAllocator::release does not call hipFree; the header and the implementation comments are now directly contradictory.
How to fix it. Update llvm_runtime_executor.h:172-176 to match the .cpp comment: state that CUDA deallocation routes through cuMemFree_v2 (synchronizing), while AMDGPU deallocation routes through CachingAllocator::release (non-synchronizing, physical memory stays mapped), and that the AMDGPU cross-launch safety invariant comes from the hipFree(context_pointer) at the end of each launch_llvm_kernel call.
Summary
Moves the SPIR-V adstack's primal and adjoint arrays off Function-scope (per-thread on-chip memory) onto a shared, per-dispatch StorageBuffer sliced per invocation. Lifts the Metal/MoltenVK per-thread shader footprint cap that bounded the previous implementation to a few dozen adstack slots.
Non-f32 adstacks (i32, u1, f64, ...) continue to use Function-scope arrays because they are typically small (loop indices, if-condition flags) and fit comfortably; only f32 goes on the heap, where the cap mattered in practice.
What changed
TaskCodegen::visit(AdStackAllocaStmt)picks heap vs Function-scope onret_type == f32. Push/Pop/LoadTop/AccAdjoint visitors branch on a newuse_heapflag.TaskCodegen::run()pre-scans the IR for every f32-valuedAdStackAllocaStmtup front soget_ad_stack_heap_thread_base()captures the final per-thread stride. Without the pre-scan the lazyinvoc_id * stridebakes a stale stride into the base expression, later allocas raise the stride, and threads overlap into each other's slices.TaskAttributes::ad_stack_heap_per_thread_strideis written at the end of codegen and read by the runtime.GfxRuntimeallocates one StorageBuffer per launch sized atstride * (group_x * block_dim) * sizeof(f32). Reused across launches with grow-on-demand; on grow the previous buffer is moved intoctx_buffers_rather than destroyed synchronously, so an in-flight cmdlist still referencing it stays valid until the stream drains.test_adstack_shader_compile_failure_raises(which asserted Metal would reject large stacks) withtest_adstack_large_capacity_heap_backed(4096-slot kernel that the old Function-scope path could not compile).test_adstack_mixed_f32_and_non_f32exercising both paths (f32 primal on the heap, i32 loop index on Function-scope) in one kernel.Stacked on #490.