[AutoDiff] Autodiff 7: Surface LLVM adstack push/pop overflow as a Python exception#495
[AutoDiff] Autodiff 7: Surface LLVM adstack push/pop overflow as a Python exception#495duburcqa wants to merge 13 commits intoduburcqa/fix_ad_correctnessfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a5d3009268
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
e281a6e to
196977b
Compare
a5d3009 to
79a56b0
Compare
196977b to
c73cb3d
Compare
91f44c5 to
6abc4aa
Compare
c73cb3d to
3d0ecaf
Compare
6abc4aa to
bedaa69
Compare
3d0ecaf to
7b52cc8
Compare
bedaa69 to
6ba6b0a
Compare
7b52cc8 to
2a30384
Compare
6ba6b0a to
7e7b15d
Compare
2a30384 to
b390eb0
Compare
7e7b15d to
d68e626
Compare
b390eb0 to
4984a2b
Compare
d68e626 to
bd14f25
Compare
|
Opus review of test factoriztaion: Looking at the four tests side by side:
Tests 2, 3, and 4 share identical setup/teardown/assertion structure and only differ in two dimensions: dynamic vs static loop, and fixed vs varying coefficient. These three could collapse into a single parametrized test using @pytest.mark.parametrize("use_static_loop", [True, False])
@pytest.mark.parametrize("use_varying_coeff", [True, False])
@test_utils.test(require=qd.extension.adstack, ad_stack_experimental_enabled=True)
def test_adstack_sum_linear(use_static_loop, use_varying_coeff):
x = qd.field(qd.f32)
y = qd.field(qd.f32)
qd.root.dense(qd.i, 1).place(x, x.grad)
qd.root.place(y, y.grad)
@qd.kernel
def compute():
for i in x:
v = x[i]
if qd.static(use_static_loop):
r = qd.static(range(3))
else:
r = range(3)
for a in r:
if qd.static(use_varying_coeff):
y[None] += v * qd.cast(a + 1, qd.f32)
else:
y[None] += v
x[0] = 1.0
y[None] = 0.0
compute()
y.grad[None] = 1.0
x.grad[0] = 0.0
compute.grad()
expected = sum((a + 1) for a in range(3)) if use_varying_coeff else 3.0
assert x.grad[0] == test_utils.approx(expected, rel=1e-4)This gives you 4 combinations (including static + varying coeff, which is new free coverage) in one test, and eliminates ~50 lines of duplication.
|
|
Please could you address Opus's suggetsions on test factroization. |
| // The stack index `n` is clamped on read so that overflow (push past capacity) does not let subsequent pops and | ||
| // top-accesses underflow it and index far out of bounds. The corresponding stack_push sets | ||
| // `runtime->adstack_overflow_flag` and skips the increment instead of trapping, so the host-side launcher | ||
| // surfaces the failure as a Python exception rather than killing the process via __builtin_trap. When n == 0 |
|
Asking Opus about any runtime impact: Hot-path branches in stack operations. stack_push, stack_pop, and stack_top_primal are called on every iteration of every AD loop that uses the adstack. This PR adds a new • stack_push: bounds check if (n + 1 > max_num_elements) before every push, plus an extra LLVMRuntime *runtime pointer argument threaded through every call site. In normal (non-overflowing) operation the branches are always not-taken, so CPU branch prediction should handle them well. On GPU, all threads in a warp would take the same synchronize() poll. Every qd.sync() now calls check_adstack_overflow(), which makes a JIT call into the runtime to read and reset the flag, plus a result buffer fetch. This is In practice, the cost is probably negligible — a couple of predictable branches per stack operation. But if someone wanted to benchmark it, the stack push/pop overhead in a |
|
Opus response for test coverage: Code changes in this PR:
Untested functionality implied by the design:
|
|
Please could you add tests for untested fucntionality implied by the design:
|
|
Oh, also please add tests for the rows in the table above with 'no', ie:
|
4984a2b to
1cb9ad7
Compare
bd14f25 to
d1f04cc
Compare
1cb9ad7 to
91db488
Compare
d1f04cc to
ed092f0
Compare
There was a problem hiding this comment.
LGTM — the critical teardown fix (pre_finalize hook) and all prior review comments have been addressed; the two new inline nits are documentation/hygiene only.
Extended reasoning...
Overview
This PR adds LLVM adstack overflow detection across all LLVM backends (x64, arm64, CUDA, AMDGPU). It adds a dedicated adstack_overflow_flag field to LLVMRuntime, bounds-checks in stack_push/stack_pop/stack_top_primal, a new check_adstack_overflow() called unconditionally on every sync, and a pre_finalize() virtual hook that sets finalizing_ = true before the two teardown syncs so overflow exceptions cannot escape into ~Program(). Tests cover the overflow, flag-reset, multi-threaded, large-capacity, and teardown paths.
Security risks
No security-sensitive code is touched. The change adds a flag and a bounds check inside the LLVM JIT runtime; the overflow flag is a simple i64 field with no auth or crypto involvement.
Level of scrutiny
Medium. The change is logic-heavy (new virtual hook ordering relative to Program::finalize call sequence), but this was the critical issue I flagged in the prior review and it was correctly fixed: Program::finalize() now calls pre_finalize() before the two synchronize() calls, and LlvmProgramImpl::pre_finalize() correctly sets finalizing_ = true. The teardown test was also tightened to exercise the actual unsynced-then-exit path rather than the already-safe pre-cleared-flag path.
Other factors
Both bugs now being posted as inline comments are minor nits: (1) the finalizing_ field comment says 'Flipped on by finalize()' when pre_finalize() is the authoritative setter — documentation only, no logic impact; (2) a 136-byte heap allocation in test_stack() is never freed — pre-existing, OS-reclaimed at exit, inconsequential in CI. Neither warrants blocking the PR. All prior review comments were addressed.
91db488 to
32d3599
Compare
ed092f0 to
c776e9d
Compare
The pre-existing LLVM runtime `stack_push` had a bare `TODO: assert n <= max_elements` so loops that exceeded the adstack capacity silently incremented `n` past the allocated region, writing out-of-bounds and handing back wrong gradients (or crashing). `stack_pop` was also unguarded and happily underflowed `n` into UINT_MAX. Make the overflow visible: - `stack_push` takes the runtime pointer, skips the store/increment on overflow, and flips a new `runtime->adstack_overflow_flag`. Codegen (`codegen_llvm.cpp`) passes `get_runtime()` at every push site. - `stack_pop` no-ops when `n == 0`; `stack_top_primal` clamps `idx` to 0 so reads on an underflowed stack stay in-bounds (garbage value, but the host raises before it is consumed). - `LlvmRuntimeExecutor::check_adstack_overflow()` polls the flag via a runtime-retrieve JIT call and throws `QuadrantsAssertionError`. Unlike `check_runtime_error`, this runs on every `synchronize()` (not gated on `compile_config.debug`) because silent wrong gradients are not acceptable to hide. - `LlvmProgramImpl::synchronize()` calls it after the regular sync. The result buffer is cached on `materialize_runtime` so internal polls do not need to thread the pointer through the public API. Covered by `test_adstack_overflow_raises`: a 64-iteration dynamic loop on a `default_ad_stack_size=32` adstack is expected to raise either `AssertionError` (LLVM via `check_runtime_error`) or `RuntimeError` (future SPIR-V path via `QD_ERROR`) at the next `qd.sync()`.
…d, and multi-threaded raise.
…, and push overflow coverage.
…then-exit, drop subprocess timeout.
… teardown-subprocess test docstring.
…rdown syncs; pin ad_stack_size on overflow tests; tighten subprocess test to drop inner sync so teardown path is exercised.
… and plug test_stack() heap leak.
…subprocess test cleanly when adstack unavailable.
32d3599 to
c6cca68
Compare
c776e9d to
c855375
Compare
There was a problem hiding this comment.
LGTM — all previously raised issues have been addressed.
Extended reasoning...
Overview
This PR adds LLVM adstack overflow detection across all LLVM backends (x64, arm64, CUDA, AMDGPU). It introduces a dedicated adstack_overflow_flag field, stack_push bounds checking with atomic writes, stack_pop/stack_top_primal underflow clamping, a check_adstack_overflow() polling method called unconditionally at every sync, and a pre_finalize() virtual hook to suppress the poll during Program teardown.
Security risks
No security-sensitive code paths (auth, crypto, permissions) are touched. The data race concern on the overflow flag was correctly addressed with __atomic_store_n + __atomic_exchange_n.
Level of scrutiny
This PR is runtime-critical (JIT execution path for every AD kernel) and received very thorough review. Multiple rounds of comments identified real bugs — the finalizing_ flag timing issue, the missing pre_finalize() hook, incorrect inline comments, the heap leak in test_stack(), and the missing adstack guard on the teardown subprocess test — all of which the author addressed. The one flagged pre-existing issue (AdStackAllocaStmt::size_in_bytes() using a 4-byte header vs the runtime's 8-byte header) predates this PR and is silently absorbed by LLVM alignment padding on all supported platforms; it warrants a follow-up fix in statements.h but does not affect correctness here.
Other factors
Test coverage is comprehensive: basic overflow raise, flag reset after catch, large-capacity workaround, multithreaded overflow, and teardown subprocess. All previous reviewer concerns are resolved.
…ime, was sized as int32 (4B too small).
There was a problem hiding this comment.
LGTM — all issues from previous review rounds addressed.
Extended reasoning...
Overview
This PR adds adstack-overflow detection to the LLVM backend: stack_push sets a dedicated adstack_overflow_flag on overflow (rather than trapping), check_adstack_overflow() polls the flag unconditionally on every synchronize(), and LlvmProgramImpl::pre_finalize() sets finalizing_=true before the two teardown syncs so overflow exceptions cannot escape into ~Program(). Ten files changed across codegen, IR, runtime, program lifecycle, and tests.
Security risks
No security-sensitive code paths (no auth, crypto, or permissions). The only risk was the original silent UB (out-of-bounds write on overflow), which this PR fixes.
Level of scrutiny
This PR went through four rounds of review, surfacing: comment inaccuracies, a real finalizing-flag timing bug (std::terminate risk), a heap leak in test_stack(), a missing adstack-extension skip guard, a C++11 memory-model data race, and a 4-byte header size mismatch in size_in_bytes(). All were fixed. The current code matches the invariants documented in each comment, the tests exercise the critical paths (flag reset, large capacity, multithreaded, teardown subprocess), and the bug hunting system found no remaining issues.
Other factors
Test coverage is thorough: overflow raises, flag resets after catch, large capacity resolves overflow, multithreaded overflow, teardown-without-explicit-sync subprocess. The pre_finalize() hook comment explicitly warns that moving the assignment back into finalize() alone would reintroduce the std::terminate bug — a good maintenance guard. The atomic fixup (__ATOMIC_RELAXED store + exchange) is the minimal correct change with zero runtime cost on x86-64/ARM64.
…rker-thread stack budget
… the worker-thread stack budget
There was a problem hiding this comment.
LGTM — all review issues addressed, including the CPU-only arch guard on the 256 KB adstack budget check.
Extended reasoning...
Overview
The PR adds adstack overflow detection to the LLVM autodiff runtime: stack_push now sets runtime->adstack_overflow_flag via a relaxed atomic store on overflow, stack_pop/stack_top_primal have underflow/clamping guards, LlvmRuntimeExecutor::check_adstack_overflow() polls the flag on every synchronize(), and LlvmProgramImpl::pre_finalize() suppresses the poll during teardown. Additionally AdStackAllocaStmt::size_in_bytes() is corrected from sizeof(int32) to sizeof(int64) to match the runtime layout. 11 files touched across codegen, IR, runtime, and tests.
Security risks
No security-sensitive surfaces (no auth, crypto, or permission code). The relaxed-atomic fix for the data race in stack_push is a correctness improvement.
Level of scrutiny
This is production-critical runtime code (autodiff correctness on all LLVM backends), but the changes are well-bounded safety additions rather than logic rewrites. Every bug raised across multiple review rounds was addressed:
- Comment inaccuracies (error_code vs adstack_overflow_flag) — fixed
- finalizing_ timing (pre_finalize() before teardown syncs) — fixed
- sizeof header mismatch — fixed
- Atomic write for multi-threaded overflow — fixed
- CPU-only arch guard on the 256 KB budget check — fixed (fixup commit 851d8fd)
- test_stack heap leak — fixed
- Test ad_stack_size pins and extension guards — fixed
Other factors
Test coverage is comprehensive: overflow raises, flag reset, teardown safety, multi-threaded overflow, and large-capacity resolution are all exercised. The bug hunting system found no new issues in the final state. The unresolved inline thread on the GPU budget check is moot because the arch guard is present in the submitted diff.
|
Opus says this PR is three ~independent fixes. Pelase could we split into three PRs? These three fixes a are orthogonal and could be pr'd sepraately?Yes — all three are independently mergeable, with one soft ordering preference. Sketch: Fix 3: AdStackAllocaStmt::size_in_bytes header size • Scope: 1 file, ~4 lines (quadrants/ir/statements.h). Fix 1: runtime overflow → Python exception + teardown safety • Scope: ~8 files (runtime.cpp, internal_functions.h, llvm_runtime_executor.{cpp,h}, llvm_program.h, program.cpp, program_impl.h, codegen_llvm.cpp for the stack_push signature Fix 2: codegen per-task stack-budget guard • Scope: 2 files (codegen_llvm.cpp, codegen_llvm.h). Suggested order
Why they're getting reviewed together anyway They share a theme — "stop silent wrong gradients in the adstack path" — and they were probably discovered together while chasing the same symptom. That's normal. But for it sounds like we need the test from fix 1 for fix 3?Yes — kind of. Let me be precise about what "needs" means here. Fix 3 is provably correct by inspection The runtime reads/writes the header through *(u64 *)stack at every site (stack_init, stack_push, stack_top_primal). The alloca was sized as if the header were int32. Changing it But it has no executable validation in isolation Without fix 1, the only test exercising the adstack at the C++ level is the old test_stack, which does 4 pushes into a max_size=16 stack — nowhere near the corrupted region. The That's why I said "obviously correct by inspection but with no executable proof" — and why fix 1's rewritten test_stack (which pushes to max_size and reads *(u64 *)stack) is the What this means for splitting Three options, in increasing test discipline:
If splitting, option 2 is the cleanest — a one-liner fix plus a one-line assertion is still a tiny PR, and it gives fix 3 its own regression pin instead of relying on fix 1's The general principle: a fix that's "obviously right" but covered only by an unrelated test is one accidental refactor away from being silently undone. A 5-line dedicated test is |
|
So lets' split into the following PRs please;
|
| // Non-owning cache of the Program-owned result buffer so internal polls (adstack overflow, etc.) can be | ||
| // invoked from `synchronize()` without threading the pointer through the public API. Ownership stays with | ||
| // `Program` for its lifetime; reallocating or repointing `Program::result_buffer_` mid-run would invalidate | ||
| // this cache, so avoid that. | ||
| uint64 *result_buffer_cache_{nullptr}; |
There was a problem hiding this comment.
🟡 The comment on result_buffer_cache_ (llvm_runtime_executor.h:143) warns against reallocating Program::result_buffer_ (with trailing underscore), but the actual field in Program is result_buffer (no trailing underscore). A developer grepping for result_buffer_ in Program will find nothing, defeating the purpose of the warning. Change to Program::result_buffer.
Extended reasoning...
What the bug is and how it manifests
The newly added field-level comment on result_buffer_cache_ (llvm_runtime_executor.h lines 141-145) reads: "reallocating or repointing Program::result_buffer_ mid-run would invalidate this cache". The trailing underscore is wrong: the actual field in Program is declared as result_buffer (no underscore), as seen in program.cpp (result_buffer = nullptr). This is a documentation inaccuracy introduced by this PR.
The specific code path
The comment was added alongside the result_buffer_cache_ field declaration in this PR. The intent is to warn maintainers that the non-owning cache pointer will be invalidated if the Program-owned result buffer is ever reallocated or reassigned. The warning is meaningful and correct in intent — but the identifier it names, Program::result_buffer_, does not exist.
Why existing code does not prevent it
This is a pure documentation error. The implementation is entirely correct: result_buffer_cache_ is populated at materialize_runtime time and the cache is valid for the program's lifetime. The only problem is the field name in the English comment.
Impact on maintainability
A developer who reads the warning and then searches the codebase for Program::result_buffer_ to understand the lifetime contract will find zero results (the field result_buffer_ does not exist in Program). They may then incorrectly conclude either (a) the comment is stale and can be ignored, or (b) the cache is safe to keep even when Program manages its result buffer differently. The intended warning loses all grep-navigability.
How to fix it
Change Program::result_buffer_ to Program::result_buffer (drop the trailing underscore) in the comment at llvm_runtime_executor.h:143.
Step-by-step proof
- Read llvm_runtime_executor.h lines 141-145 (added by this PR): comment says
Program::result_buffer_. - Search program.cpp for
result_buffer: field is declared asuint64 *result_buffer{nullptr}(no underscore) and accessed asresult_buffer = nullptr,&result_buffer, etc. throughout the file. - Grep for
result_buffer_as aProgrammember: no match anywhere in the codebase. - Conclusion: the comment uses the wrong identifier;
Program::result_bufferis the correct name.
Summary
Split out of #490. LLVM-side AD runtime safety fixes that are orthogonal to the SPIR-V adstack enablement. Apply on every LLVM backend (x64, arm64, CUDA, AMDGPU) regardless of whether the Metal/Vulkan path is enabled.
Before this PR,
stack_pushin the LLVM runtime module had a bareTODO: assert n <= max_elements- loops that exceeded the adstack capacity silently incremented the counter past the allocated region, writing out-of-bounds and handing back wrong gradients (or crashing).stack_popwas also unguarded and happily underflowed the counter.What changed
stack_pushnow takes the runtime pointer, skips the store/increment on overflow, and flips a newruntime->adstack_overflow_flag. Codegen passesget_runtime()at every push site.stack_popno-ops whenn == 0;stack_top_primalclamps the index to 0 so reads on an underflowed stack stay in-bounds (garbage value, but the host raises before it is consumed).LlvmRuntimeExecutor::check_adstack_overflow()polls the flag and throwsQuadrantsAssertionError. Unlikecheck_runtime_error, this runs on everysynchronize()(not gated oncompile_config.debug).LlvmProgramImpl::synchronize()calls the new check after the regular sync. The result buffer is cached onmaterialize_runtimeso internal polls do not need to thread the pointer through the public API.LlvmProgramImpl::finalize()sets afinalizing_flag that suppresses the poll during teardown, so an adstack-overflow raise cannot escape into~Program()and terminate the process.Covered by
test_adstack_overflow_raises(lands here): 64 iterations withdefault_ad_stack_size=32raiseAssertionErrorat the nextqd.sync().Stacked on #491. Feeds into #490.