[AutoDiff] Autodiff 7: Surface LLVM adstack push/pop overflow as a Python exception by duburcqa · Pull Request #495 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-04-17T11:20:24Z

Summary

Split out of #490. LLVM-side AD runtime safety fixes that are orthogonal to the SPIR-V adstack enablement. Apply on every LLVM backend (x64, arm64, CUDA, AMDGPU) regardless of whether the Metal/Vulkan path is enabled.

Before this PR, stack_push in the LLVM runtime module had a bare TODO: assert n <= max_elements - loops that exceeded the adstack capacity silently incremented the counter past the allocated region, writing out-of-bounds and handing back wrong gradients (or crashing). stack_pop was also unguarded and happily underflowed the counter.

What changed

stack_push now takes the runtime pointer, skips the store/increment on overflow, and flips a new runtime->adstack_overflow_flag. Codegen passes get_runtime() at every push site.
stack_pop no-ops when n == 0; stack_top_primal clamps the index to 0 so reads on an underflowed stack stay in-bounds (garbage value, but the host raises before it is consumed).
LlvmRuntimeExecutor::check_adstack_overflow() polls the flag and throws QuadrantsAssertionError. Unlike check_runtime_error, this runs on every synchronize() (not gated on compile_config.debug).
LlvmProgramImpl::synchronize() calls the new check after the regular sync. The result buffer is cached on materialize_runtime so internal polls do not need to thread the pointer through the public API.
LlvmProgramImpl::finalize() sets a finalizing_ flag that suppresses the poll during teardown, so an adstack-overflow raise cannot escape into ~Program() and terminate the process.

Covered by test_adstack_overflow_raises (lands here): 64 iterations with default_ad_stack_size=32 raise AssertionError at the next qd.sync().

Stacked on #491. Feeds into #490.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a5d3009268

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

hughperkins · 2026-04-17T13:41:14Z

Opus review of test factoriztaion:

Looking at the four tests side by side:

Test	Loop type	Body	Expected grad
`test_adstack_sum_fixed_coeff`	dynamic `range(3)`	`y += v`	`3.0`
`test_adstack_sum_constant_coeffs`	dynamic `range(3)`	`y += v * (a+1)`	`6.0`
`test_adstack_sum_static_loop_correct`	static `qd.static(range(3))`	`y += v`	`3.0`
`test_adstack_basic_gradient`	dynamic `range(10)`	`v = v * 0.95 + 0.01`	`0.95^10`

Tests 2, 3, and 4 share identical setup/teardown/assertion structure and only differ in two dimensions: dynamic vs static loop, and fixed vs varying coefficient. These three could collapse into a single parametrized test using if qd.static(...) to switch at compile time:

@pytest.mark.parametrize("use_static_loop", [True, False])
@pytest.mark.parametrize("use_varying_coeff", [True, False])
@test_utils.test(require=qd.extension.adstack, ad_stack_experimental_enabled=True)
def test_adstack_sum_linear(use_static_loop, use_varying_coeff):
    x = qd.field(qd.f32)
    y = qd.field(qd.f32)
    qd.root.dense(qd.i, 1).place(x, x.grad)
    qd.root.place(y, y.grad)

    @qd.kernel
    def compute():
        for i in x:
            v = x[i]
            if qd.static(use_static_loop):
                r = qd.static(range(3))
            else:
                r = range(3)
            for a in r:
                if qd.static(use_varying_coeff):
                    y[None] += v * qd.cast(a + 1, qd.f32)
                else:
                    y[None] += v

    x[0] = 1.0
    y[None] = 0.0
    compute()
    y.grad[None] = 1.0
    x.grad[0] = 0.0
    compute.grad()

    expected = sum((a + 1) for a in range(3)) if use_varying_coeff else 3.0
    assert x.grad[0] == test_utils.approx(expected, rel=1e-4)

This gives you 4 combinations (including static + varying coeff, which is new free coverage) in one test, and eliminates ~50 lines of duplication.

test_adstack_basic_gradient should stay separate â€” it's the only one with loop-carried state mutation (v = v * 0.95 + 0.01), which is a structurally different pattern. Though it could benefit from a brief comment explaining that distinction.

hughperkins · 2026-04-17T13:41:36Z

Please could you address Opus's suggetsions on test factroization.

hughperkins · 2026-04-17T13:42:49Z

+// The stack index `n` is clamped on read so that overflow (push past capacity) does not let subsequent pops and
+// top-accesses underflow it and index far out of bounds. The corresponding stack_push sets
+// `runtime->adstack_overflow_flag` and skips the increment instead of trapping, so the host-side launcher
+// surfaces the failure as a Python exception rather than killing the process via __builtin_trap. When n == 0


hughperkins · 2026-04-17T13:47:10Z

Asking Opus about any runtime impact:

Hot-path branches in stack operations. stack_push, stack_pop, and stack_top_primal are called on every iteration of every AD loop that uses the adstack. This PR adds a new
branch to each:

• stack_push: bounds check if (n + 1 > max_num_elements) before every push, plus an extra LLVMRuntime *runtime pointer argument threaded through every call site.
• stack_pop: underflow guard if (n > 0) before every pop.
• stack_top_primal: conditional n > 0 ? n - 1 : 0 instead of a bare n - 1.

In normal (non-overflowing) operation the branches are always not-taken, so CPU branch prediction should handle them well. On GPU, all threads in a warp would take the same
path (not-taken), so no divergence cost. But they do add instruction count and register pressure to what is essentially the innermost loop of the backward pass.

synchronize() poll. Every qd.sync() now calls check_adstack_overflow(), which makes a JIT call into the runtime to read and reset the flag, plus a result buffer fetch. This is
once per sync (not per iteration), so it's unlikely to be measurable unless the user syncs at very high frequency.

In practice, the cost is probably negligible — a couple of predictable branches per stack operation. But if someone wanted to benchmark it, the stack push/pop overhead in a
tight AD loop with many iterations would be the place to look.

hughperkins · 2026-04-17T13:51:22Z

Opus response for test coverage:

Code changes in this PR:

Change	Exercised by test?
`stack_push` overflow check + early return	Yes â€” the 64-iteration kernel triggers it
`stack_push` sets `runtime->adstack_overflow_flag = 1`	Yes â€” the test expects the resulting exception
`stack_push` new `LLVMRuntime *runtime` parameter	Yes â€” indirectly, the kernel wouldn't compile otherwise
`stack_pop` underflow guard (`if (n > 0)`)	No â€” only reachable after overflow, when the reversed loop pops more times than were successfully pushed. The test raises before observing this path.
`stack_top_primal` clamping (`n > 0 ? n - 1 : 0`)	No â€” same situation, only reachable post-overflow.
`LLVMRuntime::adstack_overflow_flag` field	Yes
`runtime_retrieve_and_reset_adstack_overflow`	Yes â€” called by `check_adstack_overflow`
`LlvmRuntimeExecutor::check_adstack_overflow()`	Yes
`check_adstack_overflow` early return when `llvm_runtime_ == nullptr`	No â€” only reachable during early teardown before `materialize_runtime`
`result_buffer_cache_` caching in `materialize_runtime`	Yes â€” indirectly, needed for `check_adstack_overflow` to work
`LlvmProgramImpl::synchronize()` calls `check_adstack_overflow`	Yes
`LlvmProgramImpl::finalize()` sets `finalizing_ = true`	No â€” no test verifies that overflow during teardown doesn't crash
`codegen_llvm.cpp` passes `get_runtime()` to `stack_push`	Yes â€” indirectly
`internal_functions.h` `test_stack` updated	No â€” no test calls `test_internal_func_args` / `test_stack`

Untested functionality implied by the design:

Flag reset after catch. The flag is reset in runtime_retrieve_and_reset_adstack_overflow, but no test verifies that after catching the overflow exception, a subsequent qd.sync() does not raise again. A second sync after the pytest.raises block would cover this.
Increasing ad_stack_size resolves the overflow. The error message tells users to pass ad_stack_size=N to qd.init(). No test verifies that doing so actually makes the same kernel succeed.
Overflow on SPIR-V. The test accepts both AssertionError and RuntimeError, and the docstring describes the SPIR-V path, but the SPIR-V code path isn't in this diff. If SPIR-V doesn't have equivalent overflow detection, the test would silently pass on LLVM-only CI without ever validating the SPIR-V claim.
Multi-threaded overflow. The comment in stack_push explains the race is benign (all threads write the same sentinel). No test uses a multi-element field where multiple threads overflow simultaneously.
Gradients are actually wrong without the safety check. The test only checks that an exception is raised, not that the gradients would have been silently wrong without it. A companion test at a just-under-capacity iteration count showing correct gradients would strengthen the argument.
Teardown safety. The finalizing_ flag exists specifically so that an overflow during ~Program() â†’ finalize() â†’ synchronize() doesn't throw into a destructor and terminate the process. No test covers this path â€” e.g., triggering overflow and then letting the qd.init() scope exit without an explicit sync.

hughperkins · 2026-04-17T13:52:34Z

Please could you add tests for untested fucntionality implied by the design:

Flag reset after catch. The flag is reset in runtime_retrieve_and_reset_adstack_overflow, but no test verifies that after catching the overflow exception, a subsequent qd.sync() does not raise again. A second sync after the pytest.raises block would cover this.
Increasing ad_stack_size resolves the overflow. The error message tells users to pass ad_stack_size=N to qd.init(). No test verifies that doing so actually makes the same kernel succeed.
Overflow on SPIR-V. The test accepts both AssertionError and RuntimeError, and the docstring describes the SPIR-V path, but the SPIR-V code path isn't in this diff. If SPIR-V doesn't have equivalent overflow detection, the test would silently pass on LLVM-only CI without ever validating the SPIR-V claim.
Multi-threaded overflow. The comment in stack_push explains the race is benign (all threads write the same sentinel). No test uses a multi-element field where multiple threads overflow simultaneously.

5. Gradients are actually wrong without the safety check. The test only checks that an exception is raised, not that the gradients would have been silently wrong without it. A companion test at a just-under-capacity iteration count showing correct gradients would strengthen the argument.

Teardown safety. The finalizing_ flag exists specifically so that an overflow during ~Program() â†’ finalize() â†’ synchronize() doesn't throw into a destructor and terminate the process. No test covers this path â€” e.g., triggering overflow and then letting the qd.init() scope exit without an explicit sync.

hughperkins · 2026-04-17T14:25:16Z

Oh, also please add tests for the rows in the table above with 'no', ie:

stack_pop underflow guard (if (n > 0))
stack_top_primal clamping (n > 0 ? n - 1 : 0)
check_adstack_overflow early return when llvm_runtime_ == nullptr
LlvmProgramImpl::finalize() sets finalizing_ = true
internal_functions.h test_stack updated

claude

LGTM — the critical teardown fix (pre_finalize hook) and all prior review comments have been addressed; the two new inline nits are documentation/hygiene only.

Extended reasoning...

Overview

This PR adds LLVM adstack overflow detection across all LLVM backends (x64, arm64, CUDA, AMDGPU). It adds a dedicated adstack_overflow_flag field to LLVMRuntime, bounds-checks in stack_push/stack_pop/stack_top_primal, a new check_adstack_overflow() called unconditionally on every sync, and a pre_finalize() virtual hook that sets finalizing_ = true before the two teardown syncs so overflow exceptions cannot escape into ~Program(). Tests cover the overflow, flag-reset, multi-threaded, large-capacity, and teardown paths.

Security risks

No security-sensitive code is touched. The change adds a flag and a bounds check inside the LLVM JIT runtime; the overflow flag is a simple i64 field with no auth or crypto involvement.

Level of scrutiny

Medium. The change is logic-heavy (new virtual hook ordering relative to Program::finalize call sequence), but this was the critical issue I flagged in the prior review and it was correctly fixed: Program::finalize() now calls pre_finalize() before the two synchronize() calls, and LlvmProgramImpl::pre_finalize() correctly sets finalizing_ = true. The teardown test was also tightened to exercise the actual unsynced-then-exit path rather than the already-safe pre-cleared-flag path.

Other factors

Both bugs now being posted as inline comments are minor nits: (1) the finalizing_ field comment says 'Flipped on by finalize()' when pre_finalize() is the authoritative setter — documentation only, no logic impact; (2) a 136-byte heap allocation in test_stack() is never freed — pre-existing, OS-reclaimed at exit, inconsequential in CI. Neither warrants blocking the PR. All prior review comments were addressed.

The pre-existing LLVM runtime `stack_push` had a bare `TODO: assert n <= max_elements` so loops that exceeded the adstack capacity silently incremented `n` past the allocated region, writing out-of-bounds and handing back wrong gradients (or crashing). `stack_pop` was also unguarded and happily underflowed `n` into UINT_MAX. Make the overflow visible: - `stack_push` takes the runtime pointer, skips the store/increment on overflow, and flips a new `runtime->adstack_overflow_flag`. Codegen (`codegen_llvm.cpp`) passes `get_runtime()` at every push site. - `stack_pop` no-ops when `n == 0`; `stack_top_primal` clamps `idx` to 0 so reads on an underflowed stack stay in-bounds (garbage value, but the host raises before it is consumed). - `LlvmRuntimeExecutor::check_adstack_overflow()` polls the flag via a runtime-retrieve JIT call and throws `QuadrantsAssertionError`. Unlike `check_runtime_error`, this runs on every `synchronize()` (not gated on `compile_config.debug`) because silent wrong gradients are not acceptable to hide. - `LlvmProgramImpl::synchronize()` calls it after the regular sync. The result buffer is cached on `materialize_runtime` so internal polls do not need to thread the pointer through the public API. Covered by `test_adstack_overflow_raises`: a 64-iteration dynamic loop on a `default_ad_stack_size=32` adstack is expected to raise either `AssertionError` (LLVM via `check_runtime_error`) or `RuntimeError` (future SPIR-V path via `QD_ERROR`) at the next `qd.sync()`.

…d, and multi-threaded raise.

…t-code check.

…, and push overflow coverage.

…then-exit, drop subprocess timeout.

… teardown-subprocess test docstring.

…rdown syncs; pin ad_stack_size on overflow tests; tighten subprocess test to drop inner sync so teardown path is exercised.

… and plug test_stack() heap leak.

…subprocess test cleanly when adstack unavailable.

claude

LGTM — all previously raised issues have been addressed.

Extended reasoning...

Overview

This PR adds LLVM adstack overflow detection across all LLVM backends (x64, arm64, CUDA, AMDGPU). It introduces a dedicated adstack_overflow_flag field, stack_push bounds checking with atomic writes, stack_pop/stack_top_primal underflow clamping, a check_adstack_overflow() polling method called unconditionally at every sync, and a pre_finalize() virtual hook to suppress the poll during Program teardown.

Security risks

No security-sensitive code paths (auth, crypto, permissions) are touched. The data race concern on the overflow flag was correctly addressed with __atomic_store_n + __atomic_exchange_n.

Level of scrutiny

This PR is runtime-critical (JIT execution path for every AD kernel) and received very thorough review. Multiple rounds of comments identified real bugs — the finalizing_ flag timing issue, the missing pre_finalize() hook, incorrect inline comments, the heap leak in test_stack(), and the missing adstack guard on the teardown subprocess test — all of which the author addressed. The one flagged pre-existing issue (AdStackAllocaStmt::size_in_bytes() using a 4-byte header vs the runtime's 8-byte header) predates this PR and is silently absorbed by LLVM alignment padding on all supported platforms; it warrants a follow-up fix in statements.h but does not affect correctness here.

Other factors

Test coverage is comprehensive: basic overflow raise, flag reset after catch, large-capacity workaround, multithreaded overflow, and teardown subprocess. All previous reviewer concerns are resolved.

…ime, was sized as int32 (4B too small).

claude

LGTM — all issues from previous review rounds addressed.

Extended reasoning...

Overview

This PR adds adstack-overflow detection to the LLVM backend: stack_push sets a dedicated adstack_overflow_flag on overflow (rather than trapping), check_adstack_overflow() polls the flag unconditionally on every synchronize(), and LlvmProgramImpl::pre_finalize() sets finalizing_=true before the two teardown syncs so overflow exceptions cannot escape into ~Program(). Ten files changed across codegen, IR, runtime, program lifecycle, and tests.

Security risks

No security-sensitive code paths (no auth, crypto, or permissions). The only risk was the original silent UB (out-of-bounds write on overflow), which this PR fixes.

Level of scrutiny

This PR went through four rounds of review, surfacing: comment inaccuracies, a real finalizing-flag timing bug (std::terminate risk), a heap leak in test_stack(), a missing adstack-extension skip guard, a C++11 memory-model data race, and a 4-byte header size mismatch in size_in_bytes(). All were fixed. The current code matches the invariants documented in each comment, the tests exercise the critical paths (flag reset, large capacity, multithreaded, teardown subprocess), and the bug hunting system found no remaining issues.

Other factors

Test coverage is thorough: overflow raises, flag resets after catch, large capacity resolves overflow, multithreaded overflow, teardown-without-explicit-sync subprocess. The pre_finalize() hook comment explicitly warns that moving the assignment back into finalize() alone would reintroduce the std::terminate bug — a good maintenance guard. The atomic fixup (__ATOMIC_RELAXED store + exchange) is the minimal correct change with zero runtime cost on x86-64/ARM64.

…rker-thread stack budget

… the worker-thread stack budget

claude

LGTM — all review issues addressed, including the CPU-only arch guard on the 256 KB adstack budget check.

Extended reasoning...

Overview

The PR adds adstack overflow detection to the LLVM autodiff runtime: stack_push now sets runtime->adstack_overflow_flag via a relaxed atomic store on overflow, stack_pop/stack_top_primal have underflow/clamping guards, LlvmRuntimeExecutor::check_adstack_overflow() polls the flag on every synchronize(), and LlvmProgramImpl::pre_finalize() suppresses the poll during teardown. Additionally AdStackAllocaStmt::size_in_bytes() is corrected from sizeof(int32) to sizeof(int64) to match the runtime layout. 11 files touched across codegen, IR, runtime, and tests.

Security risks

No security-sensitive surfaces (no auth, crypto, or permission code). The relaxed-atomic fix for the data race in stack_push is a correctness improvement.

Level of scrutiny

This is production-critical runtime code (autodiff correctness on all LLVM backends), but the changes are well-bounded safety additions rather than logic rewrites. Every bug raised across multiple review rounds was addressed:

Comment inaccuracies (error_code vs adstack_overflow_flag) — fixed
finalizing_ timing (pre_finalize() before teardown syncs) — fixed
sizeof header mismatch — fixed
Atomic write for multi-threaded overflow — fixed
CPU-only arch guard on the 256 KB budget check — fixed (fixup commit 851d8fd)
test_stack heap leak — fixed
Test ad_stack_size pins and extension guards — fixed

Other factors

Test coverage is comprehensive: overflow raises, flag reset, teardown safety, multi-threaded overflow, and large-capacity resolution are all exercised. The bug hunting system found no new issues in the final state. The unresolved inline thread on the GPU budget check is moot because the arch guard is present in the submitted diff.

…_safety

hughperkins · 2026-04-20T13:18:14Z

Opus says this PR is three ~independent fixes. Pelase could we split into three PRs?

These three fixes a are orthogonal and could be pr'd sepraately?

Yes — all three are independently mergeable, with one soft ordering preference. Sketch:

Fix 3: AdStackAllocaStmt::size_in_bytes header size

• Scope: 1 file, ~4 lines (quadrants/ir/statements.h).
• Dependencies: none. Pre-existing bug — runtime always used u64 for the header, LLVM alloca was 4 B short.
• Test: none new required (existing tests now allocate the correct size, and the extended test_stack in fix 1 would surface it, but the fix stands on its own).
• Risk: near-zero. Just makes the alloca 4 B larger.
• Could ship: tomorrow. Smallest, cleanest, highest-confidence change in the PR.

Fix 1: runtime overflow → Python exception + teardown safety

• Scope: ~8 files (runtime.cpp, internal_functions.h, llvm_runtime_executor.{cpp,h}, llvm_program.h, program.cpp, program_impl.h, codegen_llvm.cpp for the stack_push signature
change + Python tests).
• Dependencies: none semantically, but technically the stack_push(runtime, stack, max, elem_size) signature change touches codegen_llvm.cpp too — any PR landing in parallel
that touches that callsite needs rebasing.
• Self-contained test coverage: yes — all five new Python tests + the extended test_stack in internal_functions.h.
• Risk: moderate. New virtual hook (pre_finalize), new runtime flag, changes the always-on path in synchronize(). The teardown subprocess test is specifically pinning one
worst-case interaction.
• Could ship: as its own PR. It's the biggest piece but internally coherent; the push-bounds-check, pop-clamp, top-of-stack clamp, flag-poll, and pre_finalize teardown hook
are all one story: "overflow should surface as an exception, not as silent corruption, even in the unsynced-exit path."

Fix 2: codegen per-task stack-budget guard

• Scope: 2 files (codegen_llvm.cpp, codegen_llvm.h).
• Dependencies: ideally lands after fix 3 so the 256 KB budget math sums accurate per-adstack sizes (not 4 B-too-small figures). Not a correctness issue at this budget
granularity, but technically cleaner.
• Self-contained test coverage: none in this PR. The PR's five new tests all exercise fix 1 (runtime overflow). Fix 2 is the codegen-time check and should have at least one
test that deliberately crosses 256 KB at ad_stack_size=4096 and asserts the QD_ERROR_IF fires. Splitting this out as its own PR makes the absence of that test more visible
and easier to address.
• Risk: moderate. CPU-only gating with a hard-coded constant; a kernel that works fine today but happens to straddle 256 KB will start failing at compile time. Worth a
dedicated review pass.
• Could ship: as its own PR, ideally after fix 3, and ideally with a test added first.

Suggested order

Fix 3 first (trivial, unblocks the others). One-line PR.
Fix 1 second (big, self-contained, carries the Python tests that are directly visible to users).
Fix 2 third (smallest of the two behavior changes, but wants a test the current PR doesn't include — easier to add that test in a dedicated PR than to stretch this one
further).

Why they're getting reviewed together anyway

They share a theme — "stop silent wrong gradients in the adstack path" — and they were probably discovered together while chasing the same symptom. That's normal. But for
review, they land in different files with different risk profiles and different test needs, and reviewers can sign off on them independently. If CI time per PR is cheap,
splitting is a straight win for reviewability and bisectability. If the team prefers bigger atomic changes, the current shape is defensible, but in that case the PR description
should at least be explicit that fix 2 is untested in this PR.

it sounds like we need the test from fix 1 for fix 3?

Yes — kind of. Let me be precise about what "needs" means here.

Fix 3 is provably correct by inspection

The runtime reads/writes the header through *(u64 *)stack at every site (stack_init, stack_push, stack_top_primal). The alloca was sized as if the header were int32. Changing it
to int64 makes the alloca size match the layout the runtime has always used. There's no semantic ambiguity to argue over.

But it has no executable validation in isolation

Without fix 1, the only test exercising the adstack at the C++ level is the old test_stack, which does 4 pushes into a max_size=16 stack — nowhere near the corrupted region. The
corruption from the int32 sizing is in the last 4 bytes of the alloca, which only gets touched when pushes approach max_size. So an old-test_stack-only validation regime would
happily pass with fix 3 reverted.

That's why I said "obviously correct by inspection but with no executable proof" — and why fix 1's rewritten test_stack (which pushes to max_size and reads *(u64 *)stack) is the
test that incidentally pins fix 3.

What this means for splitting

Three options, in increasing test discipline:

Ship fix 3 alone, no test. Justified on "matches runtime layout, strictly increases alloca size by 4 bytes, can't make anything worse." Defensible but leaves the bug class
(alloca-size vs runtime-layout drift) without a regression pin.
Ship fix 3 alone, with a 5-line test. Add a minimal C++ test (or piggyback on existing test_stack) that pushes to max_size and asserts *(u64 *)stack == max_size. That single
assertion would have failed under the old int32 sizing because the high 4 bytes of the header would have been clobbered by the first slot's writes. Smallest validating
change.
Ship fix 3 + fix 1 together (current shape, sort of). Fix 1's test_stack rewrite covers fix 3 as a side effect. You don't need to write a separate test, but you also can't
bisect a fix-3 regression to fix 3 itself.

If splitting, option 2 is the cleanest — a one-liner fix plus a one-line assertion is still a tiny PR, and it gives fix 3 its own regression pin instead of relying on fix 1's
coverage. Then fix 1 lands on top with its more thorough overflow-path tests.

The general principle: a fix that's "obviously right" but covered only by an unrelated test is one accidental refactor away from being silently undone. A 5-line dedicated test is
cheap insurance.

hughperkins · 2026-04-20T13:19:18Z

So lets' split into the following PRs please;

AdStackAllocaStmt::size_in_bytes header size (with small 5-line test)
runtime overflow → Python exception + teardown safety
codegen per-task stack-budget guard

claude · 2026-04-20T13:43:59Z

+  // Non-owning cache of the Program-owned result buffer so internal polls (adstack overflow, etc.) can be
+  // invoked from `synchronize()` without threading the pointer through the public API. Ownership stays with
+  // `Program` for its lifetime; reallocating or repointing `Program::result_buffer_` mid-run would invalidate
+  // this cache, so avoid that.
+  uint64 *result_buffer_cache_{nullptr};


🟡 The comment on result_buffer_cache_ (llvm_runtime_executor.h:143) warns against reallocating Program::result_buffer_ (with trailing underscore), but the actual field in Program is result_buffer (no trailing underscore). A developer grepping for result_buffer_ in Program will find nothing, defeating the purpose of the warning. Change to Program::result_buffer.

Extended reasoning...

What the bug is and how it manifests

The newly added field-level comment on result_buffer_cache_ (llvm_runtime_executor.h lines 141-145) reads: "reallocating or repointing Program::result_buffer_ mid-run would invalidate this cache". The trailing underscore is wrong: the actual field in Program is declared as result_buffer (no underscore), as seen in program.cpp (result_buffer = nullptr). This is a documentation inaccuracy introduced by this PR.

The specific code path

The comment was added alongside the result_buffer_cache_ field declaration in this PR. The intent is to warn maintainers that the non-owning cache pointer will be invalidated if the Program-owned result buffer is ever reallocated or reassigned. The warning is meaningful and correct in intent — but the identifier it names, Program::result_buffer_, does not exist.

Why existing code does not prevent it

This is a pure documentation error. The implementation is entirely correct: result_buffer_cache_ is populated at materialize_runtime time and the cache is valid for the program's lifetime. The only problem is the field name in the English comment.

Impact on maintainability

A developer who reads the warning and then searches the codebase for Program::result_buffer_ to understand the lifetime contract will find zero results (the field result_buffer_ does not exist in Program). They may then incorrectly conclude either (a) the comment is stale and can be ignored, or (b) the cache is safe to keep even when Program manages its result buffer differently. The intended warning loses all grep-navigability.

How to fix it

Change Program::result_buffer_ to Program::result_buffer (drop the trailing underscore) in the comment at llvm_runtime_executor.h:143.

Step-by-step proof

Read llvm_runtime_executor.h lines 141-145 (added by this PR): comment says Program::result_buffer_.

Search program.cpp for result_buffer: field is declared as uint64 *result_buffer{nullptr} (no underscore) and accessed as result_buffer = nullptr, &result_buffer, etc. throughout the file.

Grep for result_buffer_ as a Program member: no match anywhere in the codebase.

Conclusion: the comment uses the wrong identifier; Program::result_buffer is the correct name.

chatgpt-codex-connector bot reviewed Apr 17, 2026

View reviewed changes

Comment thread quadrants/runtime/program_impls/llvm/llvm_program.h Outdated

claude bot reviewed Apr 17, 2026

View reviewed changes

Comment thread quadrants/runtime/llvm/runtime_module/runtime.cpp Outdated

Comment thread tests/python/test_adstack.py Outdated

duburcqa force-pushed the duburcqa/fix_ad_correctness branch from e281a6e to 196977b Compare April 17, 2026 11:37

duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from a5d3009 to 79a56b0 Compare April 17, 2026 11:37

duburcqa mentioned this pull request Apr 17, 2026

[AutoDiff] Autodiff 8: implement adstack for SPIR-V #490

Open

duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 196977b to c73cb3d Compare April 17, 2026 11:44

duburcqa force-pushed the duburcqa/llvm_adstack_safety branch 2 times, most recently from 91f44c5 to 6abc4aa Compare April 17, 2026 11:53

duburcqa force-pushed the duburcqa/fix_ad_correctness branch from c73cb3d to 3d0ecaf Compare April 17, 2026 12:12

duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from 6abc4aa to bedaa69 Compare April 17, 2026 12:12

duburcqa mentioned this pull request Apr 17, 2026

[AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections #500

Open

duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 3d0ecaf to 7b52cc8 Compare April 17, 2026 12:18

duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from bedaa69 to 6ba6b0a Compare April 17, 2026 12:18

duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 7b52cc8 to 2a30384 Compare April 17, 2026 12:29

duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from 6ba6b0a to 7e7b15d Compare April 17, 2026 12:29

duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 2a30384 to b390eb0 Compare April 17, 2026 12:31

duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from 7e7b15d to d68e626 Compare April 17, 2026 12:31

duburcqa force-pushed the duburcqa/fix_ad_correctness branch from b390eb0 to 4984a2b Compare April 17, 2026 12:42

duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from d68e626 to bd14f25 Compare April 17, 2026 12:43

duburcqa mentioned this pull request Apr 17, 2026

[Bug]: loss.backward() hangs indefinitely for articulated robots with freejoint + child joints Genesis-Embodied-AI/Genesis#2537

Open

hughperkins changed the title ~~[AutoDiff] Surface LLVM adstack push/pop overflow as a Python exception~~ [AutoDiff] Autodiff 7: Surface LLVM adstack push/pop overflow as a Python exception Apr 17, 2026

hughperkins reviewed Apr 17, 2026

View reviewed changes

duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 4984a2b to 1cb9ad7 Compare April 17, 2026 15:57

duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from bd14f25 to d1f04cc Compare April 17, 2026 15:57

claude bot reviewed Apr 17, 2026

View reviewed changes

Comment thread quadrants/runtime/program_impls/llvm/llvm_program.h

Comment thread tests/python/test_adstack.py

duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 1cb9ad7 to 91db488 Compare April 17, 2026 20:31

duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from d1f04cc to ed092f0 Compare April 17, 2026 20:31

claude bot reviewed Apr 17, 2026

View reviewed changes

Comment thread quadrants/runtime/program_impls/llvm/llvm_program.h Outdated

Comment thread quadrants/runtime/llvm/runtime_module/internal_functions.h

duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 91db488 to 32d3599 Compare April 17, 2026 21:36

duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from ed092f0 to c776e9d Compare April 17, 2026 21:36

claude bot reviewed Apr 17, 2026

View reviewed changes

Comment thread tests/python/test_adstack.py

Comment thread quadrants/runtime/llvm/runtime_module/runtime.cpp Outdated

duburcqa added 9 commits April 18, 2026 00:29

[AutoDiff] Cover adstack overflow flag reset, ad_stack_size workaroun…

9d5a462

…d, and multi-threaded raise.

[AutoDiff] Cover adstack overflow during teardown with subprocess exi…

f542a9f

…t-code check.

[Runtime] Extend test_stack with pop underflow, top-of-stack clamping…

94b7d18

…, and push overflow coverage.

[AutoDiff] Subprocess teardown test: write kernel to tempfile, catch-…

024df79

…then-exit, drop subprocess timeout.

[AutoDiff] Reinstate user-facing preamble + Internal details split on…

2f8dea3

… teardown-subprocess test docstring.

[AutoDiff] Pre-finalize hook sets finalizing_ flag before Program tea…

a3462ac

…rdown syncs; pin ad_stack_size on overflow tests; tighten subprocess test to drop inner sync so teardown path is exercised.

[AutoDiff] Fix finalizing_ field comment (pre_finalize is the setter)…

d6fa831

… and plug test_stack() heap leak.

[AutoDiff] Atomicize adstack_overflow_flag read/write; skip teardown-…

c855375

…subprocess test cleanly when adstack unavailable.

duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 32d3599 to c6cca68 Compare April 17, 2026 22:34

duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from c776e9d to c855375 Compare April 17, 2026 22:34

claude bot reviewed Apr 17, 2026

View reviewed changes

Comment thread quadrants/runtime/llvm/runtime_module/internal_functions.h

[Runtime] Fix AdStackAllocaStmt::size_in_bytes: header is u64 in runt…

389c2d3

…ime, was sized as int32 (4B too small).

claude bot reviewed Apr 18, 2026

View reviewed changes

[AutoDiff/LLVM] Raise at codegen when per-task adstack exceeds the wo…

7381267

…rker-thread stack budget

claude bot reviewed Apr 18, 2026

View reviewed changes

Comment thread quadrants/codegen/llvm/codegen_llvm.cpp Outdated

fixup! [AutoDiff/LLVM] Raise at codegen when per-task adstack exceeds…

851d8fd

… the worker-thread stack budget

claude bot reviewed Apr 18, 2026

View reviewed changes

duburcqa mentioned this pull request Apr 19, 2026

[SPIRV] Use native float view in load/store_buffer to avoid aliasing with atomics #513

Open

Merge branch 'duburcqa/fix_ad_correctness' into duburcqa/llvm_adstack…

2e738b1

…_safety

claude bot reviewed Apr 20, 2026

View reviewed changes

Conversation

duburcqa commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hughperkins commented Apr 17, 2026

Uh oh!

hughperkins commented Apr 17, 2026

Uh oh!

hughperkins Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

hughperkins commented Apr 17, 2026

Uh oh!

hughperkins commented Apr 17, 2026

Uh oh!

hughperkins commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hughperkins commented Apr 17, 2026

Uh oh!

Uh oh!

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

hughperkins commented Apr 20, 2026

These three fixes a are orthogonal and could be pr'd sepraately?

it sounds like we need the test from fix 1 for fix 3?

Uh oh!

hughperkins commented Apr 20, 2026

Uh oh!

claude bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

duburcqa commented Apr 17, 2026 •

edited

Loading

hughperkins commented Apr 17, 2026 •

edited

Loading