Skip to content

[AutoDiff] Autodiff 7: Surface LLVM adstack push/pop overflow as a Python exception#495

Open
duburcqa wants to merge 13 commits intoduburcqa/fix_ad_correctnessfrom
duburcqa/llvm_adstack_safety
Open

[AutoDiff] Autodiff 7: Surface LLVM adstack push/pop overflow as a Python exception#495
duburcqa wants to merge 13 commits intoduburcqa/fix_ad_correctnessfrom
duburcqa/llvm_adstack_safety

Conversation

@duburcqa
Copy link
Copy Markdown
Contributor

@duburcqa duburcqa commented Apr 17, 2026

Summary

Split out of #490. LLVM-side AD runtime safety fixes that are orthogonal to the SPIR-V adstack enablement. Apply on every LLVM backend (x64, arm64, CUDA, AMDGPU) regardless of whether the Metal/Vulkan path is enabled.

Before this PR, stack_push in the LLVM runtime module had a bare TODO: assert n <= max_elements - loops that exceeded the adstack capacity silently incremented the counter past the allocated region, writing out-of-bounds and handing back wrong gradients (or crashing). stack_pop was also unguarded and happily underflowed the counter.

What changed

  • stack_push now takes the runtime pointer, skips the store/increment on overflow, and flips a new runtime->adstack_overflow_flag. Codegen passes get_runtime() at every push site.
  • stack_pop no-ops when n == 0; stack_top_primal clamps the index to 0 so reads on an underflowed stack stay in-bounds (garbage value, but the host raises before it is consumed).
  • LlvmRuntimeExecutor::check_adstack_overflow() polls the flag and throws QuadrantsAssertionError. Unlike check_runtime_error, this runs on every synchronize() (not gated on compile_config.debug).
  • LlvmProgramImpl::synchronize() calls the new check after the regular sync. The result buffer is cached on materialize_runtime so internal polls do not need to thread the pointer through the public API.
  • LlvmProgramImpl::finalize() sets a finalizing_ flag that suppresses the poll during teardown, so an adstack-overflow raise cannot escape into ~Program() and terminate the process.

Covered by test_adstack_overflow_raises (lands here): 64 iterations with default_ad_stack_size=32 raise AssertionError at the next qd.sync().

Stacked on #491. Feeds into #490.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a5d3009268

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread quadrants/runtime/program_impls/llvm/llvm_program.h Outdated
Comment thread quadrants/runtime/llvm/runtime_module/runtime.cpp Outdated
Comment thread tests/python/test_adstack.py Outdated
@duburcqa duburcqa force-pushed the duburcqa/fix_ad_correctness branch from e281a6e to 196977b Compare April 17, 2026 11:37
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from a5d3009 to 79a56b0 Compare April 17, 2026 11:37
@duburcqa duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 196977b to c73cb3d Compare April 17, 2026 11:44
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch 2 times, most recently from 91f44c5 to 6abc4aa Compare April 17, 2026 11:53
@duburcqa duburcqa force-pushed the duburcqa/fix_ad_correctness branch from c73cb3d to 3d0ecaf Compare April 17, 2026 12:12
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from 6abc4aa to bedaa69 Compare April 17, 2026 12:12
@duburcqa duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 3d0ecaf to 7b52cc8 Compare April 17, 2026 12:18
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from bedaa69 to 6ba6b0a Compare April 17, 2026 12:18
@duburcqa duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 7b52cc8 to 2a30384 Compare April 17, 2026 12:29
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from 6ba6b0a to 7e7b15d Compare April 17, 2026 12:29
@duburcqa duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 2a30384 to b390eb0 Compare April 17, 2026 12:31
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from 7e7b15d to d68e626 Compare April 17, 2026 12:31
@duburcqa duburcqa force-pushed the duburcqa/fix_ad_correctness branch from b390eb0 to 4984a2b Compare April 17, 2026 12:42
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from d68e626 to bd14f25 Compare April 17, 2026 12:43
@hughperkins hughperkins changed the title [AutoDiff] Surface LLVM adstack push/pop overflow as a Python exception [AutoDiff] Autodiff 7: Surface LLVM adstack push/pop overflow as a Python exception Apr 17, 2026
@hughperkins
Copy link
Copy Markdown
Collaborator

Opus review of test factoriztaion:

Looking at the four tests side by side:

Test Loop type Body Expected grad
test_adstack_sum_fixed_coeff dynamic range(3) y += v 3.0
test_adstack_sum_constant_coeffs dynamic range(3) y += v * (a+1) 6.0
test_adstack_sum_static_loop_correct static qd.static(range(3)) y += v 3.0
test_adstack_basic_gradient dynamic range(10) v = v * 0.95 + 0.01 0.95^10

Tests 2, 3, and 4 share identical setup/teardown/assertion structure and only differ in two dimensions: dynamic vs static loop, and fixed vs varying coefficient. These three could collapse into a single parametrized test using if qd.static(...) to switch at compile time:

@pytest.mark.parametrize("use_static_loop", [True, False])
@pytest.mark.parametrize("use_varying_coeff", [True, False])
@test_utils.test(require=qd.extension.adstack, ad_stack_experimental_enabled=True)
def test_adstack_sum_linear(use_static_loop, use_varying_coeff):
    x = qd.field(qd.f32)
    y = qd.field(qd.f32)
    qd.root.dense(qd.i, 1).place(x, x.grad)
    qd.root.place(y, y.grad)

    @qd.kernel
    def compute():
        for i in x:
            v = x[i]
            if qd.static(use_static_loop):
                r = qd.static(range(3))
            else:
                r = range(3)
            for a in r:
                if qd.static(use_varying_coeff):
                    y[None] += v * qd.cast(a + 1, qd.f32)
                else:
                    y[None] += v

    x[0] = 1.0
    y[None] = 0.0
    compute()
    y.grad[None] = 1.0
    x.grad[0] = 0.0
    compute.grad()

    expected = sum((a + 1) for a in range(3)) if use_varying_coeff else 3.0
    assert x.grad[0] == test_utils.approx(expected, rel=1e-4)

This gives you 4 combinations (including static + varying coeff, which is new free coverage) in one test, and eliminates ~50 lines of duplication.

test_adstack_basic_gradient should stay separate — it's the only one with loop-carried state mutation (v = v * 0.95 + 0.01), which is a structurally different pattern. Though it could benefit from a brief comment explaining that distinction.

@hughperkins
Copy link
Copy Markdown
Collaborator

Please could you address Opus's suggetsions on test factroization.

// The stack index `n` is clamped on read so that overflow (push past capacity) does not let subsequent pops and
// top-accesses underflow it and index far out of bounds. The corresponding stack_push sets
// `runtime->adstack_overflow_flag` and skips the increment instead of trapping, so the host-side launcher
// surfaces the failure as a Python exception rather than killing the process via __builtin_trap. When n == 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@hughperkins
Copy link
Copy Markdown
Collaborator

Asking Opus about any runtime impact:

Hot-path branches in stack operations. stack_push, stack_pop, and stack_top_primal are called on every iteration of every AD loop that uses the adstack. This PR adds a new
branch to each:

• stack_push: bounds check if (n + 1 > max_num_elements) before every push, plus an extra LLVMRuntime *runtime pointer argument threaded through every call site.
• stack_pop: underflow guard if (n > 0) before every pop.
• stack_top_primal: conditional n > 0 ? n - 1 : 0 instead of a bare n - 1.

In normal (non-overflowing) operation the branches are always not-taken, so CPU branch prediction should handle them well. On GPU, all threads in a warp would take the same
path (not-taken), so no divergence cost. But they do add instruction count and register pressure to what is essentially the innermost loop of the backward pass.

synchronize() poll. Every qd.sync() now calls check_adstack_overflow(), which makes a JIT call into the runtime to read and reset the flag, plus a result buffer fetch. This is
once per sync (not per iteration), so it's unlikely to be measurable unless the user syncs at very high frequency.

In practice, the cost is probably negligible — a couple of predictable branches per stack operation. But if someone wanted to benchmark it, the stack push/pop overhead in a
tight AD loop with many iterations would be the place to look.

@hughperkins
Copy link
Copy Markdown
Collaborator

Opus response for test coverage:

Code changes in this PR:

Change Exercised by test?
stack_push overflow check + early return Yes — the 64-iteration kernel triggers it
stack_push sets runtime->adstack_overflow_flag = 1 Yes — the test expects the resulting exception
stack_push new LLVMRuntime *runtime parameter Yes — indirectly, the kernel wouldn't compile otherwise
stack_pop underflow guard (if (n > 0)) No — only reachable after overflow, when the reversed loop pops more times than were successfully pushed. The test raises before observing this path.
stack_top_primal clamping (n > 0 ? n - 1 : 0) No — same situation, only reachable post-overflow.
LLVMRuntime::adstack_overflow_flag field Yes
runtime_retrieve_and_reset_adstack_overflow Yes — called by check_adstack_overflow
LlvmRuntimeExecutor::check_adstack_overflow() Yes
check_adstack_overflow early return when llvm_runtime_ == nullptr No — only reachable during early teardown before materialize_runtime
result_buffer_cache_ caching in materialize_runtime Yes — indirectly, needed for check_adstack_overflow to work
LlvmProgramImpl::synchronize() calls check_adstack_overflow Yes
LlvmProgramImpl::finalize() sets finalizing_ = true No — no test verifies that overflow during teardown doesn't crash
codegen_llvm.cpp passes get_runtime() to stack_push Yes — indirectly
internal_functions.h test_stack updated No — no test calls test_internal_func_args / test_stack

Untested functionality implied by the design:

  1. Flag reset after catch. The flag is reset in runtime_retrieve_and_reset_adstack_overflow, but no test verifies that after catching the overflow exception, a subsequent qd.sync() does not raise again. A second sync after the pytest.raises block would cover this.

  2. Increasing ad_stack_size resolves the overflow. The error message tells users to pass ad_stack_size=N to qd.init(). No test verifies that doing so actually makes the same kernel succeed.

  3. Overflow on SPIR-V. The test accepts both AssertionError and RuntimeError, and the docstring describes the SPIR-V path, but the SPIR-V code path isn't in this diff. If SPIR-V doesn't have equivalent overflow detection, the test would silently pass on LLVM-only CI without ever validating the SPIR-V claim.

  4. Multi-threaded overflow. The comment in stack_push explains the race is benign (all threads write the same sentinel). No test uses a multi-element field where multiple threads overflow simultaneously.

  5. Gradients are actually wrong without the safety check. The test only checks that an exception is raised, not that the gradients would have been silently wrong without it. A companion test at a just-under-capacity iteration count showing correct gradients would strengthen the argument.

  6. Teardown safety. The finalizing_ flag exists specifically so that an overflow during ~Program() → finalize() → synchronize() doesn't throw into a destructor and terminate the process. No test covers this path — e.g., triggering overflow and then letting the qd.init() scope exit without an explicit sync.

@hughperkins
Copy link
Copy Markdown
Collaborator

hughperkins commented Apr 17, 2026

Please could you add tests for untested fucntionality implied by the design:

  1. Flag reset after catch. The flag is reset in runtime_retrieve_and_reset_adstack_overflow, but no test verifies that after catching the overflow exception, a subsequent qd.sync() does not raise again. A second sync after the pytest.raises block would cover this.

  2. Increasing ad_stack_size resolves the overflow. The error message tells users to pass ad_stack_size=N to qd.init(). No test verifies that doing so actually makes the same kernel succeed.

  3. Overflow on SPIR-V. The test accepts both AssertionError and RuntimeError, and the docstring describes the SPIR-V path, but the SPIR-V code path isn't in this diff. If SPIR-V doesn't have equivalent overflow detection, the test would silently pass on LLVM-only CI without ever validating the SPIR-V claim.

  4. Multi-threaded overflow. The comment in stack_push explains the race is benign (all threads write the same sentinel). No test uses a multi-element field where multiple threads overflow simultaneously.

5. Gradients are actually wrong without the safety check. The test only checks that an exception is raised, not that the gradients would have been silently wrong without it. A companion test at a just-under-capacity iteration count showing correct gradients would strengthen the argument.

  1. Teardown safety. The finalizing_ flag exists specifically so that an overflow during ~Program() → finalize() → synchronize() doesn't throw into a destructor and terminate the process. No test covers this path — e.g., triggering overflow and then letting the qd.init() scope exit without an explicit sync.

@hughperkins
Copy link
Copy Markdown
Collaborator

Oh, also please add tests for the rows in the table above with 'no', ie:

  • stack_pop underflow guard (if (n > 0))
  • stack_top_primal clamping (n > 0 ? n - 1 : 0)
  • check_adstack_overflow early return when llvm_runtime_ == nullptr
  • LlvmProgramImpl::finalize() sets finalizing_ = true
  • internal_functions.h test_stack updated

@duburcqa duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 4984a2b to 1cb9ad7 Compare April 17, 2026 15:57
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from bd14f25 to d1f04cc Compare April 17, 2026 15:57
Comment thread quadrants/runtime/program_impls/llvm/llvm_program.h
Comment thread tests/python/test_adstack.py
@duburcqa duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 1cb9ad7 to 91db488 Compare April 17, 2026 20:31
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from d1f04cc to ed092f0 Compare April 17, 2026 20:31
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — the critical teardown fix (pre_finalize hook) and all prior review comments have been addressed; the two new inline nits are documentation/hygiene only.

Extended reasoning...

Overview

This PR adds LLVM adstack overflow detection across all LLVM backends (x64, arm64, CUDA, AMDGPU). It adds a dedicated adstack_overflow_flag field to LLVMRuntime, bounds-checks in stack_push/stack_pop/stack_top_primal, a new check_adstack_overflow() called unconditionally on every sync, and a pre_finalize() virtual hook that sets finalizing_ = true before the two teardown syncs so overflow exceptions cannot escape into ~Program(). Tests cover the overflow, flag-reset, multi-threaded, large-capacity, and teardown paths.

Security risks

No security-sensitive code is touched. The change adds a flag and a bounds check inside the LLVM JIT runtime; the overflow flag is a simple i64 field with no auth or crypto involvement.

Level of scrutiny

Medium. The change is logic-heavy (new virtual hook ordering relative to Program::finalize call sequence), but this was the critical issue I flagged in the prior review and it was correctly fixed: Program::finalize() now calls pre_finalize() before the two synchronize() calls, and LlvmProgramImpl::pre_finalize() correctly sets finalizing_ = true. The teardown test was also tightened to exercise the actual unsynced-then-exit path rather than the already-safe pre-cleared-flag path.

Other factors

Both bugs now being posted as inline comments are minor nits: (1) the finalizing_ field comment says 'Flipped on by finalize()' when pre_finalize() is the authoritative setter — documentation only, no logic impact; (2) a 136-byte heap allocation in test_stack() is never freed — pre-existing, OS-reclaimed at exit, inconsequential in CI. Neither warrants blocking the PR. All prior review comments were addressed.

Comment thread quadrants/runtime/program_impls/llvm/llvm_program.h Outdated
Comment thread quadrants/runtime/llvm/runtime_module/internal_functions.h
@duburcqa duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 91db488 to 32d3599 Compare April 17, 2026 21:36
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from ed092f0 to c776e9d Compare April 17, 2026 21:36
Comment thread tests/python/test_adstack.py
Comment thread quadrants/runtime/llvm/runtime_module/runtime.cpp Outdated
The pre-existing LLVM runtime `stack_push` had a bare `TODO: assert n <= max_elements`
so loops that exceeded the adstack capacity silently incremented `n` past the allocated
region, writing out-of-bounds and handing back wrong gradients (or crashing). `stack_pop`
was also unguarded and happily underflowed `n` into UINT_MAX.

Make the overflow visible:

- `stack_push` takes the runtime pointer, skips the store/increment on overflow, and
  flips a new `runtime->adstack_overflow_flag`. Codegen (`codegen_llvm.cpp`) passes
  `get_runtime()` at every push site.
- `stack_pop` no-ops when `n == 0`; `stack_top_primal` clamps `idx` to 0 so reads on
  an underflowed stack stay in-bounds (garbage value, but the host raises before it is
  consumed).
- `LlvmRuntimeExecutor::check_adstack_overflow()` polls the flag via a runtime-retrieve
  JIT call and throws `QuadrantsAssertionError`. Unlike `check_runtime_error`, this runs
  on every `synchronize()` (not gated on `compile_config.debug`) because silent wrong
  gradients are not acceptable to hide.
- `LlvmProgramImpl::synchronize()` calls it after the regular sync. The result buffer
  is cached on `materialize_runtime` so internal polls do not need to thread the pointer
  through the public API.

Covered by `test_adstack_overflow_raises`: a 64-iteration dynamic loop on a
`default_ad_stack_size=32` adstack is expected to raise either `AssertionError` (LLVM
via `check_runtime_error`) or `RuntimeError` (future SPIR-V path via `QD_ERROR`) at
the next `qd.sync()`.
…rdown syncs; pin ad_stack_size on overflow tests; tighten subprocess test to drop inner sync so teardown path is exercised.
…subprocess test cleanly when adstack unavailable.
@duburcqa duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 32d3599 to c6cca68 Compare April 17, 2026 22:34
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from c776e9d to c855375 Compare April 17, 2026 22:34
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — all previously raised issues have been addressed.

Extended reasoning...

Overview

This PR adds LLVM adstack overflow detection across all LLVM backends (x64, arm64, CUDA, AMDGPU). It introduces a dedicated adstack_overflow_flag field, stack_push bounds checking with atomic writes, stack_pop/stack_top_primal underflow clamping, a check_adstack_overflow() polling method called unconditionally at every sync, and a pre_finalize() virtual hook to suppress the poll during Program teardown.

Security risks

No security-sensitive code paths (auth, crypto, permissions) are touched. The data race concern on the overflow flag was correctly addressed with __atomic_store_n + __atomic_exchange_n.

Level of scrutiny

This PR is runtime-critical (JIT execution path for every AD kernel) and received very thorough review. Multiple rounds of comments identified real bugs — the finalizing_ flag timing issue, the missing pre_finalize() hook, incorrect inline comments, the heap leak in test_stack(), and the missing adstack guard on the teardown subprocess test — all of which the author addressed. The one flagged pre-existing issue (AdStackAllocaStmt::size_in_bytes() using a 4-byte header vs the runtime's 8-byte header) predates this PR and is silently absorbed by LLVM alignment padding on all supported platforms; it warrants a follow-up fix in statements.h but does not affect correctness here.

Other factors

Test coverage is comprehensive: basic overflow raise, flag reset after catch, large-capacity workaround, multithreaded overflow, and teardown subprocess. All previous reviewer concerns are resolved.

Comment thread quadrants/runtime/llvm/runtime_module/internal_functions.h
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — all issues from previous review rounds addressed.

Extended reasoning...

Overview

This PR adds adstack-overflow detection to the LLVM backend: stack_push sets a dedicated adstack_overflow_flag on overflow (rather than trapping), check_adstack_overflow() polls the flag unconditionally on every synchronize(), and LlvmProgramImpl::pre_finalize() sets finalizing_=true before the two teardown syncs so overflow exceptions cannot escape into ~Program(). Ten files changed across codegen, IR, runtime, program lifecycle, and tests.

Security risks

No security-sensitive code paths (no auth, crypto, or permissions). The only risk was the original silent UB (out-of-bounds write on overflow), which this PR fixes.

Level of scrutiny

This PR went through four rounds of review, surfacing: comment inaccuracies, a real finalizing-flag timing bug (std::terminate risk), a heap leak in test_stack(), a missing adstack-extension skip guard, a C++11 memory-model data race, and a 4-byte header size mismatch in size_in_bytes(). All were fixed. The current code matches the invariants documented in each comment, the tests exercise the critical paths (flag reset, large capacity, multithreaded, teardown subprocess), and the bug hunting system found no remaining issues.

Other factors

Test coverage is thorough: overflow raises, flag resets after catch, large capacity resolves overflow, multithreaded overflow, teardown-without-explicit-sync subprocess. The pre_finalize() hook comment explicitly warns that moving the assignment back into finalize() alone would reintroduce the std::terminate bug — a good maintenance guard. The atomic fixup (__ATOMIC_RELAXED store + exchange) is the minimal correct change with zero runtime cost on x86-64/ARM64.

Comment thread quadrants/codegen/llvm/codegen_llvm.cpp Outdated
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — all review issues addressed, including the CPU-only arch guard on the 256 KB adstack budget check.

Extended reasoning...

Overview

The PR adds adstack overflow detection to the LLVM autodiff runtime: stack_push now sets runtime->adstack_overflow_flag via a relaxed atomic store on overflow, stack_pop/stack_top_primal have underflow/clamping guards, LlvmRuntimeExecutor::check_adstack_overflow() polls the flag on every synchronize(), and LlvmProgramImpl::pre_finalize() suppresses the poll during teardown. Additionally AdStackAllocaStmt::size_in_bytes() is corrected from sizeof(int32) to sizeof(int64) to match the runtime layout. 11 files touched across codegen, IR, runtime, and tests.

Security risks

No security-sensitive surfaces (no auth, crypto, or permission code). The relaxed-atomic fix for the data race in stack_push is a correctness improvement.

Level of scrutiny

This is production-critical runtime code (autodiff correctness on all LLVM backends), but the changes are well-bounded safety additions rather than logic rewrites. Every bug raised across multiple review rounds was addressed:

  • Comment inaccuracies (error_code vs adstack_overflow_flag) — fixed
  • finalizing_ timing (pre_finalize() before teardown syncs) — fixed
  • sizeof header mismatch — fixed
  • Atomic write for multi-threaded overflow — fixed
  • CPU-only arch guard on the 256 KB budget check — fixed (fixup commit 851d8fd)
  • test_stack heap leak — fixed
  • Test ad_stack_size pins and extension guards — fixed

Other factors

Test coverage is comprehensive: overflow raises, flag reset, teardown safety, multi-threaded overflow, and large-capacity resolution are all exercised. The bug hunting system found no new issues in the final state. The unresolved inline thread on the GPU budget check is moot because the arch guard is present in the submitted diff.

@hughperkins
Copy link
Copy Markdown
Collaborator

Opus says this PR is three ~independent fixes. Pelase could we split into three PRs?

These three fixes a are orthogonal and could be pr'd sepraately?

Yes — all three are independently mergeable, with one soft ordering preference. Sketch:

Fix 3: AdStackAllocaStmt::size_in_bytes header size

• Scope: 1 file, ~4 lines (quadrants/ir/statements.h).
• Dependencies: none. Pre-existing bug — runtime always used u64 for the header, LLVM alloca was 4 B short.
• Test: none new required (existing tests now allocate the correct size, and the extended test_stack in fix 1 would surface it, but the fix stands on its own).
• Risk: near-zero. Just makes the alloca 4 B larger.
• Could ship: tomorrow. Smallest, cleanest, highest-confidence change in the PR.

Fix 1: runtime overflow → Python exception + teardown safety

• Scope: ~8 files (runtime.cpp, internal_functions.h, llvm_runtime_executor.{cpp,h}, llvm_program.h, program.cpp, program_impl.h, codegen_llvm.cpp for the stack_push signature
change + Python tests).
• Dependencies: none semantically, but technically the stack_push(runtime, stack, max, elem_size) signature change touches codegen_llvm.cpp too — any PR landing in parallel
that touches that callsite needs rebasing.
• Self-contained test coverage: yes — all five new Python tests + the extended test_stack in internal_functions.h.
• Risk: moderate. New virtual hook (pre_finalize), new runtime flag, changes the always-on path in synchronize(). The teardown subprocess test is specifically pinning one
worst-case interaction.
• Could ship: as its own PR. It's the biggest piece but internally coherent; the push-bounds-check, pop-clamp, top-of-stack clamp, flag-poll, and pre_finalize teardown hook
are all one story: "overflow should surface as an exception, not as silent corruption, even in the unsynced-exit path."

Fix 2: codegen per-task stack-budget guard

• Scope: 2 files (codegen_llvm.cpp, codegen_llvm.h).
• Dependencies: ideally lands after fix 3 so the 256 KB budget math sums accurate per-adstack sizes (not 4 B-too-small figures). Not a correctness issue at this budget
granularity, but technically cleaner.
• Self-contained test coverage: none in this PR. The PR's five new tests all exercise fix 1 (runtime overflow). Fix 2 is the codegen-time check and should have at least one
test that deliberately crosses 256 KB at ad_stack_size=4096 and asserts the QD_ERROR_IF fires. Splitting this out as its own PR makes the absence of that test more visible
and easier to address.
• Risk: moderate. CPU-only gating with a hard-coded constant; a kernel that works fine today but happens to straddle 256 KB will start failing at compile time. Worth a
dedicated review pass.
• Could ship: as its own PR, ideally after fix 3, and ideally with a test added first.

Suggested order

  1. Fix 3 first (trivial, unblocks the others). One-line PR.
  2. Fix 1 second (big, self-contained, carries the Python tests that are directly visible to users).
  3. Fix 2 third (smallest of the two behavior changes, but wants a test the current PR doesn't include — easier to add that test in a dedicated PR than to stretch this one
    further).

Why they're getting reviewed together anyway

They share a theme — "stop silent wrong gradients in the adstack path" — and they were probably discovered together while chasing the same symptom. That's normal. But for
review, they land in different files with different risk profiles and different test needs, and reviewers can sign off on them independently. If CI time per PR is cheap,
splitting is a straight win for reviewability and bisectability. If the team prefers bigger atomic changes, the current shape is defensible, but in that case the PR description
should at least be explicit that fix 2 is untested in this PR.

it sounds like we need the test from fix 1 for fix 3?

Yes — kind of. Let me be precise about what "needs" means here.

Fix 3 is provably correct by inspection

The runtime reads/writes the header through *(u64 *)stack at every site (stack_init, stack_push, stack_top_primal). The alloca was sized as if the header were int32. Changing it
to int64 makes the alloca size match the layout the runtime has always used. There's no semantic ambiguity to argue over.

But it has no executable validation in isolation

Without fix 1, the only test exercising the adstack at the C++ level is the old test_stack, which does 4 pushes into a max_size=16 stack — nowhere near the corrupted region. The
corruption from the int32 sizing is in the last 4 bytes of the alloca, which only gets touched when pushes approach max_size. So an old-test_stack-only validation regime would
happily pass with fix 3 reverted.

That's why I said "obviously correct by inspection but with no executable proof" — and why fix 1's rewritten test_stack (which pushes to max_size and reads *(u64 *)stack) is the
test that incidentally pins fix 3.

What this means for splitting

Three options, in increasing test discipline:

  1. Ship fix 3 alone, no test. Justified on "matches runtime layout, strictly increases alloca size by 4 bytes, can't make anything worse." Defensible but leaves the bug class
    (alloca-size vs runtime-layout drift) without a regression pin.

  2. Ship fix 3 alone, with a 5-line test. Add a minimal C++ test (or piggyback on existing test_stack) that pushes to max_size and asserts *(u64 *)stack == max_size. That single
    assertion would have failed under the old int32 sizing because the high 4 bytes of the header would have been clobbered by the first slot's writes. Smallest validating
    change.

  3. Ship fix 3 + fix 1 together (current shape, sort of). Fix 1's test_stack rewrite covers fix 3 as a side effect. You don't need to write a separate test, but you also can't
    bisect a fix-3 regression to fix 3 itself.

If splitting, option 2 is the cleanest — a one-liner fix plus a one-line assertion is still a tiny PR, and it gives fix 3 its own regression pin instead of relying on fix 1's
coverage. Then fix 1 lands on top with its more thorough overflow-path tests.

The general principle: a fix that's "obviously right" but covered only by an unrelated test is one accidental refactor away from being silently undone. A 5-line dedicated test is
cheap insurance.

@hughperkins
Copy link
Copy Markdown
Collaborator

So lets' split into the following PRs please;

  • AdStackAllocaStmt::size_in_bytes header size (with small 5-line test)
  • runtime overflow → Python exception + teardown safety
  • codegen per-task stack-budget guard

Comment on lines +141 to +145
// Non-owning cache of the Program-owned result buffer so internal polls (adstack overflow, etc.) can be
// invoked from `synchronize()` without threading the pointer through the public API. Ownership stays with
// `Program` for its lifetime; reallocating or repointing `Program::result_buffer_` mid-run would invalidate
// this cache, so avoid that.
uint64 *result_buffer_cache_{nullptr};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The comment on result_buffer_cache_ (llvm_runtime_executor.h:143) warns against reallocating Program::result_buffer_ (with trailing underscore), but the actual field in Program is result_buffer (no trailing underscore). A developer grepping for result_buffer_ in Program will find nothing, defeating the purpose of the warning. Change to Program::result_buffer.

Extended reasoning...

What the bug is and how it manifests

The newly added field-level comment on result_buffer_cache_ (llvm_runtime_executor.h lines 141-145) reads: "reallocating or repointing Program::result_buffer_ mid-run would invalidate this cache". The trailing underscore is wrong: the actual field in Program is declared as result_buffer (no underscore), as seen in program.cpp (result_buffer = nullptr). This is a documentation inaccuracy introduced by this PR.

The specific code path

The comment was added alongside the result_buffer_cache_ field declaration in this PR. The intent is to warn maintainers that the non-owning cache pointer will be invalidated if the Program-owned result buffer is ever reallocated or reassigned. The warning is meaningful and correct in intent — but the identifier it names, Program::result_buffer_, does not exist.

Why existing code does not prevent it

This is a pure documentation error. The implementation is entirely correct: result_buffer_cache_ is populated at materialize_runtime time and the cache is valid for the program's lifetime. The only problem is the field name in the English comment.

Impact on maintainability

A developer who reads the warning and then searches the codebase for Program::result_buffer_ to understand the lifetime contract will find zero results (the field result_buffer_ does not exist in Program). They may then incorrectly conclude either (a) the comment is stale and can be ignored, or (b) the cache is safe to keep even when Program manages its result buffer differently. The intended warning loses all grep-navigability.

How to fix it

Change Program::result_buffer_ to Program::result_buffer (drop the trailing underscore) in the comment at llvm_runtime_executor.h:143.

Step-by-step proof

  1. Read llvm_runtime_executor.h lines 141-145 (added by this PR): comment says Program::result_buffer_.
  2. Search program.cpp for result_buffer: field is declared as uint64 *result_buffer{nullptr} (no underscore) and accessed as result_buffer = nullptr, &result_buffer, etc. throughout the file.
  3. Grep for result_buffer_ as a Program member: no match anywhere in the codebase.
  4. Conclusion: the comment uses the wrong identifier; Program::result_buffer is the correct name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants