Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions quadrants/codegen/llvm/codegen_llvm.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1727,6 +1727,10 @@ std::string TaskCodeGenLLVM::init_offloaded_task_function(OffloadedStmt *stmt, s
current_loop_reentry = nullptr;
current_while_after_loop = nullptr;

// Reset the adstack function-scope accumulator for this task. The budget is per-task (per LLVM
// function), so the count must not carry over from the previous offloaded stmt.
ad_stack_fn_scope_bytes_ = 0;

task_function_type =
llvm::FunctionType::get(llvm::Type::getVoidTy(*llvm_context), {llvm::PointerType::get(context_ty, 0)}, false);

Expand Down Expand Up @@ -2106,6 +2110,38 @@ void TaskCodeGenLLVM::visit(InternalFuncStmt *stmt) {
void TaskCodeGenLLVM::visit(AdStackAllocaStmt *stmt) {
QD_ASSERT_INFO(stmt->max_size > 0, "Adaptive autodiff stack's size should have been determined.");
auto type = llvm::ArrayType::get(llvm::Type::getInt8Ty(*llvm_context), stmt->size_in_bytes());

// Guard against LLVM worker-thread stack overflow before silent memory corruption ensues.
// Gated on CPU arches because only there do LLVM allocas become worker-thread stack frame slots bounded by
// the OS thread-stack limit. On CUDA / AMDGPU the same LLVM allocas are lowered to per-thread GPU local
// memory (a separate address space sized by the driver, not shared with the CPU call stack), so the 256 KB
// CPU-stack budget is not meaningful there and the check would falsely reject valid GPU kernels with
// f64 loop-carried variables (4 adstacks at `ad_stack_size=4096` already cross 256 KB).
//
// Adstacks are allocated at function entry (`create_entry_block_alloca`) so they are live for the entire
// task invocation and their sizes sum directly into the LLVM stack frame. A kernel that exceeds the thread
// stack does not fault at the push - it simply trashes adjacent stack memory, and downstream reverse-mode
// accumulators read zero, producing silently-wrong gradients that look indistinguishable from a broken
// backward chain. Fail loudly with a message that tells the user how to unblock: either lower
// `ad_stack_size`, shrink the per-kernel adstack count by shifting some dynamic loops back to
// `qd.static(range(...))` unrolls, or use a backend that heap-backs adstacks.
//
// Budget: 256 KB leaves headroom inside the ~512 KB default macOS secondary-thread stack for other locals
// and nested call frames. Linux defaults are larger (~8 MB), so the same limit is strictly conservative
// there.
if (arch_is_cpu(current_arch())) {
constexpr std::size_t kFnScopeAdStackBudgetBytes = 256 * 1024;
ad_stack_fn_scope_bytes_ += stmt->size_in_bytes();
QD_ERROR_IF(ad_stack_fn_scope_bytes_ > kFnScopeAdStackBudgetBytes,
"LLVM autodiff-stack budget exceeded: cumulative `AdStackAllocaStmt` size {} bytes in task "
"'{}' crosses the {} byte function-scope budget. Every adstack is allocated on the worker "
"thread stack, so scaling past this point silently corrupts the stack frame and zeros the "
"reverse-mode gradient without raising. Options: lower `ad_stack_size=N` in `qd.init()`, "
"reduce the number of loop-carried values in dynamic reverse-mode loops, or keep the "
"existing `qd.static(range(...))` unrolls on the reverse-mode path.",
ad_stack_fn_scope_bytes_, kernel_name, kFnScopeAdStackBudgetBytes);
Comment on lines +2135 to +2142
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The QD_ERROR_IF in visit(AdStackAllocaStmt*) passes kernel_name (e.g., my_kernel) as the second format argument but the message says "in task '{}'", so for a kernel compiled with multiple offloaded backward tasks the diagnostic does not identify which specific task crossed the 256 KB budget. Replace kernel_name with current_task->name (e.g., my_kernel_0_range_for_body), which is guaranteed non-null at this call site and already contains the full task-function identifier.

Extended reasoning...

What the bug is and how it manifests

In codegen_llvm.cpp lines 2135-2142, the QD_ERROR_IF call has this format string:

"... cumulative AdStackAllocaStmt size {} bytes in task '{}' ..."

with positional arguments ad_stack_fn_scope_bytes_, kernel_name, kFnScopeAdStackBudgetBytes. The second positional argument binds to kernel_name, which is the top-level kernel identifier (e.g. my_kernel). The label says 'task', implying a task-scoped identifier, but the value supplied is kernel-scoped.

The specific code path that triggers it

init_offloaded_task_function (codegen_llvm.cpp:1727+) resets ad_stack_fn_scope_bytes_ to zero at the start of each offloaded task and initialises current_task with a fully-qualified name built from kernel name + codegen id + loop name + task type (e.g. my_kernel_0_range_for_body). visit(AdStackAllocaStmt*) is only reached inside that task body, so current_task is guaranteed non-null at this call site.

Why existing code does not prevent it

There is no validation that the format argument matches the 'task' label in the message. Both kernel_name and current_task->name compile cleanly; the compiler has no way to flag the semantic mismatch.

Impact

For a backward-pass kernel compiled with multiple offloaded tasks (e.g., struct-for body, init-args task, range-for backward body), every offloaded task shares the same kernel_name. If the budget guard fires, the error message names the kernel but not which of its several tasks exceeded the limit, forcing the user to guess rather than act on the message directly.

How to fix it

Replace kernel_name with current_task->name in the QD_ERROR_IF argument list. Optionally update the format label from 'task' to 'task function'. Because current_task->name already contains kernel_name as a prefix (e.g. my_kernel_0_range_for_body), the message remains informative at the kernel level too.

Step-by-step proof

  1. A user compiles a kernel my_kernel with two offloaded tasks: my_kernel_0_range_for_body (five f64 adstacks) and my_kernel_1_init_args (one adstack).
  2. The backward body sums to 5 x 65,544 bytes = 327 KB > 256 KB; QD_ERROR_IF fires during codegen of that task.
  3. The message printed is: '... size 327720 bytes in task my_kernel crosses the 262144 byte budget.'
  4. The user has no information about which of the kernel's tasks is the culprit.
  5. With current_task->name, the message would read 'in task my_kernel_0_range_for_body', directly naming the offending task function.

}

auto alloca = create_entry_block_alloca(type, sizeof(int64));
llvm_val[stmt] = builder->CreateBitCast(alloca, llvm::PointerType::getUnqual(*llvm_context));
call("stack_init", llvm_val[stmt]);
Expand Down
8 changes: 8 additions & 0 deletions quadrants/codegen/llvm/codegen_llvm.h
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,14 @@ class TaskCodeGenLLVM : public IRVisitor, public LLVMModuleBuilder {
// The task_codegen_id represents the id of the offloaded task
int task_codegen_id{0};

// Running total of bytes reserved by `AdStackAllocaStmt`s emitted via `create_entry_block_alloca` in
// the current task. Every adstack lives at function scope on the worker-thread stack, so the sum of
// their sizes adds directly to the LLVM stack frame. If the sum exceeds the worker thread's stack
// (~512 KB on macOS, 8 MB on Linux by default) the frame silently clobbers adjacent stack pages,
// which has shown up in Genesis-style kernels as zero gradients with no SIGBUS. We raise before
// codegen emits anything that cannot run correctly.
std::size_t ad_stack_fn_scope_bytes_{0};
Comment on lines 63 to +72
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The field comment for ad_stack_fn_scope_bytes_ in codegen_llvm.h (lines 66–71) makes two universally-scoped claims — "Every adstack lives at function scope on the worker-thread stack" and "We raise before codegen emits anything that cannot run correctly" — that are only accurate for CPU backends; on CUDA/AMDGPU, LLVM allocas lower to per-thread GPU local memory, not the CPU call stack, and the raise is gated on arch_is_cpu() in the .cpp implementation so it never fires for GPU. Prefix the comment with "On CPU arches only" and note that the raise is conditional on arch_is_cpu() to match the guard that is already in place in the .cpp.

Extended reasoning...

What the bug is and how it manifests

The field comment for ad_stack_fn_scope_bytes_ (codegen_llvm.h:66–71) reads:

Every adstack lives at function scope on the worker-thread stack, so the sum of their sizes adds directly to the LLVM stack frame. … We raise before codegen emits anything that cannot run correctly.

Both statements are worded universally but are only true for CPU arches. On CUDA and AMDGPU, LLVM allocas produced by create_entry_block_alloca are lowered by the NVPTX/AMDGPU backends to per-thread device-local memory — a separate address space sized by the GPU driver that has no relationship to any CPU worker-thread stack. "The worker-thread stack" claim is therefore factually wrong for those backends. And "We raise before codegen emits anything that cannot run correctly" is false because the raise is inside an if (arch_is_cpu(current_arch())) guard in the .cpp — GPU backends skip the check entirely.

The specific code path that matters

TaskCodeGenLLVM is the base class for TaskCodeGenCUDA and TaskCodeGenAMDGPU. Both GPU subclasses inherit the field without overriding it. A GPU backend developer reading only the header declaration (as is common when quickly auditing a field's purpose or deciding whether a guard is necessary) will see the field comment and conclude both that adstacks always use the CPU call stack and that the budget guard fires unconditionally. The .cpp implementation does carry a detailed and accurate CPU-only rationale in the block comment above the arch_is_cpu() guard, but that comment is not visible from the header.

Why existing code does not prevent it

This is a pure documentation inaccuracy. The .cpp implementation is fully correct: the budget check is properly gated on arch_is_cpu(current_arch()) so GPU kernels are never rejected by it. The problem exists only in the field-level summary in the header.

What the impact would be

A GPU backend developer reading the header comment could conclude: (a) the arch_is_cpu() guard in the .cpp is unnecessary overhead since the comment says the raise fires universally, leading them to remove it; or (b) adstacks on CUDA/AMDGPU really do sit on the CPU worker-thread stack, leading to confusion when investigating GPU adstack performance or correctness. Either misunderstanding can propagate into incorrect changes downstream.

How to fix it

Prefix the comment with "On CPU arches only:" and add a parenthetical clarifying that the raise fires inside the arch_is_cpu() guard. For example:

Step-by-step proof

  1. Read codegen_llvm.h:66–71: comment says "Every adstack lives at function scope on the worker-thread stack" and "We raise before codegen emits anything that cannot run correctly" — both universal in scope.
  2. Read codegen_llvm.cpp (this diff): the budget accumulation and QD_ERROR_IF are wrapped in if (arch_is_cpu(current_arch())) { … } — the raise does NOT fire for CUDA/AMDGPU.
  3. Note that TaskCodeGenCUDA and TaskCodeGenAMDGPU inherit TaskCodeGenLLVM without overriding visit(AdStackAllocaStmt*) — the field is present and visible in both GPU subclasses.
  4. On CUDA/AMDGPU, LLVM allocas are lowered to per-thread GPU local memory by the NVPTX/AMDGPU backends — not to the CPU worker-thread call stack.
  5. Conclusion: the header comment's two universal claims are false for GPU backends; the .cpp already has the correct scoped comment; only the .h field summary needs updating.


std::unordered_map<const Stmt *, std::vector<llvm::Value *>> loop_vars_llvm;

std::unordered_map<Function *, llvm::Function *> func_map;
Expand Down
65 changes: 65 additions & 0 deletions tests/python/test_adstack.py
Original file line number Diff line number Diff line change
Expand Up @@ -546,3 +546,68 @@
@test_utils.test(require=[qd.extension.adstack, qd.extension.data64], default_fp=qd.f64)
def test_adstack_sum_linear_f64(use_static_loop, use_varying_coeff, n_iter):
_run_sum_linear(qd.f64, use_static_loop, use_varying_coeff, n_iter, rel_tol=1e-14)


def test_adstack_codegen_budget_guard_runs_in_child_process(tmp_path):
# Per-task codegen guard: the sum of `AdStackAllocaStmt::size_in_bytes()` in a single LLVM task must not cross
# the ~256 KB CPU worker-thread stack budget. Beyond that the frame silently clobbers adjacent stack memory and
# the reverse pass returns zero / garbage gradients. The guard runs inside the LLVM compilation worker thread
# pool; the underlying `QD_ERROR_IF` throws across a thread boundary that does not propagate the exception
# back to Python, so it surfaces as a loud `std::terminate` / SIGABRT rather than a catchable Python
# exception. The test runs the overflowing kernel in a child process and asserts the child aborts with a
# non-zero exit code and the guard message reaches stderr; that is enough to prove the guard fires and does
# not let silent stack-frame clobbering through.
if not is_extension_supported(qd.cpu, qd.extension.adstack):
pytest.skip("adstack extension not available on cpu")
if not is_extension_supported(qd.cpu, qd.extension.data64):
pytest.skip("f64 extension not available on cpu")

child_script = textwrap.dedent(
"""
import quadrants as qd

qd.init(arch=qd.cpu, ad_stack_experimental_enabled=True, ad_stack_size=4096, default_fp=qd.f64)

n = 4
x = qd.field(qd.f64, shape=n, needs_grad=True)
y = qd.field(qd.f64, shape=(), needs_grad=True)
n_iter = qd.field(qd.i32, shape=())

@qd.kernel
def compute():
for i in x:
v1 = x[i]
v2 = x[i]
v3 = x[i]
v4 = x[i]
v5 = x[i]
for _ in range(n_iter[None]):
v1 = qd.sin(v1)
v2 = qd.sin(v2)
v3 = qd.sin(v3)
v4 = qd.sin(v4)
v5 = qd.sin(v5)
y[None] += v1 + v2 + v3 + v4 + v5

for i in range(n):
x[i] = 0.1 + 0.1 * i
n_iter[None] = 3
y[None] = 0.0
compute()
y.grad[None] = 1.0
for i in range(n):
x.grad[i] = 0.0
compute.grad()
"""
)
script_path = tmp_path / "budget_guard_child.py"
script_path.write_text(child_script)
result = subprocess.run([sys.executable, str(script_path)], capture_output=True, check=False)

Check warning on line 605 in tests/python/test_adstack.py

View check run for this annotation

Claude / Claude Code Review

subprocess.run in budget-guard test lacks timeout, can hang CI indefinitely

Both subprocess.run calls in test_adstack.py lack a timeout argument; if the child process does not exit for any reason (e.g., the budget guard fails to fire due to misconfiguration or a build variant), the test will block the entire CI suite indefinitely. Add timeout=60 to the subprocess.run at the budget-guard test (line 605) and defensively to the teardown test as well.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Both subprocess.run calls in test_adstack.py lack a timeout argument; if the child process does not exit for any reason (e.g., the budget guard fails to fire due to misconfiguration or a build variant), the test will block the entire CI suite indefinitely. Add timeout=60 to the subprocess.run at the budget-guard test (line 605) and defensively to the teardown test as well.

Extended reasoning...

What the bug is

Both subprocess.run calls in test_adstack.py — one in test_adstack_codegen_budget_guard_runs_in_child_process (~line 605) and one in test_adstack_overflow_during_teardown_does_not_abort (~line 482) — are missing a timeout= argument. The critical case is the budget-guard test: it expects the child process to terminate with a non-zero exit code (via QD_ERROR_IF firing during LLVM compilation). If the guard fails to fire for any reason, subprocess.run will block forever.

Why the original spinning mechanism is not the real concern

The original bug description argued that __builtin_unreachable() could cause the child to spin indefinitely. The refutations are correct that this specific path is impossible: Logger::error() has raise_exception=true by default, so it throws a std::string before QD_UNREACHABLE is ever reached. That thrown exception inside the LLVM worker thread triggers std::terminate() -> std::abort() -> SIGABRT, which is reliable non-zero-exit termination. The __builtin_unreachable() is dead code in this path.

The valid concern: guard fails to fire

The real risk is broader: if the budget guard fails to fire for any reason — a build variant where the guard is compiled out, a future refactor that accidentally breaks the arch_is_cpu gate, an early return added upstream of visit(AdStackAllocaStmt*), or a misconfigured test environment — the child process will not abort, the parent subprocess.run has no timeout, and CI blocks indefinitely.

Why existing code does not prevent it

subprocess.run(..., check=False) with no timeout= will wait forever for the child. There is no watchdog or other mechanism that would unblock the parent. The skip guards at the top only protect the extension-not-compiled case; they do not protect against a guard that is compiled in but silently broken.

Impact

If this test hangs, the entire test suite hangs behind it. On most CI systems this means a full job timeout (30-60 min) before the runner kills it, with no actionable error message.

How to fix it

Add timeout=60 (or timeout=120 for slow machines) to the subprocess.run call on the budget-guard test. The teardown test (expecting returncode 0) is less critical but should also get a timeout for the same defensive reason.

Step-by-step proof

  1. The budget-guard child script calls compute.grad(), which triggers LLVM compilation of the backward kernel. If arch_is_cpu() returns False for any reason, the QD_ERROR_IF block is skipped, no abort occurs, compute.grad() and qd.sync() return normally, and the child exits with returncode 0.
  2. The parent test asserts result.returncode != 0 only AFTER subprocess.run returns. Without a timeout, subprocess.run never returns if the child hangs (e.g., in a JIT stall or LLVM thread pool deadlock).
  3. A timeout=60 turns a potential indefinite hang into a subprocess.TimeoutExpired exception within 60 seconds, making CI failures actionable.

assert (
result.returncode != 0
), "child exited with returncode 0 but the budget guard was expected to terminate the process"
combined = (result.stdout + result.stderr).decode()
assert "autodiff-stack budget exceeded" in combined, (
f"expected guard message in child output; got:\nstdout:\n{result.stdout.decode()}\n"
f"stderr:\n{result.stderr.decode()}"
)
Loading