-
Notifications
You must be signed in to change notification settings - Fork 15
[AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget #495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: duburcqa/split_llvm_adstack_runtime_overflow
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -63,6 +63,14 @@ class TaskCodeGenLLVM : public IRVisitor, public LLVMModuleBuilder { | |
| // The task_codegen_id represents the id of the offloaded task | ||
| int task_codegen_id{0}; | ||
|
|
||
| // Running total of bytes reserved by `AdStackAllocaStmt`s emitted via `create_entry_block_alloca` in | ||
| // the current task. Every adstack lives at function scope on the worker-thread stack, so the sum of | ||
| // their sizes adds directly to the LLVM stack frame. If the sum exceeds the worker thread's stack | ||
| // (~512 KB on macOS, 8 MB on Linux by default) the frame silently clobbers adjacent stack pages, | ||
| // which has shown up in Genesis-style kernels as zero gradients with no SIGBUS. We raise before | ||
| // codegen emits anything that cannot run correctly. | ||
| std::size_t ad_stack_fn_scope_bytes_{0}; | ||
|
Comment on lines
63
to
+72
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 The field comment for Extended reasoning...What the bug is and how it manifests The field comment for
Both statements are worded universally but are only true for CPU arches. On CUDA and AMDGPU, LLVM allocas produced by The specific code path that matters
Why existing code does not prevent it This is a pure documentation inaccuracy. The What the impact would be A GPU backend developer reading the header comment could conclude: (a) the How to fix it Prefix the comment with "On CPU arches only:" and add a parenthetical clarifying that the raise fires inside the Step-by-step proof
|
||
|
|
||
| std::unordered_map<const Stmt *, std::vector<llvm::Value *>> loop_vars_llvm; | ||
|
|
||
| std::unordered_map<Function *, llvm::Function *> func_map; | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -546,3 +546,68 @@ | |
| @test_utils.test(require=[qd.extension.adstack, qd.extension.data64], default_fp=qd.f64) | ||
| def test_adstack_sum_linear_f64(use_static_loop, use_varying_coeff, n_iter): | ||
| _run_sum_linear(qd.f64, use_static_loop, use_varying_coeff, n_iter, rel_tol=1e-14) | ||
|
|
||
|
|
||
| def test_adstack_codegen_budget_guard_runs_in_child_process(tmp_path): | ||
| # Per-task codegen guard: the sum of `AdStackAllocaStmt::size_in_bytes()` in a single LLVM task must not cross | ||
| # the ~256 KB CPU worker-thread stack budget. Beyond that the frame silently clobbers adjacent stack memory and | ||
| # the reverse pass returns zero / garbage gradients. The guard runs inside the LLVM compilation worker thread | ||
| # pool; the underlying `QD_ERROR_IF` throws across a thread boundary that does not propagate the exception | ||
| # back to Python, so it surfaces as a loud `std::terminate` / SIGABRT rather than a catchable Python | ||
| # exception. The test runs the overflowing kernel in a child process and asserts the child aborts with a | ||
| # non-zero exit code and the guard message reaches stderr; that is enough to prove the guard fires and does | ||
| # not let silent stack-frame clobbering through. | ||
| if not is_extension_supported(qd.cpu, qd.extension.adstack): | ||
| pytest.skip("adstack extension not available on cpu") | ||
| if not is_extension_supported(qd.cpu, qd.extension.data64): | ||
| pytest.skip("f64 extension not available on cpu") | ||
|
|
||
| child_script = textwrap.dedent( | ||
| """ | ||
| import quadrants as qd | ||
|
|
||
| qd.init(arch=qd.cpu, ad_stack_experimental_enabled=True, ad_stack_size=4096, default_fp=qd.f64) | ||
|
|
||
| n = 4 | ||
| x = qd.field(qd.f64, shape=n, needs_grad=True) | ||
| y = qd.field(qd.f64, shape=(), needs_grad=True) | ||
| n_iter = qd.field(qd.i32, shape=()) | ||
|
|
||
| @qd.kernel | ||
| def compute(): | ||
| for i in x: | ||
| v1 = x[i] | ||
| v2 = x[i] | ||
| v3 = x[i] | ||
| v4 = x[i] | ||
| v5 = x[i] | ||
| for _ in range(n_iter[None]): | ||
| v1 = qd.sin(v1) | ||
| v2 = qd.sin(v2) | ||
| v3 = qd.sin(v3) | ||
| v4 = qd.sin(v4) | ||
| v5 = qd.sin(v5) | ||
| y[None] += v1 + v2 + v3 + v4 + v5 | ||
|
|
||
| for i in range(n): | ||
| x[i] = 0.1 + 0.1 * i | ||
| n_iter[None] = 3 | ||
| y[None] = 0.0 | ||
| compute() | ||
| y.grad[None] = 1.0 | ||
| for i in range(n): | ||
| x.grad[i] = 0.0 | ||
| compute.grad() | ||
| """ | ||
| ) | ||
| script_path = tmp_path / "budget_guard_child.py" | ||
| script_path.write_text(child_script) | ||
| result = subprocess.run([sys.executable, str(script_path)], capture_output=True, check=False) | ||
|
Check warning on line 605 in tests/python/test_adstack.py
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 Both subprocess.run calls in test_adstack.py lack a timeout argument; if the child process does not exit for any reason (e.g., the budget guard fails to fire due to misconfiguration or a build variant), the test will block the entire CI suite indefinitely. Add timeout=60 to the subprocess.run at the budget-guard test (line 605) and defensively to the teardown test as well. Extended reasoning...What the bug is Both subprocess.run calls in test_adstack.py — one in test_adstack_codegen_budget_guard_runs_in_child_process (~line 605) and one in test_adstack_overflow_during_teardown_does_not_abort (~line 482) — are missing a timeout= argument. The critical case is the budget-guard test: it expects the child process to terminate with a non-zero exit code (via QD_ERROR_IF firing during LLVM compilation). If the guard fails to fire for any reason, subprocess.run will block forever. Why the original spinning mechanism is not the real concern The original bug description argued that __builtin_unreachable() could cause the child to spin indefinitely. The refutations are correct that this specific path is impossible: Logger::error() has raise_exception=true by default, so it throws a std::string before QD_UNREACHABLE is ever reached. That thrown exception inside the LLVM worker thread triggers std::terminate() -> std::abort() -> SIGABRT, which is reliable non-zero-exit termination. The __builtin_unreachable() is dead code in this path. The valid concern: guard fails to fire The real risk is broader: if the budget guard fails to fire for any reason — a build variant where the guard is compiled out, a future refactor that accidentally breaks the arch_is_cpu gate, an early return added upstream of visit(AdStackAllocaStmt*), or a misconfigured test environment — the child process will not abort, the parent subprocess.run has no timeout, and CI blocks indefinitely. Why existing code does not prevent it subprocess.run(..., check=False) with no timeout= will wait forever for the child. There is no watchdog or other mechanism that would unblock the parent. The skip guards at the top only protect the extension-not-compiled case; they do not protect against a guard that is compiled in but silently broken. Impact If this test hangs, the entire test suite hangs behind it. On most CI systems this means a full job timeout (30-60 min) before the runner kills it, with no actionable error message. How to fix it Add timeout=60 (or timeout=120 for slow machines) to the subprocess.run call on the budget-guard test. The teardown test (expecting returncode 0) is less critical but should also get a timeout for the same defensive reason. Step-by-step proof
|
||
| assert ( | ||
| result.returncode != 0 | ||
| ), "child exited with returncode 0 but the budget guard was expected to terminate the process" | ||
| combined = (result.stdout + result.stderr).decode() | ||
| assert "autodiff-stack budget exceeded" in combined, ( | ||
| f"expected guard message in child output; got:\nstdout:\n{result.stdout.decode()}\n" | ||
| f"stderr:\n{result.stderr.decode()}" | ||
| ) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 The QD_ERROR_IF in visit(AdStackAllocaStmt*) passes kernel_name (e.g., my_kernel) as the second format argument but the message says "in task '{}'", so for a kernel compiled with multiple offloaded backward tasks the diagnostic does not identify which specific task crossed the 256 KB budget. Replace kernel_name with current_task->name (e.g., my_kernel_0_range_for_body), which is guaranteed non-null at this call site and already contains the full task-function identifier.
Extended reasoning...
What the bug is and how it manifests
In codegen_llvm.cpp lines 2135-2142, the QD_ERROR_IF call has this format string:
"... cumulative AdStackAllocaStmt size {} bytes in task '{}' ..."
with positional arguments ad_stack_fn_scope_bytes_, kernel_name, kFnScopeAdStackBudgetBytes. The second positional argument binds to kernel_name, which is the top-level kernel identifier (e.g. my_kernel). The label says 'task', implying a task-scoped identifier, but the value supplied is kernel-scoped.
The specific code path that triggers it
init_offloaded_task_function (codegen_llvm.cpp:1727+) resets ad_stack_fn_scope_bytes_ to zero at the start of each offloaded task and initialises current_task with a fully-qualified name built from kernel name + codegen id + loop name + task type (e.g. my_kernel_0_range_for_body). visit(AdStackAllocaStmt*) is only reached inside that task body, so current_task is guaranteed non-null at this call site.
Why existing code does not prevent it
There is no validation that the format argument matches the 'task' label in the message. Both kernel_name and current_task->name compile cleanly; the compiler has no way to flag the semantic mismatch.
Impact
For a backward-pass kernel compiled with multiple offloaded tasks (e.g., struct-for body, init-args task, range-for backward body), every offloaded task shares the same kernel_name. If the budget guard fires, the error message names the kernel but not which of its several tasks exceeded the limit, forcing the user to guess rather than act on the message directly.
How to fix it
Replace kernel_name with current_task->name in the QD_ERROR_IF argument list. Optionally update the format label from 'task' to 'task function'. Because current_task->name already contains kernel_name as a prefix (e.g. my_kernel_0_range_for_body), the message remains informative at the kernel level too.
Step-by-step proof