Skip to content

[SPIRV] Use native float view in load/store_buffer to avoid aliasing with atomics#513

Open
duburcqa wants to merge 1 commit intomainfrom
duburcqa/fix_spirv_float_atomic_aliasing
Open

[SPIRV] Use native float view in load/store_buffer to avoid aliasing with atomics#513
duburcqa wants to merge 1 commit intomainfrom
duburcqa/fix_spirv_float_atomic_aliasing

Conversation

@duburcqa
Copy link
Copy Markdown
Contributor

@duburcqa duburcqa commented Apr 19, 2026

Problem

On Vulkan devices that expose VK_EXT_shader_atomic_float with shaderBufferFloat32AtomicAdd (AMD RX 7900 XTX and other recent AMD/NVIDIA hardware), reverse-mode kernels that emit the load-and-clear adjoint pattern against a field return a zero gradient. Reproduction:

QD_OFFLINE_CACHE=0 pytest tests/python/test_ad_dynamic_index.py::test_matrix_non_constant_index[arch=vulkan]

asserts 0.0 == ((0 + 1) * (0 + 1)) on n.grad[0][0, 0] - the adjoint never propagates from m.grad through the multiplication to n.grad.

Root cause

The SPIR-V codegen exposes each buffer via a typed view of the underlying storage buffer. Each (BufferInfo, element_type) pair gets its own OpVariable with a fresh DescriptorSet / Binding decoration, so the same VkBuffer is exposed through multiple descriptor bindings (one per access type).

For the reverse-mode pattern emitted for dst = expr * src followed by loss += dst:

  1. OpAtomicFAddEXT on dst.grad - uses the f32 view of the root buffer (binding N).
  2. OpLoad of dst.grad - load_buffer routes f32 through the u32 view (binding N+1) with an OpBitcast back to float.
  3. OpStore of 0 to dst.grad - same u32 view (binding N+1).
  4. OpAtomicFAddEXT on src.grad += loaded_value * d(expr) - back to the f32 view (binding N).

Without an Aliased decoration, a Vulkan driver is free to treat bindings N and N+1 as non-aliasing descriptor variables. The plain OpLoad at step 2 is then not ordered against the preceding OpAtomicFAddEXT at step 1 - it sees the stale zero initial value, so loaded_value = 0, and the contribution to src.grad at step 4 is zero. The final gradient is always zero regardless of the forward value.

The same descriptor-binding split exists for non-float primitives, but for integers (i32/u32/i16/u16/i8/u8) both the atomic and the uint-punned plain access already use the same uint view, so there is no cross-view aliasing. The mismatch is specific to real types on the native-atomic path.

Why CI has been green

The CI Vulkan runner (gpu-t4-4-core, NVIDIA T4) does not expose shaderBufferFloat32AtomicAdd with its driver/SDK combination. On that path the codegen falls back to the CAS-loop emulation, which routes the atomic through the same uint view as the plain load/store, and the aliasing gap does not exist. The bug only surfaces on Vulkan devices that advertise the extension.

This means the regression has been latent since native-atomic lowering landed in the SPIR-V path (pre-rename, at least as old as [SPIRV] Feature Parity Atomics & Shared Array (#432)) - it is not introduced by the heap-backed adstack work (PR #493) or the adstack overflow detection (PR #495), both of which are LLVM-only.

Fix

Route plain float loads/stores through the native element-type view, so they share the descriptor binding used by OpAtomicFAddEXT. A new pick_buffer_access_type(dt, ptr_val, ir) helper returns dt itself when dt is a real primitive (f32/f64/f16) or when the pointer is already a native-typed physical buffer pointer, u8 for u1 (unchanged - u1 has no native SPIR-V storage), and get_quadrants_uint_type(dt) for everything else. load_buffer and store_buffer adopt the helper; the bitcast path is a no-op when the view matches dt, so the emitted SPIR-V for integer and physical-pointer paths is unchanged.

No decoration / no memory-barrier changes - the fix is purely in which descriptor binding is used, and aligns reads/writes with the atomic op the hardware is already natively executing.

@hughperkins
Copy link
Copy Markdown
Collaborator

Semi-orthogonal, but related: I wonder if we should start considering a CI that runs on a better GPU. If we want CI for Genesis, we will certainly need this. AMD GPU cloud does provide such GPUs I think. (or there is packet.ai, that @v01dXYZ discovered)

@hughperkins
Copy link
Copy Markdown
Collaborator

Opus summary:

Summary

Fixes a SPIR-V codegen bug where plain loads/stores of float buffer elements could alias incorrectly with atomic float operations (OpAtomicFAddEXT) on the same memory,
causing reverse-mode autodiff to read stale values on Vulkan.
The fix changes load_buffer / store_buffer to access primitive float types through their native float view of the storage buffer, rather than the uint-punned view.
The view-selection logic is extracted into a new helper pick_buffer_access_type(dt, ptr_val, ir) shared by both functions.

Root cause

In SPIR-V / Vulkan, each (descriptor_set, binding) is a distinct variable. at_buffer creates a new binding per (buffer, element_type) pair, so the u32 view and the
f32 view of the same buffer are different variables aliasing the same memory. Without an Aliased decoration, the driver / SPIRV-Tools is free to assume they do not alias,
so an OpLoad through the u32 view is not ordered against a preceding OpAtomicFAddEXT through the f32 view at the same address.
The reverse-mode pattern

m.grad[i][j, k] += loss.grad
tmp = m.grad[i][j, k]
m.grad[i][j, k] = 0
n.grad += tmp * factor

hits this exactly: the load reads the stale zero initial value, tmp == 0, and the adjoint never propagates.
test_ad_dynamic_index.py::test_matrix_non_constant_index[arch=vulkan] asserts 0.0 == 1.0 as a result.

Behavior matrix

dt Before After
f16 / f32 / f64 uint view (u32) native float view
i* / u* (≥ 8-bit) uint view uint view (unchanged)
u1 u8 load / i8 store u8 load / i8 store (unchanged)
64-bit pointer path dt directly dt directly (unchanged)

Good points

  • Targeted fix for a real, reproducible miscompile (autodiff returning 0 instead of the gradient on Vulkan).
  • Removes the aliasing question entirely rather than papering over it with Aliased decorations or memory barriers — plain load/store and the atomic now share a single
    binding.
  • Small, contained diff (one file, +30 / −11) that touches only the view-selection logic; the surrounding load/store machinery is unchanged.
  • Refactor improves readability: the previously duplicated chain of ifs in load_buffer and store_buffer is now a single pick_buffer_access_type helper, making it
    obvious that the two paths agree.
  • Existing carve-outs preserved: u1 still maps to u8/i8, and the u64 pointer path still uses dt directly, so no regressions on those code paths.
  • Documented: a substantial comment explains the SPIR-V aliasing model, why the bug occurred, and which test reproduces it — useful for future readers and for anyone
    tempted to "simplify" the helper.

Bad points / risks

  • Asymmetry between load and store for u1: load uses u8, store uses i8. This matches the previous behavior, but the helper does not encode it — store_buffer still
    has to override the helper's result with a special-case if for u1. A cleaner design would push the load/store distinction into the helper (or at least into a named
    constant) so the two sites can't drift.
  • No new test added. The fix relies on the existing test_ad_dynamic_index.py::test_matrix_non_constant_index[arch=vulkan] to catch regressions; a more direct unit test
    of the SPIR-V output (e.g. checking that the load and the atomic resolve to the same binding) would harden against future refactors.
  • Possible compatibility surface for native float storage views. Switching f16 / f64 loads/stores to native views requires the corresponding SPIR-V capabilities
    (StorageBuffer16BitAccess, Float64, etc.) to be requested wherever those types are used. If any code path emits f16/f64 load/store without already requesting these
    capabilities, this change could surface a validation error on devices that previously worked via the u32 punning path. Worth confirming the capability-request logic covers
    all is_real(dt) cases.
  • Increased binding count. Because at_buffer allocates a binding per (buffer, element_type) pair, kernels that previously only used the u32 view for floats will now
    also allocate the native float binding. This is almost certainly negligible, but on drivers with tight descriptor limits it's a (very small) extra cost.
  • is_real(dt) is the trigger — if any non-IEEE "real-like" type is added later (e.g. bf16 with no native SPIR-V storage type), the helper will route it down the
    native-view path and break. A short assertion or whitelist (f16/f32/f64) would be safer than the open-ended predicate.
  • Fix is Vulkan/SPIR-V only. If the same aliasing pattern exists in other backends that pun float buffers through integer views, those remain unfixed; the PR doesn't claim
    to address them, but it's worth noting for triage.

@hughperkins
Copy link
Copy Markdown
Collaborator

Most of hte bad points surfaced by Opus above look like AI-flation to me. But I'm wondering your thoughts on the following observation? :

"Asymmetry between load and store for u1: load uses u8, store uses i8. This matches the previous behavior, but the helper does not encode it — store_buffer still
has to override the helper's result with a special-case if for u1. A cleaner design would push the load/store distinction into the helper (or at least into a named
constant) so the two sites can't drift."

Comment on lines +2005 to +2016
static DataType pick_buffer_access_type(DataType dt, const spirv::Value &ptr_val, spirv::IRBuilder &ir) {
if (dt->is_primitive(PrimitiveTypeID::u1)) {
return PrimitiveType::u8;
}
if (ptr_val.stype.dt == PrimitiveType::u64) {
return dt;
}
if (is_real(dt)) {
return dt;
}
return ir.get_quadrants_uint_type(dt);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 pick_buffer_access_type unconditionally routes all float loads/stores through the native float descriptor binding, but visit(AtomicOpStmt) only uses the native float binding for add-capable native atomics — for CAS-emulated add (e.g., NVIDIA T4, which lacks spirv_has_atomic_float_add) and for all non-add float atomics (min/max/mul on any device), it still calls at_buffer(stmt->dest, ir_->get_quadrants_uint_type(dt)), binding to the uint view. This recreates the exact cross-descriptor aliasing hazard the PR was designed to fix: after the PR, plain loads/stores use the float binding while atomics use the uint binding on both of these paths. The PR description claims the T4/CAS path is unaffected, but it is not — load/store bindings changed while atomic bindings did not.

Extended reasoning...

Root cause: pick_buffer_access_type (lines 2005-2016) returns dt unconditionally for all is_real(dt) types. This makes load_buffer and store_buffer always use the native float descriptor binding (e.g., f32 view = binding N) for float fields. However, visit(AtomicOpStmt) chooses its descriptor binding conditionally:

  • f32: uses at_buffer(dest, dt) → float view ONLY when spirv_has_atomic_float_add AND op_type==add (lines 1397-1401). Otherwise falls through to at_buffer(dest, get_quadrants_uint_type(dt)) → u32 view.
  • f16: same pattern with spirv_has_atomic_float16_add (lines 1406-1410).
  • f64: same pattern with spirv_has_atomic_float64_add (lines 1391-1395).
  • Non-add ops (min/max/mul): always take the uint-view path regardless of device capabilities (line 1475 for mul; min/max similarly).

Case 1 — CAS fallback on NVIDIA T4 (CI runner): The T4 does not expose shaderBufferFloat32AtomicAdd, so visit(AtomicOpStmt) for f32 add uses at_buffer(dest, get_quadrants_uint_type(dt)) → u32 binding. Before this PR, load_buffer and store_buffer also used the u32 view (binding N). After this PR they use the f32 view (binding N+1). The CAS loop in atomic_operation (spirv_ir_builder.cpp:973-978, comment: 'Device-buffer pointers are uint-typed (from at_buffer), so CAS uses uint') issues OpAtomicLoad and OpAtomicCompareExchange on the u32 binding, while the subsequent plain OpLoad (load_buffer) now issues on the f32 binding. Without an Aliased decoration the driver is free to treat these as non-aliasing; the load reads the stale zero value. The PR description explicitly marks 'CI remains green on T4 (CAS emulation path is unchanged)' as a tested invariant, but this invariant is violated: load/store changed bindings even though the atomic did not.

Case 2 — Non-add float atomics on any device: For op_type == min, max, or mul on f32/f16/f64 fields, visit(AtomicOpStmt) always takes the uint view path regardless of device capabilities (e.g., line 1475 for mul). After this PR, the plain load/store for the same field use the native float view. A kernel mixing f32.min atomics with plain f32 loads at the same buffer address — which is a legitimate pattern — now has the atomic on binding N (u32) and the load on binding N+1 (f32). This applies to every Vulkan device including the AMD RX 7900 XTX the PR was validated on.

Concrete proof for Case 2 on AMD RX 7900 XTX: Suppose we have a gradient descent kernel that does: (a) atomic f32.min to compute running minimum in buffer B at index i, then (b) plain f32 load from buffer B at index i to read back the result. Before this PR: both operations use u32 binding N → consistent. After this PR: (a) uses u32 binding N (unchanged), (b) uses f32 binding N+1 (changed). The driver may serve the load from a register cached from the f32 binding, which has never been written by the atomic, returning a stale zero.

Concrete proof for Case 1 on NVIDIA T4: Reverse-mode pattern: (1) atomic f32 add to m.grad[i] → u32 binding N. (2) load m.grad[i] → now f32 binding N+1 (stale zero). (3) store 0 to m.grad[i] → f32 binding N+1 (no effect on u32 binding). (4) atomic f32 add to n.grad += loaded * factor → loaded is 0, so n.grad never accumulates the contribution. The gradient is silently zero.

Fix: pick_buffer_access_type should mirror the binding selection logic in visit(AtomicOpStmt). It cannot be a stateless helper because the correct binding for a float field depends on both the device capability and the op type. One correct approach: keep load/store on the u32 view (reverting to the pre-PR behavior for loads/stores), and instead fix the aliasing by adding an Aliased decoration when a buffer is exposed through multiple views. Another approach: pass the op type and capability set to pick_buffer_access_type and return the native float type only when the caller guarantees that all atomics for this field will also use the native float view.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants