[SPIRV] Use native float view in load/store_buffer to avoid aliasing with atomics#513
[SPIRV] Use native float view in load/store_buffer to avoid aliasing with atomics#513
Conversation
|
Semi-orthogonal, but related: I wonder if we should start considering a CI that runs on a better GPU. If we want CI for Genesis, we will certainly need this. AMD GPU cloud does provide such GPUs I think. (or there is packet.ai, that @v01dXYZ discovered) |
|
Opus summary: SummaryFixes a SPIR-V codegen bug where plain loads/stores of float buffer elements could alias incorrectly with atomic float operations ( Root causeIn SPIR-V / Vulkan, each m.grad[i][j, k] += loss.grad
tmp = m.grad[i][j, k]
m.grad[i][j, k] = 0
n.grad += tmp * factorhits this exactly: the load reads the stale zero initial value, Behavior matrix
Good points
Bad points / risks
|
|
Most of hte bad points surfaced by Opus above look like AI-flation to me. But I'm wondering your thoughts on the following observation? : "Asymmetry between load and store for u1: load uses u8, store uses i8. This matches the previous behavior, but the helper does not encode it — store_buffer still |
| static DataType pick_buffer_access_type(DataType dt, const spirv::Value &ptr_val, spirv::IRBuilder &ir) { | ||
| if (dt->is_primitive(PrimitiveTypeID::u1)) { | ||
| return PrimitiveType::u8; | ||
| } | ||
| if (ptr_val.stype.dt == PrimitiveType::u64) { | ||
| return dt; | ||
| } | ||
| if (is_real(dt)) { | ||
| return dt; | ||
| } | ||
| return ir.get_quadrants_uint_type(dt); | ||
| } |
There was a problem hiding this comment.
🔴 pick_buffer_access_type unconditionally routes all float loads/stores through the native float descriptor binding, but visit(AtomicOpStmt) only uses the native float binding for add-capable native atomics — for CAS-emulated add (e.g., NVIDIA T4, which lacks spirv_has_atomic_float_add) and for all non-add float atomics (min/max/mul on any device), it still calls at_buffer(stmt->dest, ir_->get_quadrants_uint_type(dt)), binding to the uint view. This recreates the exact cross-descriptor aliasing hazard the PR was designed to fix: after the PR, plain loads/stores use the float binding while atomics use the uint binding on both of these paths. The PR description claims the T4/CAS path is unaffected, but it is not — load/store bindings changed while atomic bindings did not.
Extended reasoning...
Root cause: pick_buffer_access_type (lines 2005-2016) returns dt unconditionally for all is_real(dt) types. This makes load_buffer and store_buffer always use the native float descriptor binding (e.g., f32 view = binding N) for float fields. However, visit(AtomicOpStmt) chooses its descriptor binding conditionally:
- f32: uses at_buffer(dest, dt) → float view ONLY when spirv_has_atomic_float_add AND op_type==add (lines 1397-1401). Otherwise falls through to at_buffer(dest, get_quadrants_uint_type(dt)) → u32 view.
- f16: same pattern with spirv_has_atomic_float16_add (lines 1406-1410).
- f64: same pattern with spirv_has_atomic_float64_add (lines 1391-1395).
- Non-add ops (min/max/mul): always take the uint-view path regardless of device capabilities (line 1475 for mul; min/max similarly).
Case 1 — CAS fallback on NVIDIA T4 (CI runner): The T4 does not expose shaderBufferFloat32AtomicAdd, so visit(AtomicOpStmt) for f32 add uses at_buffer(dest, get_quadrants_uint_type(dt)) → u32 binding. Before this PR, load_buffer and store_buffer also used the u32 view (binding N). After this PR they use the f32 view (binding N+1). The CAS loop in atomic_operation (spirv_ir_builder.cpp:973-978, comment: 'Device-buffer pointers are uint-typed (from at_buffer), so CAS uses uint') issues OpAtomicLoad and OpAtomicCompareExchange on the u32 binding, while the subsequent plain OpLoad (load_buffer) now issues on the f32 binding. Without an Aliased decoration the driver is free to treat these as non-aliasing; the load reads the stale zero value. The PR description explicitly marks 'CI remains green on T4 (CAS emulation path is unchanged)' as a tested invariant, but this invariant is violated: load/store changed bindings even though the atomic did not.
Case 2 — Non-add float atomics on any device: For op_type == min, max, or mul on f32/f16/f64 fields, visit(AtomicOpStmt) always takes the uint view path regardless of device capabilities (e.g., line 1475 for mul). After this PR, the plain load/store for the same field use the native float view. A kernel mixing f32.min atomics with plain f32 loads at the same buffer address — which is a legitimate pattern — now has the atomic on binding N (u32) and the load on binding N+1 (f32). This applies to every Vulkan device including the AMD RX 7900 XTX the PR was validated on.
Concrete proof for Case 2 on AMD RX 7900 XTX: Suppose we have a gradient descent kernel that does: (a) atomic f32.min to compute running minimum in buffer B at index i, then (b) plain f32 load from buffer B at index i to read back the result. Before this PR: both operations use u32 binding N → consistent. After this PR: (a) uses u32 binding N (unchanged), (b) uses f32 binding N+1 (changed). The driver may serve the load from a register cached from the f32 binding, which has never been written by the atomic, returning a stale zero.
Concrete proof for Case 1 on NVIDIA T4: Reverse-mode pattern: (1) atomic f32 add to m.grad[i] → u32 binding N. (2) load m.grad[i] → now f32 binding N+1 (stale zero). (3) store 0 to m.grad[i] → f32 binding N+1 (no effect on u32 binding). (4) atomic f32 add to n.grad += loaded * factor → loaded is 0, so n.grad never accumulates the contribution. The gradient is silently zero.
Fix: pick_buffer_access_type should mirror the binding selection logic in visit(AtomicOpStmt). It cannot be a stateless helper because the correct binding for a float field depends on both the device capability and the op type. One correct approach: keep load/store on the u32 view (reverting to the pre-PR behavior for loads/stores), and instead fix the aliasing by adding an Aliased decoration when a buffer is exposed through multiple views. Another approach: pass the op type and capability set to pick_buffer_access_type and return the native float type only when the caller guarantees that all atomics for this field will also use the native float view.
Problem
On Vulkan devices that expose
VK_EXT_shader_atomic_floatwithshaderBufferFloat32AtomicAdd(AMD RX 7900 XTX and other recent AMD/NVIDIA hardware), reverse-mode kernels that emit the load-and-clear adjoint pattern against a field return a zero gradient. Reproduction:asserts
0.0 == ((0 + 1) * (0 + 1))onn.grad[0][0, 0]- the adjoint never propagates fromm.gradthrough the multiplication ton.grad.Root cause
The SPIR-V codegen exposes each buffer via a typed view of the underlying storage buffer. Each
(BufferInfo, element_type)pair gets its ownOpVariablewith a freshDescriptorSet/Bindingdecoration, so the same VkBuffer is exposed through multiple descriptor bindings (one per access type).For the reverse-mode pattern emitted for
dst = expr * srcfollowed byloss += dst:OpAtomicFAddEXTondst.grad- uses the f32 view of the root buffer (binding N).OpLoadofdst.grad-load_bufferroutes f32 through the u32 view (binding N+1) with anOpBitcastback to float.OpStoreof 0 todst.grad- same u32 view (binding N+1).OpAtomicFAddEXTonsrc.grad += loaded_value * d(expr)- back to the f32 view (binding N).Without an
Aliaseddecoration, a Vulkan driver is free to treat bindings N and N+1 as non-aliasing descriptor variables. The plainOpLoadat step 2 is then not ordered against the precedingOpAtomicFAddEXTat step 1 - it sees the stale zero initial value, soloaded_value = 0, and the contribution tosrc.gradat step 4 is zero. The final gradient is always zero regardless of the forward value.The same descriptor-binding split exists for non-float primitives, but for integers (
i32/u32/i16/u16/i8/u8) both the atomic and the uint-punned plain access already use the same uint view, so there is no cross-view aliasing. The mismatch is specific to real types on the native-atomic path.Why CI has been green
The CI Vulkan runner (
gpu-t4-4-core, NVIDIA T4) does not exposeshaderBufferFloat32AtomicAddwith its driver/SDK combination. On that path the codegen falls back to the CAS-loop emulation, which routes the atomic through the same uint view as the plain load/store, and the aliasing gap does not exist. The bug only surfaces on Vulkan devices that advertise the extension.This means the regression has been latent since native-atomic lowering landed in the SPIR-V path (pre-rename, at least as old as
[SPIRV] Feature Parity Atomics & Shared Array (#432)) - it is not introduced by the heap-backed adstack work (PR #493) or the adstack overflow detection (PR #495), both of which are LLVM-only.Fix
Route plain float loads/stores through the native element-type view, so they share the descriptor binding used by
OpAtomicFAddEXT. A newpick_buffer_access_type(dt, ptr_val, ir)helper returnsdtitself whendtis a real primitive (f32/f64/f16) or when the pointer is already a native-typed physical buffer pointer,u8foru1(unchanged - u1 has no native SPIR-V storage), andget_quadrants_uint_type(dt)for everything else.load_bufferandstore_bufferadopt the helper; the bitcast path is a no-op when the view matchesdt, so the emitted SPIR-V for integer and physical-pointer paths is unchanged.No decoration / no memory-barrier changes - the fix is purely in which descriptor binding is used, and aligns reads/writes with the atomic op the hardware is already natively executing.