Skip to content

Commit 51a6c18

Browse files
jlebarclaude
andcommitted
x86: +1 reproducible bug (#157 DSE redundant-store drops nontemporal)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 69ba115 commit 51a6c18

11 files changed

Lines changed: 615 additions & 1 deletion

x86/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -180,3 +180,4 @@ Goal: find ≥100 real bugs in the x86 path through the default LLVM pass pipeli
180180
| 154 | [154-simplifycfg-sink-merges-two-volatile-seqcst-atomicrmw](bugs/154-simplifycfg-sink-merges-two-volatile-seqcst-atomicrmw/) | SimplifyCFG SinkCommonCodeFromPredecessors | volatile seq_cst `atomicrmw` instructions in mutually-exclusive branches sunk into one (sibling of #152) | confirmed (opt diff) |
181181
| 155 | [155-frexp-i64-libcall-stack-slot-overrun](bugs/155-frexp-i64-libcall-stack-slot-overrun/) | TargetLowering expandMultipleResultFPLibCall | `llvm.frexp.f64.i64` allocates 8-byte slot, libcall writes 4 (int), load reads 8 — uninitialized upper 4 bytes (info leak + wrong value) | confirmed (asm) |
182182
| 156 | [156-instcombine-fcmp-nnan-with-nan-folds-to-bool](bugs/156-instcombine-fcmp-nnan-with-nan-folds-to-bool/) | InstCombine fcmp w/ nnan + NaN constant | should be `poison` per LangRef nnan rules; instead folds to `true`/`false` | confirmed (opt diff) |
183+
| 157 | [157-dse-redundant-stores-of-existing-values-drops-nontemporal](bugs/157-dse-redundant-stores-of-existing-values-drops-nontemporal/) | DSE eliminateRedundantStoresOfExistingValues | `isIdenticalToWhenDefined` ignores metadata; merging two identical stores drops `!nontemporal` (different code path from #149/#153) | confirmed (opt diff) |
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
file: llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp:2450-2507
2+
(DSEState::eliminateRedundantStoresOfExistingValues)
3+
4+
When two stores write the same value to the same location ("redundant
5+
existing value" elimination), DSE picks one store to keep and deletes
6+
the other without performing any metadata merge or transfer. The
7+
identity check uses `Instruction::isIdenticalToWhenDefined(...,
8+
/*IntersectAttrs=*/true)`, which ignores instruction metadata. So a
9+
plain `store` and `store ..., !nontemporal !0` are treated as
10+
identical, and the lower one (the iteration's `DefInst`) is dropped
11+
via `deleteDeadInstruction(DefInst)`. The kept store retains only its
12+
own metadata.
13+
14+
This is reachable through the main `eliminateDeadStores` path too: a
15+
later store of an identical value with !nontemporal can also be
16+
killed as a "dead" store, losing the user's nontemporal hint.
17+
18+
Reproducer:
19+
20+
target triple = "x86_64-unknown-linux-gnu"
21+
22+
define void @f(ptr %p, i32 %v) {
23+
entry:
24+
store i32 %v, ptr %p, align 4, !nontemporal !0
25+
br label %next
26+
next:
27+
store i32 %v, ptr %p, align 4
28+
ret void
29+
}
30+
31+
!0 = !{i32 1}
32+
33+
opt -passes=dse output:
34+
35+
define void @f(ptr %p, i32 %v) {
36+
entry:
37+
store i32 %v, ptr %p, align 4
38+
ret void
39+
}
40+
41+
llc -mtriple=x86_64-- -mattr=+sse2 codegen diff:
42+
43+
Without DSE:
44+
movntil %edx, (%rdi) ; nontemporal write
45+
movl %edx, (%rdi)
46+
With DSE:
47+
movl %edx, (%rdi) ; nontemporal hint LOST
48+
49+
Same-block reproducer also miscompiles (both upper-nontemporal-lower-
50+
plain and upper-plain-lower-nontemporal lose `!nontemporal` in at
51+
least one ordering, depending on which path catches it first).
52+
53+
Fix: in `eliminateRedundantStoresOfExistingValues`, after picking the
54+
survivor, intersect/merge MD_nontemporal, MD_invariant_group,
55+
MD_alias_scope, MD_noalias, MD_tbaa (similar to combineMetadata in
56+
Local.cpp), or refuse to delete when the deleted store carries
57+
attributes/metadata the survivor lacks. Same fix applies to the
58+
"redundant identical store" branch in the main loop.
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
#!/usr/bin/env bash
2+
OPT=/home/orenamd@semianalysis.com/FuzzX/amdgpu/build/llvm-fuzzer/bin/opt
3+
echo "===== DSE keeps later non-nontemporal store, deletes earlier nontemporal — !nontemporal dropped ====="
4+
"$OPT" -passes=dse -S repro.ll | grep -E "define|store|ret"
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
target triple = "x86_64-unknown-linux-gnu"
2+
define void @f(ptr %p) {
3+
store i32 0, ptr %p, align 4, !nontemporal !1 ; nontemporal store of 0
4+
store i32 0, ptr %p, align 4 ; redundant store of same value 0, no nontemporal
5+
ret void
6+
}
7+
!1 = !{i32 1}
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# w66: LowerMatrixIntrinsics fused dot-product FlattenArg drops volatile
2+
3+
## Root cause
4+
`LowerMatrixIntrinsics::lowerDotProduct::FlattenArg` lowers a
5+
`llvm.matrix.column.major.load` intrinsic to a plain `Builder.CreateLoad`
6+
without propagating the i1 `isVolatile` operand (ArgIndex 2 of the intrinsic).
7+
8+
```
9+
llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp:1718-1726
10+
auto FlattenArg = [&Builder, ...](Value *Op) {
11+
...
12+
if (match(Op, m_Intrinsic<Intrinsic::matrix_column_major_load>(
13+
m_Value(Arg)))) {
14+
auto *NewLoad = Builder.CreateLoad(Op->getType(), Arg); // <-- no volatile
15+
Op->replaceAllUsesWith(NewLoad);
16+
eraseFromParentAndRemoveFromShapeMap(cast<Instruction>(Op));
17+
return;
18+
}
19+
...
20+
};
21+
```
22+
23+
The `m_Intrinsic<matrix_column_major_load>(m_Value(), m_One())` precondition
24+
on `CanBeFlattened` only forces stride==1; it does NOT constrain the
25+
`isVolatile` arg. Therefore an intrinsic with `i1 true` for isVolatile is
26+
matched and lowered to a plain `load`.
27+
28+
## Trigger condition (x86)
29+
Dot product (1xN * Nx1) with `reassoc` fast-math flag.
30+
31+
```
32+
target triple = "x86_64-unknown-linux-gnu"
33+
34+
define <1 x double> @matmul_vol(ptr %a, ptr %b) {
35+
entry:
36+
%A = call <4 x double> @llvm.matrix.column.major.load.v4f64.i64(
37+
ptr %a, i64 1, i1 true, i32 1, i32 4) ; <-- volatile
38+
%B = call <4 x double> @llvm.matrix.column.major.load.v4f64.i64(
39+
ptr %b, i64 4, i1 false, i32 4, i32 1)
40+
%M = call reassoc <1 x double> @llvm.matrix.multiply.v1f64.v4f64.v4f64(
41+
<4 x double> %A, <4 x double> %B, i32 1, i32 4, i32 1)
42+
ret <1 x double> %M
43+
}
44+
```
45+
46+
After `opt -passes=lower-matrix-intrinsics -S`:
47+
48+
```
49+
define <1 x double> @matmul_vol(ptr %a, ptr %b) {
50+
entry:
51+
%col.load = load <4 x double>, ptr %b, align 8 ; <-- non-volatile (originally non-volatile B - OK)
52+
%0 = load <4 x double>, ptr %a, align 32 ; <-- non-volatile, but A WAS volatile
53+
%1 = fmul <4 x double> %0, %col.load
54+
%2 = call reassoc double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> %1)
55+
%3 = insertelement <1 x double> poison, double %2, i64 0
56+
ret <1 x double> %3
57+
}
58+
```
59+
60+
`%A` was created via `matrix.column.major.load(..., i1 true, ...)` (volatile
61+
MMIO-style read) but the lowered `%0 = load <4 x double>, ptr %a` is no
62+
longer volatile.
63+
64+
## Fix
65+
Pass `match`'s captured volatile arg to `Builder.CreateAlignedLoad` via the
66+
`isVolatile` parameter, or call `NewLoad->setVolatile(...)` after.
67+
The `match` pattern needs to either reject non-volatile constraint or
68+
capture it for forwarding.
69+
70+
## Why this matters
71+
A user-marked `volatile` matrix load (e.g., reading from a hardware
72+
co-processor's memory-mapped matrix register) gets converted to a plain
73+
load that any later DCE / hoisting pass may elide.
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# worker-82 investigation notes (2026-05-21)
2+
3+
No confirmed reproducible miscompiles in ~12 minute window for
4+
`llvm/lib/Target/X86/X86InstCombineIntrinsic.cpp`. Patterns investigated and
5+
verified correct (rule-outs to spare future workers re-deriving):
6+
7+
## 1. simplifyTernarylogic (lines 669-1734)
8+
9+
Wrote a Python verifier that parses each of the 256 `case 0xNN:` entries and
10+
evaluates the expression against the canonical A=0xf0, B=0xcc, C=0xaa truth-
11+
table constants. ALL entries verify (script: /tmp/w82/verify_ternlog.py). The
12+
in-source assertion at line 1732 already enforces this at runtime; the table is
13+
correct by construction.
14+
15+
## 2. simplifyX86vpermilvar (lines 2068-2113)
16+
17+
For PS: keeps bits [1:0] of each i32 mask element. For PD: shifts right by 1
18+
(extracting bit 1 of each i64 element). The +lane_offset adjustment yields
19+
correct global shuffle indices. Tested all-bits-set mask (yields max index per
20+
lane), all-zero mask (yields lane base), and "bit 1 vs bit 0" PD variants. All
21+
results match hardware semantics. The SimplifyDemandedBits calls at 3098 (mask
22+
0b00011 for PS) and 3111 (mask 0b00010 for PD) are also correct.
23+
24+
## 3. simplifyX86pshufb (lines 2024-2065) and demand-bits mask 0b10001111
25+
26+
Per-lane low-nibble index + sign-bit-zero behavior matches hardware. Demand
27+
mask 0x8F (bits 0,1,2,3,7) ignores the always-ignored bits 4,5,6. Tested with
28+
0x70 (mid-bits set, no sign) → select index 0; 0xF0 (sign + mid-bits) → zero.
29+
30+
## 4. simplifyX86varShift (lines 297-431) - psllv/psrlv/psrav with OOR/undef
31+
32+
Tested constant shift vectors with all-OOR, all-undef, and mixed OOR + in-range
33+
for arithmetic and logical variants. Arithmetic shifts clamp OOR amounts to
34+
BitWidth-1 (correct sign-bit splat). Logical shifts return zero or bail on
35+
mixed cases (correct). The lambda `OutOfRange = [Idx<0 || BitWidth<=Idx]` is
36+
correct for both branches given how arithmetic clamps to BitWidth-1.
37+
38+
## 5. simplifyX86immShift constant-vector path (lines 247-291)
39+
40+
For PSLLW, the low 64 bits of the shift vector are read as a 64-bit count. The
41+
code correctly concatenates elements [0..3] for i16 lanes, [0..1] for i32 lanes,
42+
and uses [0] alone for i64 lanes. Verified that Count.uge(BitWidth) handles all
43+
out-of-range cases (returns 0 for logical, AShr by BitWidth-1 for arithmetic).
44+
45+
## 6. simplifyX86pmadd (lines 557-609) PMADDWD and PMADDUBSW
46+
47+
PMADDWD signed-overflow case (e.g., 32768 inputs giving sum = 2^31) wraps in
48+
i32, matching hardware. PMADDUBSW uses sadd_sat for saturating addition,
49+
matching hardware. Undef-element propagation (whole-vector undef → zero per
50+
policy) matches code comment.
51+
52+
## 7. simplifyX86pmulh (lines 499-555) PMULH, PMULHU, PMULHRSW
53+
54+
The `LShr(Mul, 14)` + `Trunc i18` trick for PMULHRSW correctly preserves sign
55+
behavior via integer wraparound — verified manually for several negative inputs
56+
(-1, -16384, -16385) giving the same low-16 result as hardware arithmetic
57+
shift. The m_One signed/unsigned paths produce correct AShr by 15 / zero.
58+
m_One does NOT match `<1, ..., undef, undef>` mixed vectors because undef ≠
59+
poison and getSplatValue(false) requires exact match — verified.
60+
61+
## 8. simplifyX86FPMaxMin (lines 1737-1783)
62+
63+
The Forbidden0/Forbidden1 with NaN|Inf|Subnormal (+NegZero on Arg1 for max,
64+
Arg0 for min) correctly handles all x86_max/min vs LLVM maxnum/minnum
65+
differences. Subnormal forbidden is needed for DAZ-input case (subnormal → 0
66+
on input flushes the comparison). Confirmed equivalence in each case.
67+
68+
## 9. simplifyX86insertps (lines 1785-1840) - INSERTPS with ZMask & arg0==arg1
69+
70+
The "shuffle with zero vector" path correctly handles both arg0==arg1 (where
71+
arg0[SourceLane] == arg1[SourceLane]) and the case where ZMask zeros out the
72+
destination lane (in which case the inserted value is immediately overridden
73+
to 0 in the ZMask loop). Verified by tracing through all branch combinations.
74+
75+
## 10. simplifyX86pack (lines 433-497) PACKSS/PACKUS
76+
77+
Saturation logic: PACKSS uses signed clamp [SIntMin, SIntMax] of dst type.
78+
PACKUS uses [0, UIntMax of dst type]. Both use signed comparisons for input
79+
clamping (matching hardware semantics where negative→0 for unsigned and
80+
input>maxint→maxint). Per-lane pack mask: PackMask shuffles in
81+
(X[lo..hi], Y[lo..hi]) per 128-bit lane, then truncates. Correct.
82+
83+
## 11. simplifyX86VPERMMask demand-bits (lines 2186-2199)
84+
85+
IdxSizeInBits = Log2_32(IsBinary ? 2*NumElts : NumElts). For permvar_qi_512
86+
(NumElts=64): bottom 6 bits demanded. For vpermi2var_qi_512 (binary, NumElts=
87+
64): bottom 7 bits demanded. Cross-checked against simplifyX86vpermv masking
88+
(`Index &= Size - 1`) and simplifyX86vpermv3 (`Index &= 2*Size - 1`).
89+
Consistent with hardware.
90+
91+
## 12. PCLMULQDQ demand-elt (lines 2765-2807)
92+
93+
DemandedElts1 = getSplat(VWidth, APInt(2, bit_for_op1)) where bit_for_op1 is
94+
01 (low qword) or 10 (high qword). Splatting a 2-bit value across VWidth
95+
yields the per-128-bit-lane pattern. For 256-bit: 0101 or 1010. For 512-bit:
96+
01010101 or 10101010. Correct (per-lane qword selection).
97+
98+
## 13. simplifyX86movmsk (lines 611-640)
99+
100+
For 4-doubles → i32 result: NumElts=4, IntegerTy=i4, bitcast through <4 x i64>,
101+
isneg, bitcast to i4, zext to i32. Correct (bits [3:0] of i32 hold sign bits).
102+
Similar for all sizes (16, 32 bytes; 4 floats, 8 floats; 2 doubles, 4 doubles).
103+
104+
## 14. BMI BEXTR/BZHI/PEXT/PDEP folds (lines 2212-2349)
105+
106+
- BEXTR: Length=0 or Shift>=BitWidth → zero. Constant fold uses
107+
`(Src >> Shift) & maskTrailingOnes(min(Length, BitWidth))`. Correct.
108+
- BZHI: Index>=BitWidth → Arg0 pass-through (matches Intel SDM: "if index >=
109+
operand size, DEST = SRC"). Index=0 → zero. Constant fold correct.
110+
- PEXT shifted-mask: (Input & mask) >> MaskIdx is correct expression for
111+
shifted-mask layout.
112+
- PDEP shifted-mask: (Input << MaskIdx) & mask correctly deposits Input's low
113+
bits at mask positions.
114+
115+
## 15. SSE4A EXTRQ/EXTRQI/INSERTQ/INSERTQI (lines 1842-2021)
116+
117+
Byte-aligned (Length%8==0 && Index%8==0) → byte shuffle. Constant-fold of
118+
bit field extraction/insertion using APInt arithmetic. Index+Length>64 →
119+
undef (matches AMD spec "results undefined"). Length=0 → Length=64 (matches
120+
AMD spec "field length 0 means 64"). All correct.
121+
122+
## Summary
123+
124+
12 minutes of careful analysis covered the major fold paths in the file. No
125+
reproducible miscompile was found. The simplifications I examined are all
126+
correct under careful semantic analysis. The Python ternlog verifier confirms
127+
the 256-entry table is structurally sound.
128+
129+
Areas where I could not find issues but also could not exhaustively cover:
130+
- Per-element undef propagation in PMULH (m_One mixed-undef does not match,
131+
so that path is safe).
132+
- KnownBits-based folding for variable shifts when shift amounts come from
133+
complex computations (would require runtime testing).
134+
- AVX-512 mask-register operations are handled in target-independent IR
135+
(plain `and i16` etc.), not in this file.
136+
137+
## Confidence
138+
139+
No new candidates submitted.
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
file: llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp:2683-2708
2+
(partial-store-merging branch in eliminateDeadStores)
3+
4+
When `OR == OW_PartialEarlierWithFullLater` and partial-store-merging
5+
is enabled (default), DSE folds the killing store's bytes into the
6+
dead store via `tryToMergePartialOverlappingStores`. The dead store's
7+
stored constant is updated with `DeadSI->setOperand(0, Merged)` and
8+
the killing store is removed with `deleteDeadInstruction(KillingSI)`.
9+
10+
The killing store's metadata is discarded entirely. If the killing
11+
store carries `!nontemporal`, the user's nontemporal hint for that
12+
slice of memory is silently dropped from the merged store.
13+
14+
Reproducer:
15+
16+
target triple = "x86_64-unknown-linux-gnu"
17+
18+
define void @f(ptr %p) {
19+
entry:
20+
store i64 0, ptr %p, align 8 ; dead
21+
store i32 -1, ptr %p, align 4, !nontemporal !0 ; killing (nontemporal)
22+
ret void
23+
}
24+
25+
!0 = !{i32 1}
26+
27+
opt -passes=dse output:
28+
29+
define void @f(ptr %p) {
30+
entry:
31+
store i64 4294967295, ptr %p, align 8
32+
ret void
33+
}
34+
35+
llc -mtriple=x86_64-- -mattr=+sse4.1 codegen diff:
36+
37+
Without DSE:
38+
movq $0, (%rdi)
39+
movl $-1, %eax
40+
movntil %eax, (%rdi) ; nontemporal write of bytes 0..3
41+
With DSE:
42+
movl $4294967295, %eax
43+
movq %rax, (%rdi) ; plain temporal write, no nontemporal
44+
45+
The merged 8-byte temporal store is observably different: the
46+
low 4 bytes were requested as nontemporal (write-combining /
47+
cache-bypass), but the resulting code uses a regular store.
48+
49+
Fix: before performing the merge, refuse if the killing store has
50+
metadata that cannot survive being attached to the (wider) dead
51+
store unchanged, in particular MD_nontemporal. Alternatively,
52+
propagate MD_nontemporal to DeadSI when KillingSI has it (and
53+
require all merged-in killing stores to agree). Same care is needed
54+
for MD_invariant_group, MD_alias_scope/noalias, and MD_tbaa.

0 commit comments

Comments
 (0)