SemiAnalysisAI
diff --git a/‎x86/README.md‎
Lines changed: 1 addition & 0 deletions b/‎x86/README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎x86/bugs/157-dse-redundant-stores-of-existing-values-drops-nontemporal/NOTES.md‎
Lines changed: 58 additions & 0 deletions b/‎x86/bugs/157-dse-redundant-stores-of-existing-values-drops-nontemporal/NOTES.md‎
Lines changed: 58 additions & 0 deletions
diff --git a/‎x86/bugs/157-dse-redundant-stores-of-existing-values-drops-nontemporal/cmd.sh‎
Lines changed: 4 additions & 0 deletions b/‎x86/bugs/157-dse-redundant-stores-of-existing-values-drops-nontemporal/cmd.sh‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎x86/bugs/157-dse-redundant-stores-of-existing-values-drops-nontemporal/repro.ll‎
Lines changed: 7 additions & 0 deletions b/‎x86/bugs/157-dse-redundant-stores-of-existing-values-drops-nontemporal/repro.ll‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎x86/candidates/w66-lower-matrix-intrinsics-flatten-drops-volatile.md‎
Lines changed: 73 additions & 0 deletions b/‎x86/candidates/w66-lower-matrix-intrinsics-flatten-drops-volatile.md‎
Lines changed: 73 additions & 0 deletions
diff --git a/‎x86/candidates/w82-investigation-notes.md‎
Lines changed: 139 additions & 0 deletions b/‎x86/candidates/w82-investigation-notes.md‎
Lines changed: 139 additions & 0 deletions
diff --git a/‎x86/candidates/w93-dse-partial-overlap-merge-drops-nontemporal.md‎
Lines changed: 54 additions & 0 deletions b/‎x86/candidates/w93-dse-partial-overlap-merge-drops-nontemporal.md‎
Lines changed: 54 additions & 0 deletions
@@ -180,3 +180,4 @@ Goal: find ≥100 real bugs in the x86 path through the default LLVM pass pipeli
 | 154 | [154-simplifycfg-sink-merges-two-volatile-seqcst-atomicrmw](bugs/154-simplifycfg-sink-merges-two-volatile-seqcst-atomicrmw/) | SimplifyCFG SinkCommonCodeFromPredecessors | volatile seq_cst `atomicrmw` instructions in mutually-exclusive branches sunk into one (sibling of #152) | confirmed (opt diff) |
 | 155 | [155-frexp-i64-libcall-stack-slot-overrun](bugs/155-frexp-i64-libcall-stack-slot-overrun/) | TargetLowering expandMultipleResultFPLibCall | `llvm.frexp.f64.i64` allocates 8-byte slot, libcall writes 4 (int), load reads 8 — uninitialized upper 4 bytes (info leak + wrong value) | confirmed (asm) |
 | 156 | [156-instcombine-fcmp-nnan-with-nan-folds-to-bool](bugs/156-instcombine-fcmp-nnan-with-nan-folds-to-bool/) | InstCombine fcmp w/ nnan + NaN constant | should be `poison` per LangRef nnan rules; instead folds to `true`/`false` | confirmed (opt diff) |
+| 157 | [157-dse-redundant-stores-of-existing-values-drops-nontemporal](bugs/157-dse-redundant-stores-of-existing-values-drops-nontemporal/) | DSE eliminateRedundantStoresOfExistingValues | `isIdenticalToWhenDefined` ignores metadata; merging two identical stores drops `!nontemporal` (different code path from #149/#153) | confirmed (opt diff) |
@@ -0,0 +1,58 @@
+file: llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp:2450-2507
+(DSEState::eliminateRedundantStoresOfExistingValues)
+
+When two stores write the same value to the same location ("redundant
+existing value" elimination), DSE picks one store to keep and deletes
+the other without performing any metadata merge or transfer. The
+identity check uses `Instruction::isIdenticalToWhenDefined(...,
+/*IntersectAttrs=*/true)`, which ignores instruction metadata. So a
+plain `store` and `store ..., !nontemporal !0` are treated as
+identical, and the lower one (the iteration's `DefInst`) is dropped
+via `deleteDeadInstruction(DefInst)`. The kept store retains only its
+own metadata.
+
+This is reachable through the main `eliminateDeadStores` path too: a
+later store of an identical value with !nontemporal can also be
+killed as a "dead" store, losing the user's nontemporal hint.
+
+Reproducer:
+
+  target triple = "x86_64-unknown-linux-gnu"
+
+  define void @f(ptr %p, i32 %v) {
+  entry:
+    store i32 %v, ptr %p, align 4, !nontemporal !0
+    br label %next
+  next:
+    store i32 %v, ptr %p, align 4
+    ret void
+  }
+
+  !0 = !{i32 1}
+
+opt -passes=dse output:
+
+  define void @f(ptr %p, i32 %v) {
+  entry:
+    store i32 %v, ptr %p, align 4
+    ret void
+  }
+
+llc -mtriple=x86_64-- -mattr=+sse2 codegen diff:
+
+  Without DSE:
+    movntil %edx, (%rdi)    ; nontemporal write
+    movl    %edx, (%rdi)
+  With DSE:
+    movl    %edx, (%rdi)    ; nontemporal hint LOST
+
+Same-block reproducer also miscompiles (both upper-nontemporal-lower-
+plain and upper-plain-lower-nontemporal lose `!nontemporal` in at
+least one ordering, depending on which path catches it first).
+
+Fix: in `eliminateRedundantStoresOfExistingValues`, after picking the
+survivor, intersect/merge MD_nontemporal, MD_invariant_group,
+MD_alias_scope, MD_noalias, MD_tbaa (similar to combineMetadata in
+Local.cpp), or refuse to delete when the deleted store carries
+attributes/metadata the survivor lacks. Same fix applies to the
+"redundant identical store" branch in the main loop.
@@ -0,0 +1,4 @@
+#!/usr/bin/env bash
+OPT=/home/orenamd@semianalysis.com/FuzzX/amdgpu/build/llvm-fuzzer/bin/opt
+echo "===== DSE keeps later non-nontemporal store, deletes earlier nontemporal — !nontemporal dropped ====="
+"$OPT" -passes=dse -S repro.ll | grep -E "define|store|ret"
@@ -0,0 +1,7 @@
+target triple = "x86_64-unknown-linux-gnu"
+define void @f(ptr %p) {
+  store i32 0, ptr %p, align 4, !nontemporal !1   ; nontemporal store of 0
+  store i32 0, ptr %p, align 4                    ; redundant store of same value 0, no nontemporal
+  ret void
+}
+!1 = !{i32 1}
@@ -0,0 +1,73 @@
+# w66: LowerMatrixIntrinsics fused dot-product FlattenArg drops volatile
+
+## Root cause
+`LowerMatrixIntrinsics::lowerDotProduct::FlattenArg` lowers a
+`llvm.matrix.column.major.load` intrinsic to a plain `Builder.CreateLoad`
+without propagating the i1 `isVolatile` operand (ArgIndex 2 of the intrinsic).
+
+```
+llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp:1718-1726
+auto FlattenArg = [&Builder, ...](Value *Op) {
+  ...
+  if (match(Op, m_Intrinsic<Intrinsic::matrix_column_major_load>(
+                    m_Value(Arg)))) {
+    auto *NewLoad = Builder.CreateLoad(Op->getType(), Arg);  // <-- no volatile
+    Op->replaceAllUsesWith(NewLoad);
+    eraseFromParentAndRemoveFromShapeMap(cast<Instruction>(Op));
+    return;
+  }
+  ...
+};
+```
+
+The `m_Intrinsic<matrix_column_major_load>(m_Value(), m_One())` precondition
+on `CanBeFlattened` only forces stride==1; it does NOT constrain the
+`isVolatile` arg. Therefore an intrinsic with `i1 true` for isVolatile is
+matched and lowered to a plain `load`.
+
+## Trigger condition (x86)
+Dot product (1xN * Nx1) with `reassoc` fast-math flag.
+
+```
+target triple = "x86_64-unknown-linux-gnu"
+
+define <1 x double> @matmul_vol(ptr %a, ptr %b) {
+entry:
+  %A = call <4 x double> @llvm.matrix.column.major.load.v4f64.i64(
+        ptr %a, i64 1, i1 true,  i32 1, i32 4)   ; <-- volatile
+  %B = call <4 x double> @llvm.matrix.column.major.load.v4f64.i64(
+        ptr %b, i64 4, i1 false, i32 4, i32 1)
+  %M = call reassoc <1 x double> @llvm.matrix.multiply.v1f64.v4f64.v4f64(
+        <4 x double> %A, <4 x double> %B, i32 1, i32 4, i32 1)
+  ret <1 x double> %M
+}
+```
+
+After `opt -passes=lower-matrix-intrinsics -S`:
+
+```
+define <1 x double> @matmul_vol(ptr %a, ptr %b) {
+entry:
+  %col.load = load <4 x double>, ptr %b, align 8        ; <-- non-volatile (originally non-volatile B - OK)
+  %0        = load <4 x double>, ptr %a, align 32       ; <-- non-volatile, but A WAS volatile
+  %1 = fmul <4 x double> %0, %col.load
+  %2 = call reassoc double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> %1)
+  %3 = insertelement <1 x double> poison, double %2, i64 0
+  ret <1 x double> %3
+}
+```
+
+`%A` was created via `matrix.column.major.load(..., i1 true, ...)` (volatile
+MMIO-style read) but the lowered `%0 = load <4 x double>, ptr %a` is no
+longer volatile.
+
+## Fix
+Pass `match`'s captured volatile arg to `Builder.CreateAlignedLoad` via the
+`isVolatile` parameter, or call `NewLoad->setVolatile(...)` after.
+The `match` pattern needs to either reject non-volatile constraint or
+capture it for forwarding.
+
+## Why this matters
+A user-marked `volatile` matrix load (e.g., reading from a hardware
+co-processor's memory-mapped matrix register) gets converted to a plain
+load that any later DCE / hoisting pass may elide.
@@ -0,0 +1,139 @@
+# worker-82 investigation notes (2026-05-21)
+
+No confirmed reproducible miscompiles in ~12 minute window for
+`llvm/lib/Target/X86/X86InstCombineIntrinsic.cpp`. Patterns investigated and
+verified correct (rule-outs to spare future workers re-deriving):
+
+## 1. simplifyTernarylogic (lines 669-1734)
+
+Wrote a Python verifier that parses each of the 256 `case 0xNN:` entries and
+evaluates the expression against the canonical A=0xf0, B=0xcc, C=0xaa truth-
+table constants. ALL entries verify (script: /tmp/w82/verify_ternlog.py). The
+in-source assertion at line 1732 already enforces this at runtime; the table is
+correct by construction.
+
+## 2. simplifyX86vpermilvar (lines 2068-2113)
+
+For PS: keeps bits [1:0] of each i32 mask element. For PD: shifts right by 1
+(extracting bit 1 of each i64 element). The +lane_offset adjustment yields
+correct global shuffle indices. Tested all-bits-set mask (yields max index per
+lane), all-zero mask (yields lane base), and "bit 1 vs bit 0" PD variants. All
+results match hardware semantics. The SimplifyDemandedBits calls at 3098 (mask
+0b00011 for PS) and 3111 (mask 0b00010 for PD) are also correct.
+
+## 3. simplifyX86pshufb (lines 2024-2065) and demand-bits mask 0b10001111
+
+Per-lane low-nibble index + sign-bit-zero behavior matches hardware. Demand
+mask 0x8F (bits 0,1,2,3,7) ignores the always-ignored bits 4,5,6. Tested with
+0x70 (mid-bits set, no sign) → select index 0; 0xF0 (sign + mid-bits) → zero.
+
+## 4. simplifyX86varShift (lines 297-431) - psllv/psrlv/psrav with OOR/undef
+
+Tested constant shift vectors with all-OOR, all-undef, and mixed OOR + in-range
+for arithmetic and logical variants. Arithmetic shifts clamp OOR amounts to
+BitWidth-1 (correct sign-bit splat). Logical shifts return zero or bail on
+mixed cases (correct). The lambda `OutOfRange = [Idx<0 || BitWidth<=Idx]` is
+correct for both branches given how arithmetic clamps to BitWidth-1.
+
+## 5. simplifyX86immShift constant-vector path (lines 247-291)
+
+For PSLLW, the low 64 bits of the shift vector are read as a 64-bit count. The
+code correctly concatenates elements [0..3] for i16 lanes, [0..1] for i32 lanes,
+and uses [0] alone for i64 lanes. Verified that Count.uge(BitWidth) handles all
+out-of-range cases (returns 0 for logical, AShr by BitWidth-1 for arithmetic).
+
+## 6. simplifyX86pmadd (lines 557-609) PMADDWD and PMADDUBSW
+
+PMADDWD signed-overflow case (e.g., 32768 inputs giving sum = 2^31) wraps in
+i32, matching hardware. PMADDUBSW uses sadd_sat for saturating addition,
+matching hardware. Undef-element propagation (whole-vector undef → zero per
+policy) matches code comment.
+
+## 7. simplifyX86pmulh (lines 499-555) PMULH, PMULHU, PMULHRSW
+
+The `LShr(Mul, 14)` + `Trunc i18` trick for PMULHRSW correctly preserves sign
+behavior via integer wraparound — verified manually for several negative inputs
+(-1, -16384, -16385) giving the same low-16 result as hardware arithmetic
+shift. The m_One signed/unsigned paths produce correct AShr by 15 / zero.
+m_One does NOT match `<1, ..., undef, undef>` mixed vectors because undef ≠
+poison and getSplatValue(false) requires exact match — verified.
+
+## 8. simplifyX86FPMaxMin (lines 1737-1783)
+
+The Forbidden0/Forbidden1 with NaN|Inf|Subnormal (+NegZero on Arg1 for max,
+Arg0 for min) correctly handles all x86_max/min vs LLVM maxnum/minnum
+differences. Subnormal forbidden is needed for DAZ-input case (subnormal → 0
+on input flushes the comparison). Confirmed equivalence in each case.
+
+## 9. simplifyX86insertps (lines 1785-1840) - INSERTPS with ZMask & arg0==arg1
+
+The "shuffle with zero vector" path correctly handles both arg0==arg1 (where
+arg0[SourceLane] == arg1[SourceLane]) and the case where ZMask zeros out the
+destination lane (in which case the inserted value is immediately overridden
+to 0 in the ZMask loop). Verified by tracing through all branch combinations.
+
+## 10. simplifyX86pack (lines 433-497) PACKSS/PACKUS
+
+Saturation logic: PACKSS uses signed clamp [SIntMin, SIntMax] of dst type.
+PACKUS uses [0, UIntMax of dst type]. Both use signed comparisons for input
+clamping (matching hardware semantics where negative→0 for unsigned and
+input>maxint→maxint). Per-lane pack mask: PackMask shuffles in
+(X[lo..hi], Y[lo..hi]) per 128-bit lane, then truncates. Correct.
+
+## 11. simplifyX86VPERMMask demand-bits (lines 2186-2199)
+
+IdxSizeInBits = Log2_32(IsBinary ? 2*NumElts : NumElts). For permvar_qi_512
+(NumElts=64): bottom 6 bits demanded. For vpermi2var_qi_512 (binary, NumElts=
+64): bottom 7 bits demanded. Cross-checked against simplifyX86vpermv masking
+(`Index &= Size - 1`) and simplifyX86vpermv3 (`Index &= 2*Size - 1`).
+Consistent with hardware.
+
+## 12. PCLMULQDQ demand-elt (lines 2765-2807)
+
+DemandedElts1 = getSplat(VWidth, APInt(2, bit_for_op1)) where bit_for_op1 is
+01 (low qword) or 10 (high qword). Splatting a 2-bit value across VWidth
+yields the per-128-bit-lane pattern. For 256-bit: 0101 or 1010. For 512-bit:
+01010101 or 10101010. Correct (per-lane qword selection).
+
+## 13. simplifyX86movmsk (lines 611-640)
+
+For 4-doubles → i32 result: NumElts=4, IntegerTy=i4, bitcast through <4 x i64>,
+isneg, bitcast to i4, zext to i32. Correct (bits [3:0] of i32 hold sign bits).
+Similar for all sizes (16, 32 bytes; 4 floats, 8 floats; 2 doubles, 4 doubles).
+
+## 14. BMI BEXTR/BZHI/PEXT/PDEP folds (lines 2212-2349)
+
+- BEXTR: Length=0 or Shift>=BitWidth → zero. Constant fold uses
+  `(Src >> Shift) & maskTrailingOnes(min(Length, BitWidth))`. Correct.
+- BZHI: Index>=BitWidth → Arg0 pass-through (matches Intel SDM: "if index >=
+  operand size, DEST = SRC"). Index=0 → zero. Constant fold correct.
+- PEXT shifted-mask: (Input & mask) >> MaskIdx is correct expression for
+  shifted-mask layout.
+- PDEP shifted-mask: (Input << MaskIdx) & mask correctly deposits Input's low
+  bits at mask positions.
+
+## 15. SSE4A EXTRQ/EXTRQI/INSERTQ/INSERTQI (lines 1842-2021)
+
+Byte-aligned (Length%8==0 && Index%8==0) → byte shuffle. Constant-fold of
+bit field extraction/insertion using APInt arithmetic. Index+Length>64 →
+undef (matches AMD spec "results undefined"). Length=0 → Length=64 (matches
+AMD spec "field length 0 means 64"). All correct.
+
+## Summary
+
+12 minutes of careful analysis covered the major fold paths in the file. No
+reproducible miscompile was found. The simplifications I examined are all
+correct under careful semantic analysis. The Python ternlog verifier confirms
+the 256-entry table is structurally sound.
+
+Areas where I could not find issues but also could not exhaustively cover:
+- Per-element undef propagation in PMULH (m_One mixed-undef does not match,
+  so that path is safe).
+- KnownBits-based folding for variable shifts when shift amounts come from
+  complex computations (would require runtime testing).
+- AVX-512 mask-register operations are handled in target-independent IR
+  (plain `and i16` etc.), not in this file.
+
+## Confidence
+
+No new candidates submitted.
@@ -0,0 +1,54 @@
+file: llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp:2683-2708
+(partial-store-merging branch in eliminateDeadStores)
+
+When `OR == OW_PartialEarlierWithFullLater` and partial-store-merging
+is enabled (default), DSE folds the killing store's bytes into the
+dead store via `tryToMergePartialOverlappingStores`. The dead store's
+stored constant is updated with `DeadSI->setOperand(0, Merged)` and
+the killing store is removed with `deleteDeadInstruction(KillingSI)`.
+
+The killing store's metadata is discarded entirely. If the killing
+store carries `!nontemporal`, the user's nontemporal hint for that
+slice of memory is silently dropped from the merged store.
+
+Reproducer:
+
+  target triple = "x86_64-unknown-linux-gnu"
+
+  define void @f(ptr %p) {
+  entry:
+    store i64 0, ptr %p, align 8                       ; dead
+    store i32 -1, ptr %p, align 4, !nontemporal !0     ; killing (nontemporal)
+    ret void
+  }
+
+  !0 = !{i32 1}
+
+opt -passes=dse output:
+
+  define void @f(ptr %p) {
+  entry:
+    store i64 4294967295, ptr %p, align 8
+    ret void
+  }
+
+llc -mtriple=x86_64-- -mattr=+sse4.1 codegen diff:
+
+  Without DSE:
+    movq    $0, (%rdi)
+    movl    $-1, %eax
+    movntil %eax, (%rdi)        ; nontemporal write of bytes 0..3
+  With DSE:
+    movl    $4294967295, %eax
+    movq    %rax, (%rdi)        ; plain temporal write, no nontemporal
+
+The merged 8-byte temporal store is observably different: the
+low 4 bytes were requested as nontemporal (write-combining /
+cache-bypass), but the resulting code uses a regular store.
+
+Fix: before performing the merge, refuse if the killing store has
+metadata that cannot survive being attached to the (wider) dead
+store unchanged, in particular MD_nontemporal. Alternatively,
+propagate MD_nontemporal to DeadSI when KillingSI has it (and
+require all merged-in killing stores to agree). Same care is needed
+for MD_invariant_group, MD_alias_scope/noalias, and MD_tbaa.