|
| 1 | +# worker-82 investigation notes (2026-05-21) |
| 2 | + |
| 3 | +No confirmed reproducible miscompiles in ~12 minute window for |
| 4 | +`llvm/lib/Target/X86/X86InstCombineIntrinsic.cpp`. Patterns investigated and |
| 5 | +verified correct (rule-outs to spare future workers re-deriving): |
| 6 | + |
| 7 | +## 1. simplifyTernarylogic (lines 669-1734) |
| 8 | + |
| 9 | +Wrote a Python verifier that parses each of the 256 `case 0xNN:` entries and |
| 10 | +evaluates the expression against the canonical A=0xf0, B=0xcc, C=0xaa truth- |
| 11 | +table constants. ALL entries verify (script: /tmp/w82/verify_ternlog.py). The |
| 12 | +in-source assertion at line 1732 already enforces this at runtime; the table is |
| 13 | +correct by construction. |
| 14 | + |
| 15 | +## 2. simplifyX86vpermilvar (lines 2068-2113) |
| 16 | + |
| 17 | +For PS: keeps bits [1:0] of each i32 mask element. For PD: shifts right by 1 |
| 18 | +(extracting bit 1 of each i64 element). The +lane_offset adjustment yields |
| 19 | +correct global shuffle indices. Tested all-bits-set mask (yields max index per |
| 20 | +lane), all-zero mask (yields lane base), and "bit 1 vs bit 0" PD variants. All |
| 21 | +results match hardware semantics. The SimplifyDemandedBits calls at 3098 (mask |
| 22 | +0b00011 for PS) and 3111 (mask 0b00010 for PD) are also correct. |
| 23 | + |
| 24 | +## 3. simplifyX86pshufb (lines 2024-2065) and demand-bits mask 0b10001111 |
| 25 | + |
| 26 | +Per-lane low-nibble index + sign-bit-zero behavior matches hardware. Demand |
| 27 | +mask 0x8F (bits 0,1,2,3,7) ignores the always-ignored bits 4,5,6. Tested with |
| 28 | +0x70 (mid-bits set, no sign) → select index 0; 0xF0 (sign + mid-bits) → zero. |
| 29 | + |
| 30 | +## 4. simplifyX86varShift (lines 297-431) - psllv/psrlv/psrav with OOR/undef |
| 31 | + |
| 32 | +Tested constant shift vectors with all-OOR, all-undef, and mixed OOR + in-range |
| 33 | +for arithmetic and logical variants. Arithmetic shifts clamp OOR amounts to |
| 34 | +BitWidth-1 (correct sign-bit splat). Logical shifts return zero or bail on |
| 35 | +mixed cases (correct). The lambda `OutOfRange = [Idx<0 || BitWidth<=Idx]` is |
| 36 | +correct for both branches given how arithmetic clamps to BitWidth-1. |
| 37 | + |
| 38 | +## 5. simplifyX86immShift constant-vector path (lines 247-291) |
| 39 | + |
| 40 | +For PSLLW, the low 64 bits of the shift vector are read as a 64-bit count. The |
| 41 | +code correctly concatenates elements [0..3] for i16 lanes, [0..1] for i32 lanes, |
| 42 | +and uses [0] alone for i64 lanes. Verified that Count.uge(BitWidth) handles all |
| 43 | +out-of-range cases (returns 0 for logical, AShr by BitWidth-1 for arithmetic). |
| 44 | + |
| 45 | +## 6. simplifyX86pmadd (lines 557-609) PMADDWD and PMADDUBSW |
| 46 | + |
| 47 | +PMADDWD signed-overflow case (e.g., 32768 inputs giving sum = 2^31) wraps in |
| 48 | +i32, matching hardware. PMADDUBSW uses sadd_sat for saturating addition, |
| 49 | +matching hardware. Undef-element propagation (whole-vector undef → zero per |
| 50 | +policy) matches code comment. |
| 51 | + |
| 52 | +## 7. simplifyX86pmulh (lines 499-555) PMULH, PMULHU, PMULHRSW |
| 53 | + |
| 54 | +The `LShr(Mul, 14)` + `Trunc i18` trick for PMULHRSW correctly preserves sign |
| 55 | +behavior via integer wraparound — verified manually for several negative inputs |
| 56 | +(-1, -16384, -16385) giving the same low-16 result as hardware arithmetic |
| 57 | +shift. The m_One signed/unsigned paths produce correct AShr by 15 / zero. |
| 58 | +m_One does NOT match `<1, ..., undef, undef>` mixed vectors because undef ≠ |
| 59 | +poison and getSplatValue(false) requires exact match — verified. |
| 60 | + |
| 61 | +## 8. simplifyX86FPMaxMin (lines 1737-1783) |
| 62 | + |
| 63 | +The Forbidden0/Forbidden1 with NaN|Inf|Subnormal (+NegZero on Arg1 for max, |
| 64 | +Arg0 for min) correctly handles all x86_max/min vs LLVM maxnum/minnum |
| 65 | +differences. Subnormal forbidden is needed for DAZ-input case (subnormal → 0 |
| 66 | +on input flushes the comparison). Confirmed equivalence in each case. |
| 67 | + |
| 68 | +## 9. simplifyX86insertps (lines 1785-1840) - INSERTPS with ZMask & arg0==arg1 |
| 69 | + |
| 70 | +The "shuffle with zero vector" path correctly handles both arg0==arg1 (where |
| 71 | +arg0[SourceLane] == arg1[SourceLane]) and the case where ZMask zeros out the |
| 72 | +destination lane (in which case the inserted value is immediately overridden |
| 73 | +to 0 in the ZMask loop). Verified by tracing through all branch combinations. |
| 74 | + |
| 75 | +## 10. simplifyX86pack (lines 433-497) PACKSS/PACKUS |
| 76 | + |
| 77 | +Saturation logic: PACKSS uses signed clamp [SIntMin, SIntMax] of dst type. |
| 78 | +PACKUS uses [0, UIntMax of dst type]. Both use signed comparisons for input |
| 79 | +clamping (matching hardware semantics where negative→0 for unsigned and |
| 80 | +input>maxint→maxint). Per-lane pack mask: PackMask shuffles in |
| 81 | +(X[lo..hi], Y[lo..hi]) per 128-bit lane, then truncates. Correct. |
| 82 | + |
| 83 | +## 11. simplifyX86VPERMMask demand-bits (lines 2186-2199) |
| 84 | + |
| 85 | +IdxSizeInBits = Log2_32(IsBinary ? 2*NumElts : NumElts). For permvar_qi_512 |
| 86 | +(NumElts=64): bottom 6 bits demanded. For vpermi2var_qi_512 (binary, NumElts= |
| 87 | +64): bottom 7 bits demanded. Cross-checked against simplifyX86vpermv masking |
| 88 | +(`Index &= Size - 1`) and simplifyX86vpermv3 (`Index &= 2*Size - 1`). |
| 89 | +Consistent with hardware. |
| 90 | + |
| 91 | +## 12. PCLMULQDQ demand-elt (lines 2765-2807) |
| 92 | + |
| 93 | +DemandedElts1 = getSplat(VWidth, APInt(2, bit_for_op1)) where bit_for_op1 is |
| 94 | +01 (low qword) or 10 (high qword). Splatting a 2-bit value across VWidth |
| 95 | +yields the per-128-bit-lane pattern. For 256-bit: 0101 or 1010. For 512-bit: |
| 96 | +01010101 or 10101010. Correct (per-lane qword selection). |
| 97 | + |
| 98 | +## 13. simplifyX86movmsk (lines 611-640) |
| 99 | + |
| 100 | +For 4-doubles → i32 result: NumElts=4, IntegerTy=i4, bitcast through <4 x i64>, |
| 101 | +isneg, bitcast to i4, zext to i32. Correct (bits [3:0] of i32 hold sign bits). |
| 102 | +Similar for all sizes (16, 32 bytes; 4 floats, 8 floats; 2 doubles, 4 doubles). |
| 103 | + |
| 104 | +## 14. BMI BEXTR/BZHI/PEXT/PDEP folds (lines 2212-2349) |
| 105 | + |
| 106 | +- BEXTR: Length=0 or Shift>=BitWidth → zero. Constant fold uses |
| 107 | + `(Src >> Shift) & maskTrailingOnes(min(Length, BitWidth))`. Correct. |
| 108 | +- BZHI: Index>=BitWidth → Arg0 pass-through (matches Intel SDM: "if index >= |
| 109 | + operand size, DEST = SRC"). Index=0 → zero. Constant fold correct. |
| 110 | +- PEXT shifted-mask: (Input & mask) >> MaskIdx is correct expression for |
| 111 | + shifted-mask layout. |
| 112 | +- PDEP shifted-mask: (Input << MaskIdx) & mask correctly deposits Input's low |
| 113 | + bits at mask positions. |
| 114 | + |
| 115 | +## 15. SSE4A EXTRQ/EXTRQI/INSERTQ/INSERTQI (lines 1842-2021) |
| 116 | + |
| 117 | +Byte-aligned (Length%8==0 && Index%8==0) → byte shuffle. Constant-fold of |
| 118 | +bit field extraction/insertion using APInt arithmetic. Index+Length>64 → |
| 119 | +undef (matches AMD spec "results undefined"). Length=0 → Length=64 (matches |
| 120 | +AMD spec "field length 0 means 64"). All correct. |
| 121 | + |
| 122 | +## Summary |
| 123 | + |
| 124 | +12 minutes of careful analysis covered the major fold paths in the file. No |
| 125 | +reproducible miscompile was found. The simplifications I examined are all |
| 126 | +correct under careful semantic analysis. The Python ternlog verifier confirms |
| 127 | +the 256-entry table is structurally sound. |
| 128 | + |
| 129 | +Areas where I could not find issues but also could not exhaustively cover: |
| 130 | +- Per-element undef propagation in PMULH (m_One mixed-undef does not match, |
| 131 | + so that path is safe). |
| 132 | +- KnownBits-based folding for variable shifts when shift amounts come from |
| 133 | + complex computations (would require runtime testing). |
| 134 | +- AVX-512 mask-register operations are handled in target-independent IR |
| 135 | + (plain `and i16` etc.), not in this file. |
| 136 | + |
| 137 | +## Confidence |
| 138 | + |
| 139 | +No new candidates submitted. |
0 commit comments