Skip to content

Use an up-to-date LLVM by relying on an external llc.#3162

Merged
maleadt merged 1 commit into
mainfrom
tb/external_llc
Jun 4, 2026
Merged

Use an up-to-date LLVM by relying on an external llc.#3162
maleadt merged 1 commit into
mainfrom
tb/external_llc

Conversation

@maleadt

@maleadt maleadt commented Jun 4, 2026

Copy link
Copy Markdown
Member

No description provided.

@codecov

codecov Bot commented Jun 4, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 16.32%. Comparing base (81d7397) to head (b2a389e).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3162   +/-   ##
=======================================
  Coverage   16.32%   16.32%           
=======================================
  Files         124      124           
  Lines        9875     9875           
=======================================
  Hits         1612     1612           
  Misses       8263     8263           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: b2a389e Previous: 81d7397 Ratio
array/accumulate/Float32/1d 99930 ns 99264 ns 1.01
array/accumulate/Float32/dims=1 76181 ns 75495 ns 1.01
array/accumulate/Float32/dims=1L 1596372 ns 1585711 ns 1.01
array/accumulate/Float32/dims=2 141346 ns 141086 ns 1.00
array/accumulate/Float32/dims=2L 654489 ns 653695 ns 1.00
array/accumulate/Int64/1d 118532 ns 117278 ns 1.01
array/accumulate/Int64/dims=1 79605 ns 79261 ns 1.00
array/accumulate/Int64/dims=1L 1708585 ns 1700035 ns 1.01
array/accumulate/Int64/dims=2 154630 ns 150883 ns 1.02
array/accumulate/Int64/dims=2L 960074 ns 959294 ns 1.00
array/broadcast 18560 ns 18328 ns 1.01
array/construct 1208.7 ns 1190.1 ns 1.02
array/copy 16932 ns 16650 ns 1.02
array/copyto!/cpu_to_gpu 215028 ns 212593 ns 1.01
array/copyto!/gpu_to_cpu 281151 ns 280762 ns 1.00
array/copyto!/gpu_to_gpu 10633 ns 10344 ns 1.03
array/iteration/findall/bool 134130 ns 131321 ns 1.02
array/iteration/findall/int 147788 ns 144983 ns 1.02
array/iteration/findfirst/bool 112584 ns 68482 ns 1.64
array/iteration/findfirst/int 112793 ns 70611 ns 1.60
array/iteration/findmin/1d 68493 ns 64808 ns 1.06
array/iteration/findmin/2d 101107 ns 100918 ns 1.00
array/iteration/logical 193020 ns 189296 ns 1.02
array/iteration/scalar 66592 ns 65562 ns 1.02
array/permutedims/2d 49476 ns 49741 ns 0.99
array/permutedims/3d 50355 ns 50479 ns 1.00
array/permutedims/4d 50481 ns 50560 ns 1.00
array/random/rand/Float32 12216 ns 11506 ns 1.06
array/random/rand/Int64 23999 ns 24008 ns 1.00
array/random/rand!/Float32 8012.666666666667 ns 8226.333333333334 ns 0.97
array/random/rand!/Int64 18798 ns 20629 ns 0.91
array/random/randn/Float32 35588 ns 35503 ns 1.00
array/random/randn!/Float32 24028 ns 23995 ns 1.00
array/reductions/mapreduce/Float32/1d 33948 ns 33262 ns 1.02
array/reductions/mapreduce/Float32/dims=1 38244 ns 38127 ns 1.00
array/reductions/mapreduce/Float32/dims=1L 50348 ns 50245 ns 1.00
array/reductions/mapreduce/Float32/dims=2 55691 ns 55632 ns 1.00
array/reductions/mapreduce/Float32/dims=2L 67478 ns 67294 ns 1.00
array/reductions/mapreduce/Int64/1d 39806 ns 38973 ns 1.02
array/reductions/mapreduce/Int64/dims=1 41326 ns 40807 ns 1.01
array/reductions/mapreduce/Int64/dims=1L 86625 ns 86255 ns 1.00
array/reductions/mapreduce/Int64/dims=2 57998 ns 58105 ns 1.00
array/reductions/mapreduce/Int64/dims=2L 83141 ns 82675 ns 1.01
array/reductions/reduce/Float32/1d 33851 ns 33096 ns 1.02
array/reductions/reduce/Float32/dims=1 38543 ns 38054 ns 1.01
array/reductions/reduce/Float32/dims=1L 50188 ns 50467 ns 0.99
array/reductions/reduce/Float32/dims=2 55738 ns 55386 ns 1.01
array/reductions/reduce/Float32/dims=2L 69128 ns 67642 ns 1.02
array/reductions/reduce/Int64/1d 39495 ns 38841 ns 1.02
array/reductions/reduce/Int64/dims=1 41166 ns 40658 ns 1.01
array/reductions/reduce/Int64/dims=1L 86691 ns 86419 ns 1.00
array/reductions/reduce/Int64/dims=2 58031 ns 57814 ns 1.00
array/reductions/reduce/Int64/dims=2L 82660 ns 82725 ns 1.00
array/reverse/1d 15967 ns 16783 ns 0.95
array/reverse/1dL 67821 ns 67518 ns 1.00
array/reverse/1dL_inplace 65311 ns 65186 ns 1.00
array/reverse/1d_inplace 8397 ns 8208 ns 1.02
array/reverse/2d 20190 ns 19927 ns 1.01
array/reverse/2dL 71908 ns 71784 ns 1.00
array/reverse/2dL_inplace 65204 ns 65074 ns 1.00
array/reverse/2d_inplace 9648 ns 9571 ns 1.01
array/sorting/1d 2651128 ns 2723816 ns 0.97
array/sorting/2d 1040488 ns 1065468 ns 0.98
array/sorting/by 3193815 ns 3266956 ns 0.98
cuda/synchronization/context/auto 1142 ns 1139.2 ns 1.00
cuda/synchronization/context/blocking 909.5526315789474 ns 911.4444444444445 ns 1.00
cuda/synchronization/context/nonblocking 6082 ns 5970.4 ns 1.02
cuda/synchronization/stream/auto 996.4545454545455 ns 1010.8 ns 0.99
cuda/synchronization/stream/blocking 810.7263157894737 ns 824.0833333333334 ns 0.98
cuda/synchronization/stream/nonblocking 5990.714285714285 ns 6016 ns 1.00
integration/byval/reference 143418 ns 143217 ns 1.00
integration/byval/slices=1 145491 ns 145187 ns 1.00
integration/byval/slices=2 283990 ns 283773 ns 1.00
integration/byval/slices=3 422462 ns 422013 ns 1.00
integration/cudadevrt 101813 ns 101637 ns 1.00
integration/volumerhs 9072804 ns 8886664 ns 1.02
kernel/indexing 12687 ns 12614 ns 1.01
kernel/indexing_checked 13504 ns 13361 ns 1.01
kernel/launch 2093.777777777778 ns 2137.777777777778 ns 0.98
kernel/occupancy 691.6041666666666 ns 759.3831775700935 ns 0.91
kernel/rand 16548 ns 14411 ns 1.15
latency/import 3840684761 ns 3875005359 ns 0.99
latency/precompile 4625464529 ns 4629977402 ns 1.00
latency/ttfp 4486224788 ns 4513776589 ns 0.99

This comment was automatically generated by workflow using github-action-benchmark.

@maleadt maleadt merged commit aa47d7a into main Jun 4, 2026
2 checks passed
@maleadt maleadt deleted the tb/external_llc branch June 4, 2026 17:03
@maleadt

maleadt commented Jun 5, 2026

Copy link
Copy Markdown
Member Author

Quiet some performance regressions here. Looking into them.

@maleadt

maleadt commented Jun 5, 2026

Copy link
Copy Markdown
Member Author

Looks like an LLVM regression wrt. small/under-aligned alloca's. Workaround in GPUCompiler:

diff --git a/src/ptx.jl b/src/ptx.jl
index 7c4ea5c..ffd178e 100644
--- a/src/ptx.jl
+++ b/src/ptx.jl
@@ -343,9 +343,49 @@ function finish_ir!(@nospecialize(job::CompilerJob{PTXCompilerTarget}),
         end
     end

+    raise_alloca_alignment!(mod)
+
     return entry
 end

+# raise the alignment of under-aligned allocas
+#
+# SelectionDAG normally raises the alignment of under-aligned stack objects when
+# expanding small memcpys (`DstAlignCanChange` in `getMemcpyLoadsAndStores`), but that
+# requires the destination to be a bare `FrameIndex` node. The NVPTX back-end wraps
+# allocas in `addrspacecast`s (NVPTXLowerAlloca), hiding the frame object, so the
+# expansion has to honor the IR-level alignment. That lowers e.g. the 7-byte padding
+# copies SROA generates for non-power-of-two aggregates to byte-wise loads and stores.
+# Compensate by raising the alignment of small allocas ourselves before codegen.
+function raise_alloca_alignment!(mod::LLVM.Module)
+    changed = false
+    @tracepoint "raise alloca alignment" begin
+
+    dl = datalayout(mod)
+    for f in functions(mod), bb in blocks(f), inst in instructions(bb)
+        inst isa LLVM.AllocaInst || continue
+
+        # only static allocas
+        array_size = operands(inst)[1]
+        array_size isa LLVM.ConstantInt || continue
+
+        alloca_type = LLVMType(LLVM.API.LLVMGetAllocatedType(inst))
+        size = abi_size(dl, alloca_type) * convert(Int, array_size)
+        size > 0 || continue
+
+        # match what `DstAlignCanChange` would have done: align to the largest
+        # power-of-two memory access this object could be copied with.
+        align = min(nextpow(2, size), 16)
+        if alignment(inst) < align
+            alignment!(inst, align)
+            changed = true
+        end
+    end
+
+    end
+    return changed
+end
+
 @unlocked function mcgen(@nospecialize(job::CompilerJob{PTXCompilerTarget}),
                          mod::LLVM.Module, format=LLVM.API.LLVMAssemblyFile)
     if !isavailable(NVPTX_LLVM_Backend_jll) || !NVPTX_LLVM_Backend_jll.is_available()

Let's first try to upstream this though: llvm/llvm-project#201772

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant