Use an up-to-date LLVM by relying on an external llc.#3162
Merged
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3162 +/- ##
=======================================
Coverage 16.32% 16.32%
=======================================
Files 124 124
Lines 9875 9875
=======================================
Hits 1612 1612
Misses 8263 8263 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Contributor
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: b2a389e | Previous: 81d7397 | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
99930 ns |
99264 ns |
1.01 |
array/accumulate/Float32/dims=1 |
76181 ns |
75495 ns |
1.01 |
array/accumulate/Float32/dims=1L |
1596372 ns |
1585711 ns |
1.01 |
array/accumulate/Float32/dims=2 |
141346 ns |
141086 ns |
1.00 |
array/accumulate/Float32/dims=2L |
654489 ns |
653695 ns |
1.00 |
array/accumulate/Int64/1d |
118532 ns |
117278 ns |
1.01 |
array/accumulate/Int64/dims=1 |
79605 ns |
79261 ns |
1.00 |
array/accumulate/Int64/dims=1L |
1708585 ns |
1700035 ns |
1.01 |
array/accumulate/Int64/dims=2 |
154630 ns |
150883 ns |
1.02 |
array/accumulate/Int64/dims=2L |
960074 ns |
959294 ns |
1.00 |
array/broadcast |
18560 ns |
18328 ns |
1.01 |
array/construct |
1208.7 ns |
1190.1 ns |
1.02 |
array/copy |
16932 ns |
16650 ns |
1.02 |
array/copyto!/cpu_to_gpu |
215028 ns |
212593 ns |
1.01 |
array/copyto!/gpu_to_cpu |
281151 ns |
280762 ns |
1.00 |
array/copyto!/gpu_to_gpu |
10633 ns |
10344 ns |
1.03 |
array/iteration/findall/bool |
134130 ns |
131321 ns |
1.02 |
array/iteration/findall/int |
147788 ns |
144983 ns |
1.02 |
array/iteration/findfirst/bool |
112584 ns |
68482 ns |
1.64 |
array/iteration/findfirst/int |
112793 ns |
70611 ns |
1.60 |
array/iteration/findmin/1d |
68493 ns |
64808 ns |
1.06 |
array/iteration/findmin/2d |
101107 ns |
100918 ns |
1.00 |
array/iteration/logical |
193020 ns |
189296 ns |
1.02 |
array/iteration/scalar |
66592 ns |
65562 ns |
1.02 |
array/permutedims/2d |
49476 ns |
49741 ns |
0.99 |
array/permutedims/3d |
50355 ns |
50479 ns |
1.00 |
array/permutedims/4d |
50481 ns |
50560 ns |
1.00 |
array/random/rand/Float32 |
12216 ns |
11506 ns |
1.06 |
array/random/rand/Int64 |
23999 ns |
24008 ns |
1.00 |
array/random/rand!/Float32 |
8012.666666666667 ns |
8226.333333333334 ns |
0.97 |
array/random/rand!/Int64 |
18798 ns |
20629 ns |
0.91 |
array/random/randn/Float32 |
35588 ns |
35503 ns |
1.00 |
array/random/randn!/Float32 |
24028 ns |
23995 ns |
1.00 |
array/reductions/mapreduce/Float32/1d |
33948 ns |
33262 ns |
1.02 |
array/reductions/mapreduce/Float32/dims=1 |
38244 ns |
38127 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=1L |
50348 ns |
50245 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2 |
55691 ns |
55632 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
67478 ns |
67294 ns |
1.00 |
array/reductions/mapreduce/Int64/1d |
39806 ns |
38973 ns |
1.02 |
array/reductions/mapreduce/Int64/dims=1 |
41326 ns |
40807 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=1L |
86625 ns |
86255 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
57998 ns |
58105 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
83141 ns |
82675 ns |
1.01 |
array/reductions/reduce/Float32/1d |
33851 ns |
33096 ns |
1.02 |
array/reductions/reduce/Float32/dims=1 |
38543 ns |
38054 ns |
1.01 |
array/reductions/reduce/Float32/dims=1L |
50188 ns |
50467 ns |
0.99 |
array/reductions/reduce/Float32/dims=2 |
55738 ns |
55386 ns |
1.01 |
array/reductions/reduce/Float32/dims=2L |
69128 ns |
67642 ns |
1.02 |
array/reductions/reduce/Int64/1d |
39495 ns |
38841 ns |
1.02 |
array/reductions/reduce/Int64/dims=1 |
41166 ns |
40658 ns |
1.01 |
array/reductions/reduce/Int64/dims=1L |
86691 ns |
86419 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
58031 ns |
57814 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
82660 ns |
82725 ns |
1.00 |
array/reverse/1d |
15967 ns |
16783 ns |
0.95 |
array/reverse/1dL |
67821 ns |
67518 ns |
1.00 |
array/reverse/1dL_inplace |
65311 ns |
65186 ns |
1.00 |
array/reverse/1d_inplace |
8397 ns |
8208 ns |
1.02 |
array/reverse/2d |
20190 ns |
19927 ns |
1.01 |
array/reverse/2dL |
71908 ns |
71784 ns |
1.00 |
array/reverse/2dL_inplace |
65204 ns |
65074 ns |
1.00 |
array/reverse/2d_inplace |
9648 ns |
9571 ns |
1.01 |
array/sorting/1d |
2651128 ns |
2723816 ns |
0.97 |
array/sorting/2d |
1040488 ns |
1065468 ns |
0.98 |
array/sorting/by |
3193815 ns |
3266956 ns |
0.98 |
cuda/synchronization/context/auto |
1142 ns |
1139.2 ns |
1.00 |
cuda/synchronization/context/blocking |
909.5526315789474 ns |
911.4444444444445 ns |
1.00 |
cuda/synchronization/context/nonblocking |
6082 ns |
5970.4 ns |
1.02 |
cuda/synchronization/stream/auto |
996.4545454545455 ns |
1010.8 ns |
0.99 |
cuda/synchronization/stream/blocking |
810.7263157894737 ns |
824.0833333333334 ns |
0.98 |
cuda/synchronization/stream/nonblocking |
5990.714285714285 ns |
6016 ns |
1.00 |
integration/byval/reference |
143418 ns |
143217 ns |
1.00 |
integration/byval/slices=1 |
145491 ns |
145187 ns |
1.00 |
integration/byval/slices=2 |
283990 ns |
283773 ns |
1.00 |
integration/byval/slices=3 |
422462 ns |
422013 ns |
1.00 |
integration/cudadevrt |
101813 ns |
101637 ns |
1.00 |
integration/volumerhs |
9072804 ns |
8886664 ns |
1.02 |
kernel/indexing |
12687 ns |
12614 ns |
1.01 |
kernel/indexing_checked |
13504 ns |
13361 ns |
1.01 |
kernel/launch |
2093.777777777778 ns |
2137.777777777778 ns |
0.98 |
kernel/occupancy |
691.6041666666666 ns |
759.3831775700935 ns |
0.91 |
kernel/rand |
16548 ns |
14411 ns |
1.15 |
latency/import |
3840684761 ns |
3875005359 ns |
0.99 |
latency/precompile |
4625464529 ns |
4629977402 ns |
1.00 |
latency/ttfp |
4486224788 ns |
4513776589 ns |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
Member
Author
|
Quiet some performance regressions here. Looking into them. |
This was referenced Jun 5, 2026
Member
Author
|
Looks like an LLVM regression wrt. small/under-aligned alloca's. Workaround in GPUCompiler: diff --git a/src/ptx.jl b/src/ptx.jl
index 7c4ea5c..ffd178e 100644
--- a/src/ptx.jl
+++ b/src/ptx.jl
@@ -343,9 +343,49 @@ function finish_ir!(@nospecialize(job::CompilerJob{PTXCompilerTarget}),
end
end
+ raise_alloca_alignment!(mod)
+
return entry
end
+# raise the alignment of under-aligned allocas
+#
+# SelectionDAG normally raises the alignment of under-aligned stack objects when
+# expanding small memcpys (`DstAlignCanChange` in `getMemcpyLoadsAndStores`), but that
+# requires the destination to be a bare `FrameIndex` node. The NVPTX back-end wraps
+# allocas in `addrspacecast`s (NVPTXLowerAlloca), hiding the frame object, so the
+# expansion has to honor the IR-level alignment. That lowers e.g. the 7-byte padding
+# copies SROA generates for non-power-of-two aggregates to byte-wise loads and stores.
+# Compensate by raising the alignment of small allocas ourselves before codegen.
+function raise_alloca_alignment!(mod::LLVM.Module)
+ changed = false
+ @tracepoint "raise alloca alignment" begin
+
+ dl = datalayout(mod)
+ for f in functions(mod), bb in blocks(f), inst in instructions(bb)
+ inst isa LLVM.AllocaInst || continue
+
+ # only static allocas
+ array_size = operands(inst)[1]
+ array_size isa LLVM.ConstantInt || continue
+
+ alloca_type = LLVMType(LLVM.API.LLVMGetAllocatedType(inst))
+ size = abi_size(dl, alloca_type) * convert(Int, array_size)
+ size > 0 || continue
+
+ # match what `DstAlignCanChange` would have done: align to the largest
+ # power-of-two memory access this object could be copied with.
+ align = min(nextpow(2, size), 16)
+ if alignment(inst) < align
+ alignment!(inst, align)
+ changed = true
+ end
+ end
+
+ end
+ return changed
+end
+
@unlocked function mcgen(@nospecialize(job::CompilerJob{PTXCompilerTarget}),
mod::LLVM.Module, format=LLVM.API.LLVMAssemblyFile)
if !isavailable(NVPTX_LLVM_Backend_jll) || !NVPTX_LLVM_Backend_jll.is_available()Let's first try to upstream this though: llvm/llvm-project#201772 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.