Use an up-to-date LLVM by relying on an external `llc`. by maleadt · Pull Request #3162 · JuliaGPU/CUDA.jl

maleadt · 2026-06-04T14:30:48Z

No description provided.

codecov · 2026-06-04T16:59:55Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 16.32%. Comparing base (81d7397) to head (b2a389e).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3162   +/-   ##
=======================================
  Coverage   16.32%   16.32%           
=======================================
  Files         124      124           
  Lines        9875     9875           
=======================================
  Hits         1612     1612           
  Misses       8263     8263

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `b2a389e`	Previous: `81d7397`	Ratio
`array/accumulate/Float32/1d`	`99930` ns	`99264` ns	`1.01`
`array/accumulate/Float32/dims=1`	`76181` ns	`75495` ns	`1.01`
`array/accumulate/Float32/dims=1L`	`1596372` ns	`1585711` ns	`1.01`
`array/accumulate/Float32/dims=2`	`141346` ns	`141086` ns	`1.00`
`array/accumulate/Float32/dims=2L`	`654489` ns	`653695` ns	`1.00`
`array/accumulate/Int64/1d`	`118532` ns	`117278` ns	`1.01`
`array/accumulate/Int64/dims=1`	`79605` ns	`79261` ns	`1.00`
`array/accumulate/Int64/dims=1L`	`1708585` ns	`1700035` ns	`1.01`
`array/accumulate/Int64/dims=2`	`154630` ns	`150883` ns	`1.02`
`array/accumulate/Int64/dims=2L`	`960074` ns	`959294` ns	`1.00`
`array/broadcast`	`18560` ns	`18328` ns	`1.01`
`array/construct`	`1208.7` ns	`1190.1` ns	`1.02`
`array/copy`	`16932` ns	`16650` ns	`1.02`
`array/copyto!/cpu_to_gpu`	`215028` ns	`212593` ns	`1.01`
`array/copyto!/gpu_to_cpu`	`281151` ns	`280762` ns	`1.00`
`array/copyto!/gpu_to_gpu`	`10633` ns	`10344` ns	`1.03`
`array/iteration/findall/bool`	`134130` ns	`131321` ns	`1.02`
`array/iteration/findall/int`	`147788` ns	`144983` ns	`1.02`
`array/iteration/findfirst/bool`	`112584` ns	`68482` ns	`1.64`
`array/iteration/findfirst/int`	`112793` ns	`70611` ns	`1.60`
`array/iteration/findmin/1d`	`68493` ns	`64808` ns	`1.06`
`array/iteration/findmin/2d`	`101107` ns	`100918` ns	`1.00`
`array/iteration/logical`	`193020` ns	`189296` ns	`1.02`
`array/iteration/scalar`	`66592` ns	`65562` ns	`1.02`
`array/permutedims/2d`	`49476` ns	`49741` ns	`0.99`
`array/permutedims/3d`	`50355` ns	`50479` ns	`1.00`
`array/permutedims/4d`	`50481` ns	`50560` ns	`1.00`
`array/random/rand/Float32`	`12216` ns	`11506` ns	`1.06`
`array/random/rand/Int64`	`23999` ns	`24008` ns	`1.00`
`array/random/rand!/Float32`	`8012.666666666667` ns	`8226.333333333334` ns	`0.97`
`array/random/rand!/Int64`	`18798` ns	`20629` ns	`0.91`
`array/random/randn/Float32`	`35588` ns	`35503` ns	`1.00`
`array/random/randn!/Float32`	`24028` ns	`23995` ns	`1.00`
`array/reductions/mapreduce/Float32/1d`	`33948` ns	`33262` ns	`1.02`
`array/reductions/mapreduce/Float32/dims=1`	`38244` ns	`38127` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=1L`	`50348` ns	`50245` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2`	`55691` ns	`55632` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2L`	`67478` ns	`67294` ns	`1.00`
`array/reductions/mapreduce/Int64/1d`	`39806` ns	`38973` ns	`1.02`
`array/reductions/mapreduce/Int64/dims=1`	`41326` ns	`40807` ns	`1.01`
`array/reductions/mapreduce/Int64/dims=1L`	`86625` ns	`86255` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2`	`57998` ns	`58105` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2L`	`83141` ns	`82675` ns	`1.01`
`array/reductions/reduce/Float32/1d`	`33851` ns	`33096` ns	`1.02`
`array/reductions/reduce/Float32/dims=1`	`38543` ns	`38054` ns	`1.01`
`array/reductions/reduce/Float32/dims=1L`	`50188` ns	`50467` ns	`0.99`
`array/reductions/reduce/Float32/dims=2`	`55738` ns	`55386` ns	`1.01`
`array/reductions/reduce/Float32/dims=2L`	`69128` ns	`67642` ns	`1.02`
`array/reductions/reduce/Int64/1d`	`39495` ns	`38841` ns	`1.02`
`array/reductions/reduce/Int64/dims=1`	`41166` ns	`40658` ns	`1.01`
`array/reductions/reduce/Int64/dims=1L`	`86691` ns	`86419` ns	`1.00`
`array/reductions/reduce/Int64/dims=2`	`58031` ns	`57814` ns	`1.00`
`array/reductions/reduce/Int64/dims=2L`	`82660` ns	`82725` ns	`1.00`
`array/reverse/1d`	`15967` ns	`16783` ns	`0.95`
`array/reverse/1dL`	`67821` ns	`67518` ns	`1.00`
`array/reverse/1dL_inplace`	`65311` ns	`65186` ns	`1.00`
`array/reverse/1d_inplace`	`8397` ns	`8208` ns	`1.02`
`array/reverse/2d`	`20190` ns	`19927` ns	`1.01`
`array/reverse/2dL`	`71908` ns	`71784` ns	`1.00`
`array/reverse/2dL_inplace`	`65204` ns	`65074` ns	`1.00`
`array/reverse/2d_inplace`	`9648` ns	`9571` ns	`1.01`
`array/sorting/1d`	`2651128` ns	`2723816` ns	`0.97`
`array/sorting/2d`	`1040488` ns	`1065468` ns	`0.98`
`array/sorting/by`	`3193815` ns	`3266956` ns	`0.98`
`cuda/synchronization/context/auto`	`1142` ns	`1139.2` ns	`1.00`
`cuda/synchronization/context/blocking`	`909.5526315789474` ns	`911.4444444444445` ns	`1.00`
`cuda/synchronization/context/nonblocking`	`6082` ns	`5970.4` ns	`1.02`
`cuda/synchronization/stream/auto`	`996.4545454545455` ns	`1010.8` ns	`0.99`
`cuda/synchronization/stream/blocking`	`810.7263157894737` ns	`824.0833333333334` ns	`0.98`
`cuda/synchronization/stream/nonblocking`	`5990.714285714285` ns	`6016` ns	`1.00`
`integration/byval/reference`	`143418` ns	`143217` ns	`1.00`
`integration/byval/slices=1`	`145491` ns	`145187` ns	`1.00`
`integration/byval/slices=2`	`283990` ns	`283773` ns	`1.00`
`integration/byval/slices=3`	`422462` ns	`422013` ns	`1.00`
`integration/cudadevrt`	`101813` ns	`101637` ns	`1.00`
`integration/volumerhs`	`9072804` ns	`8886664` ns	`1.02`
`kernel/indexing`	`12687` ns	`12614` ns	`1.01`
`kernel/indexing_checked`	`13504` ns	`13361` ns	`1.01`
`kernel/launch`	`2093.777777777778` ns	`2137.777777777778` ns	`0.98`
`kernel/occupancy`	`691.6041666666666` ns	`759.3831775700935` ns	`0.91`
`kernel/rand`	`16548` ns	`14411` ns	`1.15`
`latency/import`	`3840684761` ns	`3875005359` ns	`0.99`
`latency/precompile`	`4625464529` ns	`4629977402` ns	`1.00`
`latency/ttfp`	`4486224788` ns	`4513776589` ns	`0.99`

This comment was automatically generated by workflow using github-action-benchmark.

maleadt · 2026-06-05T06:54:10Z

Quiet some performance regressions here. Looking into them.

maleadt · 2026-06-05T07:59:54Z

Looks like an LLVM regression wrt. small/under-aligned alloca's. Workaround in GPUCompiler:

diff --git a/src/ptx.jl b/src/ptx.jl
index 7c4ea5c..ffd178e 100644
--- a/src/ptx.jl
+++ b/src/ptx.jl
@@ -343,9 +343,49 @@ function finish_ir!(@nospecialize(job::CompilerJob{PTXCompilerTarget}),
         end
     end

+    raise_alloca_alignment!(mod)
+
     return entry
 end

+# raise the alignment of under-aligned allocas
+#
+# SelectionDAG normally raises the alignment of under-aligned stack objects when
+# expanding small memcpys (`DstAlignCanChange` in `getMemcpyLoadsAndStores`), but that
+# requires the destination to be a bare `FrameIndex` node. The NVPTX back-end wraps
+# allocas in `addrspacecast`s (NVPTXLowerAlloca), hiding the frame object, so the
+# expansion has to honor the IR-level alignment. That lowers e.g. the 7-byte padding
+# copies SROA generates for non-power-of-two aggregates to byte-wise loads and stores.
+# Compensate by raising the alignment of small allocas ourselves before codegen.
+function raise_alloca_alignment!(mod::LLVM.Module)
+    changed = false
+    @tracepoint "raise alloca alignment" begin
+
+    dl = datalayout(mod)
+    for f in functions(mod), bb in blocks(f), inst in instructions(bb)
+        inst isa LLVM.AllocaInst || continue
+
+        # only static allocas
+        array_size = operands(inst)[1]
+        array_size isa LLVM.ConstantInt || continue
+
+        alloca_type = LLVMType(LLVM.API.LLVMGetAllocatedType(inst))
+        size = abi_size(dl, alloca_type) * convert(Int, array_size)
+        size > 0 || continue
+
+        # match what `DstAlignCanChange` would have done: align to the largest
+        # power-of-two memory access this object could be copied with.
+        align = min(nextpow(2, size), 16)
+        if alignment(inst) < align
+            alignment!(inst, align)
+            changed = true
+        end
+    end
+
+    end
+    return changed
+end
+
 @unlocked function mcgen(@nospecialize(job::CompilerJob{PTXCompilerTarget}),
                          mod::LLVM.Module, format=LLVM.API.LLVMAssemblyFile)
     if !isavailable(NVPTX_LLVM_Backend_jll) || !NVPTX_LLVM_Backend_jll.is_available()

Let's first try to upstream this though: llvm/llvm-project#201772

Use an up-to-date LLVM by relying on an external llc.

b2a389e

github-actions Bot reviewed Jun 4, 2026

View reviewed changes

maleadt merged commit aa47d7a into main Jun 4, 2026
2 checks passed

maleadt deleted the tb/external_llc branch June 4, 2026 17:03

This was referenced Jun 5, 2026

Add dp4a device intrinsic #3163

Open

[SelectionDAG] Look through addrspacecasts when raising stack object alignment llvm/llvm-project#201772

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use an up-to-date LLVM by relying on an external `llc`.#3162

Use an up-to-date LLVM by relying on an external `llc`.#3162
maleadt merged 1 commit into
mainfrom
tb/external_llc

maleadt commented Jun 4, 2026

Uh oh!

codecov Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

maleadt commented Jun 5, 2026

Uh oh!

maleadt commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maleadt commented Jun 4, 2026

Uh oh!

codecov Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

Uh oh!

maleadt commented Jun 5, 2026

Uh oh!

maleadt commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Jun 4, 2026 •

edited

Loading