Add dp4a device intrinsic by JohnCobbler · Pull Request #3163 · JuliaGPU/CUDA.jl

JohnCobbler · 2026-06-04T17:39:01Z

Adds CUDACore.dp4a — the packed 4-element int8/uint8 dot product with 32-bit accumulate (single PTX dp4a instruction, sm_61+), in its four signedness variants:

dp4a(a::Int32,  b::Int32,  c::Int32)  -> Int32   # dp4a.s32.s32
dp4a(a::Int32,  b::UInt32, c::Int32)  -> Int32   # dp4a.s32.u32
dp4a(a::UInt32, b::Int32,  c::Int32)  -> Int32   # dp4a.u32.s32
dp4a(a::UInt32, b::UInt32, c::UInt32) -> UInt32  # dp4a.u32.u32

Non-exported, @device_function, added to @public — following the popc/byte_perm precedent. Useful for quantized int8 inference and similar integer-heavy kernels.

Implementation

Per the review discussion: on LLVM 21+ (which added @llvm.nvvm.idp4a.[us].[us]) the implementation uses the intrinsics via ccall; on older LLVM it falls back to inline PTX via @asmcall (same approach as nanosleep). Gated with @static if LLVM.version() >= v"21", following the existing pattern in math.jl.

Testing

test/core/device/intrinsics/math.jl: GPU kernels for all four variants against a pure-Julia reference over edge cases (±127, -128, 255, all-ones, accumulator pass-through, mixed signs), plus a @device_code_ptx check pinning actual dp4a instruction selection (no emulation).
Both paths verified on a Quadro RTX 6000 (sm_75): asm path on Julia 1.11.9 (LLVM 16), intrinsic path on nightly (LLVM 21.1.8). Confirmed via @device_code_llvm that each run took the intended path; identical dp4a instruction selection and identical results (4608 cases per path, zero mismatches).
Instruction selection for the intrinsics additionally checked at the LLVM level: all four lower to their dp4a.* forms with the NVPTX_LLVM_Backend_jll llc (22.1.7).

Possible follow-up: dp2a (2-element int16×int8 dot product) — same pattern, separate PR if there's interest.

maleadt · 2026-06-05T07:36:22Z

I guess that's one remaining limitation of #3162: intrinsics still have to exist in Julia's LLVM version. I guess we could relax ccall's semantics, but using @asmcall is fine for now.

maleadt

Let's wait for CI to come back first.

vchuravy · 2026-06-05T08:36:23Z

I would prefer a version check and using the LLVM intrinsic when they become available in the host compiler.

maleadt · 2026-06-05T08:49:47Z

We could also do something funky like emitting extern @llvm_new.foobar, rewriting in GPUCompiler, but that seems sketchy.

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `f4117a9`	Previous: `642cf8d`	Ratio
`array/accumulate/Float32/1d`	`98912` ns	`99547` ns	`0.99`
`array/accumulate/Float32/dims=1`	`74506` ns	`76355` ns	`0.98`
`array/accumulate/Float32/dims=1L`	`1594515` ns	`1597531` ns	`1.00`
`array/accumulate/Float32/dims=2`	`139761` ns	`140903` ns	`0.99`
`array/accumulate/Float32/dims=2L`	`652929` ns	`655167` ns	`1.00`
`array/accumulate/Int64/1d`	`118243` ns	`118905` ns	`0.99`
`array/accumulate/Int64/dims=1`	`78569` ns	`80219` ns	`0.98`
`array/accumulate/Int64/dims=1L`	`1708093` ns	`1709632` ns	`1.00`
`array/accumulate/Int64/dims=2`	`152787` ns	`155189` ns	`0.98`
`array/accumulate/Int64/dims=2L`	`959258` ns	`961346` ns	`1.00`
`array/broadcast`	`18573` ns	`18574` ns	`1.00`
`array/construct`	`1211.7` ns	`1233.9` ns	`0.98`
`array/copy`	`16394` ns	`16492` ns	`0.99`
`array/copyto!/cpu_to_gpu`	`213529` ns	`213722` ns	`1.00`
`array/copyto!/gpu_to_cpu`	`278357` ns	`280514` ns	`0.99`
`array/copyto!/gpu_to_gpu`	`10368` ns	`10351` ns	`1.00`
`array/iteration/findall/bool`	`132797` ns	`134807` ns	`0.99`
`array/iteration/findall/int`	`146941` ns	`148125` ns	`0.99`
`array/iteration/findfirst/bool`	`69609` ns	`70484` ns	`0.99`
`array/iteration/findfirst/int`	`70978` ns	`72098` ns	`0.98`
`array/iteration/findmin/1d`	`66035` ns	`70609` ns	`0.94`
`array/iteration/findmin/2d`	`100451` ns	`100855` ns	`1.00`
`array/iteration/logical`	`190428` ns	`196042` ns	`0.97`
`array/iteration/scalar`	`65953` ns	`65566` ns	`1.01`
`array/permutedims/2d`	`49235` ns	`50013` ns	`0.98`
`array/permutedims/3d`	`50900` ns	`51630` ns	`0.99`
`array/permutedims/4d`	`50549` ns	`51043` ns	`0.99`
`array/random/rand/Float32`	`12272` ns	`11830` ns	`1.04`
`array/random/rand/Int64`	`23643` ns	`23140` ns	`1.02`
`array/random/rand!/Float32`	`9786` ns	`9790.666666666666` ns	`1.00`
`array/random/rand!/Int64`	`20509` ns	`20689` ns	`0.99`
`array/random/randn/Float32`	`35193` ns	`35912` ns	`0.98`
`array/random/randn!/Float32`	`27247` ns	`24135` ns	`1.13`
`array/reductions/mapreduce/Float32/1d`	`33200` ns	`34130` ns	`0.97`
`array/reductions/mapreduce/Float32/dims=1`	`38206` ns	`38318` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=1L`	`49972` ns	`50390` ns	`0.99`
`array/reductions/mapreduce/Float32/dims=2`	`55224` ns	`55724` ns	`0.99`
`array/reductions/mapreduce/Float32/dims=2L`	`67278` ns	`67834` ns	`0.99`
`array/reductions/mapreduce/Int64/1d`	`39408` ns	`39600` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=1`	`40947` ns	`41356` ns	`0.99`
`array/reductions/mapreduce/Int64/dims=1L`	`86433` ns	`86763` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2`	`57977` ns	`58074` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2L`	`83108` ns	`83222` ns	`1.00`
`array/reductions/reduce/Float32/1d`	`33128` ns	`34137` ns	`0.97`
`array/reductions/reduce/Float32/dims=1`	`38135` ns	`38527` ns	`0.99`
`array/reductions/reduce/Float32/dims=1L`	`49967` ns	`50380` ns	`0.99`
`array/reductions/reduce/Float32/dims=2`	`55464` ns	`55782` ns	`0.99`
`array/reductions/reduce/Float32/dims=2L`	`68986` ns	`69268` ns	`1.00`
`array/reductions/reduce/Int64/1d`	`39307` ns	`40084` ns	`0.98`
`array/reductions/reduce/Int64/dims=1`	`40849` ns	`40899` ns	`1.00`
`array/reductions/reduce/Int64/dims=1L`	`86455` ns	`86709` ns	`1.00`
`array/reductions/reduce/Int64/dims=2`	`57301` ns	`57670` ns	`0.99`
`array/reductions/reduce/Int64/dims=2L`	`82630` ns	`82979` ns	`1.00`
`array/reverse/1d`	`16823` ns	`17183` ns	`0.98`
`array/reverse/1dL`	`67734` ns	`68153` ns	`0.99`
`array/reverse/1dL_inplace`	`65101` ns	`65350` ns	`1.00`
`array/reverse/1d_inplace`	`8132.333333333333` ns	`9109.333333333334` ns	`0.89`
`array/reverse/2d`	`20088` ns	`20317` ns	`0.99`
`array/reverse/2dL`	`71955` ns	`72231` ns	`1.00`
`array/reverse/2dL_inplace`	`65063` ns	`65096` ns	`1.00`
`array/reverse/2d_inplace`	`9524` ns	`10596` ns	`0.90`
`array/sorting/1d`	`2657913` ns	`2651245` ns	`1.00`
`array/sorting/2d`	`1039223` ns	`1041142` ns	`1.00`
`array/sorting/by`	`3193153` ns	`3194553` ns	`1.00`
`cuda/synchronization/context/auto`	`1144.4` ns	`1150.7` ns	`0.99`
`cuda/synchronization/context/blocking`	`915.2857142857143` ns	`927.2857142857143` ns	`0.99`
`cuda/synchronization/context/nonblocking`	`6151` ns	`6099.4` ns	`1.01`
`cuda/synchronization/stream/auto`	`1007.9` ns	`1003.1818181818181` ns	`1.00`
`cuda/synchronization/stream/blocking`	`814.2247191011236` ns	`815.8571428571429` ns	`1.00`
`cuda/synchronization/stream/nonblocking`	`6013.4` ns	`5833.6` ns	`1.03`
`integration/byval/reference`	`143145` ns	`143277` ns	`1.00`
`integration/byval/slices=1`	`145339` ns	`145740` ns	`1.00`
`integration/byval/slices=2`	`283807` ns	`284135` ns	`1.00`
`integration/byval/slices=3`	`422136` ns	`422704` ns	`1.00`
`integration/cudadevrt`	`101540` ns	`101755` ns	`1.00`
`integration/volumerhs`	`9099128` ns	`9086362` ns	`1.00`
`kernel/indexing`	`12507` ns	`12648` ns	`0.99`
`kernel/indexing_checked`	`13294` ns	`13470` ns	`0.99`
`kernel/launch`	`2023.111111111111` ns	`2063.8888888888887` ns	`0.98`
`kernel/occupancy`	`711.0335570469799` ns	`719.1268656716418` ns	`0.99`
`kernel/rand`	`14419` ns	`13890` ns	`1.04`
`latency/import`	`3863775132` ns	`3854502988` ns	`1.00`
`latency/precompile`	`4645361901` ns	`4637764247` ns	`1.00`
`latency/ttfp`	`4523735136` ns	`4501425569` ns	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

Add CUDACore.dp4a with the four signedness variants of the PTX dp4a instruction (packed 4-element int8/uint8 dot product with 32-bit accumulate), available on sm_61 and later. On LLVM 21 and later the implementation uses the @llvm.nvvm.idp4a.[us].[us] intrinsics added in LLVM 21; on older versions it falls back to inline PTX via @asmcall. Both paths verified on sm_75: identical dp4a instruction selection and bit-identical results against a byte-wise reference, on Julia 1.11 (LLVM 16, asm path) and nightly (LLVM 21, intrinsic path).

codecov · 2026-06-05T13:06:06Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 16.32%. Comparing base (aa47d7a) to head (99c2eed).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3163      +/-   ##
==========================================
- Coverage   16.33%   16.32%   -0.02%     
==========================================
  Files         124      124              
  Lines        9875     9875              
==========================================
- Hits         1613     1612       -1     
- Misses       8262     8263       +1

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

maleadt · 2026-06-05T16:39:06Z

+            # Verify the backend emits the actual dp4a instruction, not a
+            # software emulation sequence.
+            buf = CuArray{Int32}(undef, 1)
+            ptx = sprint(io->(@device_code_ptx io=io @cuda launch=false kernel_ss(buf, Int32(0), Int32(0), Int32(0))))


Please make this a CUDA.code_ptx or so, no need to actually launch kernels just to inspect generated code.

JohnCobbler · 2026-06-10T16:57:40Z

switched PTX check to CUDA.code_ptx; verified locally

maleadt force-pushed the dp4a-intrinsic branch from eb198ae to 682ea43 Compare June 5, 2026 07:50

maleadt approved these changes Jun 5, 2026

View reviewed changes

github-actions Bot reviewed Jun 5, 2026

View reviewed changes

JohnCobbler force-pushed the dp4a-intrinsic branch from 682ea43 to 99c2eed Compare June 5, 2026 10:49

JohnCobbler mentioned this pull request Jun 5, 2026

cublasXt device-pointer tests intermittently read back all-NaN on multi-GPU CI #3165

Open

maleadt reviewed Jun 5, 2026

View reviewed changes

test: inspect dp4a PTX via code_ptx instead of launching a kernel

f4117a9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dp4a device intrinsic#3163

Add dp4a device intrinsic#3163
JohnCobbler wants to merge 2 commits into
JuliaGPU:mainfrom
JohnCobbler:dp4a-intrinsic

JohnCobbler commented Jun 4, 2026 •

edited

Loading

Uh oh!

maleadt commented Jun 5, 2026

Uh oh!

maleadt left a comment

Uh oh!

vchuravy commented Jun 5, 2026

Uh oh!

maleadt commented Jun 5, 2026

Uh oh!

github-actions Bot left a comment •

edited

Loading

Uh oh!

codecov Bot commented Jun 5, 2026

Uh oh!

maleadt Jun 5, 2026

Uh oh!

JohnCobbler commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JohnCobbler commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation

Testing

Uh oh!

maleadt commented Jun 5, 2026

Uh oh!

maleadt left a comment

Choose a reason for hiding this comment

Uh oh!

vchuravy commented Jun 5, 2026

Uh oh!

maleadt commented Jun 5, 2026

Uh oh!

github-actions Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

codecov Bot commented Jun 5, 2026

Codecov Report

Uh oh!

maleadt Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

JohnCobbler commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JohnCobbler commented Jun 4, 2026 •

edited

Loading

github-actions Bot left a comment •

edited

Loading