Skip to content

Add dp4a device intrinsic#3163

Open
JohnCobbler wants to merge 2 commits into
JuliaGPU:mainfrom
JohnCobbler:dp4a-intrinsic
Open

Add dp4a device intrinsic#3163
JohnCobbler wants to merge 2 commits into
JuliaGPU:mainfrom
JohnCobbler:dp4a-intrinsic

Conversation

@JohnCobbler

@JohnCobbler JohnCobbler commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Adds CUDACore.dp4a — the packed 4-element int8/uint8 dot product with 32-bit accumulate (single PTX dp4a instruction, sm_61+), in its four signedness variants:

dp4a(a::Int32,  b::Int32,  c::Int32)  -> Int32   # dp4a.s32.s32
dp4a(a::Int32,  b::UInt32, c::Int32)  -> Int32   # dp4a.s32.u32
dp4a(a::UInt32, b::Int32,  c::Int32)  -> Int32   # dp4a.u32.s32
dp4a(a::UInt32, b::UInt32, c::UInt32) -> UInt32  # dp4a.u32.u32

Non-exported, @device_function, added to @public — following the popc/byte_perm precedent. Useful for quantized int8 inference and similar integer-heavy kernels.

Implementation

Per the review discussion: on LLVM 21+ (which added @llvm.nvvm.idp4a.[us].[us]) the implementation uses the intrinsics via ccall; on older LLVM it falls back to inline PTX via @asmcall (same approach as nanosleep). Gated with @static if LLVM.version() >= v"21", following the existing pattern in math.jl.

Testing

  • test/core/device/intrinsics/math.jl: GPU kernels for all four variants against a pure-Julia reference over edge cases (±127, -128, 255, all-ones, accumulator pass-through, mixed signs), plus a @device_code_ptx check pinning actual dp4a instruction selection (no emulation).
  • Both paths verified on a Quadro RTX 6000 (sm_75): asm path on Julia 1.11.9 (LLVM 16), intrinsic path on nightly (LLVM 21.1.8). Confirmed via @device_code_llvm that each run took the intended path; identical dp4a instruction selection and identical results (4608 cases per path, zero mismatches).
  • Instruction selection for the intrinsics additionally checked at the LLVM level: all four lower to their dp4a.* forms with the NVPTX_LLVM_Backend_jll llc (22.1.7).

Possible follow-up: dp2a (2-element int16×int8 dot product) — same pattern, separate PR if there's interest.

@maleadt

maleadt commented Jun 5, 2026

Copy link
Copy Markdown
Member

I guess that's one remaining limitation of #3162: intrinsics still have to exist in Julia's LLVM version. I guess we could relax ccall's semantics, but using @asmcall is fine for now.

@maleadt maleadt left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait for CI to come back first.

@vchuravy

vchuravy commented Jun 5, 2026

Copy link
Copy Markdown
Member

I would prefer a version check and using the LLVM intrinsic when they become available in the host compiler.

@maleadt

maleadt commented Jun 5, 2026

Copy link
Copy Markdown
Member

We could also do something funky like emitting extern @llvm_new.foobar, rewriting in GPUCompiler, but that seems sketchy.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: f4117a9 Previous: 642cf8d Ratio
array/accumulate/Float32/1d 98912 ns 99547 ns 0.99
array/accumulate/Float32/dims=1 74506 ns 76355 ns 0.98
array/accumulate/Float32/dims=1L 1594515 ns 1597531 ns 1.00
array/accumulate/Float32/dims=2 139761 ns 140903 ns 0.99
array/accumulate/Float32/dims=2L 652929 ns 655167 ns 1.00
array/accumulate/Int64/1d 118243 ns 118905 ns 0.99
array/accumulate/Int64/dims=1 78569 ns 80219 ns 0.98
array/accumulate/Int64/dims=1L 1708093 ns 1709632 ns 1.00
array/accumulate/Int64/dims=2 152787 ns 155189 ns 0.98
array/accumulate/Int64/dims=2L 959258 ns 961346 ns 1.00
array/broadcast 18573 ns 18574 ns 1.00
array/construct 1211.7 ns 1233.9 ns 0.98
array/copy 16394 ns 16492 ns 0.99
array/copyto!/cpu_to_gpu 213529 ns 213722 ns 1.00
array/copyto!/gpu_to_cpu 278357 ns 280514 ns 0.99
array/copyto!/gpu_to_gpu 10368 ns 10351 ns 1.00
array/iteration/findall/bool 132797 ns 134807 ns 0.99
array/iteration/findall/int 146941 ns 148125 ns 0.99
array/iteration/findfirst/bool 69609 ns 70484 ns 0.99
array/iteration/findfirst/int 70978 ns 72098 ns 0.98
array/iteration/findmin/1d 66035 ns 70609 ns 0.94
array/iteration/findmin/2d 100451 ns 100855 ns 1.00
array/iteration/logical 190428 ns 196042 ns 0.97
array/iteration/scalar 65953 ns 65566 ns 1.01
array/permutedims/2d 49235 ns 50013 ns 0.98
array/permutedims/3d 50900 ns 51630 ns 0.99
array/permutedims/4d 50549 ns 51043 ns 0.99
array/random/rand/Float32 12272 ns 11830 ns 1.04
array/random/rand/Int64 23643 ns 23140 ns 1.02
array/random/rand!/Float32 9786 ns 9790.666666666666 ns 1.00
array/random/rand!/Int64 20509 ns 20689 ns 0.99
array/random/randn/Float32 35193 ns 35912 ns 0.98
array/random/randn!/Float32 27247 ns 24135 ns 1.13
array/reductions/mapreduce/Float32/1d 33200 ns 34130 ns 0.97
array/reductions/mapreduce/Float32/dims=1 38206 ns 38318 ns 1.00
array/reductions/mapreduce/Float32/dims=1L 49972 ns 50390 ns 0.99
array/reductions/mapreduce/Float32/dims=2 55224 ns 55724 ns 0.99
array/reductions/mapreduce/Float32/dims=2L 67278 ns 67834 ns 0.99
array/reductions/mapreduce/Int64/1d 39408 ns 39600 ns 1.00
array/reductions/mapreduce/Int64/dims=1 40947 ns 41356 ns 0.99
array/reductions/mapreduce/Int64/dims=1L 86433 ns 86763 ns 1.00
array/reductions/mapreduce/Int64/dims=2 57977 ns 58074 ns 1.00
array/reductions/mapreduce/Int64/dims=2L 83108 ns 83222 ns 1.00
array/reductions/reduce/Float32/1d 33128 ns 34137 ns 0.97
array/reductions/reduce/Float32/dims=1 38135 ns 38527 ns 0.99
array/reductions/reduce/Float32/dims=1L 49967 ns 50380 ns 0.99
array/reductions/reduce/Float32/dims=2 55464 ns 55782 ns 0.99
array/reductions/reduce/Float32/dims=2L 68986 ns 69268 ns 1.00
array/reductions/reduce/Int64/1d 39307 ns 40084 ns 0.98
array/reductions/reduce/Int64/dims=1 40849 ns 40899 ns 1.00
array/reductions/reduce/Int64/dims=1L 86455 ns 86709 ns 1.00
array/reductions/reduce/Int64/dims=2 57301 ns 57670 ns 0.99
array/reductions/reduce/Int64/dims=2L 82630 ns 82979 ns 1.00
array/reverse/1d 16823 ns 17183 ns 0.98
array/reverse/1dL 67734 ns 68153 ns 0.99
array/reverse/1dL_inplace 65101 ns 65350 ns 1.00
array/reverse/1d_inplace 8132.333333333333 ns 9109.333333333334 ns 0.89
array/reverse/2d 20088 ns 20317 ns 0.99
array/reverse/2dL 71955 ns 72231 ns 1.00
array/reverse/2dL_inplace 65063 ns 65096 ns 1.00
array/reverse/2d_inplace 9524 ns 10596 ns 0.90
array/sorting/1d 2657913 ns 2651245 ns 1.00
array/sorting/2d 1039223 ns 1041142 ns 1.00
array/sorting/by 3193153 ns 3194553 ns 1.00
cuda/synchronization/context/auto 1144.4 ns 1150.7 ns 0.99
cuda/synchronization/context/blocking 915.2857142857143 ns 927.2857142857143 ns 0.99
cuda/synchronization/context/nonblocking 6151 ns 6099.4 ns 1.01
cuda/synchronization/stream/auto 1007.9 ns 1003.1818181818181 ns 1.00
cuda/synchronization/stream/blocking 814.2247191011236 ns 815.8571428571429 ns 1.00
cuda/synchronization/stream/nonblocking 6013.4 ns 5833.6 ns 1.03
integration/byval/reference 143145 ns 143277 ns 1.00
integration/byval/slices=1 145339 ns 145740 ns 1.00
integration/byval/slices=2 283807 ns 284135 ns 1.00
integration/byval/slices=3 422136 ns 422704 ns 1.00
integration/cudadevrt 101540 ns 101755 ns 1.00
integration/volumerhs 9099128 ns 9086362 ns 1.00
kernel/indexing 12507 ns 12648 ns 0.99
kernel/indexing_checked 13294 ns 13470 ns 0.99
kernel/launch 2023.111111111111 ns 2063.8888888888887 ns 0.98
kernel/occupancy 711.0335570469799 ns 719.1268656716418 ns 0.99
kernel/rand 14419 ns 13890 ns 1.04
latency/import 3863775132 ns 3854502988 ns 1.00
latency/precompile 4645361901 ns 4637764247 ns 1.00
latency/ttfp 4523735136 ns 4501425569 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Add CUDACore.dp4a with the four signedness variants of the PTX dp4a
instruction (packed 4-element int8/uint8 dot product with 32-bit
accumulate), available on sm_61 and later.

On LLVM 21 and later the implementation uses the @llvm.nvvm.idp4a.[us].[us]
intrinsics added in LLVM 21; on older versions it falls back to inline PTX
via @asmcall. Both paths verified on sm_75: identical dp4a instruction
selection and bit-identical results against a byte-wise reference, on
Julia 1.11 (LLVM 16, asm path) and nightly (LLVM 21, intrinsic path).
@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 16.32%. Comparing base (aa47d7a) to head (99c2eed).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3163      +/-   ##
==========================================
- Coverage   16.33%   16.32%   -0.02%     
==========================================
  Files         124      124              
  Lines        9875     9875              
==========================================
- Hits         1613     1612       -1     
- Misses       8262     8263       +1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread test/core/device/intrinsics/math.jl Outdated
# Verify the backend emits the actual dp4a instruction, not a
# software emulation sequence.
buf = CuArray{Int32}(undef, 1)
ptx = sprint(io->(@device_code_ptx io=io @cuda launch=false kernel_ss(buf, Int32(0), Int32(0), Int32(0))))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this a CUDA.code_ptx or so, no need to actually launch kernels just to inspect generated code.

@JohnCobbler

Copy link
Copy Markdown
Contributor Author

switched PTX check to CUDA.code_ptx; verified locally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants