Add dp4a device intrinsic#3163
Conversation
|
I guess that's one remaining limitation of #3162: intrinsics still have to exist in Julia's LLVM version. I guess we could relax |
maleadt
left a comment
There was a problem hiding this comment.
Let's wait for CI to come back first.
|
I would prefer a version check and using the LLVM intrinsic when they become available in the host compiler. |
|
We could also do something funky like emitting |
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: f4117a9 | Previous: 642cf8d | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
98912 ns |
99547 ns |
0.99 |
array/accumulate/Float32/dims=1 |
74506 ns |
76355 ns |
0.98 |
array/accumulate/Float32/dims=1L |
1594515 ns |
1597531 ns |
1.00 |
array/accumulate/Float32/dims=2 |
139761 ns |
140903 ns |
0.99 |
array/accumulate/Float32/dims=2L |
652929 ns |
655167 ns |
1.00 |
array/accumulate/Int64/1d |
118243 ns |
118905 ns |
0.99 |
array/accumulate/Int64/dims=1 |
78569 ns |
80219 ns |
0.98 |
array/accumulate/Int64/dims=1L |
1708093 ns |
1709632 ns |
1.00 |
array/accumulate/Int64/dims=2 |
152787 ns |
155189 ns |
0.98 |
array/accumulate/Int64/dims=2L |
959258 ns |
961346 ns |
1.00 |
array/broadcast |
18573 ns |
18574 ns |
1.00 |
array/construct |
1211.7 ns |
1233.9 ns |
0.98 |
array/copy |
16394 ns |
16492 ns |
0.99 |
array/copyto!/cpu_to_gpu |
213529 ns |
213722 ns |
1.00 |
array/copyto!/gpu_to_cpu |
278357 ns |
280514 ns |
0.99 |
array/copyto!/gpu_to_gpu |
10368 ns |
10351 ns |
1.00 |
array/iteration/findall/bool |
132797 ns |
134807 ns |
0.99 |
array/iteration/findall/int |
146941 ns |
148125 ns |
0.99 |
array/iteration/findfirst/bool |
69609 ns |
70484 ns |
0.99 |
array/iteration/findfirst/int |
70978 ns |
72098 ns |
0.98 |
array/iteration/findmin/1d |
66035 ns |
70609 ns |
0.94 |
array/iteration/findmin/2d |
100451 ns |
100855 ns |
1.00 |
array/iteration/logical |
190428 ns |
196042 ns |
0.97 |
array/iteration/scalar |
65953 ns |
65566 ns |
1.01 |
array/permutedims/2d |
49235 ns |
50013 ns |
0.98 |
array/permutedims/3d |
50900 ns |
51630 ns |
0.99 |
array/permutedims/4d |
50549 ns |
51043 ns |
0.99 |
array/random/rand/Float32 |
12272 ns |
11830 ns |
1.04 |
array/random/rand/Int64 |
23643 ns |
23140 ns |
1.02 |
array/random/rand!/Float32 |
9786 ns |
9790.666666666666 ns |
1.00 |
array/random/rand!/Int64 |
20509 ns |
20689 ns |
0.99 |
array/random/randn/Float32 |
35193 ns |
35912 ns |
0.98 |
array/random/randn!/Float32 |
27247 ns |
24135 ns |
1.13 |
array/reductions/mapreduce/Float32/1d |
33200 ns |
34130 ns |
0.97 |
array/reductions/mapreduce/Float32/dims=1 |
38206 ns |
38318 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=1L |
49972 ns |
50390 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=2 |
55224 ns |
55724 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=2L |
67278 ns |
67834 ns |
0.99 |
array/reductions/mapreduce/Int64/1d |
39408 ns |
39600 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=1 |
40947 ns |
41356 ns |
0.99 |
array/reductions/mapreduce/Int64/dims=1L |
86433 ns |
86763 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
57977 ns |
58074 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
83108 ns |
83222 ns |
1.00 |
array/reductions/reduce/Float32/1d |
33128 ns |
34137 ns |
0.97 |
array/reductions/reduce/Float32/dims=1 |
38135 ns |
38527 ns |
0.99 |
array/reductions/reduce/Float32/dims=1L |
49967 ns |
50380 ns |
0.99 |
array/reductions/reduce/Float32/dims=2 |
55464 ns |
55782 ns |
0.99 |
array/reductions/reduce/Float32/dims=2L |
68986 ns |
69268 ns |
1.00 |
array/reductions/reduce/Int64/1d |
39307 ns |
40084 ns |
0.98 |
array/reductions/reduce/Int64/dims=1 |
40849 ns |
40899 ns |
1.00 |
array/reductions/reduce/Int64/dims=1L |
86455 ns |
86709 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
57301 ns |
57670 ns |
0.99 |
array/reductions/reduce/Int64/dims=2L |
82630 ns |
82979 ns |
1.00 |
array/reverse/1d |
16823 ns |
17183 ns |
0.98 |
array/reverse/1dL |
67734 ns |
68153 ns |
0.99 |
array/reverse/1dL_inplace |
65101 ns |
65350 ns |
1.00 |
array/reverse/1d_inplace |
8132.333333333333 ns |
9109.333333333334 ns |
0.89 |
array/reverse/2d |
20088 ns |
20317 ns |
0.99 |
array/reverse/2dL |
71955 ns |
72231 ns |
1.00 |
array/reverse/2dL_inplace |
65063 ns |
65096 ns |
1.00 |
array/reverse/2d_inplace |
9524 ns |
10596 ns |
0.90 |
array/sorting/1d |
2657913 ns |
2651245 ns |
1.00 |
array/sorting/2d |
1039223 ns |
1041142 ns |
1.00 |
array/sorting/by |
3193153 ns |
3194553 ns |
1.00 |
cuda/synchronization/context/auto |
1144.4 ns |
1150.7 ns |
0.99 |
cuda/synchronization/context/blocking |
915.2857142857143 ns |
927.2857142857143 ns |
0.99 |
cuda/synchronization/context/nonblocking |
6151 ns |
6099.4 ns |
1.01 |
cuda/synchronization/stream/auto |
1007.9 ns |
1003.1818181818181 ns |
1.00 |
cuda/synchronization/stream/blocking |
814.2247191011236 ns |
815.8571428571429 ns |
1.00 |
cuda/synchronization/stream/nonblocking |
6013.4 ns |
5833.6 ns |
1.03 |
integration/byval/reference |
143145 ns |
143277 ns |
1.00 |
integration/byval/slices=1 |
145339 ns |
145740 ns |
1.00 |
integration/byval/slices=2 |
283807 ns |
284135 ns |
1.00 |
integration/byval/slices=3 |
422136 ns |
422704 ns |
1.00 |
integration/cudadevrt |
101540 ns |
101755 ns |
1.00 |
integration/volumerhs |
9099128 ns |
9086362 ns |
1.00 |
kernel/indexing |
12507 ns |
12648 ns |
0.99 |
kernel/indexing_checked |
13294 ns |
13470 ns |
0.99 |
kernel/launch |
2023.111111111111 ns |
2063.8888888888887 ns |
0.98 |
kernel/occupancy |
711.0335570469799 ns |
719.1268656716418 ns |
0.99 |
kernel/rand |
14419 ns |
13890 ns |
1.04 |
latency/import |
3863775132 ns |
3854502988 ns |
1.00 |
latency/precompile |
4645361901 ns |
4637764247 ns |
1.00 |
latency/ttfp |
4523735136 ns |
4501425569 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
682ea43 to
99c2eed
Compare
Add CUDACore.dp4a with the four signedness variants of the PTX dp4a instruction (packed 4-element int8/uint8 dot product with 32-bit accumulate), available on sm_61 and later. On LLVM 21 and later the implementation uses the @llvm.nvvm.idp4a.[us].[us] intrinsics added in LLVM 21; on older versions it falls back to inline PTX via @asmcall. Both paths verified on sm_75: identical dp4a instruction selection and bit-identical results against a byte-wise reference, on Julia 1.11 (LLVM 16, asm path) and nightly (LLVM 21, intrinsic path).
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3163 +/- ##
==========================================
- Coverage 16.33% 16.32% -0.02%
==========================================
Files 124 124
Lines 9875 9875
==========================================
- Hits 1613 1612 -1
- Misses 8262 8263 +1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
| # Verify the backend emits the actual dp4a instruction, not a | ||
| # software emulation sequence. | ||
| buf = CuArray{Int32}(undef, 1) | ||
| ptx = sprint(io->(@device_code_ptx io=io @cuda launch=false kernel_ss(buf, Int32(0), Int32(0), Int32(0)))) |
There was a problem hiding this comment.
Please make this a CUDA.code_ptx or so, no need to actually launch kernels just to inspect generated code.
|
switched PTX check to CUDA.code_ptx; verified locally |
Adds
CUDACore.dp4a— the packed 4-element int8/uint8 dot product with 32-bit accumulate (single PTXdp4ainstruction, sm_61+), in its four signedness variants:Non-exported,
@device_function, added to@public— following thepopc/byte_permprecedent. Useful for quantized int8 inference and similar integer-heavy kernels.Implementation
Per the review discussion: on LLVM 21+ (which added
@llvm.nvvm.idp4a.[us].[us]) the implementation uses the intrinsics viaccall; on older LLVM it falls back to inline PTX via@asmcall(same approach asnanosleep). Gated with@static if LLVM.version() >= v"21", following the existing pattern inmath.jl.Testing
test/core/device/intrinsics/math.jl: GPU kernels for all four variants against a pure-Julia reference over edge cases (±127, -128, 255, all-ones, accumulator pass-through, mixed signs), plus a@device_code_ptxcheck pinning actualdp4ainstruction selection (no emulation).@device_code_llvmthat each run took the intended path; identicaldp4ainstruction selection and identical results (4608 cases per path, zero mismatches).dp4a.*forms with theNVPTX_LLVM_Backend_jllllc (22.1.7).Possible follow-up:
dp2a(2-element int16×int8 dot product) — same pattern, separate PR if there's interest.