cuBLASXt: only grant pool access to peer-capable devices#3166
Merged
Conversation
cuMemPoolSetAccess on a fresh pool can succeed even when the devices are not peer capable, deferring the failure to a later allocation. Check can_access_peer before granting, and warn about devices that cannot access the pool, mirroring maybe_enable_peer_access in CUDACore.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3166 +/- ##
==========================================
+ Coverage 16.33% 16.34% +0.01%
==========================================
Files 124 124
Lines 9875 9880 +5
==========================================
+ Hits 1613 1615 +2
- Misses 8262 8265 +3 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Contributor
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: 434b095 | Previous: aa47d7a | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
99232 ns |
98976 ns |
1.00 |
array/accumulate/Float32/dims=1 |
75352 ns |
75470 ns |
1.00 |
array/accumulate/Float32/dims=1L |
1596782 ns |
1595788 ns |
1.00 |
array/accumulate/Float32/dims=2 |
140288 ns |
140419 ns |
1.00 |
array/accumulate/Float32/dims=2L |
652706 ns |
653444 ns |
1.00 |
array/accumulate/Int64/1d |
117981 ns |
118155 ns |
1.00 |
array/accumulate/Int64/dims=1 |
78945 ns |
78907 ns |
1.00 |
array/accumulate/Int64/dims=1L |
1708040 ns |
1709506 ns |
1.00 |
array/accumulate/Int64/dims=2 |
154051 ns |
153939 ns |
1.00 |
array/accumulate/Int64/dims=2L |
959710 ns |
959330 ns |
1.00 |
array/broadcast |
18251 ns |
18270 ns |
1.00 |
array/construct |
1176.7 ns |
1198.4 ns |
0.98 |
array/copy |
16211 ns |
16676 ns |
0.97 |
array/copyto!/cpu_to_gpu |
211348 ns |
211135 ns |
1.00 |
array/copyto!/gpu_to_cpu |
279211 ns |
278832 ns |
1.00 |
array/copyto!/gpu_to_gpu |
10288 ns |
10531 ns |
0.98 |
array/iteration/findall/bool |
133016 ns |
131993 ns |
1.01 |
array/iteration/findall/int |
146365 ns |
146745 ns |
1.00 |
array/iteration/findfirst/bool |
111605 ns |
111631 ns |
1.00 |
array/iteration/findfirst/int |
112190 ns |
111858 ns |
1.00 |
array/iteration/findmin/1d |
65661 ns |
66902 ns |
0.98 |
array/iteration/findmin/2d |
100624 ns |
100550 ns |
1.00 |
array/iteration/logical |
190452 ns |
189124 ns |
1.01 |
array/iteration/scalar |
64165 ns |
66015 ns |
0.97 |
array/permutedims/2d |
49195 ns |
49598 ns |
0.99 |
array/permutedims/3d |
50866 ns |
50240 ns |
1.01 |
array/permutedims/4d |
50357 ns |
50411 ns |
1.00 |
array/random/rand/Float32 |
11522 ns |
11982 ns |
0.96 |
array/random/rand/Int64 |
23755 ns |
23515 ns |
1.01 |
array/random/rand!/Float32 |
7917 ns |
8122 ns |
0.97 |
array/random/rand!/Int64 |
20468 ns |
20501 ns |
1.00 |
array/random/randn/Float32 |
34471 ns |
34458 ns |
1.00 |
array/random/randn!/Float32 |
23757 ns |
24130 ns |
0.98 |
array/reductions/mapreduce/Float32/1d |
32648 ns |
33763 ns |
0.97 |
array/reductions/mapreduce/Float32/dims=1 |
37942 ns |
38228 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=1L |
49702 ns |
50249 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=2 |
55322 ns |
55439 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
67093 ns |
67163 ns |
1.00 |
array/reductions/mapreduce/Int64/1d |
39301 ns |
40187 ns |
0.98 |
array/reductions/mapreduce/Int64/dims=1 |
40840 ns |
40738 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=1L |
86135 ns |
86458 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
57359 ns |
57724 ns |
0.99 |
array/reductions/mapreduce/Int64/dims=2L |
82502 ns |
82751 ns |
1.00 |
array/reductions/reduce/Float32/1d |
33303 ns |
33392 ns |
1.00 |
array/reductions/reduce/Float32/dims=1 |
37906 ns |
38091 ns |
1.00 |
array/reductions/reduce/Float32/dims=1L |
49787 ns |
49963 ns |
1.00 |
array/reductions/reduce/Float32/dims=2 |
55267 ns |
55491 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
68712 ns |
68898 ns |
1.00 |
array/reductions/reduce/Int64/1d |
38577 ns |
40022 ns |
0.96 |
array/reductions/reduce/Int64/dims=1 |
40581 ns |
40672 ns |
1.00 |
array/reductions/reduce/Int64/dims=1L |
86251 ns |
86301 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
57424 ns |
57311 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
82572 ns |
82411 ns |
1.00 |
array/reverse/1d |
16566 ns |
16807 ns |
0.99 |
array/reverse/1dL |
67404 ns |
67676 ns |
1.00 |
array/reverse/1dL_inplace |
65033 ns |
65187 ns |
1.00 |
array/reverse/1d_inplace |
8102 ns |
9321.666666666666 ns |
0.87 |
array/reverse/2d |
19833 ns |
19959 ns |
0.99 |
array/reverse/2dL |
71728 ns |
71879 ns |
1.00 |
array/reverse/2dL_inplace |
64897 ns |
65104 ns |
1.00 |
array/reverse/2d_inplace |
9461 ns |
11067 ns |
0.85 |
array/sorting/1d |
2650181 ns |
2655417 ns |
1.00 |
array/sorting/2d |
1035694 ns |
1038734 ns |
1.00 |
array/sorting/by |
3178013 ns |
3192232 ns |
1.00 |
cuda/synchronization/context/auto |
1144.2 ns |
1131.5 ns |
1.01 |
cuda/synchronization/context/blocking |
923.7 ns |
952.2173913043479 ns |
0.97 |
cuda/synchronization/context/nonblocking |
6161.6 ns |
6097.8 ns |
1.01 |
cuda/synchronization/stream/auto |
986.9285714285714 ns |
1004.5 ns |
0.98 |
cuda/synchronization/stream/blocking |
800.8586956521739 ns |
825.6363636363636 ns |
0.97 |
cuda/synchronization/stream/nonblocking |
5866.8 ns |
6045.333333333333 ns |
0.97 |
integration/byval/reference |
143021 ns |
143141 ns |
1.00 |
integration/byval/slices=1 |
145015 ns |
145110 ns |
1.00 |
integration/byval/slices=2 |
283532 ns |
283495 ns |
1.00 |
integration/byval/slices=3 |
421546 ns |
422045 ns |
1.00 |
integration/cudadevrt |
101462 ns |
101557 ns |
1.00 |
integration/volumerhs |
9080052 ns |
9077766 ns |
1.00 |
kernel/indexing |
12566 ns |
12534 ns |
1.00 |
kernel/indexing_checked |
13329 ns |
13291 ns |
1.00 |
kernel/launch |
2048.5555555555557 ns |
2072.5555555555557 ns |
0.99 |
kernel/occupancy |
715.9782608695652 ns |
716.9097744360902 ns |
1.00 |
kernel/rand |
13632 ns |
13723 ns |
0.99 |
latency/import |
3875036193 ns |
3841987489 ns |
1.01 |
latency/precompile |
4621167021 ns |
4621684240 ns |
1.00 |
latency/ttfp |
4491013832 ns |
4482964065 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
As offered in #3165:
xt_handle()grants pool access to all participating devices without checkingcan_access_peer, unlikemaybe_enable_peer_accessin CUDACore. Per the CUDA docs, an uncheckedcuMemPoolSetAccesson a fresh pool can succeed on non-peer-capable devices, deferring the failure to a later allocation.This checks peer capability first, warns (once) about devices that can't access a pool, and only grants access where valid.
Caveat up front: I only have a single GPU, so this is untested on actual multi-GPU hardware, and per the issue it likely isn't the whole story behind the NaN readbacks. Feel free to close if you'd rather investigate the root cause first.