Skip to content

cuBLASXt: only grant pool access to peer-capable devices#3166

Merged
maleadt merged 1 commit into
JuliaGPU:mainfrom
JohnCobbler:fix/cublasxt-peer-check
Jun 10, 2026
Merged

cuBLASXt: only grant pool access to peer-capable devices#3166
maleadt merged 1 commit into
JuliaGPU:mainfrom
JohnCobbler:fix/cublasxt-peer-check

Conversation

@JohnCobbler

Copy link
Copy Markdown
Contributor

As offered in #3165: xt_handle() grants pool access to all participating devices without checking can_access_peer, unlike maybe_enable_peer_access in CUDACore. Per the CUDA docs, an unchecked cuMemPoolSetAccess on a fresh pool can succeed on non-peer-capable devices, deferring the failure to a later allocation.

This checks peer capability first, warns (once) about devices that can't access a pool, and only grants access where valid.

Caveat up front: I only have a single GPU, so this is untested on actual multi-GPU hardware, and per the issue it likely isn't the whole story behind the NaN readbacks. Feel free to close if you'd rather investigate the root cause first.

cuMemPoolSetAccess on a fresh pool can succeed even when the devices are
not peer capable, deferring the failure to a later allocation. Check
can_access_peer before granting, and warn about devices that cannot
access the pool, mirroring maybe_enable_peer_access in CUDACore.
@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 66.66667% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 16.34%. Comparing base (aa47d7a) to head (434b095).

Files with missing lines Patch % Lines
lib/cublas/src/cuBLAS.jl 66.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3166      +/-   ##
==========================================
+ Coverage   16.33%   16.34%   +0.01%     
==========================================
  Files         124      124              
  Lines        9875     9880       +5     
==========================================
+ Hits         1613     1615       +2     
- Misses       8262     8265       +3     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: 434b095 Previous: aa47d7a Ratio
array/accumulate/Float32/1d 99232 ns 98976 ns 1.00
array/accumulate/Float32/dims=1 75352 ns 75470 ns 1.00
array/accumulate/Float32/dims=1L 1596782 ns 1595788 ns 1.00
array/accumulate/Float32/dims=2 140288 ns 140419 ns 1.00
array/accumulate/Float32/dims=2L 652706 ns 653444 ns 1.00
array/accumulate/Int64/1d 117981 ns 118155 ns 1.00
array/accumulate/Int64/dims=1 78945 ns 78907 ns 1.00
array/accumulate/Int64/dims=1L 1708040 ns 1709506 ns 1.00
array/accumulate/Int64/dims=2 154051 ns 153939 ns 1.00
array/accumulate/Int64/dims=2L 959710 ns 959330 ns 1.00
array/broadcast 18251 ns 18270 ns 1.00
array/construct 1176.7 ns 1198.4 ns 0.98
array/copy 16211 ns 16676 ns 0.97
array/copyto!/cpu_to_gpu 211348 ns 211135 ns 1.00
array/copyto!/gpu_to_cpu 279211 ns 278832 ns 1.00
array/copyto!/gpu_to_gpu 10288 ns 10531 ns 0.98
array/iteration/findall/bool 133016 ns 131993 ns 1.01
array/iteration/findall/int 146365 ns 146745 ns 1.00
array/iteration/findfirst/bool 111605 ns 111631 ns 1.00
array/iteration/findfirst/int 112190 ns 111858 ns 1.00
array/iteration/findmin/1d 65661 ns 66902 ns 0.98
array/iteration/findmin/2d 100624 ns 100550 ns 1.00
array/iteration/logical 190452 ns 189124 ns 1.01
array/iteration/scalar 64165 ns 66015 ns 0.97
array/permutedims/2d 49195 ns 49598 ns 0.99
array/permutedims/3d 50866 ns 50240 ns 1.01
array/permutedims/4d 50357 ns 50411 ns 1.00
array/random/rand/Float32 11522 ns 11982 ns 0.96
array/random/rand/Int64 23755 ns 23515 ns 1.01
array/random/rand!/Float32 7917 ns 8122 ns 0.97
array/random/rand!/Int64 20468 ns 20501 ns 1.00
array/random/randn/Float32 34471 ns 34458 ns 1.00
array/random/randn!/Float32 23757 ns 24130 ns 0.98
array/reductions/mapreduce/Float32/1d 32648 ns 33763 ns 0.97
array/reductions/mapreduce/Float32/dims=1 37942 ns 38228 ns 0.99
array/reductions/mapreduce/Float32/dims=1L 49702 ns 50249 ns 0.99
array/reductions/mapreduce/Float32/dims=2 55322 ns 55439 ns 1.00
array/reductions/mapreduce/Float32/dims=2L 67093 ns 67163 ns 1.00
array/reductions/mapreduce/Int64/1d 39301 ns 40187 ns 0.98
array/reductions/mapreduce/Int64/dims=1 40840 ns 40738 ns 1.00
array/reductions/mapreduce/Int64/dims=1L 86135 ns 86458 ns 1.00
array/reductions/mapreduce/Int64/dims=2 57359 ns 57724 ns 0.99
array/reductions/mapreduce/Int64/dims=2L 82502 ns 82751 ns 1.00
array/reductions/reduce/Float32/1d 33303 ns 33392 ns 1.00
array/reductions/reduce/Float32/dims=1 37906 ns 38091 ns 1.00
array/reductions/reduce/Float32/dims=1L 49787 ns 49963 ns 1.00
array/reductions/reduce/Float32/dims=2 55267 ns 55491 ns 1.00
array/reductions/reduce/Float32/dims=2L 68712 ns 68898 ns 1.00
array/reductions/reduce/Int64/1d 38577 ns 40022 ns 0.96
array/reductions/reduce/Int64/dims=1 40581 ns 40672 ns 1.00
array/reductions/reduce/Int64/dims=1L 86251 ns 86301 ns 1.00
array/reductions/reduce/Int64/dims=2 57424 ns 57311 ns 1.00
array/reductions/reduce/Int64/dims=2L 82572 ns 82411 ns 1.00
array/reverse/1d 16566 ns 16807 ns 0.99
array/reverse/1dL 67404 ns 67676 ns 1.00
array/reverse/1dL_inplace 65033 ns 65187 ns 1.00
array/reverse/1d_inplace 8102 ns 9321.666666666666 ns 0.87
array/reverse/2d 19833 ns 19959 ns 0.99
array/reverse/2dL 71728 ns 71879 ns 1.00
array/reverse/2dL_inplace 64897 ns 65104 ns 1.00
array/reverse/2d_inplace 9461 ns 11067 ns 0.85
array/sorting/1d 2650181 ns 2655417 ns 1.00
array/sorting/2d 1035694 ns 1038734 ns 1.00
array/sorting/by 3178013 ns 3192232 ns 1.00
cuda/synchronization/context/auto 1144.2 ns 1131.5 ns 1.01
cuda/synchronization/context/blocking 923.7 ns 952.2173913043479 ns 0.97
cuda/synchronization/context/nonblocking 6161.6 ns 6097.8 ns 1.01
cuda/synchronization/stream/auto 986.9285714285714 ns 1004.5 ns 0.98
cuda/synchronization/stream/blocking 800.8586956521739 ns 825.6363636363636 ns 0.97
cuda/synchronization/stream/nonblocking 5866.8 ns 6045.333333333333 ns 0.97
integration/byval/reference 143021 ns 143141 ns 1.00
integration/byval/slices=1 145015 ns 145110 ns 1.00
integration/byval/slices=2 283532 ns 283495 ns 1.00
integration/byval/slices=3 421546 ns 422045 ns 1.00
integration/cudadevrt 101462 ns 101557 ns 1.00
integration/volumerhs 9080052 ns 9077766 ns 1.00
kernel/indexing 12566 ns 12534 ns 1.00
kernel/indexing_checked 13329 ns 13291 ns 1.00
kernel/launch 2048.5555555555557 ns 2072.5555555555557 ns 0.99
kernel/occupancy 715.9782608695652 ns 716.9097744360902 ns 1.00
kernel/rand 13632 ns 13723 ns 0.99
latency/import 3875036193 ns 3841987489 ns 1.01
latency/precompile 4621167021 ns 4621684240 ns 1.00
latency/ttfp 4491013832 ns 4482964065 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@maleadt maleadt left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@maleadt maleadt merged commit fdb0f83 into JuliaGPU:main Jun 10, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants