Conversation
|
Your PR no longer requires formatting changes. Thank you for your contribution! |
lkdvos
left a comment
There was a problem hiding this comment.
This doesn't fully fix the issue I think, that code path shouldn't ever be reached by the GPU arrays since it is guarded by an isblasmatrix call that checks pointer(A) isa Ptr.
I think this really needs a more proper rewrite that dispatches to a gemm function that then indeed can determine the proper driver.
Note also that the current fallback is using the Strided machinery to manually write out the kernel, which is actually equivalent to what the generic_matmatmul! function does anyways
|
|
It seems to have unblocked the v1 AMD stuff on TO 🤷 . But if someone wants to make a higher performance version, go ahead. The rocBLAS gemm doesn't work well if the stride in the first dimension isn't 1, I think. |
Codecov Report❌ Patch coverage is
... and 1 file with indirect coverage changes 🚀 New features to boost your workflow:
|
|
Would it then not make more sense to fix the |
|
So I think the CUDA one doesn't ever touch this because the result of |
|
Yes, exactly, I think my argument is to either: |
|
Finally got back to this, looks like it should be ok now? |
|
After a Zulip discussion, we decided to try inserting a new |
lkdvos
left a comment
There was a problem hiding this comment.
Only have some comments about code coverage, otherwise looks very clean!
|
Added some more tests to trigger the fallback to the GPUArrays kernels |
|
Can we just exhaustively do all the mul cases that trigger the BLAS path too? I don't think we're using the adjoint path right now for example |
|
I apologize strongly for this very inefficient back-and-forth, I'll be at my desk again next week so then things should hopefully be better. I think now the non-BLAS codepaths are indeed checked, but I was really mostly talking about exhaustively testing the various BLAS paths with the different values of transposed, conjugated etc, since that seems like the more brittle part (basically replacing the |
|
It's cool, no worries. I can also test a BLAS one (with all 1 strides) for each f1, f2 combo? |
|
Yeah, I think I was mostly worried about the transpose and adjoint versions, where the |
|
I'll try to get to it tonight or tomorrow :) |
|
Good call as I ended up finding and fixing a bug |
lkdvos
left a comment
There was a problem hiding this comment.
Thanks for adding the tests!
Overall looks good to me, left a small note about the gemm call for AMD, but otherwise good to go!
| A2a = Base.unsafe_wrap(ROCMatrix{T}, pointer(A2), size(A2)) | ||
| B2a = Base.unsafe_wrap(ROCMatrix{T}, pointer(B2), size(B2)) | ||
| C2a = Base.unsafe_wrap(ROCMatrix{T}, pointer(C2), size(C2)) |
There was a problem hiding this comment.
I just checked the actual blas wrapper implementation, do you think it would be worth it to simply call that directly, so we can actually deal with the cases where the stride is not equal to the size? (https://github.com/JuliaGPU/AMDGPU.jl/blob/f49923a4c13b06325ff32696952b34d5ec73998f/src/blas/wrappers.jl#L547-L566)
Alternatively, we could also keep that for MatrixAlgebraKit and future implementations, I'm happy to already get this in as well.
There was a problem hiding this comment.
I considered doing that but then we have to handle the library handle etc ourselves, which I'd prefer not to do. I'd say merge for now and we can revisit if needed.
There was a problem hiding this comment.
So that this silently fail for matrices that do not have stride(A, 1) = size(A, 1) ?
Would it have been an option to do something like
A2a = view(Base.unsafe_wrap(ROCMatrix{T}, pointer(A2), (stride(A2, 1), size(A2, 2)), 1:size(A2, 1), :)There was a problem hiding this comment.
Oh I see this is at least checked in a specialized isblasmatrix version.
This will help us use the new support for generic
GPUArraystrided views in a way that bypasses some really awful ambiguity warnings.