Minimal GPU example which preserves the element routine#1291
Minimal GPU example which preserves the element routine#1291termi-official wants to merge 37 commits intomasterfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1291 +/- ##
==========================================
- Coverage 94.25% 92.59% -1.66%
==========================================
Files 40 47 +7
Lines 6750 6997 +247
==========================================
+ Hits 6362 6479 +117
- Misses 388 518 +130 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ed8eb72 to
2ce407b
Compare
fdff415 to
b12d500
Compare
| FunctionValues | ||
|
|
||
| struct FunctionValues{DiffOrder, IP, N_t, dNdx_t, dNdξ_t, d2Ndx2_t, d2Ndξ2_t} <: AbstractValues | ||
| struct FunctionValues{DiffOrder, IP, Nx_t, Nξ_t, dNdx_t, dNdξ_t, d2Ndx2_t, d2Ndξ2_t} <: AbstractValues |
There was a problem hiding this comment.
This is the only intrusive part of this PR here. I have decided to allow the arrays to be stored here directly to not add any helper structs just mirroring the fields as Arrays. Although one should not that passing an array here breaks almost all calls on the struct.
We can also explore the path of having helper structs and introduce some syntactic sugar to query the parts via e.g. fv[i] instead of get_substruct.
… integrated as a full test. I will split it up later.
KnutAM
left a comment
There was a problem hiding this comment.
This is really cool and I'm really excited for this feature!
I've now focused on the how-to to get a bit understanding from how this would look from a user perspective. So on the user-facing side we are introducing the APIs?
adaptdispatches forDofHandlerandConstraintHandler- Support for allocation and assembly of
CuSparseMatrixCSC CellCacheContainer,CellValuesContainer, andFacetValuesContainer(assembler?)
(Where my suggestion is to replace nr 3 with distribute_to_workers or similar)
Did I miss any new API?
|
|
||
| ch_gpu = adapt(backend, ch) | ||
| apply!(K_gpu, f_gpu, ch_gpu) | ||
| u_gpu = SparseMatrixCSC(K_gpu) \ Vector(f_gpu) |
There was a problem hiding this comment.
Can we use LinearSolve.jl instead to avoid transferring? Just that it is nice :D
There was a problem hiding this comment.
Not sure if I would do that here. We can point the user towards LinearSolve being a possibility. But we do not have optimal preconditioners in LinearSolve.jl yet. To not make the PR too large, should I leave this for now as-is and add direct support for AMGX in a subsequent PR?
There was a problem hiding this comment.
No, I agree that we shouldn't add that here. Was more thinking to use a direct solver that seemed to be supported, e.g.
using CUDSS
sol = LS.solve(prob, LS.LUFactorization())But not important.
| # Technically we can also just get one Ke or fe per worker, but for demonstration | ||
| # purposes we allocate the full block here for element-assembly style matrix-free GPU | ||
| # usage. | ||
| Kes = KA.zeros(backend, Float32, getncells(grid), getnbasefunctions(cv), getnbasefunctions(cv)) | ||
| fes = KA.zeros(backend, Float32, getncells(grid), getnbasefunctions(cv)) |
There was a problem hiding this comment.
How is this better than allocating the full matrix? Seems like this would allocate a huge amount of memory compared to the global matrix? Is it simply to avoid locks/coloring?
In the final version probably nice to stick with one approach as a first step for users to do GPU-FEM - and have other how-tos build upon that.
There was a problem hiding this comment.
Let me answer here in reverse order.
In the final version probably nice to stick with one approach as a first step.
Please note that this is not a tutorial and I think we need to educate users here a bit. No matter what you do, if you use CSR or CSC on the GPU your solve performance will be not great at all, because your arithmetic intensity is way too low for the spmv kernel. Using CSR/CSC is not recommended. However, if you do not know how to precondition your iterative solver, then it is for the user typically better to use these formats anyway to get things started, so they can optimize the solve step later separately. I can clarify this further in the how-to if you want.
How is this better than allocating the full matrix?
I essentially just want to show users here that we can very easily build standard matrix-free data structures. The 3 dimensional Ke is really just what is called "element-assembly" and is already sufficient to setup fast matrix-vector products on the GPU. Since in the past I was repeatedly tasked to chunk up PRs which do several things at once (to see how things integrate), this one is now an isolated change which does just a single thing: Port the heat example with minimal effort (for the users) to the GPU.
| cv_gpu = CellValuesContainer(backend, n_workers, cv) | ||
| cc_gpu = CellCacheContainer(backend, n_workers, dh_gpu) |
There was a problem hiding this comment.
As discussed on slack, I don't think we should all all these different versions/types as user-facing, but provide an API that can be overloaded with the specific type, which returns an AbstractArray where the element type is similar to the type of the argument. Putting it here for others to comment to.
| cv_gpu = CellValuesContainer(backend, n_workers, cv) | |
| cc_gpu = CellCacheContainer(backend, n_workers, dh_gpu) | |
| cv_gpu = distribute_to_workers(backend, n_workers, cv) | |
| cc_gpu = distribute_to_workers(backend, n_workers, CellCache(dh_gpu)) |
"""
distribute_to_workers([backend], n_workers, x::Tx)
Create an `AbstractArray{T}` with length `n_workers` where each item is can be mutated without race conditions between different workers / tasks. While `T == Tx` is not guaranteed, instances of `T` will typically support the same functionality as the original `x`.
Ferrite provides support for the following `x`s:
* [`CellCache`](@ref)
* [`CellValues`](@ref)
* [`FacetValues`](@ref)
* `AbstractAssembler` (I assume it will be needed due to the `#FIXME buffers` in the code (also no ref as no docs for this one I think?)
If no `backend` is provided, `T == Tx`, which is suitable for standard multithreading. Distributing for GPUs requires loading the relevant GPU package that provides a `backend`, currently, the following backends are implemented:
* `CUDABackend()` (Via the `CUDA.jl` extension)
"""
function distribute_to_workers endxref: #1070 (CC @fredrikekre) for the non-GPU case, which will be solved by this. Of course the same issue as there with name, some alternatives to the above could be
distributedistribute_to_tasks
There was a problem hiding this comment.
I am still not sure about this design, as I also started with it, but it just didn't turn out nicely so far.
For more context:
cc_gpu = distribute_to_workers(backend, n_workers, CellCache(dh_gpu))does not work so easily. First, the Vector types are hardcoded, so the CellCache(gpu_sdh) constructor would already return something that is not a CellCache, or we transfer the data back and forth, or the CellCache is just some other placeholder type.- if we would widen the type here, it is still mutable and we cannot make the struct immutable without breaking public API (reinit! in CellCache).
- I cannot guarantee this signature will work for all that we need. How would you, for example, handle the assembler, as the existing ones are also not GPU compatible? And if we make it GPU compatible, why would I not directly construct them with the correct buffer sizes?
There was a problem hiding this comment.
In general, the way I read the tutorial is the approach for many objects are (1) setup as usual (e.g. Grid, DofHandler, ConstraintHandler, etc.) and (2) adapt to GPU device. So having the distribute to worker interface is just exactly doing that. The alternative, having a new public type for each item (CellValues, FacetValues, MultiFieldCellValues, CellCache, FacetCache, CSCAssembler, CSRAssembler), potentially different between backends, compared to abstracting this away seems more attractive to me wrt. what we define and the number of objects the user has to deal with.
- ... or we transfer the data back and forth, or the CellCache is just some other placeholder type.
I think this transfer is OK, since it is only during setup. And this follows the same logic of adapt in my view.
- I cannot guarantee this signature will work for all that we need. How would you, for example, handle the assembler, as the existing ones are also not GPU compatible?
The new assembler should be documented as thread-safe (as long as coloring is used), so no need to distribute it to workers, can be used by all workers.
- ... And if we make it GPU compatible, why would I not directly construct them with the correct buffer sizes?
I would argue for the simplicity of the user interface, same pattern everywhere: Create as standard, adapt to GPU backend. Items with buffers/cache that are modified must be distributed to workers, just as with multithreading, and we provide a unified infrastructure to do this via distribute_to_workers.
|
Thanks for the review Knut! Let me answer your questions in order first before I get to the comments.
From the user-facing API that's it. Then there is right now the internal API with |
Yes, I think that makes perfect sense! (Just API in the sense that the Ferrite dispatches is part of the documented API)
Aha, so in that case shouldn't we have the same logic here as for the cell values and cache? (in my notation
Yes, but with the proposed struct DistributedVals{CV, CV_internal} <: AbstractVector{CV}
num::Int
cv::CV_internal
function DistributedVals{CV}(cv_internal::CV_internal, num::Int) where {CV, CV_internal}
return new{CV, CV_internal}(cv_internal, num)
end
end
function distribute_to_workers(backend, n_workers, cv::CellValues)
cva = as_structure_of_arrays(backend, n_workers, cv)
CV = typeof(get_substruct(cva, 1))
return DistributedVals{CV}(cva, n_workers)
end
function Base.get_index(d::DistriburedVals, i)
checkbounds(i, 1:d.num)
return get_substruct(d.cv, i)::CV
endThe reason I'm against exposing |
| # NOTE CellCache is mutable and hence inherently incompatible with GPU. So here is the | ||
| # immutable variant. Making the CellCache immutable is considered breaking due to the reinit! API integration. | ||
| struct ImmutableCellCache{G <: AbstractGrid, SDH, IVT, VX} | ||
| flags::UpdateFlags | ||
| grid::G | ||
| cellid::Int | ||
| nodes::IVT | ||
| coords::VX | ||
| sdh::SDH | ||
| dofs::IVT | ||
| end |
There was a problem hiding this comment.
NOTE CellCache is mutable and hence inherently incompatible with GPU. So here is the immutable variant. Making the CellCache immutable is considered breaking due to the reinit! API integration.
I think we can, by changing
Lines 35 to 46 in 34c457e
mutable struct Mutable{T}
v::T
end
struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}, CID, NodeV <: AbstractVector{<:Integer}, CoordV <: AbstractVector{X}, DofV <: AbstractVector{<:Integer}}
flags::UpdateFlags
grid::G
# Pretty useless to store this since you have it already for the reinit! call, but
# needed for the CellIterator(...) workflow since the user doesn't necessarily control
# the loop order in the cell subset.
cellid::CID
nodes::NodeV
coords::CoordV
dh::DH
dofs::DofV
endwhere CID normally is a Mutable{Int}, but for gpu this can be adapt:ed to a view to a CuVector?
This basically re-introduces the old design, which I think makes sense here instead of duplicating?
There was a problem hiding this comment.
I would go for the duplication here. Having everything funneled into a single data structure seems to be quite painful to maintain. My working patch here is
diff --git a/ext/FerriteKAExt/adapt_core.jl b/ext/FerriteKAExt/adapt_core.jl
index cbccc539a..969a9ed75 100644
--- a/ext/FerriteKAExt/adapt_core.jl
+++ b/ext/FerriteKAExt/adapt_core.jl
@@ -1,6 +1,7 @@
# This file contains adapt rules for all relevant data structures in Ferrite.jl which do
# not need customized GPU data structures.
+Adapt.@adapt_structure CellCache
Adapt.@adapt_structure CellValues
Adapt.@adapt_structure Ferrite.GeometryMapping
Adapt.@adapt_structure Ferrite.FunctionValues
diff --git a/ext/FerriteKAExt/iterator.jl b/ext/FerriteKAExt/iterator.jl
index 3f89ecc4a..d20f7f00f 100644
--- a/ext/FerriteKAExt/iterator.jl
+++ b/ext/FerriteKAExt/iterator.jl
@@ -34,9 +34,23 @@ function as_structure_of_arrays(backend, outer_dim, ::Type{CellCache}, sdh::Devi
N = Ferrite.nnodes_per_cell(sdh)
nodes = KA.zeros(backend, Int, outer_dim, N)
coords = KA.zeros(backend, Vec{dim, get_coordinate_eltype(grid)}, outer_dim, N)
+ cellids = KA.zeros(backend, Int, outer_dim, 1)
dofs = KA.zeros(backend, Int, outer_dim, n)
end
- return ImmutableCellCache(flags, grid, -1, nodes, coords, sdh, dofs)
+ return CellCache(flags, grid, cellids, nodes, coords, sdh, dofs)
+end
+function get_substruct(i, cc::CellCache, cellid)
+ return CellCache(
+ cc.flags, cc.grid, view(cc.cellid, i, :),
+ view(cc.nodes, i, :), view(cc.coords, i, :), cc.dh, view(cc.dofs, i, :)
+ )
+end
+function Ferrite.reinit!(cc::CellCache{<:Any,<:DeviceGrid}, cellid::Int)
+ cc.cellid[1] = cellid # TODO remove this in future versions
+ cc.flags.nodes && Ferrite.cellnodes!(cc.nodes, cc.grid, cellid)
+ cc.flags.coords && Ferrite.getcoordinates!(cc.coords, cc.grid, cellid)
+ cc.dh !== nothing && cc.flags.dofs && Ferrite.celldofs!(cc.dofs, cc.dh, cellid)
+ return cc
end
function Ferrite.CellCache(backend, dh::HostDofHandler{dim}, flags::UpdateFlags = UpdateFlags()) where {dim}
diff --git a/src/iterators.jl b/src/iterators.jl
index f432455fa..7eb3f244a 100644
--- a/src/iterators.jl
+++ b/src/iterators.jl
@@ -32,24 +32,24 @@ cell. The cache is updated for a new cell by calling `reinit!(cache, cellid)` wh
See also [`CellIterator`](@ref).
"""
-mutable struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}}
- const flags::UpdateFlags
- const grid::G
+struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}, CIDType, IVType, CVType <: AbstractArray{X}}
+ flags::UpdateFlags
+ grid::G
# Pretty useless to store this since you have it already for the reinit! call, but
# needed for the CellIterator(...) workflow since the user doesn't necessarily control
# the loop order in the cell subset.
- cellid::Int
- const nodes::Vector{Int}
- const coords::Vector{X}
- const dh::DH
- const dofs::Vector{Int}
+ cellid::CIDType
+ nodes::IVType
+ coords::CVType
+ dh::DH
+ dofs::IVType
end
function CellCache(grid::Grid{dim, C, T}, flags::UpdateFlags = UpdateFlags()) where {dim, C, T}
N = nnodes_per_cell(grid, 1) # nodes and coords will be resized in `reinit!`
nodes = zeros(Int, N)
coords = zeros(Vec{dim, T}, N)
- return CellCache(flags, grid, -1, nodes, coords, nothing, Int[])
+ return CellCache(flags, grid, [-1], nodes, coords, nothing, Int[])
end
function CellCache(dh::DofHandler{dim}, flags::UpdateFlags = UpdateFlags()) where {dim}
@@ -58,7 +58,7 @@ function CellCache(dh::DofHandler{dim}, flags::UpdateFlags = UpdateFlags()) wher
nodes = zeros(Int, N)
coords = zeros(Vec{dim, get_coordinate_eltype(get_grid(dh))}, N)
celldofs = zeros(Int, n)
- return CellCache(flags, get_grid(dh), -1, nodes, coords, dh, celldofs)
+ return CellCache(flags, get_grid(dh), [-1], nodes, coords, dh, celldofs)
end
function CellCache(sdh::SubDofHandler, flags::UpdateFlags = UpdateFlags())
@@ -67,7 +67,7 @@ function CellCache(sdh::SubDofHandler, flags::UpdateFlags = UpdateFlags())
end
function reinit!(cc::CellCache, i::Int)
- cc.cellid = i
+ cc.cellid[1] = i # TODO remove this in future versions
if cc.flags.nodes
resize!(cc.nodes, nnodes_per_cell(cc.grid, i))
cellnodes!(cc.nodes, cc.grid, i)
@@ -97,10 +97,10 @@ end
getnodes(cc::CellCache) = cc.nodes
getcoordinates(cc::CellCache) = cc.coords
celldofs(cc::CellCache) = cc.dofs
-cellid(cc::CellCache) = cc.cellid
+cellid(cc::CellCache) = cc.cellid[1]
# TODO: These should really be replaced with something better...
-nfacets(cc::CellCache) = nfacets(getcells(cc.grid, cc.cellid))
+nfacets(cc::CellCache) = nfacets(getcells(cc.grid, cellid(cc)))
"""
@@ -121,20 +121,20 @@ calling `reinit!(cache, fi::FacetIndex)`.
See also [`FacetIterator`](@ref).
"""
-mutable struct FacetCache{CC <: CellCache}
- const cc::CC # const for julia > 1.8
- const dofs::Vector{Int} # aliasing cc.dofs
- current_facet_id::Int
+struct FacetCache{CC <: CellCache, DVType, CFType}
+ cc::CC
+ dofs::DVType # aliasing cc.dofs
+ current_facet_id::CFType
end
function FacetCache(args...)
cc = CellCache(args...)
- return FacetCache(cc, cc.dofs, 0)
+ return FacetCache(cc, cc.dofs, [0])
end
function reinit!(fc::FacetCache, facet::BoundaryIndex)
cellid, facetid = facet
reinit!(fc.cc, cellid)
- fc.current_facet_id = facetid
+ fc.current_facet_id[1] = facetid
return nothing
end
@@ -148,7 +148,7 @@ for op in (:getnodes, :getcoordinates, :cellid, :celldofs)
end
@inline function reinit!(fv::FacetValues, fc::FacetCache)
- return reinit!(fv, fc.cc, fc.current_facet_id)
+ return reinit!(fv, fc.cc, fc.current_facet_id[1])
end
"""
diff --git a/docs/src/howto/gpu_heat_howto_literate.jl b/docs/src/howto/gpu_heat_howto_literate.jl
index 6efd99cdb..6931d0b25 100644
--- a/docs/src/howto/gpu_heat_howto_literate.jl
+++ b/docs/src/howto/gpu_heat_howto_literate.jl
@@ -68,7 +68,8 @@ end
cv_i = cv[task_index]
## Query work item cell cache. The call on the item initializes replaces the reinit! call.
- cc_i = cc[task_index](cellid)
+ cc_i = cc[task_index]
+ Ferrite.reinit!(cc_i, cellid)
## Query assembly buffer.
Ke = view(Kes, i, :, :)If this is more about keeping the reinit! workflow on the GPU I can add it, but it will increase memory pressure, because we store the cellid in global memory, just to eventually load it later from global memory instead of keeping it locally.
There was a problem hiding this comment.
Note that we cannot do any shenanigans like resizing on the GPU.
This PR has a fundamentally different design from the previous PRs. Here I try to preserve the assembly routine as-is and modify the assembly loop around them. This PR is really just an MWE to show the direction for discussion purposes.
The main vehicle for this PR is a SOA transformation of the CellCache and CellValues objects. I actually like it, but it might not be the best one performance-wise.
Closes #628 .