Skip to content

Minimal GPU example which preserves the element routine#1291

Open
termi-official wants to merge 37 commits intomasterfrom
do/gpu-mwe
Open

Minimal GPU example which preserves the element routine#1291
termi-official wants to merge 37 commits intomasterfrom
do/gpu-mwe

Conversation

@termi-official
Copy link
Copy Markdown
Member

@termi-official termi-official commented Feb 27, 2026

This PR has a fundamentally different design from the previous PRs. Here I try to preserve the assembly routine as-is and modify the assembly loop around them. This PR is really just an MWE to show the direction for discussion purposes.

The main vehicle for this PR is a SOA transformation of the CellCache and CellValues objects. I actually like it, but it might not be the best one performance-wise.

Closes #628 .

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 27, 2026

Codecov Report

❌ Patch coverage is 56.14618% with 132 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.59%. Comparing base (236eb50) to head (280f35d).
⚠️ Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
ext/FerriteCudaExt.jl 33.33% 76 Missing ⚠️
ext/FerriteKAExt/iterator.jl 58.53% 17 Missing ⚠️
src/soa_utils.jl 50.00% 16 Missing ⚠️
ext/FerriteKAExt/dof_handler.jl 66.66% 9 Missing ⚠️
ext/FerriteKAExt/adapt_core.jl 73.33% 4 Missing ⚠️
ext/FerriteKAExt/soa_core.jl 70.00% 3 Missing ⚠️
src/Grid/grid.jl 85.00% 3 Missing ⚠️
ext/FerriteKAExt/device_grid.jl 66.66% 2 Missing ⚠️
src/Dofs/ConstraintHandler.jl 95.00% 1 Missing ⚠️
src/FEValues/FacetValues.jl 88.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1291      +/-   ##
==========================================
- Coverage   94.25%   92.59%   -1.66%     
==========================================
  Files          40       47       +7     
  Lines        6750     6997     +247     
==========================================
+ Hits         6362     6479     +117     
- Misses        388      518     +130     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

FunctionValues

struct FunctionValues{DiffOrder, IP, N_t, dNdx_t, dNdξ_t, d2Ndx2_t, d2Ndξ2_t} <: AbstractValues
struct FunctionValues{DiffOrder, IP, Nx_t, Nξ_t, dNdx_t, dNdξ_t, d2Ndx2_t, d2Ndξ2_t} <: AbstractValues
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only intrusive part of this PR here. I have decided to allow the arrays to be stored here directly to not add any helper structs just mirroring the fields as Arrays. Although one should not that passing an array here breaks almost all calls on the struct.

We can also explore the path of having helper structs and introduce some syntactic sugar to query the parts via e.g. fv[i] instead of get_substruct.

@termi-official termi-official marked this pull request as ready for review February 28, 2026 01:24
Copy link
Copy Markdown
Member

@KnutAM KnutAM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool 🚀

Initial round of feedback (excluding the parts regarding the generalized vector types in #1252 ), and didn't look at the example so carefully yet.

Copy link
Copy Markdown
Member

@KnutAM KnutAM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool and I'm really excited for this feature!

I've now focused on the how-to to get a bit understanding from how this would look from a user perspective. So on the user-facing side we are introducing the APIs?

  1. adapt dispatches for DofHandler and ConstraintHandler
  2. Support for allocation and assembly of CuSparseMatrixCSC
  3. CellCacheContainer, CellValuesContainer, and FacetValuesContainer (assembler?)

(Where my suggestion is to replace nr 3 with distribute_to_workers or similar)

Did I miss any new API?


ch_gpu = adapt(backend, ch)
apply!(K_gpu, f_gpu, ch_gpu)
u_gpu = SparseMatrixCSC(K_gpu) \ Vector(f_gpu)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use LinearSolve.jl instead to avoid transferring? Just that it is nice :D

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I would do that here. We can point the user towards LinearSolve being a possibility. But we do not have optimal preconditioners in LinearSolve.jl yet. To not make the PR too large, should I leave this for now as-is and add direct support for AMGX in a subsequent PR?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I agree that we shouldn't add that here. Was more thinking to use a direct solver that seemed to be supported, e.g.

using CUDSS
sol = LS.solve(prob, LS.LUFactorization())

But not important.

Comment on lines +184 to +188
# Technically we can also just get one Ke or fe per worker, but for demonstration
# purposes we allocate the full block here for element-assembly style matrix-free GPU
# usage.
Kes = KA.zeros(backend, Float32, getncells(grid), getnbasefunctions(cv), getnbasefunctions(cv))
fes = KA.zeros(backend, Float32, getncells(grid), getnbasefunctions(cv))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this better than allocating the full matrix? Seems like this would allocate a huge amount of memory compared to the global matrix? Is it simply to avoid locks/coloring?

In the final version probably nice to stick with one approach as a first step for users to do GPU-FEM - and have other how-tos build upon that.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me answer here in reverse order.

In the final version probably nice to stick with one approach as a first step.

Please note that this is not a tutorial and I think we need to educate users here a bit. No matter what you do, if you use CSR or CSC on the GPU your solve performance will be not great at all, because your arithmetic intensity is way too low for the spmv kernel. Using CSR/CSC is not recommended. However, if you do not know how to precondition your iterative solver, then it is for the user typically better to use these formats anyway to get things started, so they can optimize the solve step later separately. I can clarify this further in the how-to if you want.

How is this better than allocating the full matrix?

I essentially just want to show users here that we can very easily build standard matrix-free data structures. The 3 dimensional Ke is really just what is called "element-assembly" and is already sufficient to setup fast matrix-vector products on the GPU. Since in the past I was repeatedly tasked to chunk up PRs which do several things at once (to see how things integrate), this one is now an isolated change which does just a single thing: Port the heat example with minimal effort (for the users) to the GPU.

Comment on lines +182 to +183
cv_gpu = CellValuesContainer(backend, n_workers, cv)
cc_gpu = CellCacheContainer(backend, n_workers, dh_gpu)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on slack, I don't think we should all all these different versions/types as user-facing, but provide an API that can be overloaded with the specific type, which returns an AbstractArray where the element type is similar to the type of the argument. Putting it here for others to comment to.

Suggested change
cv_gpu = CellValuesContainer(backend, n_workers, cv)
cc_gpu = CellCacheContainer(backend, n_workers, dh_gpu)
cv_gpu = distribute_to_workers(backend, n_workers, cv)
cc_gpu = distribute_to_workers(backend, n_workers, CellCache(dh_gpu))
"""
    distribute_to_workers([backend], n_workers, x::Tx)

Create an `AbstractArray{T}` with length `n_workers` where each item is can be mutated without race conditions between different workers / tasks. While `T == Tx` is not guaranteed, instances of `T` will typically support the same functionality as the original `x`. 

Ferrite provides support for the following `x`s:

* [`CellCache`](@ref)
* [`CellValues`](@ref)
* [`FacetValues`](@ref)
* `AbstractAssembler` (I assume it will be needed due to the `#FIXME buffers` in the code (also no ref as no docs for this one I think?)

If no `backend` is provided, `T == Tx`, which is suitable for standard multithreading. Distributing for GPUs requires loading the relevant GPU package that provides a `backend`, currently, the following backends are implemented:

* `CUDABackend()` (Via the `CUDA.jl` extension)

"""
function distribute_to_workers end

xref: #1070 (CC @fredrikekre) for the non-GPU case, which will be solved by this. Of course the same issue as there with name, some alternatives to the above could be

  • distribute
  • distribute_to_tasks

Copy link
Copy Markdown
Member Author

@termi-official termi-official Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not sure about this design, as I also started with it, but it just didn't turn out nicely so far.

For more context:

  1. cc_gpu = distribute_to_workers(backend, n_workers, CellCache(dh_gpu)) does not work so easily. First, the Vector types are hardcoded, so the CellCache(gpu_sdh) constructor would already return something that is not a CellCache, or we transfer the data back and forth, or the CellCache is just some other placeholder type.
  2. if we would widen the type here, it is still mutable and we cannot make the struct immutable without breaking public API (reinit! in CellCache).
  3. I cannot guarantee this signature will work for all that we need. How would you, for example, handle the assembler, as the existing ones are also not GPU compatible? And if we make it GPU compatible, why would I not directly construct them with the correct buffer sizes?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Minimal GPU example which preserves the element routine #1291 (comment)

In general, the way I read the tutorial is the approach for many objects are (1) setup as usual (e.g. Grid, DofHandler, ConstraintHandler, etc.) and (2) adapt to GPU device. So having the distribute to worker interface is just exactly doing that. The alternative, having a new public type for each item (CellValues, FacetValues, MultiFieldCellValues, CellCache, FacetCache, CSCAssembler, CSRAssembler), potentially different between backends, compared to abstracting this away seems more attractive to me wrt. what we define and the number of objects the user has to deal with.

  1. ... or we transfer the data back and forth, or the CellCache is just some other placeholder type.

I think this transfer is OK, since it is only during setup. And this follows the same logic of adapt in my view.

  1. I cannot guarantee this signature will work for all that we need. How would you, for example, handle the assembler, as the existing ones are also not GPU compatible?

The new assembler should be documented as thread-safe (as long as coloring is used), so no need to distribute it to workers, can be used by all workers.

  1. ... And if we make it GPU compatible, why would I not directly construct them with the correct buffer sizes?

I would argue for the simplicity of the user interface, same pattern everywhere: Create as standard, adapt to GPU backend. Items with buffers/cache that are modified must be distributed to workers, just as with multithreading, and we provide a unified infrastructure to do this via distribute_to_workers.

@termi-official
Copy link
Copy Markdown
Member Author

Thanks for the review Knut! Let me answer your questions in order first before I get to the comments.

  1. adapt comes from Adapt.jl and is the recommended way to GPU transfer data structures into the kernels. Essentially it does some switching like CuVector -> CuDeviceVector under the hood. You add this to your structs which will be used inside kernels, so this type of transformation is applied to all fields of your structs recursively (such that we also reach inner structs). So this is not really a new API which we introduce here, but more a way to add compatibility with the standard Julia GPU workflow. A goal here was specifically to be able to enable user to build their stuff with KernelAbstractions.
  2. Technically speaking the assembly into a CSC matrix would work out of the box with what we have. The blocker here is that the assembler structs are not thread-safe, so I added a hotfix for now by adding a function which directly assembles without the additional sort step. Note that I can also add the CSR format in this PR, as it is significantly faster for parallel applications (i.e. in solvers).
  3. Yes, these three structs are essentially the workhorses here. Given the refactoring that we (and especially you Knut for the CellValues) did over the last year porting most parts became significantly easier than in previous attempts and allows us to reuse many existing element routines out of the box.

From the user-facing API that's it. Then there is right now the internal API with get_substruct and as_structure_of_arrays powering the new user-facing structs. This API is subject to change, but the pattern here should be clear. For the GPU it is really beneficial to have structure of arrays data types -- while on the CPU we usually prefer the array of structs, as we already propose in the threaded assembly tutorial. The difference here in "optimal" data structures stems from the significant differences in hardware architecture. Does that make sense?

@KnutAM
Copy link
Copy Markdown
Member

KnutAM commented Mar 6, 2026

(1) So this is not really a new API which we introduce here, but more a way to add compatibility with the standard Julia GPU workflow.

Yes, I think that makes perfect sense! (Just API in the sense that the Ferrite dispatches is part of the documented API)

(2) Technically speaking the assembly into a CSC matrix would work out of the box with what we have.

Aha, so in that case shouldn't we have the same logic here as for the cell values and cache? (in my notation distribute_to_workers(backend, n_workers, assembler) (but no need to add here), and this PR adds an assembler that doesn't need to be distributed to workers since it doesn't have caches. So not necessarily for dispatch on start_assemble for matrix type then, but rather it could be explicitly constructed, something like start_assemble(CacheFreeAssembler, K, [f])?

(3) The difference here in "optimal" data structures stems from the significant differences in hardware architecture. Does that make sense?

Yes, but with the proposed distribute function that is what we would get. I think we didn't mention that from our slack conversation, but just to have it here:

struct DistributedVals{CV, CV_internal} <: AbstractVector{CV}
    num::Int
    cv::CV_internal
    function DistributedVals{CV}(cv_internal::CV_internal, num::Int) where {CV, CV_internal}
        return new{CV, CV_internal}(cv_internal, num)
    end
end

function distribute_to_workers(backend, n_workers, cv::CellValues)
    cva = as_structure_of_arrays(backend, n_workers, cv)
    CV = typeof(get_substruct(cva, 1))
    return DistributedVals{CV}(cva, n_workers)
end

function Base.get_index(d::DistriburedVals, i)
    checkbounds(i, 1:d.num)
    return get_substruct(d.cv, i)::CV
end

The reason I'm against exposing cvs = as_structure_of_arrays(backend, n_workers, cv) to users, is that then we have e.g. a CellValues object, cvs, that doesn't support the methods defined for CellValues (or even worse, in some cases support them but will silently give the wrong result. As a pragmatic approach to avoid code duplication, I like the idea to (mis)use the structures to create structs of arrays, but users should never work with such adopted structures IMO.

Comment on lines +1 to +11
# NOTE CellCache is mutable and hence inherently incompatible with GPU. So here is the
# immutable variant. Making the CellCache immutable is considered breaking due to the reinit! API integration.
struct ImmutableCellCache{G <: AbstractGrid, SDH, IVT, VX}
flags::UpdateFlags
grid::G
cellid::Int
nodes::IVT
coords::VX
sdh::SDH
dofs::IVT
end
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE CellCache is mutable and hence inherently incompatible with GPU. So here is the immutable variant. Making the CellCache immutable is considered breaking due to the reinit! API integration.

I think we can, by changing

mutable struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}}
const flags::UpdateFlags
const grid::G
# Pretty useless to store this since you have it already for the reinit! call, but
# needed for the CellIterator(...) workflow since the user doesn't necessarily control
# the loop order in the cell subset.
cellid::Int
const nodes::Vector{Int}
const coords::Vector{X}
const dh::DH
const dofs::Vector{Int}
end
to

mutable struct Mutable{T}
    v::T
end
struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}, CID, NodeV <: AbstractVector{<:Integer}, CoordV <: AbstractVector{X}, DofV <: AbstractVector{<:Integer}}
    flags::UpdateFlags
    grid::G
    # Pretty useless to store this since you have it already for the reinit! call, but
    # needed for the CellIterator(...) workflow since the user doesn't necessarily control
    # the loop order in the cell subset.
    cellid::CID
    nodes::NodeV
    coords::CoordV
    dh::DH
    dofs::DofV
end

where CID normally is a Mutable{Int}, but for gpu this can be adapt:ed to a view to a CuVector?

This basically re-introduces the old design, which I think makes sense here instead of duplicating?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go for the duplication here. Having everything funneled into a single data structure seems to be quite painful to maintain. My working patch here is

diff --git a/ext/FerriteKAExt/adapt_core.jl b/ext/FerriteKAExt/adapt_core.jl
index cbccc539a..969a9ed75 100644
--- a/ext/FerriteKAExt/adapt_core.jl
+++ b/ext/FerriteKAExt/adapt_core.jl
@@ -1,6 +1,7 @@
 # This file contains adapt rules for all relevant data structures in Ferrite.jl which do
 # not need customized GPU data structures.
 
+Adapt.@adapt_structure CellCache
 Adapt.@adapt_structure CellValues
 Adapt.@adapt_structure Ferrite.GeometryMapping
 Adapt.@adapt_structure Ferrite.FunctionValues
diff --git a/ext/FerriteKAExt/iterator.jl b/ext/FerriteKAExt/iterator.jl
index 3f89ecc4a..d20f7f00f 100644
--- a/ext/FerriteKAExt/iterator.jl
+++ b/ext/FerriteKAExt/iterator.jl
@@ -34,9 +34,23 @@ function as_structure_of_arrays(backend, outer_dim, ::Type{CellCache}, sdh::Devi
         N = Ferrite.nnodes_per_cell(sdh)
         nodes = KA.zeros(backend, Int, outer_dim, N)
         coords = KA.zeros(backend, Vec{dim, get_coordinate_eltype(grid)}, outer_dim, N)
+        cellids = KA.zeros(backend, Int, outer_dim, 1)
         dofs = KA.zeros(backend, Int, outer_dim, n)
     end
-    return ImmutableCellCache(flags, grid, -1, nodes, coords, sdh, dofs)
+    return CellCache(flags, grid, cellids, nodes, coords, sdh, dofs)
+end
+function get_substruct(i, cc::CellCache, cellid)
+    return CellCache(
+        cc.flags, cc.grid, view(cc.cellid, i, :),
+        view(cc.nodes, i, :), view(cc.coords, i, :), cc.dh, view(cc.dofs, i, :)
+    )
+end
+function Ferrite.reinit!(cc::CellCache{<:Any,<:DeviceGrid}, cellid::Int)
+    cc.cellid[1] = cellid # TODO remove this in future versions
+    cc.flags.nodes  && Ferrite.cellnodes!(cc.nodes, cc.grid, cellid)
+    cc.flags.coords && Ferrite.getcoordinates!(cc.coords, cc.grid, cellid)
+    cc.dh !== nothing && cc.flags.dofs && Ferrite.celldofs!(cc.dofs, cc.dh, cellid)
+    return cc
 end
 
 function Ferrite.CellCache(backend, dh::HostDofHandler{dim}, flags::UpdateFlags = UpdateFlags()) where {dim}

diff --git a/src/iterators.jl b/src/iterators.jl
index f432455fa..7eb3f244a 100644
--- a/src/iterators.jl
+++ b/src/iterators.jl
@@ -32,24 +32,24 @@ cell. The cache is updated for a new cell by calling `reinit!(cache, cellid)` wh
 
 See also [`CellIterator`](@ref).
 """
-mutable struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}}
-    const flags::UpdateFlags
-    const grid::G
+struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}, CIDType, IVType, CVType <: AbstractArray{X}}
+    flags::UpdateFlags
+    grid::G
     # Pretty useless to store this since you have it already for the reinit! call, but
     # needed for the CellIterator(...) workflow since the user doesn't necessarily control
     # the loop order in the cell subset.
-    cellid::Int
-    const nodes::Vector{Int}
-    const coords::Vector{X}
-    const dh::DH
-    const dofs::Vector{Int}
+    cellid::CIDType
+    nodes::IVType
+    coords::CVType
+    dh::DH
+    dofs::IVType
 end
 
 function CellCache(grid::Grid{dim, C, T}, flags::UpdateFlags = UpdateFlags()) where {dim, C, T}
     N = nnodes_per_cell(grid, 1) # nodes and coords will be resized in `reinit!`
     nodes = zeros(Int, N)
     coords = zeros(Vec{dim, T}, N)
-    return CellCache(flags, grid, -1, nodes, coords, nothing, Int[])
+    return CellCache(flags, grid, [-1], nodes, coords, nothing, Int[])
 end
 
 function CellCache(dh::DofHandler{dim}, flags::UpdateFlags = UpdateFlags()) where {dim}
@@ -58,7 +58,7 @@ function CellCache(dh::DofHandler{dim}, flags::UpdateFlags = UpdateFlags()) wher
     nodes = zeros(Int, N)
     coords = zeros(Vec{dim, get_coordinate_eltype(get_grid(dh))}, N)
     celldofs = zeros(Int, n)
-    return CellCache(flags, get_grid(dh), -1, nodes, coords, dh, celldofs)
+    return CellCache(flags, get_grid(dh), [-1], nodes, coords, dh, celldofs)
 end
 
 function CellCache(sdh::SubDofHandler, flags::UpdateFlags = UpdateFlags())
@@ -67,7 +67,7 @@ function CellCache(sdh::SubDofHandler, flags::UpdateFlags = UpdateFlags())
 end
 
 function reinit!(cc::CellCache, i::Int)
-    cc.cellid = i
+    cc.cellid[1] = i # TODO remove this in future versions
     if cc.flags.nodes
         resize!(cc.nodes, nnodes_per_cell(cc.grid, i))
         cellnodes!(cc.nodes, cc.grid, i)
@@ -97,10 +97,10 @@ end
 getnodes(cc::CellCache) = cc.nodes
 getcoordinates(cc::CellCache) = cc.coords
 celldofs(cc::CellCache) = cc.dofs
-cellid(cc::CellCache) = cc.cellid
+cellid(cc::CellCache) = cc.cellid[1]
 
 # TODO: These should really be replaced with something better...
-nfacets(cc::CellCache) = nfacets(getcells(cc.grid, cc.cellid))
+nfacets(cc::CellCache) = nfacets(getcells(cc.grid, cellid(cc)))
 
 
 """
@@ -121,20 +121,20 @@ calling `reinit!(cache, fi::FacetIndex)`.
 
 See also [`FacetIterator`](@ref).
 """
-mutable struct FacetCache{CC <: CellCache}
-    const cc::CC  # const for julia > 1.8
-    const dofs::Vector{Int} # aliasing cc.dofs
-    current_facet_id::Int
+struct FacetCache{CC <: CellCache, DVType, CFType}
+    cc::CC
+    dofs::DVType # aliasing cc.dofs
+    current_facet_id::CFType
 end
 function FacetCache(args...)
     cc = CellCache(args...)
-    return FacetCache(cc, cc.dofs, 0)
+    return FacetCache(cc, cc.dofs, [0])
 end
 
 function reinit!(fc::FacetCache, facet::BoundaryIndex)
     cellid, facetid = facet
     reinit!(fc.cc, cellid)
-    fc.current_facet_id = facetid
+    fc.current_facet_id[1] = facetid
     return nothing
 end
 
@@ -148,7 +148,7 @@ for op in (:getnodes, :getcoordinates, :cellid, :celldofs)
 end
 
 @inline function reinit!(fv::FacetValues, fc::FacetCache)
-    return reinit!(fv, fc.cc, fc.current_facet_id)
+    return reinit!(fv, fc.cc, fc.current_facet_id[1])
 end
 
 """
 
 diff --git a/docs/src/howto/gpu_heat_howto_literate.jl b/docs/src/howto/gpu_heat_howto_literate.jl
index 6efd99cdb..6931d0b25 100644
--- a/docs/src/howto/gpu_heat_howto_literate.jl
+++ b/docs/src/howto/gpu_heat_howto_literate.jl
@@ -68,7 +68,8 @@ end
         cv_i = cv[task_index]
 
         ## Query work item cell cache. The call on the item initializes replaces the reinit! call.
-        cc_i = cc[task_index](cellid)
+        cc_i = cc[task_index]
+        Ferrite.reinit!(cc_i, cellid)
 
         ## Query assembly buffer.
         Ke = view(Kes, i, :, :)

If this is more about keeping the reinit! workflow on the GPU I can add it, but it will increase memory pressure, because we store the cellid in global memory, just to eventually load it later from global memory instead of keeping it locally.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we cannot do any shenanigans like resizing on the GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU assembly example

3 participants