Minimal GPU example which preserves the element routine by termi-official · Pull Request #1291 · Ferrite-FEM/Ferrite.jl

termi-official · 2026-02-27T01:09:58Z

This PR has a fundamentally different design from the previous PRs. Here I try to preserve the assembly routine as-is and modify the assembly loop around them. This PR is really just an MWE to show the direction for discussion purposes.

The main vehicle for this PR is a SOA transformation of the CellCache and CellValues objects. I actually like it, but it might not be the best one performance-wise.

Closes #628 .

codecov · 2026-02-27T02:32:17Z

Codecov Report

❌ Patch coverage is 56.14618% with 132 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.59%. Comparing base (236eb50) to head (280f35d).
⚠️ Report is 8 commits behind head on master.

Files with missing lines	Patch %	Lines
ext/FerriteCudaExt.jl	33.33%	76 Missing ⚠️
ext/FerriteKAExt/iterator.jl	58.53%	17 Missing ⚠️
src/soa_utils.jl	50.00%	16 Missing ⚠️
ext/FerriteKAExt/dof_handler.jl	66.66%	9 Missing ⚠️
ext/FerriteKAExt/adapt_core.jl	73.33%	4 Missing ⚠️
ext/FerriteKAExt/soa_core.jl	70.00%	3 Missing ⚠️
src/Grid/grid.jl	85.00%	3 Missing ⚠️
ext/FerriteKAExt/device_grid.jl	66.66%	2 Missing ⚠️
src/Dofs/ConstraintHandler.jl	95.00%	1 Missing ⚠️
src/FEValues/FacetValues.jl	88.88%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1291      +/-   ##
==========================================
- Coverage   94.25%   92.59%   -1.66%     
==========================================
  Files          40       47       +7     
  Lines        6750     6997     +247     
==========================================
+ Hits         6362     6479     +117     
- Misses        388      518     +130

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…tter align with Ferrite internals.

…tension.

src/FEValues/common_values.jl

termi-official · 2026-02-28T01:19:48Z

src/FEValues/FunctionValues.jl

 FunctionValues

-struct FunctionValues{DiffOrder, IP, N_t, dNdx_t, dNdξ_t, d2Ndx2_t, d2Ndξ2_t} <: AbstractValues
+struct FunctionValues{DiffOrder, IP, Nx_t, Nξ_t, dNdx_t, dNdξ_t, d2Ndx2_t, d2Ndξ2_t} <: AbstractValues


This is the only intrusive part of this PR here. I have decided to allow the arrays to be stored here directly to not add any helper structs just mirroring the fields as Arrays. Although one should not that passing an array here breaks almost all calls on the struct.

We can also explore the path of having helper structs and introduce some syntactic sugar to query the parts via e.g. fv[i] instead of get_substruct.

src/Dofs/ConstraintHandler.jl

src/soa_utils.jl

ext/FerriteCudaExt.jl

… integrated as a full test. I will split it up later.

KnutAM

Cool 🚀

Initial round of feedback (excluding the parts regarding the generalized vector types in #1252 ), and didn't look at the example so carefully yet.

src/FEValues/common_values.jl

src/Dofs/ConstraintHandler.jl

src/Grid/grid.jl

src/soa_utils.jl

src/Grid/grid.jl

… messages.

… reivew.

KnutAM

This is really cool and I'm really excited for this feature!

I've now focused on the how-to to get a bit understanding from how this would look from a user perspective. So on the user-facing side we are introducing the APIs?

adapt dispatches for DofHandler and ConstraintHandler
Support for allocation and assembly of CuSparseMatrixCSC
CellCacheContainer, CellValuesContainer, and FacetValuesContainer (assembler?)

(Where my suggestion is to replace nr 3 with distribute_to_workers or similar)

Did I miss any new API?

KnutAM · 2026-03-05T07:43:59Z

docs/src/howto/gpu_heat_howto_literate.jl

+
+ch_gpu = adapt(backend, ch)
+apply!(K_gpu, f_gpu, ch_gpu)
+u_gpu = SparseMatrixCSC(K_gpu) \ Vector(f_gpu)


Can we use LinearSolve.jl instead to avoid transferring? Just that it is nice :D

Not sure if I would do that here. We can point the user towards LinearSolve being a possibility. But we do not have optimal preconditioners in LinearSolve.jl yet. To not make the PR too large, should I leave this for now as-is and add direct support for AMGX in a subsequent PR?

No, I agree that we shouldn't add that here. Was more thinking to use a direct solver that seemed to be supported, e.g.

using CUDSS sol = LS.solve(prob, LS.LUFactorization())

But not important.

KnutAM · 2026-03-05T07:52:17Z

docs/src/howto/gpu_heat_howto_literate.jl

+# Technically we can also just get one Ke or fe per worker, but for demonstration
+# purposes we allocate the full block here for element-assembly style matrix-free GPU
+# usage.
+Kes = KA.zeros(backend, Float32, getncells(grid), getnbasefunctions(cv), getnbasefunctions(cv))
+fes = KA.zeros(backend, Float32, getncells(grid), getnbasefunctions(cv))


How is this better than allocating the full matrix? Seems like this would allocate a huge amount of memory compared to the global matrix? Is it simply to avoid locks/coloring?

In the final version probably nice to stick with one approach as a first step for users to do GPU-FEM - and have other how-tos build upon that.

Let me answer here in reverse order.

In the final version probably nice to stick with one approach as a first step.

Please note that this is not a tutorial and I think we need to educate users here a bit. No matter what you do, if you use CSR or CSC on the GPU your solve performance will be not great at all, because your arithmetic intensity is way too low for the spmv kernel. Using CSR/CSC is not recommended. However, if you do not know how to precondition your iterative solver, then it is for the user typically better to use these formats anyway to get things started, so they can optimize the solve step later separately. I can clarify this further in the how-to if you want.

How is this better than allocating the full matrix?

I essentially just want to show users here that we can very easily build standard matrix-free data structures. The 3 dimensional Ke is really just what is called "element-assembly" and is already sufficient to setup fast matrix-vector products on the GPU. Since in the past I was repeatedly tasked to chunk up PRs which do several things at once (to see how things integrate), this one is now an isolated change which does just a single thing: Port the heat example with minimal effort (for the users) to the GPU.

KnutAM · 2026-03-05T21:57:16Z

docs/src/howto/gpu_heat_howto_literate.jl

+cv_gpu = CellValuesContainer(backend, n_workers, cv)
+cc_gpu = CellCacheContainer(backend, n_workers, dh_gpu)


As discussed on slack, I don't think we should all all these different versions/types as user-facing, but provide an API that can be overloaded with the specific type, which returns an AbstractArray where the element type is similar to the type of the argument. Putting it here for others to comment to.

Suggested change

cv_gpu = CellValuesContainer(backend, n_workers, cv)

cc_gpu = CellCacheContainer(backend, n_workers, dh_gpu)

cv_gpu = distribute_to_workers(backend, n_workers, cv)

cc_gpu = distribute_to_workers(backend, n_workers, CellCache(dh_gpu))

""" distribute_to_workers([backend], n_workers, x::Tx) Create an `AbstractArray{T}` with length `n_workers` where each item is can be mutated without race conditions between different workers / tasks. While `T == Tx` is not guaranteed, instances of `T` will typically support the same functionality as the original `x`. Ferrite provides support for the following `x`s: * [`CellCache`](@ref) * [`CellValues`](@ref) * [`FacetValues`](@ref) * `AbstractAssembler` (I assume it will be needed due to the `#FIXME buffers` in the code (also no ref as no docs for this one I think?) If no `backend` is provided, `T == Tx`, which is suitable for standard multithreading. Distributing for GPUs requires loading the relevant GPU package that provides a `backend`, currently, the following backends are implemented: * `CUDABackend()` (Via the `CUDA.jl` extension) """ function distribute_to_workers end

xref: #1070 (CC @fredrikekre) for the non-GPU case, which will be solved by this. Of course the same issue as there with name, some alternatives to the above could be

distribute

distribute_to_tasks

I am still not sure about this design, as I also started with it, but it just didn't turn out nicely so far.

For more context:

cc_gpu = distribute_to_workers(backend, n_workers, CellCache(dh_gpu)) does not work so easily. First, the Vector types are hardcoded, so the CellCache(gpu_sdh) constructor would already return something that is not a CellCache, or we transfer the data back and forth, or the CellCache is just some other placeholder type.

if we would widen the type here, it is still mutable and we cannot make the struct immutable without breaking public API (reinit! in CellCache).

I cannot guarantee this signature will work for all that we need. How would you, for example, handle the assembler, as the existing ones are also not GPU compatible? And if we make it GPU compatible, why would I not directly construct them with the correct buffer sizes?

Minimal GPU example which preserves the element routine #1291 (comment)

In general, the way I read the tutorial is the approach for many objects are (1) setup as usual (e.g. Grid, DofHandler, ConstraintHandler, etc.) and (2) adapt to GPU device. So having the distribute to worker interface is just exactly doing that. The alternative, having a new public type for each item (CellValues, FacetValues, MultiFieldCellValues, CellCache, FacetCache, CSCAssembler, CSRAssembler), potentially different between backends, compared to abstracting this away seems more attractive to me wrt. what we define and the number of objects the user has to deal with.

... or we transfer the data back and forth, or the CellCache is just some other placeholder type.

I think this transfer is OK, since it is only during setup. And this follows the same logic of adapt in my view.

I cannot guarantee this signature will work for all that we need. How would you, for example, handle the assembler, as the existing ones are also not GPU compatible?

The new assembler should be documented as thread-safe (as long as coloring is used), so no need to distribute it to workers, can be used by all workers.

... And if we make it GPU compatible, why would I not directly construct them with the correct buffer sizes?

I would argue for the simplicity of the user interface, same pattern everywhere: Create as standard, adapt to GPU backend. Items with buffers/cache that are modified must be distributed to workers, just as with multithreading, and we provide a unified infrastructure to do this via distribute_to_workers.

docs/src/howto/gpu_assembly.md

termi-official · 2026-03-06T13:36:46Z

Thanks for the review Knut! Let me answer your questions in order first before I get to the comments.

adapt comes from Adapt.jl and is the recommended way to GPU transfer data structures into the kernels. Essentially it does some switching like CuVector -> CuDeviceVector under the hood. You add this to your structs which will be used inside kernels, so this type of transformation is applied to all fields of your structs recursively (such that we also reach inner structs). So this is not really a new API which we introduce here, but more a way to add compatibility with the standard Julia GPU workflow. A goal here was specifically to be able to enable user to build their stuff with KernelAbstractions.
Technically speaking the assembly into a CSC matrix would work out of the box with what we have. The blocker here is that the assembler structs are not thread-safe, so I added a hotfix for now by adding a function which directly assembles without the additional sort step. Note that I can also add the CSR format in this PR, as it is significantly faster for parallel applications (i.e. in solvers).
Yes, these three structs are essentially the workhorses here. Given the refactoring that we (and especially you Knut for the CellValues) did over the last year porting most parts became significantly easier than in previous attempts and allows us to reuse many existing element routines out of the box.

From the user-facing API that's it. Then there is right now the internal API with get_substruct and as_structure_of_arrays powering the new user-facing structs. This API is subject to change, but the pattern here should be clear. For the GPU it is really beneficial to have structure of arrays data types -- while on the CPU we usually prefer the array of structs, as we already propose in the threaded assembly tutorial. The difference here in "optimal" data structures stems from the significant differences in hardware architecture. Does that make sense?

KnutAM · 2026-03-06T18:22:48Z

(1) So this is not really a new API which we introduce here, but more a way to add compatibility with the standard Julia GPU workflow.

Yes, I think that makes perfect sense! (Just API in the sense that the Ferrite dispatches is part of the documented API)

(2) Technically speaking the assembly into a CSC matrix would work out of the box with what we have.

Aha, so in that case shouldn't we have the same logic here as for the cell values and cache? (in my notation distribute_to_workers(backend, n_workers, assembler) (but no need to add here), and this PR adds an assembler that doesn't need to be distributed to workers since it doesn't have caches. So not necessarily for dispatch on start_assemble for matrix type then, but rather it could be explicitly constructed, something like start_assemble(CacheFreeAssembler, K, [f])?

(3) The difference here in "optimal" data structures stems from the significant differences in hardware architecture. Does that make sense?

Yes, but with the proposed distribute function that is what we would get. I think we didn't mention that from our slack conversation, but just to have it here:

struct DistributedVals{CV, CV_internal} <: AbstractVector{CV}
    num::Int
    cv::CV_internal
    function DistributedVals{CV}(cv_internal::CV_internal, num::Int) where {CV, CV_internal}
        return new{CV, CV_internal}(cv_internal, num)
    end
end

function distribute_to_workers(backend, n_workers, cv::CellValues)
    cva = as_structure_of_arrays(backend, n_workers, cv)
    CV = typeof(get_substruct(cva, 1))
    return DistributedVals{CV}(cva, n_workers)
end

function Base.get_index(d::DistriburedVals, i)
    checkbounds(i, 1:d.num)
    return get_substruct(d.cv, i)::CV
end

The reason I'm against exposing cvs = as_structure_of_arrays(backend, n_workers, cv) to users, is that then we have e.g. a CellValues object, cvs, that doesn't support the methods defined for CellValues (or even worse, in some cases support them but will silently give the wrong result. As a pragmatic approach to avoid code duplication, I like the idea to (mis)use the structures to create structs of arrays, but users should never work with such adopted structures IMO.

KnutAM · 2026-03-06T18:38:29Z

ext/FerriteKAExt/iterator.jl

+# NOTE CellCache is mutable and hence inherently incompatible with GPU. So here is the
+# immutable variant. Making the CellCache immutable is considered breaking due to the reinit! API integration.
+struct ImmutableCellCache{G <: AbstractGrid, SDH, IVT, VX}
+    flags::UpdateFlags
+    grid::G
+    cellid::Int
+    nodes::IVT
+    coords::VX
+    sdh::SDH
+    dofs::IVT
+end


NOTE CellCache is mutable and hence inherently incompatible with GPU. So here is the immutable variant. Making the CellCache immutable is considered breaking due to the reinit! API integration.

I think we can, by changing

Ferrite.jl/src/iterators.jl

Lines 35 to 46 in 34c457e

mutable struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}}

const flags::UpdateFlags

const grid::G

# Pretty useless to store this since you have it already for the reinit! call, but

# needed for the CellIterator(...) workflow since the user doesn't necessarily control

# the loop order in the cell subset.

cellid::Int

const nodes::Vector{Int}

const coords::Vector{X}

const dh::DH

const dofs::Vector{Int}

end

to

mutable struct Mutable{T} v::T end struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}, CID, NodeV <: AbstractVector{<:Integer}, CoordV <: AbstractVector{X}, DofV <: AbstractVector{<:Integer}} flags::UpdateFlags grid::G # Pretty useless to store this since you have it already for the reinit! call, but # needed for the CellIterator(...) workflow since the user doesn't necessarily control # the loop order in the cell subset. cellid::CID nodes::NodeV coords::CoordV dh::DH dofs::DofV end

where CID normally is a Mutable{Int}, but for gpu this can be adapt:ed to a view to a CuVector?

This basically re-introduces the old design, which I think makes sense here instead of duplicating?

I would go for the duplication here. Having everything funneled into a single data structure seems to be quite painful to maintain. My working patch here is

diff --git a/ext/FerriteKAExt/adapt_core.jl b/ext/FerriteKAExt/adapt_core.jl index cbccc539a..969a9ed75 100644 --- a/ext/FerriteKAExt/adapt_core.jl +++ b/ext/FerriteKAExt/adapt_core.jl @@ -1,6 +1,7 @@ # This file contains adapt rules for all relevant data structures in Ferrite.jl which do # not need customized GPU data structures. +Adapt.@adapt_structure CellCache Adapt.@adapt_structure CellValues Adapt.@adapt_structure Ferrite.GeometryMapping Adapt.@adapt_structure Ferrite.FunctionValues diff --git a/ext/FerriteKAExt/iterator.jl b/ext/FerriteKAExt/iterator.jl index 3f89ecc4a..d20f7f00f 100644 --- a/ext/FerriteKAExt/iterator.jl +++ b/ext/FerriteKAExt/iterator.jl @@ -34,9 +34,23 @@ function as_structure_of_arrays(backend, outer_dim, ::Type{CellCache}, sdh::Devi N = Ferrite.nnodes_per_cell(sdh) nodes = KA.zeros(backend, Int, outer_dim, N) coords = KA.zeros(backend, Vec{dim, get_coordinate_eltype(grid)}, outer_dim, N) + cellids = KA.zeros(backend, Int, outer_dim, 1) dofs = KA.zeros(backend, Int, outer_dim, n) end - return ImmutableCellCache(flags, grid, -1, nodes, coords, sdh, dofs) + return CellCache(flags, grid, cellids, nodes, coords, sdh, dofs) +end +function get_substruct(i, cc::CellCache, cellid) + return CellCache( + cc.flags, cc.grid, view(cc.cellid, i, :), + view(cc.nodes, i, :), view(cc.coords, i, :), cc.dh, view(cc.dofs, i, :) + ) +end +function Ferrite.reinit!(cc::CellCache{<:Any,<:DeviceGrid}, cellid::Int) + cc.cellid[1] = cellid # TODO remove this in future versions + cc.flags.nodes && Ferrite.cellnodes!(cc.nodes, cc.grid, cellid) + cc.flags.coords && Ferrite.getcoordinates!(cc.coords, cc.grid, cellid) + cc.dh !== nothing && cc.flags.dofs && Ferrite.celldofs!(cc.dofs, cc.dh, cellid) + return cc end function Ferrite.CellCache(backend, dh::HostDofHandler{dim}, flags::UpdateFlags = UpdateFlags()) where {dim} diff --git a/src/iterators.jl b/src/iterators.jl index f432455fa..7eb3f244a 100644 --- a/src/iterators.jl +++ b/src/iterators.jl @@ -32,24 +32,24 @@ cell. The cache is updated for a new cell by calling `reinit!(cache, cellid)` wh See also [`CellIterator`](@ref). """ -mutable struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}} - const flags::UpdateFlags - const grid::G +struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}, CIDType, IVType, CVType <: AbstractArray{X}} + flags::UpdateFlags + grid::G # Pretty useless to store this since you have it already for the reinit! call, but # needed for the CellIterator(...) workflow since the user doesn't necessarily control # the loop order in the cell subset. - cellid::Int - const nodes::Vector{Int} - const coords::Vector{X} - const dh::DH - const dofs::Vector{Int} + cellid::CIDType + nodes::IVType + coords::CVType + dh::DH + dofs::IVType end function CellCache(grid::Grid{dim, C, T}, flags::UpdateFlags = UpdateFlags()) where {dim, C, T} N = nnodes_per_cell(grid, 1) # nodes and coords will be resized in `reinit!` nodes = zeros(Int, N) coords = zeros(Vec{dim, T}, N) - return CellCache(flags, grid, -1, nodes, coords, nothing, Int[]) + return CellCache(flags, grid, [-1], nodes, coords, nothing, Int[]) end function CellCache(dh::DofHandler{dim}, flags::UpdateFlags = UpdateFlags()) where {dim} @@ -58,7 +58,7 @@ function CellCache(dh::DofHandler{dim}, flags::UpdateFlags = UpdateFlags()) wher nodes = zeros(Int, N) coords = zeros(Vec{dim, get_coordinate_eltype(get_grid(dh))}, N) celldofs = zeros(Int, n) - return CellCache(flags, get_grid(dh), -1, nodes, coords, dh, celldofs) + return CellCache(flags, get_grid(dh), [-1], nodes, coords, dh, celldofs) end function CellCache(sdh::SubDofHandler, flags::UpdateFlags = UpdateFlags()) @@ -67,7 +67,7 @@ function CellCache(sdh::SubDofHandler, flags::UpdateFlags = UpdateFlags()) end function reinit!(cc::CellCache, i::Int) - cc.cellid = i + cc.cellid[1] = i # TODO remove this in future versions if cc.flags.nodes resize!(cc.nodes, nnodes_per_cell(cc.grid, i)) cellnodes!(cc.nodes, cc.grid, i) @@ -97,10 +97,10 @@ end getnodes(cc::CellCache) = cc.nodes getcoordinates(cc::CellCache) = cc.coords celldofs(cc::CellCache) = cc.dofs -cellid(cc::CellCache) = cc.cellid +cellid(cc::CellCache) = cc.cellid[1] # TODO: These should really be replaced with something better... -nfacets(cc::CellCache) = nfacets(getcells(cc.grid, cc.cellid)) +nfacets(cc::CellCache) = nfacets(getcells(cc.grid, cellid(cc))) """ @@ -121,20 +121,20 @@ calling `reinit!(cache, fi::FacetIndex)`. See also [`FacetIterator`](@ref). """ -mutable struct FacetCache{CC <: CellCache} - const cc::CC # const for julia > 1.8 - const dofs::Vector{Int} # aliasing cc.dofs - current_facet_id::Int +struct FacetCache{CC <: CellCache, DVType, CFType} + cc::CC + dofs::DVType # aliasing cc.dofs + current_facet_id::CFType end function FacetCache(args...) cc = CellCache(args...) - return FacetCache(cc, cc.dofs, 0) + return FacetCache(cc, cc.dofs, [0]) end function reinit!(fc::FacetCache, facet::BoundaryIndex) cellid, facetid = facet reinit!(fc.cc, cellid) - fc.current_facet_id = facetid + fc.current_facet_id[1] = facetid return nothing end @@ -148,7 +148,7 @@ for op in (:getnodes, :getcoordinates, :cellid, :celldofs) end @inline function reinit!(fv::FacetValues, fc::FacetCache) - return reinit!(fv, fc.cc, fc.current_facet_id) + return reinit!(fv, fc.cc, fc.current_facet_id[1]) end """ diff --git a/docs/src/howto/gpu_heat_howto_literate.jl b/docs/src/howto/gpu_heat_howto_literate.jl index 6efd99cdb..6931d0b25 100644 --- a/docs/src/howto/gpu_heat_howto_literate.jl +++ b/docs/src/howto/gpu_heat_howto_literate.jl @@ -68,7 +68,8 @@ end cv_i = cv[task_index] ## Query work item cell cache. The call on the item initializes replaces the reinit! call. - cc_i = cc[task_index](cellid) + cc_i = cc[task_index] + Ferrite.reinit!(cc_i, cellid) ## Query assembly buffer. Ke = view(Kes, i, :, :)

If this is more about keeping the reinit! workflow on the GPU I can add it, but it will increase memory pressure, because we store the cellid in global memory, just to eventually load it later from global memory instead of keeping it locally.

Note that we cannot do any shenanigans like resizing on the GPU.

termi-official added 2 commits February 27, 2026 01:48

Minimal assembly example

a7ba0c1

Add minimal constraint handler

489155a

termi-official added 4 commits February 27, 2026 13:25

Add actual subdofhandler

f3af4b9

Add some brief docs and group functions. Also rename some stuff to be…

1311ef4

…tter align with Ferrite internals.

Fix some more subtle Int issues and move most of the stuff into an ex…

a92356b

…tension.

Move heat assembly to tests

2ce407b

termi-official force-pushed the do/gpu-mwe branch from ed8eb72 to 2ce407b Compare February 27, 2026 21:45

termi-official added 3 commits February 27, 2026 22:47

Runic

6442941

Missing file.

e0e1e84

Remove StaticArrays dep

b12d500

termi-official force-pushed the do/gpu-mwe branch from fdff415 to b12d500 Compare February 27, 2026 22:06

termi-official added 9 commits February 27, 2026 23:09

Fix trailing T

6826478

Remove self-qualification

e687709

Typo

f00dbb8

Remove dynamic dispatch from error message generation

ff8b59a

Define proper reinit for GPUCellCache

dc9954c

Make SOA transformation explicit.

60a23c8

Try to trick CodeCov by using KA CPU.

ae09f6b

Runic

12879fb

Partial recovery of detJ error

cf135fb

termi-official commented Feb 28, 2026

View reviewed changes

termi-official marked this pull request as ready for review February 28, 2026 01:24

termi-official mentioned this pull request Feb 28, 2026

More abstract vectors #1252

Merged

termi-official added 4 commits March 3, 2026 19:31

Revert to global memory usage and make sure the tests work.

545a317

Fix and comment how-to for naive GPU assembly. For now this how-to is…

04ff942

… integrated as a full test. I will split it up later.

Remove old materialize fun

8467898

Make runic happy

f715779

KnutAM reviewed Mar 4, 2026

View reviewed changes

Hide the SOA access behind a nicer user-interface

560f8b5

KristofferC reviewed Mar 4, 2026

View reviewed changes

src/Grid/grid.jl Show resolved Hide resolved

termi-official added 14 commits March 4, 2026 17:02

Fix KernelAbstractions example with Valentin's help

2535789

Use global constant for simplicity in the example.

8571d45

Remove todo from how-to

e60c93d

Rename GPU -> Device for most data structures

4fe3556

Split up extension into KA part and CUDA part. Also restore CPU error…

fdc1085

… messages.

Runic

9df89bf

Revert type stability issue and extend interfaces as suggested in the…

32eb310

… reivew.

Docstring

2b3ff9c

Revert setdiff change as there is no measurable gain

651600e

Runic

530bdd2

Sneak GPU assembly into docs

ac3a916

Runic

e2af28c

Missing file.

179c993

Tweak example

280f35d

KnutAM reviewed Mar 5, 2026

View reviewed changes

KnutAM reviewed Mar 6, 2026

View reviewed changes

		cv_gpu = CellValuesContainer(backend, n_workers, cv)
		cc_gpu = CellCacheContainer(backend, n_workers, dh_gpu)

	mutable struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}}
	const flags::UpdateFlags
	const grid::G
	# Pretty useless to store this since you have it already for the reinit! call, but
	# needed for the CellIterator(...) workflow since the user doesn't necessarily control
	# the loop order in the cell subset.
	cellid::Int
	const nodes::Vector{Int}
	const coords::Vector{X}
	const dh::DH
	const dofs::Vector{Int}
	end

Conversation

termi-official commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KnutAM left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KnutAM left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

termi-official Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

termi-official commented Mar 6, 2026

Uh oh!

KnutAM commented Mar 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

termi-official commented Feb 27, 2026 •

edited

Loading

codecov bot commented Feb 27, 2026 •

edited

Loading

termi-official Mar 6, 2026 •

edited

Loading