I was looking into ways to accelerate dense matrix-vector products (the naive $\mathcal{O}(N^2)$ approach) for matrices defined through a kernel, and I think it would be very useful to support a matrix-free implementation that can execute the double for loop efficiently across different hardware backends, particularly GPUs.
Libraries such as KeOps demonstrate that this approach can be highly effective. Some preliminary benchmarks on my Apple M3 GPU using Metal suggest that, for simple kernels such as Laplace, a GPU-accelerated matrix-free implementation can be competitive with hierarchical matrices or FMM up to surprisingly large problem sizes, extending to hundreds of thousands of DOFs.
After some poking around, it seems that implementing this through KernelAbstractions should not be too difficult.
I was looking into ways to accelerate dense matrix-vector products (the naive$\mathcal{O}(N^2)$ approach) for matrices defined through a kernel, and I think it would be very useful to support a matrix-free implementation that can execute the double for loop efficiently across different hardware backends, particularly GPUs.
Libraries such as KeOps demonstrate that this approach can be highly effective. Some preliminary benchmarks on my Apple M3 GPU using Metal suggest that, for simple kernels such as Laplace, a GPU-accelerated matrix-free implementation can be competitive with hierarchical matrices or FMM up to surprisingly large problem sizes, extending to hundreds of thousands of DOFs.
After some poking around, it seems that implementing this through KernelAbstractions should not be too difficult.