Apple has highly optimized DL primitives in Metal performance shaders. By partially de-compiling Metal Performance Shaders, I saw an insane amount of permutations. They optimized for all sorts of edge cases. I saw the term "winograd" once or twice in function names. I tried comparing custom Metal shaders to Apple's MPS and mine were terrible, but I imagine you have more time to thoroughly investigate performance deltas.
Given that Metal works on most AMD and Intel GPUs, it would be wise to run your OpenCL code on macOS and compare your performance to Apple's. That would ensure your kernels utilize the GPU as much as physically possible. Another suggestion is to try comparing DirectML, although I suspect that Apple is more optimized due to the sheer number of permutations they created. You can examine the DirectML source code to see if Microsoft takes the permutation approach too.
Apple has highly optimized DL primitives in Metal performance shaders. By partially de-compiling Metal Performance Shaders, I saw an insane amount of permutations. They optimized for all sorts of edge cases. I saw the term "winograd" once or twice in function names. I tried comparing custom Metal shaders to Apple's MPS and mine were terrible, but I imagine you have more time to thoroughly investigate performance deltas.
Given that Metal works on most AMD and Intel GPUs, it would be wise to run your OpenCL code on macOS and compare your performance to Apple's. That would ensure your kernels utilize the GPU as much as physically possible. Another suggestion is to try comparing DirectML, although I suspect that Apple is more optimized due to the sheer number of permutations they created. You can examine the DirectML source code to see if Microsoft takes the permutation approach too.