Skip to content

Suggestion: profiling #12

Description

@philipturner

Apple has highly optimized DL primitives in Metal performance shaders. By partially de-compiling Metal Performance Shaders, I saw an insane amount of permutations. They optimized for all sorts of edge cases. I saw the term "winograd" once or twice in function names. I tried comparing custom Metal shaders to Apple's MPS and mine were terrible, but I imagine you have more time to thoroughly investigate performance deltas.

Given that Metal works on most AMD and Intel GPUs, it would be wise to run your OpenCL code on macOS and compare your performance to Apple's. That would ensure your kernels utilize the GPU as much as physically possible. Another suggestion is to try comparing DirectML, although I suspect that Apple is more optimized due to the sheer number of permutations they created. You can examine the DirectML source code to see if Microsoft takes the permutation approach too.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions