A synthetic benchmarking tool to measure peak capabilities of vulkan devices. It only measures the peak metrics that can be achieved using vector operations and does not represent a real-world use case.
Download Windows/Linux/MacOS Executable for Intel/AMD/Nvidia/Apple GPU
https://github.com/nihui/vkpeak/releases
vkpeak.exevkpeak will choose the default vulkan device (device id 0).
If you need to specify device id, then
vkpeak.exe 0The device_id parameter is optional and defaults to 0.
By default, if you do not pass a scenario argument, vkpeak runs all available scenarios.
You can optionally select one scenario:
vkpeak.exe fp16-matrixOr with an explicit device id:
vkpeak.exe 0 fp16-matrixOr run a comma-separated list of scenarios:
vkpeak.exe 0 fp16-matrix,int8-matrix,copy-d2dYou can also explicitly request all scenarios:
vkpeak.exe 0 allAvailable scenario names:
- fp32-scalar
- fp32-vec4
- fp16-scalar
- fp16-vec4
- fp16-matrix
- fp64-scalar
- fp64-vec4
- int32-scalar
- int32-vec4
- int16-scalar
- int16-vec4
- int64-scalar
- int64-vec4
- int8-dotprod
- int8-matrix
- bf16-dotprod
- bf16-matrix
- fp8-matrix
- bf8-matrix
- copy-h2h
- copy-h2d
- copy-d2h
- copy-d2d
If you encounter a crash or error, try upgrading your GPU driver:
- Intel: https://downloadcenter.intel.com/product/80939/Graphics-Drivers
- AMD: https://www.amd.com/en/support
- NVIDIA: https://www.nvidia.com/Download/index.aspx
- Clone this project with all submodules
git clone https://github.com/nihui/vkpeak.git
cd vkpeak
git submodule update --init --recursive- Build with CMake
- You can pass -DVulkan_LIBRARY=<path to your macOS/lib/MoltenVK.xcframework/macos-arm64_x86_64/libMoltenVK.a> option to link static MoltenVK library on MacOS, MoltenVK is part of Vulkan SDK from https://vulkan.lunarg.com/
mkdir build
cd build
cmake ..
cmake --build . -j 4NVIDIA RTX5060Ti 16GB
device = NVIDIA GeForce RTX 5060 Ti
fp32-scalar = 17137.46 GFLOPS
fp32-vec4 = 16910.07 GFLOPS
fp16-scalar = 12730.03 GFLOPS
fp16-vec4 = 12715.02 GFLOPS
fp16-matrix = 101485.35 GFLOPS
fp64-scalar = 398.59 GFLOPS
fp64-vec4 = 394.08 GFLOPS
int32-scalar = 12703.68 GIOPS
int32-vec4 = 12181.98 GIOPS
int16-scalar = 12690.05 GIOPS
int16-vec4 = 12208.29 GIOPS
int64-scalar = 3104.59 GIOPS
int64-vec4 = 2666.86 GIOPS
int8-dotprod = 16101.59 GIOPS
int8-matrix = 202947.80 GIOPS
bf16-dotprod = 0.00 GFLOPS
bf16-matrix = 0.00 GFLOPS
fp8-matrix = 0.00 GFLOPS
bf8-matrix = 0.00 GFLOPS
copy-h2h = 18.17 GBPS
copy-h2d = 17.93 GBPS
copy-d2h = 18.09 GBPS
copy-d2d = 190.70 GBPS
AMD RX9060XT 16GB
device = AMD Radeon Graphics (RADV GFX1200)
fp32-scalar = 17606.54 GFLOPS
fp32-vec4 = 12155.22 GFLOPS
fp16-scalar = 16921.16 GFLOPS
fp16-vec4 = 27833.48 GFLOPS
fp16-matrix = 105337.66 GFLOPS
fp64-scalar = 442.80 GFLOPS
fp64-vec4 = 437.55 GFLOPS
int32-scalar = 2804.59 GIOPS
int32-vec4 = 2796.74 GIOPS
int16-scalar = 15034.62 GIOPS
int16-vec4 = 26356.38 GIOPS
int64-scalar = 932.14 GIOPS
int64-vec4 = 768.53 GIOPS
int8-dotprod = 53893.32 GIOPS
int8-matrix = 194476.41 GIOPS
bf16-dotprod = 24427.68 GFLOPS
bf16-matrix = 105099.82 GFLOPS
fp8-matrix = 205061.72 GFLOPS
bf8-matrix = 208234.02 GFLOPS
copy-h2h = 21.05 GBPS
copy-h2d = 21.17 GBPS
copy-d2h = 23.70 GBPS
copy-d2d = 145.23 GBPS
- https://github.com/Tencent/ncnn for fast neural network inference on ALL PLATFORMS