There are two problems with the SSE tests, one is fixable but the other is problematic:
-
GCC will replace the SSE (128-bit) intrinsics in the sse_* tests with AVX instructions, presumably because there is no added benefit to using AVX over SSE. (Why they would do this I have no idea...).
This is annoying, but could be solved by replacing any -mavx build flags with -msse in the sse*.c builds.
-
Even when explicit SSE instructions are used, the performance behaves as if they are AVX instructions! If (the misnamed) VADDPS_LATENCY is set to the SSE latency, then it will produce the expected SSE performance (FLOP/s = 4 x freq). But increasing this "latency" (which is actually the loop unrolling) will only increase the flop rate up to the AVX rate (8 x freq). When registers are depleted, it will start to drop (as expected).
I have counted the numbers of FLOPs via perf, and it is reporting the correct number. (If r_max is hard-coded to 100 million with a latency of 5, then it produces ~2 billion FLOPs (4 x 5 x 0.1e9 = 2e9).
I cannot find any error in the sse.c timings, so I am worried that it may be an optimization happening inside of the CPU. I am very reluctant to pursue this possibility, so I am just noting it here and moving on.
If these problems are not resolved, then it may be better to just disable the SSE timing test for now.
There are two problems with the SSE tests, one is fixable but the other is problematic:
GCC will replace the SSE (128-bit) intrinsics in the
sse_*tests with AVX instructions, presumably because there is no added benefit to using AVX over SSE. (Why they would do this I have no idea...).This is annoying, but could be solved by replacing any
-mavxbuild flags with-mssein thesse*.cbuilds.Even when explicit SSE instructions are used, the performance behaves as if they are AVX instructions! If (the misnamed)
VADDPS_LATENCYis set to the SSE latency, then it will produce the expected SSE performance (FLOP/s = 4 x freq). But increasing this "latency" (which is actually the loop unrolling) will only increase the flop rate up to the AVX rate (8 x freq). When registers are depleted, it will start to drop (as expected).I have counted the numbers of FLOPs via perf, and it is reporting the correct number. (If
r_maxis hard-coded to 100 million with a latency of 5, then it produces ~2 billion FLOPs (4 x 5 x 0.1e9 = 2e9).I cannot find any error in the
sse.ctimings, so I am worried that it may be an optimization happening inside of the CPU. I am very reluctant to pursue this possibility, so I am just noting it here and moving on.If these problems are not resolved, then it may be better to just disable the SSE timing test for now.