Add MicroBenchmark for Small Trip Count Loop vectorization#404
Add MicroBenchmark for Small Trip Count Loop vectorization#404Stylie777 wants to merge 3 commits into
Conversation
For targets where getMinTripCountTailFoldingThreshold returns a value greater than zero, llvm/llvm-project#195823 has enabled better vectorization of loops where applicable. This micro benchmark is intended to show the impact of these changes on the relevant targets. For targets where getMinTripCountTailFoldingThreshold returns zero, there will be no effect to runtime when comparing scalar vs vector.
| g_small_loop_trip_count_sum ^= checksum(B); | ||
| benchmark::DoNotOptimize(g_small_loop_trip_count_sum); | ||
| State.SetItemsProcessed(State.iterations() * 5); |
There was a problem hiding this comment.
Would be good to comment why this is needed
There was a problem hiding this comment.
It's not. I missed this when first reviewing the codex generated benchmark. I've removed it.
| B[I] = A[I] + static_cast<Ty>(1); | ||
| } | ||
|
|
||
| NOINLINE void loopTc5I64InterleaveCount2Vector(const uint64_t *__restrict A, |
There was a problem hiding this comment.
Is there a reason to not use the templated version for this one as well?
There was a problem hiding this comment.
No there isn't, I have made it consistent now.
| BENCHMARK_TEMPLATE(benchTc5Scalar, uint16_t)->Name("tc5/i16/scalar"); | ||
| BENCHMARK_TEMPLATE(benchTc5Vector, uint32_t)->Name("tc5/i32/vector"); | ||
| BENCHMARK_TEMPLATE(benchTc5Scalar, uint32_t)->Name("tc5/i32/scalar"); | ||
| BENCHMARK_TEMPLATE(benchTc5Vector, uint64_t)->Name("tc5/i64/vector"); |
There was a problem hiding this comment.
I think the potential worst case would be i64 with TC =3, could you also cover this?
There was a problem hiding this comment.
I have added cases for all data types for TC=3 for full coverage.
| NOINLINE void loopTc5Vector(const Ty *__restrict A, Ty *__restrict B) { | ||
| LOOP_VECTORIZE_ENABLE | ||
| for (uint64_t I = 0; I != 5; ++I) | ||
| B[I] = A[I] + static_cast<Ty>(1); |
There was a problem hiding this comment.
This is a case where there is basically no overhead for the vector code compared to the scalar code.
Would be good to also include cases where there is some overhead from the vector code compared to scalar, e.g. some scalarization
There was a problem hiding this comment.
I have added an example that has scalarization in the loop, if this is not what you meant please let me know!
For targets where getMinTripCountTailFoldingThreshold returns a value greater than zero, llvm/llvm-project#195823 has enabled better vectorization of loops where applicable. This micro benchmark is intended to show the impact of these changes on the relevant targets.
For targets where getMinTripCountTailFoldingThreshold returns zero, there will be no effect to runtime when comparing scalar vs vector.
Assisted-by: Codex