perf : Optimize count distinct using bitmaps instead of hashsets for smaller datatypes#21456
perf : Optimize count distinct using bitmaps instead of hashsets for smaller datatypes#21456coderfender wants to merge 17 commits intoapache:mainfrom
Conversation
|
benchmark results : It seems like we are 25x faster for u16 bitmap based accumulators (or I am sleepy :) ) |
|
I think we can do the same for 16 bit types, it is just 65_536 bytes 8192 if we use a bitmap. |
|
Oh wait, you're already doing that :) |
|
Query 0 in clickbench_extended dataset (which uses (Other queries are faster but I believe that is more around variance ) |
|
cc : @neilconway , @alamb , @martin-g . Please take a look whenever you get a chance |
This comment has been minimized.
This comment has been minimized.
alamb
left a comment
There was a problem hiding this comment.
This looks like a great idea. Thank you @coderfender
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs
Outdated
Show resolved
Hide resolved
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
93acd98 to
7e67e2e
Compare
|
run benchmark count_distinct |
|
run benchmark clickbench |
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize_count_distinct (a13bcaa) to fbdf770 (merge-base) diff File an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize_count_distinct (a13bcaa) to fbdf770 (merge-base) diff File an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
3db92e3 to
df90ef5
Compare
|
Update benchmarks after rebase with main Command : |
|
run benchmark count_distinct |
|
run benchmark clickbench_partitioned |
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize_count_distinct (61dc8e1) to eaf0a41 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize_count_distinct (61dc8e1) to eaf0a41 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
run benchmark clickbench_extended |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize_count_distinct (61dc8e1) to eaf0a41 (merge-base) diff using: clickbench_extended File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_extended — base (merge-base)
clickbench_extended — branch
File an issue against this benchmark runner |
THe benchmark runner doesn't seem to be able to reproduce that result 🤔 Also the microbenchmarks seem to still show a 10% slowdown: #21456 (comment) For i64 which isn't changed by this patch |
|
@alamb , we added bitmaps for only 8 and 16 bit datatypes . The benchmarks show a 8x - 25x speedup compared to main |
|
Investigating why untouched paths are running slower |
|
Recent benchmarks : ( not sure if this is due to variance with the build env ) |
|
run benchmark count_distinct |
datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs
Outdated
Show resolved
Hide resolved
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize_count_distinct (61dc8e1) to eaf0a41 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
Okay this still seems to be an issue. Let me try and see if I can add additional hints to the compiler and see if that helps not regress existing hotpaths for i64 |
|
Hi @coderfender, thanks for the request (#21456 (comment)). Only whitelisted users can trigger benchmarks. Allowed users: Dandandan, Fokko, Jefffrey, Omega359, adriangb, alamb, asubiotto, brunal, buraksenn, cetra3, codephage2020, comphead, erenavsarogullari, etseidl, friendlymatthew, gabotechs, geoffreyclaude, grtlr, haohuaijin, jonathanc-n, kevinjqliu, klion26, kosiew, kumarUjjawal, kunalsinghdadhwal, liamzwbao, mbutrovich, mzabaluev, neilconway, rluvaton, sdf-jkl, timsaucer, xudong963, zhuqi-lucas. File an issue against this benchmark runner |
Which issue does this PR close?
Remove hashset based accumulators for smaller int data types and use bitmaps. Follow up of : #21453
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?