Commit c74ed91
authored
Estimate aggregate output rows using existing NDV statistics (#20926)
## Which issue does this PR close?
Part of #20766
## Rationale for this change
Grouped aggregations currently estimate output rows as input_rows,
ignoring available NDV statistics. Spark's AggregateEstimation and
Trino's AggregationStatsRule both use NDV products to tighten this
estimate. This PR is highly referenced by both.
- [Spark
reference](https://github.com/apache/spark/blob/e8d8e6a8d040d26aae9571e968e0c64bda0875dc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AggregateEstimation.scala#L38-L61)
- [Trino
reference](https://github.com/trinodb/trino/blob/43c8c3ba8bff814697c5926149ce13b9532f030b/core/trino-main/src/main/java/io/trino/cost/AggregationStatsRule.java#L92-L101)
## What changes are included in this PR?
- Estimate aggregate output rows as min(input_rows, product(NDV_i +
null_adj_i) * grouping_sets)
- Cap by Top K limit when active since output row cannot be higher than
K
- Propagate distinct_count from child stats to group-by output columns
## Are these changes tested?
Yes existing and new tests that cover different scenarios and edge cases
## Are there any user-facing changes?
No1 parent 8a48a87 commit c74ed91
File tree
2 files changed
+652
-73
lines changed- datafusion
- core/tests/physical_optimizer
- physical-plan/src/aggregates
2 files changed
+652
-73
lines changedLines changed: 4 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
935 | 935 | | |
936 | 936 | | |
937 | 937 | | |
938 | | - | |
| 938 | + | |
| 939 | + | |
| 940 | + | |
| 941 | + | |
939 | 942 | | |
940 | 943 | | |
941 | 944 | | |
| |||
0 commit comments