Search before asking
Motivation
metadata.stats-mode rendered useless for Trino
Users who configure metadata.stats-mode = 'full' or metadata.stats-mode = 'truncate(16)' receive no benefit in Trino. Statistics are written to manifests correctly but discarded before they can be used for pruning.
Solution
Summary
TrinoSplitManager calls .dropStats() before newScan().plan(), which strips manifest-level column statistics from DataFileMeta entries before predicate evaluation can use them. As a result, Trino cannot use metadata.stats-mode min/max statistics for file-level skipping — all files become splits regardless of predicate filters.
.dropStats() was added deliberately in commit 5144ad9 when upgrading from Paimon 0.8.0 to 1.0-SNAPSHOT, likely to reduce split serialisation overhead or fix a serialisation error. The fix is not to remove it, but to move it to after predicate evaluation — so statistics are used for pruning during plan() and then stripped before splits are sent to workers.
Root Cause
File: src/main/java/org/apache/paimon/trino/TrinoSplitManager.java, line 86
// Current — dropStats() called BEFORE scan planning; stats unavailable for predicate evaluation
List<Split> splits = readBuilder.dropStats().newScan().plan().splits();
.dropStats() flags the scan to zero out SimpleStats (min/max/null-count per column per file) on each DataFileMeta entry. Because it is called before newScan().plan(), the predicate filter wired via readBuilder.withFilter() has no statistics to evaluate against during plan(). Every file passes the statistics check (vacuously, since stats are empty) and becomes a split.
History
.dropStats() was introduced in Paimon core in #4506 (November 2024) as an explicit optimisation to reduce the size of split objects sent to workers — splits carry DataFileMeta entries, which include column statistics that are not needed by workers after planning. Spark added dropStats() in #5093 in a secondary path. When paimon-trino upgraded to 1.0-SNAPSHOT in 5144ad9, .dropStats() was added to TrinoSplitManager — the commit message ("Update Paimon core to 1.0-SNAPSHOT / fix") gives no further explanation, but the intent is the same serialisation optimisation.
The problem is placement: .dropStats() must be called after plan() completes, not before. Calling it before plan() eliminates the serialisation overhead but also eliminates all statistics-based file pruning.
What dropStats() does
ReadBuilder.dropStats() sets a flag that causes AbstractFileStoreScan to call DataFileMeta.copyWithoutStats() on each entry — replacing column statistics with EMPTY_STATS before returning results. The flag is evaluated during plan(). If set before plan(), stats are zeroed before predicate evaluation. If set after plan() (on the returned splits), stats are zeroed only for serialisation, after pruning has already occurred.
Anything else?
No response
Are you willing to submit a PR?
Search before asking
Motivation
metadata.stats-moderendered useless for TrinoUsers who configure
metadata.stats-mode = 'full'ormetadata.stats-mode = 'truncate(16)'receive no benefit in Trino. Statistics are written to manifests correctly but discarded before they can be used for pruning.Solution
Summary
TrinoSplitManagercalls.dropStats()beforenewScan().plan(), which strips manifest-level column statistics fromDataFileMetaentries before predicate evaluation can use them. As a result, Trino cannot usemetadata.stats-modemin/max statistics for file-level skipping — all files become splits regardless of predicate filters..dropStats()was added deliberately in commit5144ad9when upgrading from Paimon 0.8.0 to 1.0-SNAPSHOT, likely to reduce split serialisation overhead or fix a serialisation error. The fix is not to remove it, but to move it to after predicate evaluation — so statistics are used for pruning duringplan()and then stripped before splits are sent to workers.Root Cause
File:
src/main/java/org/apache/paimon/trino/TrinoSplitManager.java, line 86.dropStats()flags the scan to zero outSimpleStats(min/max/null-count per column per file) on eachDataFileMetaentry. Because it is called beforenewScan().plan(), the predicate filter wired viareadBuilder.withFilter()has no statistics to evaluate against duringplan(). Every file passes the statistics check (vacuously, since stats are empty) and becomes a split.History
.dropStats()was introduced in Paimon core in#4506(November 2024) as an explicit optimisation to reduce the size of split objects sent to workers — splits carryDataFileMetaentries, which include column statistics that are not needed by workers after planning. Spark addeddropStats()in#5093in a secondary path. Whenpaimon-trinoupgraded to 1.0-SNAPSHOT in5144ad9,.dropStats()was added toTrinoSplitManager— the commit message ("Update Paimon core to 1.0-SNAPSHOT / fix") gives no further explanation, but the intent is the same serialisation optimisation.The problem is placement:
.dropStats()must be called afterplan()completes, not before. Calling it beforeplan()eliminates the serialisation overhead but also eliminates all statistics-based file pruning.What
dropStats()doesReadBuilder.dropStats()sets a flag that causesAbstractFileStoreScanto callDataFileMeta.copyWithoutStats()on each entry — replacing column statistics withEMPTY_STATSbefore returning results. The flag is evaluated duringplan(). If set beforeplan(), stats are zeroed before predicate evaluation. If set afterplan()(on the returned splits), stats are zeroed only for serialisation, after pruning has already occurred.Anything else?
No response
Are you willing to submit a PR?