Skip to content

[Feature] TrinoSplitManager calls dropStats() before scan planning, preventing manifest-level file pruning #8257

@henrik-donuts

Description

@henrik-donuts

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

metadata.stats-mode rendered useless for Trino

Users who configure metadata.stats-mode = 'full' or metadata.stats-mode = 'truncate(16)' receive no benefit in Trino. Statistics are written to manifests correctly but discarded before they can be used for pruning.

Solution

Summary

TrinoSplitManager calls .dropStats() before newScan().plan(), which strips manifest-level column statistics from DataFileMeta entries before predicate evaluation can use them. As a result, Trino cannot use metadata.stats-mode min/max statistics for file-level skipping — all files become splits regardless of predicate filters.

.dropStats() was added deliberately in commit 5144ad9 when upgrading from Paimon 0.8.0 to 1.0-SNAPSHOT, likely to reduce split serialisation overhead or fix a serialisation error. The fix is not to remove it, but to move it to after predicate evaluation — so statistics are used for pruning during plan() and then stripped before splits are sent to workers.


Root Cause

File: src/main/java/org/apache/paimon/trino/TrinoSplitManager.java, line 86

// Current — dropStats() called BEFORE scan planning; stats unavailable for predicate evaluation
List<Split> splits = readBuilder.dropStats().newScan().plan().splits();

.dropStats() flags the scan to zero out SimpleStats (min/max/null-count per column per file) on each DataFileMeta entry. Because it is called before newScan().plan(), the predicate filter wired via readBuilder.withFilter() has no statistics to evaluate against during plan(). Every file passes the statistics check (vacuously, since stats are empty) and becomes a split.


History

.dropStats() was introduced in Paimon core in #4506 (November 2024) as an explicit optimisation to reduce the size of split objects sent to workers — splits carry DataFileMeta entries, which include column statistics that are not needed by workers after planning. Spark added dropStats() in #5093 in a secondary path. When paimon-trino upgraded to 1.0-SNAPSHOT in 5144ad9, .dropStats() was added to TrinoSplitManager — the commit message ("Update Paimon core to 1.0-SNAPSHOT / fix") gives no further explanation, but the intent is the same serialisation optimisation.

The problem is placement: .dropStats() must be called after plan() completes, not before. Calling it before plan() eliminates the serialisation overhead but also eliminates all statistics-based file pruning.


What dropStats() does

ReadBuilder.dropStats() sets a flag that causes AbstractFileStoreScan to call DataFileMeta.copyWithoutStats() on each entry — replacing column statistics with EMPTY_STATS before returning results. The flag is evaluated during plan(). If set before plan(), stats are zeroed before predicate evaluation. If set after plan() (on the returned splits), stats are zeroed only for serialisation, after pruning has already occurred.

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions