[GLUTEN-11050] Regenerate input partitions for small files#11051

Merged

marin-ma merged 1 commit into

apache:mainfrom

marin-ma:input-partition-coalesce

Nov 13, 2025

marin-ma commented Nov 7, 2025 •

edited

Loading

Contributor

Regenerating input partitions to spread small files more evenly into table scan output partitions.

The strategy is designed with following steps:

Get the number of target partitions from Spark's original logicFilePartition.getFilePartitions
Assign small files starting from the smallest to the partitions with the minimum file count + total file size strategy
Assign the remaining files from the largest into the partition with the minimum total file size + file count strategy

The total size of small files can be configured using spark.gluten.sql.columnar.smallFileThreshold, which specifies the percentage of the total input file size represented by small files.

Related issue: #11050

github-actions Bot added the CORE label

github-actions Bot commented Nov 7, 2025

Run Gluten Clickhouse CI on x86

1 similar comment

github-actions Bot commented Nov 7, 2025

Run Gluten Clickhouse CI on x86

github-actions Bot added the DOCS label

github-actions Bot commented Nov 11, 2025

Run Gluten Clickhouse CI on x86

1 similar comment

github-actions Bot commented Nov 11, 2025

Run Gluten Clickhouse CI on x86

FelixYBW commented Nov 12, 2025

Contributor

@marin-ma can you describe the new partition strategy?

github-actions Bot commented Nov 12, 2025

Run Gluten Clickhouse CI on x86

zhztheplayer commented Nov 12, 2025

Member

It isn't necessarily blocking the PR, but since FilePartitionCoalescer is relatively independent in code, so I guess it's reasonable to have a FilePartitionCoalescerSuite to both safeguard and demonstrate the basic functionality of it.

github-actions Bot commented Nov 12, 2025

Run Gluten Clickhouse CI on x86

4 similar comments

github-actions Bot commented Nov 12, 2025

Run Gluten Clickhouse CI on x86

github-actions Bot commented Nov 12, 2025

Run Gluten Clickhouse CI on x86

github-actions Bot commented Nov 12, 2025

Run Gluten Clickhouse CI on x86

github-actions Bot commented Nov 12, 2025

Run Gluten Clickhouse CI on x86

marin-ma changed the title ~~[GLUTEN-11050] Optimize the coalesce input splits methods for small files~~ [GLUTEN-11050] Regenerate input partitions for small files


          regenerate input partitions

ea02219

marin-ma force-pushed the input-partition-coalesce branch from 6a072b9 to ea02219 Compare

November 12, 2025 19:06

github-actions Bot commented Nov 12, 2025

Run Gluten Clickhouse CI on x86

FelixYBW approved these changes

View reviewed changes

FelixYBW approved these changes

View reviewed changes

marin-ma commented Nov 13, 2025

Contributor Author

Run Gluten Clickhouse CI on x86

marin-ma merged commit 93c066a into apache:main

99 of 101 checks passed

LuciferYang reviewed

View reviewed changes

gluten-substrait/src/test/scala/org/apache/gluten/utils/PartitionsUtilSuite.scala

+                private def makePartitionedFile(path: String, length: Long): PartitionedFile =
+                  PartitionedFileUtilShim.makePartitionedFileFromPath(path, length)
+                private def makeFilePartitions(

LuciferYang Dec 2, 2025

Contributor

It seems that sometimes the length of the output from makeFilePartitions does not match the specified numPartitions value in the input. For instance, consider the following code:

val files = (1 to 50).map(i => makePartitionedFile(s"f$i", i * 10))
val initialPartitions = makeFilePartitions(files, 20)

The length of initialPartitions is 17 instead of 20.

marin-ma mentioned this pull request

[VL] Rebalance iceberg table scan partitions #11946

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels