Skip to content

[GLUTEN-11050] Regenerate input partitions for small files#11051

Merged
marin-ma merged 1 commit into
apache:mainfrom
marin-ma:input-partition-coalesce
Nov 13, 2025
Merged

[GLUTEN-11050] Regenerate input partitions for small files#11051
marin-ma merged 1 commit into
apache:mainfrom
marin-ma:input-partition-coalesce

Conversation

@marin-ma

@marin-ma marin-ma commented Nov 7, 2025

Copy link
Copy Markdown
Contributor

Regenerating input partitions to spread small files more evenly into table scan output partitions.

The strategy is designed with following steps:

  1. Get the number of target partitions from Spark's original logicFilePartition.getFilePartitions
  2. Assign small files starting from the smallest to the partitions with the minimum file count + total file size strategy
  3. Assign the remaining files from the largest into the partition with the minimum total file size + file count strategy

The total size of small files can be configured using spark.gluten.sql.columnar.smallFileThreshold, which specifies the percentage of the total input file size represented by small files.

Related issue: #11050

@github-actions github-actions Bot added the CORE works for Gluten Core label Nov 7, 2025
@github-actions

github-actions Bot commented Nov 7, 2025

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions

github-actions Bot commented Nov 7, 2025

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions github-actions Bot added the DOCS label Nov 11, 2025
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@FelixYBW

Copy link
Copy Markdown
Contributor

@marin-ma can you describe the new partition strategy?

@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@zhztheplayer

Copy link
Copy Markdown
Member

It isn't necessarily blocking the PR, but since FilePartitionCoalescer is relatively independent in code, so I guess it's reasonable to have a FilePartitionCoalescerSuite to both safeguard and demonstrate the basic functionality of it.

@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

4 similar comments
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@marin-ma marin-ma changed the title [GLUTEN-11050] Optimize the coalesce input splits methods for small files [GLUTEN-11050] Regenerate input partitions for small files Nov 12, 2025
@marin-ma marin-ma force-pushed the input-partition-coalesce branch from 6a072b9 to ea02219 Compare November 12, 2025 19:06
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@marin-ma

Copy link
Copy Markdown
Contributor Author

Run Gluten Clickhouse CI on x86

@marin-ma marin-ma merged commit 93c066a into apache:main Nov 13, 2025
99 of 101 checks passed
private def makePartitionedFile(path: String, length: Long): PartitionedFile =
PartitionedFileUtilShim.makePartitionedFileFromPath(path, length)

private def makeFilePartitions(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that sometimes the length of the output from makeFilePartitions does not match the specified numPartitions value in the input. For instance, consider the following code:

val files = (1 to 50).map(i => makePartitionedFile(s"f$i", i * 10))
val initialPartitions = makeFilePartitions(files, 20)

The length of initialPartitions is 17 instead of 20.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core DOCS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants