[GLUTEN-11050] Regenerate input partitions for small files#11051
Conversation
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
|
@marin-ma can you describe the new partition strategy? |
|
Run Gluten Clickhouse CI on x86 |
|
It isn't necessarily blocking the PR, but since |
|
Run Gluten Clickhouse CI on x86 |
4 similar comments
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
6a072b9 to
ea02219
Compare
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
| private def makePartitionedFile(path: String, length: Long): PartitionedFile = | ||
| PartitionedFileUtilShim.makePartitionedFileFromPath(path, length) | ||
|
|
||
| private def makeFilePartitions( |
There was a problem hiding this comment.
It seems that sometimes the length of the output from makeFilePartitions does not match the specified numPartitions value in the input. For instance, consider the following code:
val files = (1 to 50).map(i => makePartitionedFile(s"f$i", i * 10))
val initialPartitions = makeFilePartitions(files, 20)
The length of initialPartitions is 17 instead of 20.
Regenerating input partitions to spread small files more evenly into table scan output partitions.
The strategy is designed with following steps:
FilePartition.getFilePartitionsThe total size of small files can be configured using
spark.gluten.sql.columnar.smallFileThreshold, which specifies the percentage of the total input file size represented by small files.Related issue: #11050