[KYUUBI #6943][2/2] OrcScan and ParquetScan support DPP by maomaodev · Pull Request #7476 · apache/kyuubi

maomaodev · 2026-05-26T04:04:01Z

Why are the changes needed?

Part 2 of 2 to add KSHC support for dynamic partition pruning (DPP). See #6943.

Add DPP support in HiveScan for non-Parquet/ORC tables.
Add DPP support in ParquetScan / ORCScan for Parquet/ORC tables.

How was this patch tested?

1. UT & TPC-DS benchmark

Unit tests
Manual test: TPC-DS benchmark (10 GB dataset, ORC and Parquet separately). Spark configuration used for the benchmark(Spark 3.5.7, Kyuubi 1.12.0-SNAPSHOT):

spark.driver.cores            1
spark.driver.memory           4g
spark.executor.cores          1
spark.executor.instances      10
spark.executor.memory         4g
spark.master                  yarn
spark.shuffle.service.enabled true
spark.yarn.appMasterEnv.JAVA_HOME /usr/local/jdk-17
spark.executorEnv.JAVA_HOME       /usr/local/jdk-17

2. ORC benchmark

Overall performance (sum of 99)

Dimension	Vanilla Spark	KSHC Before	KSHC Now
Total time	2481.85 s	3353.43 s	2197.26 s
vs. Vanilla Spark	—	+35.12%	−11.47%
vs. KSHC Before	—	—	−34.48%

DPP hit subset (73/99)

DPP trigger was detected by matching runtime partition filter in the driver logs.

3,4,5,6,7,8,10,11,12,13,14,15,17,18,19,20,23,25,26,27,29,30,31,32,33,
35,36,38,40,42,43,45,46,47,48,49,50,51,52,53,54,55,56,57,58,60,61,63,
64,65,66,67,68,69,70,71,72,74,75,77,78,79,80,81,83,85,86,87,89,91,92,
97,98

Dimension	Vanilla Spark	KSHC Before	KSHC Now
Subset total time	1823.48 s	2642.11 s	1484.39 s
vs. Vanilla Spark	—	+44.89%	−18.60%
vs. KSHC Before	—	—	−43.82%

On the DPP-hit subset, KSHC Now provides a 43.82% speedup over KSHC Before, noticeably larger than the overall 34.48%, indicating the performance benefit mainly comes from queries where DPP is triggered.

3. Parquet benchmark

Overall performance (sum of 99)

Dimension	Vanilla Spark	KSHC Before	KSHC Now
Total time	2325.13 s	3363.57 s	2152.18 s
vs. Vanilla Spark	—	+44.66%	−7.44%
vs. KSHC Before	—	—	−36.02%

DPP hit subset (73/99)

DPP trigger was detected by matching runtime partition filter in the driver logs.

3,4,5,6,7,8,10,11,12,13,14,15,17,18,19,20,23,25,26,27,29,30,31,32,33,
35,36,38,40,42,43,45,46,47,48,49,50,51,52,53,54,55,56,57,58,60,61,63,
64,65,66,67,68,69,70,71,72,74,75,77,78,79,80,81,83,85,86,87,89,91,92,
97,98

Dimension	Vanilla Spark	KSHC Before	KSHC Now
Subset total time	1619.13 s	2487.20 s	1369.55 s
vs. Vanilla Spark	—	+53.61%	−15.41%
vs. KSHC Before	—	—	−44.94%

On the DPP-hit subset, KSHC Now provides a 44.94% speedup over KSHC Before, noticeably larger than the overall 36.02%, indicating the performance benefit mainly comes from queries where DPP is triggered.

4. Result correctness

Compared each of the 99 result files between KSHC Now and Vanilla Spark for both ORC and Parquet. ORC: 94/99 byte-identical and 98/99 row-multiset-identical; Parquet: identical figures. The 4 row-order-only diffs (q31/q65/q71/q79) come from queries whose ORDER BY clause does not totally order the output. The single multiset diff (q39) is sub-ULP floating-point rounding in stddev-style aggregates and is also present between KSHC Before and Vanilla Spark, so it is unrelated to this PR. No correctness regression introduced.

5. Spark 4.0.1 benchmark

The same TPC-DS benchmark was also run against Spark 4.0.1 with KSHC. Results align with the Spark 3.5.7 numbers: KSHC matches or outperforms the native Hive path on DPP-eligible queries, and produces identical result sets. Full Spark 4.0.1 benchmark result are omitted here to keep the report compact, they can be shared on request.

Was this patch authored or co-authored using generative AI tooling?

Partially assisted by Claude Code (Claude Opus 4.7) for unit test, code style fixes, and analysis of TPC-DS benchmark results. Core design and implementation are human-authored.

maomaodev · 2026-05-26T04:07:24Z

                <module>extensions/spark/kyuubi-spark-connector-hive</module>
            </modules>
            <properties>
+                <maven.compiler.release>17</maven.compiler.release>


The existing spark-4.0 profile is missing this property, while spark-4.1 already has it. Without it, scalac reports Class java.lang.Record not found once a module references a JDK-17-only Spark 4.0 API (e.g. Aggregation's Record type from SPARK-45919).
Inlining this one-line fix here to unblock CI, happy to split it out into a follow-up PR if preferred.

maomaodev · 2026-05-26T06:28:47Z

Two of the CI checks failed, but it seems unrelated to this PR.

maomaodev · 2026-05-27T07:34:36Z

@pan3793 Could you please take a look when you have time? Thanks!

[KYUUBI apache#6943][2/2] OrcScan and ParquetScan support DPP

90f2382

github-actions Bot added module:spark kind:build module:extensions labels May 26, 2026

maomaodev commented May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KYUUBI #6943][2/2] OrcScan and ParquetScan support DPP#7476

[KYUUBI #6943][2/2] OrcScan and ParquetScan support DPP#7476
maomaodev wants to merge 1 commit into
apache:masterfrom
maomaodev:kyuubi_6943

maomaodev commented May 26, 2026

Uh oh!

maomaodev May 26, 2026

Uh oh!

maomaodev commented May 26, 2026

Uh oh!

maomaodev commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maomaodev commented May 26, 2026

Why are the changes needed?

How was this patch tested?

1. UT & TPC-DS benchmark

2. ORC benchmark

3. Parquet benchmark

4. Result correctness

5. Spark 4.0.1 benchmark

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

maomaodev May 26, 2026

Choose a reason for hiding this comment

Uh oh!

maomaodev commented May 26, 2026

Uh oh!

maomaodev commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant