Skip to content

[KYUUBI #6943][2/2] OrcScan and ParquetScan support DPP#7476

Open
maomaodev wants to merge 1 commit into
apache:masterfrom
maomaodev:kyuubi_6943
Open

[KYUUBI #6943][2/2] OrcScan and ParquetScan support DPP#7476
maomaodev wants to merge 1 commit into
apache:masterfrom
maomaodev:kyuubi_6943

Conversation

@maomaodev

Copy link
Copy Markdown
Contributor

Why are the changes needed?

Part 2 of 2 to add KSHC support for dynamic partition pruning (DPP). See #6943.

  • Add DPP support in HiveScan for non-Parquet/ORC tables.
  • Add DPP support in ParquetScan / ORCScan for Parquet/ORC tables.

How was this patch tested?

1. UT & TPC-DS benchmark

  • Unit tests
  • Manual test: TPC-DS benchmark (10 GB dataset, ORC and Parquet separately). Spark configuration used for the benchmark(Spark 3.5.7, Kyuubi 1.12.0-SNAPSHOT):
spark.driver.cores            1
spark.driver.memory           4g
spark.executor.cores          1
spark.executor.instances      10
spark.executor.memory         4g
spark.master                  yarn
spark.shuffle.service.enabled true
spark.yarn.appMasterEnv.JAVA_HOME /usr/local/jdk-17
spark.executorEnv.JAVA_HOME       /usr/local/jdk-17

2. ORC benchmark

  • Overall performance (sum of 99)
Dimension Vanilla Spark KSHC Before KSHC Now
Total time 2481.85 s 3353.43 s 2197.26 s
vs. Vanilla Spark +35.12% −11.47%
vs. KSHC Before −34.48%
  • DPP hit subset (73/99)

DPP trigger was detected by matching runtime partition filter in the driver logs.

3,4,5,6,7,8,10,11,12,13,14,15,17,18,19,20,23,25,26,27,29,30,31,32,33,
35,36,38,40,42,43,45,46,47,48,49,50,51,52,53,54,55,56,57,58,60,61,63,
64,65,66,67,68,69,70,71,72,74,75,77,78,79,80,81,83,85,86,87,89,91,92,
97,98
Dimension Vanilla Spark KSHC Before KSHC Now
Subset total time 1823.48 s 2642.11 s 1484.39 s
vs. Vanilla Spark +44.89% −18.60%
vs. KSHC Before −43.82%

On the DPP-hit subset, KSHC Now provides a 43.82% speedup over KSHC Before, noticeably larger than the overall 34.48%, indicating the performance benefit mainly comes from queries where DPP is triggered.

3. Parquet benchmark

  • Overall performance (sum of 99)
Dimension Vanilla Spark KSHC Before KSHC Now
Total time 2325.13 s 3363.57 s 2152.18 s
vs. Vanilla Spark +44.66% −7.44%
vs. KSHC Before −36.02%
  • DPP hit subset (73/99)

DPP trigger was detected by matching runtime partition filter in the driver logs.

3,4,5,6,7,8,10,11,12,13,14,15,17,18,19,20,23,25,26,27,29,30,31,32,33,
35,36,38,40,42,43,45,46,47,48,49,50,51,52,53,54,55,56,57,58,60,61,63,
64,65,66,67,68,69,70,71,72,74,75,77,78,79,80,81,83,85,86,87,89,91,92,
97,98
Dimension Vanilla Spark KSHC Before KSHC Now
Subset total time 1619.13 s 2487.20 s 1369.55 s
vs. Vanilla Spark +53.61% −15.41%
vs. KSHC Before −44.94%

On the DPP-hit subset, KSHC Now provides a 44.94% speedup over KSHC Before, noticeably larger than the overall 36.02%, indicating the performance benefit mainly comes from queries where DPP is triggered.

4. Result correctness

Compared each of the 99 result files between KSHC Now and Vanilla Spark for both ORC and Parquet. ORC: 94/99 byte-identical and 98/99 row-multiset-identical; Parquet: identical figures. The 4 row-order-only diffs (q31/q65/q71/q79) come from queries whose ORDER BY clause does not totally order the output. The single multiset diff (q39) is sub-ULP floating-point rounding in stddev-style aggregates and is also present between KSHC Before and Vanilla Spark, so it is unrelated to this PR. No correctness regression introduced.

5. Spark 4.0.1 benchmark

The same TPC-DS benchmark was also run against Spark 4.0.1 with KSHC. Results align with the Spark 3.5.7 numbers: KSHC matches or outperforms the native Hive path on DPP-eligible queries, and produces identical result sets. Full Spark 4.0.1 benchmark result are omitted here to keep the report compact, they can be shared on request.

Was this patch authored or co-authored using generative AI tooling?

Partially assisted by Claude Code (Claude Opus 4.7) for unit test, code style fixes, and analysis of TPC-DS benchmark results. Core design and implementation are human-authored.

Comment thread pom.xml
<module>extensions/spark/kyuubi-spark-connector-hive</module>
</modules>
<properties>
<maven.compiler.release>17</maven.compiler.release>

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing spark-4.0 profile is missing this property, while spark-4.1 already has it. Without it, scalac reports Class java.lang.Record not found once a module references a JDK-17-only Spark 4.0 API (e.g. Aggregation's Record type from SPARK-45919).
Inlining this one-line fix here to unblock CI, happy to split it out into a follow-up PR if preferred.

Image

@maomaodev

Copy link
Copy Markdown
Contributor Author

Two of the CI checks failed, but it seems unrelated to this PR.

@maomaodev

Copy link
Copy Markdown
Contributor Author

@pan3793 Could you please take a look when you have time? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant