[KYUUBI #6943][1/2] HiveScan supports DPP by maomaodev · Pull Request #7436 · apache/kyuubi

maomaodev · 2026-05-08T04:13:06Z

Why are the changes needed?

Part 1 of 2 to add KSHC support for dynamic partition pruning (DPP). See #6943.

Add DPP support in HiveScan for non-Parquet/ORC tables.
Add DPP support in ParquetScan / ORCScan for Parquet/ORC tables.

How was this patch tested?

Unit tests
Manual test: TPC-DS benchmark (11 GB text dataset).

Spark configuration used for the benchmark(Spark 3.5.7, Kyuubi 1.12.0-SNAPSHOT):

spark.driver.cores    1
spark.driver.memory    4g
spark.executor.cores    1
spark.executor.instances    10
spark.executor.memory    4g
spark.master    yarn
spark.shuffle.service.enabled    true
spark.yarn.appMasterEnv.JAVA_HOME /usr/local/jdk-17
spark.executorEnv.JAVA_HOME /usr/local/jdk-17

Overall performance (sum of 99)

Dimension	Vanilla Spark	KSHC Before	KSHC Now
Total time	5950.10 s	2836.49 s	2691.95 s
vs. Vanilla Spark	—	−52.33%	−54.76%
vs. KSHC Before	—	—	−5.10%

KSHC Now provides a 5.10% (~144 s) speedup over KSHC Before, with no correctness regression.

DPP hit subset (70/99)

DPP trigger was detected by matching runtime partition filter in the driver logs.

3,4,5,6,7,8,10,11,12,13,14,15,17,18,19,20,23,25,26,27,29,30,31,32,33,
35,36,38,40,42,45,46,47,48,49,50,51,52,53,54,55,56,57,58,60,63,64,65,
66,67,69,70,71,72,74,75,77,78,79,80,81,83,85,86,87,89,91,92,97,98

Dimension	Vanilla Spark	KSHC Before	KSHC Now
Subset total time	3418.34 s	2180.60 s	2028.51 s
vs. Vanilla Spark	—	−36.21%	−40.66%
vs. KSHC Before	—	—	−6.97%

On the DPP-hit subset, KSHC Now provides a 6.97% speedup over KSHC Before, noticeably larger than the overall 5.10%, indicating the performance benefit mainly comes from queries where DPP is triggered.

Was this patch authored or co-authored using generative AI tooling?

Partially assisted by Claude Code (Claude Opus 4.7) for unit test, code style fixes, and analysis of TPC-DS benchmark results. Core design and implementation are human-authored.

pan3793 · 2026-05-08T05:44:35Z

How can KSHC be faster than Vanilla Spark?

maomaodev · 2026-05-08T13:59:53Z

How can KSHC be faster than Vanilla Spark?

Thanks for the review! After investigation, KSHC being faster than vanilla Spark on TEXT-format Hive tables mainly comes from two factors, the two factors partially overlap, together they explain essentially all of the gap:
1、Different file-splitting strategies

Vanilla Spark reads via HadoopRDD + mapred.FileInputFormat.getSplits(JobConf, numSplits):

splitSize = max(minSize, min(goalSize, blockSize)),  goalSize = totalSize / numSplits

With the defaults (minSize=1B, blockSize=128MB, numSplits=2), splitSize is only ~2MB, so each small file (1–4MB) is split into ~2 tasks, and scheduling overhead dominates.

KSHC uses DataSource V2 FileScan + FilePartition:

maxSplitBytes = min(maxPartitionBytes, max(openCostInBytes, bytesPerCore))

With the defaults (maxPartitionBytes=128MB, openCostInBytes=4MB), each small file becomes at most one task.

Validation (full 99 TPC-DS queries): setting mapreduce.input.fileinputformat.split.minsize=128M on vanilla Spark to align splitting with KSHC drops the total from 5427.47s → 3560.15s (saving ~1867s, about 65% of the 2853s gap).

2、KSHC reuses FileStatus via FileStatusCache

Vanilla Spark goes through HiveMetastoreCatalog → HadoopFsRelation on every scan and re-listStatus per partition every time. KSHC's HiveCatalogFileIndex reuses FileStatus across scans within a session.
Validation (full 99 TPC-DS queries): replacing KSHC's fileStatusCache with NoopCache increases the total from 2574.12s → 4334.32s (adding ~1760s, about 62% of the 2853s gap).

Note on Orc/Parquet
We also tested 10GB TPC-DS in Orc/Parquet format, where vanilla Spark is faster than KSHC, this issue does not appear there.

On the Spark 3.3 CI failure
SupportsRuntimeV2Filtering was introduced in Spark 3.4, so cross-version compilation against 3.3 fails. Orc/Parquet DPP additionally relies on Scan.columnarSupportMode (Spark 3.5+) — without it, the DPP benefit is fully cancelled by a pre-DPP full-table listing (See apache/spark#42099). Any suggestions on the preferred direction? Or should we just support Spark 3.5+?

maomaodev · 2026-05-12T11:08:19Z

Gentle ping @pan3793 f66216b says KSHC guarantees binary compat across Spark 3.3 onwards. Do we need to keep that, or is it acceptable to bump to 3.5+ in the next release? Both Spark 3.3 and 3.4 are upstream EOL.

pan3793 · 2026-05-12T11:53:27Z

maybe it's time to deprecate Spark 3.3 and 3.4 support

maomaodev · 2026-05-12T12:06:20Z

maybe it's time to deprecate Spark 3.3 and 3.4 support

Thanks. I see #7285 just deprecated Spark 3.3 / 3.4. I'll revisit this PR once they're actually dropped.

pan3793 · 2026-05-12T12:08:15Z

oh, we already did that, but forget to mention in "Kyuubi Migration Guide" ...

pan3793 · 2026-05-12T12:30:12Z

+   * Translate Spark's runtime V2 `IN` predicates into catalyst `InSet(attr, Set[Any])`
+   * expressions bound to the given partition attributes.
+   */
+  def toCatalystPartitionFilters(


This makes the code compatibility very fragile, catalyst code is treated as internal implementation details and does not provide any compatibility guarantee.

Given the situation, why choose to implement SupportsRuntimeV2Filtering and translate the V2 Predicate to the catalyst Filter instead of just implement SupportsRuntimeFiltering?

Good point. I’ll use SupportsRuntimeFiltering in the latest commit, and try to maintain compatibility with Spark 3.3+.

I've replaced SupportsRuntimeV2Filtering with SupportsRuntimeFiltering and updated the PR description; ready for further review.

wangzhigang1999 · 2026-05-14T11:01:07Z

 * Since Kyuubi 1.12, session configurations in REST API responses are redacted by default using `kyuubi.server.redaction.regex`. Use `kyuubi.server.conf.retrieveMode` to control this behavior: `REDACTED` (default), `ORIGINAL` (no redaction), or `NONE` (omit configs entirely).
 * Since Kyuubi 1.12, `GET /api/v1/sessions` returns only sessions owned by the authenticated user instead of all sessions on the server. To restore the previous behavior, set `kyuubi.frontend.rest.legacy.v1.sessionsReturnAllUsers=true`.
 * Since Kyuubi 1.12, the configuration `spark.sql.kyuubi.hive.connector.dropTableAsPurgeTable` is introduced by Kyuubi Spark Hive connector(KSHC) to control whether DROP TABLE command completely remove its data by skipping HDFS trash. The default value is false. To restore the legacy behavior, set it to true.
+* Since Kyuubi 1.12, the configuration `spark.sql.kyuubi.hive.connector.read.runtimeFilter.enabled` is introduced by Kyuubi Spark Hive connector(KSHC) to control whether partition columns are exposed as runtime filter attributes, which is required for Spark Dynamic Partition Pruning (DPP). The default value is true. To restore the legacy behavior, set it to true.


I think in “To restore the legacy behavior, set it to true.” it should be false instead of true?

Oh, my mistake, thanks for review! I've already made the changes.

pan3793 · 2026-05-17T16:22:17Z

      .createWithDefault(false)
+
+  val READ_RUNTIME_FILTER_ENABLED =
+    buildConf("spark.sql.kyuubi.hive.connector.read.runtimeFilter.enabled")


I feel we don't need a config for this - 1) Spark has a global config to control this feature, 2) DPP is a general optimization that has no obvious drawbacks.

Makes sense — agreed on both points. I've removed the config and updated the PR, please take another look. Thanks!

pan3793 · 2026-05-17T16:23:24Z

+ * `SupportsRuntimeV2Filtering` to keep this connector compilable against
+ * Spark 3.3, where `SupportsRuntimeV2Filtering` was introduced in Spark 3.4.
+ */
+object HiveRuntimeFilterSupport extends Logging {


the logic inside it becomes simple after moving to v1 SupportsRuntimeFiltering, do we still need such a helper class? do you have a plan to reuse the methods in the follow-up PRs?

Yes, I'm planning to reuse these methods in follow-up PRs for ParquetScan / ORCScan.

pan3793 · 2026-05-17T16:24:55Z

overall lgtm, leave some nits

pan3793 · 2026-05-18T05:32:12Z

thanks, merged to master

[KYUUBI apache#6943][1/2]HiveScan support dpp

50afc82

github-actions Bot added module:spark module:extensions labels May 8, 2026

fix ut

a32b4e8

pan3793 reviewed May 12, 2026

View reviewed changes

wangzhigang1999 mentioned this pull request May 13, 2026

[FEATURE] Add AGENTS.md to guide AI coding agents contributing to Kyuubi #7445

Closed

4 tasks

use SupportsRuntimeFiltering

824df8e

github-actions Bot added the kind:documentation Documentation is a feature! label May 13, 2026

fix ut

7829bf8

wangzhigang1999 reviewed May 14, 2026

View reviewed changes

fix doc

a014134

pan3793 reviewed May 17, 2026

View reviewed changes

pan3793 changed the title ~~[KYUUBI #6943][1/2]HiveScan support dpp~~ [KYUUBI #6943][1/2] HiveScan supports DPP May 17, 2026

remove config

c7ed368

github-actions Bot removed the kind:documentation Documentation is a feature! label May 18, 2026

fix style

a77a1d0

pan3793 approved these changes May 18, 2026

View reviewed changes

pan3793 assigned maomaodev May 18, 2026

pan3793 added this to the v1.12.0 milestone May 18, 2026

pan3793 closed this in ae352b8 May 18, 2026

maomaodev deleted the kyuubi-6943 branch May 22, 2026 04:28

Conversation

maomaodev commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are the changes needed?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

pan3793 commented May 8, 2026

Uh oh!

maomaodev commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maomaodev commented May 12, 2026

Uh oh!

pan3793 commented May 12, 2026

Uh oh!

maomaodev commented May 12, 2026

Uh oh!

pan3793 commented May 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 commented May 17, 2026

Uh oh!

pan3793 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

maomaodev commented May 8, 2026 •

edited

Loading

maomaodev commented May 8, 2026 •

edited

Loading

pan3793 May 17, 2026 •

edited

Loading