Skip to content

[KYUUBI #6943][1/2] HiveScan supports DPP#7436

Closed
maomaodev wants to merge 7 commits into
apache:masterfrom
maomaodev:kyuubi-6943
Closed

[KYUUBI #6943][1/2] HiveScan supports DPP#7436
maomaodev wants to merge 7 commits into
apache:masterfrom
maomaodev:kyuubi-6943

Conversation

@maomaodev

@maomaodev maomaodev commented May 8, 2026

Copy link
Copy Markdown
Contributor

Why are the changes needed?

Part 1 of 2 to add KSHC support for dynamic partition pruning (DPP). See #6943.

  • Add DPP support in HiveScan for non-Parquet/ORC tables.
  • Add DPP support in ParquetScan / ORCScan for Parquet/ORC tables.

How was this patch tested?

  1. Unit tests
  2. Manual test: TPC-DS benchmark (11 GB text dataset).
  • Spark configuration used for the benchmark(Spark 3.5.7, Kyuubi 1.12.0-SNAPSHOT):
spark.driver.cores    1
spark.driver.memory    4g
spark.executor.cores    1
spark.executor.instances    10
spark.executor.memory    4g
spark.master    yarn
spark.shuffle.service.enabled    true
spark.yarn.appMasterEnv.JAVA_HOME /usr/local/jdk-17
spark.executorEnv.JAVA_HOME /usr/local/jdk-17
  • Overall performance (sum of 99)
Dimension Vanilla Spark KSHC Before KSHC Now
Total time 5950.10 s 2836.49 s 2691.95 s
vs. Vanilla Spark −52.33% −54.76%
vs. KSHC Before −5.10%

KSHC Now provides a 5.10% (~144 s) speedup over KSHC Before, with no correctness regression.

  • DPP hit subset (70/99)

DPP trigger was detected by matching runtime partition filter in the driver logs.

3,4,5,6,7,8,10,11,12,13,14,15,17,18,19,20,23,25,26,27,29,30,31,32,33,
35,36,38,40,42,45,46,47,48,49,50,51,52,53,54,55,56,57,58,60,63,64,65,
66,67,69,70,71,72,74,75,77,78,79,80,81,83,85,86,87,89,91,92,97,98
Dimension Vanilla Spark KSHC Before KSHC Now
Subset total time 3418.34 s 2180.60 s 2028.51 s
vs. Vanilla Spark −36.21% −40.66%
vs. KSHC Before −6.97%

On the DPP-hit subset, KSHC Now provides a 6.97% speedup over KSHC Before, noticeably larger than the overall 5.10%, indicating the performance benefit mainly comes from queries where DPP is triggered.

Was this patch authored or co-authored using generative AI tooling?

Partially assisted by Claude Code (Claude Opus 4.7) for unit test, code style fixes, and analysis of TPC-DS benchmark results. Core design and implementation are human-authored.

@pan3793

pan3793 commented May 8, 2026

Copy link
Copy Markdown
Member

How can KSHC be faster than Vanilla Spark?

@maomaodev

maomaodev commented May 8, 2026

Copy link
Copy Markdown
Contributor Author

How can KSHC be faster than Vanilla Spark?

Thanks for the review! After investigation, KSHC being faster than vanilla Spark on TEXT-format Hive tables mainly comes from two factors, the two factors partially overlap, together they explain essentially all of the gap:
1、Different file-splitting strategies

  • Vanilla Spark reads via HadoopRDD + mapred.FileInputFormat.getSplits(JobConf, numSplits):
splitSize = max(minSize, min(goalSize, blockSize)),  goalSize = totalSize / numSplits

With the defaults (minSize=1B, blockSize=128MB, numSplits=2), splitSize is only ~2MB, so each small file (1–4MB) is split into ~2 tasks, and scheduling overhead dominates.

  • KSHC uses DataSource V2 FileScan + FilePartition:
maxSplitBytes = min(maxPartitionBytes, max(openCostInBytes, bytesPerCore))

With the defaults (maxPartitionBytes=128MB, openCostInBytes=4MB), each small file becomes at most one task.

  • Validation (full 99 TPC-DS queries): setting mapreduce.input.fileinputformat.split.minsize=128M on vanilla Spark to align splitting with KSHC drops the total from 5427.47s → 3560.15s (saving ~1867s, about 65% of the 2853s gap).

2、KSHC reuses FileStatus via FileStatusCache

  • Vanilla Spark goes through HiveMetastoreCatalogHadoopFsRelation on every scan and re-listStatus per partition every time. KSHC's HiveCatalogFileIndex reuses FileStatus across scans within a session.
  • Validation (full 99 TPC-DS queries): replacing KSHC's fileStatusCache with NoopCache increases the total from 2574.12s → 4334.32s (adding ~1760s, about 62% of the 2853s gap).

Note on Orc/Parquet
We also tested 10GB TPC-DS in Orc/Parquet format, where vanilla Spark is faster than KSHC, this issue does not appear there.

On the Spark 3.3 CI failure
SupportsRuntimeV2Filtering was introduced in Spark 3.4, so cross-version compilation against 3.3 fails. Orc/Parquet DPP additionally relies on Scan.columnarSupportMode (Spark 3.5+) — without it, the DPP benefit is fully cancelled by a pre-DPP full-table listing (See apache/spark#42099). Any suggestions on the preferred direction? Or should we just support Spark 3.5+?

@maomaodev

Copy link
Copy Markdown
Contributor Author

Gentle ping @pan3793 f66216b says KSHC guarantees binary compat across Spark 3.3 onwards. Do we need to keep that, or is it acceptable to bump to 3.5+ in the next release? Both Spark 3.3 and 3.4 are upstream EOL.

@pan3793

pan3793 commented May 12, 2026

Copy link
Copy Markdown
Member

maybe it's time to deprecate Spark 3.3 and 3.4 support

@maomaodev

Copy link
Copy Markdown
Contributor Author

maybe it's time to deprecate Spark 3.3 and 3.4 support

Thanks. I see #7285 just deprecated Spark 3.3 / 3.4. I'll revisit this PR once they're actually dropped.

@pan3793

pan3793 commented May 12, 2026

Copy link
Copy Markdown
Member

oh, we already did that, but forget to mention in "Kyuubi Migration Guide" ...

* Translate Spark's runtime V2 `IN` predicates into catalyst `InSet(attr, Set[Any])`
* expressions bound to the given partition attributes.
*/
def toCatalystPartitionFilters(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the code compatibility very fragile, catalyst code is treated as internal implementation details and does not provide any compatibility guarantee.

Given the situation, why choose to implement SupportsRuntimeV2Filtering and translate the V2 Predicate to the catalyst Filter instead of just implement SupportsRuntimeFiltering?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I’ll use SupportsRuntimeFiltering in the latest commit, and try to maintain compatibility with Spark 3.3+.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've replaced SupportsRuntimeV2Filtering with SupportsRuntimeFiltering and updated the PR description; ready for further review.

@github-actions github-actions Bot added the kind:documentation Documentation is a feature! label May 13, 2026
Comment thread docs/deployment/migration-guide.md Outdated
* Since Kyuubi 1.12, session configurations in REST API responses are redacted by default using `kyuubi.server.redaction.regex`. Use `kyuubi.server.conf.retrieveMode` to control this behavior: `REDACTED` (default), `ORIGINAL` (no redaction), or `NONE` (omit configs entirely).
* Since Kyuubi 1.12, `GET /api/v1/sessions` returns only sessions owned by the authenticated user instead of all sessions on the server. To restore the previous behavior, set `kyuubi.frontend.rest.legacy.v1.sessionsReturnAllUsers=true`.
* Since Kyuubi 1.12, the configuration `spark.sql.kyuubi.hive.connector.dropTableAsPurgeTable` is introduced by Kyuubi Spark Hive connector(KSHC) to control whether DROP TABLE command completely remove its data by skipping HDFS trash. The default value is false. To restore the legacy behavior, set it to true.
* Since Kyuubi 1.12, the configuration `spark.sql.kyuubi.hive.connector.read.runtimeFilter.enabled` is introduced by Kyuubi Spark Hive connector(KSHC) to control whether partition columns are exposed as runtime filter attributes, which is required for Spark Dynamic Partition Pruning (DPP). The default value is true. To restore the legacy behavior, set it to true.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in “To restore the legacy behavior, set it to true.” it should be false instead of true?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, my mistake, thanks for review! I've already made the changes.

.createWithDefault(false)

val READ_RUNTIME_FILTER_ENABLED =
buildConf("spark.sql.kyuubi.hive.connector.read.runtimeFilter.enabled")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we don't need a config for this - 1) Spark has a global config to control this feature, 2) DPP is a general optimization that has no obvious drawbacks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense — agreed on both points. I've removed the config and updated the PR, please take another look. Thanks!

* `SupportsRuntimeV2Filtering` to keep this connector compilable against
* Spark 3.3, where `SupportsRuntimeV2Filtering` was introduced in Spark 3.4.
*/
object HiveRuntimeFilterSupport extends Logging {

@pan3793 pan3793 May 17, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic inside it becomes simple after moving to v1 SupportsRuntimeFiltering, do we still need such a helper class? do you have a plan to reuse the methods in the follow-up PRs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm planning to reuse these methods in follow-up PRs for ParquetScan / ORCScan.

@pan3793 pan3793 changed the title [KYUUBI #6943][1/2]HiveScan support dpp [KYUUBI #6943][1/2] HiveScan supports DPP May 17, 2026
@pan3793

pan3793 commented May 17, 2026

Copy link
Copy Markdown
Member

overall lgtm, leave some nits

@github-actions github-actions Bot removed the kind:documentation Documentation is a feature! label May 18, 2026
@pan3793 pan3793 added this to the v1.12.0 milestone May 18, 2026
@pan3793 pan3793 closed this in ae352b8 May 18, 2026
@pan3793

pan3793 commented May 18, 2026

Copy link
Copy Markdown
Member

thanks, merged to master

@maomaodev maomaodev deleted the kyuubi-6943 branch May 22, 2026 04:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants