[KYUUBI #6943][1/2] HiveScan supports DPP#7436
Conversation
|
How can KSHC be faster than Vanilla Spark? |
Thanks for the review! After investigation, KSHC being faster than vanilla Spark on TEXT-format Hive tables mainly comes from two factors, the two factors partially overlap, together they explain essentially all of the gap:
With the defaults (minSize=1B, blockSize=128MB, numSplits=2), splitSize is only ~2MB, so each small file (1–4MB) is split into ~2 tasks, and scheduling overhead dominates.
With the defaults (maxPartitionBytes=128MB, openCostInBytes=4MB), each small file becomes at most one task.
2、KSHC reuses FileStatus via
Note on Orc/Parquet On the Spark 3.3 CI failure |
|
maybe it's time to deprecate Spark 3.3 and 3.4 support |
Thanks. I see #7285 just deprecated Spark 3.3 / 3.4. I'll revisit this PR once they're actually dropped. |
|
oh, we already did that, but forget to mention in "Kyuubi Migration Guide" ... |
| * Translate Spark's runtime V2 `IN` predicates into catalyst `InSet(attr, Set[Any])` | ||
| * expressions bound to the given partition attributes. | ||
| */ | ||
| def toCatalystPartitionFilters( |
There was a problem hiding this comment.
This makes the code compatibility very fragile, catalyst code is treated as internal implementation details and does not provide any compatibility guarantee.
Given the situation, why choose to implement SupportsRuntimeV2Filtering and translate the V2 Predicate to the catalyst Filter instead of just implement SupportsRuntimeFiltering?
There was a problem hiding this comment.
Good point. I’ll use SupportsRuntimeFiltering in the latest commit, and try to maintain compatibility with Spark 3.3+.
There was a problem hiding this comment.
I've replaced SupportsRuntimeV2Filtering with SupportsRuntimeFiltering and updated the PR description; ready for further review.
| * Since Kyuubi 1.12, session configurations in REST API responses are redacted by default using `kyuubi.server.redaction.regex`. Use `kyuubi.server.conf.retrieveMode` to control this behavior: `REDACTED` (default), `ORIGINAL` (no redaction), or `NONE` (omit configs entirely). | ||
| * Since Kyuubi 1.12, `GET /api/v1/sessions` returns only sessions owned by the authenticated user instead of all sessions on the server. To restore the previous behavior, set `kyuubi.frontend.rest.legacy.v1.sessionsReturnAllUsers=true`. | ||
| * Since Kyuubi 1.12, the configuration `spark.sql.kyuubi.hive.connector.dropTableAsPurgeTable` is introduced by Kyuubi Spark Hive connector(KSHC) to control whether DROP TABLE command completely remove its data by skipping HDFS trash. The default value is false. To restore the legacy behavior, set it to true. | ||
| * Since Kyuubi 1.12, the configuration `spark.sql.kyuubi.hive.connector.read.runtimeFilter.enabled` is introduced by Kyuubi Spark Hive connector(KSHC) to control whether partition columns are exposed as runtime filter attributes, which is required for Spark Dynamic Partition Pruning (DPP). The default value is true. To restore the legacy behavior, set it to true. |
There was a problem hiding this comment.
I think in “To restore the legacy behavior, set it to true.” it should be false instead of true?
There was a problem hiding this comment.
Oh, my mistake, thanks for review! I've already made the changes.
| .createWithDefault(false) | ||
|
|
||
| val READ_RUNTIME_FILTER_ENABLED = | ||
| buildConf("spark.sql.kyuubi.hive.connector.read.runtimeFilter.enabled") |
There was a problem hiding this comment.
I feel we don't need a config for this - 1) Spark has a global config to control this feature, 2) DPP is a general optimization that has no obvious drawbacks.
There was a problem hiding this comment.
Makes sense — agreed on both points. I've removed the config and updated the PR, please take another look. Thanks!
| * `SupportsRuntimeV2Filtering` to keep this connector compilable against | ||
| * Spark 3.3, where `SupportsRuntimeV2Filtering` was introduced in Spark 3.4. | ||
| */ | ||
| object HiveRuntimeFilterSupport extends Logging { |
There was a problem hiding this comment.
the logic inside it becomes simple after moving to v1 SupportsRuntimeFiltering, do we still need such a helper class? do you have a plan to reuse the methods in the follow-up PRs?
There was a problem hiding this comment.
Yes, I'm planning to reuse these methods in follow-up PRs for ParquetScan / ORCScan.
|
overall lgtm, leave some nits |
|
thanks, merged to master |
Why are the changes needed?
Part 1 of 2 to add KSHC support for dynamic partition pruning (DPP). See #6943.
HiveScanfor non-Parquet/ORC tables.ParquetScan/ORCScanfor Parquet/ORC tables.How was this patch tested?
KSHC Now provides a 5.10% (~144 s) speedup over KSHC Before, with no correctness regression.
DPP trigger was detected by matching
runtime partition filterin the driver logs.On the DPP-hit subset, KSHC Now provides a 6.97% speedup over KSHC Before, noticeably larger than the overall 5.10%, indicating the performance benefit mainly comes from queries where DPP is triggered.
Was this patch authored or co-authored using generative AI tooling?
Partially assisted by Claude Code (Claude Opus 4.7) for unit test, code style fixes, and analysis of TPC-DS benchmark results. Core design and implementation are human-authored.