[KYUUBI #6943][1/2] HiveScan supports DPP

maomaodev · pan3793 · commit ae352b8afe53 · 2026-05-18T13:31:51.000+08:00
### Why are the changes needed? Part 1 of 2 to add KSHC support for dynamic partition pruning (DPP). See #6943. - [x] Add DPP support in `HiveScan` for non-Parquet/ORC tables. - [ ] Add DPP support in `ParquetScan` / `ORCScan` for Parquet/ORC tables. ### How was this patch tested? 1. Unit tests 2. Manual test: TPC-DS benchmark (11 GB text dataset). - **Spark configuration used for the benchmark(Spark 3.5.7, Kyuubi 1.12.0-SNAPSHOT):** ``` spark.driver.cores 1 spark.driver.memory 4g spark.executor.cores 1 spark.executor.instances 10 spark.executor.memory 4g spark.master yarn spark.shuffle.service.enabled true spark.yarn.appMasterEnv.JAVA_HOME /usr/local/jdk-17 spark.executorEnv.JAVA_HOME /usr/local/jdk-17 ``` - **Overall performance (sum of 99)** | Dimension | Vanilla Spark | KSHC Before | KSHC Now | | ----------------- | ------------: | ----------: | ----------: | | Total time | 5950.10 s | 2836.49 s | 2691.95 s | | vs. Vanilla Spark | — | −52.33% | −54.76% | | vs. KSHC Before | — | — | **−5.10%** | KSHC Now provides a 5.10% (~144 s) speedup over KSHC Before, with no correctness regression. - **DPP hit subset (70/99)** DPP trigger was detected by matching `runtime partition filter` in the driver logs. ``` 3,4,5,6,7,8,10,11,12,13,14,15,17,18,19,20,23,25,26,27,29,30,31,32,33, 35,36,38,40,42,45,46,47,48,49,50,51,52,53,54,55,56,57,58,60,63,64,65, 66,67,69,70,71,72,74,75,77,78,79,80,81,83,85,86,87,89,91,92,97,98 ``` | Dimension | Vanilla Spark | KSHC Before | KSHC Now | | ----------------- | ------------: | ----------: | ----------: | | Subset total time | 3418.34 s | 2180.60 s | 2028.51 s | | vs. Vanilla Spark | — | −36.21% | −40.66% | | vs. KSHC Before | — | — | **−6.97%** | On the DPP-hit subset, KSHC Now provides a 6.97% speedup over KSHC Before, noticeably larger than the overall 5.10%, indicating the performance benefit mainly comes from queries where DPP is triggered. ### Was this patch authored or co-authored using generative AI tooling? Partially assisted by Claude Code (Claude Opus 4.7) for unit test, code style fixes, and analysis of TPC-DS benchmark results. Core design and implementation are human-authored. Closes #7436 from maomaodev/kyuubi-6943. Closes #6943 a77a1d0 [lifumao] fix style c7ed368 [lifumao] remove config a014134 [lifumao] fix doc 7829bf8 [lifumao] fix ut 824df8e [lifumao] use SupportsRuntimeFiltering a32b4e8 [lifumao] fix ut 50afc82 [lifumao] [KYUUBI #6943][1/2]HiveScan support dpp Authored-by: lifumao <lifumao@tencent.com> Signed-off-by: Cheng Pan <chengpan@apache.org>
diff --git a/extensions/spark/kyuubi-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/read/HiveFileIndex.scala b/extensions/spark/kyuubi-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/read/HiveFileIndex.scala
@@ -52,6 +52,18 @@ class HiveCatalogFileIndex(
 
   private val baseLocation: Option[URI] = table.storage.locationUri
 
+  // Align with Spark's built-in CatalogFileIndex by explicitly overriding equals.
+  // This keeps `BatchScanExec#equals` stable and enables BroadcastExchange reuse under DPP.
+  override def equals(other: Any): Boolean = other match {
+    case that: HiveCatalogFileIndex =>
+      this.hiveCatalog.name == that.hiveCatalog.name &&
+      this.catalogTable.identifier == that.catalogTable.identifier
+    case _ => false
+  }
+
+  override def hashCode(): Int =
+    31 * hiveCatalog.name.hashCode + catalogTable.identifier.hashCode
+
   override def partitionSchema: StructType = table.partitionSchema
 
   override def listFiles(
diff --git a/extensions/spark/kyuubi-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/read/HiveRuntimeFilterSupport.scala b/extensions/spark/kyuubi-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/read/HiveRuntimeFilterSupport.scala
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.kyuubi.spark.connector.hive.read
+
+import java.util.Locale
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression, In, Literal}
+import org.apache.spark.sql.connector.expressions.{Expressions, NamedReference}
+import org.apache.spark.sql.hive.kyuubi.connector.HiveBridgeHelper.StructTypeHelper
+import org.apache.spark.sql.sources.{Filter, In => FilterIn}
+import org.apache.spark.sql.types.StructType
+
+/**
+ * Helpers for a Hive-backed V2 [[org.apache.spark.sql.connector.read.Scan]] to
+ * implement [[org.apache.spark.sql.connector.read.SupportsRuntimeFiltering]]
+ * for Dynamic Partition Pruning (DPP).
+ *
+ * Spark's `DataSourceV2Strategy` currently only emits the `IN` form as a DPP
+ * runtime filter, so translation here handles `In` only. Any filter whose
+ * attribute does not match a known partition column is dropped; drops are
+ * logged at DEBUG.
+ *
+ * We deliberately use the V1 `SupportsRuntimeFiltering` instead of the newer
+ * `SupportsRuntimeV2Filtering` to keep this connector compilable against
+ * Spark 3.3, where `SupportsRuntimeV2Filtering` was introduced in Spark 3.4.
+ */
+object HiveRuntimeFilterSupport extends Logging {
+
+  /**
+   * Build the runtime-filterable attribute array. Only partition columns are exposed
+   * because DPP is only beneficial at the partition directory granularity.
+   */
+  def filterAttributes(partitionColumnNames: Seq[String]): Array[NamedReference] = {
+    partitionColumnNames.map(Expressions.column).toArray
+  }
+
+  /**
+   * Translate Spark's runtime V1 `In` filters into catalyst [[In]] expressions
+   * bound to the given partition attributes.
+   *
+   * A filter is accepted only when it is a [[FilterIn]] whose attribute resolves
+   * to a known partition column.
+   */
+  def toCatalystPartitionFilters(
+      filters: Array[Filter],
+      partitionSchema: StructType,
+      isCaseSensitive: Boolean): Seq[Expression] = {
+    val attrByName: Map[String, AttributeReference] =
+      partitionSchema.toAttributes
+        .map(a => normalize(a.name, isCaseSensitive) -> a).toMap
+
+    val accepted = filters.toSeq.flatMap {
+      case FilterIn(attribute, values) =>
+        attrByName.get(normalize(attribute, isCaseSensitive)).map { attr =>
+          In(attr, values.toSeq.map(v => Literal.create(v, attr.dataType)))
+        }
+      case _ => None
+    }
+
+    if (accepted.length < filters.length) {
+      logDebug(
+        s"Dropped ${filters.length - accepted.length} of ${filters.length} runtime " +
+          s"filter(s) not applicable to partition columns " +
+          s"[${partitionSchema.fieldNames.mkString(", ")}]")
+    }
+    accepted
+  }
+
+  private def normalize(name: String, isCaseSensitive: Boolean): String =
+    if (isCaseSensitive) {
+      name
+    } else {
+      name.toLowerCase(Locale.ROOT)
+    }
+}
diff --git a/extensions/spark/kyuubi-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/read/HiveScan.scala b/extensions/spark/kyuubi-spark-connector-hive/src/main/scala/org/apache/kyuubi/spark/connector/hive/read/HiveScan.scala
@@ -27,7 +27,8 @@ import org.apache.spark.sql.SparkSession
 import org.apache.spark.sql.catalyst.catalog.{CatalogStorageFormat, CatalogTable, CatalogTablePartition}
 import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression}
 import org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
-import org.apache.spark.sql.connector.read.PartitionReaderFactory
+import org.apache.spark.sql.connector.expressions.NamedReference
+import org.apache.spark.sql.connector.read.{PartitionReaderFactory, SupportsRuntimeFiltering}
 import org.apache.spark.sql.execution.datasources.{FilePartition, PartitionedFile}
 import org.apache.spark.sql.execution.datasources.v2.FileScan
 import org.apache.spark.sql.hive.kyuubi.connector.HiveBridgeHelper.HiveClientImpl
@@ -46,13 +47,29 @@ case class HiveScan(
     readPartitionSchema: StructType,
     pushedFilters: Array[Filter] = Array.empty,
     partitionFilters: Seq[Expression] = Seq.empty,
-    dataFilters: Seq[Expression] = Seq.empty) extends FileScan {
+    dataFilters: Seq[Expression] = Seq.empty) extends FileScan
+  with SupportsRuntimeFiltering {
 
   private val isCaseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis
 
   private val partFileToHivePartMap: mutable.Map[PartitionedFile, CatalogTablePartition] =
     mutable.Map()
 
+  private var runtimeFilters: Seq[Expression] = Seq.empty
+
+  // Align with Spark's built-in ParquetScan/OrcScan by explicitly overriding equals.
+  // This keeps `BatchScanExec#equals` stable and enables BroadcastExchange reuse under DPP.
+  override def equals(obj: Any): Boolean = obj match {
+    case other: HiveScan =>
+      super.equals(other) &&
+      catalogTable.identifier == other.catalogTable.identifier &&
+      dataSchema == other.dataSchema &&
+      equivalentFilters(pushedFilters, other.pushedFilters)
+    case _ => false
+  }
+
+  override def hashCode(): Int = getClass.hashCode()
+
   override def isSplitable(path: Path): Boolean = {
     catalogTable.provider.map(_.toUpperCase(Locale.ROOT)).exists {
       case "PARQUET" => true
@@ -83,8 +100,9 @@ case class HiveScan(
   }
 
   override protected def partitions: Seq[FilePartition] = {
+    val effectivePartitionFilters = partitionFilters ++ runtimeFilters
     val (selectedPartitions, partDirToHivePartMap) =
-      fileIndex.listHiveFiles(partitionFilters, dataFilters)
+      fileIndex.listHiveFiles(effectivePartitionFilters, dataFilters)
     val maxSplitBytes = FilePartition.maxSplitBytes(sparkSession, selectedPartitions)
     val partitionAttributes = toAttributes(fileIndex.partitionSchema)
     val attributeMap = partitionAttributes.map(a => normalizeName(a.name) -> a).toMap
@@ -157,4 +175,25 @@ case class HiveScan(
 
   def toAttributes(structType: StructType): Seq[AttributeReference] =
     structType.map(f => AttributeReference(f.name, f.dataType, f.nullable, f.metadata)())
+
+  // -------------------------------------------------------------------------------
+  // SupportsRuntimeFiltering implementation
+  // -------------------------------------------------------------------------------
+
+  override def filterAttributes(): Array[NamedReference] = {
+    HiveRuntimeFilterSupport.filterAttributes(readPartitionSchema.fieldNames.toSeq)
+  }
+
+  override def filter(filters: Array[Filter]): Unit = {
+    runtimeFilters = HiveRuntimeFilterSupport.toCatalystPartitionFilters(
+      filters,
+      fileIndex.partitionSchema,
+      isCaseSensitive)
+    if (runtimeFilters.nonEmpty) {
+      logInfo(s"Received ${runtimeFilters.length} runtime partition filter(s) for " +
+        s"${catalogTable.identifier}")
+      logDebug(s"Runtime partition filter(s) for ${catalogTable.identifier}: " +
+        s"${runtimeFilters.mkString(", ")}")
+    }
+  }
 }
diff --git a/extensions/spark/kyuubi-spark-connector-hive/src/test/scala/org/apache/kyuubi/spark/connector/hive/DynamicPartitionPruningSuite.scala b/extensions/spark/kyuubi-spark-connector-hive/src/test/scala/org/apache/kyuubi/spark/connector/hive/DynamicPartitionPruningSuite.scala
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.kyuubi.spark.connector.hive
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.{Row, SparkSession}
+import org.apache.spark.sql.catalyst.expressions.DynamicPruningExpression
+import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec
+import org.apache.spark.sql.execution.datasources.v2.BatchScanExec
+
+import org.apache.kyuubi.spark.connector.hive.read.HiveScan
+
+class DynamicPartitionPruningSuite extends KyuubiHiveTest {
+
+  private def findBatchScanExec(
+      spark: SparkSession,
+      sql: String,
+      tableNameHint: String): BatchScanExec = {
+    // Match on `HiveScan.catalogTable` rather than the node's `toString` because
+    // `BatchScanExec.toString` shape differs across Spark versions.
+    def matchesHint(b: BatchScanExec): Boolean = b.scan match {
+      case h: HiveScan => h.catalogTable.identifier.table == tableNameHint
+      case _ => false
+    }
+
+    @tailrec
+    def findBatchScan(plan: SparkPlan): Option[BatchScanExec] = plan match {
+      case aqe: AdaptiveSparkPlanExec => findBatchScan(aqe.inputPlan)
+      case _ => plan.collectFirst {
+          case b: BatchScanExec if matchesHint(b) => b
+        }
+    }
+
+    val exec = findBatchScan(spark.sql(sql).queryExecution.executedPlan)
+    assert(exec.isDefined)
+    exec.get
+  }
+
+  test("HiveScan supports DPP runtime filtering on partition columns") {
+    Seq(true, false).foreach { enabled =>
+      withSparkSession(Map(
+        "hive.exec.dynamic.partition.mode" -> "nonstrict",
+        "spark.sql.optimizer.dynamicPartitionPruning.enabled" -> enabled.toString)) { spark =>
+        val suffix = if (enabled) "on" else "off"
+        val fact = s"hive.default.dpp_fact_$suffix"
+        val dim = s"hive.default.dpp_dim_$suffix"
+
+        withTable(fact, dim) {
+          spark.sql(
+            s"""
+               | CREATE TABLE $fact (id INT, v STRING) PARTITIONED BY (dt STRING)
+               | STORED AS TEXTFILE
+               |""".stripMargin).collect()
+          spark.sql(s"INSERT INTO $fact PARTITION (dt='2026-01-01') VALUES (1, 'a'), (2, 'b')")
+          spark.sql(s"INSERT INTO $fact PARTITION (dt='2026-05-01') VALUES (3, 'c'), (4, 'd')")
+          spark.sql(s"INSERT INTO $fact PARTITION (dt='2026-09-01') VALUES (5, 'e'), (6, 'f')")
+
+          spark.sql(
+            s"""
+               | CREATE TABLE $dim (dt STRING, tag STRING)
+               | STORED AS TEXTFILE
+               |""".stripMargin).collect()
+          spark.sql(s"INSERT INTO $dim VALUES ('2026-05-01', 'target')")
+
+          val sql =
+            s"""
+               | SELECT f.id, f.v, f.dt
+               | FROM $fact f JOIN $dim d ON f.dt = d.dt
+               | WHERE d.tag = 'target'
+               |""".stripMargin
+
+          checkAnswer(
+            spark.sql(sql),
+            Seq(
+              Row(3, "c", "2026-05-01"),
+              Row(4, "d", "2026-05-01")))
+
+          // DPP being actually applied is observable as a `DynamicPruningExpression`
+          // injected into `BatchScanExec.runtimeFilters`.
+          val exec = findBatchScanExec(spark, sql, fact.split('.').last)
+          val hasDpp = exec.runtimeFilters.exists(_.isInstanceOf[DynamicPruningExpression])
+          assert(hasDpp == enabled)
+        }
+      }
+    }
+  }
+}
diff --git a/extensions/spark/kyuubi-spark-connector-hive/src/test/scala/org/apache/kyuubi/spark/connector/hive/read/HiveRuntimeFilterSupportSuite.scala b/extensions/spark/kyuubi-spark-connector-hive/src/test/scala/org/apache/kyuubi/spark/connector/hive/read/HiveRuntimeFilterSupportSuite.scala