[GLUTEN-10511][VL][Delta] Document logical/physical split + future cleanup

sezruby · claude · sezruby · commit abedd16cff46 · 2026-06-04T21:27:42.000-07:00
Expand the docstring of `transformColumnMappingPlan` to explain why some scan
node fields stay logical and others become physical, and note the longer-term
cleanup direction (defer all physical translation to substrait emission, which
would remove both the alias-back project and the scanFilters override).

Mirror the same context in the `DeltaScanTransformer.scanFilters` override so
each site reads independently.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/gluten-delta/src/main/scala/org/apache/gluten/execution/DeltaScanTransformer.scala b/gluten-delta/src/main/scala/org/apache/gluten/execution/DeltaScanTransformer.scala
@@ -56,13 +56,25 @@ case class DeltaScanTransformer(
 
   override lazy val fileFormat: ReadFileFormat = ReadFileFormat.ParquetReadFormat
 
-  // For Delta column-mapping tables, `dataFilters` are kept LOGICAL on the scan node so that
-  // `PreparedDeltaFileIndex` (Delta's file index, which uses logical names for partition pruning
-  // and stats-based file skipping) resolves them correctly. The native side, however, must see
-  // PHYSICAL names. `output` and `dataSchema` are physical, and `BasicScanExecTransformer`
-  // matches `scanFilters` against `pushDownFilters` (which are derived from a Filter referencing
-  // the physical-named scan output) by AttributeReference equality, which checks names. Translate
-  // the logical filter attrs to their physical counterparts in `output` so the two sets line up.
+  // For Delta column-mapping tables, `dataFilters` on the scan node are LOGICAL-named so Delta's
+  // file index (`PreparedDeltaFileIndex.matchingFiles`, `Snapshot.filesForScan`) can do partition
+  // pruning and stats-based file skipping -- both resolve filter attrs against logical schemas.
+  //
+  // The native (Velox) side, however, must see PHYSICAL names: `output` and `dataSchema` are
+  // physical (so the parquet reader finds the right column), and `BasicScanExecTransformer`
+  // matches `scanFilters` against `pushDownFilters` (built from a `Filter` that references the
+  // physical-named scan output) by `AttributeReference.equals`, which compares names. Without
+  // this override, the logical-named `scanFilters` and physical-named `pushDownFilters` would
+  // never match, causing duplicate filter evaluation in the substrait plan.
+  //
+  // Translate by exprId match against `output` rather than by re-running Delta's column-mapping
+  // helpers; exprIds are stable across the post-transform rewrite and don't require a second
+  // metadata lookup.
+  //
+  // See `DeltaPostTransformRules.transformColumnMappingPlan` for the full picture of which
+  // fields stay logical vs. become physical, and the longer-term cleanup direction (do all
+  // physical translation at substrait emission time so this override and the alias-back
+  // ProjectExec both go away).
   override def scanFilters: Seq[Expression] = relation.fileFormat match {
     case d: DeltaParquetFileFormat if d.columnMappingMode != NoMapping =>
       val physicalByExprId = output.collect { case ar: AttributeReference => ar.exprId -> ar }.toMap
diff --git a/gluten-delta/src/main/scala/org/apache/gluten/extension/DeltaPostTransformRules.scala b/gluten-delta/src/main/scala/org/apache/gluten/extension/DeltaPostTransformRules.scala
@@ -164,16 +164,43 @@ object DeltaPostTransformRules {
   }
 
   /**
-   * This method is only used for Delta ColumnMapping FileFormat(e.g. nameMapping and idMapping)
-   * transform the metadata of Delta into Parquet's, each plan should only be transformed once.
+   * Used for Delta ColumnMapping FileFormat (nameMapping and idMapping). Each plan is transformed
+   * at most once; the first run is tagged so re-runs are no-ops.
    *
-   * Partition and data filters on the scan node stay LOGICAL so that Delta's
-   * `PreparedDeltaFileIndex` can do partition pruning and file-level data skipping (its partition
-   * schema and column-stats schema both use logical names). Reader-facing pieces (`output`,
-   * `dataSchema`, and the data fields of `requiredSchema`) become physical so the parquet reader
-   * and Velox find the right columns in the file. Filter binding to the native side is by exprId,
-   * not by name, so logical-named filter attributes still resolve correctly against the
-   * physical-named `output`.
+   * Background: with column mapping, Delta files are written with PHYSICAL column names while
+   * Delta's metadata (partition schema, column stats) keeps LOGICAL names. Vanilla Spark + Delta
+   * resolves this asymmetry inside `DeltaParquetFileFormat.buildReaderWithPartitionValues`:
+   * everything on the scan node stays logical, and physical translation happens just-in-time when
+   * handing data and filters to the parquet reader. Gluten bypasses that hook (it goes to native
+   * via Substrait), so the translation has to live somewhere on our side.
+   *
+   * What this rule produces -- the parts that diverge from vanilla Spark are commented at each
+   * site. The split-by-consumer is asymmetric on purpose:
+   *
+   *   - `output`, `dataSchema`, and the data fields of `requiredSchema` ==> PHYSICAL. These flow
+   *     into the substrait `NamedStruct` that Velox uses to look up columns in the parquet file.
+   *     The parquet column name is the physical name, so Velox needs the physical name on the
+   *     schema side. A `ProjectExecTransformer` is added below to alias these back to logical names
+   *     for downstream Spark operators.
+   *   - `partitionSchema`, `partitionFilters`, `dataFilters`, partition fields of `requiredSchema`
+   *     ==> LOGICAL. These are consumed by Delta's `PreparedDeltaFileIndex.matchingFiles` and
+   *     `Snapshot.filesForScan`, which resolve filters and partition values against
+   *     `metadata.partitionSchema` and the column-stats schema -- both LOGICAL. Rewriting any of
+   *     these to physical names was the cause of issue #10511 (partition pruning silently no-op'd)
+   *     and would also disable file-level stats skipping.
+   *   - `DeltaScanTransformer.scanFilters` (override) ==> PHYSICAL, translated from `dataFilters`
+   *     by exprId match against `output`. Substrait binds filters by exprId rather than name, so it
+   *     would be tempting to pass logical-named filters straight through; but
+   *     `BasicScanExecTransformer.filterExprs()` does a name-and-exprId equality check
+   *     (`scanFilters.partition(pushDownFilters.contains(_))`) against the physical-named
+   *     `pushDownFilters` from the upstream `Filter`. The override ensures both sides match.
+   *
+   * Future cleanup (out of scope for this fix): the cleaner shape is to mirror vanilla Spark
+   * exactly -- keep EVERYTHING on the scan node logical, and do physical translation only at
+   * substrait emission time (e.g. inside the `NamedStruct`/`ReadRel` build in
+   * `BasicScanExecTransformer.doTransform`). That removes the alias-back project below and the
+   * `scanFilters` override, but it requires plumbing Delta-specific physical-name lookup into the
+   * substrait emitter and is a multi-module refactor.
    */
   private def transformColumnMappingPlan(plan: SparkPlan): SparkPlan = plan match {
     case plan: DeltaScanTransformer =>