@@ -164,16 +164,43 @@ object DeltaPostTransformRules {
164164 }
165165
166166 /**
167- * This method is only used for Delta ColumnMapping FileFormat(e.g. nameMapping and idMapping)
168- * transform the metadata of Delta into Parquet's, each plan should only be transformed once .
167+ * Used for Delta ColumnMapping FileFormat ( nameMapping and idMapping). Each plan is transformed
168+ * at most once; the first run is tagged so re-runs are no-ops .
169169 *
170- * Partition and data filters on the scan node stay LOGICAL so that Delta's
171- * `PreparedDeltaFileIndex` can do partition pruning and file-level data skipping (its partition
172- * schema and column-stats schema both use logical names). Reader-facing pieces (`output`,
173- * `dataSchema`, and the data fields of `requiredSchema`) become physical so the parquet reader
174- * and Velox find the right columns in the file. Filter binding to the native side is by exprId,
175- * not by name, so logical-named filter attributes still resolve correctly against the
176- * physical-named `output`.
170+ * Background: with column mapping, Delta files are written with PHYSICAL column names while
171+ * Delta's metadata (partition schema, column stats) keeps LOGICAL names. Vanilla Spark + Delta
172+ * resolves this asymmetry inside `DeltaParquetFileFormat.buildReaderWithPartitionValues`:
173+ * everything on the scan node stays logical, and physical translation happens just-in-time when
174+ * handing data and filters to the parquet reader. Gluten bypasses that hook (it goes to native
175+ * via Substrait), so the translation has to live somewhere on our side.
176+ *
177+ * What this rule produces -- the parts that diverge from vanilla Spark are commented at each
178+ * site. The split-by-consumer is asymmetric on purpose:
179+ *
180+ * - `output`, `dataSchema`, and the data fields of `requiredSchema` ==> PHYSICAL. These flow
181+ * into the substrait `NamedStruct` that Velox uses to look up columns in the parquet file.
182+ * The parquet column name is the physical name, so Velox needs the physical name on the
183+ * schema side. A `ProjectExecTransformer` is added below to alias these back to logical names
184+ * for downstream Spark operators.
185+ * - `partitionSchema`, `partitionFilters`, `dataFilters`, partition fields of `requiredSchema`
186+ * ==> LOGICAL. These are consumed by Delta's `PreparedDeltaFileIndex.matchingFiles` and
187+ * `Snapshot.filesForScan`, which resolve filters and partition values against
188+ * `metadata.partitionSchema` and the column-stats schema -- both LOGICAL. Rewriting any of
189+ * these to physical names was the cause of issue #10511 (partition pruning silently no-op'd)
190+ * and would also disable file-level stats skipping.
191+ * - `DeltaScanTransformer.scanFilters` (override) ==> PHYSICAL, translated from `dataFilters`
192+ * by exprId match against `output`. Substrait binds filters by exprId rather than name, so it
193+ * would be tempting to pass logical-named filters straight through; but
194+ * `BasicScanExecTransformer.filterExprs()` does a name-and-exprId equality check
195+ * (`scanFilters.partition(pushDownFilters.contains(_))`) against the physical-named
196+ * `pushDownFilters` from the upstream `Filter`. The override ensures both sides match.
197+ *
198+ * Future cleanup (out of scope for this fix): the cleaner shape is to mirror vanilla Spark
199+ * exactly -- keep EVERYTHING on the scan node logical, and do physical translation only at
200+ * substrait emission time (e.g. inside the `NamedStruct`/`ReadRel` build in
201+ * `BasicScanExecTransformer.doTransform`). That removes the alias-back project below and the
202+ * `scanFilters` override, but it requires plumbing Delta-specific physical-name lookup into the
203+ * substrait emitter and is a multi-module refactor.
177204 */
178205 private def transformColumnMappingPlan (plan : SparkPlan ): SparkPlan = plan match {
179206 case plan : DeltaScanTransformer =>
0 commit comments