Skip to content

Commit 62bc6c9

Browse files
committed
[CORE] Unbundle Arrow memory + vector from gluten-velox-bundle
Mark arrow-memory-{core,unsafe,netty} and arrow-vector as scope=provided in gluten-arrow/pom.xml. They are bundled in Spark's distribution ($SPARK_HOME/jars/ for Spark 3.x; declared in Spark 4.x's pom), so the user's classpath already has them at runtime — gluten does not need to ship its own copy. Effects: * The gluten-velox bundle no longer ships ANY org.apache.arrow.memory.* or org.apache.arrow.vector.* classes. The class-shadowing problem from #12225 goes away by construction — there is no gluten-shipped copy left to shadow the user's vanilla Arrow. * The org.apache.arrow shade-relocation block in package/pom.xml becomes redundant and is removed: arrow-memory/vector are no longer in the bundle to relocate, and arrow-c-data / arrow-dataset (still bundled) were already excluded from relocation because their JNI binds to the original class names. * arrow-c-data and arrow-dataset remain at scope=compile in gluten-arrow — Spark does NOT ship those, so gluten still bundles them. With the relocation block gone, their public method signatures naturally bind to the user's vanilla org.apache.arrow.memory.BufferAllocator / arrow-vector types, exactly matching what every other Arrow C-Data caller on the classpath expects. Compile-classpath touch-ups: * backends-velox/pom.xml: re-declare arrow-memory-core and arrow-vector at scope=provided. The transitive route through gluten-arrow no longer carries them after the scope flip, so backends-velox needs its own provided declaration to compile. * gluten-ut/* and backends-clickhouse already declare arrow at provided scope locally, so they are unaffected. Caveats: * Spark 3.5 and earlier do NOT declare arrow-memory/arrow-vector in their Maven POM (they ship them inside the binary distribution only). gluten builds against the version pinned in `arrow.version`. Maintainers should keep `arrow.version` aligned with the lowest-common-denominator Arrow version across supported Spark distros (DBR 16.4 ships Arrow 12.0.1 with Spark 3.5; vanilla Spark 3.5.x ships 15.0.0 — the 15.0.0 default here is fine for vanilla Spark 3.5 but may need a compat profile for DBR/Cloudera flavors). * dev/check-arrow-c-shading.sh added in #12226 still passes — the bundle still contains org/apache/arrow/c/* classes whose method signatures now reference unshaded org.apache.arrow.memory.* / org.apache.arrow.vector.* types (which are no longer in the bundle, but resolve at runtime from Spark's Arrow). Builds on #12244 (drop the 15.0.0-gluten Arrow version rename). Addresses the follow-up direction from #12226 discussion: "remove Arrow from the bundled Gluten Jar and let users rely on Spark's bundled Arrow".
1 parent 62da6bf commit 62bc6c9

3 files changed

Lines changed: 37 additions & 24 deletions

File tree

backends-velox/pom.xml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,24 @@
7979
<version>${project.version}</version>
8080
<scope>compile</scope>
8181
</dependency>
82+
<!--
83+
arrow-memory-core / arrow-vector are scope=provided in gluten-arrow so
84+
they are not bundled. Re-declare them at provided scope here so the
85+
compile classpath still resolves them. At runtime they come from
86+
Spark's distribution.
87+
-->
88+
<dependency>
89+
<groupId>org.apache.arrow</groupId>
90+
<artifactId>arrow-memory-core</artifactId>
91+
<version>${arrow.version}</version>
92+
<scope>provided</scope>
93+
</dependency>
94+
<dependency>
95+
<groupId>org.apache.arrow</groupId>
96+
<artifactId>arrow-vector</artifactId>
97+
<version>${arrow.version}</version>
98+
<scope>provided</scope>
99+
</dependency>
82100
<dependency>
83101
<groupId>com.github.ben-manes.caffeine</groupId>
84102
<artifactId>caffeine</artifactId>

gluten-arrow/pom.xml

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -85,17 +85,24 @@
8585
<version>${spark.version}</version>
8686
<scope>provided</scope>
8787
</dependency>
88+
<!--
89+
Arrow memory + vector come from Spark's distribution (declared in Spark
90+
4.x's pom; bundled in $SPARK_HOME/jars/ for Spark 3.x). gluten compiles
91+
against the same arrow.version it expects at runtime, but the artifacts
92+
are scope=provided so they are NOT bundled into gluten-velox-bundle —
93+
avoiding the #12225 class-shadowing problem entirely.
94+
-->
8895
<dependency>
8996
<groupId>org.apache.arrow</groupId>
9097
<artifactId>${arrow-memory.artifact}</artifactId>
9198
<version>${arrow.version}</version>
92-
<scope>runtime</scope>
99+
<scope>provided</scope>
93100
</dependency>
94101
<dependency>
95102
<groupId>org.apache.arrow</groupId>
96103
<artifactId>arrow-memory-core</artifactId>
97104
<version>${arrow.version}</version>
98-
<scope>compile</scope>
105+
<scope>provided</scope>
99106
<exclusions>
100107
<exclusion>
101108
<groupId>io.netty</groupId>
@@ -111,6 +118,7 @@
111118
<groupId>org.apache.arrow</groupId>
112119
<artifactId>arrow-vector</artifactId>
113120
<version>${arrow.version}</version>
121+
<scope>provided</scope>
114122
<exclusions>
115123
<exclusion>
116124
<groupId>io.netty</groupId>

package/pom.xml

Lines changed: 9 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -118,28 +118,15 @@
118118
<include>com.google.gson.**</include>
119119
</includes>
120120
</relocation>
121-
<relocation>
122-
<pattern>org.apache.arrow</pattern>
123-
<shadedPattern>${gluten.shade.packageName}.org.apache.arrow</shadedPattern>
124-
<!--
125-
arrow's C and dataset wrappers refer to the original class
126-
path, so they must not be relocated. Their public APIs also
127-
take and return org.apache.arrow.memory.* and
128-
org.apache.arrow.vector.* types, so those packages must also
129-
stay unshaded — otherwise the bundled (unshaded)
130-
ArrowArrayStream/ArrowSchema get compiled against the
131-
relocated BufferAllocator/VectorSchemaRoot, producing
132-
`NoSuchMethodError` for any caller passing a vanilla
133-
Apache Arrow allocator. See #12225.
134-
-->
135-
<excludes>
136-
<exclude>org.apache.arrow.c.*</exclude>
137-
<exclude>org.apache.arrow.c.jni.*</exclude>
138-
<exclude>org.apache.arrow.memory.**</exclude>
139-
<exclude>org.apache.arrow.vector.**</exclude>
140-
<exclude>org.apache.arrow.dataset.**</exclude>
141-
</excludes>
142-
</relocation>
121+
<!--
122+
org.apache.arrow.memory.* and org.apache.arrow.vector.* are
123+
now scope=provided in gluten-arrow/pom.xml — they come from
124+
Spark's distribution at runtime, so there is nothing to
125+
relocate. arrow-c-data and arrow-dataset are still bundled
126+
but never relocated (their JNI binds to the original class
127+
names), so no shade-relocation entry is needed for Arrow.
128+
See #12225 / #12226 for the historical context.
129+
-->
143130
<relocation>
144131
<pattern>com.google.flatbuffers</pattern>
145132
<shadedPattern>${gluten.shade.packageName}.com.google.flatbuffers</shadedPattern>

0 commit comments

Comments
 (0)