[VL] Reduce Velox scan SQL metrics by default to mitigate driver OOM#12127
Conversation
|
Run Gluten Clickhouse CI on x86 |
86f7772 to
09c0f07
Compare
|
Run Gluten Clickhouse CI on x86 |
a8e8cab to
67c52c7
Compare
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
67c52c7 to
6bbb6e8
Compare
|
Run Gluten Clickhouse CI on x86 |
4bb6c9a to
c621483
Compare
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
c621483 to
329b93a
Compare
|
Run Gluten Clickhouse CI on x86 |
329b93a to
636c627
Compare
|
Run Gluten Clickhouse CI on x86 |
636c627 to
295b307
Compare
|
Run Gluten Clickhouse CI on x86 |
|
Roughly 0.5% to 1% of our top 2000 resource-intensive jobs are affected by this issue. |
|
@FelixYBW, can we go ahead and merge this change? The detailed scan metrics will need to be enabled via a configuration setting afterward. |
|
@lifulong Does it only impact driver memory or executor memory as well? We usually use a driver nodes with more memory and cores. So driver memory OOM rarely happens. Can we set the config to true by default and disable it in your case? |
|
Run Gluten Clickhouse CI on x86 |
e677597 to
204c9d7
Compare
|
Run Gluten Clickhouse CI on x86 |
@FelixYBW Thanks for the review. I discovered this issue when troubleshooting driver OOM issues. Theoretically it has little impact on executor memory usage, but I haven’t analyzed actual executor memory usage data. |
204c9d7 to
a74de87
Compare
|
Run Gluten Clickhouse CI on x86 |
What changes are proposed in this pull request?
Gluten jobs on the Velox backend are more prone to driver memory pressure than vanilla Spark in some production workloads. Investigation points to scan operators registering too many SQL metrics (accumulators).
Each BatchScanExecTransformer / FileSourceScanExecTransformer / HiveTableScanExecTransformer previously registered 30+ executor-side metrics per scan node.
Vanilla Spark is much leaner—for example, BatchScanExec only exposes numOutputRows (+ connector customMetrics), and FileSourceScanExec adds a small set of driver metrics (numFiles, metadataTime, etc.).
This gap increases driver heap usage and can contribute to driver OOM, especially on scan-heavy queries.
(Driver heap dump analysis while oom, the largest memory-consuming object is LiveStageMetrics)

(Gluten has been failed in first scan stage, while vanilla spark finished successfully with same driver memory 12g.)

Introduce a Velox-only minimal scan metrics set by default, with an opt-in switch for full metrics collection (debugging / advanced troubleshooting).
spark.gluten.sql.scan.detailedMetrics.enabled
ClickHouse backend is unchanged—this config does not affect CH scan metrics.
Default minimal metrics (Velox)
BatchScan (9 executor metrics):
rawInputRows, rawInputBytes, numOutputRows, outputBytes, scanTime, wallNanos, peakMemoryBytes, ioWaitTime, storageReadBytes
FileSourceScan / HiveTableScan — above plus Spark-aligned driver metrics:
numFiles, metadataTime, filesSize, numPartitions, pruningTime
Moved to full collection only (when detailed metrics enabled)
Examples include: numInputRows, inputVectors, inputBytes, outputVectors, cpuCount, numMemoryAllocations, skippedSplits, processedSplits, numDynamicFiltersAccepted, loadLazyVectorTime, skippedStrides, processedStrides, connector timing (preloadSplits, pageLoadTime, dataSourceAddSplitTime, dataSourceReadTime), storage cache details (storageReads, localReadBytes, ramReadBytes), etc.
How was this patch tested?
WIP on our produce envriment
Was this patch authored or co-authored using generative AI tooling?
co-authored using cursor.