[VL] Reduce Velox scan SQL metrics by default to mitigate driver OOM by lifulong · Pull Request #12127 · apache/gluten

lifulong · 2026-05-22T07:41:13Z

What changes are proposed in this pull request?

Gluten jobs on the Velox backend are more prone to driver memory pressure than vanilla Spark in some production workloads. Investigation points to scan operators registering too many SQL metrics (accumulators).

Each BatchScanExecTransformer / FileSourceScanExecTransformer / HiveTableScanExecTransformer previously registered 30+ executor-side metrics per scan node.

Vanilla Spark is much leaner—for example, BatchScanExec only exposes numOutputRows (+ connector customMetrics), and FileSourceScanExec adds a small set of driver metrics (numFiles, metadataTime, etc.).

This gap increases driver heap usage and can contribute to driver OOM, especially on scan-heavy queries.

(Driver heap dump analysis while oom, the largest memory-consuming object is LiveStageMetrics)

(Gluten has been failed in first scan stage, while vanilla spark finished successfully with same driver memory 12g.)

Introduce a Velox-only minimal scan metrics set by default, with an opt-in switch for full metrics collection (debugging / advanced troubleshooting).
spark.gluten.sql.scan.detailedMetrics.enabled

ClickHouse backend is unchanged—this config does not affect CH scan metrics.

Default minimal metrics (Velox)
BatchScan (9 executor metrics):
rawInputRows, rawInputBytes, numOutputRows, outputBytes, scanTime, wallNanos, peakMemoryBytes, ioWaitTime, storageReadBytes

FileSourceScan / HiveTableScan — above plus Spark-aligned driver metrics:
numFiles, metadataTime, filesSize, numPartitions, pruningTime

Moved to full collection only (when detailed metrics enabled)
Examples include: numInputRows, inputVectors, inputBytes, outputVectors, cpuCount, numMemoryAllocations, skippedSplits, processedSplits, numDynamicFiltersAccepted, loadLazyVectorTime, skippedStrides, processedStrides, connector timing (preloadSplits, pageLoadTime, dataSourceAddSplitTime, dataSourceReadTime), storage cache details (storageReads, localReadBytes, ramReadBytes), etc.

How was this patch tested?

WIP on our produce envriment

Was this patch authored or co-authored using generative AI tooling?

co-authored using cursor.

github-actions · 2026-05-22T07:41:42Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-22T07:53:18Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-22T10:07:07Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-22T10:09:10Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-22T10:28:53Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-22T11:22:27Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-22T11:32:50Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-25T02:07:54Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-25T02:23:28Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-25T05:25:02Z

Run Gluten Clickhouse CI on x86

lifulong · 2026-05-29T08:44:27Z

Roughly 0.5% to 1% of our top 2000 resource-intensive jobs are affected by this issue.

rui-mo

Thanks!

rui-mo · 2026-06-04T01:26:31Z

@FelixYBW, can we go ahead and merge this change? The detailed scan metrics will need to be enabled via a configuration setting afterward.

FelixYBW · 2026-06-04T05:19:12Z

@lifulong Does it only impact driver memory or executor memory as well? We usually use a driver nodes with more memory and cores. So driver memory OOM rarely happens.

Can we set the config to true by default and disable it in your case?

github-actions · 2026-06-04T14:02:41Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-04T14:26:13Z

Run Gluten Clickhouse CI on x86

lifulong · 2026-06-04T14:26:30Z

@lifulong Does it only impact driver memory or executor memory as well? We usually use a driver nodes with more memory and cores. So driver memory OOM rarely happens.

Can we set the config to true by default and disable it in your case?

@FelixYBW Thanks for the review. I discovered this issue when troubleshooting driver OOM issues. Theoretically it has little impact on executor memory usage, but I haven’t analyzed actual executor memory usage data.
I’ll adjust the default value of spark.gluten.sql.scan.detailedMetrics.enabled: keep all metrics collected by default, and allow reducing collected metrics via this configuration toggle.

…true

github-actions · 2026-06-04T14:29:49Z

Run Gluten Clickhouse CI on x86

github-actions Bot added CORE works for Gluten Core VELOX labels May 22, 2026

lifulong marked this pull request as draft May 22, 2026 07:41

lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch from 86f7772 to 09c0f07 Compare May 22, 2026 07:52

lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch 2 times, most recently from a8e8cab to 67c52c7 Compare May 22, 2026 10:04

github-actions Bot added the DOCS label May 22, 2026

lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch from 67c52c7 to 6bbb6e8 Compare May 22, 2026 10:16

lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch 2 times, most recently from 4bb6c9a to c621483 Compare May 22, 2026 11:19

lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch from c621483 to 329b93a Compare May 25, 2026 02:07

lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch from 329b93a to 636c627 Compare May 25, 2026 02:23

Gluten driver oom while spark ok with same driver memory

295b307

lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch from 636c627 to 295b307 Compare May 25, 2026 05:24

lifulong marked this pull request as ready for review May 25, 2026 06:06

zhouyuan requested a review from rui-mo May 29, 2026 06:17

rui-mo approved these changes Jun 3, 2026

View reviewed changes

lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch from e677597 to 204c9d7 Compare June 4, 2026 14:25

Change default spark.gluten.sql.scan.detailedMetrics.enabled conf to …

a74de87

…true

lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch from 204c9d7 to a74de87 Compare June 4, 2026 14:29

FelixYBW merged commit aa82d4a into apache:main Jun 5, 2026
62 checks passed

Conversation

lifulong commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

lifulong commented May 29, 2026

Uh oh!

rui-mo left a comment

Choose a reason for hiding this comment

Uh oh!

rui-mo commented Jun 4, 2026

Uh oh!

FelixYBW commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

lifulong commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lifulong commented May 22, 2026 •

edited

Loading