Skip to content

[VL] Reduce Velox scan SQL metrics by default to mitigate driver OOM#12127

Merged
FelixYBW merged 2 commits into
apache:mainfrom
lifulong:gluten_driver_oom_while_spark_ok_use_same_driver_memory
Jun 5, 2026
Merged

[VL] Reduce Velox scan SQL metrics by default to mitigate driver OOM#12127
FelixYBW merged 2 commits into
apache:mainfrom
lifulong:gluten_driver_oom_while_spark_ok_use_same_driver_memory

Conversation

@lifulong

@lifulong lifulong commented May 22, 2026

Copy link
Copy Markdown
Contributor

What changes are proposed in this pull request?

Gluten jobs on the Velox backend are more prone to driver memory pressure than vanilla Spark in some production workloads. Investigation points to scan operators registering too many SQL metrics (accumulators).

Each BatchScanExecTransformer / FileSourceScanExecTransformer / HiveTableScanExecTransformer previously registered 30+ executor-side metrics per scan node.

Vanilla Spark is much leaner—for example, BatchScanExec only exposes numOutputRows (+ connector customMetrics), and FileSourceScanExec adds a small set of driver metrics (numFiles, metadataTime, etc.).

This gap increases driver heap usage and can contribute to driver OOM, especially on scan-heavy queries.

(Driver heap dump analysis while oom, the largest memory-consuming object is LiveStageMetrics)
企业微信截图_7f05f208-9f83-472b-b638-0aa70650abfc

(Gluten has been failed in first scan stage, while vanilla spark finished successfully with same driver memory 12g.)
企业微信截图_0f06b928-eff5-4ba8-a1ae-6f87aca571be

Introduce a Velox-only minimal scan metrics set by default, with an opt-in switch for full metrics collection (debugging / advanced troubleshooting).
spark.gluten.sql.scan.detailedMetrics.enabled

ClickHouse backend is unchanged—this config does not affect CH scan metrics.

Default minimal metrics (Velox)
BatchScan (9 executor metrics):
rawInputRows, rawInputBytes, numOutputRows, outputBytes, scanTime, wallNanos, peakMemoryBytes, ioWaitTime, storageReadBytes

FileSourceScan / HiveTableScan — above plus Spark-aligned driver metrics:
numFiles, metadataTime, filesSize, numPartitions, pruningTime

Moved to full collection only (when detailed metrics enabled)
Examples include: numInputRows, inputVectors, inputBytes, outputVectors, cpuCount, numMemoryAllocations, skippedSplits, processedSplits, numDynamicFiltersAccepted, loadLazyVectorTime, skippedStrides, processedStrides, connector timing (preloadSplits, pageLoadTime, dataSourceAddSplitTime, dataSourceReadTime), storage cache details (storageReads, localReadBytes, ramReadBytes), etc.

How was this patch tested?

WIP on our produce envriment

Was this patch authored or co-authored using generative AI tooling?

co-authored using cursor.

@github-actions github-actions Bot added CORE works for Gluten Core VELOX labels May 22, 2026
@lifulong lifulong marked this pull request as draft May 22, 2026 07:41
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@lifulong lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch from 86f7772 to 09c0f07 Compare May 22, 2026 07:52
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@lifulong lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch 2 times, most recently from a8e8cab to 67c52c7 Compare May 22, 2026 10:04
@github-actions github-actions Bot added the DOCS label May 22, 2026
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@lifulong lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch from 67c52c7 to 6bbb6e8 Compare May 22, 2026 10:16
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@lifulong lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch 2 times, most recently from 4bb6c9a to c621483 Compare May 22, 2026 11:19
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@lifulong lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch from c621483 to 329b93a Compare May 25, 2026 02:07
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@lifulong lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch from 329b93a to 636c627 Compare May 25, 2026 02:23
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@lifulong lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch from 636c627 to 295b307 Compare May 25, 2026 05:24
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@lifulong lifulong marked this pull request as ready for review May 25, 2026 06:06
@zhouyuan zhouyuan requested a review from rui-mo May 29, 2026 06:17
@lifulong

Copy link
Copy Markdown
Contributor Author

Roughly 0.5% to 1% of our top 2000 resource-intensive jobs are affected by this issue.

@rui-mo rui-mo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@rui-mo

rui-mo commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

@FelixYBW, can we go ahead and merge this change? The detailed scan metrics will need to be enabled via a configuration setting afterward.

@FelixYBW

FelixYBW commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

@lifulong Does it only impact driver memory or executor memory as well? We usually use a driver nodes with more memory and cores. So driver memory OOM rarely happens.

Can we set the config to true by default and disable it in your case?

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@lifulong lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch from e677597 to 204c9d7 Compare June 4, 2026 14:25
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@lifulong

lifulong commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

@lifulong Does it only impact driver memory or executor memory as well? We usually use a driver nodes with more memory and cores. So driver memory OOM rarely happens.

Can we set the config to true by default and disable it in your case?

@FelixYBW Thanks for the review. I discovered this issue when troubleshooting driver OOM issues. Theoretically it has little impact on executor memory usage, but I haven’t analyzed actual executor memory usage data.
I’ll adjust the default value of spark.gluten.sql.scan.detailedMetrics.enabled: keep all metrics collected by default, and allow reducing collected metrics via this configuration toggle.

@lifulong lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch from 204c9d7 to a74de87 Compare June 4, 2026 14:29
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@FelixYBW FelixYBW merged commit aa82d4a into apache:main Jun 5, 2026
62 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core DOCS VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants