Skip to content

[BugFix] Abort BE vacuum tasks once the FE caller's timeout elapses#74694

Open
starrocks-xupeng wants to merge 2 commits into
StarRocks:mainfrom
starrocks-xupeng:vacuum_task_timeout
Open

[BugFix] Abort BE vacuum tasks once the FE caller's timeout elapses#74694
starrocks-xupeng wants to merge 2 commits into
StarRocks:mainfrom
starrocks-xupeng:vacuum_task_timeout

Conversation

@starrocks-xupeng

@starrocks-xupeng starrocks-xupeng commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Why I'm doing:

The FE gives up waiting for a vacuum RPC after its brpc timeout (LakeService.TIMEOUT_VACUUM, 1 hour), marks the partition vacuum as failed and re-dispatches it shortly after. But the BE-side vacuum task is never cancelled: it keeps running as a zombie, occupying one of the few workers of the RELEASE_SNAPSHOT thread pool (5 by default) for hours, while nobody reads its response. On clusters with partitions that accumulated a huge number of versions, zombie tasks can exhaust the whole pool: newly dispatched vacuum requests pile up in the queue, vacuum throughput collapses cluster-wide, and the version backlog keeps growing.

What I'm doing:

  • Add optional int64 timeout_ms = 11 to VacuumRequest (gensrc/proto/lake_service.proto): the maximum duration the FE caller waits for the request.
  • FE: AutovacuumDaemon#vacuumPartitionImpl fills timeoutMs with LakeService.TIMEOUT_VACUUM, the brpc timeout of the vacuum RPC, so the BE deadline matches exactly how long the FE actually waits.
  • BE: LakeServiceImpl::vacuum anchors an absolute deadline (butil::gettimeofday_ms() + timeout_ms) at the time the request is received and passes it to lake::vacuum (new deadline_ms parameter, default 0 = no deadline, all other callers unchanged).
  • BE: vacuum_impl checks the deadline once at entry — a task that already exceeded the deadline while waiting in the thread pool queue aborts without doing any work; collect_files_to_vacuum checks it on each iteration of the version-chain walk (the dominant cost for high-version-count partitions) and aborts with Status::TimedOut, freeing the worker. Aborting between walk iterations leaves the metadata chain untouched, so the next vacuum round resumes from the same state.
  • BE: new mutable config lake_vacuum_enable_task_timeout (default true) gates the deadline: when set to false the BE ignores timeout_ms and vacuum tasks always run to completion.
  • Requests without timeout_ms (older FE versions) carry no deadline and run to completion, exactly as before.
  • UT: test_vacuum_deadline_expired_mid_walk (deadline expires mid-walk via a mocked clock on the new vacuum:check_deadline sync point: nothing is deleted, and a follow-up run without deadline converges normally), test_vacuum_task_deadline_exceeded (handler threads timeout_ms into the task and returns TIMEOUT; a request without the field is unaffected, and so is any request when lake_vacuum_enable_task_timeout is off), and testVacuumRequestCarriesTimeout (FE fills the field).

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
    • This pr needs auto generate documentation
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 4.1
    • 4.0
    • 3.5

@CelerData-Reviewer

Copy link
Copy Markdown

@codex review

@github-actions github-actions Bot requested review from meegoo and xiangguangyxg June 11, 2026 08:31

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9620972e97

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread be/src/storage/lake/vacuum.cpp
The FE gives up waiting for a vacuum RPC after its brpc timeout (1 hour),
but the BE task keeps running as a zombie: it occupies one of the few
RELEASE_SNAPSHOT workers and races with re-dispatched vacuums of the same
partition for hours, while nobody reads its response.

Carry the FE timeout in VacuumRequest.timeout_ms. The BE vacuum handler
anchors an absolute deadline when the request is received and threads it
through the vacuum execution; the version-chain walk checks the deadline
on each iteration and aborts with Status::TimedOut once it passes. The
check at vacuum entry also kills tasks that already exceeded the deadline
while waiting in the thread pool queue. Requests from older FEs without
the field carry no deadline and run to completion as before.
@github-actions

Copy link
Copy Markdown
Contributor

No new undocumented parameters detected by the param-drift check.

@github-actions

Copy link
Copy Markdown
Contributor

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions

Copy link
Copy Markdown
Contributor

[BE Incremental Coverage Report]

pass : 20 / 20 (100.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/service/service_be/lake_service.cpp 4 4 100.00% []
🔵 be/src/storage/lake/vacuum.cpp 16 16 100.00% []

@github-actions

Copy link
Copy Markdown
Contributor

[FE Incremental Coverage Report]

pass : 1 / 1 (100.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/lake/vacuum/AutovacuumDaemon.java 1 1 100.00% []

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants