Fix stale reference in loadAllPushes() causing stuck push after controller restart by haoxu07 · Pull Request #2701 · linkedin/venice

haoxu07 · 2026-04-07T01:06:31Z

Summary

Fix a bug in AbstractPushMonitor.loadAllPushes() where checkPushStatus() uses a stale loop variable instead of the refreshed object from topicToPushMap, causing batch pushes to get permanently stuck at END_OF_PUSH_RECEIVED after a controller restart.
Add LoadAllPushesStaleReferenceTest to reproduce and verify the fix.

Problem

During controller STANDBY→LEADER transition, loadAllPushes():

Line 176: Bulk-loads OfflinePushStatus with partition statuses from ZK (snapshot T1)
Line 190: Registers ZK watchers for partition status changes
Line 195: updateOfflinePush() reads fresh data from ZK (snapshot T2) — all replicas now COMPLETED — and replaces the entry in topicToPushMap
Line 199-204: checkPushStatus() is called with the stale loop variable (T1), not the refreshed object (T2)

If replicas complete between T1 and T2, the stale object has incomplete partition statuses. Since no ZK watcher callbacks fire for already-completed partitions (their final write happened before watcher registration), the push is stuck at END_OF_PUSH_RECEIVED permanently, blocking all future batch pushes for the store.

Evidence

Observed in EI on 2026-04-03:

heartbeat_inc_push_mt-0 v2420 on mt-0: all 60 partitions × 3 replicas COMPLETED in ZK, but controller's in-memory OfflinePushStatus.currentStatus = END_OF_PUSH_RECEIVED (confirmed via heap dump, ODP event 4075972)
Controller logs showed DeferredVersionSwapService: "Skipping store as parent version 2420 status is PARTIALLY_ONLINE" for 8+ hours
Required manual --kill-job to unblock

Fix

One-line change after updateOfflinePush():

offlinePushStatus = topicToPushMap.get(offlinePushStatus.getKafkaTopic());

This ensures checkPushStatus() evaluates the latest partition data from ZK.

Test plan

LoadAllPushesStaleReferenceTest.testLoadAllPushesStaleReferenceAfterUpdateOfflinePush — reproduces the race condition (fails without fix, passes with fix)
All existing AbstractPushMonitorTest inherited tests pass
CI

🤖 Generated with Claude Code

END_OF_PUSH_RECEIVED after controller restart During controller STANDBY→LEADER transition, loadAllPushes() bulk-loads OfflinePushStatus objects from ZK (snapshot T1), then for each push: 1. Registers ZK watchers for partition status changes 2. Calls updateOfflinePush() which reads fresh data from ZK (snapshot T2) and replaces the entry in topicToPushMap 3. Calls checkPushStatus() with the **stale** loop variable (T1), not the refreshed object from topicToPushMap (T2) If replicas complete between T1 and T2, the stale object has incomplete partition statuses. checkPushStatus() returns non-terminal, and since no ZK watcher callbacks fire for already-completed partitions, the push is stuck at END_OF_PUSH_RECEIVED permanently. This blocks all future batch pushes for the store ("future version X exists"). Fix: After updateOfflinePush(), re-read the object from topicToPushMap so checkPushStatus() evaluates the latest partition data from ZK. Added LoadAllPushesStaleReferenceTest to reproduce the race condition. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Fixes a controller failover race in AbstractPushMonitor.loadAllPushes() where checkPushStatus() could evaluate a stale OfflinePushStatus instance (loaded before watcher registration) instead of the refreshed object after updateOfflinePush(), leaving batch pushes stuck at END_OF_PUSH_RECEIVED after restart.

Changes:

Refresh the offlinePushStatus loop variable from topicToPushMap immediately after updateOfflinePush() so subsequent status evaluation uses the latest ZK snapshot.
Add LoadAllPushesStaleReferenceTest to reproduce the stale-reference scenario and verify the push transitions to COMPLETED.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`services/venice-controller/src/main/java/com/linkedin/venice/pushmonitor/AbstractPushMonitor.java`	Re-reads the refreshed `OfflinePushStatus` from `topicToPushMap` after `updateOfflinePush()` to avoid using stale partition status data.
`services/venice-controller/src/test/java/com/linkedin/venice/pushmonitor/LoadAllPushesStaleReferenceTest.java`	Adds a regression test that simulates stale (T1) vs refreshed (T2) snapshots around watcher subscription and verifies completion + version swap.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

...ontroller/src/test/java/com/linkedin/venice/pushmonitor/LoadAllPushesStaleReferenceTest.java

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 7, 2026 01:06

Copilot started reviewing on behalf of haoxu07 April 7, 2026 01:07 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

...ontroller/src/test/java/com/linkedin/venice/pushmonitor/LoadAllPushesStaleReferenceTest.java Outdated Show resolved Hide resolved

Remove hard-coded line numbers from test Javadoc

fc3454f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stale reference in loadAllPushes() causing stuck push after controller restart#2701

Fix stale reference in loadAllPushes() causing stuck push after controller restart#2701
haoxu07 wants to merge 2 commits intolinkedin:mainfrom
haoxu07:fix-loadAllPushes-stale-reference

haoxu07 commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

haoxu07 commented Apr 7, 2026

Summary

Problem

Evidence

Fix

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants