Skip to content

Fix stale reference in loadAllPushes() causing stuck push after controller restart#2701

Open
haoxu07 wants to merge 2 commits intolinkedin:mainfrom
haoxu07:fix-loadAllPushes-stale-reference
Open

Fix stale reference in loadAllPushes() causing stuck push after controller restart#2701
haoxu07 wants to merge 2 commits intolinkedin:mainfrom
haoxu07:fix-loadAllPushes-stale-reference

Conversation

@haoxu07
Copy link
Copy Markdown
Contributor

@haoxu07 haoxu07 commented Apr 7, 2026

Summary

  • Fix a bug in AbstractPushMonitor.loadAllPushes() where checkPushStatus() uses a stale loop variable instead of the refreshed object from topicToPushMap, causing batch pushes to get permanently stuck at END_OF_PUSH_RECEIVED after a controller restart.
  • Add LoadAllPushesStaleReferenceTest to reproduce and verify the fix.

Problem

During controller STANDBY→LEADER transition, loadAllPushes():

  1. Line 176: Bulk-loads OfflinePushStatus with partition statuses from ZK (snapshot T1)
  2. Line 190: Registers ZK watchers for partition status changes
  3. Line 195: updateOfflinePush() reads fresh data from ZK (snapshot T2) — all replicas now COMPLETED — and replaces the entry in topicToPushMap
  4. Line 199-204: checkPushStatus() is called with the stale loop variable (T1), not the refreshed object (T2)

If replicas complete between T1 and T2, the stale object has incomplete partition statuses. Since no ZK watcher callbacks fire for already-completed partitions (their final write happened before watcher registration), the push is stuck at END_OF_PUSH_RECEIVED permanently, blocking all future batch pushes for the store.

Evidence

Observed in EI on 2026-04-03:

  • heartbeat_inc_push_mt-0 v2420 on mt-0: all 60 partitions × 3 replicas COMPLETED in ZK, but controller's in-memory OfflinePushStatus.currentStatus = END_OF_PUSH_RECEIVED (confirmed via heap dump, ODP event 4075972)
  • Controller logs showed DeferredVersionSwapService: "Skipping store as parent version 2420 status is PARTIALLY_ONLINE" for 8+ hours
  • Required manual --kill-job to unblock

Fix

One-line change after updateOfflinePush():

offlinePushStatus = topicToPushMap.get(offlinePushStatus.getKafkaTopic());

This ensures checkPushStatus() evaluates the latest partition data from ZK.

Test plan

  • LoadAllPushesStaleReferenceTest.testLoadAllPushesStaleReferenceAfterUpdateOfflinePush — reproduces the race condition (fails without fix, passes with fix)
  • All existing AbstractPushMonitorTest inherited tests pass
  • CI

🤖 Generated with Claude Code

END_OF_PUSH_RECEIVED after controller restart

During controller STANDBY→LEADER transition, loadAllPushes() bulk-loads
OfflinePushStatus objects from ZK (snapshot T1), then for each push:
1. Registers ZK watchers for partition status changes
2. Calls updateOfflinePush() which reads fresh data from ZK (snapshot T2)
   and replaces the entry in topicToPushMap
3. Calls checkPushStatus() with the **stale** loop variable (T1), not the
   refreshed object from topicToPushMap (T2)

If replicas complete between T1 and T2, the stale object has incomplete
partition statuses. checkPushStatus() returns non-terminal, and since no
ZK watcher callbacks fire for already-completed partitions, the push is
stuck at END_OF_PUSH_RECEIVED permanently. This blocks all future batch
pushes for the store ("future version X exists").

Fix: After updateOfflinePush(), re-read the object from topicToPushMap
so checkPushStatus() evaluates the latest partition data from ZK.

Added LoadAllPushesStaleReferenceTest to reproduce the race condition.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 7, 2026 01:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a controller failover race in AbstractPushMonitor.loadAllPushes() where checkPushStatus() could evaluate a stale OfflinePushStatus instance (loaded before watcher registration) instead of the refreshed object after updateOfflinePush(), leaving batch pushes stuck at END_OF_PUSH_RECEIVED after restart.

Changes:

  • Refresh the offlinePushStatus loop variable from topicToPushMap immediately after updateOfflinePush() so subsequent status evaluation uses the latest ZK snapshot.
  • Add LoadAllPushesStaleReferenceTest to reproduce the stale-reference scenario and verify the push transitions to COMPLETED.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
services/venice-controller/src/main/java/com/linkedin/venice/pushmonitor/AbstractPushMonitor.java Re-reads the refreshed OfflinePushStatus from topicToPushMap after updateOfflinePush() to avoid using stale partition status data.
services/venice-controller/src/test/java/com/linkedin/venice/pushmonitor/LoadAllPushesStaleReferenceTest.java Adds a regression test that simulates stale (T1) vs refreshed (T2) snapshots around watcher subscription and verifies completion + version swap.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants