Summary
HAKeeper LogService checker/operator paths can treat any running replica of the same shard as satisfying work for a specific desired replica. This is unsafe when a LogStore has multiple generations of the same shard, for example desired shard=1/replica=262145 while the store heartbeat reports shard=1/replica=275385.
Steps to Reproduce
Construct a HAKeeper LogState where:
LogState.Shards[1].Replicas:
262145 -> log-1
LogState.Stores["log-1"].Replicas:
shard=1, replica=275385
Then run the LogService checker/operator finish checks for starting shard=1/replica=262145.
Actual Behavior
The checker/operator can treat the store as already having shard 1 started because it only checks ShardID in several paths.
Expected Behavior
The checker/operator should require an exact (ShardID, ReplicaID) match. shard=1/replica=275385 must not satisfy Start shard=1/replica=262145.
Impact
This can let HAKeeper believe a replacement/start operation has converged while the store is running a different replica generation. In a mixed-generation state, this can contribute to HAKeeper repeatedly targeting an old replica and treating a newer same-shard replica as zombie.
Additional Context
This fix prevents the shard-level identity confusion. It does not by itself recover already-persisted inconsistent production data where HAKeeper state, local metadata, and Dragonboat WAL/bootstrap records have diverged.
Summary
HAKeeper LogService checker/operator paths can treat any running replica of the same shard as satisfying work for a specific desired replica. This is unsafe when a LogStore has multiple generations of the same shard, for example desired
shard=1/replica=262145while the store heartbeat reportsshard=1/replica=275385.Steps to Reproduce
Construct a HAKeeper
LogStatewhere:Then run the LogService checker/operator finish checks for starting
shard=1/replica=262145.Actual Behavior
The checker/operator can treat the store as already having shard 1 started because it only checks
ShardIDin several paths.Expected Behavior
The checker/operator should require an exact
(ShardID, ReplicaID)match.shard=1/replica=275385must not satisfyStart shard=1/replica=262145.Impact
This can let HAKeeper believe a replacement/start operation has converged while the store is running a different replica generation. In a mixed-generation state, this can contribute to HAKeeper repeatedly targeting an old replica and treating a newer same-shard replica as zombie.
Additional Context
This fix prevents the shard-level identity confusion. It does not by itself recover already-persisted inconsistent production data where HAKeeper state, local metadata, and Dragonboat WAL/bootstrap records have diverged.