Skip to content

[Bug]: HAKeeper logservice checker treats different replica generations as the same shard #24911

@LeftHandCold

Description

@LeftHandCold

Summary

HAKeeper LogService checker/operator paths can treat any running replica of the same shard as satisfying work for a specific desired replica. This is unsafe when a LogStore has multiple generations of the same shard, for example desired shard=1/replica=262145 while the store heartbeat reports shard=1/replica=275385.

Steps to Reproduce

Construct a HAKeeper LogState where:

LogState.Shards[1].Replicas:
  262145 -> log-1

LogState.Stores["log-1"].Replicas:
  shard=1, replica=275385

Then run the LogService checker/operator finish checks for starting shard=1/replica=262145.

Actual Behavior

The checker/operator can treat the store as already having shard 1 started because it only checks ShardID in several paths.

Expected Behavior

The checker/operator should require an exact (ShardID, ReplicaID) match. shard=1/replica=275385 must not satisfy Start shard=1/replica=262145.

Impact

This can let HAKeeper believe a replacement/start operation has converged while the store is running a different replica generation. In a mixed-generation state, this can contribute to HAKeeper repeatedly targeting an old replica and treating a newer same-shard replica as zombie.

Additional Context

This fix prevents the shard-level identity confusion. It does not by itself recover already-persisted inconsistent production data where HAKeeper state, local metadata, and Dragonboat WAL/bootstrap records have diverged.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't workingseverity/s1High impact: Logical errors or data errors that must occur

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions