Skip to content

fix: RepairDisconnectedMasters does not heal failed followers, causing infinite Failed state after pod restarts#1705

Open
svils wants to merge 2 commits intoOT-CONTAINER-KIT:mainfrom
svils:fix/repair-disconnected-followers
Open

fix: RepairDisconnectedMasters does not heal failed followers, causing infinite Failed state after pod restarts#1705
svils wants to merge 2 commits intoOT-CONTAINER-KIT:mainfrom
svils:fix/repair-disconnected-followers

Conversation

@svils
Copy link
Copy Markdown

@svils svils commented Mar 9, 2026

Summary

Fixes #1692

  • Broaden repairDisconnectedMastersRepairDisconnectedNodes to handle both failed masters and failed slaves. For slaves, issue CLUSTER MEET (fix gossip) + CLUSTER REPLICATE (re-establish replication so the follower resolves its master's current IP from gossip).
  • Add RepairStaleReplication to detect connected followers whose master_link_status is down and re-issue CLUSTER REPLICATE. This catches the subtler case where gossip is fine but replication is still pointed at a stale IP.
  • Harden cluster config defaults to reduce the likelihood and blast radius of the problem:
    • tcp-keepalive 60s (Redis recommends this; faster dead connection detection)
    • cluster-node-timeout 15000ms, configurable via CLUSTER_NODE_TIMEOUT env var (more gossip recovery time before marking nodes failed)
    • cluster-allow-reads-when-down yes (keep clients unblocked while the operator repairs nodes)
    • Probe TimeoutSeconds=5 / FailureThreshold=5 (prevent Kubernetes from prematurely evicting pods during transient blips)

Both CLUSTER MEET and CLUSTER REPLICATE are idempotent and safe to retry.

Test plan

  • Existing tests updated for the rename (RepairDisconnectedMastersRepairDisconnectedNodes)
  • New unit tests for replicationLinkUp (up, down, master node, empty)
  • New unit tests for RepairStaleReplication (replication down → repaired, replication up → no-op)
  • Updated statefulset_test.go expectations for new probe defaults
  • All tests pass (go test ./internal/k8sutils/...)
  • Deploy to a test cluster, restart follower pods, verify automatic recovery
  • Deploy to a test cluster, restart leader pods, verify followers re-resolve master IP

@svils svils force-pushed the fix/repair-disconnected-followers branch 4 times, most recently from 340b07d to d57ffa9 Compare March 9, 2026 14:06
@svils svils changed the title fix: Repair disconnected followers after pod restarts fix: RepairDisconnectedMasters does not heal failed followers, causing infinite Failed state after pod restarts Mar 9, 2026
…g infinite Failed state after pod restarts

Broaden repairDisconnectedMasters to handle both failed masters and
failed slaves (renamed to RepairDisconnectedNodes). For slaves, issue
CLUSTER MEET to fix gossip and CLUSTER REPLICATE to re-establish
replication so the follower resolves its master's current IP.

Add RepairStaleReplication to detect connected followers whose
master_link_status is down and re-issue CLUSTER REPLICATE.

Harden cluster config defaults:
- tcp-keepalive 60s (faster dead connection detection)
- cluster-node-timeout 15000ms (more gossip recovery time, configurable)
- cluster-allow-reads-when-down yes (unblock clients during repair)
- probe TimeoutSeconds=5 / FailureThreshold=5 (prevent premature pod eviction)

Fixes OT-CONTAINER-KIT#1692

Signed-off-by: svils <63684363+svils@users.noreply.github.com>
@svils svils force-pushed the fix/repair-disconnected-followers branch from d57ffa9 to 11722dd Compare March 9, 2026 15:46
@svils
Copy link
Copy Markdown
Author

svils commented Apr 7, 2026

@shubham-cmyk hey, you asked me to open this PR back on #1692 — it's been about a month now with no review. we're running this in production across multiple EKS clusters and keep hitting these bugs on every node rotation.

this PR (#1705) fixes the follower repair loop, but while debugging in prod i ran into two more issues that are closely related:

all three are straightforward fixes with tests, and they've been running stable in our production builds. would be great to get some eyes on these so they can land in the next release.

@joepizza1
Copy link
Copy Markdown

@shubham-cmyk @drivebyer can you please review and get this fix in ? We're facing this issue on few environmnets and it would be really helpful to have this fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RepairDisconnectedMasters does not heal failed followers, causing infinite Failed state after pod restarts

2 participants