fix: RepairDisconnectedMasters does not heal failed followers, causing infinite Failed state after pod restarts by svils · Pull Request #1705 · OT-CONTAINER-KIT/redis-operator

svils · 2026-03-09T13:44:13Z

Summary

Broaden repairDisconnectedMasters → RepairDisconnectedNodes to handle both failed masters and failed slaves. For slaves, issue CLUSTER MEET (fix gossip) + CLUSTER REPLICATE (re-establish replication so the follower resolves its master's current IP from gossip).
Add RepairStaleReplication to detect connected followers whose master_link_status is down and re-issue CLUSTER REPLICATE. This catches the subtler case where gossip is fine but replication is still pointed at a stale IP.
Harden cluster config defaults to reduce the likelihood and blast radius of the problem:
- tcp-keepalive 60s (Redis recommends this; faster dead connection detection)
- cluster-node-timeout 15000ms, configurable via CLUSTER_NODE_TIMEOUT env var (more gossip recovery time before marking nodes failed)
- cluster-allow-reads-when-down yes (keep clients unblocked while the operator repairs nodes)
- Probe TimeoutSeconds=5 / FailureThreshold=5 (prevent Kubernetes from prematurely evicting pods during transient blips)

Both CLUSTER MEET and CLUSTER REPLICATE are idempotent and safe to retry.

Test plan

Existing tests updated for the rename (RepairDisconnectedMasters → RepairDisconnectedNodes)
New unit tests for replicationLinkUp (up, down, master node, empty)
New unit tests for RepairStaleReplication (replication down → repaired, replication up → no-op)
Updated statefulset_test.go expectations for new probe defaults
All tests pass (go test ./internal/k8sutils/...)
Deploy to a test cluster, restart follower pods, verify automatic recovery
Deploy to a test cluster, restart leader pods, verify followers re-resolve master IP

…g infinite Failed state after pod restarts Broaden repairDisconnectedMasters to handle both failed masters and failed slaves (renamed to RepairDisconnectedNodes). For slaves, issue CLUSTER MEET to fix gossip and CLUSTER REPLICATE to re-establish replication so the follower resolves its master's current IP. Add RepairStaleReplication to detect connected followers whose master_link_status is down and re-issue CLUSTER REPLICATE. Harden cluster config defaults: - tcp-keepalive 60s (faster dead connection detection) - cluster-node-timeout 15000ms (more gossip recovery time, configurable) - cluster-allow-reads-when-down yes (unblock clients during repair) - probe TimeoutSeconds=5 / FailureThreshold=5 (prevent premature pod eviction) Fixes OT-CONTAINER-KIT#1692 Signed-off-by: svils <63684363+svils@users.noreply.github.com>

svils · 2026-04-07T19:20:32Z

@shubham-cmyk hey, you asked me to open this PR back on #1692 — it's been about a month now with no review. we're running this in production across multiple EKS clusters and keep hitting these bugs on every node rotation.

this PR (#1705) fixes the follower repair loop, but while debugging in prod i ran into two more issues that are closely related:

fix: Batch CLUSTER ADDSLOTS for single-leader RedisCluster to avoid exec URL limit #1706 — single-leader clusters get stuck in Bootstrap forever because ADDSLOTS exceeds the k8s exec URL limit
fix: reset follower with stale state before add-node to prevent "Node is not empty" loop #1734 — after the leader-only CLUSTER RESET, add-node fails with "Node is not empty" because the follower still has stale state (same root cause as RedisCluster node cannot init/rejoin after cluster scale-in and scale-out due to already stored configuration #1407)

all three are straightforward fixes with tests, and they've been running stable in our production builds. would be great to get some eyes on these so they can land in the next release.

joepizza1 · 2026-04-13T17:20:29Z

@shubham-cmyk @drivebyer can you please review and get this fix in ? We're facing this issue on few environmnets and it would be really helpful to have this fixed.

svils requested review from drivebyer, iamabhishek-dubey and shubham-cmyk as code owners March 9, 2026 13:44

svils force-pushed the fix/repair-disconnected-followers branch 4 times, most recently from 340b07d to d57ffa9 Compare March 9, 2026 14:06

svils changed the title ~~fix: Repair disconnected followers after pod restarts~~ fix: RepairDisconnectedMasters does not heal failed followers, causing infinite Failed state after pod restarts Mar 9, 2026

svils force-pushed the fix/repair-disconnected-followers branch from d57ffa9 to 11722dd Compare March 9, 2026 15:46

Merge branch 'main' into fix/repair-disconnected-followers

7aebfa9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: RepairDisconnectedMasters does not heal failed followers, causing infinite Failed state after pod restarts#1705

fix: RepairDisconnectedMasters does not heal failed followers, causing infinite Failed state after pod restarts#1705
svils wants to merge 2 commits intoOT-CONTAINER-KIT:mainfrom
svils:fix/repair-disconnected-followers

svils commented Mar 9, 2026 •

edited

Loading

Uh oh!

svils commented Apr 7, 2026

Uh oh!

joepizza1 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

svils commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

svils commented Apr 7, 2026

Uh oh!

joepizza1 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

svils commented Mar 9, 2026 •

edited

Loading