Skip to content

fix: skip unprobeable pods so failover convergence doesn't stall for minutes#2

Merged
mvasilenko merged 1 commit intomainfrom
fix/redis-operator-skip-unprobeable-pods
Mar 31, 2026
Merged

fix: skip unprobeable pods so failover convergence doesn't stall for minutes#2
mvasilenko merged 1 commit intomainfrom
fix/redis-operator-skip-unprobeable-pods

Conversation

@mvasilenko
Copy link
Copy Markdown
Owner

Cherry-pick of OT-CONTAINER-KIT#1712 (Sina Sadeghi).

  • GetRedisNodesByRole skips Pending/non-Ready/no-PodIP pods instead of returning error on first TCP timeout
  • Role-label healer updated to skip unprobeable pods
  • Avoids replication/Sentinel reconfiguration on incomplete topology

Verified on k3d (1 server + 3 agents): node-failure outage reduced from 323s/no-recovery (v0.24.0 stock) to ~39s with
this fix.

Fixes OT-CONTAINER-KIT#1711 / OT-CONTAINER-KIT#1719

@mvasilenko mvasilenko changed the title fix: skip unprobeable pods so failover convergence doesn't stall for minutes fix: skip unprobeable pods so failover convergence doesn't stall for minutes Mar 30, 2026
…minutes

Signed-off-by: Michael Vasilenko <mvasilenko@gmail.com>
@mvasilenko mvasilenko force-pushed the fix/redis-operator-skip-unprobeable-pods branch from b3ca756 to 2e09e54 Compare March 30, 2026 21:49
@mvasilenko mvasilenko merged commit 3becb31 into main Mar 31, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RedisReplication controller cannot converge after Sentinel failover when one Redis pod is Pending/unreachable (master service remains stale)

1 participant