Skip to content

fix: skip unprobeable pods so failover convergence doesn’t back off f…#1712

Open
sinner- wants to merge 2 commits intoOT-CONTAINER-KIT:mainfrom
sinner-:fix_issue_1711
Open

fix: skip unprobeable pods so failover convergence doesn’t back off f…#1712
sinner- wants to merge 2 commits intoOT-CONTAINER-KIT:mainfrom
sinner-:fix_issue_1711

Conversation

@sinner-
Copy link
Copy Markdown

@sinner- sinner- commented Mar 13, 2026

Description

Summary

Fix RedisReplication failover convergence when one Redis pod is unreachable during Sentinel-led master promotion.

Previously, GetRedisNodesByRole() would fail the reconcile as soon as it hit an unreachable Redis pod. In the reported scenario, that meant the operator could not update redis-role labels or RedisReplication.status.masterNode after Sentinel had already promoted a new master. As a result, the <name>-master Service kept pointing at the wrong pod and clients remained stuck until controller-runtime eventually retried after exponential backoff.

This change updates the controller to reason over the probable Redis topology instead of aborting on a single unavailable ordinal.

What changed

  • Skip Redis pods that are not probeable yet when discovering roles.
    • Pending pods
    • non-Ready pods
    • pods without a PodIP
  • Reuse pod data directly when building Redis clients for role checks.
  • Update the role-label healer to skip unprobeable pods as well.
  • Avoid replication or Sentinel reconfiguration when the observed topology is incomplete and the current master cannot be identified safely.
  • Add focused unit coverage for:
    • skipping unprobeable pods in role discovery
    • continuing status reconciliation when one pod is unobserved
    • avoiding unsafe replication/Sentinel changes on incomplete topology
    • preserving existing behavior when topology is complete

Why this fixes the issue

With this change, a failed or pending old master no longer causes reconcile to return an error just because that ordinal cannot be probed. The operator can continue to observe the surviving ready Redis pods, identify the Sentinel-elected master, update status.masterNode, and refresh redis-role labels. That allows the <name>-master Service to move to the new master quickly instead of waiting for error-driven exponential backoff to unwind.

In short:

  • client recovery now follows the healthy promoted master
  • the reconcile loop no longer enters the long backoff path for this failure mode
  • unsafe reconfiguration is avoided while topology is incomplete

Validation

Executed:

  • env GOCACHE=/tmp/redis-operator-go-build-cache go test ./internal/k8sutils
  • env GOCACHE=/tmp/redis-operator-go-build-cache go test ./internal/controller/common/redis
  • env GOCACHE=/tmp/redis-operator-go-build-cache go test ./internal/controller/redisreplication -run 'TestReconcileRedis|TestReconcileStatus'

Fixes #1711

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • Tests have been added/modified and all tests pass.
  • Functionality/bugs have been confirmed to be unchanged or fixed.
  • I have performed a self-review of my own code.
  • Documentation has been updated or added where necessary.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 13, 2026

Codecov Report

❌ Patch coverage is 57.28155% with 44 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@cca67a6). Learn more about missing BASE report.

Files with missing lines Patch % Lines
internal/k8sutils/redis.go 34.09% 29 Missing ⚠️
...er/redisreplication/redisreplication_controller.go 73.68% 10 Missing and 5 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1712   +/-   ##
=======================================
  Coverage        ?   32.11%           
=======================================
  Files           ?       83           
  Lines           ?     6803           
  Branches        ?        0           
=======================================
  Hits            ?     2185           
  Misses          ?     4402           
  Partials        ?      216           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sinner-
Copy link
Copy Markdown
Author

sinner- commented Mar 19, 2026

hey @shubham-cmyk @drivebyer @iamabhishek-dubey any guidance on how I can best contribute to moving this forward?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RedisReplication controller cannot converge after Sentinel failover when one Redis pod is Pending/unreachable (master service remains stale)

1 participant