fix: skip unprobeable pods so failover convergence doesn’t back off f… by sinner- · Pull Request #1712 · OT-CONTAINER-KIT/redis-operator

sinner- · 2026-03-13T02:59:28Z

Description

Summary

Fix RedisReplication failover convergence when one Redis pod is unreachable during Sentinel-led master promotion.

Previously, GetRedisNodesByRole() would fail the reconcile as soon as it hit an unreachable Redis pod. In the reported scenario, that meant the operator could not update redis-role labels or RedisReplication.status.masterNode after Sentinel had already promoted a new master. As a result, the <name>-master Service kept pointing at the wrong pod and clients remained stuck until controller-runtime eventually retried after exponential backoff.

This change updates the controller to reason over the probable Redis topology instead of aborting on a single unavailable ordinal.

What changed

Skip Redis pods that are not probeable yet when discovering roles.
- Pending pods
- non-Ready pods
- pods without a PodIP
Reuse pod data directly when building Redis clients for role checks.
Update the role-label healer to skip unprobeable pods as well.
Avoid replication or Sentinel reconfiguration when the observed topology is incomplete and the current master cannot be identified safely.
Add focused unit coverage for:
- skipping unprobeable pods in role discovery
- continuing status reconciliation when one pod is unobserved
- avoiding unsafe replication/Sentinel changes on incomplete topology
- preserving existing behavior when topology is complete

Why this fixes the issue

With this change, a failed or pending old master no longer causes reconcile to return an error just because that ordinal cannot be probed. The operator can continue to observe the surviving ready Redis pods, identify the Sentinel-elected master, update status.masterNode, and refresh redis-role labels. That allows the <name>-master Service to move to the new master quickly instead of waiting for error-driven exponential backoff to unwind.

In short:

client recovery now follows the healthy promoted master
the reconcile loop no longer enters the long backoff path for this failure mode
unsafe reconfiguration is avoided while topology is incomplete

Validation

Executed:

env GOCACHE=/tmp/redis-operator-go-build-cache go test ./internal/k8sutils
env GOCACHE=/tmp/redis-operator-go-build-cache go test ./internal/controller/common/redis
env GOCACHE=/tmp/redis-operator-go-build-cache go test ./internal/controller/redisreplication -run 'TestReconcileRedis|TestReconcileStatus'

Fixes #1711

Type of change

Bug fix (non-breaking change which fixes an issue)

Checklist

Tests have been added/modified and all tests pass.
Functionality/bugs have been confirmed to be unchanged or fixed.
I have performed a self-review of my own code.
Documentation has been updated or added where necessary.

…or minutes

codecov · 2026-03-13T17:18:57Z

Codecov Report

❌ Patch coverage is 57.28155% with 44 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@cca67a6). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
internal/k8sutils/redis.go	34.09%	29 Missing ⚠️
...er/redisreplication/redisreplication_controller.go	73.68%	10 Missing and 5 partials ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1712   +/-   ##
=======================================
  Coverage        ?   32.11%           
=======================================
  Files           ?       83           
  Lines           ?     6803           
  Branches        ?        0           
=======================================
  Hits            ?     2185           
  Misses          ?     4402           
  Partials        ?      216

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sinner- · 2026-03-19T22:15:41Z

hey @shubham-cmyk @drivebyer @iamabhishek-dubey any guidance on how I can best contribute to moving this forward?

sinner- requested review from drivebyer, iamabhishek-dubey and shubham-cmyk as code owners March 13, 2026 02:59

fix: skip unprobeable pods so failover convergence doesn’t back off f…

d1365ef

…or minutes

shubham-cmyk force-pushed the fix_issue_1711 branch from c95d16c to d1365ef Compare March 13, 2026 17:07

Merge branch 'main' into fix_issue_1711

2d5d42b

This was referenced Mar 30, 2026

fix: skip unprobeable pods so failover convergence doesn't stall for minutes mvasilenko/opstree-redis-operator#2

Merged

fix: skip unprobeable pods so failover convergence doesn't stall for minutes - ha split-brain and self-recovery #1728

Open

joaojuliosap mentioned this pull request Apr 23, 2026

fix: skip unprobeable pods so failover convergence doesn’t back off f… #1755

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: skip unprobeable pods so failover convergence doesn’t back off f…#1712

fix: skip unprobeable pods so failover convergence doesn’t back off f…#1712
sinner- wants to merge 2 commits intoOT-CONTAINER-KIT:mainfrom
sinner-:fix_issue_1711

sinner- commented Mar 13, 2026

Uh oh!

codecov Bot commented Mar 13, 2026 •

edited

Loading

Uh oh!

sinner- commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sinner- commented Mar 13, 2026

Summary

What changed

Why this fixes the issue

Validation

Uh oh!

codecov Bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sinner- commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Mar 13, 2026 •

edited

Loading