Skip to content

fix: Batch CLUSTER ADDSLOTS for single-leader RedisCluster to avoid exec URL limit#1706

Open
svils wants to merge 7 commits intoOT-CONTAINER-KIT:mainfrom
svils:fix/single-leader-addslots-batch
Open

fix: Batch CLUSTER ADDSLOTS for single-leader RedisCluster to avoid exec URL limit#1706
svils wants to merge 7 commits intoOT-CONTAINER-KIT:mainfrom
svils:fix/single-leader-addslots-batch

Conversation

@svils
Copy link
Copy Markdown

@svils svils commented Mar 9, 2026

Summary

Fixes #1704

CreateSingleLeaderRedisCommand built a single redis-cli CLUSTER ADDSLOTS 0 1 2 ... 16383 command with all 16384 slot numbers as individual arguments. When executed via the Kubernetes pod exec API (SPDY), the arguments are encoded as URL query parameters, exceeding the URL length limit. The connection upgrade fails and the cluster stays stuck in Bootstrap forever.

  • Redis 7+ (default): use CLUSTER ADDSLOTSRANGE 0 16383 — a single compact command with just a start-end range pair, avoiding the URL length issue entirely
  • Redis <7 fallback: batch CLUSTER ADDSLOTS into chunks of 1000 per exec call. CLUSTER ADDSLOTS is idempotent for unassigned slots, so partial retries on the next reconcile are safe
  • Auth (-a) and TLS flags are handled per-call since the single-leader path now executes independently
  • The default: (multi-leader) path is unchanged in behavior

Test plan

  • New unit test TestSingleLeaderAddSlotsBatching validates all 16384 slots are covered across 17 batches with no gaps or overlaps (v6 fallback path)
  • All existing tests pass (go test ./internal/k8sutils/...)
  • Deploy a RedisCluster with clusterSize: 1 on Redis 7, verify it bootstraps with ADDSLOTSRANGE
  • Deploy a RedisCluster with clusterSize: 3, verify multi-leader bootstrap is unaffected

@svils svils force-pushed the fix/single-leader-addslots-batch branch from 33be324 to 9dac93a Compare March 9, 2026 14:28
…xec URL limit

CreateSingleLeaderRedisCommand built a single redis-cli CLUSTER ADDSLOTS
command with all 16384 slot numbers as arguments. This exceeds the
Kubernetes pod exec SPDY URL length limit, causing the connection
upgrade to fail and the cluster to stay stuck in Bootstrap forever.

Replace with executeSingleLeaderAddSlots:
- Redis 7+ (default): use CLUSTER ADDSLOTSRANGE 0 16383 — a single
  compact command that takes a start-end range pair
- Redis <7 fallback: batch CLUSTER ADDSLOTS into chunks of 1000 per
  exec call to stay within the URL length limit

Fixes OT-CONTAINER-KIT#1704

Signed-off-by: svils <63684363+svils@users.noreply.github.com>
@svils svils force-pushed the fix/single-leader-addslots-batch branch from 9dac93a to 1d10833 Compare March 9, 2026 14:31
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 9, 2026

Codecov Report

❌ Patch coverage is 0% with 42 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@6864c09). Learn more about missing BASE report.

Files with missing lines Patch % Lines
internal/k8sutils/redis.go 0.00% 42 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1706   +/-   ##
=======================================
  Coverage        ?   29.88%           
=======================================
  Files           ?       83           
  Lines           ?     6743           
  Branches        ?        0           
=======================================
  Hits            ?     2015           
  Misses          ?     4532           
  Partials        ?      196           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@svils svils changed the title fix: Batch CLUSTER ADDSLOTS for single-leader RedisCluster fix: Batch CLUSTER ADDSLOTS for single-leader RedisCluster to avoid exec URL limit Mar 9, 2026
svils and others added 5 commits March 9, 2026 17:46
…g infinite Failed state after pod restarts

Broaden repairDisconnectedMasters to handle both failed masters and
failed slaves (renamed to RepairDisconnectedNodes). For slaves, issue
CLUSTER MEET to fix gossip and CLUSTER REPLICATE to re-establish
replication so the follower resolves its master's current IP.

Add RepairStaleReplication to detect connected followers whose
master_link_status is down and re-issue CLUSTER REPLICATE.

Harden cluster config defaults:
- tcp-keepalive 60s (faster dead connection detection)
- cluster-node-timeout 15000ms (more gossip recovery time, configurable)
- cluster-allow-reads-when-down yes (unblock clients during repair)
- probe TimeoutSeconds=5 / FailureThreshold=5 (prevent premature pod eviction)

Fixes OT-CONTAINER-KIT#1692

Signed-off-by: svils <63684363+svils@users.noreply.github.com>
… is not empty" loop

ExecuteRedisReplicationCommand uses `redis-cli --cluster add-node` to
join followers to the cluster. This command requires the target node to
be completely empty — no cluster state and no keys in database 0.

After a leader-only CLUSTER RESET during single-node bootstrap (via
executeFailoverCommand), the follower retains its previous cluster
state and replicated data. The operator then enters an infinite error
loop:

  [ERR] Node <cluster>-follower-0...is not empty. Either the node
  already knows other nodes (check with CLUSTER NODES) or contains
  some key in database 0.

Add resetFollowerIfNotEmpty: before running add-node, check whether the
follower knows other nodes or has keys in db0. If so, issue CLUSTER
RESET HARD to clear both cluster state and data so add-node can succeed.
The follower will re-sync from the master via full sync after joining.

Related: OT-CONTAINER-KIT#1407

Signed-off-by: svils <63684363+svils@users.noreply.github.com>
@svils svils force-pushed the fix/single-leader-addslots-batch branch from 6f7025e to cbec96a Compare April 9, 2026 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RedisCluster with clusterSize 1 stuck in Bootstrap forever

1 participant