Skip to content

controllerutil: guard ResourceWatcher.watched with a RWMutex and use pointer receivers#1747

Open
SAY-5 wants to merge 1 commit intoOT-CONTAINER-KIT:mainfrom
SAY-5:fix/resource-watcher-mutex-1739
Open

controllerutil: guard ResourceWatcher.watched with a RWMutex and use pointer receivers#1747
SAY-5 wants to merge 1 commit intoOT-CONTAINER-KIT:mainfrom
SAY-5:fix/resource-watcher-mutex-1739

Conversation

@SAY-5
Copy link
Copy Markdown

@SAY-5 SAY-5 commented Apr 20, 2026

Addresses the first root cause (ResourceWatcher data race) called out in #1739.

Problem

ResourceWatcher.Watch and handleEvent read and mutate the shared watched map with no synchronisation. The moment MAX_CONCURRENT_RECONCILES is raised above 1 (the issue report used 20 to drive 30-100 RedisReplication CRs), multiple reconcilers race on the map and Go's runtime aborts the operator with concurrent map write/read panics under -race, and produces non-deterministic duplicate enqueues / skipped dependents in production.

Fix

  • Add a sync.RWMutex. Watch takes it exclusively while appending; handleEvent takes the read lock to snapshot the dependent list and releases it before enqueueing so the queue.Add loop does not hold the lock while controller-runtime is doing work.
  • Promote every receiver to a pointer. The previous value receivers happened to mutate the map through its header, but a sync.Mutex by value is copied on every call, which defeats the mutex and is the standard go vet error. Pointer receivers keep the mutex tied to the one struct the reconciler constructed via NewResourceWatcher.

This only covers the ResourceWatcher data race. The other two root causes in the issue (TCP connection leaks in GetRedisNodesByRole and redundant network I/O across reconcile phases) are better addressed in separate, reviewable PRs.

…pointer receivers

ResourceWatcher.Watch and handleEvent read and mutate the shared
watched map with no synchronisation. The moment MAX_CONCURRENT_RECONCILES
is raised above 1 (the issue used 20 to drive 30-100 RedisReplication
CRs), multiple reconcilers race on the map and Go's runtime aborts
the operator with concurrent map write / read panics under -race,
and produces non-deterministic duplicate enqueues / skipped deps in
production (OT-CONTAINER-KIT#1739).

Add a sync.RWMutex: Watch takes it exclusively while appending,
handleEvent takes the read lock to snapshot the dependent list and
releases it before enqueueing so the queue.Add loop does not hold
the lock while controller-runtime is doing work.

Promote every receiver to a pointer. The previous value receivers
happened to mutate the map through its header, but a sync.Mutex by
value is copied on every call, which defeats the mutex and is the
standard `go vet` error. Pointer receivers keep the mutex tied to
the one struct the reconciler constructed via NewResourceWatcher.

Fixes OT-CONTAINER-KIT#1739
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant