controllerutil: guard ResourceWatcher.watched with a RWMutex and use pointer receivers#1747
Open
SAY-5 wants to merge 1 commit intoOT-CONTAINER-KIT:mainfrom
Open
controllerutil: guard ResourceWatcher.watched with a RWMutex and use pointer receivers#1747SAY-5 wants to merge 1 commit intoOT-CONTAINER-KIT:mainfrom
SAY-5 wants to merge 1 commit intoOT-CONTAINER-KIT:mainfrom
Conversation
…pointer receivers ResourceWatcher.Watch and handleEvent read and mutate the shared watched map with no synchronisation. The moment MAX_CONCURRENT_RECONCILES is raised above 1 (the issue used 20 to drive 30-100 RedisReplication CRs), multiple reconcilers race on the map and Go's runtime aborts the operator with concurrent map write / read panics under -race, and produces non-deterministic duplicate enqueues / skipped deps in production (OT-CONTAINER-KIT#1739). Add a sync.RWMutex: Watch takes it exclusively while appending, handleEvent takes the read lock to snapshot the dependent list and releases it before enqueueing so the queue.Add loop does not hold the lock while controller-runtime is doing work. Promote every receiver to a pointer. The previous value receivers happened to mutate the map through its header, but a sync.Mutex by value is copied on every call, which defeats the mutex and is the standard `go vet` error. Pointer receivers keep the mutex tied to the one struct the reconciler constructed via NewResourceWatcher. Fixes OT-CONTAINER-KIT#1739
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses the first root cause (ResourceWatcher data race) called out in #1739.
Problem
ResourceWatcher.WatchandhandleEventread and mutate the sharedwatchedmap with no synchronisation. The momentMAX_CONCURRENT_RECONCILESis raised above 1 (the issue report used20to drive 30-100RedisReplicationCRs), multiple reconcilers race on the map and Go's runtime aborts the operator with concurrent map write/read panics under-race, and produces non-deterministic duplicate enqueues / skipped dependents in production.Fix
sync.RWMutex.Watchtakes it exclusively while appending;handleEventtakes the read lock to snapshot the dependent list and releases it before enqueueing so thequeue.Addloop does not hold the lock while controller-runtime is doing work.sync.Mutexby value is copied on every call, which defeats the mutex and is the standardgo veterror. Pointer receivers keep the mutex tied to the one struct the reconciler constructed viaNewResourceWatcher.This only covers the ResourceWatcher data race. The other two root causes in the issue (TCP connection leaks in
GetRedisNodesByRoleand redundant network I/O across reconcile phases) are better addressed in separate, reviewable PRs.