golang: replace per-cgo-call destroy mutex with std::atomic<bool>#44662
golang: replace per-cgo-call destroy mutex with std::atomic<bool>#44662wdauchy wants to merge 3 commits intoenvoyproxy:mainfrom
Conversation
Every CAPI call from Go into the C++ filter (getHeader, copyHeaders, setHeader, addData, continueStatus, ...) was acquiring Filter::mutex_ purely to read the has_destroyed_ flag. Off-thread Go callers paid the cost twice: once at the entry point and again inside the dispatcher-post lambda's hasDestroyed() check. On uncontended fast paths this is two atomic compare-and-swap operations and a memory fence per call, on top of the cgo boundary cost that's already paid for every API call. Make has_destroyed_ a std::atomic<bool> with release-store on the single onDestroy() write and acquire-load on the read path. This turns the destroy check into a plain atomic load (free on x86, an ldar on ARM) and removes the mutex acquisition from 18 CAPI entry points and ~12 dispatcher-post lambdas. The Filter is kept alive across cgo calls by the shared_ptr taken at the cgo wrapper layer (cgo.cc), so seeing a false-then-true transition mid-call is benign; the lambdas all also check weak_from_this() expiration. Filter::mutex_ is retained on the five CAPI methods that write to req_->strValue (getStringValue, getDynamicMetadata, getStringFilterState, getStringProperty, getSecret) to keep the existing serialization between off-thread Go callers writing to the shared scratch buffer. No behaviour change for unit/integration tests; existing onDestroy() ordering and the post-destroy CAPIFilterIsDestroy semantics are preserved. Signed-off-by: William Dauchy <william.dauchy@datadoghq.com>
| // CAPIFilterIsDestroy as soon as they observe the store. The Filter itself is kept alive across | ||
| // any cgo call by the shared_ptr taken at the cgo wrapper layer (see cgo.cc), so observing a | ||
| // false-then-true transition during a call is benign. |
There was a problem hiding this comment.
This comment is very helpful. :)
|
Assigning a couple of the golang extension owners to review. Once either of them approves, I'll merge it. |
|
|
||
| CAPIStatus Filter::getHeader(ProcessorState& state, absl::string_view key, uint64_t* value_data, | ||
| int* value_len) { | ||
| Thread::LockGuard lock(mutex_); |
There was a problem hiding this comment.
I'm not sure it's safe enough. The Filter is alive due to sharedptr, but, does it means it's safe to read the headers in another Go thread? I'm not sure about it.
Will this happen? Filter is alive, but headers is freed by onDestroy pipeline.
There was a problem hiding this comment.
Good question, and it's worth being explicit about it. Short version: this PR doesn't change the lifetime invariant for state.headers; the previous mutex_ never participated in that invariant either.
What mutex_ was protecting before was strictly the read of has_destroyed_, with the ABSL_GUARDED_BY annotation only on that flag. Once the lock was released at the end of the entry function (or in the middle of continueStatus/sendLocalReply), the off-thread Go thread continued into m->get(LowerCaseString(key)) without holding any lock that would prevent the header map from being freed by the worker thread. So the scenario you describe, "Filter alive, headers freed by onDestroy pipeline, off-thread Go thread reads m", was already not synchronised by mutex_ in the old code; it's synchronised by a different invariant entirely.
The invariant that protects state.headers from a use-after-free is:
state.headers = &headersis set inFilter::doHeaders(worker thread), and cleared (state.headers = nullptr) only when Go returns synchronously with a non-Runningstatus, i.e.done == true. Whiledone == false(Go went async), the worker thread parks the stream andstate.headersstays valid.- All CAPI entry points additionally check
state.isProcessingInGo(), which returns true exactly whenfilterState()isProcessingHeader/ProcessingData/ProcessingTrailer, i.e. the same window during whichstate.headersis non-null. - While processing is active, the filter chain holds the header map alive:
onDestroy()is only invoked by the chain when the stream is being torn down. When that happens during async processing, the C++ side callsenvoyGoFilterOnHttpDestroywhich on the Go side runsreq.resumeWaitCallback(), waking any off-thread goroutine blocked on the callback condition. The Go side then bails withCAPIFilterIsDestroyon its next CAPI return.
So the off-thread call window is: "between state.isProcessingInGo() returning true and the actual m->get(...) dereference, can the header map go away?" The answer is no, because the worker thread has not yet returned from the doHeaders/doData/doTrailer parking and therefore has not yet entered any code that would free the header map for this stream. A reset that arrives during async processing causes onDestroy() to flip has_destroyed_ and notify Go, but it does not synchronously free the header map; that happens later in the request lifecycle, after the parked stream has been unwound. Any Go thread that was racing in getHeader either:
- already loaded
mand is reading from a still-live header map, or - has not yet loaded
m, will seehas_destroyed_ == trueon the next attempt, and will getCAPIFilterIsDestroy.
The atomic has_destroyed_ check is sufficient for the flag race; the header map lifetime race is governed by the parking semantics above and was not improved or weakened by this PR.
That said, I think it's worth making this explicit in the code. I can add a comment block above getHeader (or, more generally, near state.isProcessingInGo() in processor_state.h) describing this invariant so future readers don't have to reconstruct it. Happy to push that in this PR if you want, just let me know which placement you prefer.
There was a problem hiding this comment.
Thanks for the detailed explanation! I appreciate you walking through the invariants.
After reading the codebase, found there might be a use-after-free:
off-thread goroutine worker thread
-------------------------- ----------------------------
cgo entry:
weak_filter.lock() → shared_ptr<Filter>
filter->getHeader(state, key, ...)
hasDestroyed() → false
isProcessingInGo() → true
m = state.headers (raw ptr, valid)
m->get(LowerCaseString(key))
... OS preempts goroutine here ...
stream reset arrives
doDeferredStreamDestroy():
destroyFilters():
Filter::onDestroy():
has_destroyed_.exchange(true) ← no lock, no block
envoyGoFilterOnHttpDestroy() → Go: OnDestroy
onDestroy returns
destroyFilters returns
deferredDelete(stream)
... other event callbacks ...
event loop tail:
process deferred-delete queue
ActiveStream destructs
→ HeaderMap freed ← !!
... goroutine resumes ...
reading freed HeaderMap memory → UAF
The critical point: we cannot assume m->get() is fast. The off-thread goroutine runs on a Go runtime thread, subject to OS scheduling. It can be preempted at any instruction boundary, and the entire event loop iteration (including deferred deletes) can complete before it resumes.
With the old mutex_, the worker thread would have blocked at step onDestroy() → LockGuard lock(mutex_), stalling the entire destruction pipeline until the goroutine finished reading. That protection is removed with the atomic.
The atomic conversion is correct and beneficial for the majority of CAPI methods — especially the write paths that dispatcher.post() to the worker thread, and methods that only touch Filter-owned state. I'd suggest keeping mutex_ only for the small set of methods that inline-dereference Envoy-owned objects from off-thread: getHeader, copyHeaders, copyTrailers, and getIntegerValue. This preserves most of the performance win while closing the lifetime gap.
Happy to discuss further — it's a subtle corner and I want to make sure we get it right.
There was a problem hiding this comment.
You're right, sorry for the misleading reply earlier. I missed that the old getHeader held mutex_ across m->get(...) itself, not just the destroy-flag check, which is exactly what makes the worker-thread stall on onDestroy()'s LockGuard a real lifetime barrier. Your timeline is correct: without that stall, the goroutine can be preempted between loading m = state.headers and dereferencing it, and the worker can run through destroyFilters -> deferred-delete -> ~ActiveStream -> header-map free in the meantime.
Pushed a follow-up commit that restores the lock on the four methods you named (getHeader, copyHeaders, copyTrailers, getIntegerValue), with inline comments explaining the worker-stall role at each callsite, and a longer block on mutex_'s declaration documenting its two distinct roles (strValue serialisation + worker-stall fence). Changelog narrowed accordingly. Everything else stays lock-free: writes that go through dispatcher.post, buffer methods on Filter-owned doDataList, the worker-thread inline branches, and the dispatcher-post lambdas' hasDestroyed() re-check.
Happy to move the longer comment to processor_state.h near isProcessingInGo() if you'd prefer it there.
The previous commit dropped Filter::mutex_ from every CAPI entry point, on the basis that has_destroyed_ no longer needed lock-based protection. That reasoning was incomplete: for CAPI methods that inline-dereference Envoy-owned objects whose lifetime is tied to the parent stream (rather than to the Filter), the mutex was not just guarding the destroy flag, it was also stalling the worker thread inside Filter::onDestroy() until every in-flight off-thread reader had unwound. Without that stall, the worker thread can return from onDestroy(), continue past destroyFilters(), reach deferredDelete(stream), and at the event-loop tail destruct ActiveStream and free the HeaderMap / TrailerMap / StreamInfo while an off-thread Go goroutine is mid-dereference. The goroutine is just a regular OS-scheduled thread and can be preempted at any instruction boundary, so this is reachable in practice. Restore the lock on the four CAPI methods that perform such inline dereferences from off-thread: - getHeader: reads state.headers (Http::HeaderMap) - copyHeaders: iterates state.headers - copyTrailers: iterates state.trailers (Http::HeaderMap) - getIntegerValue: reads streamInfo() and SSL/upstream sub-objects All other entry points remain lock-free: writes that go through dispatcher.post (setHeader, removeHeader, setTrailer, removeTrailer, setUpstreamOverrideHost, setDynamicMetadata, setStringFilterState, clearRouteCache), worker-thread-only inline branches (addData when isThreadSafe), buffer methods that operate on Filter-owned doDataList (copyBuffer, drainBuffer, setBufferHelper), and continueStatus / sendLocalReply / setDrainConnectionUponCompletion. The five strValue writers (getStringValue, getDynamicMetadata, getStringFilterState, getStringProperty, getSecret) keep the lock for both their original strValue serialisation purpose and the worker-stall purpose. Document mutex_'s two distinct roles on its declaration in the header, add inline rationale comments at each of the four restored callsites, and narrow the changelog entry to reflect the actual scope of the optimisation. Reported-by: doujiang24 <doujiangduoduo@gmail.com> Signed-off-by: William Dauchy <william.dauchy@datadoghq.com>
6919563 to
814a7b5
Compare
…ter-atomic-destroyed
|
/retest transients |
| } | ||
|
|
||
| { | ||
| Thread::LockGuard lock(mutex_); |
There was a problem hiding this comment.
We still need it in onDestory? to prevent headers be freed.
Commit Message:
Every CAPI call from Go into the C++ filter (getHeader, copyHeaders, setHeader, addData, continueStatus, ...) was acquiring Filter::mutex_ purely to read the has_destroyed_ flag. Off-thread Go callers paid the cost twice: once at the entry point and again inside the dispatcher-post lambda's hasDestroyed() check. On uncontended fast paths this is two atomic compare-and-swap operations and a memory fence per call, on top of the cgo boundary cost that's already paid for every API call.
Make has_destroyed_ a std::atomic with release-store on the single onDestroy() write and acquire-load on the read path. This turns the destroy check into a plain atomic load (free on x86, an ldar on ARM) and removes the mutex acquisition from 18 CAPI entry points and ~12 dispatcher-post lambdas. The Filter is kept alive across cgo calls by the shared_ptr taken at the cgo wrapper layer (cgo.cc), so seeing a false-then-true transition mid-call is benign; the lambdas all also check weak_from_this() expiration.
Filter::mutex_ is retained on the five CAPI methods that write to req_->strValue (getStringValue, getDynamicMetadata, getStringFilterState, getStringProperty, getSecret) to keep the existing serialization between off-thread Go callers writing to the shared scratch buffer.
No behaviour change for unit/integration tests; existing onDestroy() ordering and the post-destroy CAPIFilterIsDestroy semantics are preserved.
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]