Skip to content

golang: replace per-cgo-call destroy mutex with std::atomic<bool>#44662

Open
wdauchy wants to merge 3 commits intoenvoyproxy:mainfrom
wdauchy:optimize/golang-filter-atomic-destroyed
Open

golang: replace per-cgo-call destroy mutex with std::atomic<bool>#44662
wdauchy wants to merge 3 commits intoenvoyproxy:mainfrom
wdauchy:optimize/golang-filter-atomic-destroyed

Conversation

@wdauchy
Copy link
Copy Markdown
Contributor

@wdauchy wdauchy commented Apr 26, 2026

Commit Message:
Every CAPI call from Go into the C++ filter (getHeader, copyHeaders, setHeader, addData, continueStatus, ...) was acquiring Filter::mutex_ purely to read the has_destroyed_ flag. Off-thread Go callers paid the cost twice: once at the entry point and again inside the dispatcher-post lambda's hasDestroyed() check. On uncontended fast paths this is two atomic compare-and-swap operations and a memory fence per call, on top of the cgo boundary cost that's already paid for every API call.

Make has_destroyed_ a std::atomic with release-store on the single onDestroy() write and acquire-load on the read path. This turns the destroy check into a plain atomic load (free on x86, an ldar on ARM) and removes the mutex acquisition from 18 CAPI entry points and ~12 dispatcher-post lambdas. The Filter is kept alive across cgo calls by the shared_ptr taken at the cgo wrapper layer (cgo.cc), so seeing a false-then-true transition mid-call is benign; the lambdas all also check weak_from_this() expiration.

Filter::mutex_ is retained on the five CAPI methods that write to req_->strValue (getStringValue, getDynamicMetadata, getStringFilterState, getStringProperty, getSecret) to keep the existing serialization between off-thread Go callers writing to the shared scratch buffer.

No behaviour change for unit/integration tests; existing onDestroy() ordering and the post-destroy CAPIFilterIsDestroy semantics are preserved.

Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

@repokitteh-read-only
Copy link
Copy Markdown

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #44662 was opened by wdauchy.

see: more, trace.

Every CAPI call from Go into the C++ filter (getHeader, copyHeaders,
setHeader, addData, continueStatus, ...) was acquiring Filter::mutex_
purely to read the has_destroyed_ flag. Off-thread Go callers paid the
cost twice: once at the entry point and again inside the dispatcher-post
lambda's hasDestroyed() check. On uncontended fast paths this is two
atomic compare-and-swap operations and a memory fence per call, on top
of the cgo boundary cost that's already paid for every API call.

Make has_destroyed_ a std::atomic<bool> with release-store on the
single onDestroy() write and acquire-load on the read path. This turns
the destroy check into a plain atomic load (free on x86, an ldar on
ARM) and removes the mutex acquisition from 18 CAPI entry points and
~12 dispatcher-post lambdas. The Filter is kept alive across cgo calls
by the shared_ptr taken at the cgo wrapper layer (cgo.cc), so seeing a
false-then-true transition mid-call is benign; the lambdas all also
check weak_from_this() expiration.

Filter::mutex_ is retained on the five CAPI methods that write to
req_->strValue (getStringValue, getDynamicMetadata,
getStringFilterState, getStringProperty, getSecret) to keep the
existing serialization between off-thread Go callers writing to the
shared scratch buffer.

No behaviour change for unit/integration tests; existing onDestroy()
ordering and the post-destroy CAPIFilterIsDestroy semantics are
preserved.

Signed-off-by: William Dauchy <william.dauchy@datadoghq.com>
@wdauchy wdauchy marked this pull request as ready for review April 26, 2026 09:24
ravenblackx
ravenblackx previously approved these changes Apr 27, 2026
Comment on lines +419 to +421
// CAPIFilterIsDestroy as soon as they observe the store. The Filter itself is kept alive across
// any cgo call by the shared_ptr taken at the cgo wrapper layer (see cgo.cc), so observing a
// false-then-true transition during a call is benign.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is very helpful. :)

@ravenblackx
Copy link
Copy Markdown
Contributor

Assigning a couple of the golang extension owners to review. Once either of them approves, I'll merge it.


CAPIStatus Filter::getHeader(ProcessorState& state, absl::string_view key, uint64_t* value_data,
int* value_len) {
Thread::LockGuard lock(mutex_);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's safe enough. The Filter is alive due to sharedptr, but, does it means it's safe to read the headers in another Go thread? I'm not sure about it.

Will this happen? Filter is alive, but headers is freed by onDestroy pipeline.

Copy link
Copy Markdown
Contributor Author

@wdauchy wdauchy Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, and it's worth being explicit about it. Short version: this PR doesn't change the lifetime invariant for state.headers; the previous mutex_ never participated in that invariant either.

What mutex_ was protecting before was strictly the read of has_destroyed_, with the ABSL_GUARDED_BY annotation only on that flag. Once the lock was released at the end of the entry function (or in the middle of continueStatus/sendLocalReply), the off-thread Go thread continued into m->get(LowerCaseString(key)) without holding any lock that would prevent the header map from being freed by the worker thread. So the scenario you describe, "Filter alive, headers freed by onDestroy pipeline, off-thread Go thread reads m", was already not synchronised by mutex_ in the old code; it's synchronised by a different invariant entirely.

The invariant that protects state.headers from a use-after-free is:

  1. state.headers = &headers is set in Filter::doHeaders (worker thread), and cleared (state.headers = nullptr) only when Go returns synchronously with a non-Running status, i.e. done == true. While done == false (Go went async), the worker thread parks the stream and state.headers stays valid.
  2. All CAPI entry points additionally check state.isProcessingInGo(), which returns true exactly when filterState() is ProcessingHeader/ProcessingData/ProcessingTrailer, i.e. the same window during which state.headers is non-null.
  3. While processing is active, the filter chain holds the header map alive: onDestroy() is only invoked by the chain when the stream is being torn down. When that happens during async processing, the C++ side calls envoyGoFilterOnHttpDestroy which on the Go side runs req.resumeWaitCallback(), waking any off-thread goroutine blocked on the callback condition. The Go side then bails with CAPIFilterIsDestroy on its next CAPI return.

So the off-thread call window is: "between state.isProcessingInGo() returning true and the actual m->get(...) dereference, can the header map go away?" The answer is no, because the worker thread has not yet returned from the doHeaders/doData/doTrailer parking and therefore has not yet entered any code that would free the header map for this stream. A reset that arrives during async processing causes onDestroy() to flip has_destroyed_ and notify Go, but it does not synchronously free the header map; that happens later in the request lifecycle, after the parked stream has been unwound. Any Go thread that was racing in getHeader either:

  • already loaded m and is reading from a still-live header map, or
  • has not yet loaded m, will see has_destroyed_ == true on the next attempt, and will get CAPIFilterIsDestroy.

The atomic has_destroyed_ check is sufficient for the flag race; the header map lifetime race is governed by the parking semantics above and was not improved or weakened by this PR.

That said, I think it's worth making this explicit in the code. I can add a comment block above getHeader (or, more generally, near state.isProcessingInGo() in processor_state.h) describing this invariant so future readers don't have to reconstruct it. Happy to push that in this PR if you want, just let me know which placement you prefer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation! I appreciate you walking through the invariants.
After reading the codebase, found there might be a use-after-free:

  off-thread goroutine                          worker thread
  --------------------------                    ----------------------------
  cgo entry:
    weak_filter.lock() → shared_ptr<Filter>
    filter->getHeader(state, key, ...)                                                                                                                                                
      hasDestroyed() → false
      isProcessingInGo() → true                                                                                                                                                       
      m = state.headers   (raw ptr, valid)                            
      m->get(LowerCaseString(key))                                                                                                                                                  
        ... OS preempts goroutine here ...
                                                stream reset arrives
                                                doDeferredStreamDestroy():
                                                  destroyFilters():
                                                    Filter::onDestroy():
                                                      has_destroyed_.exchange(true) ← no lock, no block
                                                      envoyGoFilterOnHttpDestroy() → Go: OnDestroy
                                                    onDestroy returns
                                                  destroyFilters returns
                                                  deferredDelete(stream)
                                                ... other event callbacks ...
                                                event loop tail:
                                                  process deferred-delete queue
                                                  ActiveStream destructs
                                                    → HeaderMap freed ← !!

        ... goroutine resumes ...
        reading freed HeaderMap memory → UAF

The critical point: we cannot assume m->get() is fast. The off-thread goroutine runs on a Go runtime thread, subject to OS scheduling. It can be preempted at any instruction boundary, and the entire event loop iteration (including deferred deletes) can complete before it resumes.

With the old mutex_, the worker thread would have blocked at step onDestroy() → LockGuard lock(mutex_), stalling the entire destruction pipeline until the goroutine finished reading. That protection is removed with the atomic.

The atomic conversion is correct and beneficial for the majority of CAPI methods — especially the write paths that dispatcher.post() to the worker thread, and methods that only touch Filter-owned state. I'd suggest keeping mutex_ only for the small set of methods that inline-dereference Envoy-owned objects from off-thread: getHeader, copyHeaders, copyTrailers, and getIntegerValue. This preserves most of the performance win while closing the lifetime gap.

Happy to discuss further — it's a subtle corner and I want to make sure we get it right.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, sorry for the misleading reply earlier. I missed that the old getHeader held mutex_ across m->get(...) itself, not just the destroy-flag check, which is exactly what makes the worker-thread stall on onDestroy()'s LockGuard a real lifetime barrier. Your timeline is correct: without that stall, the goroutine can be preempted between loading m = state.headers and dereferencing it, and the worker can run through destroyFilters -> deferred-delete -> ~ActiveStream -> header-map free in the meantime.

Pushed a follow-up commit that restores the lock on the four methods you named (getHeader, copyHeaders, copyTrailers, getIntegerValue), with inline comments explaining the worker-stall role at each callsite, and a longer block on mutex_'s declaration documenting its two distinct roles (strValue serialisation + worker-stall fence). Changelog narrowed accordingly. Everything else stays lock-free: writes that go through dispatcher.post, buffer methods on Filter-owned doDataList, the worker-thread inline branches, and the dispatcher-post lambdas' hasDestroyed() re-check.

Happy to move the longer comment to processor_state.h near isProcessingInGo() if you'd prefer it there.

The previous commit dropped Filter::mutex_ from every CAPI entry point,
on the basis that has_destroyed_ no longer needed lock-based protection.
That reasoning was incomplete: for CAPI methods that inline-dereference
Envoy-owned objects whose lifetime is tied to the parent stream (rather
than to the Filter), the mutex was not just guarding the destroy flag,
it was also stalling the worker thread inside Filter::onDestroy() until
every in-flight off-thread reader had unwound. Without that stall, the
worker thread can return from onDestroy(), continue past
destroyFilters(), reach deferredDelete(stream), and at the event-loop
tail destruct ActiveStream and free the HeaderMap / TrailerMap /
StreamInfo while an off-thread Go goroutine is mid-dereference. The
goroutine is just a regular OS-scheduled thread and can be preempted at
any instruction boundary, so this is reachable in practice.

Restore the lock on the four CAPI methods that perform such inline
dereferences from off-thread:

  - getHeader:       reads state.headers (Http::HeaderMap)
  - copyHeaders:     iterates state.headers
  - copyTrailers:    iterates state.trailers (Http::HeaderMap)
  - getIntegerValue: reads streamInfo() and SSL/upstream sub-objects

All other entry points remain lock-free: writes that go through
dispatcher.post (setHeader, removeHeader, setTrailer, removeTrailer,
setUpstreamOverrideHost, setDynamicMetadata, setStringFilterState,
clearRouteCache), worker-thread-only inline branches (addData when
isThreadSafe), buffer methods that operate on Filter-owned doDataList
(copyBuffer, drainBuffer, setBufferHelper), and continueStatus /
sendLocalReply / setDrainConnectionUponCompletion. The five strValue
writers (getStringValue, getDynamicMetadata, getStringFilterState,
getStringProperty, getSecret) keep the lock for both their original
strValue serialisation purpose and the worker-stall purpose.

Document mutex_'s two distinct roles on its declaration in the header,
add inline rationale comments at each of the four restored callsites,
and narrow the changelog entry to reflect the actual scope of the
optimisation.

Reported-by: doujiang24 <doujiangduoduo@gmail.com>
Signed-off-by: William Dauchy <william.dauchy@datadoghq.com>
@wdauchy wdauchy force-pushed the optimize/golang-filter-atomic-destroyed branch from 6919563 to 814a7b5 Compare April 28, 2026 13:15
@wdauchy
Copy link
Copy Markdown
Contributor Author

wdauchy commented Apr 28, 2026

/retest transients

}

{
Thread::LockGuard lock(mutex_);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need it in onDestory? to prevent headers be freed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants