fix(sdk): prevent overdrain race in batch span/log export#3441
fix(sdk): prevent overdrain race in batch span/log export#3441xofyarg wants to merge 1 commit intoopen-telemetry:mainfrom
Conversation
|
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3441 +/- ##
=====================================
Coverage 83.2% 83.2%
=====================================
Files 128 128
Lines 25048 25086 +38
=====================================
+ Hits 20859 20896 +37
- Misses 4189 4190 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This avoids the busy loop, but it also changes the ps- you also need to sign |
|
Now I think I've figured out the real bug. Quoting the commit message: With this fix, the behavior is kept the same. We would have to add retry in Please let me know what you think @lalitb, and I'm working on the CLA at the moment. |
This makes sense to me now. My earlier concern was about the old break-on-empty change because that changed I agree there is still a separate existing window between Will approve this once the CLA is fixed :) |
On the producer side, there is a window between enqueuing an item and incrementing `current_batch_size`. During that window, the consumer can drain more items from the channel than are reflected in the snapshotted target, causing `fetch_sub` to underflow and wrap the global counter. A later export cycle can then observe the wrapped counter value and spin when `try_recv()` makes no progress. Fix this by capping each drain iteration to the remaining snapshotted target so the helper never subtracts more items than were counted. Also strengthen the counter synchronization with acquire/acqrel ordering.
|
Signed CLA and updated the commit log title to match the pattern. |
|
LGTM. We can wait for couple of days for more eyes, before merging. |
A race between the atomic
current_batch_sizeincrement(Ordering::Relaxed) and the channel send in on_end()/emit() can cause get_spans_and_export (and get_logs_and_export) to spin on try_recv()/Instant::now() with no progress. Break out when try_recv() drains nothing; remaining items are picked up on the next export cycle.
Fixes #
After upgrading opentelemetry_sdk from 0.26.0 to 0.31.0, we had an issue that the
span_processor::BatchSpanProcessor::export_batch_syncshowed up in flamegraph with very high CPU usage. The application hasn't been changed, so it's unlikely caused by increased tracing span. Although I cannot explain exactly how, it seems like there's a race condition leading to a spin loop.Changes
Add a condition check to break the spin loop when
spans_receiver.try_recvreturns error. It's safe and won't lose span but return early.Merge requirement checklist
CHANGELOG.mdfiles updated for non-trivial, user-facing changes