Skip to content

fix(sdk): prevent overdrain race in batch span/log export#3441

Open
xofyarg wants to merge 1 commit intoopen-telemetry:mainfrom
xofyarg:spin-loop-fixup
Open

fix(sdk): prevent overdrain race in batch span/log export#3441
xofyarg wants to merge 1 commit intoopen-telemetry:mainfrom
xofyarg:spin-loop-fixup

Conversation

@xofyarg
Copy link
Copy Markdown

@xofyarg xofyarg commented Mar 31, 2026

A race between the atomic current_batch_size
increment(Ordering::Relaxed) and the channel send in on_end()/emit() can cause get_spans_and_export (and get_logs_and_export) to spin on try_recv()/Instant::now() with no progress. Break out when try_recv() drains nothing; remaining items are picked up on the next export cycle.

Fixes #
After upgrading opentelemetry_sdk from 0.26.0 to 0.31.0, we had an issue that the span_processor::BatchSpanProcessor::export_batch_sync showed up in flamegraph with very high CPU usage. The application hasn't been changed, so it's unlikely caused by increased tracing span. Although I cannot explain exactly how, it seems like there's a race condition leading to a spin loop.

image

Changes

Add a condition check to break the spin loop when spans_receiver.try_recv returns error. It's safe and won't lose span but return early.

Merge requirement checklist

  • CONTRIBUTING guidelines followed
  • Unit tests added/updated (if applicable)
  • Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • Changes in public API reviewed (if applicable)

@xofyarg xofyarg requested a review from a team as a code owner March 31, 2026 06:19
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla bot commented Mar 31, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: xofyarg / name: anb (0f1a1e1)

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

❌ Patch coverage is 95.65217% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.2%. Comparing base (67e7dff) to head (0f1a1e1).

Files with missing lines Patch % Lines
opentelemetry-sdk/src/logs/batch_log_processor.rs 87.5% 1 Missing ⚠️
opentelemetry-sdk/src/trace/span_processor.rs 97.3% 1 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff          @@
##            main   #3441   +/-   ##
=====================================
  Coverage   83.2%   83.2%           
=====================================
  Files        128     128           
  Lines      25048   25086   +38     
=====================================
+ Hits       20859   20896   +37     
- Misses      4189    4190    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lalitb
Copy link
Copy Markdown
Member

lalitb commented Apr 2, 2026

This avoids the busy loop, but it also changes the force_flush() / shutdown() contract: the same counter-versus-channel visibility race can now cause the helper to return Ok(()) without exporting all pending items. We should establish ordering between the successful enqueue and current_batch_size observation, rather than treating drained == 0 as a successful completion condition. In practice, that likely means using release/acquire semantics around the counter update/load and ensuring the flush/shutdown paths retry or wait when target > 0 but no items were drained yet.

ps- you also need to sign easyCLA before any action can be taken on this PR.

@xofyarg
Copy link
Copy Markdown
Author

xofyarg commented Apr 3, 2026

Now I think I've figured out the real bug. Quoting the commit message:

On the producer side, there is a window between enqueuing an item and
incrementing `current_batch_size`. During that window, the consumer can
drain more items from the channel than are reflected in the snapshotted
target, causing `fetch_sub` to underflow and wrap the global counter.

A later export cycle can then observe the wrapped counter value and spin
when `try_recv()` makes no progress.

Fix this by capping each drain iteration to the remaining snapshotted
target so the helper never subtracts more items than were counted.
Also strengthen the counter synchronization with acquire/acqrel
ordering.

With this fix, the behavior is kept the same. We would have to add retry in force_flush() / shutdown() to make sure they can drain all the items from the channel, but that looks like a bit out of scope.

Please let me know what you think @lalitb, and I'm working on the CLA at the moment.

@lalitb
Copy link
Copy Markdown
Member

lalitb commented Apr 5, 2026

Please let me know what you think @lalitb

This makes sense to me now. My earlier concern was about the old break-on-empty change because that changed force_flush() / shutdown() behavior. The current patch is fixing a different bug: draining more items than were counted, underflowing current_batch_size, and later spinning on the bad counter value. Limiting the drain to the snapshotted target, plus the stronger ordering, looks right for that.

I agree there is still a separate existing window between try_send() and fetch_add() where force_flush() / shutdown() can miss an already-enqueued item, but that seems like a follow-up issue, not something this PR introduces. Would be good to create a separate issue for this.

Will approve this once the CLA is fixed :)

@xofyarg xofyarg changed the title fix spin loop in get_[spans,logs]_and_export fix(sdk): prevent overdrain race in batch span/log export Apr 6, 2026
On the producer side, there is a window between enqueuing an item and
incrementing `current_batch_size`. During that window, the consumer can
drain more items from the channel than are reflected in the snapshotted
target, causing `fetch_sub` to underflow and wrap the global counter.

A later export cycle can then observe the wrapped counter value and spin
when `try_recv()` makes no progress.

Fix this by capping each drain iteration to the remaining snapshotted
target so the helper never subtracts more items than were counted.
Also strengthen the counter synchronization with acquire/acqrel
ordering.
@xofyarg
Copy link
Copy Markdown
Author

xofyarg commented Apr 6, 2026

Signed CLA and updated the commit log title to match the pattern.

@lalitb
Copy link
Copy Markdown
Member

lalitb commented Apr 7, 2026

LGTM. We can wait for couple of days for more eyes, before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants