fix(sdk): prevent overdrain race in batch span/log export by xofyarg · Pull Request #3441 · open-telemetry/opentelemetry-rust

xofyarg · 2026-03-31T06:19:35Z

A race between the atomic current_batch_size
increment(Ordering::Relaxed) and the channel send in on_end()/emit() can cause get_spans_and_export (and get_logs_and_export) to spin on try_recv()/Instant::now() with no progress. Break out when try_recv() drains nothing; remaining items are picked up on the next export cycle.

Fixes #
After upgrading opentelemetry_sdk from 0.26.0 to 0.31.0, we had an issue that the span_processor::BatchSpanProcessor::export_batch_sync showed up in flamegraph with very high CPU usage. The application hasn't been changed, so it's unlikely caused by increased tracing span. Although I cannot explain exactly how, it seems like there's a race condition leading to a spin loop.

Changes

Add a condition check to break the spin loop when spans_receiver.try_recv returns error. It's safe and won't lose span but return early.

Merge requirement checklist

CONTRIBUTING guidelines followed
Unit tests added/updated (if applicable)
Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
Changes in public API reviewed (if applicable)

linux-foundation-easycla · 2026-03-31T06:19:45Z

The committers listed above are authorized under a signed CLA.

✅ login: xofyarg / name: anb (0f1a1e1)

codecov · 2026-03-31T06:22:54Z

Codecov Report

❌ Patch coverage is 95.65217% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.2%. Comparing base (67e7dff) to head (0f1a1e1).

Files with missing lines	Patch %	Lines
opentelemetry-sdk/src/logs/batch_log_processor.rs	87.5%	1 Missing ⚠️
opentelemetry-sdk/src/trace/span_processor.rs	97.3%	1 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff          @@
##            main   #3441   +/-   ##
=====================================
  Coverage   83.2%   83.2%           
=====================================
  Files        128     128           
  Lines      25048   25086   +38     
=====================================
+ Hits       20859   20896   +37     
- Misses      4189    4190    +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lalitb · 2026-04-02T20:05:57Z

This avoids the busy loop, but it also changes the force_flush() / shutdown() contract: the same counter-versus-channel visibility race can now cause the helper to return Ok(()) without exporting all pending items. We should establish ordering between the successful enqueue and current_batch_size observation, rather than treating drained == 0 as a successful completion condition. In practice, that likely means using release/acquire semantics around the counter update/load and ensuring the flush/shutdown paths retry or wait when target > 0 but no items were drained yet.

ps- you also need to sign easyCLA before any action can be taken on this PR.

xofyarg · 2026-04-03T23:44:55Z

Now I think I've figured out the real bug. Quoting the commit message:

On the producer side, there is a window between enqueuing an item and
incrementing `current_batch_size`. During that window, the consumer can
drain more items from the channel than are reflected in the snapshotted
target, causing `fetch_sub` to underflow and wrap the global counter.

A later export cycle can then observe the wrapped counter value and spin
when `try_recv()` makes no progress.

Fix this by capping each drain iteration to the remaining snapshotted
target so the helper never subtracts more items than were counted.
Also strengthen the counter synchronization with acquire/acqrel
ordering.

With this fix, the behavior is kept the same. We would have to add retry in force_flush() / shutdown() to make sure they can drain all the items from the channel, but that looks like a bit out of scope.

Please let me know what you think @lalitb, and I'm working on the CLA at the moment.

lalitb · 2026-04-05T18:16:18Z

Please let me know what you think @lalitb

This makes sense to me now. My earlier concern was about the old break-on-empty change because that changed force_flush() / shutdown() behavior. The current patch is fixing a different bug: draining more items than were counted, underflowing current_batch_size, and later spinning on the bad counter value. Limiting the drain to the snapshotted target, plus the stronger ordering, looks right for that.

I agree there is still a separate existing window between try_send() and fetch_add() where force_flush() / shutdown() can miss an already-enqueued item, but that seems like a follow-up issue, not something this PR introduces. Would be good to create a separate issue for this.

Will approve this once the CLA is fixed :)

On the producer side, there is a window between enqueuing an item and incrementing `current_batch_size`. During that window, the consumer can drain more items from the channel than are reflected in the snapshotted target, causing `fetch_sub` to underflow and wrap the global counter. A later export cycle can then observe the wrapped counter value and spin when `try_recv()` makes no progress. Fix this by capping each drain iteration to the remaining snapshotted target so the helper never subtracts more items than were counted. Also strengthen the counter synchronization with acquire/acqrel ordering.

xofyarg · 2026-04-06T17:07:29Z

Signed CLA and updated the commit log title to match the pattern.

lalitb · 2026-04-07T21:02:45Z

LGTM. We can wait for couple of days for more eyes, before merging.

xofyarg requested a review from a team as a code owner March 31, 2026 06:19

xofyarg force-pushed the spin-loop-fixup branch from d0e849e to 388acae Compare April 3, 2026 23:39

xofyarg force-pushed the spin-loop-fixup branch from 388acae to 9acee19 Compare April 6, 2026 16:39

xofyarg changed the title ~~fix spin loop in get_[spans,logs]_and_export~~ fix(sdk): prevent overdrain race in batch span/log export Apr 6, 2026

xofyarg force-pushed the spin-loop-fixup branch from 9acee19 to 0f1a1e1 Compare April 6, 2026 16:44

lalitb approved these changes Apr 7, 2026

View reviewed changes

lalitb mentioned this pull request Apr 7, 2026

[Bug]: BatchSpanProcessor` and BatchLogProcessor can miss already-enqueued items during force_flush() / shutdown() #3453

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sdk): prevent overdrain race in batch span/log export#3441

fix(sdk): prevent overdrain race in batch span/log export#3441
xofyarg wants to merge 1 commit intoopen-telemetry:mainfrom
xofyarg:spin-loop-fixup

xofyarg commented Mar 31, 2026

Uh oh!

linux-foundation-easycla bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

lalitb commented Apr 2, 2026

Uh oh!

xofyarg commented Apr 3, 2026

Uh oh!

lalitb commented Apr 5, 2026

Uh oh!

xofyarg commented Apr 6, 2026

Uh oh!

lalitb commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xofyarg commented Mar 31, 2026

Changes

Merge requirement checklist

Uh oh!

linux-foundation-easycla bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lalitb commented Apr 2, 2026

Uh oh!

xofyarg commented Apr 3, 2026

Uh oh!

lalitb commented Apr 5, 2026

Uh oh!

xofyarg commented Apr 6, 2026

Uh oh!

lalitb commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

linux-foundation-easycla bot commented Mar 31, 2026 •

edited

Loading

codecov bot commented Mar 31, 2026 •

edited

Loading