[fast-client] Replace boxed Integer with AtomicInteger for pending request counter in InstanceHealthMonitor by ayush571995 · Pull Request #2672 · linkedin/venice

ayush571995 · 2026-03-28T15:41:02Z

Problem Statement

InstanceHealthMonitor tracks in-flight requests per server instance using:

private final Map<String, Integer> pendingRequestCounterMap = new VeniceConcurrentHashMap<>();

Every increment (on request send) and decrement (on response) goes through
ConcurrentHashMap.compute(), which has two problems on the hot path:

Segment lock on every update — all threads dispatching to or receiving
responses from the same server instance serialize through the same lock.
At 16 threads with 30 instances (realistic fast client concurrency), this
causes 6x latency inflation on this operation alone.
Integer boxing on every update — compute() returns a boxed Integer,
allocating a new heap object on every increment and decrement for counter
values > 127 (outside the JVM integer cache). At 50k req/s this generates
~1.6 MB/s of short-lived Integer garbage, adding continuous GC pressure.

This runs twice per request — once on send and once on completion — making it
one of the highest-frequency operations in the fast client read path.

Solution

Replace Map<String, Integer> with Map<String, AtomicInteger>:

computeIfAbsent only locks on the first encounter of a new instance;
all subsequent increments/decrements bypass the map lock entirely
Increments become a single lock-free CAS via AtomicInteger.incrementAndGet()
Zero per-update heap allocation after the AtomicInteger is in the map
The decrement is extracted into decrementPendingCounter() using
getAndUpdate(v -> Math.max(0, v - 1)), keeping the floor-at-zero check
and the decrement as a single atomic operation — same correctness guarantee
as the original compute() block, without a lock

JMH benchmark (PendingRequestCounterBenchmark) added to venice-test-common
simulating one full request lifecycle (increment + decrement):

Threads	Instances	`compute()` (before)	`AtomicInteger` (after)	Speedup
1	30	83 ns/op	46 ns/op	1.8x
4	30	421 ns/op	141 ns/op	3.0x
16	30	1844 ns/op	305 ns/op	6.0x
32	30	4272 ns/op	817 ns/op	5.2x

GC allocation rate (30 instances): compute() → 16 bytes/op, AtomicInteger → 0 bytes/op.

Code changes

Added new code behind a config. If so list the config names and their default values in the PR description.
Introduced new log lines.
- Confirmed if logs need to be rate limited to avoid excessive logging.
  - Existing error log lines are preserved verbatim; they only fire on buggy
    decrement-without-increment scenarios, not on the normal path.

Concurrency-Specific Checks

Both reviewer and PR author to verify

Code has no race conditions or thread safety issues.
- getAndUpdate() is a single atomic CAS — the floor-at-zero check and
  decrement are not separable by another thread.
Proper synchronization mechanisms are used where needed.
- AtomicInteger CAS replaces the ConcurrentHashMap segment lock;
  computeIfAbsent handles the one-time insertion safely.
No blocking calls inside critical sections.
- CAS is non-blocking by definition; the map lock is only held during
  computeIfAbsent on first instance encounter.
Verified thread-safe collections are used.
- VeniceConcurrentHashMap retained; value type changed from Integer to AtomicInteger.
Validated proper exception handling in multi-threaded code.
- No exception paths changed; error cases log and return cleanly.

How was this PR tested?

New unit tests added.
New integration tests added.
Modified or extended existing tests.
- Existing InstanceHealthMonitorTest covers increment/decrement correctness,
  block threshold enforcement, and error logging — all pass unchanged since
  public method signatures (getPendingRequestCounter, getBlockedInstanceCount)
  are unmodified.
Verified backward compatibility (if applicable).
- No config, API, or behavior changes. Internal implementation change only.
JMH microbenchmark PendingRequestCounterBenchmark added to document and
reproduce the performance comparison under varying thread counts and instance counts.

Does this PR introduce any user-facing or breaking changes?

No. Internal implementation change to InstanceHealthMonitor. No public
APIs, configs, or observable behavior is changed.

Copilot

Pull request overview

This PR optimizes the fast-client hot path in InstanceHealthMonitor by replacing the per-instance pending request counter from boxed Integer updates via ConcurrentHashMap.compute() to AtomicInteger-based lock-free increments/decrements, and adds a JMH microbenchmark to quantify the improvement.

Changes:

Replaced Map<String, Integer> pending-request counters with Map<String, AtomicInteger> and switched updates to computeIfAbsent(...).incrementAndGet() + atomic floor-at-zero decrement.
Refactored decrement logic into a dedicated helper (decrementPendingCounter) while preserving existing error logging behavior.
Added PendingRequestCounterBenchmark JMH benchmark to compare the old vs new approaches.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`clients/venice-client/src/main/java/com/linkedin/venice/fastclient/meta/InstanceHealthMonitor.java`	Switches pending request counters to `AtomicInteger` and uses lock-free CAS-based updates.
`internal/venice-test-common/src/jmh/java/com/linkedin/venice/benchmark/PendingRequestCounterBenchmark.java`	Adds a microbenchmark for old (`compute` + boxed `Integer`) vs new (`AtomicInteger`) counter update paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

...e-test-common/src/jmh/java/com/linkedin/venice/benchmark/PendingRequestCounterBenchmark.java

sushantmane

Thanks for the contribution, @ayush571995! Welcome to the Venice project. 🎉

The code change itself looks correct — AtomicInteger with computeIfAbsent is the more idiomatic pattern for concurrent counters, and the getAndUpdate(v -> Math.max(0, v - 1)) nicely preserves the floor-at-zero guarantee as a single atomic CAS.

A couple of things I'd like to discuss:

How impactful is this in practice?

The compute() on increment runs on the request submission thread, but it sits between System.nanoTime(), URL string composition, and transportClient.get() — an actual network round-trip that typically costs millions of nanoseconds. The decrement in the normal case runs in the whenComplete callback on the response-handling thread, but alongside much heavier operations like decompression and deserialization that dominate the cost. (Only in the timeout case is the decrement scheduled off the response path entirely, on the TimeoutProcessor thread.)

The JMH benchmark isolates the counter operations from everything else, which makes the speedup look dramatic (6x at 16 threads). But in context, even the worst-case saving (~3.5µs) is a tiny fraction of the end-to-end request cost dominated by network I/O. Similarly, ~1.6 MB/s of short-lived Integer objects is comfortably handled by young-gen GC in modern JVMs.

Could you share your thinking on whether this showed up in any application-level profiling (flame graphs, p99 latency, GC logs), or if this was more of a code-quality improvement? Either motivation is fine — just want to make sure the PR description sets accurate expectations for future readers.

Benchmark methodology notes

A few things that would make the benchmark more representative:

Increment and decrement hit different instances — pickInstance(ts) advances the index on each call, so in boxedCompute_t01, the increment hits instance N and decrement hits instance N+1. A real request would increment and decrement the same instance. Consider picking the instance once per iteration.
Decrement uses decrementAndGet() instead of getAndUpdate() — The production code uses getAndUpdate(v -> Math.max(0, v - 1)) which involves a CAS loop with a lambda. The benchmark using the simpler decrementAndGet() slightly overstates the improvement on the decrement side.
Increment skips computeIfAbsent — The production code calls computeIfAbsent(...).incrementAndGet() but the benchmark uses atomicMap.get(instance).incrementAndGet(). Since the map is pre-populated this is likely negligible, but for completeness it's worth matching the production code path.

These don't change the directional conclusion (AtomicInteger is faster), but would make the numbers more trustworthy.

Overall: the code change is clean and correct. I think it's a nice improvement — just want to make sure we calibrate the performance narrative accurately. Looking forward to your thoughts!

…eger

Address review feedback on PR linkedin#2672: - Use same instance for both inc and dec per iteration - Atomic decrement uses getAndUpdate(v -> Math.max(0, v-1)) to match decrementPendingCounter() exactly - Remove stale counterResetConsumer comment

ayush571995 · 2026-04-07T11:46:48Z

Thanks for the contribution, @ayush571995! Welcome to the Venice project. 🎉

The code change itself looks correct — AtomicInteger with computeIfAbsent is the more idiomatic pattern for concurrent counters, and the getAndUpdate(v -> Math.max(0, v - 1)) nicely preserves the floor-at-zero guarantee as a single atomic CAS.

A couple of things I'd like to discuss:

How impactful is this in practice?

The compute() on increment runs on the request submission thread, but it sits between System.nanoTime(), URL string composition, and transportClient.get() — an actual network round-trip that typically costs millions of nanoseconds. The decrement in the normal case runs in the whenComplete callback on the response-handling thread, but alongside much heavier operations like decompression and deserialization that dominate the cost. (Only in the timeout case is the decrement scheduled off the response path entirely, on the TimeoutProcessor thread.)

The JMH benchmark isolates the counter operations from everything else, which makes the speedup look dramatic (6x at 16 threads). But in context, even the worst-case saving (~3.5µs) is a tiny fraction of the end-to-end request cost dominated by network I/O. Similarly, ~1.6 MB/s of short-lived Integer objects is comfortably handled by young-gen GC in modern JVMs.

Could you share your thinking on whether this showed up in any application-level profiling (flame graphs, p99 latency, GC logs), or if this was more of a code-quality improvement? Either motivation is fine — just want to make sure the PR description sets accurate expectations for future readers.

Benchmark methodology notes

A few things that would make the benchmark more representative:

Increment and decrement hit different instances — pickInstance(ts) advances the index on each call, so in boxedCompute_t01, the increment hits instance N and decrement hits instance N+1. A real request would increment and decrement the same instance. Consider picking the instance once per iteration.

Decrement uses decrementAndGet() instead of getAndUpdate() — The production code uses getAndUpdate(v -> Math.max(0, v - 1)) which involves a CAS loop with a lambda. The benchmark using the simpler decrementAndGet() slightly overstates the improvement on the decrement side.

Increment skips computeIfAbsent — The production code calls computeIfAbsent(...).incrementAndGet() but the benchmark uses atomicMap.get(instance).incrementAndGet(). Since the map is pre-populated this is likely negligible, but for completeness it's worth matching the production code path.

These don't change the directional conclusion (AtomicInteger is faster), but would make the numbers more trustworthy.

Overall: the code change is clean and correct. I think it's a nice improvement — just want to make sure we calibrate the performance narrative accurately. Looking forward to your thoughts!

Thanks for the thorough review @sushantmane. You're right on all three benchmark issues — fixed in the latest commit. Here's a summary of what changed and the updated numbers.

Benchmark fixes applied:

Both increment and decrement now operate on the same instance per iteration, matching the production pattern
Atomic decrement now uses getAndUpdate(v -> Math.max(0, v - 1)) to mirror decrementPendingCounter() exactly
Removed stale counterResetConsumer comment

Updated results (corrected benchmark):

Threads	Instances	`compute()`	`AtomicInteger`	Speedup
1	30	60 ns/op	43 ns/op	1.4x
4	30	160 ns/op	81 ns/op	2.0x
16	30	1033 ns/op	152 ns/op	6.8x
32	30	2208 ns/op	604 ns/op	3.7x

The GC numbers also corrected themselves — since both operations now hit the same instance, the counter
oscillates between 0–1 which stays inside the JVM integer cache, so allocation is near zero for both.
The lock contention story remains the primary argument: at realistic fast client concurrency
(16–32 threads, 30 instances), compute() is 4–7x slower per operation.

On the real-world impact question:

You're right that a single operation saving ~900 ns is invisible against a network RTT. The case here is cumulative:

This runs twice per request (increment on send, decrement on completion) on every GET and every
scatter route in batch GET
Under high QPS, the lock serializes threads that could otherwise proceed independently — it's not just
latency per-op but queueing delay stacking up across concurrent requests
The fix has zero risk (same semantics, existing tests pass, no config change) and the code is simpler —
the Consumer<String> field is replaced by a straightforward private method

Agreed that without a flame graph showing this as a hotspot, the latency claim is speculative. Happy to
mark this as a low-priority cleanup rather than a performance fix if that framing is more appropriate.

Updating the int for autoboxing

70cb936

kvargha requested a review from Copilot March 31, 2026 04:01

Copilot started reviewing on behalf of kvargha March 31, 2026 04:02 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

...e-test-common/src/jmh/java/com/linkedin/venice/benchmark/PendingRequestCounterBenchmark.java Outdated Show resolved Hide resolved

...e-test-common/src/jmh/java/com/linkedin/venice/benchmark/PendingRequestCounterBenchmark.java Outdated Show resolved Hide resolved

sushantmane reviewed Apr 3, 2026

View reviewed changes

ayush571995 added 2 commits April 7, 2026 16:53

Merge branch 'linkedin:main' into optimize/pending-counter-atomic-int…

383dc5d

…eger

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fast-client] Replace boxed Integer with AtomicInteger for pending request counter in InstanceHealthMonitor#2672

[fast-client] Replace boxed Integer with AtomicInteger for pending request counter in InstanceHealthMonitor#2672
ayush571995 wants to merge 3 commits intolinkedin:mainfrom
ayush571995:optimize/pending-counter-atomic-integer

ayush571995 commented Mar 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

sushantmane left a comment •

edited

Loading

Uh oh!

ayush571995 commented Apr 7, 2026

How impactful is this in practice?

Benchmark methodology notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ayush571995 commented Mar 28, 2026

Problem Statement

Solution

Code changes

Concurrency-Specific Checks

How was this PR tested?

Does this PR introduce any user-facing or breaking changes?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

sushantmane left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

How impactful is this in practice?

Benchmark methodology notes

Uh oh!

ayush571995 commented Apr 7, 2026

How impactful is this in practice?

Benchmark methodology notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sushantmane left a comment •

edited

Loading