[fast-client] Replace boxed Integer with AtomicInteger for pending request counter in InstanceHealthMonitor#2672
Conversation
There was a problem hiding this comment.
Pull request overview
This PR optimizes the fast-client hot path in InstanceHealthMonitor by replacing the per-instance pending request counter from boxed Integer updates via ConcurrentHashMap.compute() to AtomicInteger-based lock-free increments/decrements, and adds a JMH microbenchmark to quantify the improvement.
Changes:
- Replaced
Map<String, Integer>pending-request counters withMap<String, AtomicInteger>and switched updates tocomputeIfAbsent(...).incrementAndGet()+ atomic floor-at-zero decrement. - Refactored decrement logic into a dedicated helper (
decrementPendingCounter) while preserving existing error logging behavior. - Added
PendingRequestCounterBenchmarkJMH benchmark to compare the old vs new approaches.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
clients/venice-client/src/main/java/com/linkedin/venice/fastclient/meta/InstanceHealthMonitor.java |
Switches pending request counters to AtomicInteger and uses lock-free CAS-based updates. |
internal/venice-test-common/src/jmh/java/com/linkedin/venice/benchmark/PendingRequestCounterBenchmark.java |
Adds a microbenchmark for old (compute + boxed Integer) vs new (AtomicInteger) counter update paths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...e-test-common/src/jmh/java/com/linkedin/venice/benchmark/PendingRequestCounterBenchmark.java
Outdated
Show resolved
Hide resolved
...e-test-common/src/jmh/java/com/linkedin/venice/benchmark/PendingRequestCounterBenchmark.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Thanks for the contribution, @ayush571995! Welcome to the Venice project. 🎉
The code change itself looks correct — AtomicInteger with computeIfAbsent is the more idiomatic pattern for concurrent counters, and the getAndUpdate(v -> Math.max(0, v - 1)) nicely preserves the floor-at-zero guarantee as a single atomic CAS.
A couple of things I'd like to discuss:
How impactful is this in practice?
The compute() on increment runs on the request submission thread, but it sits between System.nanoTime(), URL string composition, and transportClient.get() — an actual network round-trip that typically costs millions of nanoseconds. The decrement in the normal case runs in the whenComplete callback on the response-handling thread, but alongside much heavier operations like decompression and deserialization that dominate the cost. (Only in the timeout case is the decrement scheduled off the response path entirely, on the TimeoutProcessor thread.)
The JMH benchmark isolates the counter operations from everything else, which makes the speedup look dramatic (6x at 16 threads). But in context, even the worst-case saving (~3.5µs) is a tiny fraction of the end-to-end request cost dominated by network I/O. Similarly, ~1.6 MB/s of short-lived Integer objects is comfortably handled by young-gen GC in modern JVMs.
Could you share your thinking on whether this showed up in any application-level profiling (flame graphs, p99 latency, GC logs), or if this was more of a code-quality improvement? Either motivation is fine — just want to make sure the PR description sets accurate expectations for future readers.
Benchmark methodology notes
A few things that would make the benchmark more representative:
-
Increment and decrement hit different instances —
pickInstance(ts)advances the index on each call, so inboxedCompute_t01, the increment hits instance N and decrement hits instance N+1. A real request would increment and decrement the same instance. Consider picking the instance once per iteration. -
Decrement uses
decrementAndGet()instead ofgetAndUpdate()— The production code usesgetAndUpdate(v -> Math.max(0, v - 1))which involves a CAS loop with a lambda. The benchmark using the simplerdecrementAndGet()slightly overstates the improvement on the decrement side. -
Increment skips
computeIfAbsent— The production code callscomputeIfAbsent(...).incrementAndGet()but the benchmark usesatomicMap.get(instance).incrementAndGet(). Since the map is pre-populated this is likely negligible, but for completeness it's worth matching the production code path.
These don't change the directional conclusion (AtomicInteger is faster), but would make the numbers more trustworthy.
Overall: the code change is clean and correct. I think it's a nice improvement — just want to make sure we calibrate the performance narrative accurately. Looking forward to your thoughts!
Address review feedback on PR linkedin#2672: - Use same instance for both inc and dec per iteration - Atomic decrement uses getAndUpdate(v -> Math.max(0, v-1)) to match decrementPendingCounter() exactly - Remove stale counterResetConsumer comment
Thanks for the thorough review @sushantmane. You're right on all three benchmark issues — fixed in the latest commit. Here's a summary of what changed and the updated numbers. Benchmark fixes applied:
Updated results (corrected benchmark):
The GC numbers also corrected themselves — since both operations now hit the same instance, the counter On the real-world impact question: You're right that a single operation saving ~900 ns is invisible against a network RTT. The case here is cumulative:
Agreed that without a flame graph showing this as a hotspot, the latency claim is speculative. Happy to |
Problem Statement
InstanceHealthMonitortracks in-flight requests per server instance using:Every increment (on request send) and decrement (on response) goes through
ConcurrentHashMap.compute(), which has two problems on the hot path:Segment lock on every update — all threads dispatching to or receiving
responses from the same server instance serialize through the same lock.
At 16 threads with 30 instances (realistic fast client concurrency), this
causes 6x latency inflation on this operation alone.
Integerboxing on every update —compute()returns a boxedInteger,allocating a new heap object on every increment and decrement for counter
values > 127 (outside the JVM integer cache). At 50k req/s this generates
~1.6 MB/s of short-lived
Integergarbage, adding continuous GC pressure.This runs twice per request — once on send and once on completion — making it
one of the highest-frequency operations in the fast client read path.
Solution
Replace
Map<String, Integer>withMap<String, AtomicInteger>:computeIfAbsentonly locks on the first encounter of a new instance;all subsequent increments/decrements bypass the map lock entirely
AtomicInteger.incrementAndGet()AtomicIntegeris in the mapdecrementPendingCounter()usinggetAndUpdate(v -> Math.max(0, v - 1)), keeping the floor-at-zero checkand the decrement as a single atomic operation — same correctness guarantee
as the original
compute()block, without a lockJMH benchmark (
PendingRequestCounterBenchmark) added tovenice-test-commonsimulating one full request lifecycle (increment + decrement):
compute()(before)AtomicInteger(after)GC allocation rate (30 instances):
compute()→ 16 bytes/op,AtomicInteger→ 0 bytes/op.Code changes
decrement-without-increment scenarios, not on the normal path.
Concurrency-Specific Checks
Both reviewer and PR author to verify
getAndUpdate()is a single atomic CAS — the floor-at-zero check anddecrement are not separable by another thread.
AtomicIntegerCAS replaces theConcurrentHashMapsegment lock;computeIfAbsenthandles the one-time insertion safely.computeIfAbsenton first instance encounter.VeniceConcurrentHashMapretained; value type changed fromIntegertoAtomicInteger.How was this PR tested?
InstanceHealthMonitorTestcovers increment/decrement correctness,block threshold enforcement, and error logging — all pass unchanged since
public method signatures (
getPendingRequestCounter,getBlockedInstanceCount)are unmodified.
PendingRequestCounterBenchmarkadded to document andreproduce the performance comparison under varying thread counts and instance counts.
Does this PR introduce any user-facing or breaking changes?
InstanceHealthMonitor. No publicAPIs, configs, or observable behavior is changed.