Skip to content

[fast-client] Replace boxed Integer with AtomicInteger for pending request counter in InstanceHealthMonitor#2672

Open
ayush571995 wants to merge 3 commits intolinkedin:mainfrom
ayush571995:optimize/pending-counter-atomic-integer
Open

[fast-client] Replace boxed Integer with AtomicInteger for pending request counter in InstanceHealthMonitor#2672
ayush571995 wants to merge 3 commits intolinkedin:mainfrom
ayush571995:optimize/pending-counter-atomic-integer

Conversation

@ayush571995
Copy link
Copy Markdown
Contributor

Problem Statement

InstanceHealthMonitor tracks in-flight requests per server instance using:

private final Map<String, Integer> pendingRequestCounterMap = new VeniceConcurrentHashMap<>();

Every increment (on request send) and decrement (on response) goes through
ConcurrentHashMap.compute(), which has two problems on the hot path:

  1. Segment lock on every update — all threads dispatching to or receiving
    responses from the same server instance serialize through the same lock.
    At 16 threads with 30 instances (realistic fast client concurrency), this
    causes 6x latency inflation on this operation alone.

  2. Integer boxing on every updatecompute() returns a boxed Integer,
    allocating a new heap object on every increment and decrement for counter
    values > 127 (outside the JVM integer cache). At 50k req/s this generates
    ~1.6 MB/s of short-lived Integer garbage, adding continuous GC pressure.

This runs twice per request — once on send and once on completion — making it
one of the highest-frequency operations in the fast client read path.

Solution

Replace Map<String, Integer> with Map<String, AtomicInteger>:

  • computeIfAbsent only locks on the first encounter of a new instance;
    all subsequent increments/decrements bypass the map lock entirely
  • Increments become a single lock-free CAS via AtomicInteger.incrementAndGet()
  • Zero per-update heap allocation after the AtomicInteger is in the map
  • The decrement is extracted into decrementPendingCounter() using
    getAndUpdate(v -> Math.max(0, v - 1)), keeping the floor-at-zero check
    and the decrement as a single atomic operation — same correctness guarantee
    as the original compute() block, without a lock

JMH benchmark (PendingRequestCounterBenchmark) added to venice-test-common
simulating one full request lifecycle (increment + decrement):

Threads Instances compute() (before) AtomicInteger (after) Speedup
1 30 83 ns/op 46 ns/op 1.8x
4 30 421 ns/op 141 ns/op 3.0x
16 30 1844 ns/op 305 ns/op 6.0x
32 30 4272 ns/op 817 ns/op 5.2x

GC allocation rate (30 instances): compute() → 16 bytes/op, AtomicInteger → 0 bytes/op.

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.
      • Existing error log lines are preserved verbatim; they only fire on buggy
        decrement-without-increment scenarios, not on the normal path.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
    • getAndUpdate() is a single atomic CAS — the floor-at-zero check and
      decrement are not separable by another thread.
  • Proper synchronization mechanisms are used where needed.
    • AtomicInteger CAS replaces the ConcurrentHashMap segment lock;
      computeIfAbsent handles the one-time insertion safely.
  • No blocking calls inside critical sections.
    • CAS is non-blocking by definition; the map lock is only held during
      computeIfAbsent on first instance encounter.
  • Verified thread-safe collections are used.
    • VeniceConcurrentHashMap retained; value type changed from Integer to AtomicInteger.
  • Validated proper exception handling in multi-threaded code.
    • No exception paths changed; error cases log and return cleanly.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
    • Existing InstanceHealthMonitorTest covers increment/decrement correctness,
      block threshold enforcement, and error logging — all pass unchanged since
      public method signatures (getPendingRequestCounter, getBlockedInstanceCount)
      are unmodified.
  • Verified backward compatibility (if applicable).
    • No config, API, or behavior changes. Internal implementation change only.
  • JMH microbenchmark PendingRequestCounterBenchmark added to document and
    reproduce the performance comparison under varying thread counts and instance counts.

Does this PR introduce any user-facing or breaking changes?

  • No. Internal implementation change to InstanceHealthMonitor. No public
    APIs, configs, or observable behavior is changed.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the fast-client hot path in InstanceHealthMonitor by replacing the per-instance pending request counter from boxed Integer updates via ConcurrentHashMap.compute() to AtomicInteger-based lock-free increments/decrements, and adds a JMH microbenchmark to quantify the improvement.

Changes:

  • Replaced Map<String, Integer> pending-request counters with Map<String, AtomicInteger> and switched updates to computeIfAbsent(...).incrementAndGet() + atomic floor-at-zero decrement.
  • Refactored decrement logic into a dedicated helper (decrementPendingCounter) while preserving existing error logging behavior.
  • Added PendingRequestCounterBenchmark JMH benchmark to compare the old vs new approaches.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
clients/venice-client/src/main/java/com/linkedin/venice/fastclient/meta/InstanceHealthMonitor.java Switches pending request counters to AtomicInteger and uses lock-free CAS-based updates.
internal/venice-test-common/src/jmh/java/com/linkedin/venice/benchmark/PendingRequestCounterBenchmark.java Adds a microbenchmark for old (compute + boxed Integer) vs new (AtomicInteger) counter update paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@sushantmane sushantmane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, @ayush571995! Welcome to the Venice project. 🎉

The code change itself looks correct — AtomicInteger with computeIfAbsent is the more idiomatic pattern for concurrent counters, and the getAndUpdate(v -> Math.max(0, v - 1)) nicely preserves the floor-at-zero guarantee as a single atomic CAS.

A couple of things I'd like to discuss:

How impactful is this in practice?

The compute() on increment runs on the request submission thread, but it sits between System.nanoTime(), URL string composition, and transportClient.get() — an actual network round-trip that typically costs millions of nanoseconds. The decrement in the normal case runs in the whenComplete callback on the response-handling thread, but alongside much heavier operations like decompression and deserialization that dominate the cost. (Only in the timeout case is the decrement scheduled off the response path entirely, on the TimeoutProcessor thread.)

The JMH benchmark isolates the counter operations from everything else, which makes the speedup look dramatic (6x at 16 threads). But in context, even the worst-case saving (~3.5µs) is a tiny fraction of the end-to-end request cost dominated by network I/O. Similarly, ~1.6 MB/s of short-lived Integer objects is comfortably handled by young-gen GC in modern JVMs.

Could you share your thinking on whether this showed up in any application-level profiling (flame graphs, p99 latency, GC logs), or if this was more of a code-quality improvement? Either motivation is fine — just want to make sure the PR description sets accurate expectations for future readers.

Benchmark methodology notes

A few things that would make the benchmark more representative:

  1. Increment and decrement hit different instancespickInstance(ts) advances the index on each call, so in boxedCompute_t01, the increment hits instance N and decrement hits instance N+1. A real request would increment and decrement the same instance. Consider picking the instance once per iteration.

  2. Decrement uses decrementAndGet() instead of getAndUpdate() — The production code uses getAndUpdate(v -> Math.max(0, v - 1)) which involves a CAS loop with a lambda. The benchmark using the simpler decrementAndGet() slightly overstates the improvement on the decrement side.

  3. Increment skips computeIfAbsent — The production code calls computeIfAbsent(...).incrementAndGet() but the benchmark uses atomicMap.get(instance).incrementAndGet(). Since the map is pre-populated this is likely negligible, but for completeness it's worth matching the production code path.

These don't change the directional conclusion (AtomicInteger is faster), but would make the numbers more trustworthy.

Overall: the code change is clean and correct. I think it's a nice improvement — just want to make sure we calibrate the performance narrative accurately. Looking forward to your thoughts!

Address review feedback on PR linkedin#2672:
- Use same instance for both inc and dec per iteration
- Atomic decrement uses getAndUpdate(v -> Math.max(0, v-1))
  to match decrementPendingCounter() exactly
- Remove stale counterResetConsumer comment
@ayush571995
Copy link
Copy Markdown
Contributor Author

Thanks for the contribution, @ayush571995! Welcome to the Venice project. 🎉

The code change itself looks correct — AtomicInteger with computeIfAbsent is the more idiomatic pattern for concurrent counters, and the getAndUpdate(v -> Math.max(0, v - 1)) nicely preserves the floor-at-zero guarantee as a single atomic CAS.

A couple of things I'd like to discuss:

How impactful is this in practice?

The compute() on increment runs on the request submission thread, but it sits between System.nanoTime(), URL string composition, and transportClient.get() — an actual network round-trip that typically costs millions of nanoseconds. The decrement in the normal case runs in the whenComplete callback on the response-handling thread, but alongside much heavier operations like decompression and deserialization that dominate the cost. (Only in the timeout case is the decrement scheduled off the response path entirely, on the TimeoutProcessor thread.)

The JMH benchmark isolates the counter operations from everything else, which makes the speedup look dramatic (6x at 16 threads). But in context, even the worst-case saving (~3.5µs) is a tiny fraction of the end-to-end request cost dominated by network I/O. Similarly, ~1.6 MB/s of short-lived Integer objects is comfortably handled by young-gen GC in modern JVMs.

Could you share your thinking on whether this showed up in any application-level profiling (flame graphs, p99 latency, GC logs), or if this was more of a code-quality improvement? Either motivation is fine — just want to make sure the PR description sets accurate expectations for future readers.

Benchmark methodology notes

A few things that would make the benchmark more representative:

  1. Increment and decrement hit different instancespickInstance(ts) advances the index on each call, so in boxedCompute_t01, the increment hits instance N and decrement hits instance N+1. A real request would increment and decrement the same instance. Consider picking the instance once per iteration.
  2. Decrement uses decrementAndGet() instead of getAndUpdate() — The production code uses getAndUpdate(v -> Math.max(0, v - 1)) which involves a CAS loop with a lambda. The benchmark using the simpler decrementAndGet() slightly overstates the improvement on the decrement side.
  3. Increment skips computeIfAbsent — The production code calls computeIfAbsent(...).incrementAndGet() but the benchmark uses atomicMap.get(instance).incrementAndGet(). Since the map is pre-populated this is likely negligible, but for completeness it's worth matching the production code path.

These don't change the directional conclusion (AtomicInteger is faster), but would make the numbers more trustworthy.

Overall: the code change is clean and correct. I think it's a nice improvement — just want to make sure we calibrate the performance narrative accurately. Looking forward to your thoughts!

Thanks for the thorough review @sushantmane. You're right on all three benchmark issues — fixed in the latest commit. Here's a summary of what changed and the updated numbers.

Benchmark fixes applied:

  • Both increment and decrement now operate on the same instance per iteration, matching the production pattern
  • Atomic decrement now uses getAndUpdate(v -> Math.max(0, v - 1)) to mirror decrementPendingCounter() exactly
  • Removed stale counterResetConsumer comment

Updated results (corrected benchmark):

Threads Instances compute() AtomicInteger Speedup
1 30 60 ns/op 43 ns/op 1.4x
4 30 160 ns/op 81 ns/op 2.0x
16 30 1033 ns/op 152 ns/op 6.8x
32 30 2208 ns/op 604 ns/op 3.7x

The GC numbers also corrected themselves — since both operations now hit the same instance, the counter
oscillates between 0–1 which stays inside the JVM integer cache, so allocation is near zero for both.
The lock contention story remains the primary argument: at realistic fast client concurrency
(16–32 threads, 30 instances), compute() is 4–7x slower per operation.

On the real-world impact question:

You're right that a single operation saving ~900 ns is invisible against a network RTT. The case here is cumulative:

  • This runs twice per request (increment on send, decrement on completion) on every GET and every
    scatter route in batch GET
  • Under high QPS, the lock serializes threads that could otherwise proceed independently — it's not just
    latency per-op but queueing delay stacking up across concurrent requests
  • The fix has zero risk (same semantics, existing tests pass, no config change) and the code is simpler —
    the Consumer<String> field is replaced by a straightforward private method

Agreed that without a flame graph showing this as a hotspot, the latency claim is speculative. Happy to
mark this as a low-priority cleanup rather than a performance fix if that framing is more appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants