Conversation
…cond periods (#1139) Co-authored-by: Omprakash Yadav <oyadav@linkedin.com>
…ant-qps # Conflicts: # CHANGELOG.md
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…c.1) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| * <p>This class is designed to run on a single-threaded {@link ScheduledExecutorService} | ||
| * and requires no synchronization.</p> | ||
| */ | ||
| private class EventLoop |
There was a problem hiding this comment.
It would be best to avoid modifying preexisting code and instead create a new class with the different behavior. It would be even better if we can add a config flag to control ramping.
| * If there are more tasks than the max in the buffer, they'll be immediately executed to align with the limit | ||
| * <p> | ||
| * Event loop is meant to be run in a single-threaded setting. | ||
| * <p>Permits are refreshed every {@link Rate#getPeriodRaw()} milliseconds using fractional |
There was a problem hiding this comment.
Could you comment on the distributed behavior of this new algorithm? Consider that there are multiple client hosts – for e.g. target qps is 1, and there are 4 hosts on the client side, then we're gonna see 4 calls arrive simultaneously at the end of a 4 second period? Is this behavior same as the current algorithm?
…ges of smoothRatelimiter for ramp (#1157) Co-authored-by: Omprakash Yadav <oyadav@linkedin.com>
| * @param enablePrecisePeriodTracking When {@code true}, uses double-precision period and permit tracking | ||
| * to eliminate millisecond quantization errors. When {@code false}, uses | ||
| * integer-rounded period and permit values. Defaults to {@code false}. |
There was a problem hiding this comment.
If the issue is just integer rounding of milliseconds, wouldn't the simplest solution be to change the time source from milliseconds to nanoseconds (and maybe change integers to longs)?
| if (_enablePrecisePeriodTracking) | ||
| { | ||
| // Advance by exact fractional period so sub-millisecond remainders accumulate. | ||
| _permitTime += period; |
There was a problem hiding this comment.
How large is the period? Could this cause drift?
| if (_enablePrecisePeriodTracking) | ||
| { | ||
| _permitAvailableCount = Math.max(_permitAvailableCount, 1.0); | ||
| } | ||
| else | ||
| { | ||
| _permitAvailableCount++; |
There was a problem hiding this comment.
Why is precise branch different to not precise branch?
| else if (_enablePrecisePeriodTracking) | ||
| { | ||
| // Round up so the scheduler never fires before the fractional boundary. | ||
| nextRunRelativeTime = Math.max(1, (long) Math.ceil(_permitTime + period - now)); | ||
| } | ||
| else | ||
| { | ||
| nextRunRelativeTime = Math.max(0, (long) (_permitTime + period) - now); |
There was a problem hiding this comment.
Why is precise branch different?
| _permitAvailableCount = rate.getEvents(); | ||
| _permitsInTimeFrame = rate.getEvents(); | ||
| _permitAvailableCount = _enablePrecisePeriodTracking ? rate.getEventsRaw() : rate.getEvents(); | ||
| _permitsInTimeFrame = _enablePrecisePeriodTracking ? rate.getEventsRaw() : rate.getEvents(); |
There was a problem hiding this comment.
What is the difference between the raw and not-raw apis?
Problem Statement
When setting a target QPS for dark cluster forking, the observed dispatch rate does not match the configured target. The error was be as high as +33% or -25% depending on the target QPS value.
Test Setup
Source cluster: IRPS-irps-feedstorage-test17-0 (1 pod)
Dark cluster: IRPS-irps-feedstorage-test9-0 (1 pod)
Incoming traffic: ~350-380 req/s (consistent across all tests)
Buffer: size=2000, TTL=10 seconds for ConstantQpsRateLimiter
Cluster size ratio: 1:1
Root Cause Analysis
3.1 How the Rate Limiter Works
The dispatch chain is:
IrpRcService.trafficRecord()
→ ConstantQpsForkingStrategy.handleRequest() [~350/s incoming]
→ ConstantQPSDarkClusterStrategy.handleUnaryRequest()
→ rateLimiter.submit(callback) [adds to circular buffer]
→ EventLoop dispatches from buffer at rate [observed QPS]
→ IrpBaseDarkClusterDispatcher.unaryCall() [actual dark cluster call]
3.2 Key Finding: Buffer Replays Requests
The EvictingCircularBuffer.get() does not remove items from the buffer. Items stay until they expire (TTL=10s) or are overwritten by newer items. The rate limiter's event loop circles through the buffer, re-dispatching the same requests to maintain the target rate even when incoming traffic is lower than the target.
From the SI source code comment:
4.1 Requirement
The rate limiter must dispatch at rates that are not constrained to integer-ms periods. For a target of 750 QPS, the ideal period is 1.333ms -- the fix must achieve sub-millisecond timing precision without changing the SI library's public API.
5.2 Algorithm: Fractional Permit Accumulation
Instead of refilling a fixed burst of permits every Math.round(period) ms, accumulate fractional permits each millisecond based on the exact rate.
The key insight: run the event loop at a fixed 1ms tick, but track permits as a double and accumulate targetQps / 1000.0 permits per tick.
permitsPerMs = targetQps / 1000.0
Every 1ms tick:
permitBalance += permitsPerMs
while permitBalance >= 1.0:
dispatch one request
permitBalance -= 1.0