Skip to content

Stable Constant Qps#1144

Open
jcleezer wants to merge 17 commits intomasterfrom
oyadav/stable-constant-qps
Open

Stable Constant Qps#1144
jcleezer wants to merge 17 commits intomasterfrom
oyadav/stable-constant-qps

Conversation

@jcleezer
Copy link
Copy Markdown
Contributor

  1. Problem Statement
    When setting a target QPS for dark cluster forking, the observed dispatch rate does not match the configured target. The error was be as high as +33% or -25% depending on the target QPS value.

  2. Test Setup
    Source cluster: IRPS-irps-feedstorage-test17-0 (1 pod)
    Dark cluster: IRPS-irps-feedstorage-test9-0 (1 pod)
    Incoming traffic: ~350-380 req/s (consistent across all tests)
    Buffer: size=2000, TTL=10 seconds for ConstantQpsRateLimiter
    Cluster size ratio: 1:1

  3. Root Cause Analysis
    3.1 How the Rate Limiter Works
    The dispatch chain is:

IrpRcService.trafficRecord()
→ ConstantQpsForkingStrategy.handleRequest() [~350/s incoming]
→ ConstantQPSDarkClusterStrategy.handleUnaryRequest()
→ rateLimiter.submit(callback) [adds to circular buffer]
→ EventLoop dispatches from buffer at rate [observed QPS]
→ IrpBaseDarkClusterDispatcher.unaryCall() [actual dark cluster call]
3.2 Key Finding: Buffer Replays Requests
The EvictingCircularBuffer.get() does not remove items from the buffer. Items stay until they expire (TTL=10s) or are overwritten by newer items. The rate limiter's event loop circles through the buffer, re-dispatching the same requests to maintain the target rate even when incoming traffic is lower than the target.
From the SI source code comment:

"should only be used in cases where the user demands a constant rate of callback execution, and it's not important that all callbacks are executed, or executed only once."
This is by design for dark cluster testing.

  1. Proposed Fix for SI Library
    4.1 Requirement
    The rate limiter must dispatch at rates that are not constrained to integer-ms periods. For a target of 750 QPS, the ideal period is 1.333ms -- the fix must achieve sub-millisecond timing precision without changing the SI library's public API.
    5.2 Algorithm: Fractional Permit Accumulation
    Instead of refilling a fixed burst of permits every Math.round(period) ms, accumulate fractional permits each millisecond based on the exact rate.
    The key insight: run the event loop at a fixed 1ms tick, but track permits as a double and accumulate targetQps / 1000.0 permits per tick.

permitsPerMs = targetQps / 1000.0
Every 1ms tick:
permitBalance += permitsPerMs
while permitBalance >= 1.0:
dispatch one request
permitBalance -= 1.0

* <p>This class is designed to run on a single-threaded {@link ScheduledExecutorService}
* and requires no synchronization.</p>
*/
private class EventLoop
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be best to avoid modifying preexisting code and instead create a new class with the different behavior. It would be even better if we can add a config flag to control ramping.

* If there are more tasks than the max in the buffer, they'll be immediately executed to align with the limit
* <p>
* Event loop is meant to be run in a single-threaded setting.
* <p>Permits are refreshed every {@link Rate#getPeriodRaw()} milliseconds using fractional
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you comment on the distributed behavior of this new algorithm? Consider that there are multiple client hosts – for e.g. target qps is 1, and there are 4 hosts on the client side, then we're gonna see 4 calls arrive simultaneously at the end of a 4 second period? Is this behavior same as the current algorithm?

Jason Leezer and others added 3 commits March 26, 2026 14:17
…ges of smoothRatelimiter for ramp (#1157)

Co-authored-by: Omprakash Yadav <oyadav@linkedin.com>
Comment on lines +139 to +141
* @param enablePrecisePeriodTracking When {@code true}, uses double-precision period and permit tracking
* to eliminate millisecond quantization errors. When {@code false}, uses
* integer-rounded period and permit values. Defaults to {@code false}.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the issue is just integer rounding of milliseconds, wouldn't the simplest solution be to change the time source from milliseconds to nanoseconds (and maybe change integers to longs)?

if (_enablePrecisePeriodTracking)
{
// Advance by exact fractional period so sub-millisecond remainders accumulate.
_permitTime += period;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How large is the period? Could this cause drift?

Comment on lines +360 to +366
if (_enablePrecisePeriodTracking)
{
_permitAvailableCount = Math.max(_permitAvailableCount, 1.0);
}
else
{
_permitAvailableCount++;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is precise branch different to not precise branch?

Comment on lines +415 to +422
else if (_enablePrecisePeriodTracking)
{
// Round up so the scheduler never fires before the fractional boundary.
nextRunRelativeTime = Math.max(1, (long) Math.ceil(_permitTime + period - now));
}
else
{
nextRunRelativeTime = Math.max(0, (long) (_permitTime + period) - now);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is precise branch different?

_permitAvailableCount = rate.getEvents();
_permitsInTimeFrame = rate.getEvents();
_permitAvailableCount = _enablePrecisePeriodTracking ? rate.getEventsRaw() : rate.getEvents();
_permitsInTimeFrame = _enablePrecisePeriodTracking ? rate.getEventsRaw() : rate.getEvents();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between the raw and not-raw apis?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants