Stable Constant Qps#1144

Open

jcleezer wants to merge 17 commits intomasterfrom

oyadav/stable-constant-qps

Contributor

jcleezer commented Feb 18, 2026

Problem Statement
When setting a target QPS for dark cluster forking, the observed dispatch rate does not match the configured target. The error was be as high as +33% or -25% depending on the target QPS value.
Test Setup
Source cluster: IRPS-irps-feedstorage-test17-0 (1 pod)
Dark cluster: IRPS-irps-feedstorage-test9-0 (1 pod)
Incoming traffic: ~350-380 req/s (consistent across all tests)
Buffer: size=2000, TTL=10 seconds for ConstantQpsRateLimiter
Cluster size ratio: 1:1
Root Cause Analysis
3.1 How the Rate Limiter Works
The dispatch chain is:

IrpRcService.trafficRecord()
→ ConstantQpsForkingStrategy.handleRequest() [~350/s incoming]
→ ConstantQPSDarkClusterStrategy.handleUnaryRequest()
→ rateLimiter.submit(callback) [adds to circular buffer]
→ EventLoop dispatches from buffer at rate [observed QPS]
→ IrpBaseDarkClusterDispatcher.unaryCall() [actual dark cluster call]
3.2 Key Finding: Buffer Replays Requests
The EvictingCircularBuffer.get() does not remove items from the buffer. Items stay until they expire (TTL=10s) or are overwritten by newer items. The rate limiter's event loop circles through the buffer, re-dispatching the same requests to maintain the target rate even when incoming traffic is lower than the target.
From the SI source code comment:

"should only be used in cases where the user demands a constant rate of callback execution, and it's not important that all callbacks are executed, or executed only once."
This is by design for dark cluster testing.

Proposed Fix for SI Library
4.1 Requirement
The rate limiter must dispatch at rates that are not constrained to integer-ms periods. For a target of 750 QPS, the ideal period is 1.333ms -- the fix must achieve sub-millisecond timing precision without changing the SI library's public API.
5.2 Algorithm: Fractional Permit Accumulation
Instead of refilling a fixed burst of permits every Math.round(period) ms, accumulate fractional permits each millisecond based on the exact rate.
The key insight: run the event loop at a fixed 1ms tick, but track permits as a double and accumulate targetQps / 1000.0 permits per tick.

permitsPerMs = targetQps / 1000.0
Every 1ms tick:
permitBalance += permitsPerMs
while permitBalance >= 1.0:
dispatch one request
permitBalance -= 1.0

Jason Leezer and others added 12 commits

February 10, 2026 14:02


          Prep feature branch for stable ConstantQPS

f5c375e


          Fix millisecond quantization bug in SmoothRateLimiter for sub-millise…

d63ba07

…cond periods (#1139)

Co-authored-by: Omprakash Yadav <oyadav@linkedin.com>


          Bump version

d06c1b9


          Bump version

2bf66d7


          CHANGELOG

84ea017


          Stabalize test

2ddcf95


          Merge master

3f1946b


          Merge branch 'master' into oyadav/stable-constant-qps

25af0e1


          Merge remote-tracking branch 'origin/master' into oyadav/stable-const…

780c540

…ant-qps

# Conflicts:
#	CHANGELOG.md


          Bump version to 29.85.7-rc1 for release candidate

5fd9fef

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Fix CHANGELOG

08035b1


          Fix RC version format to match release script expectations (29.85.7-r…

e4c7983

…c.1)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jcleezer temporarily deployed to jfrog-publish

March 12, 2026 19:06

— with

GitHub Actions Inactive

shivamgupta1 reviewed

View reviewed changes

r2-core/src/main/java/com/linkedin/r2/transport/http/client/SmoothRateLimiter.java

+                 * <p>This class is designed to run on a single-threaded {@link ScheduledExecutorService}
+                 * and requires no synchronization.</p>
                  */
                 private class EventLoop

Contributor

shivamgupta1 Mar 24, 2026

It would be best to avoid modifying preexisting code and instead create a new class with the different behavior. It would be even better if we can add a config flag to control ramping.

r2-core/src/main/java/com/linkedin/r2/transport/http/client/SmoothRateLimiter.java Outdated

-                 * If there are more tasks than the max in the buffer, they'll be immediately executed to align with the limit
-                 * <p>
-                 * Event loop is meant to be run in a single-threaded setting.
+                 * <p>Permits are refreshed every {@link Rate#getPeriodRaw()} milliseconds using fractional

Contributor

shivamgupta1 Mar 24, 2026

Could you comment on the distributed behavior of this new algorithm? Consider that there are multiple client hosts – for e.g. target qps is 1, and there are 4 hosts on the client side, then we're gonna see 4 calls arrive simultaneously at the end of a 4 second period? Is this behavior same as the current algorithm?

Jason Leezer and others added 3 commits

March 26, 2026 14:17


          Prep for release

79dc949


          Added flag enablePrecisePeriodTracking for handeling old and new chan…

c57207d

…ges of smoothRatelimiter for ramp (#1157)

Co-authored-by: Omprakash Yadav <oyadav@linkedin.com>


          Prep release candidate

1a32679

jcleezer temporarily deployed to jfrog-publish

April 2, 2026 18:20

— with

GitHub Actions Inactive

Jason Leezer and others added 2 commits

April 7, 2026 09:34


          Prep release

f7fe3ad


          Merge branch 'master' into oyadav/stable-constant-qps

534c029

TylerHorth reviewed

View reviewed changes

r2-core/src/main/java/com/linkedin/r2/transport/http/client/SmoothRateLimiter.java

Comment on lines +139 to +141

+                 * @param enablePrecisePeriodTracking   When {@code true}, uses double-precision period and permit tracking
+                 *                               to eliminate millisecond quantization errors. When {@code false}, uses
+                 *                               integer-rounded period and permit values. Defaults to {@code false}.

Contributor

TylerHorth Apr 7, 2026

If the issue is just integer rounding of milliseconds, wouldn't the simplest solution be to change the time source from milliseconds to nanoseconds (and maybe change integers to longs)?

r2-core/src/main/java/com/linkedin/r2/transport/http/client/SmoothRateLimiter.java

+                      if (_enablePrecisePeriodTracking)
+                      {
+                        // Advance by exact fractional period so sub-millisecond remainders accumulate.
+                        _permitTime += period;

Contributor

TylerHorth Apr 7, 2026

How large is the period? Could this cause drift?

r2-core/src/main/java/com/linkedin/r2/transport/http/client/SmoothRateLimiter.java

Comment on lines +360 to +366

+                      if (_enablePrecisePeriodTracking)
+                      {
+                        _permitAvailableCount = Math.max(_permitAvailableCount, 1.0);
+                      }
+                      else
+                      {
+                        _permitAvailableCount++;

Contributor

TylerHorth Apr 7, 2026

Why is precise branch different to not precise branch?

r2-core/src/main/java/com/linkedin/r2/transport/http/client/SmoothRateLimiter.java

Comment on lines +415 to +422

+                        else if (_enablePrecisePeriodTracking)
+                        {
+                          // Round up so the scheduler never fires before the fractional boundary.
+                          nextRunRelativeTime = Math.max(1, (long) Math.ceil(_permitTime + period - now));
+                        }
+                        else
+                        {
+                          nextRunRelativeTime = Math.max(0, (long) (_permitTime + period) - now);

Contributor

TylerHorth Apr 7, 2026

Why is precise branch different?

r2-core/src/main/java/com/linkedin/r2/transport/http/client/SmoothRateLimiter.java

-                    _permitAvailableCount = rate.getEvents();
-                    _permitsInTimeFrame = rate.getEvents();
+                    _permitAvailableCount = _enablePrecisePeriodTracking ? rate.getEventsRaw() : rate.getEvents();
+                    _permitsInTimeFrame = _enablePrecisePeriodTracking ? rate.getEventsRaw() : rate.getEvents();

Contributor

TylerHorth Apr 7, 2026

What is the difference between the raw and not-raw apis?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet