Skip to content

Add TTL jitter to avoid synchronized cache expiration (thundering herd)#154

Open
acaliskol wants to merge 1 commit into
spiritix:masterfrom
acaliskol:feat/ttl-jitter
Open

Add TTL jitter to avoid synchronized cache expiration (thundering herd)#154
acaliskol wants to merge 1 commit into
spiritix:masterfrom
acaliskol:feat/ttl-jitter

Conversation

@acaliskol

Copy link
Copy Markdown
Contributor

Why

When many cache entries are written within the same second, they expire in lockstep one TTL later, producing a synchronized DB miss wave — the classic thundering herd. With TTLs in the hours-to-days range (Lada's typical use), a single warm-up burst (deploy, scheduled job, traffic spike) can put 100–1000× the steady-state read load on the database at expiration time.

What

Perturb each positive TTL by ±N% before SET EX, spreading expirations uniformly over a ~(2 × N)% window of the configured TTL.

// Conceptually:
$effectiveTtl = $this->applyJitter($this->expirationTime);
$this->redis->set($key, $value, 'EX', $effectiveTtl);

Math sketch

With 1000 keys written in the same second and TTL = 3600s:

Configuration Peak DB miss rate at expiration
pct = 0 (current behavior) 1000 / 1s = 1000 req/s ← spike
pct = 15 1000 / 1080s ≈ 0.93 req/s ← flat
pct = 50 1000 / 3600s ≈ 0.28 req/s

The ±15% default trades a worst-case +15% staleness window for a ~3-order-of-magnitude reduction in peak DB pressure.

Behavior

TTL jitterPct Result
<= 0 (persist forever) any unchanged
> 0 0 unchanged (legacy / disabled)
> 0 (0, 100] random_int(ttl - delta, ttl + delta), floor at 1
> 0 < 0 clamped to 0 (disabled), no crash
> 0 > 100 clamped to 100

Floor of 1 guarantees SET EX never silently degrades into "persist forever" when random_int returns a value that would zero the TTL.

Configuration

// config/lada-cache.php
'ttl_jitter_pct' => (int) env('LADA_CACHE_TTL_JITTER_PCT', 15),
  • 0 → disabled (deterministic, legacy)
  • 15 → default (±15%)
  • 100 → maximum spread

Default rationale

ttl_jitter_pct = 15 (on by default) follows Lada's existing production-safe & beneficial defaults convention — active, consider_rows, and enable_debugbar are all defaulted ON for the same reason.

Jitter sits cleanly in that class because Lada's scope is Eloquent query-result cache:

  • Exact-TTL primitives (distributed locks, OTPs, session expiry, rate limit windows) go through Laravel's Cache::lock() / Cache::put() / RateLimiter — not through Lada.
  • Time-bound business logic (auction end, coupon expiry, cooldown windows) is checked via wall-clock SQL conditions (expires_at < NOW()) — not Lada TTL.

So for the intended Lada use case, jitter is semantically transparent: a row that's "fresh enough" at 3600s is still fresh enough at 4140s, and an admin invalidate beats either timing anyway.

Users with exact-TTL needs in Lada's cache scope (rare) can opt out cleanly with 'ttl_jitter_pct' => 0. The CHANGELOG and config comment both flag this.

Compatibility

  • ✅ No public API change. Cache::set() signature unchanged.
  • ✅ Constructor gains an optional 4th parameter (?int $jitterPct = null); existing instantiation calls keep working.
  • ✅ Existing tests pass unchanged.
  • ✅ Behavior change is observable only in TTL values, never in correctness — every cached row is still discoverable and invalidatable.
  • ⚠️ Users on PHP < 7.0 won't have random_int — but Lada already requires PHP 8.1+, so this is a non-issue.

Tests

7 new test methods in tests/Unit/CacheTest.php, all passing locally (PHP 8.5, PHPUnit 11.5, Redis 7):

PHPUnit 11.5.55 by Sebastian Bergmann and contributors.

.............                                                     13 / 13 (100%)

Time: 00:00.462, Memory: 40.50 MB
Tests: 13, Assertions: 290

Covers: disabled-passthrough, in-band sampling (50 iterations), non-degenerate randomness, persist-forever passthrough, negative-clamp, excessive-clamp, sub-second floor.

Notes

I'm running this patch in production on a fork (Laravel 12, 1.8M users, Octane + MariaDB + Redis). After turning it on we observed DB CPU spikes around the daily cache cliff disappearing — happy to share metrics if useful.

When many cache entries are written within the same second, they expire in lockstep one TTL later, producing a synchronized DB miss wave. This perturbs each positive TTL by a configurable percentage before SET EX, spreading expirations over the configured TTL window.

The branch is rebased onto upstream master so the Unreleased changelog keeps the current upstream entries and adds the TTL jitter note without a merge commit.

Tested: rg -n '<<<<<<<|=======|>>>>>>>' CHANGELOG.md config/lada-cache.php src/Cache.php tests/Unit/CacheTest.php; git diff --cached --check
Not-tested: Package test suite after squash
@codecov

codecov Bot commented May 28, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 91.66667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 78.05%. Comparing base (b1a216f) to head (2aa094a).

Files with missing lines Patch % Lines
src/Cache.php 91.66% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master     #154      +/-   ##
============================================
+ Coverage     77.90%   78.05%   +0.15%     
- Complexity      270      274       +4     
============================================
  Files            26       26              
  Lines           792      802      +10     
============================================
+ Hits            617      626       +9     
- Misses          175      176       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@spiritix

Copy link
Copy Markdown
Owner

Tagging some contributors here to discuss this feature proposal. Let me know what you guys think! Is this the right approach in your opinion? I'd like to get the community more involved for directional changes like this.

@kontainer-dam-pim @Tim-streamline @zgetro @duyphuongn @MGApcDev @michael-rubel @ogunsakin01@diegotibi

@michael-rubel

Copy link
Copy Markdown

@spiritix Seems like a reasonable feature to avoid spikes in production. I'd make it optional, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants