Add TTL jitter to avoid synchronized cache expiration (thundering herd)#154
Open
acaliskol wants to merge 1 commit into
Open
Add TTL jitter to avoid synchronized cache expiration (thundering herd)#154acaliskol wants to merge 1 commit into
acaliskol wants to merge 1 commit into
Conversation
When many cache entries are written within the same second, they expire in lockstep one TTL later, producing a synchronized DB miss wave. This perturbs each positive TTL by a configurable percentage before SET EX, spreading expirations over the configured TTL window. The branch is rebased onto upstream master so the Unreleased changelog keeps the current upstream entries and adds the TTL jitter note without a merge commit. Tested: rg -n '<<<<<<<|=======|>>>>>>>' CHANGELOG.md config/lada-cache.php src/Cache.php tests/Unit/CacheTest.php; git diff --cached --check Not-tested: Package test suite after squash
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #154 +/- ##
============================================
+ Coverage 77.90% 78.05% +0.15%
- Complexity 270 274 +4
============================================
Files 26 26
Lines 792 802 +10
============================================
+ Hits 617 626 +9
- Misses 175 176 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
120dd87 to
2aa094a
Compare
Owner
|
Tagging some contributors here to discuss this feature proposal. Let me know what you guys think! Is this the right approach in your opinion? I'd like to get the community more involved for directional changes like this. @kontainer-dam-pim @Tim-streamline @zgetro @duyphuongn @MGApcDev @michael-rubel @ogunsakin01@diegotibi |
|
@spiritix Seems like a reasonable feature to avoid spikes in production. I'd make it optional, though. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
When many cache entries are written within the same second, they expire in lockstep one TTL later, producing a synchronized DB miss wave — the classic thundering herd. With TTLs in the hours-to-days range (Lada's typical use), a single warm-up burst (deploy, scheduled job, traffic spike) can put 100–1000× the steady-state read load on the database at expiration time.
What
Perturb each positive TTL by ±N% before
SET EX, spreading expirations uniformly over a ~(2 × N)% window of the configured TTL.Math sketch
With 1000 keys written in the same second and TTL = 3600s:
The ±15% default trades a worst-case +15% staleness window for a ~3-order-of-magnitude reduction in peak DB pressure.
Behavior
<= 0(persist forever)> 00> 0(0, 100]random_int(ttl - delta, ttl + delta), floor at 1> 0< 0> 0> 100Floor of
1guaranteesSET EXnever silently degrades into "persist forever" whenrandom_intreturns a value that would zero the TTL.Configuration
0→ disabled (deterministic, legacy)15→ default (±15%)100→ maximum spreadDefault rationale
ttl_jitter_pct = 15(on by default) follows Lada's existing production-safe & beneficial defaults convention —active,consider_rows, andenable_debugbarare all defaulted ON for the same reason.Jitter sits cleanly in that class because Lada's scope is Eloquent query-result cache:
Cache::lock()/Cache::put()/RateLimiter— not through Lada.expires_at < NOW()) — not Lada TTL.So for the intended Lada use case, jitter is semantically transparent: a row that's "fresh enough" at 3600s is still fresh enough at 4140s, and an admin invalidate beats either timing anyway.
Users with exact-TTL needs in Lada's cache scope (rare) can opt out cleanly with
'ttl_jitter_pct' => 0. The CHANGELOG and config comment both flag this.Compatibility
Cache::set()signature unchanged.?int $jitterPct = null); existing instantiation calls keep working.random_int— but Lada already requires PHP 8.1+, so this is a non-issue.Tests
7 new test methods in
tests/Unit/CacheTest.php, all passing locally (PHP 8.5, PHPUnit 11.5, Redis 7):Covers: disabled-passthrough, in-band sampling (50 iterations), non-degenerate randomness, persist-forever passthrough, negative-clamp, excessive-clamp, sub-second floor.
Notes
I'm running this patch in production on a fork (Laravel 12, 1.8M users, Octane + MariaDB + Redis). After turning it on we observed DB CPU spikes around the daily cache cliff disappearing — happy to share metrics if useful.