[metric] Add version swap disk size drop alert to detect unexpected data loss#2684
[metric] Add version swap disk size drop alert to detect unexpected data loss#2684jingy-li wants to merge 3 commits intolinkedin:mainfrom
Conversation
…r leak, improve tests - Add VersionStatus.PUSHED guard to prevent false positive alerts during STARTED status when future version is still ingesting partial data - Fix sensor memory leak in handleStoreDeleted by calling metricsRepository.removeSensor() for the disk-size-drop alert sensor - Change exception logging from DEBUG to WARN for safety mechanism visibility - Add tests: STARTED status guard, sensor cleanup on deletion, exception path, and alert-fires-then-resets lifecycle - Fix existing test to explicitly stub getVersions() and getVersion() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| // Only check when the future version has completed ingestion (PUSHED status). | ||
| // During STARTED, disk data is partial and would cause false positive alerts. | ||
| Version futureVersionObj = store.getVersion(futureVersion); | ||
| if (futureVersionObj == null || futureVersionObj.getStatus() != VersionStatus.PUSHED) { |
There was a problem hiding this comment.
can we also check for online status?
There was a problem hiding this comment.
The futureVersion from the parent class (AbstractVeniceAggVersionedStats.applyVersionInfo) is only set for versions with STARTED or PUSHED status. Once a version becomes ONLINE, it's tracked as currentVersion, not futureVersion — so getFutureVersion() returns NON_EXISTING_VERSION and we already exit early at line 122-125. The PUSHED check here is specifically to filter out STARTED (partial data during ingestion). Resulting ONLINE can never reach this point.
There was a problem hiding this comment.
That is not right behavior. A version status can be ONLINE but still not current.
| long futureSize = futureStats.getDiskUsageInBytes(); | ||
|
|
||
| // Only alert if both versions have meaningful data | ||
| if (currentSize <= 0 || futureSize <= 0) { |
There was a problem hiding this comment.
so if futureSize == 0, we will return and never catch total data loss?
There was a problem hiding this comment.
ah, nice catch! Updated in the next commit.
… bytes Remove the futureSize <= 0 early-return guard. Since the PUSHED status guard already ensures ingestion is complete, a future version with 0 bytes is a genuine data loss signal (the exact scenario from ACTIONITEM-16176: 40G -> MB) and should trigger the alert. Only skip when currentSize <= 0 (e.g., first version of a store with no baseline to compare). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Problem Statement
During a recent incident (ACTIONITEM-16176), Venice disk usage dropped from 40GB to near-zero (MB) with no alert or notification. The system lacked any mechanism to detect unexpected disk size drops between version swaps, meaning operators had no early warning when a new version contained significantly less data than expected.
Solution
Added a disk-size-drop alert metric (version_swap_disk_size_drop_alert) in AggVersionedStorageEngineStats that compares the current serving version's disk size against the incoming future version's disk size during version swap. When the future version's size drops below a configurable threshold (default 50%) of the current version, the metric records 1, enabling alerting via InGraph/Observe.
Key design decisions:
Code changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized,RWLock) are used where needed.ConcurrentHashMap,CopyOnWriteArrayList).How was this PR tested?
Does this PR introduce any user-facing or breaking changes?