Skip to content

[metric] Add version swap disk size drop alert to detect unexpected data loss#2684

Open
jingy-li wants to merge 3 commits intolinkedin:mainfrom
jingy-li:add-alert
Open

[metric] Add version swap disk size drop alert to detect unexpected data loss#2684
jingy-li wants to merge 3 commits intolinkedin:mainfrom
jingy-li:add-alert

Conversation

@jingy-li
Copy link
Copy Markdown
Contributor

@jingy-li jingy-li commented Apr 1, 2026

Problem Statement

During a recent incident (ACTIONITEM-16176), Venice disk usage dropped from 40GB to near-zero (MB) with no alert or notification. The system lacked any mechanism to detect unexpected disk size drops between version swaps, meaning operators had no early warning when a new version contained significantly less data than expected.

Solution

Added a disk-size-drop alert metric (version_swap_disk_size_drop_alert) in AggVersionedStorageEngineStats that compares the current serving version's disk size against the incoming future version's disk size during version swap. When the future version's size drops below a configurable threshold (default 50%) of the current version, the metric records 1, enabling alerting via InGraph/Observe.

Key design decisions:

  • Server-side placement: The alert lives in AggVersionedStorageEngineStats because it requires direct access to RocksDB storage engine disk sizes, which are only available on the server (the controller has no disk size access).
  • PUSHED status guard: The comparison only runs when the future version has status PUSHED (ingestion complete), preventing false positives from partial data during STARTED status.
  • Configurable threshold: Controlled via SERVER_VERSION_SWAP_DISK_SIZE_DROP_ALERT_THRESHOLD config key, defaulting to 0.5 (50%).
  • Sensor lifecycle management: Alert sensors are properly deregistered from MetricsRepository on store deletion to prevent memory leaks.

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

Jingyan Li and others added 2 commits April 1, 2026 13:44
…r leak,

improve tests

- Add VersionStatus.PUSHED guard to prevent false positive alerts during
  STARTED status when future version is still ingesting partial data
- Fix sensor memory leak in handleStoreDeleted by calling
  metricsRepository.removeSensor() for the disk-size-drop alert sensor
- Change exception logging from DEBUG to WARN for safety mechanism visibility
- Add tests: STARTED status guard, sensor cleanup on deletion, exception
  path, and alert-fires-then-resets lifecycle
- Fix existing test to explicitly stub getVersions() and getVersion()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
// Only check when the future version has completed ingestion (PUSHED status).
// During STARTED, disk data is partial and would cause false positive alerts.
Version futureVersionObj = store.getVersion(futureVersion);
if (futureVersionObj == null || futureVersionObj.getStatus() != VersionStatus.PUSHED) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also check for online status?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The futureVersion from the parent class (AbstractVeniceAggVersionedStats.applyVersionInfo) is only set for versions with STARTED or PUSHED status. Once a version becomes ONLINE, it's tracked as currentVersion, not futureVersion — so getFutureVersion() returns NON_EXISTING_VERSION and we already exit early at line 122-125. The PUSHED check here is specifically to filter out STARTED (partial data during ingestion). Resulting ONLINE can never reach this point.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not right behavior. A version status can be ONLINE but still not current.

long futureSize = futureStats.getDiskUsageInBytes();

// Only alert if both versions have meaningful data
if (currentSize <= 0 || futureSize <= 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if futureSize == 0, we will return and never catch total data loss?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, nice catch! Updated in the next commit.

… bytes

Remove the futureSize <= 0 early-return guard. Since the PUSHED status
guard already ensures ingestion is complete, a future version with 0 bytes
is a genuine data loss signal (the exact scenario from ACTIONITEM-16176:
40G -> MB) and should trigger the alert. Only skip when currentSize <= 0
(e.g., first version of a store with no baseline to compare).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jingy-li jingy-li requested a review from majisourav99 April 2, 2026 16:50
@jingy-li jingy-li enabled auto-merge (squash) April 8, 2026 17:56
@jingy-li jingy-li disabled auto-merge April 8, 2026 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants