You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The piece_cleanup job (added in #346) deletes the oldest pieces from each SP once that SP exceeds MAX_DATASET_STORAGE_SIZE_BYTES. It already emits the signals we need (runs, deletions, runtime, no_progress events), but the ops dashboard does not surface them.
This issue extends the dashboard pattern from #89 (now closed) with cleanup-specific charts and alerts on BetterStack, so on-call can see a stuck or runaway cleanup loop without grepping logs.
Impact
Today, a problem with cleanup only surfaces through downstream symptoms: storage growth, deal failures, or rising over-quota counts. Root cause is several layers removed by then. Direct visibility shrinks the gap from "alert fired" to "we know why".
Proposed Approach
Add to the existing BetterStack dealbot ops dashboard.
Charts:
Pieces deleted per SP per hour
piece_cleanup_no_progress events per SP per hour
Cleanup job runtime distribution, with a marker at MAX_PIECE_CLEANUP_RUNTIME_SECONDS
Active over-quota SPs (gauge of SPs currently above MAX_DATASET_STORAGE_SIZE_BYTES)
Alerts (route to the same Slack channel as the existing #89 alerts):
Description
The piece_cleanup job (added in #346) deletes the oldest pieces from each SP once that SP exceeds MAX_DATASET_STORAGE_SIZE_BYTES. It already emits the signals we need (runs, deletions, runtime, no_progress events), but the ops dashboard does not surface them.
This issue extends the dashboard pattern from #89 (now closed) with cleanup-specific charts and alerts on BetterStack, so on-call can see a stuck or runaway cleanup loop without grepping logs.
Impact
Today, a problem with cleanup only surfaces through downstream symptoms: storage growth, deal failures, or rising over-quota counts. Root cause is several layers removed by then. Direct visibility shrinks the gap from "alert fired" to "we know why".
Proposed Approach
Add to the existing BetterStack dealbot ops dashboard.
Charts:
Alerts (route to the same Slack channel as the existing #89 alerts):
Acceptance