Skip to content

add monitoring to dealbot ops dashboard for cleanup job #500

@SgtPooki

Description

@SgtPooki

Description

The piece_cleanup job (added in #346) deletes the oldest pieces from each SP once that SP exceeds MAX_DATASET_STORAGE_SIZE_BYTES. It already emits the signals we need (runs, deletions, runtime, no_progress events), but the ops dashboard does not surface them.

This issue extends the dashboard pattern from #89 (now closed) with cleanup-specific charts and alerts on BetterStack, so on-call can see a stuck or runaway cleanup loop without grepping logs.

Impact

Today, a problem with cleanup only surfaces through downstream symptoms: storage growth, deal failures, or rising over-quota counts. Root cause is several layers removed by then. Direct visibility shrinks the gap from "alert fired" to "we know why".

Proposed Approach

Add to the existing BetterStack dealbot ops dashboard.

Charts:

  1. Pieces deleted per SP per hour
  2. piece_cleanup_no_progress events per SP per hour
  3. Cleanup job runtime distribution, with a marker at MAX_PIECE_CLEANUP_RUNTIME_SECONDS
  4. Active over-quota SPs (gauge of SPs currently above MAX_DATASET_STORAGE_SIZE_BYTES)

Alerts (route to the same Slack channel as the existing #89 alerts):

  • 3 consecutive piece_cleanup_no_progress events for the same SP (cleanup runs at 1/hour per SP by default, so this fires after ~3 hours stuck; per feat: add per-SP piece cleanup job to bound storage growth #346)
  • Job runtime equals MAX_PIECE_CLEANUP_RUNTIME_SECONDS three times in any 6-hour window
  • An SP stays over MAX_DATASET_STORAGE_SIZE_BYTES for more than 24 hours

Acceptance

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestready-for-workTriaged: scope, plan, and DoD are clear; contributor can pick up

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    🐱 Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions