Skip to content

Latest commit

 

History

History
87 lines (60 loc) · 2.71 KB

File metadata and controls

87 lines (60 loc) · 2.71 KB

🚨 Error Budget Management

This guide provides information about error budget management in the ML Systems Evaluation Framework, including definitions, calculation, monitoring, and best practices.

🎯 What are Error Budgets?

Error budgets define the acceptable amount of time a system can be unavailable or not meeting Service Level Objectives (SLOs). They help teams make informed decisions about deployments, changes, and risk-taking.

📊 Error Budget Concepts

📋 Error Budget Definition

An error budget is the inverse of your SLO target. For example:

  • If your SLO is 99.9% availability, your error budget is 0.1%
  • If your SLO is 95% accuracy, your error budget is 5%

🧮 Error Budget Calculation

# Error budget calculation
# Always inferred from SLO target
error_budget = 1 - slo_target

⚙️ Error Budget Configuration

Note: You do not specify error_budget in your configuration. It is always inferred from the SLO target.

🔧 Example SLO Configuration

slos:
  availability:
    target: 0.9999
    window: "30d"
    description: "System availability"
  accuracy:
    target: 0.95
    window: "30d"
    description: "Model accuracy"

📊 Error Budget Monitoring

The framework automatically tracks error budget consumption, burn rate, and remaining budget for each SLO based on the target value.

🚨 Error Budget Alerts

You can configure alerts based on error budget consumption, burn rate, or remaining budget, but you do not specify error_budget in SLO config.

🎯 Error Budget Management Strategies

  • Conservative: Deploy only if a large portion of the error budget remains.
  • Aggressive: Deploy with a smaller remaining error budget for faster iteration.
  • Balanced: Balance stability and innovation based on error budget consumption.

🏭 Error Budget by Industry

Error budgets are always calculated as 1.0 - target for each SLO, regardless of industry.

Error Budget Reporting

The framework provides dashboards and reports showing error budget consumption, burn rate, and remaining budget for each SLO.

Error Budget Best Practices

  • Set realistic SLO targets based on historical performance
  • Monitor error budget consumption and burn rate
  • Use error budget status to inform deployment and risk decisions
  • Communicate error budget status to stakeholders
  • Automate alerting and reporting based on error budget status

Error Budget Examples

slos:
  availability:
    target: 0.999
    window: "30d"
    description: "System availability"
  accuracy:
    target: 0.95
    window: "30d"
    description: "Model accuracy"

The error budget for each SLO is always calculated as 1.0 - target (e.g., 0.001 for 99.9% availability).