Enhancement Description
Introduce exponential backoff with jitter and retry limits to prevent controller tight failure loops and CPU/log exhaustion.
Background
Failed operations are immediately re-queued, creating tight loops that can consume CPU, spam logs, and hide the root cause. Retry policies need backoff, jitter, and limits.
Scope
Implement exponential backoff with jitter
- Files: internal/daemon/controller/queue.go
- Files: internal/daemon/controller/manager.go
Backoff policy
- 1s, 2s, 4s, 8s, 16s, 32s, max 60s
- Add jitter to avoid thundering herd
- Enforce max retry count
- Mark permanent failures clearly (no further retries)
Improve observability
- Log retry count and next delay
- Log terminal failure state with context
Non-Goals
- Redesigning the controller architecture or queue model
- Changing business logic of controller tasks beyond retry behavior
- Implementing distributed scheduling
Risks and Open Questions
- Must avoid delaying truly transient errors too much; tune backoff carefully
- Ensure backoff does not break time-sensitive operations
- Confirm behavior under concurrent task loads
Validation Plan
Unit and Integration Checks
- go test ./... for controller packages
- Unit tests for backoff computation (including jitter bounds)
- Tests for max retry enforcement and terminal failure behavior
End-to-End Checks
- Run daemon with induced failure and verify CPU remains stable
- Confirm logs show retry scheduling and terminal failures properly
Evidence Required in Issue Updates
- Before/after CPU/log excerpts under induced failure
- Example log lines showing retry delay and attempt count
- Test output verifying backoff schedule and max retries
Acceptance Criteria
- Failed operations retry with backoff and jitter
- Max retry count is enforced
- CPU usage during failure scenarios remains low (target <5% sustained)
- Logs are informative and not spammy
- Terminal failures stop retrying and are clearly marked
Deliverables
- PR implementing backoff + tests
- Notes describing chosen parameters and any config knobs (if added)
Enhancement Description
Introduce exponential backoff with jitter and retry limits to prevent controller tight failure loops and CPU/log exhaustion.
Background
Failed operations are immediately re-queued, creating tight loops that can consume CPU, spam logs, and hide the root cause. Retry policies need backoff, jitter, and limits.
Scope
Implement exponential backoff with jitter
Backoff policy
Improve observability
Non-Goals
Risks and Open Questions
Validation Plan
Unit and Integration Checks
End-to-End Checks
Evidence Required in Issue Updates
Acceptance Criteria
Deliverables