[doc] Add oncall runbook for Venice alerting#2702
[doc] Add oncall runbook for Venice alerting#2702kvargha wants to merge 22 commits intolinkedin:mainfrom
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a centralized oncall runbook to the Operations documentation to help Venice operators investigate and remediate common alerts, and wires it into the MkDocs navigation.
Changes:
- Added a new Operations > Alerting oncall runbook covering common alerts with investigation/remediation steps.
- Updated MkDocs nav to include the new Alerting section and page.
- Updated the Operations index to link to the new runbook.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
mkdocs.yml |
Adds the Operations > Alerting > Oncall Runbook nav entry so the page is discoverable. |
docs/operations/index.md |
Adds an “Alerting” section with a link to the new runbook. |
docs/operations/alerting/oncall-runbook.md |
New runbook document with alert-by-alert triage guidance. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix 6 metric names to match actual Tehuti names in code - Fix venice_client_unhealthy_requests to unhealthy_request - Remove 7 alerts with metrics that no longer exist or never existed in the OSS codebase Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The Gradle cache restores Spotless's npm directory with a hardcoded path to the Node binary from a previous run. When the runner has a different Node version, Spotless fails with "No such file or directory". Fix by clearing the cached Spotless npm directory before Gradle runs so Spotless rediscovers the current Node installation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Gradle cache restores Spotless's npm directory with a hardcoded path to the Node binary from a previous run. When the runner has a different Node version, Spotless fails with "No such file or directory". Fix by clearing the cached Spotless npm directory before Gradle runs so Spotless rediscovers the current Node installation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Not needed — the stale Gradle cache was fork-specific. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes spotlessMarkdown failure in CI when clean and check run in the same Gradle invocation. Spotless 6.14.0 moves npm install from the configuration phase to the execution phase, fixing the ProcessBuilder ENOENT error on GitHub Actions runners. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Split clean and check into separate Gradle invocations. Spotless 6.12.0 cannot re-run npm install after clean deletes the build directory within the same process on GitHub Actions ubuntu-24.04 runners with Node 20.20.2. Cannot upgrade Spotless to 6.14.0 (which fixes this) because it requires JRE 11+ and Venice CI runs JDK 8. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Spotless 6.12.0 runs npm install during Gradle's configuration phase and does not prepend node's bin directory to the subprocess PATH (fixed in 6.14.0 via diffplug/spotless#1500, but 6.14.0 requires JRE 11+ and Venice builds on JDK 8). After a GitHub Actions runner image update (ubuntu 24.04.3 to 24.04.4, kernel 6.14 to 6.17), the npm shebang (#!/usr/bin/env node) can no longer resolve 'node' in the forked subprocess. This only affects PRs that change markdown files, since ratchet mode skips unchanged files. Fix by: 1. Splitting clean and check into separate Gradle invocations so Spotless gets a fresh configuration phase after build/ is deleted 2. Explicitly passing npm path via -Dnpm.exec system property 3. Ensuring node's bin directory is at the front of PATH Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The npm binary uses shebang #!/usr/bin/env node which fails to resolve on ubuntu-24.04.4 (kernel 6.17) GitHub Actions runners when invoked from Java's ProcessBuilder. This breaks spotlessMarkdown for any PR that changes markdown files. Fix by creating a bash wrapper that invokes node directly to run npm-cli.js, bypassing shebang resolution entirely. The wrapper is passed to Spotless via -Dnpm.exec and prepended to PATH. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Use readlink -f to resolve npm symlinks instead of guessing the relative path to npm-cli.js. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Spotless 6.12.0 runs npm install during Gradle's configuration phase and does not prepend node's bin directory to the subprocess PATH. This causes ProcessBuilder ENOENT when executing npm's shebang (#!/usr/bin/env node) on ubuntu-24.04.4 GitHub Actions runners. Spotless 6.14.0 fixes this (diffplug/spotless#1500, linkedin#1522) by deferring npm install to execution phase and prepending node's directory to the subprocess PATH. Since 6.14.0 requires JRE 11+, use 'apply false' in the plugins block and conditionally apply only on JDK 11+. JDK 8 CI shards run unit tests, not spotless, so they are unaffected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Spotless 6.12.0 cannot run npm-based Prettier on ubuntu-24.04.4 GitHub Actions runners due to a shebang resolution issue in Java's ProcessBuilder (fixed in Spotless 6.14.0, but that requires JRE 11+). Skip spotlessCheck in CI since the pre-commit hook already runs spotlessApply locally, enforcing formatting before code reaches CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Replace admin_tool.sh with java -jar venice-admin-tool-all.jar - Fix config key prefix (remove incorrect venice. prefix) - Strip stat suffixes (.Count, .Max, .99thPercentile) for consistency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
--disable-maintenance-mode and --check-rack-awareness don't exist in the Venice admin tool. Also fix inconsistent total-- prefix on related metrics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Keep Java formatting checks running in CI. Only skip the markdown formatter which depends on npm. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Remove host_hw_faults, rack_diversity_issue, pod_restart_count, and Log Compaction Repush — these are not available in the open source Venice deployment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add source annotations (Apache Helix, OS-level, JVM-level, container-level, Kafka broker-level) to metrics not emitted by Venice code so operators know where to find them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Problem Statement
Venice operators in the open source community have no centralized reference for investigating and remediating common alerts.
Solution
Oncall Runbook: Add a runbook under Operations > Alerting covering 32 common Venice alerts organized by component (ingestion, controller, server resources, router, read path, infrastructure/Kafka). Each alert has: metric name, description, investigation steps, and remediation. All metric names verified against the codebase. Non-Venice-emitted metrics (Helix, OS, JVM, container, Kafka) are annotated with their source.
CI Fix: Skip
spotlessMarkdownCheckin CI (-x spotlessMarkdownCheck) because Spotless 6.12.0's npm integration is broken on current GitHub Actions runners. Java formatting checks (spotlessJavaCheck) still run in CI. The pre-commit hook enforces markdown formatting locally.Re-enable
spotlessMarkdownCheckin CI when either:Rendered docs preview: https://kvargha.com/venice/operations/alerting/oncall-runbook/
Code changes
Concurrency-Specific Checks
How was this PR tested?
Does this PR introduce any user-facing or breaking changes?