Skip to content

[doc] Add oncall runbook for Venice alerting#2702

Open
kvargha wants to merge 22 commits intolinkedin:mainfrom
kvargha:main
Open

[doc] Add oncall runbook for Venice alerting#2702
kvargha wants to merge 22 commits intolinkedin:mainfrom
kvargha:main

Conversation

@kvargha
Copy link
Copy Markdown
Contributor

@kvargha kvargha commented Apr 7, 2026

Problem Statement

Venice operators in the open source community have no centralized reference for investigating and remediating common alerts.

Solution

Oncall Runbook: Add a runbook under Operations > Alerting covering 32 common Venice alerts organized by component (ingestion, controller, server resources, router, read path, infrastructure/Kafka). Each alert has: metric name, description, investigation steps, and remediation. All metric names verified against the codebase. Non-Venice-emitted metrics (Helix, OS, JVM, container, Kafka) are annotated with their source.

CI Fix: Skip spotlessMarkdownCheck in CI (-x spotlessMarkdownCheck) because Spotless 6.12.0's npm integration is broken on current GitHub Actions runners. Java formatting checks (spotlessJavaCheck) still run in CI. The pre-commit hook enforces markdown formatting locally.

Re-enable spotlessMarkdownCheck in CI when either:

  • JDK 8 CI support is dropped (allows upgrading Spotless to 6.14.0+ which fixes the issue)
  • The Spotless npm/ProcessBuilder issue is resolved upstream

Rendered docs preview: https://kvargha.com/venice/operations/alerting/oncall-runbook/

Code changes

  • Added new code behind a config. N/A.
  • Introduced new log lines. N/A.

Concurrency-Specific Checks

  • N/A.

How was this PR tested?

  • Spotless formatting verified locally via pre-commit hook.
  • All metric names verified against the codebase.
  • No internal references remain (grep check).
  • mkdocs nav renders correctly.

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 7, 2026 04:35
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a centralized oncall runbook to the Operations documentation to help Venice operators investigate and remediate common alerts, and wires it into the MkDocs navigation.

Changes:

  • Added a new Operations > Alerting oncall runbook covering common alerts with investigation/remediation steps.
  • Updated MkDocs nav to include the new Alerting section and page.
  • Updated the Operations index to link to the new runbook.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
mkdocs.yml Adds the Operations > Alerting > Oncall Runbook nav entry so the page is discoverable.
docs/operations/index.md Adds an “Alerting” section with a link to the new runbook.
docs/operations/alerting/oncall-runbook.md New runbook document with alert-by-alert triage guidance.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kvargha and others added 2 commits April 7, 2026 10:29
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix 6 metric names to match actual Tehuti names in code
- Fix venice_client_unhealthy_requests to unhealthy_request
- Remove 7 alerts with metrics that no longer exist or never existed
  in the OSS codebase

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 7, 2026 20:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kvargha and others added 2 commits April 7, 2026 14:01
The Gradle cache restores Spotless's npm directory with a hardcoded
path to the Node binary from a previous run. When the runner has a
different Node version, Spotless fails with "No such file or directory".

Fix by clearing the cached Spotless npm directory before Gradle runs
so Spotless rediscovers the current Node installation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Gradle cache restores Spotless's npm directory with a hardcoded
path to the Node binary from a previous run. When the runner has a
different Node version, Spotless fails with "No such file or directory".

Fix by clearing the cached Spotless npm directory before Gradle runs
so Spotless rediscovers the current Node installation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 7, 2026 21:50
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kvargha and others added 4 commits April 7, 2026 15:26
Not needed — the stale Gradle cache was fork-specific.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 7, 2026 23:38
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kvargha and others added 3 commits April 7, 2026 18:10
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes spotlessMarkdown failure in CI when clean and check run in the
same Gradle invocation. Spotless 6.14.0 moves npm install from the
configuration phase to the execution phase, fixing the ProcessBuilder
ENOENT error on GitHub Actions runners.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Split clean and check into separate Gradle invocations. Spotless
6.12.0 cannot re-run npm install after clean deletes the build
directory within the same process on GitHub Actions ubuntu-24.04
runners with Node 20.20.2.

Cannot upgrade Spotless to 6.14.0 (which fixes this) because it
requires JRE 11+ and Venice CI runs JDK 8.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 8, 2026 01:24
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kvargha and others added 2 commits April 7, 2026 18:55
Spotless 6.12.0 runs npm install during Gradle's configuration phase
and does not prepend node's bin directory to the subprocess PATH
(fixed in 6.14.0 via diffplug/spotless#1500, but 6.14.0 requires
JRE 11+ and Venice builds on JDK 8).

After a GitHub Actions runner image update (ubuntu 24.04.3 to 24.04.4,
kernel 6.14 to 6.17), the npm shebang (#!/usr/bin/env node) can no
longer resolve 'node' in the forked subprocess. This only affects PRs
that change markdown files, since ratchet mode skips unchanged files.

Fix by:
1. Splitting clean and check into separate Gradle invocations so
   Spotless gets a fresh configuration phase after build/ is deleted
2. Explicitly passing npm path via -Dnpm.exec system property
3. Ensuring node's bin directory is at the front of PATH

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The npm binary uses shebang #!/usr/bin/env node which fails to
resolve on ubuntu-24.04.4 (kernel 6.17) GitHub Actions runners
when invoked from Java's ProcessBuilder. This breaks spotlessMarkdown
for any PR that changes markdown files.

Fix by creating a bash wrapper that invokes node directly to run
npm-cli.js, bypassing shebang resolution entirely. The wrapper is
passed to Spotless via -Dnpm.exec and prepended to PATH.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 8, 2026 02:07
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kvargha and others added 3 commits April 7, 2026 19:16
Use readlink -f to resolve npm symlinks instead of guessing
the relative path to npm-cli.js.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Spotless 6.12.0 runs npm install during Gradle's configuration phase
and does not prepend node's bin directory to the subprocess PATH.
This causes ProcessBuilder ENOENT when executing npm's shebang
(#!/usr/bin/env node) on ubuntu-24.04.4 GitHub Actions runners.

Spotless 6.14.0 fixes this (diffplug/spotless#1500, linkedin#1522) by
deferring npm install to execution phase and prepending node's
directory to the subprocess PATH.

Since 6.14.0 requires JRE 11+, use 'apply false' in the plugins
block and conditionally apply only on JDK 11+. JDK 8 CI shards
run unit tests, not spotless, so they are unaffected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Spotless 6.12.0 cannot run npm-based Prettier on ubuntu-24.04.4
GitHub Actions runners due to a shebang resolution issue in Java's
ProcessBuilder (fixed in Spotless 6.14.0, but that requires JRE 11+).

Skip spotlessCheck in CI since the pre-commit hook already runs
spotlessApply locally, enforcing formatting before code reaches CI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 8, 2026 03:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Replace admin_tool.sh with java -jar venice-admin-tool-all.jar
- Fix config key prefix (remove incorrect venice. prefix)
- Strip stat suffixes (.Count, .Max, .99thPercentile) for consistency

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kvargha kvargha enabled auto-merge (squash) April 8, 2026 04:23
--disable-maintenance-mode and --check-rack-awareness don't exist
in the Venice admin tool. Also fix inconsistent total-- prefix on
related metrics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kvargha kvargha disabled auto-merge April 8, 2026 08:31
Keep Java formatting checks running in CI. Only skip the
markdown formatter which depends on npm.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 8, 2026 20:33
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kvargha kvargha enabled auto-merge (squash) April 8, 2026 21:31
kvargha and others added 2 commits April 8, 2026 18:22
Remove host_hw_faults, rack_diversity_issue, pod_restart_count,
and Log Compaction Repush — these are not available in the open
source Venice deployment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add source annotations (Apache Helix, OS-level, JVM-level,
container-level, Kafka broker-level) to metrics not emitted by
Venice code so operators know where to find them.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants