docs(skills): add perf-investigation skill by rhefner1 · Pull Request #185 · gooddata/gooddata-cn-terraform

rhefner1 · 2026-06-11T15:22:42Z

What

Adds a new agent skill, perf-investigation, for diagnosing performance bottlenecks in a deployed GoodData.CN stack from observability data — given a slow trace ("why is trace abc123 taking 13s?") or a time-bounded report ("dashboards were slow ~10 min ago"). It is post-hoc diagnosis, not load-test running.

Why

Distilled from a long real performance investigation. The recurring lesson: the symptom is rarely where the cause is, and most gates are small hardcoded concurrency constants throttling a service that sits at low CPU — so the reflex to "add pods/CPU" is usually wrong. The skill encodes a repeatable method so the next investigation doesn't rediscover this the hard way.

SKILL.md — the method: pin the symptom (ask the user when ambiguous) → reproduce evidence before theorizing → map the request path → localize via trace decomposition → classify with a signature table (idle CPU + deep queue = a concurrency constant; cause vs. victim; architectural ceilings; pool limits; burstable-credit cliffs) → prove (Little's law, read the image source) → name the fix and its lever class. Plus discipline (pre-register hypotheses, one variable at a time, environmental drift) and a workload-realism check before declaring an architectural wall.
references/grafana-queries.md — Prometheus/Tempo/Loki query cookbook: CPU + CFS throttling, latency, Pulsar unacked backlog (a common non-obvious culprit), the jq span-timeline decomposition, slow-trace search, log correlation, and the pull-the-image-and-read-the-source recipe.
references/request-flow.md — the GoodData.CN request flow (gateway routing, AFM execution chain) annotated for where bottlenecks live, for cause-vs-victim reasoning.

Public/customer-facing: generic service names only, no internal cluster/org/repo references.

Notes for reviewers

The routing patterns and Pulsar topic names in request-flow.md are the bits most worth a second look before this is treated as public-facing docs.
Skill content only — no Terraform/infra changes.

🤖 Generated with Claude Code

A methodology skill for root-causing performance bottlenecks in a deployed GoodData.CN stack from observability data (Grafana/Prometheus/Loki/Tempo), triggered by a slow trace or a "X was slow at time T" report. Leads with the method — symptom != cause, decompose the trace, classify with a signature table, prove before fixing — and keeps the Grafana query patterns and the request-flow map in references. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

rhefner1 marked this pull request as draft June 12, 2026 08:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(skills): add perf-investigation skill#185

docs(skills): add perf-investigation skill#185
rhefner1 wants to merge 1 commit into
masterfrom
docs/perf-investigation-skill

rhefner1 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rhefner1 commented Jun 11, 2026

What

Why

Contents

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant