Skip to content

Latest commit

 

History

History
267 lines (197 loc) · 8.71 KB

File metadata and controls

267 lines (197 loc) · 8.71 KB

Agent Configuration Guide

This document covers the configuration of the Comms Agent and PIR Agent — the two output-facing specialist agents in the On-Call Copilot pipeline. It also provides a full reference for all four agents.


Overview

All four specialist agents are defined as plain Python strings (*_INSTRUCTIONS) in app/agents/. Each constant is injected as the instruction prompt when the Agent is created with the shared FoundryChatClient. To change agent behaviour, edit the corresponding instruction constant and rebuild/redeploy the container.

Agent File Output keys
Triage app/agents/triage.py suspected_root_causes, immediate_actions, missing_information, runbook_alignment
Summary app/agents/summary.py summary.what_happened, summary.current_status
Comms app/agents/comms.py comms.slack_update, comms.stakeholder_update
PIR app/agents/pir.py post_incident_report.timeline, post_incident_report.customer_impact, post_incident_report.prevention_actions

All agents run concurrently via ConcurrentBuilder and each returns a JSON-only response — no prose, no markdown wrapping.


Comms Agent

File: app/agents/comms.py

Purpose

Translates raw incident signals into two audience-appropriate communications:

  • A Slack channel update for the on-call engineering team
  • A stakeholder summary for executives and product managers

Output Schema

{
  "comms": {
    "slack_update": "<string>",
    "stakeholder_update": "<string>"
  }
}

Slack Update Format

Element Detail
Emoji prefix :rotating_light: for active SEV1/2, :warning: for degraded, :white_check_mark: for resolved
Required fields Incident ID, severity, one-line summary, affected services, next update ETA
Tone Calm, factual, action-oriented

Example output for an active SEV1:

:rotating_light: *SEV1 ACTIVE* | INC-2024-0847
*Summary:* Checkout API degraded — P95 latency 8s, 62% requests timing out
*Services:* checkout-api, order-service (AKS us-east-1)
*Status:* Investigating AZ node capacity exhaustion
*Next update:* 14:30 UTC

Stakeholder Update Format

Element Detail
Audience Executives, product managers — no technical jargon
Focus Customer experience, business impact, what the team is doing, next update time
Prohibited Blame language, acronyms without explanation, raw log/metric data

Configuration Options

Behaviour How to change
Emoji set for Slack Edit the emoji prefix rules in COMMS_INSTRUCTIONS
Add a third channel (e.g. Teams) Add a third key to the output schema in COMMS_INSTRUCTIONS and handle the new key in downstream consumers
Change update interval wording Edit the "next update" guidance text
Enforce message length limit Add a character-count constraint to the guidelines section
Add incident priority/SLA language Add a "Priority" line to the Slack format rules

Customisation Example

To add a pagerduty_note field alongside the existing keys, update the task block in COMMS_INSTRUCTIONS:

COMMS_INSTRUCTIONS = """\
...
## Task
Return a single JSON object with ONLY this key:

{
  "comms": {
    "slack_update": "...",
    "stakeholder_update": "...",
    "pagerduty_note": "string – one-line note for PagerDuty incident timeline"
  }
}
...
"""

PIR Agent

File: app/agents/pir.py

Purpose

Constructs a post-incident report from resolved or partially-resolved incident signals. Produces a structured timeline, quantified customer impact, and actionable prevention measures.

The PIR Agent is only meaningful when the incident payload contains sufficient signal history. Use demo 3 (scripts/demos/) or scenario 5 (scripts/scenarios/) for best results.

Output Schema

{
  "post_incident_report": {
    "timeline": [
      { "time": "HH:MMZ or ISO 8601", "event": "<string>" }
    ],
    "customer_impact": "<string>",
    "prevention_actions": [
      "<string>"
    ]
  }
}

Timeline Construction

Rule Detail
Source Derived from alerts[].timestamp, logs[].timestamp, metrics[].timestamp in input
Ordering Strictly chronological; earliest signal is the start anchor
Ongoing incidents Final entry uses "time": "ONGOING"
Time format Prefer HH:MMZ for same-day incidents; ISO 8601 for multi-day

Customer Impact

The agent is instructed to quantify impact wherever the input data supports it:

Data present Expected output
revenue_impact in payload Dollar figure included verbatim
Error rate metrics Percentage quoted
Affected service with SLA Impact framed against SLA
No customer-facing impact Explicit "No customer-visible impact" statement

Prevention Actions

Each prevention action should be:

  • Specific — names the exact system, config, or process to change
  • Actionable — describes the change, not just the goal
  • Owned — suggests a responsible role (e.g. Platform Engineering, DBA, SRE)

Example:

"Add Redis maxmemory-policy alert at 70% threshold – owner: Platform Engineering"

Configuration Options

Behaviour How to change
Number of prevention actions Add "Provide at least N prevention actions" to the guidelines
Timeline timestamp format Specify format preference in the Timeline guidelines section
Include root cause section Add "root_cause": "string" to the output schema in PIR_INSTRUCTIONS
Suppress revenue data Add "Do not include revenue estimates" to guidelines
Add blameless postmortem framing Add a tone guideline: "Use blameless postmortem language throughout"
Add a lessons-learned section Extend the schema with "lessons_learned": ["string"]

Customisation Example

To add a root_cause field and enforce blameless language:

PIR_INSTRUCTIONS = """\
...
## Task
Return a single JSON object with ONLY this key:

{
  "post_incident_report": {
    "timeline": [...],
    "root_cause": "string – concise description of the confirmed root cause",
    "customer_impact": "...",
    "prevention_actions": [...]
  }
}

## Guidelines
...
- **Blameless language**: Focus on systems and processes, not individuals.
- **Root cause**: State the confirmed root cause. If not yet confirmed, write
  \"Under investigation\" and list the leading hypotheses.
...
"""

Triage Agent (reference)

File: app/agents/triage.py

Output keys: suspected_root_causes (array with hypothesis, evidence, confidence), immediate_actions (array with step, owner_role, priority), missing_information, runbook_alignment.

Key guardrails: credentials are redacted as [REDACTED]; if data is too sparse, confidence is set to 0 and missing_information is populated rather than hallucinating a root cause.


Summary Agent (reference)

File: app/agents/summary.py

Output keys: summary.what_happened (2–4 sentence factual summary), summary.current_status (prefixed with ONGOING / MITIGATED / MONITORING / RESOLVED).

Status is inferred from the payload: presence of timeframe.end implies resolved; absence implies ongoing.


Rebuilding and Redeploying After Changes

After editing any *_INSTRUCTIONS constant:

# 1. Rebuild the container image
docker build -t oncall-copilot:v9 .

# 2. Push to your registry
docker push <registry>/oncall-copilot:v9

# 3. Create a new agent version in Microsoft Foundry
az cognitiveservices agent create-version \
  --account-name <account> \
  --project-name <project> \
  --name oncall-copilot \
  --image <registry>/oncall-copilot:v9

# 4. Start the new version
az cognitiveservices agent start \
  --account-name <account> \
  --project-name <project> \
  --name oncall-copilot \
  --agent-version 9

See CONTRIBUTING.md for the full build and deployment guide.


Testing Agent Output

Smoke test against live Foundry deployment:

python scripts/invoke.py --demo 3   # rich payload — exercises comms + PIR
python scripts/invoke.py --scenario 5  # resolved storage incident — PIR-focused

Direct model test (no container needed):

python scripts/test_agents_direct.py          # tests comms + PIR with demo 3
python scripts/test_agents_direct.py --demo 1 # use a different demo payload

Validate JSON schema only:

python scripts/validate.py