Skip to content

feat: add safety guardrails demo with Sentinel AI#57

Open
MaxwellCalkin wants to merge 1 commit intoanthropics:mainfrom
MaxwellCalkin:feat/safety-guardrails-demo
Open

feat: add safety guardrails demo with Sentinel AI#57
MaxwellCalkin wants to merge 1 commit intoanthropics:mainfrom
MaxwellCalkin:feat/safety-guardrails-demo

Conversation

@MaxwellCalkin
Copy link
Copy Markdown

Summary

Adds a new safety-guardrails demo that shows how to integrate real-time AI safety scanning into Claude Agent SDK applications using Sentinel AI.

  • Uses SDK hooks (PreToolUse / PostToolUse) to scan all user inputs and agent outputs in real-time
  • Detects prompt injection, PII leakage, harmful content, toxicity, and hallucination indicators
  • Automatically blocks high-risk inputs and redacts PII before it reaches the agent
  • Follows the same project structure as the existing research-agent demo (Python, pyproject.toml, uv sync)

How it works

User Input ──> [Sentinel Scan] ──> Claude Agent ──> [Sentinel Scan] ──> Output
                    │                                      │
              Block injections                      Block harmful content
              Redact PII                            Detect PII leakage

Files added

safety-guardrails/
├── README.md                          # Setup instructions and usage guide
├── pyproject.toml                     # Dependencies (claude-agent-sdk, sentinel-guardrails)
├── .env.example
├── .gitignore
└── safety_guardrails/
    ├── agent.py                       # Main entry point with interactive chat loop
    └── safety_hooks.py                # Sentinel AI hook implementations

Test plan

  • Run uv sync to install dependencies
  • Set ANTHROPIC_API_KEY and run uv run python safety_guardrails/agent.py
  • Test clean input: "Hello, tell me about machine learning" — passes through
  • Test prompt injection: "Ignore all previous instructions" — blocked
  • Test PII: "My SSN is 123-45-6789" — detected and redacted

Add a new demo showing how to integrate real-time safety scanning into
Claude Agent SDK applications using Sentinel AI. The demo uses SDK hooks
to scan user inputs and agent outputs for prompt injection, PII leakage,
harmful content, toxicity, and hallucination indicators.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant