sample-nvidia-nemotron-cascade-workshop

Sample code for two-tier LLM inference — also called LLM cascading or confidence-based model routing — on Amazon Bedrock. NVIDIA Nemotron Nano handles the easy, high-volume support-ticket classification path; harder or higher-stakes tickets escalate to Anthropic Claude Sonnet on the same Bedrock API surface.

The app classifies support tickets, but ticket triage is only the example. The pattern applies to lead scoring, moderation, alert routing, document classification, and other workloads where most requests are routine and a small tail needs a stronger model.

Clone, point at your Bedrock account, and try it. Or use it as a workshop scaffold — the structure supports both.

What's In The Repo

The repo already includes:

A single-ticket baseline endpoint: POST /api/triage
A one-ticket cascade demo: POST /api/triage/cascade
A /bulk UI that lights up once POST /api/triage/bulk is implemented
Shared Bedrock Converse API client code, Zod schemas, prompts, tests, and synthetic ticket data
A bake-off harness comparing Sonnet-only, Nano-only, and cascade modes

The bulk endpoint is intentionally left as an exercise. Implement it to see the same cascade scale — Nano first on every ticket, escalating to Claude Sonnet when escalation rules fire, streaming NDJSON rows back to the /bulk table as each ticket finishes. Spec is in docs/phase2-change.md.

Architecture

flowchart TB
    subgraph CLIENT["Client"]
        DEMO["/ Routing Demo\none ticket cascade"]
        BULK["/bulk Strategy Comparison\nSonnet vs Nano vs Cascade"]
    end

    subgraph API["Next.js API Routes"]
        SINGLE["POST /api/triage\nSonnet-only baseline"]
        CASCADE["POST /api/triage/cascade\nSSE: Nano to Claude"]
        BULKAPI["POST /api/triage/bulk\nimplement to light up /bulk"]
    end

    subgraph LIB["Shared Library"]
        CLIENT_LIB["lib/bedrock/client.ts\nConverse API + tool calling"]
        SCHEMA["lib/triage/schema.ts\nZod validation"]
        PROMPTS["lib/triage/prompts.ts\nSystem + user prompts"]
    end

    subgraph BEDROCK["Amazon Bedrock us-west-2"]
        direction LR
        NANO["Nemotron Nano 30B\nfast first pass"]
        SONNET["Claude Sonnet 4.6\nescalation + baseline"]
    end

    DEMO --> CASCADE
    BULK --> BULKAPI
    CASCADE --> CLIENT_LIB
    SINGLE --> CLIENT_LIB
    BULKAPI --> CLIENT_LIB
    CLIENT_LIB --> SCHEMA
    CLIENT_LIB --> PROMPTS
    CLIENT_LIB --> BEDROCK
    NANO -.->|"confidence < 0.7\nOR P0/P1\nOR needs_human"| SONNET

How To Use This Repo

Step	What you do	Result
1. Spin up	Install deps, configure AWS credentials, call `/api/triage`	Proves Bedrock access works
2. See the cascade	Use `/` to watch Nano resolve or escalate one ticket	Makes the routing pattern concrete
3. Build bulk triage	Implement `POST /api/triage/bulk` using your preferred AI coding assistant	Scales the same pattern to many tickets
4. Compare strategies	Run `/bulk` and `npm run bakeoff -- --dry-run --all`	Cost, latency, and agreement side by side
5. Productionize	Review retries, throttling, guardrails, evals, and rollout strategy	Turns sample code into a deployment pattern

The repo is intentionally partner-agnostic above the application layer. Use any coding assistant, IDE, or PR review workflow you want — or run it as a live workshop with your team. What it teaches is the runtime architecture: one AWS credential chain, one Bedrock endpoint, multiple models chosen by workload shape.

Bake-Off Numbers

Live results from npm run bakeoff -- --all --limit=30 against 30 synthetic B2B support tickets:

Config	Total cost (30 tickets)	Avg latency	Agreement vs Opus	Escalation rate
Sonnet 4.6 only	$0.0608	4,614ms	93.3% (28/30)	n/a
Nemotron Nano 30B only	$0.0041	654ms	83.3% (25/30)	n/a
Partnership cascade (Nano + Claude)	$0.0514	4,084ms	93.3% (28/30)	76.7% (23/30)

The judge is Claude Opus 4.7. It labels each ticket once as a proxy answer key; the strategies are compared against those labels. Opus is not called by the live app.

How to read these numbers

This is a deployment-strategy comparison, not a model comparison. Nemotron Nano and Claude Sonnet are doing different jobs in the cascade — Nano handles the high-volume routing pass on every ticket; Claude handles the long tail that the routing logic flags as needing a stronger model. The numbers above show three legitimate deployment shapes you might choose:

Nano alone is ~15× cheaper and ~7× faster than Sonnet alone, at 83% agreement with the Opus answer key. Strong fit when latency or cost matters more than the last 10 points of category accuracy, or when downstream consumers can tolerate occasional re-classification.
Partnership cascade matches Sonnet-only agreement (93%) at lower cost (~15% savings on this workload). Nano handles every ticket; Claude is invoked only when Nano's output trips a domain-tuned escalation rule. Higher savings come from better-tuned escalation against your own data.
Sonnet alone is the simplest deployment when you don't yet have data to tune escalation against and budget isn't tight.

The cascade's win-rate depends entirely on the escalation logic. Confidence and stakes signals alone do not catch every Nano mistake — Nano can be confidently wrong on adjacent categories (e.g., integration vs feature_request). The example escalation in scripts/bakeoff.ts adds a hardcoded high-disagreement category list as a workshop starting point. Production teams replace this with a learned router trained on their own historical Nano-vs-Claude disagreements; see Production Caveats below.

These numbers should be refreshed before publication with:

npm run bakeoff -- --label --limit=30
npm run bakeoff -- --all --limit=30

Quickstart

npm install
cp .env.example .env.local
npm run dev

Open http://localhost:3000.

Smoke test the baseline endpoint:

curl -X POST http://localhost:3000/api/triage \
  -H 'Content-Type: application/json' \
  -d "$(jq '.[0]' data/sample-tickets.json)"

The app uses the AWS SDK default credential chain. Whatever makes this command work will also make the app work:

aws sts get-caller-identity

Credentials

Set the region in .env.local:

AWS_REGION=us-west-2
AWS_PROFILE=default

If you are using temporary STS credentials, use:

AWS_REGION=us-west-2
AWS_ACCESS_KEY_ID=<temporary-access-key-id>
AWS_SECRET_ACCESS_KEY=<temporary-secret-access-key>
AWS_SESSION_TOKEN=<temporary-session-token>

Do not commit .env.local or any real credentials.

Build The Bulk Endpoint

Add POST /api/triage/bulk.

Request:

{ "tickets": [{ "id": "T-001", "subject": "...", "body": "...", "customer_tier": "pro" }] }

Response: application/x-ndjson, one JSON object per line.

Required behavior:

Validate every input ticket with TicketSchema.
Call Nemotron Nano first for every ticket.
Escalate to Claude Sonnet when Nano's output trips an escalation rule. Match the logic in scripts/bakeoff.ts:shouldEscalate — confidence, stakes (P0/P1, needs_human, abuse), and a domain-tuned high-disagreement category list. Production deployments replace the category list with a learned router; see Production Caveats.
Bound concurrency to avoid unbounded Bedrock calls.
Retry throttling and transient 5xx failures with exponential backoff and jitter.
Attach optional Bedrock Guardrails when BEDROCK_GUARDRAIL_ID and BEDROCK_GUARDRAIL_VERSION are configured.
Validate every model output with Zod before streaming it.
Add tests that mirror tests/triage.test.ts.

The full implementation spec is in docs/phase2-change.md.

Scripts

Command	What it does
`npm run dev`	Start the Next.js dev server
`npm run typecheck`	Run TypeScript checking
`npm test`	Run Vitest
`npm run generate-tickets`	Regenerate `data/synthetic-1k.json`
`npm run bakeoff -- --label`	Generate judge labels
`npm run bakeoff -- --all`	Run strategy comparison
`npm run bakeoff -- --dry-run --all`	Replay cached comparison results

Region And Models

AWS_REGION=us-west-2. Model IDs live in lib/bedrock/models.ts and should be referenced through the MODELS constant.

Current pins:

CLAUDE_SONNET:  "us.anthropic.claude-sonnet-4-6"
NEMOTRON_NANO:  "nvidia.nemotron-nano-3-30b"
NEMOTRON_SUPER: "nvidia.nemotron-super-3-120b"
OPUS_JUDGE:     "us.anthropic.claude-opus-4-7"

NEMOTRON_SUPER remains available for experimentation, but the default cascade is Nano to Claude Sonnet.

Future: a third cascade tier

NVIDIA released Nemotron 3 Ultra (550B / 55B-active MoE) on June 4, 2026 as a frontier reasoning tier above Super, pitched for long-running agentic workloads — sustained multi-turn planning, sub-agent delegation, and deep reasoning. The cascade pattern in this repo extends naturally to a third rung: Nano (volume) → Claude Sonnet or Super (escalation) → Ultra (frontier reasoning for the long tail of agent-orchestration tasks). Once Ultra lands on Amazon Bedrock, the addition is a single constant in lib/bedrock/models.ts and one branch in the shouldEscalate function.

Reference: NVIDIA developer blog.

Production Caveats

This is sample code, not a turnkey production routing policy. Before shipping the pattern, run a domain-specific eval, shadow the cascade beside your existing model, define escalation thresholds with real error costs, set cost and latency budgets, and monitor retry rate, escalation rate, category drift, and human override rate.

Repository Status

This repository is intended for publication as an aws-samples sample after OpenSourcerer self-certification and Public Content Security Review.

Suggested public repo name:

sample-nvidia-nemotron-cascade-workshop

Security

See SECURITY.md for reporting security issues.

Contributing

See CONTRIBUTING.md.

License

This sample is licensed under the MIT-0 License. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
data		data
docs		docs
lib		lib
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.nvmrc		.nvmrc
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
next.config.mjs		next.config.mjs
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
tailwind.config.ts		tailwind.config.ts
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

sample-nvidia-nemotron-cascade-workshop

What's In The Repo

Architecture

How To Use This Repo

Bake-Off Numbers

How to read these numbers

Quickstart

Credentials

Build The Bulk Endpoint

Scripts

Region And Models

Future: a third cascade tier

Production Caveats

Repository Status

Security

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

sample-nvidia-nemotron-cascade-workshop

What's In The Repo

Architecture

How To Use This Repo

Bake-Off Numbers

How to read these numbers

Quickstart

Credentials

Build The Bulk Endpoint

Scripts

Region And Models

Future: a third cascade tier

Production Caveats

Repository Status

Security

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages