Skip to content

Commit f2aa540

Browse files
Feature/add responses api to costing sample (#209)
1 parent 3aca976 commit f2aa540

13 files changed

Lines changed: 749 additions & 136 deletions

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ It's quick and easy to get started!
6767
| [AuthX][sample-authx] | Authentication and role-based authorization in a mock HR API. | All infrastructures |
6868
| [AuthX Pro][sample-authx-pro] | Authentication and role-based authorization in a mock product with multiple APIs and policy fragments. | All infrastructures |
6969
| [Azure Maps][sample-azure-maps] | Proxying calls to Azure Maps with APIM policies. | All infrastructures |
70-
| [Costing][sample-costing] | Track and allocate API costs per business unit using APIM subscriptions, Entra ID application tracking, and AI Gateway token/PTU tracking including streaming (SSE) token usage, which is not simple to capture correctly in APIM. | All infrastructures |
70+
| [Costing][sample-costing] | Track and allocate API costs per business unit using APIM subscriptions, Entra ID application tracking, and AI Gateway token/PTU tracking across **both** Azure OpenAI Chat Completions and Responses APIs, including streaming (SSE) token usage which is not simple to capture correctly in APIM. | All infrastructures |
7171
| [Dynamic CORS][sample-dynamic-cors] | Dynamic per-API CORS origin validation using custom policy fragments and a maintainable origin mapping. | All infrastructures |
7272
| [Egress Control][sample-egress-control] | Control APIM outbound internet traffic by routing it through a Network Virtual Appliance (NVA) in a hub/spoke topology. | appgw-apim, appgw-apim-pe |
7373
| [General][sample-general] | Basic demo of APIM sample setup and policy usage. | All infrastructures |

assets/APIM-Samples-Slide-Deck.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1118,7 +1118,7 @@ <h4>Azure Maps</h4>
11181118
</div>
11191119
<div class="arch-card">
11201120
<h4>Costing</h4>
1121-
<p>Track API costs per business unit via subscriptions, Entra ID apps, and AI Gateway tokens, <em>including streaming (SSE) token usage</em> (not simple to capture correctly in APIM).</p>
1121+
<p>Track API costs per business unit via subscriptions, Entra ID apps, and AI Gateway tokens across <em>both Azure OpenAI Chat Completions and Responses APIs</em>, including streaming (SSE) token usage (not simple to capture correctly in APIM).</p>
11221122
</div>
11231123
<div class="arch-card">
11241124
<h4>Dynamic CORS</h4>

docs/index.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -447,7 +447,7 @@ <h3>Azure Maps</h3>
447447

448448
<a class="sample-card" href="https://github.com/Azure-Samples/Apim-Samples/tree/main/samples/costing" target="_blank" rel="noopener">
449449
<h3>Costing</h3>
450-
<p>Track and allocate API costs per business unit using subscriptions, Entra ID application tracking, and AI Gateway token/PTU tracking <em>including streaming (SSE) token usage</em>, which is not simple to capture correctly in APIM.</p>
450+
<p>Track and allocate API costs per business unit using subscriptions, Entra ID application tracking, and AI Gateway token/PTU tracking across <em>both Azure OpenAI Chat Completions and Responses APIs</em>, including streaming (SSE) token usage which is not simple to capture correctly in APIM.</p>
451451
<span class="infra-tag">All infrastructures</span>
452452
</a>
453453

samples/costing/README.md

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ This sample demonstrates how to track and allocate API costs using Azure API Man
1616
6. **Enable cost governance** - Establish patterns for consistent tagging and naming conventions
1717
7. **Enable budget alerts** - Create scheduled query alerts when callers exceed configurable thresholds
1818
8. **Track AI token consumption per client** - When APIM is used as an AI Gateway, capture prompt, completion, and total token usage per calling application, enabling per-client cost attribution for PTU or pay-as-you-go OpenAI deployments
19-
9. **Real AOAI interactions via Foundry** (optional) - Deploy a full Microsoft Foundry environment (Hub + Project + Azure AI Services) and route real Azure OpenAI chat completions through APIM, demonstrating accurate token tracking for both non-streaming and streaming (SSE) responses
19+
9. **Real AOAI interactions via Foundry** (optional) - Deploy a full Microsoft Foundry environment (Hub + Project + Azure AI Services) and route real Azure OpenAI traffic through APIM across **both the Chat Completions and Responses APIs**, demonstrating accurate token tracking for non-streaming, streaming (SSE), and stateless (`store: false`) requests
2020

2121
> **Note on non-OpenAI models**: This sample deploys an Azure OpenAI model only (default: `gpt-5-mini`). Other model families on Azure AI Services - such as Anthropic Claude via the Azure Marketplace - are gated by separate quota that is granted through a manual approval process, which puts them beyond the scope of a self-service sample. If you have approved quota for another provider, you can extend the sample by adding a second deployment in `main.bicep`; the token-tracking policy and workbook queries are model-agnostic.
2222
@@ -86,6 +86,23 @@ The workbook surfaces **both** streaming variants side-by-side so you can see ex
8686

8787
The **AI Gateway** tab's *Streaming vs Non-Streaming Breakdown* and the **Per-Request Detail** tab's `AI Delivery Mode` + `Usage Provenance` columns both render this distinction, so you can confirm token capture works regardless of whether the client or APIM supplied the usage option.
8888

89+
### AI Surface Coverage (Chat Completions + Responses API)
90+
91+
The notebook exercises **six** AI request modes per business unit per model so you can see APIM token tracking work across both Azure OpenAI surfaces and every streaming variant. Mode is chosen by `j % 6` for the `j`-th request within a business unit, giving a deterministic, even mix:
92+
93+
| Mode | API surface | Streaming | Notes |
94+
| --- | --- | --- | --- |
95+
| 0 | Chat Completions | No | Baseline non-streaming chat. |
96+
| 1 | Chat Completions | Yes | Client sends `stream_options.include_usage = true`; APIM forwards unchanged. |
97+
| 2 | Chat Completions | Yes | Client omits `stream_options`; the `pf-ensure-stream-include-usage.xml` fragment injects it and emits an `IncludeUsageInjected` trace. |
98+
| 3 | Responses API | No | Stateful (`store` defaults to `true`); uses `input` + `max_output_tokens`. |
99+
| 4 | Responses API | Yes | Streaming Responses; the policy fragment is a no-op for this surface. |
100+
| 5 | Responses API | No | Stateless variant with `store: false` to demonstrate ephemeral usage. |
101+
102+
The Chat Completions and Responses APIs use different api-versions (`2024-10-21` vs `2025-03-01-preview`), different routes (`/deployments/{id}/chat/completions` vs `/responses`), and different request shapes (`messages` + `max_completion_tokens` vs `input` + `max_output_tokens`). They share the same `aoai-backend` and the same APIM AI logger, so `ApiManagementGatewayLlmLog` rows from both surfaces flow into the same workspace and are split by `OperationId` (`chat-completions-create` vs `responses-create`) in the workbook.
103+
104+
The `pf-ensure-stream-include-usage.xml` fragment short-circuits for the Responses API: it only inspects the body when `messages` is present, so Responses requests pass through untouched. The workbook's *Streaming vs Non-Streaming Breakdown*, *Token Counts by Business Unit & Delivery Mode* table, and *Per-Request Detail* tab all surface an `API Surface` column / slice (`Chat` vs `Responses`) so you can verify each mode produced its expected rows.
105+
89106
> **Business unit attribution**: Join `ApiManagementGatewayLlmLog` with `ApiManagementGatewayLogs` on `CorrelationId` to map token counts to `ApimSubscriptionId` (business unit). See `bu-token-usage.kql` for a ready-to-use query.
90107
91108
### Context Propagation
@@ -114,7 +131,7 @@ This lab deploys and configures:
114131
- **Azure Monitor Workbook** - Pre-built tabbed dashboard with:
115132
- **Subscription-Based Costing tab**: Cost allocation table (base + variable cost per BU), base vs variable cost stacked bar chart, cost breakdown by API, request count and distribution charts, success/error rate analysis, response code distribution, business unit drill-down
116133
- **Entra ID Application Costing tab**: Usage by caller ID (bar chart + table), cost allocation by caller (table + pie chart), hourly request trend by caller
117-
- **AI Gateway Token/PTU tab**: Three rows of summary tiles grouped under **APIM Inbound** (total APIM requests, AI APIM requests, inbound), **AI Backend** (backend requests, successful, throttled, failed), and **Tokens** (total tokens), followed by a request-funnel table, scope-reconciliation explainer + table, token cost allocation table with configurable per-1K-token rates, model and streaming pie charts, streaming vs non-streaming breakdown table, token-share pie, and hourly token-type trend chart
134+
- **AI Gateway Token/PTU tab**: Summary tiles grouped under **APIM Inbound** (AI Requests across all subs, AI Requests per BU) and **AI Backend** (a Successful row with `Successful (all 2xx)`, `Successful (2xx, with tokens)`, `Successful (no tokens)`, and an Errors row with `Throttled (429)`, `Client Errors (4xx)`, `Server Errors (5xx)`), then a **Tokens** row (total tokens), followed by a request-funnel table, a Token Coverage Investigation drill-in for `Successful (no tokens)`, scope-reconciliation explainer + table, token cost allocation table with configurable per-1K-token rates, model and streaming pie charts, streaming vs non-streaming breakdown table, token-share pie, and hourly token-type trend chart
118135
- **SKU-Based Pricing** - Automatically derives base monthly cost, overage rate, and included request allowance from the deployed APIM SKU using built-in pricing data (sourced from the [Azure API Management pricing page](https://azure.microsoft.com/pricing/details/api-management/), March 2026)
119136
- **Budget Alerts** (optional) - Per-BU scheduled query alerts when request thresholds are exceeded
120137

0 commit comments

Comments
 (0)