Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions charts/ai-models/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,19 @@
{{- printf "%d" (int (round (mulf (default 0 .) 1000) 0)) -}}
{{- end -}}

{{- define "ai-models.modelBudgetMicroUsd" -}}
{{- $planName := .planName -}}
{{- $modelName := .modelName -}}
{{- $planValue := .planValue -}}
{{- $result := $planValue.monthlyBudgetUsd -}}
{{- if and (hasKey $planValue "modelBudgets") (hasKey $planValue.modelBudgets "overrides") -}}
{{- if hasKey $planValue.modelBudgets.overrides $modelName -}}
{{- $result = index $planValue.modelBudgets.overrides $modelName -}}
{{- end -}}
{{- end -}}
{{- printf "%d" (int64 (mulf $result 1000000)) -}}
{{- end -}}

{{- define "ai-models.weightedCostBranch" -}}
{{- $pricing := . -}}
{{- $inputScaled := include "ai-models.priceScale" $pricing.inputPer1M -}}
Expand Down
4 changes: 2 additions & 2 deletions charts/ai-models/templates/backendtrafficpolicy.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@ spec:
value: {{ $routeName }}
limit:
# `llm_custom_total_cost` is emitted as an integer in micro-USD (USD * 1e6).
# Convert USD budgets to the same unit so token cost budgeting works as expected.
requests: {{ mul $planValue.monthlyBudgetUsd 1000000 | int64 }}
# Budget: use modelBudgets.overrides.<model> if defined, else monthlyBudgetUsd.
requests: {{ include "ai-models.modelBudgetMicroUsd" (dict "planName" $planName "modelName" $routeName "planValue" $planValue) }}
unit: Month
cost:
request:
Expand Down
17 changes: 16 additions & 1 deletion charts/ai-models/values.yaml
Original file line number Diff line number Diff line change
@@ -1,14 +1,29 @@
# Monthly estimated-spend guardrails.
# Each plan value is expressed in real USD, but the chart converts it to micro-USD because
# `llm_custom_total_cost` is emitted by Envoy AI Gateway as an integer in micro-USD.
#
# Structure:
# - plans.<plan>.monthlyBudgetUsd: Default budget for all models (USD).
# - plans.<plan>.modelBudgets.overrides.<model>: Per-model budget override (USD).
#
# Example: `30` means "allow about $30 of estimated model spend per month".
# Example:
# free:
# monthlyBudgetUsd: 30 # Default: $30 per month for all models
# modelBudgets:
# overrides:
# gpt-5-mini: 10 # Override: $10 for this specific model
# gemini-2.5-pro: 50 # Override: $50 for this specific model
rateLimitBudgeting:
plans:
free:
monthlyBudgetUsd: 30
modelBudgets:
overrides: {}
pro:
monthlyBudgetUsd: 200

modelBudgets:
overrides: {}

gatewayRef:
name: core-gateway
Expand Down
35 changes: 33 additions & 2 deletions docs/models-chart-docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,14 @@ For the ticket-focused investigation of the full pipeline, deployed versions, an

Docs: [rate-limit-investigation.md](./rate-limit-investigation.md)

That guide explains:

- what `weighted`, `flat`, and `tieredWeighted` mean
- the difference between input, cached input, and output tokens
- how `standard` and `longContext` pricing work
- how the math turns token usage into `llm_custom_total_cost`
- how monthly budgets interact with the fallback requests-per-minute rule

### Gateway Reference

| Parameter | Description | Default |
Expand All @@ -128,22 +136,40 @@ Docs: [rate-limit-investigation.md](./rate-limit-investigation.md)
| Parameter | Description | Default |
|-----------|-------------|---------|
| `rateLimitBudgeting.plans.<plan>.monthlyBudgetUsd` | Monthly estimated spend guard per account, plan, and model | `free=30`, `pro=200` |
| `rateLimitBudgeting.plans.<plan>.modelBudgets.overrides.<model>` | Per-model budget override (USD) | none (uses `monthlyBudgetUsd`) |
| `models.<name>.pricing.strategy` | Cost model used to compute `llm_custom_total_cost` | `weighted`, `flat`, `tieredWeighted` |

The chart uses one rate-limit control:

1. A monthly budget rule based on estimated request cost.

The budget rule matches `x-account-id + x-billing-plan + x-ai-eg-model`.

The budget is decremented from response metadata, so the request that crosses the budget can still succeed. Once Redis already contains an exhausted bucket from earlier responses, the next matching request is rejected before it reaches the upstream provider.

Budget resolution: `modelBudgets.overrides.<model>` if defined, else `monthlyBudgetUsd`.

Example with per-model budgets:

```yaml
rateLimitBudgeting:
plans:
free:
monthlyBudgetUsd: 30 # Default: $30 per month for all models
modelBudgets:
overrides:
gpt-5-mini: 10 # Override: $10 for this specific model
gemini-2.5-pro: 50 # Override: $50 for this specific model
```

### Backend Traffic Policy

| Parameter | Description | Default |
|-----------|-------------|---------|
| `BackendTrafficPolicy` target | One `BackendTrafficPolicy` is rendered per model route | each `HTTPRoute` |
| Monthly budget selector | `x-account-id`, `x-billing-plan`, `x-ai-eg-model` | generated |
| Budget limit unit | Same unit as `llm_custom_total_cost` | micro-USD |
| Fallback selector | `x-api-key-id`, `x-ai-eg-model` | generated |

`BackendTrafficPolicy` does not calculate cost by itself. It reads the `llm_custom_total_cost` value produced by `AIGatewayRoute` and uses that response metadata as the cost of the request.

Expand Down Expand Up @@ -195,8 +221,12 @@ rateLimitBudgeting:
plans:
free:
monthlyBudgetUsd: 30
modelBudgets:
overrides: {}
pro:
monthlyBudgetUsd: 200
modelBudgets:
overrides: {}

backends:
gpt-01:
Expand Down Expand Up @@ -270,8 +300,9 @@ Instead, rate limiting works like this:

1. `AIGatewayRoute` computes `llm_custom_total_cost` from token usage and the model's pricing block.
2. `BackendTrafficPolicy` charges that value against the monthly budget for the account, billing plan, and model.
3. If Redis already shows that budget bucket as exhausted, the next matching request is rejected before it reaches the upstream provider.
4. The request that actually crosses the budget is still allowed, because the cost is applied on the response path.
3. A separate fallback requests-per-minute rule protects against bursts.

The budget is decremented from response metadata, so the request that crosses the budget can still succeed. Once Redis already contains an exhausted bucket from earlier responses, the next matching request is rejected before it reaches the upstream provider.

If you need to tune behavior, update:

Expand Down
49 changes: 42 additions & 7 deletions docs/models-chart-docs/cost-tracking.md
Original file line number Diff line number Diff line change
Expand Up @@ -239,12 +239,13 @@ The important detail is step 4 happens on the response path.
That means:

1. the request that crosses the monthly budget can still succeed
2. the next matching request is the one that gets blocked if the Redis-backed bucket is already exhausted
3. once the bucket is exhausted from earlier responses, later matching requests are rejected before they reach the upstream provider
2. the next matching request is the one that gets blocked once Redis already contains the exhausted bucket

This delayed enforcement is why the chart also supports a simple requests-per-minute fallback.

## Who Gets Limited

There is one rate-limit rule shape in `BackendTrafficPolicy`.
There are two different rules in `BackendTrafficPolicy`.

### Monthly budget rule

Expand All @@ -260,12 +261,25 @@ Meaning:
2. the bucket depends on the billing plan
3. the bucket is separate for each model

### Fallback burst rule

This rule matches on:

1. `x-api-key-id`
2. `x-ai-eg-model`

Meaning:

1. each API key gets its own burst limit
2. the burst limit is separate for each model
3. this rule protects against spikes, not monthly spend fairness

## The Main Rate-Limit Settings In `values.yaml`

| Setting | What it means | Example |
| --- | --- | --- |
| `rateLimitBudgeting.plans.free.monthlyBudgetUsd` | Monthly estimated spend for the `free` plan | `30` |
| `rateLimitBudgeting.plans.pro.monthlyBudgetUsd` | Monthly estimated spend for the `pro` plan | `200` |
| `rateLimitBudgeting.plans.<plan>.monthlyBudgetUsd` | Default budget for all models | `30` |
| `rateLimitBudgeting.plans.<plan>.modelBudgets.overrides.<model>` | Per-model budget override | `gpt-5-mini: 10` |
| `models.<name>.pricing.strategy` | Which pricing formula to use | `weighted`, `flat`, `tieredWeighted` |
| `models.<name>.pricing.standard.inputPer1M` | Fresh input token price | `0.75` |
| `models.<name>.pricing.standard.cachedInputPer1M` | Cached input token price | `0.075` |
Expand All @@ -274,6 +288,28 @@ Meaning:
| `models.<name>.pricing.thresholdTokens` | Threshold that activates `longContext` pricing | `200000` |
| `models.<name>.pricing.longContext.*` | Prices used after the threshold is crossed | see Gemini profiles |

### Per-Model Budget Configuration

Each billing plan supports per-model budget overrides. The resolution order is:

1. Check `rateLimitBudgeting.plans.<plan>.modelBudgets.overrides.<model>`
2. Fall back to `rateLimitBudgeting.plans.<plan>.monthlyBudgetUsd`

Example configuration:

```yaml
rateLimitBudgeting:
plans:
free:
monthlyBudgetUsd: 30 # Default: $30 per month for all models
modelBudgets:
overrides:
gpt-5-mini: 10 # Expensive model: $10
gemini-2.5-pro: 50 # Popular model: $50
```

This allows fine-grained control over spend allocation across different models while maintaining a sensible default.

## Where The Math Lives In The Chart

If you need to trace the implementation:
Expand Down Expand Up @@ -301,5 +337,4 @@ These limits should be understood by anyone operating the chart:
2. Cached-input discounting only works when the provider reports cached token usage.
3. Image models can still be approximate because providers may publish separate text-input and image-input prices while the current Envoy metadata only gives aggregate `input_tokens` and `output_tokens`.
4. Very small requests can round down to `0` micro-USD because Envoy CEL uses integer math.
5. The request that actually consumes the last available budget is still allowed. Enforcement starts on the next matching request.
6. The metadata key must stay exactly `io.envoy.ai_gateway/llm_custom_total_cost`, or the monthly budget rule will stop working.
5. The metadata key must stay exactly `io.envoy.ai_gateway/llm_custom_total_cost`, or the monthly budget rule will stop working.
29 changes: 12 additions & 17 deletions docs/models-chart-docs/rate-limit-investigation.md
Original file line number Diff line number Diff line change
Expand Up @@ -257,29 +257,26 @@ The budget rules match on:
If a request bypasses the `core-gateway` auth path, these headers may be missing. In that case the
budget rule does not apply.

## Review Of The Previous `30 req/min` Rule
## Review Of The `30 req/min` Fallback

This repo previously had a second rule:
The fallback rule is currently:

- per `x-api-key-id`
- per `x-ai-eg-model`
- default `30` requests per minute

That rule was not a true fallback in protocol terms. It was an always-on second limiter that
applied independently whenever the request matched its headers.
This rule is intentionally separate from monthly budgets:

### Why it was removed
- the monthly budget is decremented on the response path, so the request that crosses the budget can still succeed
- the fallback exists as a coarse burst guard while that delayed enforcement catches up

- it was not cost-aware
- it was plan-agnostic
- it could reject requests even when the account still had monthly budget remaining
- it made the runtime behavior harder to explain because two independent rules were active at once
Limitations:

### What remains true after removing it
- it is not cost-aware
- it is plan-agnostic
- it can reject requests even when monthly budget remains

- the request that crosses the monthly budget is still allowed
- later matching requests are rejected once Redis already contains the exhausted budget bucket
- there is no longer a separate request-count guardrail in this chart
In this repo, that trade-off is accepted because it is explicitly framed as burst protection, not spend fairness.

## Changes Made In This Branch

Expand All @@ -288,7 +285,7 @@ Besides documenting the current behavior, this branch makes four chart changes:
1. Model pricing now lives under an explicit `pricing.strategy` block in `charts/ai-models/values.yaml`.
2. Vendor prices were refreshed from the current Fireworks, OpenAI, and Gemini pricing pages.
3. Gemini long-context tiers are now represented explicitly with `tieredWeighted` pricing.
4. The separate request-rate rule was removed so only the monthly cost-based rule remains.
4. The fallback request-rate limit is values-driven instead of hardcoded.

Rationale:

Expand All @@ -298,15 +295,13 @@ Rationale:
`tieredWeighted`
- `BackendTrafficPolicy` still consumes a single `llm_custom_total_cost` metadata key, so this keeps
the enforcement path stable while improving the estimate
- removing the extra request-rate rule makes the behavior easier to reason about, at the cost of no
longer having a coarse burst guard in this chart

## Proposed Improvements

| Proposal | Rationale | Complexity |
| --- | --- | --- |
| Keep cost-based limiting in CEL, but treat it as an estimate | This repo now has explicit `weighted`, `flat`, and `tieredWeighted` pricing strategies, which is better than raw token limits. The right framing is “estimated spend guardrail”, not “exact billing.” | Low |
| Add a separate gateway-level abuse-control policy only if production traffic shows a real burst problem | The current chart now has one budget rule, which is easier to reason about. If a burst guard is needed later, make it explicit and separate from billing logic. | Medium |
| Add a gateway-level burst-abuse policy alongside route-level budget rules | Envoy Gateway supports layered `BackendTrafficPolicy` behavior. A gateway-level abuse limit plus route-level budget rules is easier to reason about than one hardcoded per-route fallback. | Medium |
| Build budget observability from existing access logs | The access log already exports `gen_ai.usage.custom_total_cost`, `account_id`, `billing_plan`, and `api_key_id`. A dashboard can show budget burn before users hit `429`. | Medium |
| Validate cached-token telemetry per provider | Cached pricing is where the estimate can diverge the most. This should be verified against live responses from OpenAI, Gemini proxy paths, Fireworks, and Vertex/OpenAI-compatible backends. | Medium |
| Do not plan around `tokenBudget` yet | I could not find `tokenBudget` in the Envoy Gateway `v1.7.x` or Envoy AI Gateway `v0.5.0` docs queried via Context7/Tavily. Upgrade exploration is needed before treating it as a viable option. | Medium |
Expand Down
Loading