feat(ai-proxy): add cooldownDuration support for failover token recovery by wydream · Pull Request #3700 · higress-group/higress

wydream · 2026-04-10T08:55:56Z

Ⅰ. Describe what this PR did

本 PR 为 ai-proxy 的 failover 机制新增 cooldownDuration 配置项，支持 API Key 被摘除后经过冷却时间自动恢复，无需主动健康检查。

背景

当前 failover 机制在 API Key 被标记为不可用后，唯一的恢复路径是主动健康检查 —— 定时向 provider 发送真实的 ChatCompletion 请求。这存在两个问题：

额外 token 消耗：健康检查会产生真实的 token 费用
配置门槛高：必须配置 healthCheckModel，否则 failover 无法启用

对于因限流（429）被摘除的 Key，等待一段时间后自动恢复是更合理的策略。

主要变更

failover 结构体扩展（provider/failover.go）：
- 新增 cooldownDuration int64 字段（毫秒），配置冷却恢复时间
- 新增 ctxApiTokenUnavailableSince shared data key，记录每个 token 被摘除的时间戳
配置解析与校验（provider/failover.go）：
- FromJson 解析 cooldownDuration
- Validate 放宽为 healthCheckModel 和 cooldownDuration 二选一，两者也可同时配置
冷却恢复逻辑（provider/failover.go）：
- token 被摘除时记录 time.Now().UnixMilli() 到 shared data
- 定时器回调中优先检查冷却恢复：遍历不可用 token，若 now - unavailableSince >= cooldownDuration 则直接恢复
- 已通过冷却恢复的 token 不再进入健康检查流程
- 两种恢复模式可独立使用，也可同时配置
CAS 安全的 helper 函数（provider/failover.go）：
- getApiTokenUnavailableSince：读取 token 摘除时间戳
- setApiTokenUnavailableSince：记录 token 摘除时间戳（CAS 重试）
- removeApiTokenUnavailableSince：清除已恢复 token 的时间戳（CAS 重试）
Bug 修复（provider/failover.go）：
- 修复 initApiTokens 中 CAS 硬编码为 0 导致 resetSharedData 后写入失败的问题

配置示例

# 仅使用冷却恢复（无健康检查，零 token 消耗）
failover:
  enabled: true
  failureThreshold: 3
  cooldownDuration: 60000   # 60 秒后自动恢复
  failoverOnStatus:
    - "429"

# 冷却恢复 + 健康检查（冷却优先，未冷却到期的走健康检查）
failover:
  enabled: true
  failureThreshold: 3
  cooldownDuration: 60000
  healthCheckModel: "gpt-3.5-turbo"
  failoverOnStatus:
    - "429"
    - "5.*"

Ⅱ. Does this pull request fix one issue?

解决 failover 机制在以下场景的功能缺口：

API Key 因限流（429）被摘除后，只能通过主动健康检查恢复，产生额外 token 消耗
不配置 healthCheckModel 就无法启用 failover

Ⅲ. Why don't you add test cases (unit test/integration test)?

已补充完整的单元测试和端到端测试：

单元测试（`provider/provider_test.go`）

✅ FromJson 解析 cooldownDuration（默认值 0、自定义值）
✅ Validate 四种组合（仅 healthCheckModel / 仅 cooldownDuration / 两者都配 / 两者都不配）
✅ Validate 负值 cooldownDuration 边界

端到端测试（`test/cooldown.go`，17 个用例）

✅ 配置解析测试
- cooldown only 配置正常启动
- cooldown + healthCheck 同时配置正常启动
- failover 无恢复机制配置启动失败
- 单 token + cooldown 配置正常启动
✅ 响应触发测试
- 429 触发 failover
- 200 不触发 failover
- 非匹配状态码（500）不触发 failover（仅配置 429）
- 多状态码匹配（500 触发 failover，配置 5.*）
✅ 冷却恢复测试
- 冷却到期后 tick 恢复 token
- 冷却未到期时 tick 不恢复 token
- failureThreshold > 1 时单次失败不摘除 token
- 所有 token 不可用时兜底使用不可用 token
- 冷却恢复后新请求正常使用
- 一个 token 被摘除后另一个继续使用
- 成功请求重置失败计数

Ⅳ. Describe how to verify it

方式一：运行单元测试

cd plugins/wasm-go/extensions/ai-proxy
go test -v -run "TestCooldown|TestFailover" ./...

方式二：手动验证

配置 OpenAI Provider，使用两个 API Key，其中一个额度不足：

provider:
  type: openai
  apiTokens:
    - "sk-valid-key"
    - "sk-exhausted-key"
  modelMapping:
    "*": "gpt-3.5-turbo"
  failover:
    enabled: true
    failureThreshold: 1
    cooldownDuration: 60000
    failoverOnStatus:
      - "429"

发送请求，观察日志：
- 当 sk-exhausted-key 返回 429 时，日志输出 failover: apiToken sk-exhausted-key is unavailable now
- 后续请求仅使用 sk-valid-key
- 60 秒后日志输出 cooldown recovery: apiToken sk-exhausted-key has cooled down for xxxms, restoring to available list
- 恢复后两个 Key 均参与负载均衡

Ⅴ. Special notes for reviews

兼容性：未配置 cooldownDuration 时行为与之前完全一致，不影响已有配置
校验变更：Validate 从要求 healthCheckModel 改为要求 healthCheckModel 或 cooldownDuration 至少配一个。已有配置了 healthCheckModel 的用户不受影响
冷却优先：同时配置两种恢复模式时，冷却恢复优先执行，已恢复的 token 不再进入健康检查
实例粒度：冷却恢复与健康检查一样，基于 Envoy 进程内 SharedData，每个 Envoy 实例独立计算，复用 lease 选主机制
Bug 修复：initApiTokens 的 CAS 修复是在测试中发现的已有 bug，resetSharedData 后 CAS 不再为 0，导致后续写入失败

Ⅵ. AI Coding Tool Usage Checklist (if applicable)

Please check all applicable items:

For regular updates/changes (not new plugins):
- I have included the AI Coding summary below

AI Coding Summary

问题根因：

failover 机制的 token 恢复仅支持主动健康检查，产生额外 token 消耗
不配置 healthCheckModel 就无法启用 failover

修复方案：

新增 cooldownDuration 配置项，支持基于时间的被动恢复
token 被摘除时记录时间戳到 SharedData，定时器回调中检查冷却是否到期
放宽 Validate 校验，允许 healthCheckModel 和 cooldownDuration 二选一
新增 CAS 安全的时间戳读写 helper 函数
修复 initApiTokens CAS 硬编码 bug
补充 22 个测试用例（5 个单元测试 + 17 个端到端测试）

影响范围：

provider/failover.go：核心实现（结构体扩展、配置解析、冷却恢复逻辑、helper 函数、bug 修复）
provider/provider_test.go：新增 cooldownDuration 解析和 Validate 组合单元测试
test/cooldown.go：新增端到端测试文件（17 个用例）
main_test.go：接入 Cooldown 测试入口

Add a time-based passive recovery mechanism for API tokens marked as unavailable during failover. When configured, tokens are automatically restored after the cooldown period elapses without sending health check requests, eliminating extra token consumption. - Add cooldownDuration field to failover config (milliseconds) - Record unavailable timestamp when token is removed from available list - Check cooldown expiry in tick callback and restore tokens automatically - Allow either healthCheckModel or cooldownDuration (or both) in validation - Fix CAS mismatch bug in initApiTokens after resetSharedData - Add 22 test cases (5 unit + 17 e2e) covering config, failover trigger, cooldown recovery, threshold, fallback, and failure count reset Change-Id: I20630159aca6ad2a938a3b0c157366cedd9ef494

codecov-commenter · 2026-04-10T09:05:01Z

Codecov Report

❌ Patch coverage is 90.98940% with 51 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...s/wasm-go/extensions/ai-proxy/provider/failover.go	48.35%	36 Missing and 11 partials ⚠️
...ugins/wasm-go/extensions/ai-proxy/test/cooldown.go	99.15%	2 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

johnlanni · 2026-04-14T11:13:25Z

cc @cr7258

Change-Id: I50ac7d777f100581d9636f647684226943f0f862

- 将 resetSharedData 函数中的复杂 CAS 重试逻辑替换为直接设置 cas=0 的方式 - 添加注释说明使用 cas=0 来无条件清除共享数据状态 - 更新配置更新时的重置共享数据相关注释 - 移除 provider/failover_test.go 测试文件及其相关测试函数 Change-Id: If13da8a4d2b3e8d16c355485352cbd8a5c9cdd11

wydream requested review from johnlanni and rinfx as code owners April 10, 2026 08:55

Merge branch 'main' into feat/failover-cooldown-duration

dfe7b7f

wydream added 3 commits April 14, 2026 19:59

Fix failover shared data reset on config updates

0d97db6

Change-Id: I50ac7d777f100581d9636f647684226943f0f862

Merge branch 'main' into feat/failover-cooldown-duration

ba268d3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ai-proxy): add cooldownDuration support for failover token recovery#3700

feat(ai-proxy): add cooldownDuration support for failover token recovery#3700
wydream wants to merge 5 commits intohigress-group:mainfrom
wydream:feat/failover-cooldown-duration

wydream commented Apr 10, 2026

Uh oh!

codecov-commenter commented Apr 10, 2026 •

edited

Loading

Uh oh!

johnlanni commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wydream commented Apr 10, 2026

Ⅰ. Describe what this PR did

背景

主要变更

配置示例

Ⅱ. Does this pull request fix one issue?

Ⅲ. Why don't you add test cases (unit test/integration test)?

单元测试（provider/provider_test.go）

端到端测试（test/cooldown.go，17 个用例）

Ⅳ. Describe how to verify it

方式一：运行单元测试

方式二：手动验证

Ⅴ. Special notes for reviews

Ⅵ. AI Coding Tool Usage Checklist (if applicable)

AI Coding Summary

Uh oh!

codecov-commenter commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

johnlanni commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

单元测试（`provider/provider_test.go`）

端到端测试（`test/cooldown.go`，17 个用例）

codecov-commenter commented Apr 10, 2026 •

edited

Loading