feat(ai-proxy): add cooldownDuration support for failover token recovery#3700
Open
wydream wants to merge 5 commits intohigress-group:mainfrom
Open
feat(ai-proxy): add cooldownDuration support for failover token recovery#3700wydream wants to merge 5 commits intohigress-group:mainfrom
wydream wants to merge 5 commits intohigress-group:mainfrom
Conversation
Add a time-based passive recovery mechanism for API tokens marked as
unavailable during failover. When configured, tokens are automatically
restored after the cooldown period elapses without sending health check
requests, eliminating extra token consumption.
- Add cooldownDuration field to failover config (milliseconds)
- Record unavailable timestamp when token is removed from available list
- Check cooldown expiry in tick callback and restore tokens automatically
- Allow either healthCheckModel or cooldownDuration (or both) in validation
- Fix CAS mismatch bug in initApiTokens after resetSharedData
- Add 22 test cases (5 unit + 17 e2e) covering config, failover trigger,
cooldown recovery, threshold, fallback, and failure count reset
Change-Id: I20630159aca6ad2a938a3b0c157366cedd9ef494
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Collaborator
|
cc @cr7258 |
Change-Id: I50ac7d777f100581d9636f647684226943f0f862
- 将 resetSharedData 函数中的复杂 CAS 重试逻辑替换为直接设置 cas=0 的方式 - 添加注释说明使用 cas=0 来无条件清除共享数据状态 - 更新配置更新时的重置共享数据相关注释 - 移除 provider/failover_test.go 测试文件及其相关测试函数 Change-Id: If13da8a4d2b3e8d16c355485352cbd8a5c9cdd11
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Ⅰ. Describe what this PR did
本 PR 为 ai-proxy 的 failover 机制新增
cooldownDuration配置项,支持 API Key 被摘除后经过冷却时间自动恢复,无需主动健康检查。背景
当前 failover 机制在 API Key 被标记为不可用后,唯一的恢复路径是主动健康检查 —— 定时向 provider 发送真实的 ChatCompletion 请求。这存在两个问题:
healthCheckModel,否则 failover 无法启用对于因限流(429)被摘除的 Key,等待一段时间后自动恢复是更合理的策略。
主要变更
failover结构体扩展(provider/failover.go):cooldownDuration int64字段(毫秒),配置冷却恢复时间ctxApiTokenUnavailableSinceshared data key,记录每个 token 被摘除的时间戳配置解析与校验(
provider/failover.go):FromJson解析cooldownDurationValidate放宽为healthCheckModel和cooldownDuration二选一,两者也可同时配置冷却恢复逻辑(
provider/failover.go):time.Now().UnixMilli()到 shared datanow - unavailableSince >= cooldownDuration则直接恢复CAS 安全的 helper 函数(
provider/failover.go):getApiTokenUnavailableSince:读取 token 摘除时间戳setApiTokenUnavailableSince:记录 token 摘除时间戳(CAS 重试)removeApiTokenUnavailableSince:清除已恢复 token 的时间戳(CAS 重试)Bug 修复(
provider/failover.go):initApiTokens中 CAS 硬编码为 0 导致resetSharedData后写入失败的问题配置示例
Ⅱ. Does this pull request fix one issue?
解决 failover 机制在以下场景的功能缺口:
healthCheckModel就无法启用 failoverⅢ. Why don't you add test cases (unit test/integration test)?
已补充完整的单元测试和端到端测试:
单元测试(
provider/provider_test.go)FromJson解析cooldownDuration(默认值 0、自定义值)Validate四种组合(仅 healthCheckModel / 仅 cooldownDuration / 两者都配 / 两者都不配)Validate负值 cooldownDuration 边界端到端测试(
test/cooldown.go,17 个用例)✅ 配置解析测试
✅ 响应触发测试
5.*)✅ 冷却恢复测试
Ⅳ. Describe how to verify it
方式一:运行单元测试
方式二:手动验证
sk-exhausted-key返回 429 时,日志输出failover: apiToken sk-exhausted-key is unavailable nowsk-valid-keycooldown recovery: apiToken sk-exhausted-key has cooled down for xxxms, restoring to available listⅤ. Special notes for reviews
cooldownDuration时行为与之前完全一致,不影响已有配置Validate从要求healthCheckModel改为要求healthCheckModel或cooldownDuration至少配一个。已有配置了healthCheckModel的用户不受影响initApiTokens的 CAS 修复是在测试中发现的已有 bug,resetSharedData后 CAS 不再为 0,导致后续写入失败Ⅵ. AI Coding Tool Usage Checklist (if applicable)
Please check all applicable items:
AI Coding Summary
问题根因:
healthCheckModel就无法启用 failover修复方案:
cooldownDuration配置项,支持基于时间的被动恢复Validate校验,允许healthCheckModel和cooldownDuration二选一initApiTokensCAS 硬编码 bug影响范围:
provider/failover.go:核心实现(结构体扩展、配置解析、冷却恢复逻辑、helper 函数、bug 修复)provider/provider_test.go:新增cooldownDuration解析和Validate组合单元测试test/cooldown.go:新增端到端测试文件(17 个用例)main_test.go:接入 Cooldown 测试入口