Skip to content

feat(src/operator): add blueking-apigateway-operator into blueking-ap…#2767

Merged
wklken merged 6 commits into
TencentBlueKing:masterfrom
wklken:ft_project_operator
May 25, 2026
Merged

feat(src/operator): add blueking-apigateway-operator into blueking-ap…#2767
wklken merged 6 commits into
TencentBlueKing:masterfrom
wklken:ft_project_operator

Conversation

@wklken

@wklken wklken commented May 22, 2026

Copy link
Copy Markdown
Collaborator

…igateway project

Description

https://github.com/TencentBlueKing/blueking-apigateway-operator 最新代码合入 src/operator

Fixes # (issue)

Checklist

  • 填写 PR 描述及相关 issue (write PR description and related issue)
  • 代码风格检查通过 (code style check passed)
  • PR 中包含单元测试 (include unit test)
  • 单元测试通过 (unit test passed)
  • 本地开发联调环境验证通过 (local development environment verification passed)

@wklken

wklken commented May 22, 2026

Copy link
Copy Markdown
Collaborator Author

PR Review

Review Scope

变更概览

  • 这次 PR 新增 src/operator,把 TencentBlueKing/blueking-apigateway-operator 当前 HEAD a239893b7e90b1489eb4dbb4f4e5f5265787f995 导入到主仓库。
  • 已把 Go module/import path 从 github.com/TencentBlueKing/blueking-apigateway-operator 改为 operator,并新增 .github/workflows/operator.yaml 让 CI 在 src/operator 下执行。
  • 和外部 operator 仓库做了归一化对比:除 module/import path、CI 路径适配、gci 配置、pprof BasicAuth 修正和新增 server_test.go 外,未看到大范围业务逻辑改写。
  • 变更规模是 131 个文件、约 33073 行新增;以下结论是风险优先 review,不是逐行穷尽检查。

历史评论核对

  • 已修复/不再适用: 无历史评论。
  • 仍然存在: 无历史评论。
  • 无法从当前代码确认: 无。

发现的问题

Blocking

  • 未发现 blocking 问题。

Important

  • [【新】] src/operator/Makefile:81 / src/operator/Dockerfile:8
    • make integration 会先执行 docker-build,而 CI 里该命令的 working directory 是 src/operator;此时 Docker build context 只有 src/operator,不会带上父仓库的 .git
    • 但镜像构建里 RUN make build 会继续执行 src/operator/Makefile:73git describe --tags --abbrev=0git rev-parse HEAD,在没有 .git 的容器里这两个命令会返回空,随后 -X operator/pkg/version.Version= / Commit= 会把默认值覆盖成空串。
    • 这里是否要沿用 src/core-api / src/mcp-proxy 的模式:在 docker build 外层从真实 checkout 计算 VERSION/COMMIT,再通过 build args 传进 Dockerfile?否则 CI 集成测试用的 operator 镜像以及后续发布镜像的 operator version 都会缺少可追踪版本信息。

Suggestion

  • [【新】] .github/workflows/operator.yaml:6

    • 当前 path filter 只包含 src/operator/**。如果后续只改 .github/workflows/operator.yaml,这个 workflow 自己不会触发。
    • 这里是否也把 .github/workflows/operator.yaml 加进 push / pull_request 的 paths,避免 CI 配置变更无法自检?
  • [【新】] src/operator/docs/debug/README.md:10 / src/operator/pkg/utils/schema/validate_test.go:410

    • 我跑了 git diff --check upstream/master..HEAD,当前会因为 debug README 的 trailing whitespace、validate_test.go raw JSON 中的 space-before-tab、以及 tests/integration/docker-compose.yml 的 EOF 空行失败。
    • 这里是否要在合入前清一下这些空白字符,避免后续如果接入 git diff --check 或类似 pre-commit gate 时直接红?

优点

  • src/operator 和外部 operator 仓库当前 HEAD 基本一致,module rename 范围清晰,没有把导入改成一组难追踪的业务重写。
  • pprof 路由从原来的 root router 注册修正为 BasicAuth group 注册,并补了 TestPprofRequiresBasicAuth,这个安全边界比外部仓库当前版本更清楚。
  • operator CI 已经按 monorepo 子目录补了 working-directory 和 Go cache path,方向是对的。

测试与文档建议

  • 已做:git fetch upstream --prune、PR head worktree review、外部 blueking-apigateway-operator HEAD 对比、git diff --stat/name-only/numstatgit diff --check
  • 未跑 make lint / make testsrc/operator/AGENTS.md 要求先 source .envrc 并使用 Go 1.25.5;当前 worktree 没有 .envrc,本机是 go1.25.8,所以我没有冒然跑会受 runtime 影响的检查。

结论

整体看,这个 PR 的导入和 module rename 是可读、可追踪的;建议合入前优先确认 Docker 镜像版本信息注入问题,其他两个是低风险清理项。

[from-codex local repo]

@wklken

wklken commented May 22, 2026

Copy link
Copy Markdown
Collaborator Author

PR Review Follow-up

我又挑了 operator 的关键运行路径看了一轮:启动/配置、leader/agent、registry/store/synchronizer、committer、schema validator。PR head 仍是 7b5d9c1f951d8d637c2eee29dd717f3a935d1b94

发现的问题

Important

  • [【新】] src/operator/pkg/utils/groutine.go:41

    • GoroutineWithRecovery 在 recover 分支里直接调用 sentry.CurrentHub().Client().CaptureMessage(...)。但默认配置和模板里 sentry.dsn 是空的(src/operator/config.yaml.tpl:63-66),cmd/init.go:77-89 也会在 DSN 为空时跳过 sentry.Init
    • 在这种情况下,sentry-go 的 current hub 没有绑定 client,Client() 可能是 nil;一旦被包装的 goroutine panic,recover handler 自己会因为 nil client 再次 panic,达不到“兜底恢复”的目的,反而可能直接把进程打挂。
    • 建议改成 safe wrapper,例如 sentry.CaptureMessage(msg),或者先判断 client := sentry.CurrentHub().Client(); client != nil 后再发。
  • [【新】] src/operator/pkg/core/committer/committer.go:146-150

    • global resource 分支里 wg.Done() 放在 commitGlobalResource 调用之后,且没有 defer。如果 commitGlobalResource 中发生 panic,外层 GoroutineWithRecovery 会 recover,但 wg.Done() 不会执行,commitGroup 会一直卡在 wg.Wait()src/operator/pkg/core/committer/committer.go:161)。
    • stage resource 分支的实际提交 goroutine 已经用 defer wg.Done()src/operator/pkg/core/committer/committer.go:177-183),global resource 这里建议同样在 goroutine 开头 defer wg.Done(),避免一次 global plugin metadata 异常阻塞后续提交。

补充说明

  • schema validator 里 PluginMetadata 起初看起来可疑,但 PluginMetadataConf.UnmarshalJSON 会用 id 把配置转换为 map[pluginName]rawConf,当前 metadata schema 查找路径是对的,我没有把它列为问题。
  • 仍未跑 make lint / make testsrc/operator/AGENTS.md 要求先 source .envrc 并使用 Go 1.25.5;当前 worktree 没有 .envrc,本机 Go 是 1.25.8

[from-codex local repo]

wklken added 3 commits May 24, 2026 09:15
Why this change was needed:
The root AGENTS.md project tree did not list the operator subproject, so agents starting from the repository root could miss the local src/operator guidance.

What changed:
- Added src/operator to the root project structure
- Pointed readers to src/operator/AGENTS.md

Problem solved:
Future repo-level agent work can discover the operator handbook from the root project overview.
Why this change was needed:
CI builds the operator image from src/operator, so the Docker build context does not include the repository .git directory. The previous image build recalculated git metadata inside the builder container and could overwrite version fields with empty strings.

What changed:
- Define VERSION and COMMIT in the operator Makefile from the real checkout
- Pass those values into docker build as build args
- Make the Dockerfile pass VERSION and COMMIT through to make build
- Normalize trailing whitespace in operator debug docs and schema tests

Problem solved:
Operator images built by integration CI and release pipelines keep traceable version and commit metadata even when the Docker context excludes .git, with the related operator files committed together as requested.
@wklken

wklken commented May 25, 2026

Copy link
Copy Markdown
Collaborator Author

fixed:

  • [【新】] src/operator/Makefile:81 / src/operator/Dockerfile:8
  • [【新】] src/operator/pkg/utils/groutine.go:41
  • [【新】] src/operator/pkg/core/committer/committer.go:146-150

Why this change was needed:
Recovered goroutine panics are meant to be a last-resort safety path, but two operator paths could still fail after recovery: disabled Sentry left the hub without a client, and a global resource panic could leave commitGroup blocked on its WaitGroup.

What changed:
- Report recovered panic messages through sentry.CaptureMessage so a missing client is handled by sentry-go.
- Defer WaitGroup completion in the global resource commit goroutine.
- Add regression tests for Sentry-disabled recovery and global resource commit panic handling.

Problem solved:
A panic in wrapped operator goroutines no longer risks a secondary recovery panic or a stuck global plugin metadata commit batch.
@wklken

wklken commented May 25, 2026

Copy link
Copy Markdown
Collaborator Author

PR Review

Review Scope

变更概览

  • 这次 PR 将 TencentBlueKing/blueking-apigateway-operator 合入主仓库的 src/operator,并补了 monorepo 下的 CI、Docker、AGENTS 说明和若干修复提交。
  • 当前 diff 仍是 133 个文件、约 33k 行新增;以下结论是风险优先 review,不是逐行穷尽检查。
  • 旧 review 提到的 Docker 镜像版本信息、Sentry nil client panic、global commit wg.Done() 三个 Important 问题已在当前 head 修复。

历史评论核对

  • 已修复/不再适用:
    • src/operator/Makefile:83 / src/operator/Dockerfile:7-10: docker-build 已从宿主 checkout 传入 VERSIONCOMMIT build args,容器内 make build -e 不再依赖 .git
    • src/operator/pkg/utils/groutine.go:43-49: panic recovery 已改用 sentry.CaptureMessage,并新增 nil client 测试。
    • src/operator/pkg/core/committer/committer.go:146-150: global resource 分支已 defer wg.Done(),并新增 panic 不阻塞 commitGroup 的测试。
  • 仍然存在:
    • git diff --check upstream/master...HEAD 仍会报 src/operator/docs/debug/README_EN.md trailing whitespace、src/operator/pkg/utils/schema/validate_test.go space-before-tab、src/operator/tests/integration/docker-compose.yml EOF 空行。
    • .github/workflows/operator.yaml 的 path filter 仍未包含 workflow 文件本身。
  • 无法从当前代码确认:
    • 无。

发现的问题

Blocking

  • [【新】] src/operator/pkg/core/registry/apigw.go:281 / src/operator/pkg/core/registry/apigw.go:454
    • 当前 global plugin_metadata 解析写死 len(matches) == 5ValueToGlobalResource 也要求 key segment 数量必须等于 5。
    • 但默认配置和模板的 dashboard.etcd.keyPrefix/bk-gateway-apigw/defaultsrc/operator/pkg/config/config.go:209src/operator/config.yaml.tpl:13),实际 global key 会像 /bk-gateway-apigw/default/v2/global/plugin_metadata/<id>,去掉首 / 后是 6 段。
    • 这样按默认配置发布 global plugin metadata 时,会被解析为非法 key,ListGlobalResources -> commitGlobalResource -> SyncGlobal 走不到 APISIX plugin_metadata 写入,也不会刷新 virtual stage。这里是否应该像 stage resource 一样按尾部结构解析,而不是要求整条 key 正好 5 段,并补一个 keyPrefix 带子路径的回归测试?

Important

  • [【新】] src/operator/pkg/utils/tls.go:42

    • NewClientTLSConfig 加载了 RootCAs 和客户端证书,但同时设置 InsecureSkipVerify: truesrc/operator/pkg/core/runner/common.go:79-81 在 etcd 配了 CA/cert/key 时会使用这个 TLS 配置。
    • 这会跳过 etcd 服务端证书校验,使 CA 配置形同虚设。这里是否应该移除 InsecureSkipVerify,并在需要时通过 ServerName 或 endpoint host 做显式校验?
  • [【新】] src/operator/pkg/client/apisix.go:100

    • 每次 GetReleaseVersion() 都会写包级全局变量 retry.Evaluator,但 src/operator/pkg/eventreporter/reporter.go:217 允许多个 version probe 并发执行。
    • gentleman-retry 在插件安装时读取这个全局 evaluator;并发探测会互相覆盖闭包里的 gateway/stage/publishID/resp 指针,可能按错误 publish_id 判断探测结果,也会触发 data race。这里是否应该改成每个请求独立 evaluator,或避免并发写这个全局变量?
  • [【新】] src/operator/pkg/core/registry/apigw.go:139

    • control-plane watch 遇到 ErrCompacted / ErrFutureRev 时只记录 “need full sync to recover” 后直接 return;defer 会把 currentRevision 清零并关闭通道。
    • 上层 src/operator/pkg/core/agent/agent.go:85-93 对所有 watch channel close 都只是重建 watch,没有执行全量同步。compaction gap 中的控制面变更会被跳过,APISIX 侧可能一直停留在旧配置,直到同 stage 后续再次触发事件。这里是否应该在 compacted/future revision 分支触发 full sync,或让 agent/runner 进入可观测的失败恢复路径?
  • [【新】] src/operator/pkg/eventreporter/reporter.go:297

    • addEvent() 同步写 reporter.eventChain,committer 主路径会在 parse/apply 阶段直接调用这些上报函数(例如 src/operator/pkg/core/committer/committer.go:219233234248)。
    • reporter worker 调 CoreAPI 时使用 context.TODO()src/operator/pkg/client/core_api.go:78-85 也没有把 ctx 或 timeout 传给 HTTP 请求。CoreAPI 连接挂住时,reportChain 填满后 Start() 不再 drain eventChain,最终会把 committer 卡在事件上报发送处,阻塞后续 APISIX 同步。这里是否应该给上报链路加有界 timeout/context,并避免发布主路径被事件上报硬阻塞?
  • [【新】] src/operator/pkg/config/config.go:298 / src/operator/config.yaml.tpl:1

    • 模板默认 debug: true,而配置初始化会 json.Marshal(c) 并把完整配置打印到 stdout;同一个模板里包含 etcd password、CoreAPI instance secret、trace token、HTTP BasicAuth password。
    • 按模板启动会把敏感配置写进容器日志。这里是否应该把模板默认改为 debug: false,并且即使 debug 打印也要对 password/secret/token 做脱敏?
  • [【新】] src/operator/pkg/config/config.go:205 / src/operator/config.yaml.tpl:45-50

    • 默认配置和模板都使用公开固定的 httpServer.authPassword: DebugModel@bk,且模板默认监听 0.0.0.0[::]
    • /debug/pprof/v1/open/* 只靠 BasicAuth 保护(src/operator/pkg/server/server.go:93-97src/operator/pkg/server/router.go:50-55),open API 能返回 APISIX/网关资源,SSL 结构里还包含私钥字段。这里是否应该要求显式配置非默认强密码,或至少让模板默认只监听本地/集群内地址?
  • [【新】] src/operator/pkg/entity/entity.go:443

    • ResourceMetadata.Labels 是指针字段,但 IsEmpty() 直接访问 rm.Labels.Gateway / rm.Labels.Stage;watch 热路径 src/operator/pkg/core/agent/agent.go:149 会先调用 event.IsEmpty()
    • control-plane etcd 中如果出现缺少 labels 的 route/service JSON,watch 会解析出 Labels == nil 后 panic,而不是把坏事件作为解析/校验错误处理。这里是否应该给 IsEmpty()GetReleaseID()GetReleaseInfo() 等热路径方法统一加 nil labels guard,并在 APIGW registry 边界校验 labels/id/apisix-version 与 key path 一致?

Suggestion

  • [【新】] src/operator/pkg/apis/open/serializer/common.go:27 / src/operator/pkg/client/resource_slz.go:29

    • debug/list API 的响应结构把 SSL 字段定义成 json:"ssl,omitempty",但真实实体是 ApisixStageResource.SSLs json:"ssls,omitempty"。handler 通过 marshal/unmarshal 转换时会丢掉 ssls 字段,list-apigw / list-apisix 看不到 SSL 资源。是否要统一成 SSLs ... json:"ssls,omitempty" 并补一个包含 SSL 的序列化测试?
  • [【已提及】] .github/workflows/operator.yaml:6 / .github/workflows/operator.yaml:9

    • 当前 path filter 只包含 src/operator/**。如果后续只改 .github/workflows/operator.yaml,这个 workflow 自己不会触发。是否要把 workflow 文件本身也加入 push / pull_request 的 paths?
  • [【已提及】] src/operator/docs/debug/README_EN.md:10 / src/operator/pkg/utils/schema/validate_test.go:410

    • git diff --check upstream/master...HEAD 仍会因为 trailing whitespace、space-before-tab、EOF 空行失败。这个不需要 blocking,但合入前顺手清理会降低后续接入 diff/check hooks 的噪音。

优点

  • 旧 review 的三个较高风险问题都有定向修复,并补了相关测试。
  • pprof 已挂在 BasicAuth group 下,当前 go test ./pkg/server 覆盖了无认证 401、有认证 200。
  • operator 子项目的 AGENTS 说明把 watch/commit/sync 关键路径写清楚了,后续维护成本比纯代码导入低。

测试与文档建议

  • 已跑并通过:go test ./pkg/utils ./pkg/core/committer
  • 已跑并通过:go test ./pkg/server ./pkg/apis/open/...
  • 已跑并通过:go test ./pkg/utils/schema ./pkg/core/differ ./pkg/entity
  • 已跑并通过:go test ./pkg/core/registry ./pkg/client
  • 已跑并通过:go test ./pkg/core/agent ./pkg/eventreporter
  • 已跑并通过:go test ./pkg/core/store ./pkg/core/synchronizer
  • 已跑并通过:go test ./pkg/leaderelection
  • 已跑但未作为通过结论:go test $(go list ./... | grep -v '/tests/integration')。这条因为多个 package 并行时 operator/tests/util.StartEmbedEtcdClient 固定监听 localhost:1234/2345,和 pkg/leaderelection 的 embed etcd 测试抢端口而失败;单独跑 pkg/leaderelection 可通过。
  • 已跑并失败:git diff --check upstream/master...HEAD,失败原因是上面 Suggestion 里的空白问题。
  • 未跑 make lint / make test:本 worktree 没有 src/operator/bin/ginkgomake lint 还会带 --fix 修改工作区;我用 go test 分组覆盖了主要 package。

结论

建议先处理 Blocking 的 global resource key parsing 问题,再处理 TLS 校验、debug/默认密码、version probe 并发和 event reporter 背压这些 Important 项。旧 review 的三个历史 Important 已经修复,但当前 head 仍不建议直接合入。

[from-codex local repo]

@wklken

wklken commented May 25, 2026

Copy link
Copy Markdown
Collaborator Author

处理:

[【新】] src/operator/pkg/client/apisix.go:100

[【新】] src/operator/pkg/core/registry/apigw.go:281 / src/operator/pkg/core/registry/apigw.go:454

[【新】] src/operator/pkg/config/config.go:298 / src/operator/config.yaml.tpl:1

[【新】] src/operator/pkg/config/config.go:205 / src/operator/config.yaml.tpl:45-50

[【新】] src/operator/pkg/apis/open/serializer/common.go:27 / src/operator/pkg/client/resource_slz.go:29


剩余的已经录单,下个小版本处理

@wklken wklken requested a review from cszmzh May 25, 2026 03:58
Why this change was needed:
Operator review found a few debug/runtime correctness gaps: local debug defaults drifted from the deployed dashboard prefix, version probes mutated a package-level retry evaluator while probes can run concurrently, and debug resource list DTOs used ssl instead of the ssls field emitted by the real APISIX resource entity.

What changed:
- Updated operator debug defaults for dashboard etcd prefix and loopback HTTP binding.
- Reworked APISIX release-version probing to use request-local retry evaluation instead of the global gentleman-retry evaluator.
- Aligned debug list serializers and CLI output with the ssls field, with regression tests for SSL resources.
- Cleaned up debug command documentation and integration compose formatting.

Problem solved:
Debug/list APIs now preserve SSL resources, concurrent version probes no longer race through global retry state, and local operator defaults better match the intended debug deployment shape.
@wklken wklken merged commit bd366c9 into TencentBlueKing:master May 25, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants