[manager] add leader discovery by wangxiyu191 · Pull Request #93 · alibaba/tair-kvcache

wangxiyu191 · 2026-03-31T08:46:57Z

Summary

Add Leader Discovery capability: Introduce GetClusterInfo API to MetaService and enhance GetManagerClusterInfo in AdminService, allowing any node (leader or follower) to return the current leader's connection endpoint (host, meta_rpc_port, meta_http_port, admin_rpc_port, admin_http_port). Clients can query any node to discover and directly connect to the leader.
Add NodeEndpointInfo and node registration: Each node writes its endpoint information to the CoordinationBackend KV store at startup. A new NodeEndpointInfo class handles serialization. Leader endpoint lookups use a local cache optimization when the queried node is itself the leader.
Add NetUtil and advertised_host config: Introduce NetUtil::GetLocalIp() to auto-detect the node's non-loopback IPv4 address. Add kvcm.service.advertised_host config option to explicitly set the externally-reachable address (takes precedence over auto-detection).
Add HA design document: Comprehensive design doc covering HA architecture, LeaderElector state machine, CoordinationBackend, leader discovery flow, failure scenarios, and configuration reference.

摘要

新增 Leader 发现能力：在 MetaService 中新增 GetClusterInfo API，在 AdminService 的 GetManagerClusterInfo 中增加 leader_endpoint 字段。任意节点（Leader 或 Follower）均可响应请求，返回当前 Leader 的完整连接信息（host、各服务端口）。客户端可通过任意节点发现并直连 Leader。
新增 NodeEndpointInfo 与节点注册机制：每个节点启动时将自身的端点信息写入 CoordinationBackend 的 KV 存储。新增 NodeEndpointInfo 类负责序列化/反序列化。当查询节点自身即为 Leader 时，使用本地缓存优化避免远程调用。
新增 NetUtil 与 advertised_host 配置：引入 NetUtil::GetLocalIp() 自动探测节点的非 loopback IPv4 地址；新增 kvcm.service.advertised_host 配置项支持显式指定对外广告地址（优先级高于自动探测）。
新增 HA 设计文档：涵盖 HA 整体架构、LeaderElector 状态机、CoordinationBackend、Leader 发现流程、故障场景分析及配置参考。

Test plan

Unit tests: LeaderElectorTest — SetSelfNodeInfoAndGetNodeInfo, ReadNonExistentNodeInfo, LeaderDiscoveryEndToEnd, LeaderDiscoveryTwoNodes, NodeInfoOverwrite
Unit tests: GrpcStubTest — TestGetClusterInfoNoLeaderElector, TestGetClusterInfoWithLeaderElector
Integration tests: MetaServiceTestBase::test_get_cluster_info (gRPC + HTTP)
Integration tests: AdminServiceTestBase::test_get_manager_cluster_info — enhanced with leader_endpoint validation
Integration tests: AdminServiceLeaderElectionTest — cross-node leader endpoint discovery with advertised_host=127.0.0.1

🤖 Generated with [Qoder][https://qoder.com]

qoderai

👋 Review Summary

Nice work on this feature — the leader discovery design is clean, the design doc is thorough, and the test matrix covering unit, gRPC stub, and multi-node integration scenarios is solid. A few things caught my eye that are worth discussing before merging.

🛡️ Key Risks & Issues

1. TOCTOU race between leader_node_id and leader_endpoint (inline comment)
Both GetClusterInfo and GetManagerClusterInfo read GetLeaderNodeID() once to set the response field, then call GetLeaderNodeInfo() which reads it again internally. If the leader changes between these two reads, the response will contain an inconsistent pair — leader_node_id pointing to the old leader but leader_endpoint containing the new leader's connection info. This could confuse clients that validate consistency between the two fields.

2. Silent failure of SetSelfNodeInfo at startup (inline comment)
If the coordination backend write fails during CreateLeaderElector, the node proceeds normally. If it later becomes leader, all cluster-wide GetClusterInfo calls will return INTERNAL_ERROR because no peer can look up its endpoint. This breaks the leader discovery contract silently.

3. Hard error on transient endpoint lookup failure (inline comment)
Both discovery APIs return INTERNAL_ERROR when endpoint lookup fails, discarding the already-available leader_node_id. For an API designed to be retried, returning OK with a partial response (omitting the optional leader_endpoint) would be more resilient.

4. Duplicate NodeEndpoint protobuf definition
NodeEndpoint is defined identically in both meta_service.proto (lines 286-293) and admin_service.proto (lines 449-456). These compile to distinct C++ types (proto::meta::NodeEndpoint vs proto::admin::NodeEndpoint), requiring duplicate conversion code in both service implementations. If a field is later added to one but not the other, the APIs will silently diverge. Consider extracting it to a shared common_types.proto.

🧪 Verification Advice

NetUtil::GetLocalIp() has no unit test coverage. Since it's the default code path when advertised_host is not configured, consider adding at least a basic test (returns valid IPv4, doesn't return 127.0.0.1 on non-loopback systems). The integration tests sidestep it via advertised_host=127.0.0.1.
The "no leader elected yet" scenario (empty leader_node_id during cluster startup or failover) isn't covered by any test. A GrpcStubTest with a LeaderElector that hasn't won election yet would verify the response shape in that transient state.
NodeEndpointInfo deserialization with partial/corrupt JSON isn't tested. Given this data comes from a distributed KV store (Redis), partial writes are possible.

💡 Thoughts & Suggestions

The overall architecture — node registration at startup, KV-based endpoint storage, local cache optimization for self-is-leader — is well-designed and practical.
The advertised_host config is the right escape hatch for multi-NIC environments. Worth documenting prominently that it should be set in any multi-node deployment (the design doc mentions it but could be more emphatic about the GetLocalIp() non-determinism risk).
The design doc's "待完善功能" section correctly identifies the lack of node info TTL/cleanup. Worth tracking as a follow-up since stale entries could accumulate in the coordination backend after node restarts with new node_ids.

🤖 Generated by Qoder • View workflow run

qoderai · 2026-03-31T08:59:30Z

kv_cache_manager/service/meta_service_impl.cc

+    std::string leader_node_id = leader_elector_->GetLeaderNodeID();
+
+    response->set_self_node_id(self_node_id);
+    response->set_leader_node_id(leader_node_id);
+
+    // 尝试读取 leader 的节点连接信息
+    if (!leader_node_id.empty()) {
+        NodeEndpointInfo node_info;
+        ErrorCode ec = leader_elector_->GetLeaderNodeInfo(node_info);


There's a TOCTOU race between the two reads of leader_node_id. At line 618, GetLeaderNodeID() is called to populate response->leader_node_id. Then at line 626, GetLeaderNodeInfo() internally calls GetLeaderNodeID() again (leader_elector.cc:584). If the leader changes between these two reads, the response will contain leader_node_id = "nodeA" but leader_endpoint pointing to nodeB's host and ports.

A client receiving this inconsistent response would connect to nodeB's address but believe it's talking to nodeA, which could break reconnection logic or consistency checks.

The same pattern exists in AdminServiceImpl::GetManagerClusterInfo (lines 846-856).

Consider passing the already-obtained leader_node_id into the lookup rather than re-reading it internally — something like GetNodeInfo(leader_node_id, node_info) directly, with the self-cache optimization handled by comparing leader_node_id == GetSelfNodeID() at the call site.

🤖 Generated by Qoder • Fix in Qoder

qoderai · 2026-03-31T08:59:32Z

kv_cache_manager/service/server.cc

+        ErrorCode ec = leader_elector_->SetSelfNodeInfo(node_info);
+        if (ec != EC_OK) {
+            KVCM_LOG_WARN("failed to write node info for node_id[%s], ec=%d", node_id.c_str(), ec);


If SetSelfNodeInfo fails here, the node proceeds normally (WARN log only, CreateLeaderElector returns true). If this node later wins the election and becomes leader, no peer can look up its endpoint info — every GetClusterInfo and GetManagerClusterInfo call across the cluster will return INTERNAL_ERROR ("leader exists but node info lookup failed").

This effectively breaks the entire leader discovery feature silently. The node appears healthy and serves as leader, but no client can discover how to connect to it.

Should this be treated as a fatal startup error (return false) instead? The coordination backend write is a prerequisite for the leader discovery contract to work.

🤖 Generated by Qoder • Fix in Qoder

qoderai · 2026-03-31T08:59:33Z

kv_cache_manager/service/meta_service_impl.cc

+        } else {
+            KVCM_LOG_WARN("[traceId: %s] GetClusterInfo: failed to read node info for leader %s, ec=%d",
+                          request->trace_id().c_str(),
+                          leader_node_id.c_str(),
+                          ec);
+            status->set_code(proto::meta::INTERNAL_ERROR);
+            status->set_message("Leader endpoint unavailable: leader exists but node info lookup failed");
+            request_context->set_status_code(status->code());
+            SET_SPAN_TRACER_STR_IN_HEADER(request_context);
+            return;


When the endpoint lookup fails (e.g., a transient Redis timeout), this returns INTERNAL_ERROR and discards the already-obtained leader_node_id and self_node_id. For a discovery API that clients are expected to retry (per section 5.4 of the design doc), this seems unnecessarily harsh.

Since leader_endpoint is an optional protobuf field, an alternative would be to return OK with leader_node_id populated but leader_endpoint omitted, letting the client retry to get the full info. This way a transient backend hiccup doesn't completely block discovery — the client at least knows who the leader is.

The same pattern applies in AdminServiceImpl::GetManagerClusterInfo (lines 865-874).

🤖 Generated by Qoder • Fix in Qoder

oldsharp · 2026-03-31T11:57:48Z

kv_cache_manager/protocol/protobuf/meta_service.proto

+
+message GetClusterInfoRequest {
+    string trace_id = 1;
+    string instance_id = 2;


The GetClusterInfoRequest.instance_id field is defined but never used in the implementation, consider remove it?

My design suggests that clients should pass the instance_id whenever possible. However, in some scenarios where an instance_id is not available, it's acceptable to omit it or pass a empty value.

Adding instance_id has two considerations:

Primarily for alignment at the interface level. There's a macro API_CONTEXT_GET_COLLECTOR_AND_INIT_GRPC/HTTP that relies on instance_id to gather metrics.

For future considerations, using the instance_id passed by the client, we can seamlessly route requests of different instances to different clusters, providing cluster services without any visible difference.

However, your review reminded me of what to do if a user passes a non-existent instance_id. The instance_id might not exist before registration; I'll refine this part.

kv_cache_manager/service/admin_service_impl.cc

kv_cache_manager/service/meta_service_metrics_base.cc

charpty · 2026-04-01T02:34:51Z

LGTM

…r info

wangxiyu191 added 10 commits March 31, 2026 16:31

[manager] add leader discovery interface for Meta and Admin services

cbff42d

[manager] introduce NodeEndpointInfo class for endpoint serialization

2a6bb1c

[manager] enhance leader discovery interface

2d0f4c7

[integration_test] add tests for GetClusterInfo interface

41d5527

[client] add GetClusterInfo support for GrpcStub

88a57d2

[manager] refactor interface for LeaderInfo

59d5ce9

[manager] enhance according to review

a293d0a

[common] add NetUtil to get local IP and add server address config

83a50c1

[docs] add design document for HA and LeaderElector

00606d3

[manager] refine error message when leader_elector_ == nullptr

01e37be

wangxiyu191 changed the title ~~Add Leader Discovery for HA cluster~~ [manaager] add leader discovery Mar 31, 2026

wangxiyu191 requested review from charpty and oldsharp March 31, 2026 08:48

qoderai bot reviewed Mar 31, 2026

View reviewed changes

github-actions bot added the ai reviewed AI has reviewed this PR label Mar 31, 2026

[manager] use MetaNodeEndpoint in Meta API and add custom_info support

fb4657c

oldsharp reviewed Mar 31, 2026

View reviewed changes

wangxiyu191 changed the title ~~[manaager] add leader discovery~~ [manager] add leader discovery Mar 31, 2026

wangxiyu191 added 2 commits April 7, 2026 21:43

[manager] fix GetClusterInfo failing when instance_id is not registered

6364105

[manager] abort startup when SetSelfNodeInfo fails

759f493

wangxiyu191 force-pushed the feature/add-leader-discovery branch from 3d64796 to c416310 Compare April 7, 2026 14:09

[manager] allow GetClusterInfo to return partial result without leade…

2cd6c5a

…r info

wangxiyu191 force-pushed the feature/add-leader-discovery branch from c416310 to 2cd6c5a Compare April 7, 2026 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[manager] add leader discovery#93

[manager] add leader discovery#93
wangxiyu191 wants to merge 14 commits intomainfrom
feature/add-leader-discovery

wangxiyu191 commented Mar 31, 2026 •

edited

Loading

Uh oh!

qoderai bot left a comment

Uh oh!

qoderai bot Mar 31, 2026

Uh oh!

qoderai bot Mar 31, 2026

Uh oh!

qoderai bot Mar 31, 2026

Uh oh!

oldsharp Mar 31, 2026

Uh oh!

wangxiyu191 Apr 1, 2026

Uh oh!

Uh oh!

Uh oh!

charpty commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wangxiyu191 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

摘要

Test plan

Uh oh!

qoderai bot left a comment

Choose a reason for hiding this comment

👋 Review Summary

🛡️ Key Risks & Issues

🧪 Verification Advice

💡 Thoughts & Suggestions

Uh oh!

qoderai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

qoderai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

qoderai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

oldsharp Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

wangxiyu191 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

charpty commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wangxiyu191 commented Mar 31, 2026 •

edited

Loading