Conversation
There was a problem hiding this comment.
👋 Review Summary
Nice work on this feature — the leader discovery design is clean, the design doc is thorough, and the test matrix covering unit, gRPC stub, and multi-node integration scenarios is solid. A few things caught my eye that are worth discussing before merging.
🛡️ Key Risks & Issues
1. TOCTOU race between leader_node_id and leader_endpoint (inline comment)
Both GetClusterInfo and GetManagerClusterInfo read GetLeaderNodeID() once to set the response field, then call GetLeaderNodeInfo() which reads it again internally. If the leader changes between these two reads, the response will contain an inconsistent pair — leader_node_id pointing to the old leader but leader_endpoint containing the new leader's connection info. This could confuse clients that validate consistency between the two fields.
2. Silent failure of SetSelfNodeInfo at startup (inline comment)
If the coordination backend write fails during CreateLeaderElector, the node proceeds normally. If it later becomes leader, all cluster-wide GetClusterInfo calls will return INTERNAL_ERROR because no peer can look up its endpoint. This breaks the leader discovery contract silently.
3. Hard error on transient endpoint lookup failure (inline comment)
Both discovery APIs return INTERNAL_ERROR when endpoint lookup fails, discarding the already-available leader_node_id. For an API designed to be retried, returning OK with a partial response (omitting the optional leader_endpoint) would be more resilient.
4. Duplicate NodeEndpoint protobuf definition
NodeEndpoint is defined identically in both meta_service.proto (lines 286-293) and admin_service.proto (lines 449-456). These compile to distinct C++ types (proto::meta::NodeEndpoint vs proto::admin::NodeEndpoint), requiring duplicate conversion code in both service implementations. If a field is later added to one but not the other, the APIs will silently diverge. Consider extracting it to a shared common_types.proto.
🧪 Verification Advice
NetUtil::GetLocalIp()has no unit test coverage. Since it's the default code path whenadvertised_hostis not configured, consider adding at least a basic test (returns valid IPv4, doesn't return 127.0.0.1 on non-loopback systems). The integration tests sidestep it viaadvertised_host=127.0.0.1.- The "no leader elected yet" scenario (empty
leader_node_idduring cluster startup or failover) isn't covered by any test. AGrpcStubTestwith aLeaderElectorthat hasn't won election yet would verify the response shape in that transient state. NodeEndpointInfodeserialization with partial/corrupt JSON isn't tested. Given this data comes from a distributed KV store (Redis), partial writes are possible.
💡 Thoughts & Suggestions
- The overall architecture — node registration at startup, KV-based endpoint storage, local cache optimization for self-is-leader — is well-designed and practical.
- The
advertised_hostconfig is the right escape hatch for multi-NIC environments. Worth documenting prominently that it should be set in any multi-node deployment (the design doc mentions it but could be more emphatic about theGetLocalIp()non-determinism risk). - The design doc's "待完善功能" section correctly identifies the lack of node info TTL/cleanup. Worth tracking as a follow-up since stale entries could accumulate in the coordination backend after node restarts with new node_ids.
🤖 Generated by Qoder • View workflow run
| std::string leader_node_id = leader_elector_->GetLeaderNodeID(); | ||
|
|
||
| response->set_self_node_id(self_node_id); | ||
| response->set_leader_node_id(leader_node_id); | ||
|
|
||
| // 尝试读取 leader 的节点连接信息 | ||
| if (!leader_node_id.empty()) { | ||
| NodeEndpointInfo node_info; | ||
| ErrorCode ec = leader_elector_->GetLeaderNodeInfo(node_info); |
There was a problem hiding this comment.
There's a TOCTOU race between the two reads of leader_node_id. At line 618, GetLeaderNodeID() is called to populate response->leader_node_id. Then at line 626, GetLeaderNodeInfo() internally calls GetLeaderNodeID() again (leader_elector.cc:584). If the leader changes between these two reads, the response will contain leader_node_id = "nodeA" but leader_endpoint pointing to nodeB's host and ports.
A client receiving this inconsistent response would connect to nodeB's address but believe it's talking to nodeA, which could break reconnection logic or consistency checks.
The same pattern exists in AdminServiceImpl::GetManagerClusterInfo (lines 846-856).
Consider passing the already-obtained leader_node_id into the lookup rather than re-reading it internally — something like GetNodeInfo(leader_node_id, node_info) directly, with the self-cache optimization handled by comparing leader_node_id == GetSelfNodeID() at the call site.
🤖 Generated by Qoder • Fix in Qoder
kv_cache_manager/service/server.cc
Outdated
| ErrorCode ec = leader_elector_->SetSelfNodeInfo(node_info); | ||
| if (ec != EC_OK) { | ||
| KVCM_LOG_WARN("failed to write node info for node_id[%s], ec=%d", node_id.c_str(), ec); |
There was a problem hiding this comment.
If SetSelfNodeInfo fails here, the node proceeds normally (WARN log only, CreateLeaderElector returns true). If this node later wins the election and becomes leader, no peer can look up its endpoint info — every GetClusterInfo and GetManagerClusterInfo call across the cluster will return INTERNAL_ERROR ("leader exists but node info lookup failed").
This effectively breaks the entire leader discovery feature silently. The node appears healthy and serves as leader, but no client can discover how to connect to it.
Should this be treated as a fatal startup error (return false) instead? The coordination backend write is a prerequisite for the leader discovery contract to work.
🤖 Generated by Qoder • Fix in Qoder
| } else { | ||
| KVCM_LOG_WARN("[traceId: %s] GetClusterInfo: failed to read node info for leader %s, ec=%d", | ||
| request->trace_id().c_str(), | ||
| leader_node_id.c_str(), | ||
| ec); | ||
| status->set_code(proto::meta::INTERNAL_ERROR); | ||
| status->set_message("Leader endpoint unavailable: leader exists but node info lookup failed"); | ||
| request_context->set_status_code(status->code()); | ||
| SET_SPAN_TRACER_STR_IN_HEADER(request_context); | ||
| return; |
There was a problem hiding this comment.
When the endpoint lookup fails (e.g., a transient Redis timeout), this returns INTERNAL_ERROR and discards the already-obtained leader_node_id and self_node_id. For a discovery API that clients are expected to retry (per section 5.4 of the design doc), this seems unnecessarily harsh.
Since leader_endpoint is an optional protobuf field, an alternative would be to return OK with leader_node_id populated but leader_endpoint omitted, letting the client retry to get the full info. This way a transient backend hiccup doesn't completely block discovery — the client at least knows who the leader is.
The same pattern applies in AdminServiceImpl::GetManagerClusterInfo (lines 865-874).
🤖 Generated by Qoder • Fix in Qoder
|
|
||
| message GetClusterInfoRequest { | ||
| string trace_id = 1; | ||
| string instance_id = 2; |
There was a problem hiding this comment.
The GetClusterInfoRequest.instance_id field is defined but never used in the implementation, consider remove it?
There was a problem hiding this comment.
My design suggests that clients should pass the instance_id whenever possible. However, in some scenarios where an instance_id is not available, it's acceptable to omit it or pass a empty value.
Adding instance_id has two considerations:
-
Primarily for alignment at the interface level. There's a macro
API_CONTEXT_GET_COLLECTOR_AND_INIT_GRPC/HTTPthat relies oninstance_idto gather metrics. -
For future considerations, using the
instance_idpassed by the client, we can seamlessly route requests of different instances to different clusters, providing cluster services without any visible difference.
However, your review reminded me of what to do if a user passes a non-existent instance_id. The instance_id might not exist before registration; I'll refine this part.
|
LGTM |
3d64796 to
c416310
Compare
c416310 to
2cd6c5a
Compare
Summary
GetClusterInfoAPI to MetaService and enhanceGetManagerClusterInfoin AdminService, allowing any node (leader or follower) to return the current leader's connection endpoint (host,meta_rpc_port,meta_http_port,admin_rpc_port,admin_http_port). Clients can query any node to discover and directly connect to the leader.NodeEndpointInfoand node registration: Each node writes its endpoint information to the CoordinationBackend KV store at startup. A newNodeEndpointInfoclass handles serialization. Leader endpoint lookups use a local cache optimization when the queried node is itself the leader.NetUtilandadvertised_hostconfig: IntroduceNetUtil::GetLocalIp()to auto-detect the node's non-loopback IPv4 address. Addkvcm.service.advertised_hostconfig option to explicitly set the externally-reachable address (takes precedence over auto-detection).摘要
GetClusterInfoAPI,在 AdminService 的GetManagerClusterInfo中增加leader_endpoint字段。任意节点(Leader 或 Follower)均可响应请求,返回当前 Leader 的完整连接信息(host、各服务端口)。客户端可通过任意节点发现并直连 Leader。NodeEndpointInfo与节点注册机制:每个节点启动时将自身的端点信息写入 CoordinationBackend 的 KV 存储。新增NodeEndpointInfo类负责序列化/反序列化。当查询节点自身即为 Leader 时,使用本地缓存优化避免远程调用。NetUtil与advertised_host配置:引入NetUtil::GetLocalIp()自动探测节点的非 loopback IPv4 地址;新增kvcm.service.advertised_host配置项支持显式指定对外广告地址(优先级高于自动探测)。Test plan
Unit tests:
LeaderElectorTest—SetSelfNodeInfoAndGetNodeInfo,ReadNonExistentNodeInfo,LeaderDiscoveryEndToEnd,LeaderDiscoveryTwoNodes,NodeInfoOverwriteUnit tests:
GrpcStubTest—TestGetClusterInfoNoLeaderElector,TestGetClusterInfoWithLeaderElectorIntegration tests:
MetaServiceTestBase::test_get_cluster_info(gRPC + HTTP)Integration tests:
AdminServiceTestBase::test_get_manager_cluster_info— enhanced withleader_endpointvalidationIntegration tests:
AdminServiceLeaderElectionTest— cross-node leader endpoint discovery withadvertised_host=127.0.0.1🤖 Generated with [Qoder][https://qoder.com]