You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We audited all tonic/gRPC usage in the workspace (engine server, MetaServer, the
internal MetaServer client, the PyO3 engine client, and build.rs) against tonic
0.14 best practices and how production Rust systems (GreptimeDB, qdrant, RisingWave,
databend-meta, InfluxDB IOx) configure tonic. Reference + full audit: docs/tonic-best-practices.md.
Stack: tonic 0.14, tonic-prost(-build) 0.14, default features (no TLS, plaintext h2c).
This issue tracks the gaps. Most are about inheriting unsafe tonic defaults
(4 MiB decode limit, no keepalive, no per-RPC deadline).
What's already good (no action)
PyO3 engine client (python/src/lib.rs) is the reference: connect_timeout, tcp_nodelay, HTTP/2 keepalive + keep_alive_while_idle, 64 MiB message limits,
channel cloned per call.
Engine server sets 64 MiB message limits; both servers use serve_with_shutdown.
Centralized Status mapping, boundary validation, no unwrap/panic in prod handlers.
(High) No per-RPC deadline on the data path. The PyO3 client sets only connect_timeout (handshake); load/save/query have no timeout, so a wedged server
hangs the connector worker forever. The query_prefetchDeadlineExceeded
fail-open (python/src/lib.rs:454) can never trigger because no deadline is ever
set. Fix: Request::set_timeout(...) per call (emits server-visible grpc-timeout),
with a per-RPC budget.
(Medium) MetaServer server inherits the 4 MiB decode default. pegaflow-metaserver/src/lib.rs uses a bare MetaServerServer::new(...). InsertBlockHashes batches drain a queue (depth 256) and can exceed 4 MiB, failing
with a cryptic decode error — while the engine server allows 64 MiB. Set max_decoding/encoding_message_size to match.
(Medium) Internal MetaServer client lacks keepalive/nodelay/timeout. Two
inconsistent channels to the same peer: query_client via connect_lazy(), the
registration loop via eager connect(); neither sets HTTP/2 keepalive, tcp_nodelay, connect_timeout, or per-RPC timeout (unlike the PyO3 client). A
half-open MetaServer is detected slowly. Fix: one shared Endpoint builder
(keepalive + keep_alive_while_idle + connect_timeout + tcp_nodelay) reused for
both paths, plus per-RPC set_timeout. Reference: GreptimeDB ChannelConfig.
(Medium) Engine server has no HTTP/2 keepalive / concurrency limits / handler
timeout.Server::builder() sets none of http2_keepalive_interval, max_concurrent_streams, concurrency_limit_per_connection, timeout. The session stream is long-lived; without server keepalive, a client that died without
a clean RST is detected slowly, delaying CUDA IPC cleanup.
(Low, perf) build.rs compiles protos without bytes(["."]).Not recommended — retracted, see comment below.repeated bytes block_hashes / wrapper_bytes decode to Vec<Vec<u8>> with a heap copy per
hash on the hot save/load/query path. tonic_prost_build::configure().bytes(["."])
→ zero-copy Bytes.
(Low) metaserver_client.rs panics on bad CLI addr. .expect("valid metaserver_addr URI") is reachable from --metaserver-addr; should
surface as a startup Result, not a panic deep in client construction.
(Non-blocking) No tonic-health / tonic-reflection. A custom health RPC
exists; standard grpc.health.v1 + reflection would help k8s probes and grpcurl.
(Non-blocking) No TLS. All gRPC (engine, MetaServer, inter-node P2P) is
plaintext h2c. Acceptable on a trusted cluster network — flagged so it's a conscious
decision. Enable tls-ring/tls-aws-lc if links ever leave the trust boundary.
Suggested order
Per-RPC deadlines (real wedge risk).
MetaServer client + server (message size + keepalive) as one PR; extract a shared Endpoint/config helper.
Summary
We audited all
tonic/gRPC usage in the workspace (engine server, MetaServer, theinternal MetaServer client, the PyO3 engine client, and
build.rs) against tonic0.14 best practices and how production Rust systems (GreptimeDB, qdrant, RisingWave,
databend-meta, InfluxDB IOx) configure tonic. Reference + full audit:
docs/tonic-best-practices.md.Stack:
tonic 0.14,tonic-prost(-build) 0.14, default features (no TLS, plaintext h2c).This issue tracks the gaps. Most are about inheriting unsafe tonic defaults
(4 MiB decode limit, no keepalive, no per-RPC deadline).
What's already good (no action)
python/src/lib.rs) is the reference:connect_timeout,tcp_nodelay, HTTP/2 keepalive +keep_alive_while_idle, 64 MiB message limits,channel cloned per call.
serve_with_shutdown.Statusmapping, boundary validation, nounwrap/panic in prod handlers.sessionserver-stream uses bounded mpsc +ReceiverStream+ client-disconnectcleanup via
closed().Findings to track
connect_timeout(handshake); load/save/query have no timeout, so a wedged serverhangs the connector worker forever. The
query_prefetchDeadlineExceededfail-open (
python/src/lib.rs:454) can never trigger because no deadline is everset. Fix:
Request::set_timeout(...)per call (emits server-visiblegrpc-timeout),with a per-RPC budget.
pegaflow-metaserver/src/lib.rsuses a bareMetaServerServer::new(...).InsertBlockHashesbatches drain a queue (depth 256) and can exceed 4 MiB, failingwith a cryptic decode error — while the engine server allows 64 MiB. Set
max_decoding/encoding_message_sizeto match.inconsistent channels to the same peer:
query_clientviaconnect_lazy(), theregistration loop via eager
connect(); neither sets HTTP/2 keepalive,tcp_nodelay,connect_timeout, or per-RPC timeout (unlike the PyO3 client). Ahalf-open MetaServer is detected slowly. Fix: one shared
Endpointbuilder(keepalive +
keep_alive_while_idle+connect_timeout+tcp_nodelay) reused forboth paths, plus per-RPC
set_timeout. Reference: GreptimeDBChannelConfig.timeout.
Server::builder()sets none ofhttp2_keepalive_interval,max_concurrent_streams,concurrency_limit_per_connection,timeout. Thesessionstream is long-lived; without server keepalive, a client that died withouta clean RST is detected slowly, delaying CUDA IPC cleanup.
(Low, perf)Not recommended — retracted, see comment below.build.rscompiles protos withoutbytes(["."]).repeated bytes block_hashes/wrapper_bytesdecode toVec<Vec<u8>>with a heap copy perhash on the hot save/load/query path.
tonic_prost_build::configure().bytes(["."])→ zero-copy
Bytes.metaserver_client.rspanics on bad CLI addr..expect("valid metaserver_addr URI")is reachable from--metaserver-addr; shouldsurface as a startup
Result, not a panic deep in client construction.tonic-health/tonic-reflection. A customhealthRPCexists; standard
grpc.health.v1+ reflection would help k8s probes andgrpcurl.plaintext h2c. Acceptable on a trusted cluster network — flagged so it's a conscious
decision. Enable
tls-ring/tls-aws-lcif links ever leave the trust boundary.Suggested order
Endpoint/config helper.bytes, addr panic, health/reflection, TLS decision).