Skip to content

gRPC/tonic hardening: per-RPC deadlines, keepalive, message-size limits #353

Description

@xiaguan

Summary

We audited all tonic/gRPC usage in the workspace (engine server, MetaServer, the
internal MetaServer client, the PyO3 engine client, and build.rs) against tonic
0.14 best practices and how production Rust systems (GreptimeDB, qdrant, RisingWave,
databend-meta, InfluxDB IOx) configure tonic. Reference + full audit:
docs/tonic-best-practices.md.

Stack: tonic 0.14, tonic-prost(-build) 0.14, default features (no TLS, plaintext h2c).

This issue tracks the gaps. Most are about inheriting unsafe tonic defaults
(4 MiB decode limit, no keepalive, no per-RPC deadline).

What's already good (no action)

  • PyO3 engine client (python/src/lib.rs) is the reference: connect_timeout,
    tcp_nodelay, HTTP/2 keepalive + keep_alive_while_idle, 64 MiB message limits,
    channel cloned per call.
  • Engine server sets 64 MiB message limits; both servers use serve_with_shutdown.
  • Centralized Status mapping, boundary validation, no unwrap/panic in prod handlers.
  • session server-stream uses bounded mpsc + ReceiverStream + client-disconnect
    cleanup via closed().

Findings to track

  • (High) No per-RPC deadline on the data path. The PyO3 client sets only
    connect_timeout (handshake); load/save/query have no timeout, so a wedged server
    hangs the connector worker forever. The query_prefetch DeadlineExceeded
    fail-open (python/src/lib.rs:454) can never trigger because no deadline is ever
    set. Fix: Request::set_timeout(...) per call (emits server-visible grpc-timeout),
    with a per-RPC budget.
  • (Medium) MetaServer server inherits the 4 MiB decode default.
    pegaflow-metaserver/src/lib.rs uses a bare MetaServerServer::new(...).
    InsertBlockHashes batches drain a queue (depth 256) and can exceed 4 MiB, failing
    with a cryptic decode error — while the engine server allows 64 MiB. Set
    max_decoding/encoding_message_size to match.
  • (Medium) Internal MetaServer client lacks keepalive/nodelay/timeout. Two
    inconsistent channels to the same peer: query_client via connect_lazy(), the
    registration loop via eager connect(); neither sets HTTP/2 keepalive,
    tcp_nodelay, connect_timeout, or per-RPC timeout (unlike the PyO3 client). A
    half-open MetaServer is detected slowly. Fix: one shared Endpoint builder
    (keepalive + keep_alive_while_idle + connect_timeout + tcp_nodelay) reused for
    both paths, plus per-RPC set_timeout. Reference: GreptimeDB ChannelConfig.
  • (Medium) Engine server has no HTTP/2 keepalive / concurrency limits / handler
    timeout.
    Server::builder() sets none of http2_keepalive_interval,
    max_concurrent_streams, concurrency_limit_per_connection, timeout. The
    session stream is long-lived; without server keepalive, a client that died without
    a clean RST is detected slowly, delaying CUDA IPC cleanup.
  • (Low, perf) build.rs compiles protos without bytes(["."]). Not recommended — retracted, see comment below. repeated bytes block_hashes / wrapper_bytes decode to Vec<Vec<u8>> with a heap copy per
    hash on the hot save/load/query path. tonic_prost_build::configure().bytes(["."])
    → zero-copy Bytes.
  • (Low) metaserver_client.rs panics on bad CLI addr.
    .expect("valid metaserver_addr URI") is reachable from --metaserver-addr; should
    surface as a startup Result, not a panic deep in client construction.
  • (Non-blocking) No tonic-health / tonic-reflection. A custom health RPC
    exists; standard grpc.health.v1 + reflection would help k8s probes and grpcurl.
  • (Non-blocking) No TLS. All gRPC (engine, MetaServer, inter-node P2P) is
    plaintext h2c. Acceptable on a trusted cluster network — flagged so it's a conscious
    decision. Enable tls-ring/tls-aws-lc if links ever leave the trust boundary.

Suggested order

  1. Per-RPC deadlines (real wedge risk).
  2. MetaServer client + server (message size + keepalive) as one PR; extract a shared
    Endpoint/config helper.
  3. Engine server keepalive / limits.
  4. Remaining cleanup (bytes, addr panic, health/reflection, TLS decision).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions