Skip to content

Query requests are silently dropped when using shared memory when receiver's RX thread stalls #2628

@zygfrydw

Description

@zygfrydw

Describe the bug

When the transport-level shared-memory optimization (transport/shared_memory/transport_optimization) is enabled, query requests are auto-promoted into watchdog-protected SHM buffers. If the receiving side's RX thread is descheduled for longer than the watchdog TTL (~100–200 ms) between accepting consecutive frames, the SHM chunk is invalidated by the sender's watchdog before the receiver maps it and is able to confirm the SHM message.
The receiver then silently drops the query (return Ok(()) with only a tracing::debug!) and the querier never receives a reply or any information that the query has been dropped. Therefore, the client get() hangs full queries_default_timeout (default 600 s).

Mechanism (root cause)

  1. The client issues session.get(...) request.
  2. On the wire path, map_zmsg_to_partner auto-promotes the Request::Query payload into an SHM chunk.
  3. Every allocated chunk is registered with the watchdog subsystem:
    • GLOBAL_CONFIRMATOR (period 50 ms, commons/zenoh-shm/src/watchdog/confirmator.rs) — the owner of a live ConfirmedDescriptor keeps "kicking" the chunk.
    • GLOBAL_VALIDATOR (period 100 ms, commons/zenoh-shm/src/watchdog/validator.rs) — if a chunk's bit hasn't been kicked since the last tick, it sets watchdog_invalidated = true.
  4. After the sender's transport writes the frame to TCP and releases its chunk handle, only the receiver can keep the chunk alive, and only after it successfully calls read_shmbuf.
  5. If the receiver's RX thread is stalled (CPU contention, etc.) and does not drain the socket within the watchdog window, the sender-side validator invalidates the chunk.
  6. When the RX thread finally runs, map_zmsg_to_shmbufread_shmbufis_valid() fails (commons/zenoh-shm/src/lib.rs: !watchdog_invalidated && generation == info.generation) and returns bail!("Buffer is invalidated") (commons/zenoh-shm/src/reader.rs:79).
  7. The transport RX swallows it:
 if let Some(shm_context) = &self.shm_context {
     if let Err(e) =
         crate::shm::map_zmsg_to_shmbuf(msg.as_mut(), &shm_context.shm_reader)
     {
         tracing::debug!("Error receiving SHM buffer: {e}");
         return Ok(());
     }
 }
  1. The query request never reaches the queryable callback. The querier has a matched queryable, so it does not fire the "finished with 0 replies" drop path — it waits the full queries_default_timeout (default 600 s).

Impact

We hit this issue in production using rmw_zenoh, where it strongly affects ROS 2 services. RMW treats service calls as reliable and implements no retry logic for queries, so a single silently-dropped request (per the mechanism above) translates directly into a hung service call with no error surfaced to the application.

Our system comprises more than 100 ROS 2 nodes, with several grouped into composable containers. A composable container loads its components by issuing a sequence of ROS 2 service requests to load each component into the process. At system startup, all of these load requests are issued in a short burst while the machine is under heavy CPU contention — a thread storm competing for resources until the system stabilizes.
During this startup window, some of the component-load queries are silently dropped. Because rmw_zenoh does not retry and the get() simply hangs (up to queries_default_timeout), the affected containers are left partially initialized, with a non-deterministic set of components never loaded. The system comes up in a broken state, and the components that are missing vary from boot to boot.

Workaround

Disable shared memory transport optimization and use shared memory explicitly only for topics that tolerate lost messages.

To reproduce

I attach the source code for shm_query_client and shm_query_server and their respective configs.

  1. Run service RUST_LOG=zenoh=debug ./shm_query_server configs/shm_query_server.json5 [client] loaded config=shm_query_server.json
  2. Run client RUST_LOG=zenoh=debug ./shm_query_client shm_query_client.json5 8192 10000
    It will make the first request over TCP (lazy shared memory initialization), and it will wait for [ENTER]
  3. Stop the service process to simulate CPU congestion with Ctrl+Z
  4. Press Enter in shm_query_client to send the second query
  5. Go back to the shm_query_server terminal and execute fg 1 to resume service.
  6. When the RX thread in service resumes, it will print a debug log informing that the SHM segment got lost
2026-06-01T09:57:09.852157Z DEBUG  rx-0 ThreadId(07) zenoh_transport::unicast::universal::rx: Error receiving SHM buffer: Buffer is invalidated at /home/runner/.cargo/git/checkouts/zenoh-9c599d5ef3e0480e/81c6c93/commons/zenoh-shm/src/reader.rs:79.
2026-06-01T09:57:11.623704Z DEBUG  rx-1 ThreadId(08) zenoh_transport::unicast::universal::link: RX task failed: Read error on TCP link 127.0.0.1:7448 => 127.0.0.1:39906: early eof at /home/runner/.cargo/git/checkouts/zenoh-9c599d5ef3e0480e/81c6c93/io/zenoh-links/zenoh-link-tcp/src/unicast.rs:159.
  1. The shm_query_client will be frozen until the query timeout elapses.

shm_query_client.cpp
shm_query_server.cpp
shm_query_client.json
shm_query_server.json

System info

  • Zenoh version: zenohd v1.9.0, originally reproduced on 1.6.2 (b81e253)
  • Platform / OS: Ubuntu 24.04.4 LTS (Noble Numbat)
  • Kernel: Linux 6.17.0-29-generic, PREEMPT_DYNAMIC, x86_64
  • CPU: Intel(R) Core(TM) i7-14700KF — 20 cores / 28 threads
  • Architecture: x86_64
  • Toolchain: gcc 13.3.0, cmake 3.28.3

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions