Query requests are silently dropped when using shared memory when receiver's RX thread stalls

### Describe the bug

When the transport-level shared-memory optimization (`transport/shared_memory/transport_optimization`) is enabled, query requests are auto-promoted into watchdog-protected SHM buffers. If the receiving side's RX thread is descheduled for longer than the watchdog TTL (~100–200 ms) between accepting consecutive frames, the SHM chunk is invalidated by the **sender's** watchdog before the receiver maps it and is able to confirm the SHM message. 
The receiver then **silently drops the query** (`return Ok(())` with only a `tracing::debug!`) and the querier never receives a reply or any information that the query has been dropped. Therefore, the client `get()` hangs full `queries_default_timeout` (default **600 s**).

#### Mechanism (root cause)
1. The client issues `session.get(...)` request.
2. On the wire path, `map_zmsg_to_partner` auto-promotes the `Request::Query` payload into an SHM chunk.
3. Every allocated chunk is registered with the **watchdog subsystem**:
   - `GLOBAL_CONFIRMATOR` (period **50 ms**, `commons/zenoh-shm/src/watchdog/confirmator.rs`) — the owner of a live `ConfirmedDescriptor` keeps "kicking" the chunk.
   - `GLOBAL_VALIDATOR` (period **100 ms**, `commons/zenoh-shm/src/watchdog/validator.rs`) — if a chunk's bit hasn't been kicked since the last tick, it sets `watchdog_invalidated = true`.
4. After the sender's transport writes the frame to TCP and releases its chunk handle, only the **receiver** can keep the chunk alive, and only **after** it successfully calls `read_shmbuf`.
5. If the receiver's RX thread is stalled (CPU contention, etc.) and does not drain the socket within the watchdog window, the sender-side validator invalidates the chunk.
6. When the RX thread finally runs, `map_zmsg_to_shmbuf` → `read_shmbuf` → `is_valid()` fails (`commons/zenoh-shm/src/lib.rs`: `!watchdog_invalidated && generation == info.generation`) and returns `bail!("Buffer is invalidated")` (`commons/zenoh-shm/src/reader.rs:79`).
7. The transport RX swallows it:
 ```rust
  if let Some(shm_context) = &self.shm_context {
      if let Err(e) =
          crate::shm::map_zmsg_to_shmbuf(msg.as_mut(), &shm_context.shm_reader)
      {
          tracing::debug!("Error receiving SHM buffer: {e}");
          return Ok(());
      }
  }
 ```
8. The query request never reaches the queryable callback. The querier has a matched queryable, so it does not fire the "finished with 0 replies" drop path — it waits the full `queries_default_timeout` (default **600 s**).

#### Impact
We hit this issue in production using [rmw_zenoh](https://github.com/ros2/rmw_zenoh), where it strongly affects ROS 2 services. RMW treats service calls as reliable and implements no retry logic for queries, so a single silently-dropped request (per the mechanism above) translates directly into a hung service call with no error surfaced to the application.

Our system comprises more than 100 ROS 2 nodes, with several grouped into composable containers. A composable container loads its components by issuing a sequence of ROS 2 service requests to load each component into the process. At system startup, all of these load requests are issued in a short burst while the machine is under heavy CPU contention — a thread storm competing for resources until the system stabilizes. 
During this startup window, some of the component-load queries are silently dropped. Because rmw_zenoh does not retry and the get() simply hangs (up to queries_default_timeout), the affected containers are left partially initialized, with a non-deterministic set of components never loaded. The system comes up in a broken state, and the components that are missing vary from boot to boot.

#### Workaround
Disable shared memory transport optimization and use shared memory explicitly only for topics that tolerate lost messages. 

### To reproduce

I attach the source code for shm_query_client and shm_query_server and their respective configs.

1. Run service `RUST_LOG=zenoh=debug ./shm_query_server configs/shm_query_server.json5
[client] loaded config=shm_query_server.json`
2. Run client `RUST_LOG=zenoh=debug ./shm_query_client shm_query_client.json5 8192 10000`
It will make the first request over TCP (lazy shared memory initialization), and it will wait for [ENTER]
3. Stop the service process to simulate CPU congestion with Ctrl+Z
4. Press Enter in `shm_query_client` to send the second query
5. Go back to the shm_query_server terminal and execute `fg 1` to resume service.
6. When the RX thread in service resumes, it will print a debug log informing that the SHM segment got lost
```
2026-06-01T09:57:09.852157Z DEBUG  rx-0 ThreadId(07) zenoh_transport::unicast::universal::rx: Error receiving SHM buffer: Buffer is invalidated at /home/runner/.cargo/git/checkouts/zenoh-9c599d5ef3e0480e/81c6c93/commons/zenoh-shm/src/reader.rs:79.
2026-06-01T09:57:11.623704Z DEBUG  rx-1 ThreadId(08) zenoh_transport::unicast::universal::link: RX task failed: Read error on TCP link 127.0.0.1:7448 => 127.0.0.1:39906: early eof at /home/runner/.cargo/git/checkouts/zenoh-9c599d5ef3e0480e/81c6c93/io/zenoh-links/zenoh-link-tcp/src/unicast.rs:159.
``` 
7. The shm_query_client will be frozen until the query timeout elapses.

[shm_query_client.cpp](https://github.com/user-attachments/files/28462260/shm_query_client.cpp)
[shm_query_server.cpp](https://github.com/user-attachments/files/28462259/shm_query_server.cpp)
[shm_query_client.json](https://github.com/user-attachments/files/28462273/shm_query_client.json)
[shm_query_server.json](https://github.com/user-attachments/files/28462274/shm_query_server.json)

### System info

- **Zenoh version:** zenohd v1.9.0, originally reproduced on 1.6.2 (b81e253b3)
- **Platform / OS:** Ubuntu 24.04.4 LTS (Noble Numbat)
- **Kernel:** Linux 6.17.0-29-generic, PREEMPT_DYNAMIC, x86_64
- **CPU:** Intel(R) Core(TM) i7-14700KF — 20 cores / 28 threads
- **Architecture:** x86_64
- **Toolchain:** gcc 13.3.0, cmake 3.28.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query requests are silently dropped when using shared memory when receiver's RX thread stalls #2628

Describe the bug

Mechanism (root cause)

Impact

Workaround

To reproduce

System info

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Query requests are silently dropped when using shared memory when receiver's RX thread stalls #2628

Description

Describe the bug

Mechanism (root cause)

Impact

Workaround

To reproduce

System info

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions