Worker upload timeout hangs worker indefinitately

This one is a bit hard to describe / reproduce, so apologies in advance if I'm not quite correct about some of the claims here 😓 

**Expected behaviour**: "Upload results timeout" causes an error and fails / retries / does _something_ sensible
**Actual behaviour**: "Upload results timeout" (intermittently?) causes the worker to stall
**Steps to reproduce**: Not sure... see below

**More details**

So, I'm not sure if this is an interplay between a few settings, more generally a dodgy config or both but despite this happening fairly frequently I can't actually _reliably_ reproduce it.

The symptom: during a build, one or more actions "hangs" indefinitely (I observed an action running for over 28000s as I left it running overnight! Killing and re-running the build _sometimes_ fixes it (I assume because it gets run on a different worker 🤷).

One general point regarding triage - it's not immediately obvious to me how to tell which worker node a given remote build ran on? As far as I can tell, there's nothing on the "client", and nothing useful in the scheduler either? With a worker pool of about 15 nodes, this can make it awkward to pinpoint if it's a particular worker that's being troublesome... anyway...

I noticed that _some_ node log the following warning:

```
  2026-06-04T23:18:32.877022Z  WARN nativelink_worker::running_actions_manager: Upload results timeout, operation_id: e95c9fc5-65a6-4c3b-8699-60d3881327f7, timeout: 12
```

However, what I _don't_ see after this is a `DeadlineExceeded` `ERROR` log (or any `ERROR` log at all) after this.

After this warning is logged, the worker seems to have 1 fewer available `max_inflight_tasks`. By default we set this to `5`, so if this happens a few times, the worker effectively stops accepting new jobs.

I've turned debug logging on for _some_ worker nodes (but not all) but I can't see anything particularly interesting in those either. 

(It's worth noting that the 12s timeout is intentionally low - our cache is on the same 10G network switch as the workers, so we're not expecting things to take very long... the problem still seems to happen even if we have the default 600s timeout)

Edit:

Found this log on the cache at ~the same time as one occurrence of the error on a worker:
```
  2026-06-05T10:45:29.285920Z ERROR nativelink_service::bytestream_server: error: status: InvalidArgument, message: "Expected WriteRequest struct in stream (got None) : In ByteStreamServer::write", details: [], metadata: MetadataMap { headers: {} }
    at nativelink-service/src/bytestream_server.rs:1068
    in nativelink_service::bytestream_server::write with request: Streaming
    in nativelink_util::task::http_executor
    in nativelink::services::http_connection with remote_addr: 10.81.75.96:53878, socket_addr: 0.0.0.0:50051
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker upload timeout hangs worker indefinitately #2403

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Worker upload timeout hangs worker indefinitately #2403

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions