This one is a bit hard to describe / reproduce, so apologies in advance if I'm not quite correct about some of the claims here 😓
Expected behaviour: "Upload results timeout" causes an error and fails / retries / does something sensible
Actual behaviour: "Upload results timeout" (intermittently?) causes the worker to stall
Steps to reproduce: Not sure... see below
More details
So, I'm not sure if this is an interplay between a few settings, more generally a dodgy config or both but despite this happening fairly frequently I can't actually reliably reproduce it.
The symptom: during a build, one or more actions "hangs" indefinitely (I observed an action running for over 28000s as I left it running overnight! Killing and re-running the build sometimes fixes it (I assume because it gets run on a different worker 🤷).
One general point regarding triage - it's not immediately obvious to me how to tell which worker node a given remote build ran on? As far as I can tell, there's nothing on the "client", and nothing useful in the scheduler either? With a worker pool of about 15 nodes, this can make it awkward to pinpoint if it's a particular worker that's being troublesome... anyway...
I noticed that some node log the following warning:
2026-06-04T23:18:32.877022Z WARN nativelink_worker::running_actions_manager: Upload results timeout, operation_id: e95c9fc5-65a6-4c3b-8699-60d3881327f7, timeout: 12
However, what I don't see after this is a DeadlineExceeded ERROR log (or any ERROR log at all) after this.
After this warning is logged, the worker seems to have 1 fewer available max_inflight_tasks. By default we set this to 5, so if this happens a few times, the worker effectively stops accepting new jobs.
I've turned debug logging on for some worker nodes (but not all) but I can't see anything particularly interesting in those either.
(It's worth noting that the 12s timeout is intentionally low - our cache is on the same 10G network switch as the workers, so we're not expecting things to take very long... the problem still seems to happen even if we have the default 600s timeout)
Edit:
Found this log on the cache at ~the same time as one occurrence of the error on a worker:
2026-06-05T10:45:29.285920Z ERROR nativelink_service::bytestream_server: error: status: InvalidArgument, message: "Expected WriteRequest struct in stream (got None) : In ByteStreamServer::write", details: [], metadata: MetadataMap { headers: {} }
at nativelink-service/src/bytestream_server.rs:1068
in nativelink_service::bytestream_server::write with request: Streaming
in nativelink_util::task::http_executor
in nativelink::services::http_connection with remote_addr: 10.81.75.96:53878, socket_addr: 0.0.0.0:50051
This one is a bit hard to describe / reproduce, so apologies in advance if I'm not quite correct about some of the claims here 😓
Expected behaviour: "Upload results timeout" causes an error and fails / retries / does something sensible
Actual behaviour: "Upload results timeout" (intermittently?) causes the worker to stall
Steps to reproduce: Not sure... see below
More details
So, I'm not sure if this is an interplay between a few settings, more generally a dodgy config or both but despite this happening fairly frequently I can't actually reliably reproduce it.
The symptom: during a build, one or more actions "hangs" indefinitely (I observed an action running for over 28000s as I left it running overnight! Killing and re-running the build sometimes fixes it (I assume because it gets run on a different worker 🤷).
One general point regarding triage - it's not immediately obvious to me how to tell which worker node a given remote build ran on? As far as I can tell, there's nothing on the "client", and nothing useful in the scheduler either? With a worker pool of about 15 nodes, this can make it awkward to pinpoint if it's a particular worker that's being troublesome... anyway...
I noticed that some node log the following warning:
However, what I don't see after this is a
DeadlineExceededERRORlog (or anyERRORlog at all) after this.After this warning is logged, the worker seems to have 1 fewer available
max_inflight_tasks. By default we set this to5, so if this happens a few times, the worker effectively stops accepting new jobs.I've turned debug logging on for some worker nodes (but not all) but I can't see anything particularly interesting in those either.
(It's worth noting that the 12s timeout is intentionally low - our cache is on the same 10G network switch as the workers, so we're not expecting things to take very long... the problem still seems to happen even if we have the default 600s timeout)
Edit:
Found this log on the cache at ~the same time as one occurrence of the error on a worker: