Skip to content

Worker upload timeout hangs worker indefinitately #2403

@swarren12

Description

@swarren12

This one is a bit hard to describe / reproduce, so apologies in advance if I'm not quite correct about some of the claims here 😓

Expected behaviour: "Upload results timeout" causes an error and fails / retries / does something sensible
Actual behaviour: "Upload results timeout" (intermittently?) causes the worker to stall
Steps to reproduce: Not sure... see below

More details

So, I'm not sure if this is an interplay between a few settings, more generally a dodgy config or both but despite this happening fairly frequently I can't actually reliably reproduce it.

The symptom: during a build, one or more actions "hangs" indefinitely (I observed an action running for over 28000s as I left it running overnight! Killing and re-running the build sometimes fixes it (I assume because it gets run on a different worker 🤷).

One general point regarding triage - it's not immediately obvious to me how to tell which worker node a given remote build ran on? As far as I can tell, there's nothing on the "client", and nothing useful in the scheduler either? With a worker pool of about 15 nodes, this can make it awkward to pinpoint if it's a particular worker that's being troublesome... anyway...

I noticed that some node log the following warning:

  2026-06-04T23:18:32.877022Z  WARN nativelink_worker::running_actions_manager: Upload results timeout, operation_id: e95c9fc5-65a6-4c3b-8699-60d3881327f7, timeout: 12

However, what I don't see after this is a DeadlineExceeded ERROR log (or any ERROR log at all) after this.

After this warning is logged, the worker seems to have 1 fewer available max_inflight_tasks. By default we set this to 5, so if this happens a few times, the worker effectively stops accepting new jobs.

I've turned debug logging on for some worker nodes (but not all) but I can't see anything particularly interesting in those either.

(It's worth noting that the 12s timeout is intentionally low - our cache is on the same 10G network switch as the workers, so we're not expecting things to take very long... the problem still seems to happen even if we have the default 600s timeout)

Edit:

Found this log on the cache at ~the same time as one occurrence of the error on a worker:

  2026-06-05T10:45:29.285920Z ERROR nativelink_service::bytestream_server: error: status: InvalidArgument, message: "Expected WriteRequest struct in stream (got None) : In ByteStreamServer::write", details: [], metadata: MetadataMap { headers: {} }
    at nativelink-service/src/bytestream_server.rs:1068
    in nativelink_service::bytestream_server::write with request: Streaming
    in nativelink_util::task::http_executor
    in nativelink::services::http_connection with remote_addr: 10.81.75.96:53878, socket_addr: 0.0.0.0:50051

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions