s3 remote cache: data race / nil pointer panic in cache/remotecache/s3.readerAtCloser under concurrent multipart upload


## Bug description

`cache/remotecache/s3.readerAtCloser` implements [`io.ReaderAt`](https://pkg.go.dev/io#ReaderAt), whose documented contract explicitly allows concurrent callers:

> Clients of ReadAt can execute parallel ReadAt calls on the same input source.

However, `readerAtCloser` has **no synchronization** protecting its mutable state (`rc`, `ra`, `offset`, `closed`). When used as the body of an S3 upload via `manager.Uploader`, the AWS SDK v2 S3 upload manager spawns `DefaultUploadConcurrency = 5` worker goroutines ([upload.go#L601-605](https://github.com/aws/aws-sdk-go-v2/blob/main/feature/s3/manager/upload.go)), each given its own `io.NewSectionReader` slice that shares the same underlying body. Each worker's `Read` eventually becomes a concurrent `ReadAt` call on the same `*readerAtCloser` at a different offset, triggering a data race and a `nil pointer dereference` panic that crashes `buildkitd`.

This happens when a cache layer is read through this S3-backed `readerAtCloser` and then uploaded through the AWS multipart uploader, so the common case is `cache-from type=s3` together with `cache-to type=s3` for layers larger than `DefaultUploadPartSize = 5 MiB`. The pod/container dies, the build fails, and no retry logic in the client (`buildctl`) can recover because the daemon itself is gone.

PR #5597 addressed a related offset handling bug in `s3Client.getReader`, but **did not address the underlying thread-safety violation**. The bug is still present on master as of `05fdd002b` (2025-10-15) and `a243ce438` (current master, 2026-04-06) — `cache/remotecache/s3/readerat.go` has not been modified since it was first added in [`09c5a7c0e`](https://github.com/moby/buildkit/commit/09c5a7c0e) (2022-05-13).

Related but not duplicate:
- #5584 (closed) — reported the visible SIGSEGV under cache-from/cache-to to the same bucket. Believed fixed by #5597.
- #5597 (merged) — added `offset` parameter to `getReader`. Reduces the frequency of the race by avoiding unnecessary close/reopen on the common "sequential read" path, but does not prevent concurrent `ReadAt` calls from racing on shared state.
- #3993 (stalled since Aug 2023) — refactor to remove the custom `readerAtCloser` and use `contentutil.FromFetcher` instead. The replacement `readerAt` in `util/contentutil/fetcher.go` also has no mutex, so this refactor would not fix the race either.

### Observed in production

We first hit this on a build that pulls base cache from a `branch` prefix, reads additional cache from `master` prefix, and exports to the `branch` prefix — a standard cache layout. The buildkit image used was `moby/buildkit:master` pulled in mid-October 2025, which contains the #5597 fix; this rules out the offset bug as the cause.

Sanitized buildctl log fragment and stack trace (truncated):

```
+ buildctl --addr unix:///run/buildkit/buildkitd.sock build \
    --frontend dockerfile.v0 --local context=. --local dockerfile=. --opt filename=Dockerfile \
    --secret id=GITHUB_USERNAME --secret id=GITHUB_TOKEN --secret id=GITHUB_HEADER \
    --opt build-arg:APP_VERSION=<sha> \
    --output type=image,name=<registry>/<image>:<tag>,push=true \
    --import-cache type=s3,endpoint_url=https://<s3-compatible>,region=<region>,bucket=<bucket>,blobs_prefix=blobs/,manifests_prefix=manifests/,access_key_id=<redacted>,secret_access_key=<redacted>,prefix=s3cache/branch/<image>/,name=<pr-branch> \
    --import-cache type=s3,endpoint_url=https://<s3-compatible>,region=<region>,bucket=<bucket>,blobs_prefix=blobs/,manifests_prefix=manifests/,access_key_id=<redacted>,secret_access_key=<redacted>,prefix=s3cache/master/<image>/,name=master \
    --export-cache type=s3,endpoint_url=https://<s3-compatible>,region=<region>,bucket=<bucket>,blobs_prefix=blobs/,manifests_prefix=manifests/,access_key_id=<redacted>,secret_access_key=<redacted>,prefix=s3cache/branch/<image>/,name=<pr-branch>

#1  [internal] load build definition from Dockerfile                                  DONE 0.2s
#2  [internal] load metadata for docker.io/library/node:20-bookworm                   DONE 1.8s
...
#19 exporting to image                                                                DONE 0.8s
#20 exporting cache to Amazon S3
#20 preparing build cache for export
#20 writing layer sha256:56784d23502ebb983d6ce883fb06c5be67d52b73c2d27cf4a4bd52072954e01a

buildkit container terminated (exit code 2)

time="2026-04-09T06:28:08Z" level=warning msg="failed to update distribution source for layer sha256:<redacted>: content digest sha256:<redacted>: not found" span="exporting to image"
[... 7 more identical warnings for other layers ...]

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x12d560b]

goroutine 2421 [running]:
github.com/moby/buildkit/cache/remotecache/s3.(*readerAtCloser).ReadAt(0xc0006112c0, {0xc006080000, 0x80000, 0x80000}, 0x1400000)
    /src/cache/remotecache/s3/readerat.go:52 +0x1cb
io.(*SectionReader).ReadAt(0xc0009eaa28?, {0xc006080000?, 0x0?, 0x427b25?}, 0x40?)
    /usr/local/go/src/io/io.go:555 +0x93
io.(*SectionReader).Read(0xc0009382d0, {0xc006080000?, 0x2b980c0?, 0xc0009eab28?})
    /usr/local/go/src/io/io.go:516 +0x4f
github.com/aws/smithy-go/transport/http/internal/io.(*safeReadCloser).Read(0x458269?, {0xc006080000?, 0x17f0cc0?, 0x1?})
    /src/vendor/github.com/aws/smithy-go/transport/http/internal/io/safe.go:59 +0x11d
net/http.(*readTrackingBody).Read(0x2b33be8?, {0xc006080000?, 0x20d?, 0x1000?})
    /usr/local/go/src/net/http/transport.go:657 +0x27
net/http.(*http2clientStream).writeRequestBody(0xc005558180, 0xc00554ec60)
    /usr/local/go/src/net/http/h2_bundle.go:8799 +0x43f
net/http.(*http2clientStream).writeRequest(0xc005558180, 0xc00554ec60)
    /usr/local/go/src/net/http/h2_bundle.go:8506 +0x852
net/http.(*http2clientStream).doRequest(0xc005558180, 0x5?)
    /usr/local/go/src/net/http/h2_bundle.go:8392 +0x18
created by net/http.(*http2ClientConn).RoundTrip in goroutine 2235
    /usr/local/go/src/net/http/h2_bundle.go:8298 +0x2ed
```

A few details from the panic that cross-check with the root cause analysis below:

- crash address is `0x20` — not `0x0` — indicating a field access on a nil `*T` embedded inside an interface, not a nil interface dispatch. Consistent with `hrs.rc` being a non-nil interface whose concrete value was closed (its internal body pointer nilled) by another goroutine before this goroutine's `Read` finished dispatching.
- `ReadAt` was called with `off = 0x1400000 = 20 MiB` and `len(p) = 0x80000 = 512 KiB`. 20 MiB is exactly the offset of part #5 in a multipart upload with the default `PartSize = 5 MiB`, confirming the AWS SDK upload manager was mid-multipart upload with at least 5 parts in flight.
- the panic PC (`pc=0x12d560b`) lands at `readerat.go:52`, which is the `nn, err = hrs.rc.Read(p)` line inside the `else` branch of the `io.ReaderAt` type assertion.

## Root cause

Annotated [`cache/remotecache/s3/readerat.go`](https://github.com/moby/buildkit/blob/master/cache/remotecache/s3/readerat.go) with the race points marked:

```go
func (hrs *readerAtCloser) ReadAt(p []byte, off int64) (n int, err error) {
    if hrs.closed {                             // (1) unsynchronised read
        return 0, io.EOF
    }
    if hrs.ra != nil {                          // (2) unsynchronised read
        return hrs.ra.ReadAt(p, off)
    }
    if hrs.rc == nil || off != hrs.offset {     // (3) unsynchronised read of rc, offset
        if hrs.rc != nil {
            hrs.rc.Close()                      // (4) another goroutine may still be Read()ing this rc
            hrs.rc = nil                        // (5) transient nil — visible to concurrent readers
        }
        rc, err := hrs.open(off)
        if err != nil { return 0, err }
        hrs.rc = rc                             // (6) unsynchronised write; torn read possible
    }
    if ra, ok := hrs.rc.(io.ReaderAt); ok {
        hrs.ra = ra                             // (7) unsynchronised write
        n, err = ra.ReadAt(p, off)
    } else {
        for {
            var nn int
            nn, err = hrs.rc.Read(p)            // (8) line 52 — crash site; hrs.rc may be nil here
            n += nn
            p = p[nn:]
            if nn == len(p) || err != nil { break }
        }
    }
    hrs.offset += int64(n)                      // (9) unsynchronised read-modify-write
    return
}
```

Concrete interleaving that produces the panic:

1. Goroutine A is deep inside `hrs.rc.Read(p)` at (8), reading from the S3 GET response body for offset X.
2. Goroutine B enters `ReadAt` for offset Y ≠ X, reaches (3) `hrs.rc != nil`, enters the reopen path, calls `hrs.rc.Close()` at (4) — closing the very body A is currently reading from — then sets `hrs.rc = nil` at (5).
3. Goroutine A's in-flight `Read` returns with either an error or short data. Control returns to (8). Although A holds the old `hrs.rc` value in a CPU register normally, Go's runtime reloads fields when method dispatch needs to re-evaluate them. The next iteration of the loop at (8) re-reads `hrs.rc`, which is now either `nil` (interface header zeroed by B at (5)) or a half-written interface header (torn read on (6)).
4. Either way, calling `.Read` on a nil or partially-zeroed interface dereferences an invalid pointer → SIGSEGV at `readerat.go:52`.

The `addr=0x20` in the panic signal is consistent with accessing a field at offset `0x20` of a struct pointed to by the interface's value pointer — the second scenario above, where the interface header still has a type pointer but the value pointer has been concurrently invalidated.

Even when the scheduler does not interleave into the crash path, the `-race` detector flags multiple data races on `rc`, `offset`, and `ra` across every concurrent `ReadAt` call.

## Reproduction

A self-contained reproduction as a Go unit test in `cache/remotecache/s3/readerat_race_test.go` (stdlib only, no S3 or network needed):

```go
package s3

import (
	"bytes"
	"errors"
	"fmt"
	"io"
	"runtime"
	"sync"
	"sync/atomic"
	"testing"
)

// sequentialReadCloser simulates an S3 GetObject response body.
// Its Read yields the scheduler before and after to widen the race window.
type sequentialReadCloser struct {
	r      io.Reader
	closed atomic.Bool
}

func (s *sequentialReadCloser) Read(p []byte) (int, error) {
	runtime.Gosched()
	if s.closed.Load() {
		return 0, io.ErrClosedPipe
	}
	n, err := s.r.Read(p)
	runtime.Gosched()
	return n, err
}

func (s *sequentialReadCloser) Close() error {
	s.closed.Store(true)
	return nil
}

// mimics s3Client.getReader: opener serves bytes from the requested offset
// exactly as a real S3 GET with `Range: bytes=N-` would.
func newTestBlob() (func(int64) (io.ReadCloser, error), *atomic.Int64) {
	const blobSize = 25 * 1024 * 1024
	data := make([]byte, blobSize)
	for i := range data {
		data[i] = byte(i & 0xff)
	}
	var openCount atomic.Int64
	open := func(offset int64) (io.ReadCloser, error) {
		openCount.Add(1)
		if offset < 0 || offset > int64(len(data)) {
			return nil, io.EOF
		}
		return &sequentialReadCloser{r: bytes.NewReader(data[offset:])}, nil
	}
	return open, &openCount
}

// readPart simulates one AWS SDK upload manager worker: reads an entire
// 5 MiB part from the shared readerAtCloser using 512 KiB read buffers,
// matching the production stack trace values.
func readPart(rac ReaderAtCloser, partIdx int, partSize int64, readBuf int) error {
	partOff := int64(partIdx) * partSize
	buf := make([]byte, readBuf)
	var total int64
	for total < partSize {
		want := partSize - total
		if want > int64(readBuf) {
			want = int64(readBuf)
		}
		n, err := rac.ReadAt(buf[:want], partOff+total)
		total += int64(n)
		if err != nil {
			if errors.Is(err, io.EOF) {
				return nil
			}
			return err
		}
		if n == 0 {
			return io.ErrUnexpectedEOF
		}
	}
	return nil
}

// Under `-race`, this test fails because unsynchronised access to
// readerAtCloser fields is flagged. Without `-race`, the test may also
// panic or return corrupted data depending on scheduler timing.
func TestReaderAtCloser_ConcurrentDataRace(t *testing.T) {
	open, openCount := newTestBlob()
	rac := toReaderAtCloser(open)
	defer rac.Close()

	const (
		concurrency = 5               // DefaultUploadConcurrency
		partSize    = 5 * 1024 * 1024 // DefaultUploadPartSize
		readBufLen  = 512 * 1024      // 0x80000 from production stack trace
	)

	var wg sync.WaitGroup
	panics := make(chan any, concurrency)

	for i := 0; i < concurrency; i++ {
		wg.Add(1)
		go func(partIdx int) {
			defer wg.Done()
			defer func() {
				if r := recover(); r != nil {
					panics <- fmt.Errorf("goroutine %d: %v", partIdx, r)
				}
			}()
			_ = readPart(rac, partIdx, partSize, readBufLen)
		}(i)
	}
	wg.Wait()
	close(panics)

	for p := range panics {
		t.Errorf("panic observed: %v", p)
	}
	t.Logf("openCount=%d (reader was reopened per concurrent offset change)", openCount.Load())
}

// Runs the race up to 50 times to deterministically reproduce the panic
// without needing `-race`.
func TestReaderAtCloser_ConcurrentPanicReproduction(t *testing.T) {
	if testing.Short() {
		t.Skip("skipping slow reproduction in -short mode")
	}
	for attempt := 1; attempt <= 50; attempt++ {
		open, _ := newTestBlob()
		rac := toReaderAtCloser(open)

		var wg sync.WaitGroup
		var gotPanic atomic.Bool
		var panicMsg atomic.Value

		for i := 0; i < 5; i++ {
			wg.Add(1)
			go func(partIdx int) {
				defer wg.Done()
				defer func() {
					if r := recover(); r != nil {
						gotPanic.Store(true)
						panicMsg.Store(fmt.Sprintf("%v", r))
					}
				}()
				_ = readPart(rac, partIdx, 5*1024*1024, 512*1024)
			}(i)
		}
		wg.Wait()
		_ = rac.Close()

		if gotPanic.Load() {
			msg, _ := panicMsg.Load().(string)
			t.Fatalf("reproduced nil-pointer panic on attempt %d: %s", attempt, msg)
		}
	}
}
```

### Running it

Against current master (`a243ce438`) with no patches:

```
$ go test -v -race -run TestReaderAtCloser_Concurrent -timeout 240s ./cache/remotecache/s3/
=== RUN   TestReaderAtCloser_ConcurrentDataRace
==================
WARNING: DATA RACE
Write at 0x... by goroutine 24:
  github.com/moby/buildkit/cache/remotecache/s3.(*readerAtCloser).ReadAt()
      cache/remotecache/s3/readerat.go:61
Previous read at 0x... by goroutine 22:
  github.com/moby/buildkit/cache/remotecache/s3.(*readerAtCloser).ReadAt()
      cache/remotecache/s3/readerat.go:35
...
==================
==================
WARNING: DATA RACE
Write at 0x... by goroutine 24:
  github.com/moby/buildkit/cache/remotecache/s3.(*readerAtCloser).ReadAt()
      cache/remotecache/s3/readerat.go:52
...
==================
    readerat_race_test.go:183: openCount=45 (reader was reopened per concurrent offset change)
    testing.go:1617: race detected during execution of test
--- FAIL: TestReaderAtCloser_ConcurrentDataRace (0.12s)

=== RUN   TestReaderAtCloser_ConcurrentPanicReproduction
    readerat_race_test.go:... reproduced panic on attempt 2: runtime error: invalid memory address or nil pointer dereference
--- FAIL: TestReaderAtCloser_ConcurrentPanicReproduction (0.02s)

FAIL
FAIL	github.com/moby/buildkit/cache/remotecache/s3	0.533s
```

Note `openCount=45` with 5 goroutines reading 5 parts — each part triggers ~9 reopens because each concurrent `ReadAt` at a different offset flips the `off != hrs.offset` branch and closes + reopens the underlying S3 body. This is both a correctness (race) and a performance (thrashing) problem.

## Proposed minimum fix

A minimal `sync.Mutex` restores the `io.ReaderAt` thread-safety contract:

```diff
--- a/cache/remotecache/s3/readerat.go
+++ b/cache/remotecache/s3/readerat.go
@@ -2,6 +2,7 @@ package s3

 import (
 	"io"
+	"sync"
 )

 type ReaderAtCloser interface {
@@ -10,6 +11,7 @@ type ReaderAtCloser interface {
 }

 type readerAtCloser struct {
+	mu     sync.Mutex
 	offset int64
 	rc     io.ReadCloser
 	ra     io.ReaderAt
@@ -24,6 +26,9 @@ func toReaderAtCloser(open func(offset int64) (io.ReadCloser, error)) ReaderAtCl
 }

 func (hrs *readerAtCloser) ReadAt(p []byte, off int64) (n int, err error) {
+	hrs.mu.Lock()
+	defer hrs.mu.Unlock()
+
 	if hrs.closed {
 		return 0, io.EOF
 	}
@@ -63,6 +68,9 @@ func (hrs *readerAtCloser) ReadAt(p []byte, off int64) (n int, err error) {
 }

 func (hrs *readerAtCloser) Close() error {
+	hrs.mu.Lock()
+	defer hrs.mu.Unlock()
+
 	if hrs.closed {
 		return nil
 	}
```

With this diff applied, both tests above pass reliably: `go test -race -count=5 ./cache/remotecache/s3/` → `ok`, 50 panic-reproduction attempts with no panic, no data race flagged.

### Performance consideration and follow-up

The mutex is the minimum correctness fix but leaves a performance cliff: concurrent readers at different offsets still trigger close-then-reopen cycles, which thrashes S3 connections. With 5 worker goroutines from `manager.Uploader` each targeting a different 5 MiB part, every `ReadAt` flips the `off != hrs.offset` condition and reopens the underlying `GetObject` body. Under the mutex this becomes serialised per `readerAtCloser`: correct, but potentially slower for a single large blob that is being re-exported from S3 back to S3. This is not a global BuildKit lock: different cache layers can still upload in parallel, but one blob backed by this reader loses multipart read parallelism and still pays the reopen churn.

A proper fix would also eliminate the shared-state optimisation that causes the thrashing, for example by having `ReaderAt` open an independent reader per call (stateless), or by keeping a small pool of per-offset readers. That optimisation can be follow-up work after the correctness fix, or part of the #3993 refactor (which would also need a mutex added to the replacement `contentutil.readerAt`).

## Version information

buildkit image: `moby/buildkit:master`, pulled from an upstream registry mirror on 2025-10-15, corresponding to master `~05fdd002b` (hack: use bake to build buildkit binaries). Bug also reproduces against current master `a243ce438b` (2026-04-06).

`cache/remotecache/s3/readerat.go` has been byte-identical on master since commit [`09c5a7c0e`](https://github.com/moby/buildkit/commit/09c5a7c0e) ("Add s3 remote cache", 2022-05-13). No changes after PR #5597.

Environment where the production crash was observed:
- buildkitd running as the `buildkit` container of a Kubernetes pod with `privileged: true`
- buildctl client invoking `build --import-cache type=s3,... --export-cache type=s3,...` with S3-compatible endpoint
- S3-compatible object storage (Yandex Cloud Object Storage, `storage.yandexcloud.net`) with HTTP/2 transport
- Layer sizes well above the 5 MiB `DefaultUploadPartSize` (Docker image layers — node_modules, Chromium, etc.)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s3 remote cache: data race / nil pointer panic in cache/remotecache/s3.readerAtCloser under concurrent multipart upload #6674

Bug description

Observed in production

Root cause

Reproduction

Running it

Proposed minimum fix

Performance consideration and follow-up

Version information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

s3 remote cache: data race / nil pointer panic in cache/remotecache/s3.readerAtCloser under concurrent multipart upload #6674

Description

Bug description

Observed in production

Root cause

Reproduction

Running it

Proposed minimum fix

Performance consideration and follow-up

Version information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions