Skip to content

s3 remote cache: data race / nil pointer panic in cache/remotecache/s3.readerAtCloser under concurrent multipart upload #6674

@kuzaxak

Description

@kuzaxak

Bug description

cache/remotecache/s3.readerAtCloser implements io.ReaderAt, whose documented contract explicitly allows concurrent callers:

Clients of ReadAt can execute parallel ReadAt calls on the same input source.

However, readerAtCloser has no synchronization protecting its mutable state (rc, ra, offset, closed). When used as the body of an S3 upload via manager.Uploader, the AWS SDK v2 S3 upload manager spawns DefaultUploadConcurrency = 5 worker goroutines (upload.go#L601-605), each given its own io.NewSectionReader slice that shares the same underlying body. Each worker's Read eventually becomes a concurrent ReadAt call on the same *readerAtCloser at a different offset, triggering a data race and a nil pointer dereference panic that crashes buildkitd.

This happens when a cache layer is read through this S3-backed readerAtCloser and then uploaded through the AWS multipart uploader, so the common case is cache-from type=s3 together with cache-to type=s3 for layers larger than DefaultUploadPartSize = 5 MiB. The pod/container dies, the build fails, and no retry logic in the client (buildctl) can recover because the daemon itself is gone.

PR #5597 addressed a related offset handling bug in s3Client.getReader, but did not address the underlying thread-safety violation. The bug is still present on master as of 05fdd002b (2025-10-15) and a243ce438 (current master, 2026-04-06) — cache/remotecache/s3/readerat.go has not been modified since it was first added in 09c5a7c0e (2022-05-13).

Related but not duplicate:

Observed in production

We first hit this on a build that pulls base cache from a branch prefix, reads additional cache from master prefix, and exports to the branch prefix — a standard cache layout. The buildkit image used was moby/buildkit:master pulled in mid-October 2025, which contains the #5597 fix; this rules out the offset bug as the cause.

Sanitized buildctl log fragment and stack trace (truncated):

+ buildctl --addr unix:///run/buildkit/buildkitd.sock build \
    --frontend dockerfile.v0 --local context=. --local dockerfile=. --opt filename=Dockerfile \
    --secret id=GITHUB_USERNAME --secret id=GITHUB_TOKEN --secret id=GITHUB_HEADER \
    --opt build-arg:APP_VERSION=<sha> \
    --output type=image,name=<registry>/<image>:<tag>,push=true \
    --import-cache type=s3,endpoint_url=https://<s3-compatible>,region=<region>,bucket=<bucket>,blobs_prefix=blobs/,manifests_prefix=manifests/,access_key_id=<redacted>,secret_access_key=<redacted>,prefix=s3cache/branch/<image>/,name=<pr-branch> \
    --import-cache type=s3,endpoint_url=https://<s3-compatible>,region=<region>,bucket=<bucket>,blobs_prefix=blobs/,manifests_prefix=manifests/,access_key_id=<redacted>,secret_access_key=<redacted>,prefix=s3cache/master/<image>/,name=master \
    --export-cache type=s3,endpoint_url=https://<s3-compatible>,region=<region>,bucket=<bucket>,blobs_prefix=blobs/,manifests_prefix=manifests/,access_key_id=<redacted>,secret_access_key=<redacted>,prefix=s3cache/branch/<image>/,name=<pr-branch>

#1  [internal] load build definition from Dockerfile                                  DONE 0.2s
#2  [internal] load metadata for docker.io/library/node:20-bookworm                   DONE 1.8s
...
#19 exporting to image                                                                DONE 0.8s
#20 exporting cache to Amazon S3
#20 preparing build cache for export
#20 writing layer sha256:56784d23502ebb983d6ce883fb06c5be67d52b73c2d27cf4a4bd52072954e01a

buildkit container terminated (exit code 2)

time="2026-04-09T06:28:08Z" level=warning msg="failed to update distribution source for layer sha256:<redacted>: content digest sha256:<redacted>: not found" span="exporting to image"
[... 7 more identical warnings for other layers ...]

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x12d560b]

goroutine 2421 [running]:
github.com/moby/buildkit/cache/remotecache/s3.(*readerAtCloser).ReadAt(0xc0006112c0, {0xc006080000, 0x80000, 0x80000}, 0x1400000)
    /src/cache/remotecache/s3/readerat.go:52 +0x1cb
io.(*SectionReader).ReadAt(0xc0009eaa28?, {0xc006080000?, 0x0?, 0x427b25?}, 0x40?)
    /usr/local/go/src/io/io.go:555 +0x93
io.(*SectionReader).Read(0xc0009382d0, {0xc006080000?, 0x2b980c0?, 0xc0009eab28?})
    /usr/local/go/src/io/io.go:516 +0x4f
github.com/aws/smithy-go/transport/http/internal/io.(*safeReadCloser).Read(0x458269?, {0xc006080000?, 0x17f0cc0?, 0x1?})
    /src/vendor/github.com/aws/smithy-go/transport/http/internal/io/safe.go:59 +0x11d
net/http.(*readTrackingBody).Read(0x2b33be8?, {0xc006080000?, 0x20d?, 0x1000?})
    /usr/local/go/src/net/http/transport.go:657 +0x27
net/http.(*http2clientStream).writeRequestBody(0xc005558180, 0xc00554ec60)
    /usr/local/go/src/net/http/h2_bundle.go:8799 +0x43f
net/http.(*http2clientStream).writeRequest(0xc005558180, 0xc00554ec60)
    /usr/local/go/src/net/http/h2_bundle.go:8506 +0x852
net/http.(*http2clientStream).doRequest(0xc005558180, 0x5?)
    /usr/local/go/src/net/http/h2_bundle.go:8392 +0x18
created by net/http.(*http2ClientConn).RoundTrip in goroutine 2235
    /usr/local/go/src/net/http/h2_bundle.go:8298 +0x2ed

A few details from the panic that cross-check with the root cause analysis below:

  • crash address is 0x20 — not 0x0 — indicating a field access on a nil *T embedded inside an interface, not a nil interface dispatch. Consistent with hrs.rc being a non-nil interface whose concrete value was closed (its internal body pointer nilled) by another goroutine before this goroutine's Read finished dispatching.
  • ReadAt was called with off = 0x1400000 = 20 MiB and len(p) = 0x80000 = 512 KiB. 20 MiB is exactly the offset of part buildctl: add dump #5 in a multipart upload with the default PartSize = 5 MiB, confirming the AWS SDK upload manager was mid-multipart upload with at least 5 parts in flight.
  • the panic PC (pc=0x12d560b) lands at readerat.go:52, which is the nn, err = hrs.rc.Read(p) line inside the else branch of the io.ReaderAt type assertion.

Root cause

Annotated cache/remotecache/s3/readerat.go with the race points marked:

func (hrs *readerAtCloser) ReadAt(p []byte, off int64) (n int, err error) {
    if hrs.closed {                             // (1) unsynchronised read
        return 0, io.EOF
    }
    if hrs.ra != nil {                          // (2) unsynchronised read
        return hrs.ra.ReadAt(p, off)
    }
    if hrs.rc == nil || off != hrs.offset {     // (3) unsynchronised read of rc, offset
        if hrs.rc != nil {
            hrs.rc.Close()                      // (4) another goroutine may still be Read()ing this rc
            hrs.rc = nil                        // (5) transient nil — visible to concurrent readers
        }
        rc, err := hrs.open(off)
        if err != nil { return 0, err }
        hrs.rc = rc                             // (6) unsynchronised write; torn read possible
    }
    if ra, ok := hrs.rc.(io.ReaderAt); ok {
        hrs.ra = ra                             // (7) unsynchronised write
        n, err = ra.ReadAt(p, off)
    } else {
        for {
            var nn int
            nn, err = hrs.rc.Read(p)            // (8) line 52 — crash site; hrs.rc may be nil here
            n += nn
            p = p[nn:]
            if nn == len(p) || err != nil { break }
        }
    }
    hrs.offset += int64(n)                      // (9) unsynchronised read-modify-write
    return
}

Concrete interleaving that produces the panic:

  1. Goroutine A is deep inside hrs.rc.Read(p) at (8), reading from the S3 GET response body for offset X.
  2. Goroutine B enters ReadAt for offset Y ≠ X, reaches (3) hrs.rc != nil, enters the reopen path, calls hrs.rc.Close() at (4) — closing the very body A is currently reading from — then sets hrs.rc = nil at (5).
  3. Goroutine A's in-flight Read returns with either an error or short data. Control returns to (8). Although A holds the old hrs.rc value in a CPU register normally, Go's runtime reloads fields when method dispatch needs to re-evaluate them. The next iteration of the loop at (8) re-reads hrs.rc, which is now either nil (interface header zeroed by B at (5)) or a half-written interface header (torn read on (6)).
  4. Either way, calling .Read on a nil or partially-zeroed interface dereferences an invalid pointer → SIGSEGV at readerat.go:52.

The addr=0x20 in the panic signal is consistent with accessing a field at offset 0x20 of a struct pointed to by the interface's value pointer — the second scenario above, where the interface header still has a type pointer but the value pointer has been concurrently invalidated.

Even when the scheduler does not interleave into the crash path, the -race detector flags multiple data races on rc, offset, and ra across every concurrent ReadAt call.

Reproduction

A self-contained reproduction as a Go unit test in cache/remotecache/s3/readerat_race_test.go (stdlib only, no S3 or network needed):

package s3

import (
	"bytes"
	"errors"
	"fmt"
	"io"
	"runtime"
	"sync"
	"sync/atomic"
	"testing"
)

// sequentialReadCloser simulates an S3 GetObject response body.
// Its Read yields the scheduler before and after to widen the race window.
type sequentialReadCloser struct {
	r      io.Reader
	closed atomic.Bool
}

func (s *sequentialReadCloser) Read(p []byte) (int, error) {
	runtime.Gosched()
	if s.closed.Load() {
		return 0, io.ErrClosedPipe
	}
	n, err := s.r.Read(p)
	runtime.Gosched()
	return n, err
}

func (s *sequentialReadCloser) Close() error {
	s.closed.Store(true)
	return nil
}

// mimics s3Client.getReader: opener serves bytes from the requested offset
// exactly as a real S3 GET with `Range: bytes=N-` would.
func newTestBlob() (func(int64) (io.ReadCloser, error), *atomic.Int64) {
	const blobSize = 25 * 1024 * 1024
	data := make([]byte, blobSize)
	for i := range data {
		data[i] = byte(i & 0xff)
	}
	var openCount atomic.Int64
	open := func(offset int64) (io.ReadCloser, error) {
		openCount.Add(1)
		if offset < 0 || offset > int64(len(data)) {
			return nil, io.EOF
		}
		return &sequentialReadCloser{r: bytes.NewReader(data[offset:])}, nil
	}
	return open, &openCount
}

// readPart simulates one AWS SDK upload manager worker: reads an entire
// 5 MiB part from the shared readerAtCloser using 512 KiB read buffers,
// matching the production stack trace values.
func readPart(rac ReaderAtCloser, partIdx int, partSize int64, readBuf int) error {
	partOff := int64(partIdx) * partSize
	buf := make([]byte, readBuf)
	var total int64
	for total < partSize {
		want := partSize - total
		if want > int64(readBuf) {
			want = int64(readBuf)
		}
		n, err := rac.ReadAt(buf[:want], partOff+total)
		total += int64(n)
		if err != nil {
			if errors.Is(err, io.EOF) {
				return nil
			}
			return err
		}
		if n == 0 {
			return io.ErrUnexpectedEOF
		}
	}
	return nil
}

// Under `-race`, this test fails because unsynchronised access to
// readerAtCloser fields is flagged. Without `-race`, the test may also
// panic or return corrupted data depending on scheduler timing.
func TestReaderAtCloser_ConcurrentDataRace(t *testing.T) {
	open, openCount := newTestBlob()
	rac := toReaderAtCloser(open)
	defer rac.Close()

	const (
		concurrency = 5               // DefaultUploadConcurrency
		partSize    = 5 * 1024 * 1024 // DefaultUploadPartSize
		readBufLen  = 512 * 1024      // 0x80000 from production stack trace
	)

	var wg sync.WaitGroup
	panics := make(chan any, concurrency)

	for i := 0; i < concurrency; i++ {
		wg.Add(1)
		go func(partIdx int) {
			defer wg.Done()
			defer func() {
				if r := recover(); r != nil {
					panics <- fmt.Errorf("goroutine %d: %v", partIdx, r)
				}
			}()
			_ = readPart(rac, partIdx, partSize, readBufLen)
		}(i)
	}
	wg.Wait()
	close(panics)

	for p := range panics {
		t.Errorf("panic observed: %v", p)
	}
	t.Logf("openCount=%d (reader was reopened per concurrent offset change)", openCount.Load())
}

// Runs the race up to 50 times to deterministically reproduce the panic
// without needing `-race`.
func TestReaderAtCloser_ConcurrentPanicReproduction(t *testing.T) {
	if testing.Short() {
		t.Skip("skipping slow reproduction in -short mode")
	}
	for attempt := 1; attempt <= 50; attempt++ {
		open, _ := newTestBlob()
		rac := toReaderAtCloser(open)

		var wg sync.WaitGroup
		var gotPanic atomic.Bool
		var panicMsg atomic.Value

		for i := 0; i < 5; i++ {
			wg.Add(1)
			go func(partIdx int) {
				defer wg.Done()
				defer func() {
					if r := recover(); r != nil {
						gotPanic.Store(true)
						panicMsg.Store(fmt.Sprintf("%v", r))
					}
				}()
				_ = readPart(rac, partIdx, 5*1024*1024, 512*1024)
			}(i)
		}
		wg.Wait()
		_ = rac.Close()

		if gotPanic.Load() {
			msg, _ := panicMsg.Load().(string)
			t.Fatalf("reproduced nil-pointer panic on attempt %d: %s", attempt, msg)
		}
	}
}

Running it

Against current master (a243ce438) with no patches:

$ go test -v -race -run TestReaderAtCloser_Concurrent -timeout 240s ./cache/remotecache/s3/
=== RUN   TestReaderAtCloser_ConcurrentDataRace
==================
WARNING: DATA RACE
Write at 0x... by goroutine 24:
  github.com/moby/buildkit/cache/remotecache/s3.(*readerAtCloser).ReadAt()
      cache/remotecache/s3/readerat.go:61
Previous read at 0x... by goroutine 22:
  github.com/moby/buildkit/cache/remotecache/s3.(*readerAtCloser).ReadAt()
      cache/remotecache/s3/readerat.go:35
...
==================
==================
WARNING: DATA RACE
Write at 0x... by goroutine 24:
  github.com/moby/buildkit/cache/remotecache/s3.(*readerAtCloser).ReadAt()
      cache/remotecache/s3/readerat.go:52
...
==================
    readerat_race_test.go:183: openCount=45 (reader was reopened per concurrent offset change)
    testing.go:1617: race detected during execution of test
--- FAIL: TestReaderAtCloser_ConcurrentDataRace (0.12s)

=== RUN   TestReaderAtCloser_ConcurrentPanicReproduction
    readerat_race_test.go:... reproduced panic on attempt 2: runtime error: invalid memory address or nil pointer dereference
--- FAIL: TestReaderAtCloser_ConcurrentPanicReproduction (0.02s)

FAIL
FAIL	github.com/moby/buildkit/cache/remotecache/s3	0.533s

Note openCount=45 with 5 goroutines reading 5 parts — each part triggers ~9 reopens because each concurrent ReadAt at a different offset flips the off != hrs.offset branch and closes + reopens the underlying S3 body. This is both a correctness (race) and a performance (thrashing) problem.

Proposed minimum fix

A minimal sync.Mutex restores the io.ReaderAt thread-safety contract:

--- a/cache/remotecache/s3/readerat.go
+++ b/cache/remotecache/s3/readerat.go
@@ -2,6 +2,7 @@ package s3

 import (
 	"io"
+	"sync"
 )

 type ReaderAtCloser interface {
@@ -10,6 +11,7 @@ type ReaderAtCloser interface {
 }

 type readerAtCloser struct {
+	mu     sync.Mutex
 	offset int64
 	rc     io.ReadCloser
 	ra     io.ReaderAt
@@ -24,6 +26,9 @@ func toReaderAtCloser(open func(offset int64) (io.ReadCloser, error)) ReaderAtCl
 }

 func (hrs *readerAtCloser) ReadAt(p []byte, off int64) (n int, err error) {
+	hrs.mu.Lock()
+	defer hrs.mu.Unlock()
+
 	if hrs.closed {
 		return 0, io.EOF
 	}
@@ -63,6 +68,9 @@ func (hrs *readerAtCloser) ReadAt(p []byte, off int64) (n int, err error) {
 }

 func (hrs *readerAtCloser) Close() error {
+	hrs.mu.Lock()
+	defer hrs.mu.Unlock()
+
 	if hrs.closed {
 		return nil
 	}

With this diff applied, both tests above pass reliably: go test -race -count=5 ./cache/remotecache/s3/ok, 50 panic-reproduction attempts with no panic, no data race flagged.

Performance consideration and follow-up

The mutex is the minimum correctness fix but leaves a performance cliff: concurrent readers at different offsets still trigger close-then-reopen cycles, which thrashes S3 connections. With 5 worker goroutines from manager.Uploader each targeting a different 5 MiB part, every ReadAt flips the off != hrs.offset condition and reopens the underlying GetObject body. Under the mutex this becomes serialised per readerAtCloser: correct, but potentially slower for a single large blob that is being re-exported from S3 back to S3. This is not a global BuildKit lock: different cache layers can still upload in parallel, but one blob backed by this reader loses multipart read parallelism and still pays the reopen churn.

A proper fix would also eliminate the shared-state optimisation that causes the thrashing, for example by having ReaderAt open an independent reader per call (stateless), or by keeping a small pool of per-offset readers. That optimisation can be follow-up work after the correctness fix, or part of the #3993 refactor (which would also need a mutex added to the replacement contentutil.readerAt).

Version information

buildkit image: moby/buildkit:master, pulled from an upstream registry mirror on 2025-10-15, corresponding to master ~05fdd002b (hack: use bake to build buildkit binaries). Bug also reproduces against current master a243ce438b (2026-04-06).

cache/remotecache/s3/readerat.go has been byte-identical on master since commit 09c5a7c0e ("Add s3 remote cache", 2022-05-13). No changes after PR #5597.

Environment where the production crash was observed:

  • buildkitd running as the buildkit container of a Kubernetes pod with privileged: true
  • buildctl client invoking build --import-cache type=s3,... --export-cache type=s3,... with S3-compatible endpoint
  • S3-compatible object storage (Yandex Cloud Object Storage, storage.yandexcloud.net) with HTTP/2 transport
  • Layer sizes well above the 5 MiB DefaultUploadPartSize (Docker image layers — node_modules, Chromium, etc.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions