You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Clients of ReadAt can execute parallel ReadAt calls on the same input source.
However, readerAtCloser has no synchronization protecting its mutable state (rc, ra, offset, closed). When used as the body of an S3 upload via manager.Uploader, the AWS SDK v2 S3 upload manager spawns DefaultUploadConcurrency = 5 worker goroutines (upload.go#L601-605), each given its own io.NewSectionReader slice that shares the same underlying body. Each worker's Read eventually becomes a concurrent ReadAt call on the same *readerAtCloser at a different offset, triggering a data race and a nil pointer dereference panic that crashes buildkitd.
This happens when a cache layer is read through this S3-backed readerAtCloser and then uploaded through the AWS multipart uploader, so the common case is cache-from type=s3 together with cache-to type=s3 for layers larger than DefaultUploadPartSize = 5 MiB. The pod/container dies, the build fails, and no retry logic in the client (buildctl) can recover because the daemon itself is gone.
PR #5597 addressed a related offset handling bug in s3Client.getReader, but did not address the underlying thread-safety violation. The bug is still present on master as of 05fdd002b (2025-10-15) and a243ce438 (current master, 2026-04-06) — cache/remotecache/s3/readerat.go has not been modified since it was first added in 09c5a7c0e (2022-05-13).
[s3 cache] Fix upload of downloaded layer #5597 (merged) — added offset parameter to getReader. Reduces the frequency of the race by avoiding unnecessary close/reopen on the common "sequential read" path, but does not prevent concurrent ReadAt calls from racing on shared state.
S3 Cache: Remove custom readerAt to use the standard one #3993 (stalled since Aug 2023) — refactor to remove the custom readerAtCloser and use contentutil.FromFetcher instead. The replacement readerAt in util/contentutil/fetcher.go also has no mutex, so this refactor would not fix the race either.
Observed in production
We first hit this on a build that pulls base cache from a branch prefix, reads additional cache from master prefix, and exports to the branch prefix — a standard cache layout. The buildkit image used was moby/buildkit:master pulled in mid-October 2025, which contains the #5597 fix; this rules out the offset bug as the cause.
Sanitized buildctl log fragment and stack trace (truncated):
A few details from the panic that cross-check with the root cause analysis below:
crash address is 0x20 — not 0x0 — indicating a field access on a nil *T embedded inside an interface, not a nil interface dispatch. Consistent with hrs.rc being a non-nil interface whose concrete value was closed (its internal body pointer nilled) by another goroutine before this goroutine's Read finished dispatching.
ReadAt was called with off = 0x1400000 = 20 MiB and len(p) = 0x80000 = 512 KiB. 20 MiB is exactly the offset of part buildctl: add dump #5 in a multipart upload with the default PartSize = 5 MiB, confirming the AWS SDK upload manager was mid-multipart upload with at least 5 parts in flight.
the panic PC (pc=0x12d560b) lands at readerat.go:52, which is the nn, err = hrs.rc.Read(p) line inside the else branch of the io.ReaderAt type assertion.
func (hrs*readerAtCloser) ReadAt(p []byte, offint64) (nint, errerror) {
ifhrs.closed { // (1) unsynchronised readreturn0, io.EOF
}
ifhrs.ra!=nil { // (2) unsynchronised readreturnhrs.ra.ReadAt(p, off)
}
ifhrs.rc==nil||off!=hrs.offset { // (3) unsynchronised read of rc, offsetifhrs.rc!=nil {
hrs.rc.Close() // (4) another goroutine may still be Read()ing this rchrs.rc=nil// (5) transient nil — visible to concurrent readers
}
rc, err:=hrs.open(off)
iferr!=nil { return0, err }
hrs.rc=rc// (6) unsynchronised write; torn read possible
}
ifra, ok:=hrs.rc.(io.ReaderAt); ok {
hrs.ra=ra// (7) unsynchronised writen, err=ra.ReadAt(p, off)
} else {
for {
varnnintnn, err=hrs.rc.Read(p) // (8) line 52 — crash site; hrs.rc may be nil heren+=nnp=p[nn:]
ifnn==len(p) ||err!=nil { break }
}
}
hrs.offset+=int64(n) // (9) unsynchronised read-modify-writereturn
}
Concrete interleaving that produces the panic:
Goroutine A is deep inside hrs.rc.Read(p) at (8), reading from the S3 GET response body for offset X.
Goroutine B enters ReadAt for offset Y ≠ X, reaches (3) hrs.rc != nil, enters the reopen path, calls hrs.rc.Close() at (4) — closing the very body A is currently reading from — then sets hrs.rc = nil at (5).
Goroutine A's in-flight Read returns with either an error or short data. Control returns to (8). Although A holds the old hrs.rc value in a CPU register normally, Go's runtime reloads fields when method dispatch needs to re-evaluate them. The next iteration of the loop at (8) re-reads hrs.rc, which is now either nil (interface header zeroed by B at (5)) or a half-written interface header (torn read on (6)).
Either way, calling .Read on a nil or partially-zeroed interface dereferences an invalid pointer → SIGSEGV at readerat.go:52.
The addr=0x20 in the panic signal is consistent with accessing a field at offset 0x20 of a struct pointed to by the interface's value pointer — the second scenario above, where the interface header still has a type pointer but the value pointer has been concurrently invalidated.
Even when the scheduler does not interleave into the crash path, the -race detector flags multiple data races on rc, offset, and ra across every concurrent ReadAt call.
Reproduction
A self-contained reproduction as a Go unit test in cache/remotecache/s3/readerat_race_test.go (stdlib only, no S3 or network needed):
package s3
import (
"bytes""errors""fmt""io""runtime""sync""sync/atomic""testing"
)
// sequentialReadCloser simulates an S3 GetObject response body.// Its Read yields the scheduler before and after to widen the race window.typesequentialReadCloserstruct {
r io.Readerclosed atomic.Bool
}
func (s*sequentialReadCloser) Read(p []byte) (int, error) {
runtime.Gosched()
ifs.closed.Load() {
return0, io.ErrClosedPipe
}
n, err:=s.r.Read(p)
runtime.Gosched()
returnn, err
}
func (s*sequentialReadCloser) Close() error {
s.closed.Store(true)
returnnil
}
// mimics s3Client.getReader: opener serves bytes from the requested offset// exactly as a real S3 GET with `Range: bytes=N-` would.funcnewTestBlob() (func(int64) (io.ReadCloser, error), *atomic.Int64) {
constblobSize=25*1024*1024data:=make([]byte, blobSize)
fori:=rangedata {
data[i] =byte(i&0xff)
}
varopenCount atomic.Int64open:=func(offsetint64) (io.ReadCloser, error) {
openCount.Add(1)
ifoffset<0||offset>int64(len(data)) {
returnnil, io.EOF
}
return&sequentialReadCloser{r: bytes.NewReader(data[offset:])}, nil
}
returnopen, &openCount
}
// readPart simulates one AWS SDK upload manager worker: reads an entire// 5 MiB part from the shared readerAtCloser using 512 KiB read buffers,// matching the production stack trace values.funcreadPart(racReaderAtCloser, partIdxint, partSizeint64, readBufint) error {
partOff:=int64(partIdx) *partSizebuf:=make([]byte, readBuf)
vartotalint64fortotal<partSize {
want:=partSize-totalifwant>int64(readBuf) {
want=int64(readBuf)
}
n, err:=rac.ReadAt(buf[:want], partOff+total)
total+=int64(n)
iferr!=nil {
iferrors.Is(err, io.EOF) {
returnnil
}
returnerr
}
ifn==0 {
returnio.ErrUnexpectedEOF
}
}
returnnil
}
// Under `-race`, this test fails because unsynchronised access to// readerAtCloser fields is flagged. Without `-race`, the test may also// panic or return corrupted data depending on scheduler timing.funcTestReaderAtCloser_ConcurrentDataRace(t*testing.T) {
open, openCount:=newTestBlob()
rac:=toReaderAtCloser(open)
deferrac.Close()
const (
concurrency=5// DefaultUploadConcurrencypartSize=5*1024*1024// DefaultUploadPartSizereadBufLen=512*1024// 0x80000 from production stack trace
)
varwg sync.WaitGrouppanics:=make(chanany, concurrency)
fori:=0; i<concurrency; i++ {
wg.Add(1)
gofunc(partIdxint) {
deferwg.Done()
deferfunc() {
ifr:=recover(); r!=nil {
panics<-fmt.Errorf("goroutine %d: %v", partIdx, r)
}
}()
_=readPart(rac, partIdx, partSize, readBufLen)
}(i)
}
wg.Wait()
close(panics)
forp:=rangepanics {
t.Errorf("panic observed: %v", p)
}
t.Logf("openCount=%d (reader was reopened per concurrent offset change)", openCount.Load())
}
// Runs the race up to 50 times to deterministically reproduce the panic// without needing `-race`.funcTestReaderAtCloser_ConcurrentPanicReproduction(t*testing.T) {
iftesting.Short() {
t.Skip("skipping slow reproduction in -short mode")
}
forattempt:=1; attempt<=50; attempt++ {
open, _:=newTestBlob()
rac:=toReaderAtCloser(open)
varwg sync.WaitGroupvargotPanic atomic.BoolvarpanicMsg atomic.Valuefori:=0; i<5; i++ {
wg.Add(1)
gofunc(partIdxint) {
deferwg.Done()
deferfunc() {
ifr:=recover(); r!=nil {
gotPanic.Store(true)
panicMsg.Store(fmt.Sprintf("%v", r))
}
}()
_=readPart(rac, partIdx, 5*1024*1024, 512*1024)
}(i)
}
wg.Wait()
_=rac.Close()
ifgotPanic.Load() {
msg, _:=panicMsg.Load().(string)
t.Fatalf("reproduced nil-pointer panic on attempt %d: %s", attempt, msg)
}
}
}
Running it
Against current master (a243ce438) with no patches:
$ go test -v -race -run TestReaderAtCloser_Concurrent -timeout 240s ./cache/remotecache/s3/
=== RUN TestReaderAtCloser_ConcurrentDataRace
==================
WARNING: DATA RACE
Write at 0x... by goroutine 24:
github.com/moby/buildkit/cache/remotecache/s3.(*readerAtCloser).ReadAt()
cache/remotecache/s3/readerat.go:61
Previous read at 0x... by goroutine 22:
github.com/moby/buildkit/cache/remotecache/s3.(*readerAtCloser).ReadAt()
cache/remotecache/s3/readerat.go:35
...
==================
==================
WARNING: DATA RACE
Write at 0x... by goroutine 24:
github.com/moby/buildkit/cache/remotecache/s3.(*readerAtCloser).ReadAt()
cache/remotecache/s3/readerat.go:52
...
==================
readerat_race_test.go:183: openCount=45 (reader was reopened per concurrent offset change)
testing.go:1617: race detected during execution of test
--- FAIL: TestReaderAtCloser_ConcurrentDataRace (0.12s)
=== RUN TestReaderAtCloser_ConcurrentPanicReproduction
readerat_race_test.go:... reproduced panic on attempt 2: runtime error: invalid memory address or nil pointer dereference
--- FAIL: TestReaderAtCloser_ConcurrentPanicReproduction (0.02s)
FAIL
FAIL github.com/moby/buildkit/cache/remotecache/s3 0.533s
Note openCount=45 with 5 goroutines reading 5 parts — each part triggers ~9 reopens because each concurrent ReadAt at a different offset flips the off != hrs.offset branch and closes + reopens the underlying S3 body. This is both a correctness (race) and a performance (thrashing) problem.
Proposed minimum fix
A minimal sync.Mutex restores the io.ReaderAt thread-safety contract:
With this diff applied, both tests above pass reliably: go test -race -count=5 ./cache/remotecache/s3/ → ok, 50 panic-reproduction attempts with no panic, no data race flagged.
Performance consideration and follow-up
The mutex is the minimum correctness fix but leaves a performance cliff: concurrent readers at different offsets still trigger close-then-reopen cycles, which thrashes S3 connections. With 5 worker goroutines from manager.Uploader each targeting a different 5 MiB part, every ReadAt flips the off != hrs.offset condition and reopens the underlying GetObject body. Under the mutex this becomes serialised per readerAtCloser: correct, but potentially slower for a single large blob that is being re-exported from S3 back to S3. This is not a global BuildKit lock: different cache layers can still upload in parallel, but one blob backed by this reader loses multipart read parallelism and still pays the reopen churn.
A proper fix would also eliminate the shared-state optimisation that causes the thrashing, for example by having ReaderAt open an independent reader per call (stateless), or by keeping a small pool of per-offset readers. That optimisation can be follow-up work after the correctness fix, or part of the #3993 refactor (which would also need a mutex added to the replacement contentutil.readerAt).
Version information
buildkit image: moby/buildkit:master, pulled from an upstream registry mirror on 2025-10-15, corresponding to master ~05fdd002b (hack: use bake to build buildkit binaries). Bug also reproduces against current master a243ce438b (2026-04-06).
cache/remotecache/s3/readerat.go has been byte-identical on master since commit 09c5a7c0e ("Add s3 remote cache", 2022-05-13). No changes after PR #5597.
Environment where the production crash was observed:
buildkitd running as the buildkit container of a Kubernetes pod with privileged: true
Bug description
cache/remotecache/s3.readerAtCloserimplementsio.ReaderAt, whose documented contract explicitly allows concurrent callers:However,
readerAtCloserhas no synchronization protecting its mutable state (rc,ra,offset,closed). When used as the body of an S3 upload viamanager.Uploader, the AWS SDK v2 S3 upload manager spawnsDefaultUploadConcurrency = 5worker goroutines (upload.go#L601-605), each given its ownio.NewSectionReaderslice that shares the same underlying body. Each worker'sReadeventually becomes a concurrentReadAtcall on the same*readerAtCloserat a different offset, triggering a data race and anil pointer dereferencepanic that crashesbuildkitd.This happens when a cache layer is read through this S3-backed
readerAtCloserand then uploaded through the AWS multipart uploader, so the common case iscache-from type=s3together withcache-to type=s3for layers larger thanDefaultUploadPartSize = 5 MiB. The pod/container dies, the build fails, and no retry logic in the client (buildctl) can recover because the daemon itself is gone.PR #5597 addressed a related offset handling bug in
s3Client.getReader, but did not address the underlying thread-safety violation. The bug is still present on master as of05fdd002b(2025-10-15) anda243ce438(current master, 2026-04-06) —cache/remotecache/s3/readerat.gohas not been modified since it was first added in09c5a7c0e(2022-05-13).Related but not duplicate:
offsetparameter togetReader. Reduces the frequency of the race by avoiding unnecessary close/reopen on the common "sequential read" path, but does not prevent concurrentReadAtcalls from racing on shared state.readerAtCloserand usecontentutil.FromFetcherinstead. The replacementreaderAtinutil/contentutil/fetcher.goalso has no mutex, so this refactor would not fix the race either.Observed in production
We first hit this on a build that pulls base cache from a
branchprefix, reads additional cache frommasterprefix, and exports to thebranchprefix — a standard cache layout. The buildkit image used wasmoby/buildkit:masterpulled in mid-October 2025, which contains the #5597 fix; this rules out the offset bug as the cause.Sanitized buildctl log fragment and stack trace (truncated):
A few details from the panic that cross-check with the root cause analysis below:
0x20— not0x0— indicating a field access on a nil*Tembedded inside an interface, not a nil interface dispatch. Consistent withhrs.rcbeing a non-nil interface whose concrete value was closed (its internal body pointer nilled) by another goroutine before this goroutine'sReadfinished dispatching.ReadAtwas called withoff = 0x1400000 = 20 MiBandlen(p) = 0x80000 = 512 KiB. 20 MiB is exactly the offset of part buildctl: add dump #5 in a multipart upload with the defaultPartSize = 5 MiB, confirming the AWS SDK upload manager was mid-multipart upload with at least 5 parts in flight.pc=0x12d560b) lands atreaderat.go:52, which is thenn, err = hrs.rc.Read(p)line inside theelsebranch of theio.ReaderAttype assertion.Root cause
Annotated
cache/remotecache/s3/readerat.gowith the race points marked:Concrete interleaving that produces the panic:
hrs.rc.Read(p)at (8), reading from the S3 GET response body for offset X.ReadAtfor offset Y ≠ X, reaches (3)hrs.rc != nil, enters the reopen path, callshrs.rc.Close()at (4) — closing the very body A is currently reading from — then setshrs.rc = nilat (5).Readreturns with either an error or short data. Control returns to (8). Although A holds the oldhrs.rcvalue in a CPU register normally, Go's runtime reloads fields when method dispatch needs to re-evaluate them. The next iteration of the loop at (8) re-readshrs.rc, which is now eithernil(interface header zeroed by B at (5)) or a half-written interface header (torn read on (6))..Readon a nil or partially-zeroed interface dereferences an invalid pointer → SIGSEGV atreaderat.go:52.The
addr=0x20in the panic signal is consistent with accessing a field at offset0x20of a struct pointed to by the interface's value pointer — the second scenario above, where the interface header still has a type pointer but the value pointer has been concurrently invalidated.Even when the scheduler does not interleave into the crash path, the
-racedetector flags multiple data races onrc,offset, andraacross every concurrentReadAtcall.Reproduction
A self-contained reproduction as a Go unit test in
cache/remotecache/s3/readerat_race_test.go(stdlib only, no S3 or network needed):Running it
Against current master (
a243ce438) with no patches:Note
openCount=45with 5 goroutines reading 5 parts — each part triggers ~9 reopens because each concurrentReadAtat a different offset flips theoff != hrs.offsetbranch and closes + reopens the underlying S3 body. This is both a correctness (race) and a performance (thrashing) problem.Proposed minimum fix
A minimal
sync.Mutexrestores theio.ReaderAtthread-safety contract:With this diff applied, both tests above pass reliably:
go test -race -count=5 ./cache/remotecache/s3/→ok, 50 panic-reproduction attempts with no panic, no data race flagged.Performance consideration and follow-up
The mutex is the minimum correctness fix but leaves a performance cliff: concurrent readers at different offsets still trigger close-then-reopen cycles, which thrashes S3 connections. With 5 worker goroutines from
manager.Uploadereach targeting a different 5 MiB part, everyReadAtflips theoff != hrs.offsetcondition and reopens the underlyingGetObjectbody. Under the mutex this becomes serialised perreaderAtCloser: correct, but potentially slower for a single large blob that is being re-exported from S3 back to S3. This is not a global BuildKit lock: different cache layers can still upload in parallel, but one blob backed by this reader loses multipart read parallelism and still pays the reopen churn.A proper fix would also eliminate the shared-state optimisation that causes the thrashing, for example by having
ReaderAtopen an independent reader per call (stateless), or by keeping a small pool of per-offset readers. That optimisation can be follow-up work after the correctness fix, or part of the #3993 refactor (which would also need a mutex added to the replacementcontentutil.readerAt).Version information
buildkit image:
moby/buildkit:master, pulled from an upstream registry mirror on 2025-10-15, corresponding to master~05fdd002b(hack: use bake to build buildkit binaries). Bug also reproduces against current mastera243ce438b(2026-04-06).cache/remotecache/s3/readerat.gohas been byte-identical on master since commit09c5a7c0e("Add s3 remote cache", 2022-05-13). No changes after PR #5597.Environment where the production crash was observed:
buildkitcontainer of a Kubernetes pod withprivileged: truebuild --import-cache type=s3,... --export-cache type=s3,...with S3-compatible endpointstorage.yandexcloud.net) with HTTP/2 transportDefaultUploadPartSize(Docker image layers — node_modules, Chromium, etc.)