Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
176 changes: 176 additions & 0 deletions docs/storage-refactoring-proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
# Storage refactoring proposal: reducing global lock contention

## Context

Today, `ImageStore` uses a single global `sync.RWMutex` for storage operations:

- source: https://github.com/project-zot/zot/blob/main/pkg/storage/imagestore/imagestore.go#L47

As repository count and write throughput increase, this lock becomes a bottleneck because unrelated repositories are forced to wait on each other.

## Problem statement

A single global lock introduces unnecessary contention:

- write activity in one repository can block writes in all other repositories
- mixed read/write traffic across many repositories is serialized more than necessary
- lock wait time grows with repository cardinality and concurrent clients

This is especially visible in multi-tenant deployments where repositories are independent in practice but coupled by lock scope.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this about, we don't have a tenant concept? This opens up a whole other discussion: Do we want to support multiple tenants in the same zot instance? Or do we want to support multiple tenants in a zot cluster on different nodes? Or do we say each tenant should have its own independent zot?

I think we should stick to the solving one issue at at time instead of increasing complexity. Handling remote vs local storage in the context of dedupe and GC should be resolved before getting into tenants. We should also reach a conclusion what is a valid zot cluster configuration before getting into tenants (in my opinion multiple zot instances sync-ing from each other using the sync extension is an exploit of the sync extension).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-tenant simply means users operating on separate repositories.


## Goals

- preserve correctness for manifest/index/tag/blob lifecycle operations
- reduce lock contention for unrelated repositories
- keep dedupe behavior for shared blobs
- improve remote backend efficiency and OCI distribution spec alignment

## Non-goals

- changing storage driver APIs beyond what is required for lock scoping and dedupe paths
- changing external registry API semantics

## Proposed design

### 1) Introduce a global blobstore namespace

Use a dedicated hidden namespace (for example `_blobstore`) to store all deduplicated blob payloads.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note, if we are to implement #2752 the links we provide to the client would be to this namespace, so it wouldn't be "hidden".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is _blobstore supposed to comply with an OCI layout it is just a folder including only the content _blobstore/<hash_algo>/<hash_value>?
This should be captured in the proposal.

If it is an oci layout, what would we write in the index? What role would those entries have? In what case would we read it? I am thinking of GC on the global store. We could have an index of all the manifests present in all the repos, for running GC on the global store as a repo. But this also means maintaining it. And we already have the DB for that, correct?


Rationale:

- blob bytes are global content-addressed objects by digest
- storing them once naturally supports dedupe across repositories
- `_blobstore` naming keeps it hidden from normal repository listings

### 2) Use split lock domains

Replace the single global lock domain with two levels:

- global blobstore lock: `RWMutex` protecting `_blobstore` metadata/data mutations
Comment thread
andaaron marked this conversation as resolved.
- per-repository locks: `RWMutex` per repository protecting repo-local metadata (manifests, tags, links, indexes)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note manifests and indexes are also blobs. So in case of DELETE APIC calls, dedupe and gc we would need to evaluate if the blob is used in other repositories before removing it from _blobstore


#### Locking intent

- operations touching only repository metadata should use only that repository lock

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is vague. Manifests could be blobs shared across repos.
Maybe we should list these operations which would run using only the repository lock.

- operations touching shared blob data should use blobstore lock
- operations spanning both should acquire both with a consistent lock order

#### Required lock order (to prevent deadlocks)

Always acquire:

1. blobstore lock (if needed)
2. repository lock (if needed)

Never invert this order in any call path.

### 3) Dedupe behavior by storage type

- local/filesystem-backed: keep content-addressed dedupe in `_blobstore`
- remote/object-backed: dedupe should not require creating empty marker files
Comment on lines +69 to +70

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs elaborating:

Suggested change
- local/filesystem-backed: keep content-addressed dedupe in `_blobstore`
- remote/object-backed: dedupe should not require creating empty marker files
In both local and remote storage cases keep content-addressed dedupe in `_blobstore`.
The difference lies in:
- local/filesystem-backed: keep a hardlink copy of the blob in the `_blobstore` inside the OCI layouts of the repositories using it
- remote/object-backed: do not keep a copy of the blob in the `_blobstore` for each repository using it

As opposed to the current implementation dedupe should not require creating empty marker files in remote storage.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

substores have their own _blobstore

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each imagestore has its own _blobstore


## Operation impact summary

The table below summarizes where improvements are expected.

| Case | Current (single global lock) | Proposed (blobstore + per-repo locks) | Expected impact |
| --- | --- | --- | --- |
| Concurrent writes to different repos, different blobs | serialized globally | parallel (independent repo locks; short blobstore sections only when needed) | major throughput gain |
| Writes to different repos, same blob digest | serialized globally | contention mostly scoped to blobstore lock around dedupe check/insert | better than current; bounded shared contention |
| Read in repo A + write in repo B | often blocked by global write lock | independent locks allow progress unless shared blobstore mutation conflicts | lower read latency under write load |
| Tag/manifest updates in independent repos | serialized globally | parallel under different repo locks | major reduction in lock wait |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true only for tag updates. The manifest is a blob, so it needs to be read/written to the global blob store.

@rchincha rchincha Apr 10, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about manifest blobs being private to a repo (so can lock only the repo) or are they deduped as well then need the _blobstore lock?

| Heavy read workload with light writes | write lock stalls unrelated reads | writes primarily block only affected lock domains | improved tail latency |
| High churn in one hot repo + quiet other repos | hot repo impacts everyone | isolation: hot repo mostly impacts itself (+ shared blobstore moments) | improved tenant isolation |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not use the word tenant.


## Pathological case

The new scheme still has a worst-case contention pattern:

- many concurrent writes across repositories targeting the same small set of blob digests
- each flow needs blobstore coordination for dedupe decision/creation

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an entry to GC. In case of remote storage every repo needs to be locked/read before the blob is removed by GC from the common blob store.
For local storage it should be a bug deal since they are hardlinks. Deleting one leaves the blob available for the other deduped blobs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify GC behavior

In this case, the blobstore lock becomes the hotspot. This is still preferable to a single global lock because:

- contention is limited to shared blob paths instead of all storage operations
- repo-local operations that do not require blobstore mutation can continue concurrently

## Remote dedupe change: stop creating empty entries

For remote storage dedupe, do not create empty file entries as dedupe markers.

Why this change is required:

- empty marker entries can interfere with OCI distribution spec expectations around blob/object layout and semantics
- each marker introduces additional backend operations (HEAD/LIST/PUT-like checks), increasing cloud API calls and cost
Comment on lines +103 to +104

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very well said.


Expected result:

- cleaner, spec-aligned remote storage behavior
- fewer backend queries and lower cloud cost under dedupe-heavy workloads

## Correctness and safety notes

- lock acquisition order must be uniformly enforced in all code paths
- blob existence checks and blob writes must remain atomic with respect to dedupe decisions

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- blob existence checks and blob writes must remain atomic with respect to dedupe decisions
- blob existence checks and blob writes must remain atomic with respect to dedupe and GC decisions

- repo metadata operations must remain atomic per repository
- metrics should track lock wait/hold times separately for blobstore and repository locks to validate improvement

## Rollout guidance

- keep changes behind internal implementation boundaries first (no API break)
- run existing storage, dedupe, and conformance suites
- add concurrency-focused tests for lock ordering and deadlock absence
- add benchmark scenarios: multi-repo parallel push/pull, shared-digest fan-in, hot-repo isolation

## Upgrade behavior

The migration story should support both in-place upgrade and side-by-side cutover.

Version scope for this proposal:

- source deployment series: zot 2.1.x
- target deployment series: zot 2.2.x (where this refactor is expected to land)

### In-place upgrade on startup

When a new zot binary starts with data written by an older zot version, startup may run storage migration steps to move data to the new layout/metadata conventions.

Implications:

- startup can be blocked while migration work is in progress

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maybe track what repositories have been migrated, maybe on a per repo level the "cache" DB? I am thinking we don't want to do this every time zot starts, and a system may be restarted in the middle of the migration, we don't want to assume everything was migrated already, or completelys tart over.

@rchincha rchincha Apr 10, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently a blocking call at start time.
Also, need to take into account a crash during migration (e.g., "ran out of space")

- large datasets or high repository counts can increase startup latency
- operationally simple (single deployment), but downtime/startup delay risk is higher

Recommended use:

- smaller deployments where startup delay is acceptable
- environments where a one-time maintenance window is available

### Side-by-side sync/mirror migration

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.


Stand up a new zot instance and configure sync/mirror from the old zot server. The new instance writes incoming data directly in the new storage format.

Implications:

- avoids long startup migration blocking on the new instance
- reduces upgrade risk by separating data movement from process startup
- enables progressive cutover after validation

Recommended use:

- large deployments where startup blocking is unacceptable
- environments needing lower-risk migration with rollback options

### Cutover and validation notes

- validate repository listing behavior and hidden `_blobstore` visibility rules before traffic switch
- compare blob/manifest counts between old and new instances during sync
- perform a staged cutover (read-only checks, then write traffic shift) to reduce migration risk

## Related WIP PRs

- https://github.com/project-zot/zot/pull/3906
- https://github.com/project-zot/zot/pull/2968
- https://github.com/project-zot/zot/pull/3922

These PRs can be used to stage and validate the refactor incrementally.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to give this to the AI this is not enough.
I suggest you give it the exact order in which to use the PRs to solve the problem.

And there are still things to clarify before that.
For example in implement the _blobstore. And #3906 can be used as a reference, but this document may require something which does not necessarily allign wit the PR. I am thinking the full layout of global blobstore vs just the content of the blobs folder, decision which at the time I am writing this comment hasn't been made.

Loading