Skip to content

feat(storage): introduce storage refactoring proposal to reduce globa…#3927

Draft
rchincha wants to merge 1 commit into
project-zot:mainfrom
rchincha:storage-refactor
Draft

feat(storage): introduce storage refactoring proposal to reduce globa…#3927
rchincha wants to merge 1 commit into
project-zot:mainfrom
rchincha:storage-refactor

Conversation

@rchincha

@rchincha rchincha commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

…l lock contention

What type of PR is this?

Which issue does this PR fix:

What does this PR do / Why do we need it:

If an issue # is not available please add repro steps and logs showing the issue:

Testing done on this change:

Automation added to e2e:

Will this break upgrades or downgrades?

Does this PR introduce any user-facing change?:


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…l lock contention

Signed-off-by: Ramkumar Chinchani <rchincha.dev@gmail.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a storage refactoring proposal document aimed at reducing ImageStore’s global lock contention by scoping locks per repository and separating shared blob dedupe into a dedicated blobstore domain.

Changes:

  • Introduces a design proposal for split lock domains (blobstore + per-repo locks) with a defined lock acquisition order.
  • Proposes a global deduplicated blob namespace and changes to remote dedupe behavior to avoid empty marker entries.
  • Provides rollout/upgrade guidance and expected performance impact scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/storage-refactoring-proposal.md

@andaaron andaaron left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is high level, but we should also decide if we should keep the same image store implementation for both local and remote. Having them together couples thing and adds extra requirements

For example if we separate them:

  1. We can completely remove the dedupe logic from the remote implementation. We don't need to think about locking and unlocking for dedupe and we can focus on locks in case of API delete and GC.
  2. We can theoretically remove the locking on the global store for GC at repo level. And do an independent GC on the global store level later. We don't care about global store lock during deletion API calls, because the deduped copies are independent at repo level. We would still need the global store lock on dedupe.

If we keep them the the same implementation we add limitations such as:

  1. In case of GC at global store level we need to lock all repos in case of local storage, because remote store has a single copy, even though it should be needed for local storage,
  2. In case of dedupe we may need to lock also for some lower level ops in the store implementation, even though remote storage shouldn't care about dedupe.

- mixed read/write traffic across many repositories is serialized more than necessary
- lock wait time grows with repository cardinality and concurrent clients

This is especially visible in multi-tenant deployments where repositories are independent in practice but coupled by lock scope.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this about, we don't have a tenant concept? This opens up a whole other discussion: Do we want to support multiple tenants in the same zot instance? Or do we want to support multiple tenants in a zot cluster on different nodes? Or do we say each tenant should have its own independent zot?

I think we should stick to the solving one issue at at time instead of increasing complexity. Handling remote vs local storage in the context of dedupe and GC should be resolved before getting into tenants. We should also reach a conclusion what is a valid zot cluster configuration before getting into tenants (in my opinion multiple zot instances sync-ing from each other using the sync extension is an exploit of the sync extension).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-tenant simply means users operating on separate repositories.


### 1) Introduce a global blobstore namespace

Use a dedicated hidden namespace (for example `_blobstore`) to store all deduplicated blob payloads.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note, if we are to implement #2752 the links we provide to the client would be to this namespace, so it wouldn't be "hidden".

Replace the single global lock domain with two levels:

- global blobstore lock: `RWMutex` protecting `_blobstore` metadata/data mutations
- per-repository locks: `RWMutex` per repository protecting repo-local metadata (manifests, tags, links, indexes)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note manifests and indexes are also blobs. So in case of DELETE APIC calls, dedupe and gc we would need to evaluate if the blob is used in other repositories before removing it from _blobstore


### 1) Introduce a global blobstore namespace

Use a dedicated hidden namespace (for example `_blobstore`) to store all deduplicated blob payloads.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is _blobstore supposed to comply with an OCI layout it is just a folder including only the content _blobstore/<hash_algo>/<hash_value>?
This should be captured in the proposal.

If it is an oci layout, what would we write in the index? What role would those entries have? In what case would we read it? I am thinking of GC on the global store. We could have an index of all the manifests present in all the repos, for running GC on the global store as a repo. But this also means maintaining it. And we already have the DB for that, correct?


#### Locking intent

- operations touching only repository metadata should use only that repository lock

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is vague. Manifests could be blobs shared across repos.
Maybe we should list these operations which would run using only the repository lock.

Comment on lines +103 to +104
- empty marker entries can interfere with OCI distribution spec expectations around blob/object layout and semantics
- each marker introduces additional backend operations (HEAD/LIST/PUT-like checks), increasing cloud API calls and cost

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very well said.

## Correctness and safety notes

- lock acquisition order must be uniformly enforced in all code paths
- blob existence checks and blob writes must remain atomic with respect to dedupe decisions

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- blob existence checks and blob writes must remain atomic with respect to dedupe decisions
- blob existence checks and blob writes must remain atomic with respect to dedupe and GC decisions


Implications:

- startup can be blocked while migration work is in progress

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maybe track what repositories have been migrated, maybe on a per repo level the "cache" DB? I am thinking we don't want to do this every time zot starts, and a system may be restarted in the middle of the migration, we don't want to assume everything was migrated already, or completelys tart over.

@rchincha rchincha Apr 10, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently a blocking call at start time.
Also, need to take into account a crash during migration (e.g., "ran out of space")

- smaller deployments where startup delay is acceptable
- environments where a one-time maintenance window is available

### Side-by-side sync/mirror migration

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

- https://github.com/project-zot/zot/pull/2968
- https://github.com/project-zot/zot/pull/3922

These PRs can be used to stage and validate the refactor incrementally.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to give this to the AI this is not enough.
I suggest you give it the exact order in which to use the PRs to solve the problem.

And there are still things to clarify before that.
For example in implement the _blobstore. And #3906 can be used as a reference, but this document may require something which does not necessarily allign wit the PR. I am thinking the full layout of global blobstore vs just the content of the blobs folder, decision which at the time I am writing this comment hasn't been made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants