-
Notifications
You must be signed in to change notification settings - Fork 231
feat(storage): introduce storage refactoring proposal to reduce globa… #3927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,176 @@ | ||||||||||||||
| # Storage refactoring proposal: reducing global lock contention | ||||||||||||||
|
|
||||||||||||||
| ## Context | ||||||||||||||
|
|
||||||||||||||
| Today, `ImageStore` uses a single global `sync.RWMutex` for storage operations: | ||||||||||||||
|
|
||||||||||||||
| - source: https://github.com/project-zot/zot/blob/main/pkg/storage/imagestore/imagestore.go#L47 | ||||||||||||||
|
|
||||||||||||||
| As repository count and write throughput increase, this lock becomes a bottleneck because unrelated repositories are forced to wait on each other. | ||||||||||||||
|
|
||||||||||||||
| ## Problem statement | ||||||||||||||
|
|
||||||||||||||
| A single global lock introduces unnecessary contention: | ||||||||||||||
|
|
||||||||||||||
| - write activity in one repository can block writes in all other repositories | ||||||||||||||
| - mixed read/write traffic across many repositories is serialized more than necessary | ||||||||||||||
| - lock wait time grows with repository cardinality and concurrent clients | ||||||||||||||
|
|
||||||||||||||
| This is especially visible in multi-tenant deployments where repositories are independent in practice but coupled by lock scope. | ||||||||||||||
|
|
||||||||||||||
| ## Goals | ||||||||||||||
|
|
||||||||||||||
| - preserve correctness for manifest/index/tag/blob lifecycle operations | ||||||||||||||
| - reduce lock contention for unrelated repositories | ||||||||||||||
| - keep dedupe behavior for shared blobs | ||||||||||||||
| - improve remote backend efficiency and OCI distribution spec alignment | ||||||||||||||
|
|
||||||||||||||
| ## Non-goals | ||||||||||||||
|
|
||||||||||||||
| - changing storage driver APIs beyond what is required for lock scoping and dedupe paths | ||||||||||||||
| - changing external registry API semantics | ||||||||||||||
|
|
||||||||||||||
| ## Proposed design | ||||||||||||||
|
|
||||||||||||||
| ### 1) Introduce a global blobstore namespace | ||||||||||||||
|
|
||||||||||||||
| Use a dedicated hidden namespace (for example `_blobstore`) to store all deduplicated blob payloads. | ||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just a note, if we are to implement #2752 the links we provide to the client would be to this namespace, so it wouldn't be "hidden".
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also is If it is an oci layout, what would we write in the index? What role would those entries have? In what case would we read it? I am thinking of GC on the global store. We could have an index of all the manifests present in all the repos, for running GC on the global store as a repo. But this also means maintaining it. And we already have the DB for that, correct? |
||||||||||||||
|
|
||||||||||||||
| Rationale: | ||||||||||||||
|
|
||||||||||||||
| - blob bytes are global content-addressed objects by digest | ||||||||||||||
| - storing them once naturally supports dedupe across repositories | ||||||||||||||
| - `_blobstore` naming keeps it hidden from normal repository listings | ||||||||||||||
|
|
||||||||||||||
| ### 2) Use split lock domains | ||||||||||||||
|
|
||||||||||||||
| Replace the single global lock domain with two levels: | ||||||||||||||
|
|
||||||||||||||
| - global blobstore lock: `RWMutex` protecting `_blobstore` metadata/data mutations | ||||||||||||||
|
andaaron marked this conversation as resolved.
|
||||||||||||||
| - per-repository locks: `RWMutex` per repository protecting repo-local metadata (manifests, tags, links, indexes) | ||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note manifests and indexes are also blobs. So in case of DELETE APIC calls, dedupe and gc we would need to evaluate if the blob is used in other repositories before removing it from |
||||||||||||||
|
|
||||||||||||||
| #### Locking intent | ||||||||||||||
|
|
||||||||||||||
| - operations touching only repository metadata should use only that repository lock | ||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is vague. Manifests could be blobs shared across repos. |
||||||||||||||
| - operations touching shared blob data should use blobstore lock | ||||||||||||||
| - operations spanning both should acquire both with a consistent lock order | ||||||||||||||
|
|
||||||||||||||
| #### Required lock order (to prevent deadlocks) | ||||||||||||||
|
|
||||||||||||||
| Always acquire: | ||||||||||||||
|
|
||||||||||||||
| 1. blobstore lock (if needed) | ||||||||||||||
| 2. repository lock (if needed) | ||||||||||||||
|
|
||||||||||||||
| Never invert this order in any call path. | ||||||||||||||
|
|
||||||||||||||
| ### 3) Dedupe behavior by storage type | ||||||||||||||
|
|
||||||||||||||
| - local/filesystem-backed: keep content-addressed dedupe in `_blobstore` | ||||||||||||||
| - remote/object-backed: dedupe should not require creating empty marker files | ||||||||||||||
|
Comment on lines
+69
to
+70
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This needs elaborating:
Suggested change
As opposed to the current implementation dedupe should not require creating empty marker files in remote storage.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. substores have their own _blobstore
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. each imagestore has its own _blobstore |
||||||||||||||
|
|
||||||||||||||
| ## Operation impact summary | ||||||||||||||
|
|
||||||||||||||
| The table below summarizes where improvements are expected. | ||||||||||||||
|
|
||||||||||||||
| | Case | Current (single global lock) | Proposed (blobstore + per-repo locks) | Expected impact | | ||||||||||||||
| | --- | --- | --- | --- | | ||||||||||||||
| | Concurrent writes to different repos, different blobs | serialized globally | parallel (independent repo locks; short blobstore sections only when needed) | major throughput gain | | ||||||||||||||
| | Writes to different repos, same blob digest | serialized globally | contention mostly scoped to blobstore lock around dedupe check/insert | better than current; bounded shared contention | | ||||||||||||||
| | Read in repo A + write in repo B | often blocked by global write lock | independent locks allow progress unless shared blobstore mutation conflicts | lower read latency under write load | | ||||||||||||||
| | Tag/manifest updates in independent repos | serialized globally | parallel under different repo locks | major reduction in lock wait | | ||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is true only for tag updates. The manifest is a blob, so it needs to be read/written to the global blob store.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same comment about manifest blobs being private to a repo (so can lock only the repo) or are they deduped as well then need the _blobstore lock? |
||||||||||||||
| | Heavy read workload with light writes | write lock stalls unrelated reads | writes primarily block only affected lock domains | improved tail latency | | ||||||||||||||
| | High churn in one hot repo + quiet other repos | hot repo impacts everyone | isolation: hot repo mostly impacts itself (+ shared blobstore moments) | improved tenant isolation | | ||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's not use the word tenant. |
||||||||||||||
|
|
||||||||||||||
| ## Pathological case | ||||||||||||||
|
|
||||||||||||||
| The new scheme still has a worst-case contention pattern: | ||||||||||||||
|
|
||||||||||||||
| - many concurrent writes across repositories targeting the same small set of blob digests | ||||||||||||||
| - each flow needs blobstore coordination for dedupe decision/creation | ||||||||||||||
|
|
||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add an entry to GC. In case of remote storage every repo needs to be locked/read before the blob is removed by GC from the common blob store.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clarify GC behavior |
||||||||||||||
| In this case, the blobstore lock becomes the hotspot. This is still preferable to a single global lock because: | ||||||||||||||
|
|
||||||||||||||
| - contention is limited to shared blob paths instead of all storage operations | ||||||||||||||
| - repo-local operations that do not require blobstore mutation can continue concurrently | ||||||||||||||
|
|
||||||||||||||
| ## Remote dedupe change: stop creating empty entries | ||||||||||||||
|
|
||||||||||||||
| For remote storage dedupe, do not create empty file entries as dedupe markers. | ||||||||||||||
|
|
||||||||||||||
| Why this change is required: | ||||||||||||||
|
|
||||||||||||||
| - empty marker entries can interfere with OCI distribution spec expectations around blob/object layout and semantics | ||||||||||||||
| - each marker introduces additional backend operations (HEAD/LIST/PUT-like checks), increasing cloud API calls and cost | ||||||||||||||
|
Comment on lines
+103
to
+104
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Very well said. |
||||||||||||||
|
|
||||||||||||||
| Expected result: | ||||||||||||||
|
|
||||||||||||||
| - cleaner, spec-aligned remote storage behavior | ||||||||||||||
| - fewer backend queries and lower cloud cost under dedupe-heavy workloads | ||||||||||||||
|
|
||||||||||||||
| ## Correctness and safety notes | ||||||||||||||
|
|
||||||||||||||
| - lock acquisition order must be uniformly enforced in all code paths | ||||||||||||||
| - blob existence checks and blob writes must remain atomic with respect to dedupe decisions | ||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
| - repo metadata operations must remain atomic per repository | ||||||||||||||
| - metrics should track lock wait/hold times separately for blobstore and repository locks to validate improvement | ||||||||||||||
|
|
||||||||||||||
| ## Rollout guidance | ||||||||||||||
|
|
||||||||||||||
| - keep changes behind internal implementation boundaries first (no API break) | ||||||||||||||
| - run existing storage, dedupe, and conformance suites | ||||||||||||||
| - add concurrency-focused tests for lock ordering and deadlock absence | ||||||||||||||
| - add benchmark scenarios: multi-repo parallel push/pull, shared-digest fan-in, hot-repo isolation | ||||||||||||||
|
|
||||||||||||||
| ## Upgrade behavior | ||||||||||||||
|
|
||||||||||||||
| The migration story should support both in-place upgrade and side-by-side cutover. | ||||||||||||||
|
|
||||||||||||||
| Version scope for this proposal: | ||||||||||||||
|
|
||||||||||||||
| - source deployment series: zot 2.1.x | ||||||||||||||
| - target deployment series: zot 2.2.x (where this refactor is expected to land) | ||||||||||||||
|
|
||||||||||||||
| ### In-place upgrade on startup | ||||||||||||||
|
|
||||||||||||||
| When a new zot binary starts with data written by an older zot version, startup may run storage migration steps to move data to the new layout/metadata conventions. | ||||||||||||||
|
|
||||||||||||||
| Implications: | ||||||||||||||
|
|
||||||||||||||
| - startup can be blocked while migration work is in progress | ||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we maybe track what repositories have been migrated, maybe on a per repo level the "cache" DB? I am thinking we don't want to do this every time zot starts, and a system may be restarted in the middle of the migration, we don't want to assume everything was migrated already, or completelys tart over.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently a blocking call at start time. |
||||||||||||||
| - large datasets or high repository counts can increase startup latency | ||||||||||||||
| - operationally simple (single deployment), but downtime/startup delay risk is higher | ||||||||||||||
|
|
||||||||||||||
| Recommended use: | ||||||||||||||
|
|
||||||||||||||
| - smaller deployments where startup delay is acceptable | ||||||||||||||
| - environments where a one-time maintenance window is available | ||||||||||||||
|
|
||||||||||||||
| ### Side-by-side sync/mirror migration | ||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense. |
||||||||||||||
|
|
||||||||||||||
| Stand up a new zot instance and configure sync/mirror from the old zot server. The new instance writes incoming data directly in the new storage format. | ||||||||||||||
|
|
||||||||||||||
| Implications: | ||||||||||||||
|
|
||||||||||||||
| - avoids long startup migration blocking on the new instance | ||||||||||||||
| - reduces upgrade risk by separating data movement from process startup | ||||||||||||||
| - enables progressive cutover after validation | ||||||||||||||
|
|
||||||||||||||
| Recommended use: | ||||||||||||||
|
|
||||||||||||||
| - large deployments where startup blocking is unacceptable | ||||||||||||||
| - environments needing lower-risk migration with rollback options | ||||||||||||||
|
|
||||||||||||||
| ### Cutover and validation notes | ||||||||||||||
|
|
||||||||||||||
| - validate repository listing behavior and hidden `_blobstore` visibility rules before traffic switch | ||||||||||||||
| - compare blob/manifest counts between old and new instances during sync | ||||||||||||||
| - perform a staged cutover (read-only checks, then write traffic shift) to reduce migration risk | ||||||||||||||
|
|
||||||||||||||
| ## Related WIP PRs | ||||||||||||||
|
|
||||||||||||||
| - https://github.com/project-zot/zot/pull/3906 | ||||||||||||||
| - https://github.com/project-zot/zot/pull/2968 | ||||||||||||||
| - https://github.com/project-zot/zot/pull/3922 | ||||||||||||||
|
|
||||||||||||||
| These PRs can be used to stage and validate the refactor incrementally. | ||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you want to give this to the AI this is not enough. And there are still things to clarify before that. |
||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this about, we don't have a tenant concept? This opens up a whole other discussion: Do we want to support multiple tenants in the same zot instance? Or do we want to support multiple tenants in a zot cluster on different nodes? Or do we say each tenant should have its own independent zot?
I think we should stick to the solving one issue at at time instead of increasing complexity. Handling remote vs local storage in the context of dedupe and GC should be resolved before getting into tenants. We should also reach a conclusion what is a valid zot cluster configuration before getting into tenants (in my opinion multiple zot instances sync-ing from each other using the sync extension is an exploit of the sync extension).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multi-tenant simply means users operating on separate repositories.