[Proposal] Incremental reprocessing #349

rkistner · 2025-09-01T13:27:12Z

rkistner
Sep 1, 2025
Maintainer

Background

Currently, when changes to Sync Rules or Sync Streams are deployed, PowerSync re-replicates all data from the source database from scratch, processing it with the new Sync Rules. Once that is ready, clients are switched over to sync from the new copy.

While there is no direct "downtime", it can take a long time on large databases, and clients have to re-sync all data even if only a small portion changed.

Status

2026-06-15: Updated with the latest status. Sync rules / bucket definitions are not part of the plan anymore, only supporting new sync streams.
2026-03-18: We had an experimental/proof-of-concept build demonstrating incremental reprocessing. We are now working on getting functionality ready for production, and merging pieces of functionality at a time.
2026-01-09: Update implementation status
2025-12-09: Updated plan with more specifics, implementation tasks and links to relevant PRs.
2025-09-01: Original version of proposal outlined two implementation options.

Proposal

The base idea is to only reprocess data in storage that we need to.

With sync streams now being stable, only sync streams with config.edition: 3 (or future versions) will be supported. Using sync rules or alpha sync streams will require full reprocessing as before.

In general:

Adding a new Sync Stream definition will process that definition only.
Removing a Sync Stream definition will remove the relevant streams only, and does not require re-reading any data from the source database.
Renaming a sync stream is treated as adding a new stream and removing the old one - does not preserve data.

Modifying an existing stream preserves existing data without reprocessing on a "best-effort" basis. The understand this, note that a sync stream query is split into 3 parts:

The "bucket data source". This is the "results" part of the query, that directly goes into the synced buckets. Multiple queries in the same sync stream are typically merged here, if they share the same output buckets.
The "parameter indexes", defined by subqueries and join tables. These determine which buckets are synced, using replicated data.
The "querier", defined by request parameters.

Reprocessing of streams happen on the level of bucket data sources and parameter indexes. If one is added or modified, all data for that one is reprocessed. Note that adding a query to a sync stream can trigger reprocessing of other queries/tables in the same sync stream, if they share the same output buckets.

Changing the querier part of queries does not require any reprocessing. For example, changing a query from SELECT * FROM projects WHERE user_id = auth.user_id() to SELECT * FROM projects WHERE user_id = auth.jwt() ->> 'owner' requires no reprocessing, since it does not require any changes to storage - only the sync-time evaluation is changed. But changing to SELECT * FROM projects WHERE owner_id = auth.user_id() requires a change to storage, so that triggers reprocessing of the related data.

The specifics here depend a lot on how queries are parsed - more examples will follow in the future.

Implementation

Previously, a "sync rules version" was created on each deploy of sync rules / sync streams. Each version was processed separately from all others, with isolated storage. Each had its own "replication stream": The process that replicates from the source database, owning a replication slot, change stream, or replication progress tracking for the source database. The concept of replication streams were never explicit, since it overlapped with sync rules versions.

This now splits it into two separate entities:

A replication stream has isolated storage, and handles a single conceptual stream of changes from the source database.
A sync config handles the deployed sync rules / sync streams.

There can now be multiple sync configs per replication stream. In that case, they can share storage, avoiding the need to re-replicate all data from scratch. Separate replication streams are still used if the sync configs cannot share data, for example due to incompatible replication config.

Within a replication stream, storage is further split by bucket source definition and parameter index. This makes it easy to drop all storage for specific definitions if they are removed.

When a new sync config is added to an existing replication stream, we compute all definitions that are not covered by existing sync configs in that replication stream. For any new definitions, we re-snapshot all the relevant source tables. These create new SourceTable entities, allowing any existing definitions from other sync configs to be unaffected.

The snapshots above, as well as the initial replication snapshots, now happen concurrently with streaming replication. This means data will continue being streamed for the existing sync config while we snapshot. It also means that the streaming replication position will not get behind while we do an initial snapshot, avoiding issues with for example Postgres WAL slots exceeding max_slot_wal_keep_size.

All the above storage changes are implemented in a new storage version, since it includes many breaking changes. During development, storage version 3 is used for this. Once the structure is fixed, it will become the stable storage version 4.

Implementation progress

Supported source databases:
- Postgres: Pending
- MongoDB: [Incremental Reprocessing] Support for multiple sync configs on MongoDB storage #670 (experimental)
- MySQL: Pending
- SQL Server: Pending
- Convex: Pending
Supported storage databases:
- MongoDB: [Incremental Reprocessing] Support for multiple sync configs on MongoDB storage #670 (experimental)
- Postgres: Pending

Other considerations

Defragmenting

The "Defragment" (reprocess) API creates a new replication stream, re-replicating all data. In the future we can investigate more granular approaches to defragmentation.

G2Jose · 2025-09-04T20:18:48Z

G2Jose
Sep 4, 2025

Just wanted to chime in and say this would be super helpful for my personal use case with powersync in 2 ways:

Currently I have a hard loading screen whenever a full resync happens in my app. I've tried going without, but it seems to require quite a bit of processing that it really tanks UI performance while the sync is in progress. I'm having to deploy sync rules anytime I add a new table.
I'm self hosting powersync and have provisioned a certain number of IOPS and throughput. I recently ran into an issue where I exceeded some of these thresholds and ended up in a state where my EC2 stopped responding for some time.

2 replies

simolus3 Sep 5, 2025
Maintainer

I've tried going without, but it seems to require quite a bit of processing that it really tanks UI performance while the sync is in progress

Out of interest, are you using the newer Rust client for this? On RN, that can greatly improve sync performance (and also improves UI responsiveness by offloading work to a background thread, but there's still a bit of work happening on the main thread). So that might be worth trying out if you haven't looked at it already.

G2Jose Sep 5, 2025

Thanks for this, I didn't know there was a rust client I can drop in!

lgbestel · 2026-03-18T12:50:47Z

lgbestel
Mar 18, 2026

Real-world case: compaction is insufficient, forced to redeploy sync rules for performance

We're running a production app with ~45 synced tables, including a large table with 15,000+ rows that gets frequently updated. Our initial sync time for new clients had grown to 3+ minutes.

We've already:

Removed unused columns from sync rules
Removed unnecessary collections
Ran compaction and defragmentation
None of these brought significant improvement. The reason became clear after reading the compaction docs: MOVE operations don't reduce operation count, only data size. Our buckets still had 100,000+ operation entries that new clients had to traverse, even though most were empty MOVEs. CLEAR operations couldn't help either because PUT operations near the start of the bucket block them.

The only thing that dramatically improved performance was redeploying sync rules without any changes. This rebuilds all buckets from scratch with exactly 1 PUT per existing row. Initial sync dropped from 3+ minutes to under 30 seconds — for all users, even though they had to re-sync everything.

Since compaction has this structural limitation, we're now forced to set up a scheduled cron job to redeploy sync rules daily during low-traffic hours. It works, but it's clearly a workaround with operational overhead (dual replication slots, WAL accumulation, active users getting forced back to sync screen during switchover).

Incremental reprocessing as proposed in this discussion would eliminate our need for this workaround entirely. Even a more targeted solution — like a "rebuild buckets from current state" operation that doesn't require full re-replication from the source database — would be a huge improvement over redeploying sync rules.

Would love to see this prioritized!

1 reply

rkistner Mar 18, 2026
Maintainer Author

Incremental reprocessing would actually do the opposite: It would make the the sync rule re-deploy do nothing if there are no changes, so it won't re-build buckets at all.

Is this on a cloud instance or self-hosted? On cloud instances, the "defragment" action currently does something similar to deploying sync rules with whitespace/comment changes, but will continue doing the same once incremental reprocessing is implemented.

One approach to look into is defragmenting by periodically updating the oldest data in the source database, then running the compact. In some cases, splitting queries into separate bucket definitions or sync streams can also help, since those are compacted individually. Have you tried anything like this yet?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PowerSync

[Proposal] Incremental reprocessing #349

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

PowerSync

[Proposal] Incremental reprocessing #349

Uh oh!

Uh oh!

rkistner Sep 1, 2025 Maintainer

Background

Status

Proposal

Implementation

Implementation progress

Other considerations

Defragmenting

Replies: 2 comments · 3 replies

Uh oh!

G2Jose Sep 4, 2025

Uh oh!

simolus3 Sep 5, 2025 Maintainer

Uh oh!

G2Jose Sep 5, 2025

Uh oh!

lgbestel Mar 18, 2026

Uh oh!

rkistner Mar 18, 2026 Maintainer Author

rkistner
Sep 1, 2025
Maintainer

Replies: 2 comments 3 replies

G2Jose
Sep 4, 2025

simolus3 Sep 5, 2025
Maintainer

lgbestel
Mar 18, 2026

rkistner Mar 18, 2026
Maintainer Author