Continuous Rerandomization Plan
Overview
Replaces the existing, one-off rerandomization protocol by a continuous, online process that rerandomizes shares while the system is running. No downtime or restart required.
Key design decision: in-memory shares are less likely to be exfiltrated, so only the DB (at-rest persistence) is rerandomized. The actor is completely unmodified. The rerand server handles everything, writing to a staging schema and then copying to live once all parties confirm.
Architecture
- Rerand Server (modified
iris-mpc-bins/bin/iris-mpc-upgrade/rerandomize_db.rs, separate process, one per party) — rerandomizes shares, writes to staging, coordinates with peers via S3 markers, copies confirmed chunks to live DB. Replaces the existing one-off RerandomizeDb subcommand with a new RerandomizeContinuous subcommand. Core rerandomization logic in iris-mpc-upgrade/src/rerandomization.rs is reused; the new subcommand adds the continuous loop, S3 coordination, and staging management.
- Main Server (existing, minimal changes) — at startup, syncs rerand progress with peers and catches up any missing chunks from staging before loading the DB into memory.
The GPU actor, batch processing, and result processor are completely untouched.
Seed & Randomness
One epoch is active at a time. At the start of each epoch:
- Each rerand server generates a fresh BLS12-381 keypair
- Private key is saved to Secrets Manager at
{env}/iris-mpc-db-rerandomization/epoch-{E}/private-key-party-{P}
- Public key is uploaded to S3 at
s3://bucket/rerand/epoch-{E}/party-{P}/public-key
- Each rerand server downloads the other two parties' public keys from S3 (polling until all present)
- Each derives the same 32-byte
shared_secret via the BLS12-381 pairing
Only the rerand server needs access to the key. The main server never touches it.
Keygen is idempotent on restart
When starting an epoch, the rerand server:
- Checks if an epoch-scoped private key already exists in Secrets Manager at
{env}/iris-mpc-db-rerandomization/epoch-{E}/private-key-party-{P}
- If yes: loads it, derives the public key, and uploads the public key to S3 if not already present (covers crash-after-SM-write-before-S3-upload)
- If no: generates a new keypair, saves the private key to Secrets Manager first, then uploads the public key to S3
Secrets Manager is checked first because the private key is written to SM before the public key is uploaded to S3. If we crash between the two writes, on restart we find the key in SM and re-upload to S3.
Epoch transition
One epoch at a time, no overlap:
- All three rerand servers finish processing all chunks for epoch E
- Each server uploads a completion marker:
s3://bucket/rerand/epoch-{E}/party-{P}/complete
- Each server polls until all three completion markers exist
- Keys for epoch E are deleted from Secrets Manager — old secret is destroyed, old shares (overwritten in live DB) are unrecoverable
- Epoch E+1 begins: create/publish
manifest.json, keygen, derive new shared_secret, start processing
Old S3 markers under epoch-{E}/ are left in place (no active cleanup). Use S3 lifecycle policies to reap old epoch prefixes after a retention period.
On restart mid-epoch: private key is still in SM, public keys and markers are still in S3, rerand_progress table tells you the current epoch and which chunk to resume from. Re-derive shared_secret, continue.
S3 Coordination Bus
All cross-party coordination uses S3 markers in a shared bucket. Each party writes to its own prefixed paths. Marker layout:
s3://bucket/rerand/epoch-{E}/party-{P}/public-key # public key for DH
s3://bucket/rerand/epoch-{E}/party-{P}/max-id # party P watermark for manifest (MAX(id))
s3://bucket/rerand/epoch-{E}/party-{P}/manifest.json # epoch chunking manifest (party 0 writes, others read)
s3://bucket/rerand/epoch-{E}/party-{P}/chunk-{K}/staged # chunk K staging committed
s3://bucket/rerand/epoch-{E}/party-{P}/complete # epoch E fully done
Coordination is polling-based: a rerand server checks for peer markers by listing the S3 prefix. A few seconds of polling latency is fine for background work.
Authentication: the shared bucket uses IAM prefix policies to scope write access per party. Each party can only write to s3://bucket/rerand/epoch-*/party-{P}/*. All parties can read/list the full s3://bucket/rerand/epoch-{E}/ prefix to observe peer markers. The manifest is written by the designated writer (party 0) under its own prefix (party-0/manifest.json) and is read-only for others.
Schema Changes
New column on irises
ALTER TABLE irises ADD COLUMN rerand_epoch INTEGER NOT NULL DEFAULT 0;
Modified increment_version_id trigger
CREATE OR REPLACE FUNCTION increment_version_id()
RETURNS TRIGGER AS $$
BEGIN
IF (OLD.left_code IS DISTINCT FROM NEW.left_code OR
OLD.left_mask IS DISTINCT FROM NEW.left_mask OR
OLD.right_code IS DISTINCT FROM NEW.right_code OR
OLD.right_mask IS DISTINCT FROM NEW.right_mask)
AND NEW.rerand_epoch IS NOT DISTINCT FROM OLD.rerand_epoch THEN
NEW.version_id = COALESCE(OLD.version_id, 0) + 1;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
When rerand_epoch changes (rerandomization), share data changes but version_id stays the same. When rerand_epoch stays the same (user-facing modification), version_id bumps as before.
Staging schema
Each party has a staging schema (e.g. SMPC_rerand_staging) with:
CREATE TABLE irises (
epoch INTEGER NOT NULL,
id BIGINT NOT NULL,
chunk_id INTEGER NOT NULL,
left_code BYTEA,
left_mask BYTEA,
right_code BYTEA,
right_mask BYTEA,
original_version_id SMALLINT,
rerand_epoch INTEGER,
PRIMARY KEY (epoch, id)
);
Coordination table
A rerand_progress table in each party's DB:
CREATE TABLE rerand_progress (
epoch INTEGER NOT NULL,
chunk_id INTEGER NOT NULL,
staging_written BOOLEAN NOT NULL DEFAULT FALSE,
all_confirmed BOOLEAN NOT NULL DEFAULT FALSE,
live_applied BOOLEAN NOT NULL DEFAULT FALSE,
PRIMARY KEY (epoch, chunk_id)
);
Chunk ranges are derived from the manifest (chunk_size, max_id_inclusive) and chunk_id, so they are not stored here.
Lifecycle: staging_written → all_confirmed → live_applied.
Flow
Step 1: Rerand Server (per party, separate process)
Runs continuously:
- Determine the active epoch E and load its manifest (the highest epoch with a manifest at
s3://bucket/rerand/epoch-{E}/party-0/manifest.json but without all three completion markers). If no manifest exists for the next epoch, create it (party 0 only): collect watermarks, compute max_id_inclusive, write manifest.json.
- Derive
shared_secret for epoch E (keygen or resume — see above)
- Pick next chunk range
[start, end) for chunk K from the manifest
- Read entries from live schema, recording each entry's
version_id
- Rerandomize shares using
BLAKE3(shared_secret || iris_id) XOF
- Write rerandomized shares to staging schema with
epoch = E, original_version_id, chunk_id = K, and rerand_epoch = E + 1
- Set
staging_written = TRUE in local rerand_progress for (epoch = E, chunk_id = K)
- Upload S3 marker after staging commit:
s3://bucket/rerand/epoch-{E}/party-{P}/chunk-{K}/staged
- Poll S3 until all 3 party markers exist for chunk K
- Set
all_confirmed = TRUE in local rerand_progress for (epoch = E, chunk_id = K)
- Acquire
pg_advisory_lock(RERAND_APPLY_LOCK) on a dedicated connection, then copy from staging to live DB, delete staging, and mark applied — all in one transaction (scoped to epoch and chunk):
SELECT pg_advisory_lock(RERAND_APPLY_LOCK); -- on dedicated connection
BEGIN;
UPDATE irises SET
left_code = staging.left_code,
left_mask = staging.left_mask,
right_code = staging.right_code,
right_mask = staging.right_mask,
rerand_epoch = staging.rerand_epoch
FROM staging_schema.irises AS staging
WHERE irises.id = staging.id
AND staging.epoch = E
AND staging.chunk_id = K
AND irises.version_id = staging.original_version_id;
DELETE FROM staging_schema.irises WHERE epoch = E AND chunk_id = K;
UPDATE rerand_progress SET live_applied = TRUE WHERE epoch = E AND chunk_id = K;
COMMIT;
SELECT pg_advisory_unlock(RERAND_APPLY_LOCK); -- release after commit
- Proceed to next chunk (or start epoch transition if all chunks done)
Step 2: Main Server Startup (minimal changes)
At startup, before load_iris_db:
- Existing: modification sync (
sync_modifications) — all parties catch up on modifications, producing identical version_id values
- New: rerand sync — parties exchange a compact rerand watermark during the existing startup sync (
SyncState exchange):
- Each party computes
(epoch, max_confirmed_chunk) from its local rerand_progress table: the active epoch E and the highest chunk_id where all_confirmed = TRUE. Since chunks are processed in strictly increasing order, all chunks 0..max_confirmed_chunk are implicitly confirmed.
- Each party sends this single
(epoch, max_confirmed_chunk) pair as part of SyncState.
- Each party computes
safe_up_to = max(max_confirmed_chunk_party_0, max_confirmed_chunk_party_1, max_confirmed_chunk_party_2) for the agreed epoch E, then locally applies all chunks 0..safe_up_to where live_applied = FALSE.
- This is safe because
all_confirmed = TRUE at any party means that party observed all three S3 staged markers, which means all three parties successfully committed the chunk to their staging schemas. A slower party may not have polled S3 yet, but its staging data is already there. Using max ensures all parties converge to the same applied set, preventing cross-party desync where one party loads rerandomized shares and another loads stale shares.
- Edge case: if no chunks have been confirmed yet (fresh epoch or very start),
max_confirmed_chunk is -1 / None. safe_up_to becomes -1 / None and the catch-up step is skipped entirely.
- New (DB-only catch-up): acquire
pg_advisory_lock(RERAND_APPLY_LOCK) on a dedicated connection. Then for every chunk K in 0..safe_up_to where locally live_applied = FALSE (in increasing order): run the same apply transaction as Step 1.11. Keep the lock held through step 4.
- Existing:
load_iris_db — loads from live DB into GPU memory. The advisory lock is still held, so the rerand server cannot apply new chunks while the DB is being read into memory.
- Release the advisory lock:
SELECT pg_advisory_unlock(RERAND_APPLY_LOCK) on the dedicated connection, then drop the connection.
Advisory lock: startup vs rerand server concurrency
Both the rerand server (Step 1.11) and the main server startup (Steps 2.3–2.4) acquire pg_advisory_lock(RERAND_APPLY_LOCK) before applying chunks. This ensures:
- Only one process applies chunks at a time (no interleaving).
- The main server holds the lock from catch-up through
load_iris_db, so the rerand server cannot sneak in applies between catch-up and memory load.
- If either process crashes, the connection drops and Postgres automatically releases the session-level lock. No stale locks.
Implementation with connection pools (sqlx): session-level advisory locks are tied to a specific Postgres connection. When using a connection pool, acquire a dedicated connection (pool.acquire()) and hold it (do not drop/return it) for the entire lock window. The catch-up queries and load_iris_db can use the pool normally — the dedicated connection just sits idle holding the lock. Release with pg_advisory_unlock(...) on the same connection after load_iris_db completes, then drop the connection.
let mut lock_conn = pool.acquire().await?;
sqlx::query("SELECT pg_advisory_lock($1)")
.bind(RERAND_APPLY_LOCK)
.execute(&mut *lock_conn).await?;
apply_catchup_chunks(&pool).await?; // uses pool
load_iris_db(&pool).await?; // uses pool
sqlx::query("SELECT pg_advisory_unlock($1)")
.bind(RERAND_APPLY_LOCK)
.execute(&mut *lock_conn).await?;
drop(lock_conn);
Why modification sync before rerand sync matters
Modification sync ensures all parties have the same version_id values before the rerand staging copy runs. This guarantees the optimistic lock (WHERE version_id = original_version_id) produces the same skip set on all parties — the same entries are updated, the same entries are skipped.
Conflict Resolution: Rerandomization vs Modifications
Why the optimistic lock is needed
The rerand server reads entry X at time T with version_id = V. A modification (reauth/deletion) may happen later, bumping version_id to V+1. The staging still has original_version_id = V. The optimistic lock prevents overwriting the modification:
UPDATE irises SET ... WHERE version_id = original_version_id;
-- V ≠ V+1 → entry X skipped
Why rerand_epoch and the trigger are needed
Without the trigger change, the staging copy would bump version_id (because share data changed). The trigger change keeps version_id as a pure "user-facing modification counter," separate from rerandomization.
Chunking
Chunk boundaries must be identical across parties for chunk K to be meaningful. Define them via an epoch manifest object in S3:
s3://bucket/rerand/epoch-{E}/party-0/manifest.json: { epoch: E, chunk_size: N, max_id_inclusive: M }
- Party 0 writes the manifest once at epoch start under its own prefix (IAM-compliant); other parties poll until it exists and treat it as immutable.
- Watermark sync: before the manifest is written, each party P uploads its local watermark
max_id_party_P = SELECT MAX(id) FROM irises to s3://bucket/rerand/epoch-{E}/party-{P}/max-id.
- The manifest writer waits until all three
max-id markers exist, then sets max_id_inclusive as:
M = min(max_id_party_0, max_id_party_1, max_id_party_2) - safety_buffer_ids
safety_buffer_ids is configurable (default 0 or one chunk) to avoid rerandomizing the “tip” where replication/ingest lag could differ across parties.
- New inserts with
id > M are left for a future epoch.
- Chunk K corresponds to
[start, end) where start = 1 + K * N and end = min(start + N, M + 1).
A configurable delay (--chunk-delay, default e.g. 5s) is inserted between chunks to avoid sustained DB load. The rerand server should not stress the live DB with continuous writes — the delay spreads the I/O over time. The delay, chunk size, and number of parallel DB connections should all be configurable via CLI flags or environment variables.
Continuous Rerandomization Plan
Overview
Replaces the existing, one-off rerandomization protocol by a continuous, online process that rerandomizes shares while the system is running. No downtime or restart required.
Key design decision: in-memory shares are less likely to be exfiltrated, so only the DB (at-rest persistence) is rerandomized. The actor is completely unmodified. The rerand server handles everything, writing to a staging schema and then copying to live once all parties confirm.
Architecture
iris-mpc-bins/bin/iris-mpc-upgrade/rerandomize_db.rs, separate process, one per party) — rerandomizes shares, writes to staging, coordinates with peers via S3 markers, copies confirmed chunks to live DB. Replaces the existing one-offRerandomizeDbsubcommand with a newRerandomizeContinuoussubcommand. Core rerandomization logic iniris-mpc-upgrade/src/rerandomization.rsis reused; the new subcommand adds the continuous loop, S3 coordination, and staging management.The GPU actor, batch processing, and result processor are completely untouched.
Seed & Randomness
One epoch is active at a time. At the start of each epoch:
{env}/iris-mpc-db-rerandomization/epoch-{E}/private-key-party-{P}s3://bucket/rerand/epoch-{E}/party-{P}/public-keyshared_secretvia the BLS12-381 pairingOnly the rerand server needs access to the key. The main server never touches it.
Keygen is idempotent on restart
When starting an epoch, the rerand server:
{env}/iris-mpc-db-rerandomization/epoch-{E}/private-key-party-{P}Secrets Manager is checked first because the private key is written to SM before the public key is uploaded to S3. If we crash between the two writes, on restart we find the key in SM and re-upload to S3.
Epoch transition
One epoch at a time, no overlap:
s3://bucket/rerand/epoch-{E}/party-{P}/completemanifest.json, keygen, derive newshared_secret, start processingOld S3 markers under
epoch-{E}/are left in place (no active cleanup). Use S3 lifecycle policies to reap old epoch prefixes after a retention period.On restart mid-epoch: private key is still in SM, public keys and markers are still in S3,
rerand_progresstable tells you the current epoch and which chunk to resume from. Re-deriveshared_secret, continue.S3 Coordination Bus
All cross-party coordination uses S3 markers in a shared bucket. Each party writes to its own prefixed paths. Marker layout:
Coordination is polling-based: a rerand server checks for peer markers by listing the S3 prefix. A few seconds of polling latency is fine for background work.
Authentication: the shared bucket uses IAM prefix policies to scope write access per party. Each party can only write to
s3://bucket/rerand/epoch-*/party-{P}/*. All parties can read/list the fulls3://bucket/rerand/epoch-{E}/prefix to observe peer markers. The manifest is written by the designated writer (party 0) under its own prefix (party-0/manifest.json) and is read-only for others.Schema Changes
New column on
irisesModified
increment_version_idtriggerWhen
rerand_epochchanges (rerandomization), share data changes butversion_idstays the same. Whenrerand_epochstays the same (user-facing modification),version_idbumps as before.Staging schema
Each party has a staging schema (e.g.
SMPC_rerand_staging) with:Coordination table
A
rerand_progresstable in each party's DB:Chunk ranges are derived from the manifest (
chunk_size,max_id_inclusive) andchunk_id, so they are not stored here.Lifecycle:
staging_written→all_confirmed→live_applied.Flow
Step 1: Rerand Server (per party, separate process)
Runs continuously:
s3://bucket/rerand/epoch-{E}/party-0/manifest.jsonbut without all three completion markers). If no manifest exists for the next epoch, create it (party 0 only): collect watermarks, computemax_id_inclusive, writemanifest.json.shared_secretfor epoch E (keygen or resume — see above)[start, end)for chunk K from the manifestversion_idBLAKE3(shared_secret || iris_id)XOFepoch = E,original_version_id,chunk_id = K, andrerand_epoch = E + 1staging_written = TRUEin localrerand_progressfor(epoch = E, chunk_id = K)s3://bucket/rerand/epoch-{E}/party-{P}/chunk-{K}/stagedall_confirmed = TRUEin localrerand_progressfor(epoch = E, chunk_id = K)pg_advisory_lock(RERAND_APPLY_LOCK)on a dedicated connection, then copy from staging to live DB, delete staging, and mark applied — all in one transaction (scoped to epoch and chunk):Step 2: Main Server Startup (minimal changes)
At startup, before
load_iris_db:sync_modifications) — all parties catch up on modifications, producing identicalversion_idvaluesSyncStateexchange):(epoch, max_confirmed_chunk)from its localrerand_progresstable: the active epoch E and the highestchunk_idwhereall_confirmed = TRUE. Since chunks are processed in strictly increasing order, all chunks0..max_confirmed_chunkare implicitly confirmed.(epoch, max_confirmed_chunk)pair as part ofSyncState.safe_up_to = max(max_confirmed_chunk_party_0, max_confirmed_chunk_party_1, max_confirmed_chunk_party_2)for the agreed epoch E, then locally applies all chunks0..safe_up_towherelive_applied = FALSE.all_confirmed = TRUEat any party means that party observed all three S3stagedmarkers, which means all three parties successfully committed the chunk to their staging schemas. A slower party may not have polled S3 yet, but its staging data is already there. Usingmaxensures all parties converge to the same applied set, preventing cross-party desync where one party loads rerandomized shares and another loads stale shares.max_confirmed_chunkis -1 / None.safe_up_tobecomes -1 / None and the catch-up step is skipped entirely.pg_advisory_lock(RERAND_APPLY_LOCK)on a dedicated connection. Then for every chunk K in0..safe_up_towhere locallylive_applied = FALSE(in increasing order): run the same apply transaction as Step 1.11. Keep the lock held through step 4.load_iris_db— loads from live DB into GPU memory. The advisory lock is still held, so the rerand server cannot apply new chunks while the DB is being read into memory.SELECT pg_advisory_unlock(RERAND_APPLY_LOCK)on the dedicated connection, then drop the connection.Advisory lock: startup vs rerand server concurrency
Both the rerand server (Step 1.11) and the main server startup (Steps 2.3–2.4) acquire
pg_advisory_lock(RERAND_APPLY_LOCK)before applying chunks. This ensures:load_iris_db, so the rerand server cannot sneak in applies between catch-up and memory load.Implementation with connection pools (sqlx): session-level advisory locks are tied to a specific Postgres connection. When using a connection pool, acquire a dedicated connection (
pool.acquire()) and hold it (do not drop/return it) for the entire lock window. The catch-up queries andload_iris_dbcan use the pool normally — the dedicated connection just sits idle holding the lock. Release withpg_advisory_unlock(...)on the same connection afterload_iris_dbcompletes, then drop the connection.Why modification sync before rerand sync matters
Modification sync ensures all parties have the same
version_idvalues before the rerand staging copy runs. This guarantees the optimistic lock (WHERE version_id = original_version_id) produces the same skip set on all parties — the same entries are updated, the same entries are skipped.Conflict Resolution: Rerandomization vs Modifications
Why the optimistic lock is needed
The rerand server reads entry X at time T with
version_id = V. A modification (reauth/deletion) may happen later, bumpingversion_idto V+1. The staging still hasoriginal_version_id = V. The optimistic lock prevents overwriting the modification:Why
rerand_epochand the trigger are neededWithout the trigger change, the staging copy would bump
version_id(because share data changed). The trigger change keepsversion_idas a pure "user-facing modification counter," separate from rerandomization.Chunking
Chunk boundaries must be identical across parties for chunk K to be meaningful. Define them via an epoch manifest object in S3:
s3://bucket/rerand/epoch-{E}/party-0/manifest.json:{ epoch: E, chunk_size: N, max_id_inclusive: M }max_id_party_P = SELECT MAX(id) FROM irisestos3://bucket/rerand/epoch-{E}/party-{P}/max-id.max-idmarkers exist, then setsmax_id_inclusiveas:M = min(max_id_party_0, max_id_party_1, max_id_party_2) - safety_buffer_idssafety_buffer_idsis configurable (default 0 or one chunk) to avoid rerandomizing the “tip” where replication/ingest lag could differ across parties.id > Mare left for a future epoch.[start, end)wherestart = 1 + K * Nandend = min(start + N, M + 1).A configurable delay (
--chunk-delay, default e.g. 5s) is inserted between chunks to avoid sustained DB load. The rerand server should not stress the live DB with continuous writes — the delay spreads the I/O over time. The delay, chunk size, and number of parallel DB connections should all be configurable via CLI flags or environment variables.