Customisable table reorganising using shadow tables and minimal locks.
Psycopack can be used to perform the following operations without downtime:
- Schema changes that would be dangerous (locking the table for too long) using regular SQL DDL statements (e.g. adding exclusion constraints).
- Converting a primary key from int to bigint.
- Clustering (or repacking) the table to address table and index bloat (similar
to
pg_repack). - Partition an existing table (only by date range at the moment).
Psycopack uses a shadow-table approach.
This means creating a copy of the table (shadow table), synchronising its data with the original table data, and swapping it with the original. It is also possible to reverse the swap, so that you can verify the results before deciding to commit to them.
The Psycopack process achieves the above in 6 major steps:
- Pre-validation: Checks if the table can be processed by Psycopack at all. See Known Limitations below.
- Setup: Creates the copy table, a trigger to keep the shadow table in sync with the original, and a backfill log table to keep track of rows from the original table that need to be copied over.
- Backfill: Copies data from the original table to the copy one.
- Schema synchronisation: Recreates all indexes and constraints in the copy table. Also recreates foreign keys on referring tables. This step is performed after the backfill is complete to speed up writes on the copy table.
- Swap: Effectively swaps the original table by the copy table. The original (now old) table will be kept in sync via triggers, in case the swap process needs to be reverted. The swapping is done in a way that allows to revert the process if something goes wrong, and without locking the table for more than a few milliseconds. This is achieved via an operation called reverse swap.
- Clean up: Drops the old table. After this point, the process is irreversible, so it's recommended to verify the results before running this step.
There are also two strategies for performing the process:
- Change log (Recommended): The backfill is performed in batches, and changes after the process kicks off are tracked via a change log table. This allows the copy table to synchronise its schema with the original table without using concurrent index creation, which can take a long time and can be blocked by long-running transactions.
- Triggers: The backfill is performed in batches, and changes after the process kicks off are immediately pushed to the copy table via triggers. This allows the copy table to be up to date after the backfill process is complete, the schema can be synchronised afterwards using operations that only acquire weak locks.
Psycopack runs on a Python class that can be used to configure the outcome of
the process. Below are some examples of how Psycopack can be used.
from psycopack import psycopack, get_db_connection
with get_db_connection("postgresql://user:password@host:port/db") as conn:
with conn.cursor() as cur:
psycopack = Psycopack(
batch_size=50_000,
conn=conn,
cur=cur,
schema="public",
table="to_psycopack",
)
psycopack.pre_validate()
psycopack.setup_repacking()
# Now that the shadow table is set up, change the schema accordingly.
cur.execute(f"ALTER TABLE {psycopack.copy_table} ... ")
# Finish the process.
psycopack.sync_schemas()
repack.post_sync_update()
psycopack.swap()
psycopack.clean_up()from psycopack import psycopack, get_db_connection
with get_db_connection("postgresql://user:password@host:port/db") as conn:
with conn.cursor() as cur:
psycopack = Psycopack(
batch_size=50_000,
conn=conn,
cur=cur,
schema="public",
table="to_psycopack",
# Set this argument and Psycopack will do the conversion
# automatically.
convert_pk_to_bigint=True,
)
psycopack.full()from psycopack import psycopack, get_db_connection
with get_db_connection("postgresql://user:password@host:port/db") as conn:
with conn.cursor() as cur:
psycopack = Psycopack(
batch_size=50_000,
conn=conn,
cur=cur,
schema="public",
table="to_psycopack",
)
# Simply run `.full()` to repack the whole table with further
# customisation.
psycopack.full()from psycopack import psycopack, get_db_connection
with get_db_connection("postgresql://user:password@host:port/db") as conn:
with conn.cursor() as cur:
psycopack = Psycopack(
table="to_repack",
batch_size=1,
conn=connection,
cur=cur,
change_log_batch_size=10,
sync_strategy=SyncStrategy.CHANGE_LOG,
partition_config=_partition.PartitionConfig(
column="datetime_field",
num_of_extra_partitions_ahead=10,
strategy=_partition.DateRangeStrategy(
partition_by=_partition.PartitionInterval.DAY
),
),
)The following types of tables aren't currently supported:
- Inherited tables.
- Tables without primary keys.
- Tables with composite primary keys.
- Primary key field type not one of:
bigintbigserialintegerserialsmallintsmallserial
- Tables with triggers.
- Tables with invalid indexes (the user should drop or re-index them first).
- Tables with deferrable unique constraints.
- Referring foreign keys on a different schema than the original table.
- The CHANGE_LOG strategy must be used when doing partitioning.
- Unique constraints are skipped, as they are not supported on partitioned tables. The user should add them manually after the process is complete or after the setup step.
- No support for referring fks yet.
Unless the user is a superuser, they may lack certain privileges to run Psycopack.
If such privileges are not assigned, Psycopack will raise an error informing
the user which tables are lacking a particular privilege or ownership and which
statements are required to fix it.
In summary, the user must have at least:
- CREATE and USAGE privilege on the schema:
GRANT CREATE, USAGE ON SCHEMA <schema> TO <user>;
- Ownership of the original table and tables with foreign keys pointing
to the original table:
ALTER TABLE <table> OWNER TO <user>;
- REFERENCES privilege on tables the original table has foreign keys to:
GRANT REFERENCES ON TABLE <table> TO <user>;