Skip to content

Undo#21

Draft
gburd wants to merge 31 commits intomasterfrom
undo
Draft

Undo#21
gburd wants to merge 31 commits intomasterfrom
undo

Conversation

@gburd
Copy link
Copy Markdown
Owner

@gburd gburd commented Mar 26, 2026

No description provided.

@github-actions github-actions Bot force-pushed the master branch 30 times, most recently from 9355586 to 9cbf7e6 Compare March 30, 2026 18:18
gburd and others added 30 commits April 16, 2026 18:55
This commit adds the core UNDO logging system for PostgreSQL, implementing
ZHeap-inspired physical UNDO with Compensation Log Records (CLRs) for
crash-safe transaction rollback and standby replication support.

Key features:
- Physical UNDO application using memcpy() for direct page modification
- CLR (Compensation Log Record) generation during transaction rollback
- Shared buffer integration (UNDO pages use standard buffer pool)
- UndoRecordSet architecture with chunk-based organization
- UNDO worker for automatic cleanup of old records
- Per-persistence-level record sets (permanent/unlogged/temp)

Architecture:
- UNDO logs stored in $PGDATA/base/undo/ with 64-bit UndoRecPtr
- 40-bit offset (1TB per log) + 24-bit log number (16M logs)
- Integrated with PostgreSQL's shared_buffers (no separate cache)
- WAL-logged CLRs ensure crash safety and standby replay
Extends UNDO adding a per-relation model that can record logical
operations for the purposed of recovery or in support of MVCC visibility
tracking. Unlike cluster-wide UNDO (which stores complete tuple data
globally), per-relation UNDO stores logical operation metadata in a
relation-specific UNDO fork.

Architecture:
- Separate UNDO fork per relation (relfilenode.undo)
- Metapage (block 0) tracks head/tail/free chain pointers
- Data pages contain UNDO records with operation metadata
- WAL resource manager (RM_RELUNDO_ID) for crash recovery
- Two-phase protocol: RelUndoReserve() / RelUndoFinish() / RelUndoCancel()

Record types:
- RELUNDO_INSERT: Tracks inserted TID range
- RELUNDO_DELETE: Tracks deleted TID
- RELUNDO_UPDATE: Tracks old/new TID pair
- RELUNDO_TUPLE_LOCK: Tracks tuple lock acquisition
- RELUNDO_DELTA_INSERT: Tracks columnar delta insertion

Table AM integration:
- relation_init_undo: Create UNDO fork during CREATE TABLE
- tuple_satisfies_snapshot_undo: MVCC visibility via UNDO chain
- relation_vacuum_undo: Discard old UNDO records during VACUUM

This complements cluster-wide UNDO by providing table-AM-specific
UNDO management without global coordination overhead.
Implements a minimal table access method that exercises the per-relation
UNDO subsystem. Validates end-to-end functionality: UNDO fork creation,
record insertion, chain walking, and crash recovery.

Implemented operations:
- INSERT: Full implementation with UNDO record creation
- Sequential scan: Forward-only table scan
- CREATE/DROP TABLE: UNDO fork lifecycle management
- VACUUM: UNDO record discard

This test AM stores tuples in simple heap-like pages using custom
TestUndoTamTupleHeader (t_len, t_xmin, t_self) followed by MinimalTuple
data. Pages use standard PageHeaderData and PageAddItem().

Two-phase UNDO protocol demonstration:
1. Insert tuple onto data page (PageAddItem)
2. Reserve UNDO space (RelUndoReserve)
3. Build UNDO record (header + payload)
4. Commit UNDO record (RelUndoFinish)
5. Register for rollback (RegisterPerRelUndo)

Introspection:
- test_undo_tam_dump_chain(regclass): Walk UNDO fork, return all records

Testing:
- sql/undo_tam.sql: Basic INSERT/scan operations
- t/058_undo_tam_crash.pl: Crash recovery validation

This test module is NOT suitable for production use. It serves only to
validate the per-relation UNDO infrastructure and demonstrate table AM
integration patterns.
Extends per-relation UNDO from metadata-only (MVCC visibility) to
supporting transaction rollback. When a transaction aborts, per-relation
UNDO chains are applied asynchronously by background workers.

Architecture:
- Async-only rollback via background worker pool
- Work queue protected by RelUndoWorkQueueLock
- Catalog access safe in worker (proper transaction state)
- Test helper (RelUndoProcessPendingSync) for deterministic testing

Extended data structures:
- RelUndoRecordHeader gains info_flags and tuple_len
- RELUNDO_INFO_HAS_TUPLE flag indicates tuple data present
- RELUNDO_INFO_HAS_CLR / CLR_APPLIED for crash safety

Rollback operations:
- RELUNDO_INSERT: Mark inserted tuples as LP_UNUSED
- RELUNDO_DELETE: Restore deleted tuple via memcpy (stored in UNDO)
- RELUNDO_UPDATE: Restore old tuple version (stored in UNDO)
- RELUNDO_TUPLE_LOCK: Remove lock marker
- RELUNDO_DELTA_INSERT: Restore original column data

Transaction integration:
- RegisterPerRelUndo: Track relation UNDO chains per transaction
- GetPerRelUndoPtr: Chain UNDO records within relation
- ApplyPerRelUndo: Queue work for background workers on abort
- StartRelUndoWorker: Spawn worker if none running

Async rationale:
Per-relation UNDO cannot apply synchronously during ROLLBACK because
catalog access (relation_open) is not allowed during TRANS_ABORT state.
Background workers execute in proper transaction context, avoiding the
constraint. This matches the ZHeap architecture where UNDO application
is deferred to background processes.

WAL:
- XLOG_RELUNDO_APPLY: Compensation log records (CLRs) for applied UNDO
- Prevents double-application after crash recovery

Testing:
- sql/undo_tam_rollback.sql: Validates INSERT rollback
- test_undo_tam_process_pending(): Drain work queue synchronously
Implements production-ready WAL features for the per-relation UNDO
resource manager: async I/O, consistency checking, parallel redo,
and compression validation.

Async I/O optimization:
When INSERT records reference both data page (block 0) and metapage
(block 1), issue prefetch for block 1 before reading block 0. This
allows both I/Os to proceed in parallel, reducing crash recovery stall
time. Uses pgaio batch mode when io_method is worker or io_uring.

Pattern:
  if (has_metapage && io_method != IOMETHOD_SYNC)
      pgaio_enter_batchmode();
  relundo_prefetch_block(record, 1);  // Start async read
  process_block_0();                  // Overlaps with metapage I/O
  process_block_1();                  // Should be in cache
  pgaio_exit_batchmode();

Consistency checking:
All redo functions validate WAL record fields before application:
- Bounds checks: offsets < BLCKSZ, counters within range
- Monotonicity: counters advance, pd_lower increases
- Cross-field validation: record fits within page
- Type validation: record types in valid range
- Post-condition checks: updated values are reasonable

Parallel redo support:
Implements startup/cleanup/mask callbacks required for multi-core
crash recovery:
- relundo_startup: Initialize per-backend state
- relundo_cleanup: Release per-backend resources
- relundo_mask: Mask LSN, checksum, free space for page comparison

Page dependency rules:
- Different pages replay in parallel (no ordering constraints)
- Same page: INIT precedes INSERT (enforced by page LSN)
- Metapage updates are sequential (buffer lock serialization)

Compression validation:
WAL compression (wal_compression GUC) automatically compresses full
page images via XLogCompressBackupBlock(). Test validates 40-46%
reduction for RELUNDO FPIs with lz4, pglz, and zstd algorithms.

Test: t/059_relundo_wal_compression.pl measures WAL volume with/without
compression for identical workloads.
…UNDO completions

Implement the UNDO subsystem changes needed for Constant-Time Recovery
(CTR). At abort time, transactions register in the Aborted Transaction
Map (ATM) for O(1) visibility checks instead of performing synchronous
rollback. A background Logical Revert worker lazily applies UNDO chains
from ATM entries.

Specifically:

- Add ATM shared-memory structure with 16 LWLock partitions, WAL-logged
  add/forget operations, and a redo handler (new resource manager
  RM_ATM_ID).

- Add Logical Revert background worker that scans ATM for unreverted
  entries, applies their per-relation UNDO chains, then removes them.

- Complete tuple data storage in per-relation UNDO records via new
  RelUndoFinishWithTuple() write path and working
  RelUndoReadRecordWithTuple() read path.

- Enable and complete rollback functions for all five record types
  (INSERT, DELETE, UPDATE, TUPLE_LOCK, DELTA_INSERT) in
  RelUndoApplyChain(), removing #ifdef NOT_USED guards.

- Wire in per-relation CLR (Compensation Log Record) support for
  crash-safe Logical Revert: each applied UNDO record gets a CLR
  so recovery skips already-applied operations.

- Modify abort path in ApplyPerRelUndo() to try ATM insertion first,
  falling back to synchronous rollback only when ATM is full.

- Call ATMRecoveryFinalize() after WAL redo to log unreverted entry
  count for the Logical Revert worker to process.
This commit provides examples and architectural documentation for the
UNDO subsystems. It is intended for reviewers and committers to understand
the design decisions and usage patterns.

Contents:
- 01-basic-undo-setup.sql: Cluster-wide UNDO basics
- 02-undo-rollback.sql: Rollback demonstrations
- 03-undo-subtransactions.sql: Subtransaction handling
- 04-transactional-fileops.sql: FILEOPS usage
- 05-undo-monitoring.sql: Monitoring and statistics
- 06-per-relation-undo.sql: Per-relation UNDO with test_undo_tam
- DESIGN_NOTES.md: Comprehensive architecture documentation
- README.md: Examples overview

This commit should NOT be merged. It exists only to provide context
and documentation for the patch series.
Introduce the IndexPrune framework that allows index access methods to
register callbacks for proactively pruning dead index entries when UNDO
records are discarded. This avoids accumulating dead tuples that would
otherwise require VACUUM to clean up.

Key components:
- index_prune.h: IndexPruneCallbacks structure and registration API
- index_prune.c: Registry management and IndexPruneNotifyDiscard() dispatcher
- relundo_discard.c: Hook to call IndexPruneNotifyDiscard on UNDO discard

Individual index AM implementations follow in subsequent commits.
Placeholder for index pruning design documentation.
To be populated when design notes are split by subsystem.
Register IndexPrune callbacks in the B-tree access method handler.
nbtprune.c implements dead-entry detection and removal using UNDO
discard notifications, allowing proactive cleanup without full VACUUM.
Register IndexPrune callbacks in the hash access method handler.
hashprune.c implements dead-entry detection and removal using UNDO
discard notifications for hash indexes.
Register IndexPrune callbacks in the GIN access method handler.
ginprune.c implements dead-entry detection and removal using UNDO
discard notifications for GIN indexes.
Register IndexPrune callbacks in the GiST access method handler.
gistprune.c implements dead-entry detection and removal using UNDO
discard notifications for GiST indexes.
Register IndexPrune callbacks in the SP-GiST access method handler.
spgprune.c implements dead-entry detection and removal using UNDO
discard notifications for SP-GiST indexes.
Add VACUUM statistics tracking for UNDO-pruned index entries and verbose
output. Include comprehensive test suite exercising index pruning across
all supported index access methods via test_undo_tam.
Introduce the FILEOPS deferred-operations infrastructure following the
Berkeley DB fileops.src model. Each filesystem operation is a composable
unit with its own WAL record type, redo handler, and descriptor.

This commit provides the core machinery only - no specific operations:
- PendingFileOp linked list for deferred operations
- FileOpsDoPendingOps() executor at transaction commit/abort
- Subtransaction support (AtSubCommit/AtSubAbort/PostPrepare)
- WAL resource manager shell (RM_FILEOPS_ID)
- Platform portability layer (fsync_parent, FileOpsSync)
- GUC: enable_transactional_fileops
- Transaction lifecycle hooks in xact.c

Individual operations (CREATE, DELETE, RENAME, WRITE, TRUNCATE, etc.)
are added in subsequent commits.
Implement transactional file creation (BDB: __fop_create). Files are
created immediately so they can be used within the transaction. If
register_delete is true, the file is automatically deleted on abort.

API: FileOpsCreate(path, flags, mode, register_delete) -> fd
WAL: XLOG_FILEOPS_CREATE with idempotent redo (creates parent dirs
if missing on standbys).
Implement deferred file deletion (BDB: __fop_remove). Deletion is
scheduled for transaction commit or abort, not executed immediately.

API: FileOpsDelete(path, at_commit) -> void
WAL: XLOG_FILEOPS_DELETE (intentional no-op during redo; deletion
driven by XACT commit/abort records).
On Windows: uses pgunlink() with retry on EACCES.
Implement deferred file rename (BDB: __fop_rename). The rename is
scheduled for commit time using durable_rename() which handles fsync
ordering on Unix and MoveFileEx with retry on Windows.

API: FileOpsRename(oldpath, newpath) -> int
WAL: XLOG_FILEOPS_RENAME (intentional no-op during redo).
Implement WAL-logged file write at offset (BDB: __fop_write). Data is
written immediately using pwrite() and fsynced for durability.

API: FileOpsWrite(path, offset, data, len) -> int
WAL: XLOG_FILEOPS_WRITE with redo that replays the write.
On Windows: uses SetFilePointerEx + WriteFile via pg_pwrite.
Implement WAL-logged file truncation. Executed immediately with
XLogFlush before the irreversible operation (following SMGR_TRUNCATE
pattern). Uses ftruncate() on POSIX, SetEndOfFile() on Windows.

API: FileOpsTruncate(path, length) -> void
WAL: XLOG_FILEOPS_TRUNCATE with redo that replays the truncation.
Implement WAL-logged file metadata operations.

CHMOD: chmod() on POSIX, _chmod() on Windows with limited mode bits
(only _S_IREAD/_S_IWRITE; no group/other support).

CHOWN: chown() on POSIX, no-op with WARNING on Windows (Windows uses
ACLs for ownership, not uid/gid).

Both execute immediately and are WAL-logged for crash recovery.
MKDIR: Immediate execution using MakePGDirectory(). Registers
rmdir-on-abort for automatic cleanup on rollback. On Windows: _mkdir()
(no mode parameter, permissions inherited from parent).

RMDIR: Deferred to commit time (like DELETE). Uses rmdir() on all
platforms, _rmdir() on Windows.
SYMLINK: Immediate execution. Uses symlink() on POSIX, pgsymlink()
(NTFS junction points) on Windows. Registers delete-on-abort.

LINK: Immediate execution. Uses link() on POSIX, CreateHardLinkA()
on Windows (NTFS only). Registers delete-on-abort.

Both create links idempotently during redo (unlink first if exists).
Add extended attribute operations to the transactional file operations
framework, completing the Berkeley DB fileops.src operation set.

FileOpsSetXattr() and FileOpsRemoveXattr() provide immediate execution
with WAL logging for crash recovery replay. A new cross-platform
portability layer (src/port/pg_xattr.c) abstracts platform differences:

  - Linux: <sys/xattr.h> setxattr/removexattr
  - macOS: <sys/xattr.h> with extra options parameter
  - FreeBSD: <sys/extattr.h> extattr_set_file/extattr_delete_file
  - Windows: NTFS Alternate Data Streams via CreateFileA("path:name")
  - Fallback: returns ENOTSUP (operation succeeds in WAL but no-op
    on unsupported platforms for WAL stream portability)

Platform detection uses compiler-defined macros (__linux__, __APPLE__,
__FreeBSD__, WIN32) rather than configure-time checks, avoiding
meson.build/configure.ac complexity.
Add regression tests for all FILEOPS operations (CREATE, DELETE,
RENAME, WRITE, TRUNCATE, CHMOD, CHOWN, MKDIR, RMDIR, SYMLINK, LINK,
SETXATTR, REMOVEXATTR) and a crash recovery test for WAL replay.

Update the transactional fileops example script with the expanded
operation set following the Berkeley DB fileops.src model.
Adds opt-in UNDO support to the standard heap table access method.
When enabled, heap operations write UNDO records to enable physical
rollback without scanning the heap, and support UNDO-based MVCC
visibility determination.

How heap uses UNDO:

INSERT operations:
  - Before inserting tuple, call PrepareXactUndoData() to reserve UNDO space
  - Write UNDO record with: transaction ID, tuple TID, old tuple data (null for INSERT)
  - On abort: UndoReplay() marks tuple as LP_UNUSED without heap scan

UPDATE operations:
  - Write UNDO record with complete old tuple version before update
  - On abort: UndoReplay() restores old tuple version from UNDO

DELETE operations:
  - Write UNDO record with complete deleted tuple data
  - On abort: UndoReplay() resurrects tuple from UNDO record

MVCC visibility:
  - Tuples reference UNDO chain via xmin/xmax
  - HeapTupleSatisfiesSnapshot() can walk UNDO chain for older versions
  - Enables reconstructing tuple state as of any snapshot

Configuration:
  CREATE TABLE t (...) WITH (enable_undo=on);

The enable_undo storage parameter is per-table and defaults to off for
backward compatibility. When disabled, heap behaves exactly as before.

Value proposition:

1. Faster rollback: No heap scan required, UNDO chains are sequential
   - Traditional abort: Full heap scan to mark tuples invalid (O(n) random I/O)
   - UNDO abort: Sequential UNDO log scan (O(n) sequential I/O, better cache locality)

2. Cleaner abort handling: UNDO records are self-contained
   - No need to track which heap pages were modified
   - Works across crashes (UNDO is WAL-logged)

3. Foundation for future features:
   - Multi-version concurrency control without bloat
   - Faster VACUUM (can discard entire UNDO segments)
   - Point-in-time recovery improvements

Trade-offs:

Costs:
  - Additional writes: Every DML writes both heap + UNDO (roughly 2x write amplification)
  - UNDO log space: Requires space for UNDO records until no longer visible
  - Complexity: New GUCs (undo_retention, max_undo_workers), monitoring needed

Benefits:
  - Primarily valuable for workloads with:
    - Frequent aborts (e.g., speculative execution, deadlocks)
    - Long-running transactions needing old snapshots
    - Hot UPDATE workloads benefiting from cleaner rollback

Not recommended for:
  - Bulk load workloads (COPY: 2x write amplification without abort benefit)
  - Append-only tables (rare aborts mean cost without benefit)
  - Space-constrained systems (UNDO retention increases storage)

When beneficial:
  - OLTP with high abort rates (>5%)
  - Systems with aggressive pruning needs (frequent VACUUM)
  - Workloads requiring historical visibility (audit, time-travel queries)

Integration points:
  - heap_insert/update/delete call PrepareXactUndoData/InsertXactUndoData
  - Heap pruning respects undo_retention to avoid discarding needed UNDO
  - pg_upgrade compatibility: UNDO disabled for upgraded tables

Background workers:
  - Cluster-wide UNDO has async workers for cleanup/discard of old UNDO records
  - Rollback itself is synchronous (via UndoReplay() during transaction abort)
  - Workers periodically trim UNDO logs based on undo_retention and snapshot visibility

This demonstrates cluster-wide UNDO in production use. Note that this
differs from per-relation logical UNDO (added in subsequent patches),
which uses per-table UNDO forks and async rollback via background
workers.
Document the cluster-wide UNDO architecture including UNDO log design,
record format, transaction integration, and heap AM integration details.
Formatting-only changes from pgindent: typedef brace alignment,
pointer spacing, comment wrapping, and function argument alignment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant