Skip to content

spec-6.5: cluster-aware backup / restore / PITR substrate#14

Merged
sqlrush merged 9 commits into
mainfrom
spec-6.5-cluster-backup-restore-pitr
Jul 1, 2026
Merged

spec-6.5: cluster-aware backup / restore / PITR substrate#14
sqlrush merged 9 commits into
mainfrom
spec-6.5-cluster-backup-restore-pitr

Conversation

@sqlrush

@sqlrush sqlrush commented Jun 30, 2026

Copy link
Copy Markdown
Owner

Summary

Implements the spec-6.5 catalog/shmem/wire/manifest/PITR-target substrate, with mutating physical backup and restore-point entry points failing closed until the missing correctness primitives land.

  • Adds the cluster backup manifest and restore-point/PITR helper contracts: per-thread WAL metadata, undo/TT inclusion flags, SCN cut fields, control-file inclusion, storage id, catversion, CRC32C validation, restore compatibility checks, and PITR target snapping helpers.
  • Adds SQL/catalog surface for pg_cluster_backup_start, pg_cluster_backup_stop, pg_cluster_create_restore_point, and read-only status/history/restore-point/PITR views.
  • Keeps IC request/ACK wire validation and shared-memory state as substrate, but does not claim peer backup capture or restore-point success while commit-drain and physical capture are absent.
  • Changes mutating backup/restore-point SQL paths to return feature_not_supported instead of creating an unsound restore point, publishing a partial manifest, or relying on naive cluster_scn_current() snapshots.
  • Keeps --disable-cluster builds isolated: cluster SQL symbols remain linkable, while cluster-only runtime behavior is behind USE_PGRAC_CLUSTER.

Merge Readiness / Fail-Closed Contract

This PR is intentionally a fail-closed substrate and is not a usable physical backup, restore, or PITR implementation.

The current merge order allows this fail-closed spec-6.5 substrate to land before spec-6.0a. That does not relax the runtime contract: pg_cluster_backup_start, pg_cluster_backup_stop, pg_cluster_create_restore_point, and any real backup/restore/PITR success path must continue to return FEATURE_NOT_SUPPORTED while physical capture, durable per-thread WAL pinning, restore-point commit-drain/fence, restore replay integration, PITR replay/open semantics, SCN high-water restore, and RESETLOGS/incarnation handling are absent.

Future work may wire the real success paths only after those correctness primitives exist and are covered by multi-node backup -> restore -> recover -> read tests. Until then, the PR must produce no successful backup manifest, restore point, or PITR claim from incomplete state.

Review Fix

This revision addresses the critical review finding that the earlier draft returned success while the restore-point commit-drain barrier and physical capture path were not implemented.

  • Removed the pending_commits_empty=true / commit_fence_held=true caller lie.
  • Removed the cluster_scn_current() restore-point cut fallback from the mutating path.
  • Removed runtime manifest success assembly that marked WAL/undo/TT/control as included without capture proof.
  • Updated TAP coverage and docs to assert the honest fail-closed substrate behavior.

The full spec-6.5 physical backup/copy/restore/PITR execution path remains intentionally unclaimed in this PR. It still requires commit-drain/fence, durable WAL pinning, physical capture, restore replay integration, SCN high-water restore, RESETLOGS/incarnation handling, and multi-node restore-then-read e2e before it can be considered complete.

Tests

Rebased onto origin/main 4a6bcc7ab3 (Merge PR #13 / spec-6.3a GRD/GES lifecycle reclaim) and replayed the 6.5 changes on top of the 6.3a lifecycle-reclaim semantics.

Local validation on macOS with --without-icu because local ICU headers/libs are unavailable:

  • make -s -j4
  • make -s install
  • make -C src/test/cluster_unit check (136 binaries)
  • src/test/cluster_unit/test_cluster_backup
  • prove -I src/test/perl -I src/test/cluster_tap src/test/cluster_tap/t/332_cluster_backup_pitr.pl using the installed tree and elevated localhost bind permission
  • scripts/ci/check-comment-headers.sh
  • scripts/ci/run-cppcheck.sh (0 findings; baseline-diff clean)
  • git diff --check
  • --disable-cluster in-tree build in /private/tmp/pgrac-worktrees/linkdb-spec-6.5-disable-2: configure without --enable-cluster, then make -s -j4

scripts/ci/check-format.sh was also run locally. It still reports pre-existing formatting violations in unrelated files outside this PR's touch set (cluster_voting_disk_io.c, cluster_grd.c, cluster_tt_local.c, cluster_ges.c, cluster_cssd.h, cluster_ic_chunk.h, cluster_itl_slot.h, cluster_gcs_block.h); the 6.5 touched files are clang-format clean.

Scope / Boundaries

This PR does not implement ADG, standby read-only service, RDMA, DRM, production storage backend/fence-driver work, or storage-provider copy plumbing from spec-6.0a. It also does not change 6.0a or 6.3a branches/worktrees.

This PR is not a shippable full backup/restore/PITR implementation. It is an honest substrate PR: read-only views and pure helper contracts exist; mutating paths fail closed when required correctness proofs are absent.

This PR does not merge main, tag, mark shipped, or do release sync work.

@sqlrush sqlrush force-pushed the spec-6.5-cluster-backup-restore-pitr branch from d8d77a2 to 7c241e0 Compare June 30, 2026 15:01
@sqlrush sqlrush force-pushed the spec-6.5-cluster-backup-restore-pitr branch from 46dede6 to c13ec81 Compare July 1, 2026 02:03
@sqlrush sqlrush marked this pull request as ready for review July 1, 2026 03:51
@sqlrush sqlrush merged commit 955d126 into main Jul 1, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant