Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/fast.yml
Original file line number Diff line number Diff line change
Expand Up @@ -249,7 +249,7 @@ jobs:
# Full cluster_tap suite + 2-node ClusterPair + heartbeat round-
# trip + Stage 2/3 medium perf matrix tests run in nightly.yml.
make -C src/test/cluster_tap check \
PROVE_TESTS="t/010_views.pl t/030_acceptance.pl t/050_shared_storage_initdb.pl t/200_stage2_acceptance_capability.pl t/226_stage3_mvcc_acceptance_capability.pl t/273_stage4_recovery_acceptance_capability.pl"
PROVE_TESTS="t/010_views.pl t/030_acceptance.pl t/050_shared_storage_initdb.pl t/200_stage2_acceptance_capability.pl t/226_stage3_mvcc_acceptance_capability.pl t/273_stage4_recovery_acceptance_capability.pl t/332_block_device_backend.pl t/333_block_device_multinode.pl"

- name: Upload regression diffs on failure
if: failure()
Expand Down
4 changes: 4 additions & 0 deletions .github/workflows/nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,10 @@ jobs:
# heap-ITL WAL measure / t/330 production-bench-subset / t/331 4-node
# reconfig fault matrix.
- { name: stage5-integrated-acceptance, ranges: "327-331", unit: false, regress: false }
# spec-6.0a production shared-storage backend matrix. The first
# shard covers the CI-portable block_device raw-image e2e; hardware
# O_DIRECT / SCSI-3 PR legs remain external/manual.
- { name: stage6-storage, ranges: "332-339", unit: false, regress: false }
steps:
- name: Checkout
uses: actions/checkout@v4
Expand Down
13 changes: 13 additions & 0 deletions .github/workflows/perf.yml
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,19 @@ jobs:
--out "scripts/perf/results/cr-profile-${{ github.run_id }}.csv" \
| tee "scripts/perf/results/cr-profile-${{ github.run_id }}.log"

# spec-6.0a D7: storage I/O report-only matrix. The default CI leg uses a
# regular-file raw image with O_DIRECT disabled, so it is a conformance and
# trend artifact rather than a hardware O_DIRECT claim.
- name: Storage I/O matrix (warn-only, Linux)
if: runner.os == 'Linux'
continue-on-error: true
run: |
mkdir -p scripts/perf/results
PGRAC_ENABLE_INSTALL=$HOME/linkdb-install \
STORAGE_IO_DURATION="${STORAGE_IO_DURATION:-10}" \
STORAGE_IO_SCALE="${STORAGE_IO_SCALE:-5}" \
scripts/perf/run-storage-io-matrix.sh

- name: Collect perf artifacts
if: always()
run: |
Expand Down
13 changes: 13 additions & 0 deletions docs/cluster/shared-storage-backends.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Shared-Storage Backends

## spec-6.0a Implementation Notes

spec-6.0a lands the `block_device` production shared-storage backend on top of the `ClusterSharedFsOps` provider framework. The CI-portable path uses a regular-file raw image with `cluster.block_device_use_odirect=off`; production deployments should use a persistent block-device path with direct I/O enabled.

The implementation intentionally records these frozen-spec deltas:

- The raw backend opens the device with `BasicOpenFile(..., PG_O_DIRECT)` instead of adding a PostgreSQL `fd.c` VFD substrate. This keeps the PG buffered file path untouched and matches the voting-disk raw-fd precedent. The direct-I/O contract remains fail-closed at backend startup: unsupported `PG_O_DIRECT` or incompatible `BLCKSZ`/`PG_IO_ALIGN_SIZE` raises `cluster_storage_io_alignment`.
- `cluster.block_device_path` accepts either a block device or a regular-file raw image. Regular files are accepted for CI and development conformance tests only and emit a startup warning.
- The frozen spec reserved SQLSTATEs `58R02` and `58R03`, but current main already uses them. This implementation uses `58R14` for `cluster_storage_io_alignment` and `58R15` for `cluster_storage_fence_unavailable`.
- SCSI-3 PR coverage in CI is limited to fail-closed forced-driver behavior on a non-PR raw image and unit coverage for node-key derivation. Hardware PR probe/register legs require a real SG_IO-capable device and remain external/manual release evidence.
- The raw layout implementation currently lives in `cluster_shared_fs_block_device.c`. A future cleanup should split the on-device layout/allocator/cache code into raw-layout-specific files without changing the storage contract.
11 changes: 11 additions & 0 deletions docs/perf-gates.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,17 @@ gh workflow run perf.yml -R sqlrush/linkdb

CI(GitHub Actions perf workflow)上传 artifact `perf-2node-baseline-{ubuntu,macos}-<run_id>`,retention 60 days。

### Storage I/O Matrix (spec-6.0a, report-only)

Production shared-storage backend work adds a storage I/O report under:

```bash
PGRAC_ENABLE_INSTALL=$HOME/linkdb-install \
./scripts/perf/run-storage-io-matrix.sh
```

Default CI shape uses a regular-file raw image with `cluster.block_device_use_odirect=off`, so the artifact is a conformance/trend signal, not a hardware O_DIRECT claim. Set `STORAGE_IO_ODIRECT=on` only on a verified block-device environment where the soundness gate has confirmed direct-I/O alignment behavior.

---

## 5. ship 决策树(简化版)
Expand Down
4 changes: 2 additions & 2 deletions docs/reference/system-views.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ SELECT role, count(*) FROM pg_cluster_nodes GROUP BY role;
## pg_stat_cluster_wait_events

Lists the cluster-specific wait event registry on the local node.
Always returns 46 rows in `--enable-cluster` builds (one per
Always returns 110 rows in `--enable-cluster` builds (one per
registered cluster wait event).

### Columns
Expand Down Expand Up @@ -180,7 +180,7 @@ See [Wait events](wait-events.md) for the full event roster.
## pg_stat_gcluster_wait_events

Cross-node placeholder for cluster-wide wait events. In the
current release returns 46 rows for the local node only;
current release returns 110 rows for the local node only;
`node_id` is always the value of the local `cluster.node_id` GUC.

The column shape `(node_id, type, name)` is the public contract
Expand Down
25 changes: 22 additions & 3 deletions docs/reference/wait-events.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Cluster wait events

linkdb registers 46 cluster-specific wait events distributed across
10 classes. Each row in `pg_stat_cluster_wait_events` corresponds
linkdb registers 110 cluster-specific wait events distributed across
11 classes. Each row in `pg_stat_cluster_wait_events` corresponds
to one entry in this table.

The values appear in the standard `pg_stat_activity.wait_event_type`
Expand Down Expand Up @@ -140,10 +140,29 @@ Active Data Guard / read-only standby coordination.
| `AdgReadSnapshotWait` | Waiting for a read snapshot to be released |
| `AdgScnSyncWait` | Waiting for SCN sync between primary and standby |

## Cluster: SharedFs (12 events)

Shared-storage provider and raw block-device I/O.

| Name | Description |
|---|---|
| `ClusterSharedFsRead` | Waiting for generic shared-storage read |
| `ClusterSharedFsWrite` | Waiting for generic shared-storage write |
| `ClusterSharedFsExtend` | Waiting for generic shared-storage extend |
| `ClusterSharedFsTruncate` | Waiting for generic shared-storage truncate |
| `ClusterSharedFsFsync` | Waiting for generic shared-storage fsync |
| `ClusterBlockDeviceRead` | Waiting for raw block-device read |
| `ClusterBlockDeviceWrite` | Waiting for raw block-device write |
| `ClusterBlockDevicePrefetch` | Waiting for raw block-device prefetch hint |
| `ClusterBlockDeviceWriteback` | Waiting for raw block-device writeback hint |
| `ClusterBlockDeviceSync` | Waiting for raw block-device barrier sync |
| `ClusterBlockDevicePrProbe` | Waiting for SCSI-3 PR capability probe |
| `ClusterBlockDevicePrRegister` | Waiting for SCSI-3 PR own-key registration |

## Querying

```sql
-- Total registered (46):
-- Total registered (110):
SELECT count(*) FROM pg_stat_cluster_wait_events;

-- Per-class counts:
Expand Down
148 changes: 148 additions & 0 deletions scripts/perf/run-storage-io-matrix.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
#!/bin/bash
#-------------------------------------------------------------------------
#
# run-storage-io-matrix.sh
# spec-6.0a storage I/O conformance/perf report-only matrix.
#
# Runs a small single-node pgbench sample through the normal local
# backend and the raw block_device backend over a CI-portable regular
# file image. The block_device leg disables O_DIRECT unless the caller
# opts in, so loopback numbers are report-only and carry a soundness
# marker instead of pretending to be hardware O_DIRECT measurements.
#
# IDENTIFICATION
# scripts/perf/run-storage-io-matrix.sh
#
# Author: SqlRush <sqlrush@gmail.com>
#
# Portions Copyright (c) 2026, pgrac contributors
#
# Spec: spec-6.0a-production-shared-storage-backend-matrix.md (D7)
#
#-------------------------------------------------------------------------
set -euo pipefail

INSTALL="${PGRAC_ENABLE_INSTALL:-$HOME/linkdb-install}"
SCALE="${STORAGE_IO_SCALE:-5}"
DURATION="${STORAGE_IO_DURATION:-10}"
CLIENTS="${STORAGE_IO_CLIENTS:-2}"
JOBS="${STORAGE_IO_JOBS:-2}"
RAW_MB="${STORAGE_IO_RAW_MB:-192}"
ODIRECT="${STORAGE_IO_ODIRECT:-off}"
OUTDIR="$(cd "$(dirname "$0")" && pwd)/results"
STAMP="$(date +%Y%m%d-%H%M%S)"
OUT="$OUTDIR/storage-io-matrix-$STAMP.json"
WORK="$(mktemp -d /tmp/pgrac-storage-io.XXXXXX)"

cleanup() {
rm -rf "$WORK"
}
trap cleanup EXIT

mkdir -p "$OUTDIR"

if [ ! -x "$INSTALL/bin/initdb" ]; then
cat > "$OUT" <<EOF
{"status":"unavailable","reason":"install prefix not found","install":"$INSTALL"}
EOF
echo "storage I/O matrix unavailable: install prefix not found at $INSTALL" >&2
echo "results: $OUT"
exit 0
fi

PATH="$INSTALL/bin:$PATH"
export PGHOST="$WORK"

write_unavailable() {
local reason="$1"

cat > "$OUT" <<EOF
{"status":"unavailable","reason":"$(json_escape "$reason")","install":"$(json_escape "$INSTALL")"}
EOF
echo "storage I/O matrix unavailable: $reason" >&2
echo "results: $OUT"
exit 0
}

json_escape() {
printf '%s' "$1" | sed 's/\\/\\\\/g; s/"/\\"/g'
}

bench_backend() {
local backend="$1"
local port="$2"
local pgdata="$WORK/pgdata_$backend"
local raw_image="$WORK/raw_$backend.img"
local log="$WORK/log_$backend"
local tps

initdb -D "$pgdata" -A trust -N > /dev/null || return 1
{
echo "port = $port"
echo "unix_socket_directories = '$WORK'"
echo "listen_addresses = ''"
echo "cluster.enabled = on"
echo "cluster.node_id = 0"
echo "cluster.allow_single_node = on"
echo "cluster.smgr_user_relations = on"
echo "autovacuum = off"
echo "shared_buffers = '128MB'"
echo "cluster.shared_storage_backend = $backend"
if [ "$backend" = "block_device" ]; then
truncate -s "${RAW_MB}M" "$raw_image"
echo "cluster.block_device_path = '$raw_image'"
echo "cluster.block_device_use_odirect = $ODIRECT"
fi
} >> "$pgdata/postgresql.conf"

pg_ctl -D "$pgdata" -l "$log" -w start > /dev/null || return 1
if ! pgbench -p "$port" -i -s "$SCALE" postgres > /dev/null 2>&1; then
pg_ctl -D "$pgdata" -m fast -w stop > /dev/null || true
return 1
fi
tps=$(pgbench -p "$port" -c "$CLIENTS" -j "$JOBS" -T "$DURATION" postgres 2>/dev/null \
| awk '/tps =/ {print $3; exit}') || {
pg_ctl -D "$pgdata" -m fast -w stop > /dev/null || true
return 1
}
if [ -z "$tps" ]; then
pg_ctl -D "$pgdata" -m fast -w stop > /dev/null || true
return 1
fi
pg_ctl -D "$pgdata" -m fast -w stop > /dev/null || return 1

printf '%s' "$tps"
}

if ! bench_backend local 54601 > "$WORK/tps_local"; then
write_unavailable "local backend benchmark failed"
fi
if ! bench_backend block_device 54602 > "$WORK/tps_block"; then
write_unavailable "block_device backend benchmark failed"
fi

TPS_LOCAL="$(cat "$WORK/tps_local")"
TPS_BLOCK="$(cat "$WORK/tps_block")"

cat > "$OUT" <<EOF
{
"status": "ok",
"soundness": {
"block_device_odirect": "$(json_escape "$ODIRECT")",
"ci_shape": "regular-file raw image; report-only unless STORAGE_IO_ODIRECT=on on a verified block device"
},
"settings": {
"scale": $SCALE,
"duration_seconds": $DURATION,
"clients": $CLIENTS,
"jobs": $JOBS,
"raw_mb": $RAW_MB
},
"results": {
"local_tps": "$TPS_LOCAL",
"block_device_tps": "$TPS_BLOCK"
}
}
EOF

echo "storage I/O matrix results: $OUT"
2 changes: 2 additions & 0 deletions src/backend/access/rmgrdesc/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,11 @@ subdir = src/backend/access/rmgrdesc
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global

# PGRAC: spec-6.0a adds clusterrawdesc.o for RM_CLUSTER_RAW_LAYOUT.
OBJS = \
brindesc.o \
clogdesc.o \
clusterrawdesc.o \
clusterundodesc.o \
committsdesc.o \
dbasedesc.o \
Expand Down
63 changes: 63 additions & 0 deletions src/backend/access/rmgrdesc/clusterrawdesc.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
/*-------------------------------------------------------------------------
*
* clusterrawdesc.c
* rmgr descriptor for RM_CLUSTER_RAW_LAYOUT.
*
* Human-readable WAL descriptor/identifier for the spec-6.0a raw
* block-device layout metadata resource manager. pg_waldump and
* backend rmgrdesc callers use this file to decode raw layout metadata
* page-image records without needing the block-device provider itself.
*
* Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
* Portions Copyright (c) 2026, pgrac contributors
*
* Author: SqlRush <sqlrush@gmail.com>
*
* IDENTIFICATION
* src/backend/access/rmgrdesc/clusterrawdesc.c
*
* NOTES
* This is a pgrac-original file (no derivation from PostgreSQL).
* Spec: spec-6.0a-production-shared-storage-backend-matrix.md
* (FROZEN, RM_CLUSTER_RAW_LAYOUT descriptor surface).
*
*-------------------------------------------------------------------------
*/
#include "postgres.h"

#ifdef USE_PGRAC_CLUSTER
#include "cluster/storage/cluster_raw_xlog.h"

void
cluster_raw_layout_desc(StringInfo buf, XLogReaderState *record)
{
char *payload = XLogRecGetData(record);
uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;

switch (info) {
case XLOG_CLUSTER_RAW_LAYOUT_WRITE: {
xl_cluster_raw_layout_write *rec = (xl_cluster_raw_layout_write *)payload;

appendStringInfo(buf, "offset " UINT64_FORMAT " nbytes %u (metadata page image)",
rec->offset, rec->nbytes);
break;
}
default:
appendStringInfo(buf, "unknown op %u", info);
break;
}
}

const char *
cluster_raw_layout_identify(uint8 info)
{
switch (info & ~XLR_INFO_MASK) {
case XLOG_CLUSTER_RAW_LAYOUT_WRITE:
return "RAW_LAYOUT_WRITE";
default:
return NULL;
}
}

#endif /* USE_PGRAC_CLUSTER */
3 changes: 3 additions & 0 deletions src/backend/access/rmgrdesc/meson.build
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
# Copyright (c) 2022-2023, PostgreSQL Global Development Group

# used by frontend programs like pg_waldump
# PGRAC: spec-6.0a adds clusterrawdesc.c for RM_CLUSTER_RAW_LAYOUT.
rmgr_desc_sources = files(
'brindesc.c',
'clogdesc.c',
'clusterrawdesc.c',
'clusterundodesc.c',
'committsdesc.c',
'dbasedesc.c',
'genericdesc.c',
Expand Down
Loading
Loading