Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions docs/cluster/ges-grant.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ cross-node grants will land in spec-2.17 (BAST + deadlock + 4-node).
|---|---|---|---|---|
| `cluster.ges_request_timeout_ms` | `60000` | `[1, 600000]` | `USERSET` | Timeout (ms) for cross-node GES grant request. Backend rolls back via GES_RELEASE on expiry. PG `lock_timeout=0` (disabled) does NOT short-circuit — falls back to this GUC. |
| `cluster.grd_max_entries` | `0` | `[0, 1048576]` | `POSTMASTER` | Size of GRD entry HTAB (per spec-2.15). `0` = skeleton mode. |
| `cluster.grd_entry_reclaim` | `on` | `bool` | `SIGHUP` | Enable safe cold reclaim of holderless GRD entries after the lookup pin drops to zero. |
| `cluster.grd_entry_reclaim_max_per_sweep` | `256` | `[0, 65536]` | `SIGHUP` | Maximum cold entries LMON attempts to reclaim per sweep. |

### Effective timeout

Expand Down Expand Up @@ -62,6 +64,19 @@ Four cap counters (per spec-2.16 D1) surface entry-level saturation:
- `converts_full_count`
- `ngranted_promoted_count`

spec-6.3a adds GRD entry-lifecycle counters in the `grd` category:

| Key | Meaning |
|---|---|
| `grd_entries_reclaimed_count` | Cold holderless entries removed from the GRD HTAB |
| `grd_reclaim_skipped_pinned_count` | Reclaim attempts skipped because a lookup pin was still held |
| `grd_pin_high_water` | Highest observed lookup-pin count on a single entry |
| `grd_sweep_runs` | LMON cold-reclaim sweep invocations |

Implementation details for the pin/release discipline, shard-local scan
model, and ERROR cleanup classification are in
`docs/cluster/grd-entry-lifecycle.md`.

## Wire Format

Payload bytes follow the 36-byte `ClusterICEnvelope`:
Expand Down
55 changes: 55 additions & 0 deletions docs/cluster/grd-entry-lifecycle.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# GRD Entry Lifecycle And Cold Reclaim

Status: spec-6.3a implementation note.

## Lock Model

GRD entries live in the partitioned GRD HTAB and are protected in this order:

1. shard LWLock
2. entry spinlock

There is no table-wide GRD entry lock in the hot create or reclaim path.
Create, lookup, and `HASH_REMOVE` take only the target shard LWLock.

Scan paths do not use `hash_seq_search` over the live HTAB. Each entry is
linked into a shard-local intrusive list while the shard LWLock is held. A
scanner takes one shard LWLock in shared mode, copies `ClusterResId` keys,
releases the shard, and later re-lookups each key through the normal
pin/release API. This keeps scan safety local to the shard and avoids
serializing unrelated entry churn.

## Pin Discipline

`cluster_grd_entry_lookup_or_create()` pins an entry before releasing the
shard LWLock. Callers must pair every successful lookup with
`cluster_grd_entry_release()`.

`cluster_grd_entry_release()` copies the `ClusterResId` before decrementing
the pin. After the decrement publishes `pin == 0`, release does not
dereference the old entry pointer. Last-unpin reclaim re-enters by copied
resource id, takes the shard LWLock in exclusive mode, revalidates
cold state under the entry spinlock, sets `RECLAIMING`, unlinks the entry
from the shard list, and then calls `HASH_REMOVE`.

Cold means `pin == 0` and no holders, waiters, converts, or reservations.
Entries with live state are never reclaimed.

## ERROR Cleanup Audit

Pinned windows are intentionally short. spec-6.3a classifies lookup sites as:

| Class | Sites | Cleanup rule |
|---|---|---|
| F | Snapshot walkers, cleanup sweeps, normal grant/release/convert mutators | The pinned window contains only fixed-size copies, spinlock-protected array mutation, atomics, and no allocation or visitor callback. External WFG refresh and SQL row visitors run after release. |
| T | Starvation fairness grant-barrier LMD submit/cancel while a pin is held | Wrapped by `grd_lmd_submit_wait_edge_pinned()` / `grd_lmd_cancel_wait_edge_pinned()`. `PG_CATCH` releases the entry pin and rethrows. |
| R | none in spec-6.3a | No long-lived GRD entry pin is registered in `ResourceOwner`. Future paths that keep a pin across arbitrary backend code must add ResourceOwner tracking or a local `PG_TRY` guard. |

The unit case `test_grd_pin_cleanup_on_lmd_submit_error` injects an ERROR
through the pinned LMD submit path and verifies the pin is not leaked.

## Tests

The cluster unit lifecycle suite covers paired pin/release, last-unpin cold
reclaim, periodic sweep reclaim, live-state exclusion, large sweep batches,
over-release fail-safe behavior, and the pinned LMD ERROR cleanup path.
5 changes: 4 additions & 1 deletion docs/reference/ges-lock-modes.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,10 @@ Ordinary lock acquisition, waiting, and release across nodes are unaffected.

The `nconverts` column of `pg_cluster_grd_entries` reports the number of
pending conversion requests queued on a resource; it is `0` in normal
operation.
operation. The diagnostic view snapshots entry keys by walking shard-local
entry lists under each shard lock and then re-looks up each entry through the
normal pin/release API before taking the per-entry snapshot, so cold reclaim
can safely remove holderless entries while diagnostics are active.

## Blocking notifications

Expand Down
49 changes: 46 additions & 3 deletions docs/user-guide/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,16 @@
linkdb uses two configuration mechanisms layered on top of standard
PostgreSQL configuration:

1. **`postgresql.conf`** — standard PG config plus three new
`cluster.*` GUCs added by linkdb's cluster subsystem.
1. **`postgresql.conf`** — standard PG config plus the `cluster.*`
GUCs added by linkdb's cluster subsystem.
2. **`pgrac.conf`** — INI-style file describing the cluster
topology (the list of nodes that participate in the cluster).

## cluster.* GUCs

All three GUCs require server restart to change (PGC_POSTMASTER).
Most bootstrap and storage-routing GUCs require server restart to
change (PGC_POSTMASTER). Runtime maintenance knobs are marked with
their own context below.

### `cluster.node_id`

Expand Down Expand Up @@ -138,6 +140,47 @@ Reserved for future timeout enforcement on `tier1` peer `recv(2)`. Currently in

Upper bound on the payload size accepted by the chunked-send API. A caller asking to send a larger payload is rejected outright with `ERRCODE_PROGRAM_LIMIT_EXCEEDED` rather than silently truncating. Increase this when the workload expects larger cross-node messages; the hard cap is 256 MB.

### `cluster.grd_max_entries`

| | |
|---|---|
| Type | integer |
| Default | `0` |
| Range | `0` – `1048576` |
| Context | postmaster |

Capacity of the GRD resource-entry hash table. `0` keeps the entry
table disabled for skeleton-mode deployments. Values above zero enable
GES resource tracking; empty entries are eligible for lifecycle reclaim
once they have no holders, waiters, converts, reservations, or lookup
pins.

### `cluster.grd_entry_reclaim`

| | |
|---|---|
| Type | bool |
| Default | `on` |
| Context | sighup |

Enables safe cold reclaim of holderless GRD entries. The lookup/release
pin discipline remains active even when this is off; disabling reclaim
only prevents `HASH_REMOVE` so operators can preserve entries for
diagnostics during investigation.

### `cluster.grd_entry_reclaim_max_per_sweep`

| | |
|---|---|
| Type | integer |
| Default | `256` |
| Range | `0` – `65536` |
| Context | sighup |

Maximum number of cold entries LMON attempts to reclaim during one
sweep. `0` disables periodic sweeps, while last-pin release can still
reclaim a cold entry when `cluster.grd_entry_reclaim = on`.

### `cluster.interconnect_chunk_reassembly_timeout_ms`

| | |
Expand Down
14 changes: 12 additions & 2 deletions src/backend/cluster/cluster_debug.c
Original file line number Diff line number Diff line change
Expand Up @@ -968,8 +968,8 @@ dump_scn(ReturnSetInfo *rsinfo)
/*
* dump_grd -- spec-2.14 D6 GRD routing substrate observability.
*
* Emits 14 rows under category='grd' (8 from spec-2.14 + 6 from
* spec-2.15 entry-table infrastructure):
* Emits core routing rows plus entry lifecycle counters under
* category='grd':
* - grd_shard_count: 4096 (constant)
* - grd_local_master_count: shards mastered by self node
* - grd_remote_master_count: 4096 - local (SQL-friendly though derivable)
Expand All @@ -984,6 +984,10 @@ dump_scn(ReturnSetInfo *rsinfo)
* - grd_entry_create_count: lifetime created entries
* - grd_entry_lookup_hit_count: lifetime OK lookups
* - grd_entry_full_count: lifetime FULL returns
* - grd_entries_reclaimed_count: lifetime cold entry removes
* - grd_reclaim_skipped_pinned_count: reclaim skipped because pin>0
* - grd_pin_high_water: max observed per-entry pin count
* - grd_sweep_runs: LMON reclaim sweep invocations
*
* Counter invariant (v0.4 P1.2):
* grd_shard_lookup_count >=
Expand Down Expand Up @@ -1025,6 +1029,12 @@ dump_grd(ReturnSetInfo *rsinfo)
fmt_int64((int64)cluster_grd_entry_lookup_hit_count()));
emit_row(rsinfo, "grd", "grd_entry_full_count",
fmt_int64((int64)cluster_grd_entry_full_count()));
emit_row(rsinfo, "grd", "grd_entries_reclaimed_count",
fmt_int64((int64)cluster_grd_entries_reclaimed_count()));
emit_row(rsinfo, "grd", "grd_reclaim_skipped_pinned_count",
fmt_int64((int64)cluster_grd_reclaim_skipped_pinned_count()));
emit_row(rsinfo, "grd", "grd_pin_high_water", fmt_int64((int64)cluster_grd_pin_high_water()));
emit_row(rsinfo, "grd", "grd_sweep_runs", fmt_int64((int64)cluster_grd_sweep_runs()));

emit_row(rsinfo, "grd", "grd_holders_full_count",
fmt_int64((int64)cluster_grd_holders_full_count()));
Expand Down
Loading
Loading