sqlrush · sqlrush · Jul 1, 2026 · Jun 30, 2026 · Jun 30, 2026 · Jun 30, 2026
@@ -16,6 +16,8 @@ cross-node grants will land in spec-2.17 (BAST + deadlock + 4-node).
 |---|---|---|---|---|
 | `cluster.ges_request_timeout_ms` | `60000` | `[1, 600000]` | `USERSET` | Timeout (ms) for cross-node GES grant request.  Backend rolls back via GES_RELEASE on expiry.  PG `lock_timeout=0` (disabled) does NOT short-circuit — falls back to this GUC. |
 | `cluster.grd_max_entries` | `0` | `[0, 1048576]` | `POSTMASTER` | Size of GRD entry HTAB (per spec-2.15).  `0` = skeleton mode. |
+| `cluster.grd_entry_reclaim` | `on` | `bool` | `SIGHUP` | Enable safe cold reclaim of holderless GRD entries after the lookup pin drops to zero. |
+| `cluster.grd_entry_reclaim_max_per_sweep` | `256` | `[0, 65536]` | `SIGHUP` | Maximum cold entries LMON attempts to reclaim per sweep. |
 
 ### Effective timeout
 
@@ -62,6 +64,19 @@ Four cap counters (per spec-2.16 D1) surface entry-level saturation:
 - `converts_full_count`
 - `ngranted_promoted_count`
 
+spec-6.3a adds GRD entry-lifecycle counters in the `grd` category:
+
+| Key | Meaning |
+|---|---|
+| `grd_entries_reclaimed_count` | Cold holderless entries removed from the GRD HTAB |
+| `grd_reclaim_skipped_pinned_count` | Reclaim attempts skipped because a lookup pin was still held |
+| `grd_pin_high_water` | Highest observed lookup-pin count on a single entry |
+| `grd_sweep_runs` | LMON cold-reclaim sweep invocations |
+
+Implementation details for the pin/release discipline, shard-local scan
+model, and ERROR cleanup classification are in
+`docs/cluster/grd-entry-lifecycle.md`.
+
 ## Wire Format
 
 Payload bytes follow the 36-byte `ClusterICEnvelope`:

@@ -0,0 +1,55 @@
+# GRD Entry Lifecycle And Cold Reclaim
+
+Status: spec-6.3a implementation note.
+
+## Lock Model
+
+GRD entries live in the partitioned GRD HTAB and are protected in this order:
+
+1. shard LWLock
+2. entry spinlock
+
+There is no table-wide GRD entry lock in the hot create or reclaim path.
+Create, lookup, and `HASH_REMOVE` take only the target shard LWLock.
+
+Scan paths do not use `hash_seq_search` over the live HTAB. Each entry is
+linked into a shard-local intrusive list while the shard LWLock is held. A
+scanner takes one shard LWLock in shared mode, copies `ClusterResId` keys,
+releases the shard, and later re-lookups each key through the normal
+pin/release API. This keeps scan safety local to the shard and avoids
+serializing unrelated entry churn.
+
+## Pin Discipline
+
+`cluster_grd_entry_lookup_or_create()` pins an entry before releasing the
+shard LWLock. Callers must pair every successful lookup with
+`cluster_grd_entry_release()`.
+
+`cluster_grd_entry_release()` copies the `ClusterResId` before decrementing
+the pin. After the decrement publishes `pin == 0`, release does not
+dereference the old entry pointer. Last-unpin reclaim re-enters by copied
+resource id, takes the shard LWLock in exclusive mode, revalidates
+cold state under the entry spinlock, sets `RECLAIMING`, unlinks the entry
+from the shard list, and then calls `HASH_REMOVE`.
+
+Cold means `pin == 0` and no holders, waiters, converts, or reservations.
+Entries with live state are never reclaimed.
+
+## ERROR Cleanup Audit
+
+Pinned windows are intentionally short. spec-6.3a classifies lookup sites as:
+
+| Class | Sites | Cleanup rule |
+|---|---|---|
+| F | Snapshot walkers, cleanup sweeps, normal grant/release/convert mutators | The pinned window contains only fixed-size copies, spinlock-protected array mutation, atomics, and no allocation or visitor callback. External WFG refresh and SQL row visitors run after release. |
+| T | Starvation fairness grant-barrier LMD submit/cancel while a pin is held | Wrapped by `grd_lmd_submit_wait_edge_pinned()` / `grd_lmd_cancel_wait_edge_pinned()`. `PG_CATCH` releases the entry pin and rethrows. |
+| R | none in spec-6.3a | No long-lived GRD entry pin is registered in `ResourceOwner`. Future paths that keep a pin across arbitrary backend code must add ResourceOwner tracking or a local `PG_TRY` guard. |
+
+The unit case `test_grd_pin_cleanup_on_lmd_submit_error` injects an ERROR
+through the pinned LMD submit path and verifies the pin is not leaked.
+
+## Tests
+
+The cluster unit lifecycle suite covers paired pin/release, last-unpin cold
+reclaim, periodic sweep reclaim, live-state exclusion, large sweep batches,
+over-release fail-safe behavior, and the pinned LMD ERROR cleanup path.
@@ -77,7 +77,10 @@ Ordinary lock acquisition, waiting, and release across nodes are unaffected.
 
 The `nconverts` column of `pg_cluster_grd_entries` reports the number of
 pending conversion requests queued on a resource; it is `0` in normal
-operation.
+operation. The diagnostic view snapshots entry keys by walking shard-local
+entry lists under each shard lock and then re-looks up each entry through the
+normal pin/release API before taking the per-entry snapshot, so cold reclaim
+can safely remove holderless entries while diagnostics are active.
 
 ## Blocking notifications
 

@@ -3,14 +3,16 @@
 linkdb uses two configuration mechanisms layered on top of standard
 PostgreSQL configuration:
 
-1. **`postgresql.conf`** — standard PG config plus three new
-   `cluster.*` GUCs added by linkdb's cluster subsystem.
+1. **`postgresql.conf`** — standard PG config plus the `cluster.*`
+   GUCs added by linkdb's cluster subsystem.
 2. **`pgrac.conf`** — INI-style file describing the cluster
    topology (the list of nodes that participate in the cluster).
 
 ## cluster.* GUCs
 
-All three GUCs require server restart to change (PGC_POSTMASTER).
+Most bootstrap and storage-routing GUCs require server restart to
+change (PGC_POSTMASTER). Runtime maintenance knobs are marked with
+their own context below.
 
 ### `cluster.node_id`
 
@@ -138,6 +140,47 @@ Reserved for future timeout enforcement on `tier1` peer `recv(2)`.  Currently in
 
 Upper bound on the payload size accepted by the chunked-send API.  A caller asking to send a larger payload is rejected outright with `ERRCODE_PROGRAM_LIMIT_EXCEEDED` rather than silently truncating.  Increase this when the workload expects larger cross-node messages; the hard cap is 256 MB.
 
+### `cluster.grd_max_entries`
+
+| | |
+|---|---|
+| Type | integer |
+| Default | `0` |
+| Range | `0` – `1048576` |
+| Context | postmaster |
+
+Capacity of the GRD resource-entry hash table. `0` keeps the entry
+table disabled for skeleton-mode deployments. Values above zero enable
+GES resource tracking; empty entries are eligible for lifecycle reclaim
+once they have no holders, waiters, converts, reservations, or lookup
+pins.
+
+### `cluster.grd_entry_reclaim`
+
+| | |
+|---|---|
+| Type | bool |
+| Default | `on` |
+| Context | sighup |
+
+Enables safe cold reclaim of holderless GRD entries. The lookup/release
+pin discipline remains active even when this is off; disabling reclaim
+only prevents `HASH_REMOVE` so operators can preserve entries for
+diagnostics during investigation.
+
+### `cluster.grd_entry_reclaim_max_per_sweep`
+
+| | |
+|---|---|
+| Type | integer |
+| Default | `256` |
+| Range | `0` – `65536` |
+| Context | sighup |
+
+Maximum number of cold entries LMON attempts to reclaim during one
+sweep. `0` disables periodic sweeps, while last-pin release can still
+reclaim a cold entry when `cluster.grd_entry_reclaim = on`.
+
 ### `cluster.interconnect_chunk_reassembly_timeout_ms`
 
 | | |

@@ -968,8 +968,8 @@ dump_scn(ReturnSetInfo *rsinfo)
 /*
  * dump_grd -- spec-2.14 D6 GRD routing substrate observability.
  *
- *	Emits 14 rows under category='grd' (8 from spec-2.14 + 6 from
- *	spec-2.15 entry-table infrastructure):
+ *	Emits core routing rows plus entry lifecycle counters under
+ *	category='grd':
  *	  - grd_shard_count:             4096 (constant)
  *	  - grd_local_master_count:      shards mastered by self node
  *	  - grd_remote_master_count:     4096 - local (SQL-friendly though derivable)
@@ -984,6 +984,10 @@ dump_scn(ReturnSetInfo *rsinfo)
  *	  - grd_entry_create_count:      lifetime created entries
  *	  - grd_entry_lookup_hit_count:  lifetime OK lookups
  *	  - grd_entry_full_count:        lifetime FULL returns
+ *	  - grd_entries_reclaimed_count: lifetime cold entry removes
+ *	  - grd_reclaim_skipped_pinned_count: reclaim skipped because pin>0
+ *	  - grd_pin_high_water:          max observed per-entry pin count
+ *	  - grd_sweep_runs:              LMON reclaim sweep invocations
  *
  *	Counter invariant (v0.4 P1.2):
  *	  grd_shard_lookup_count >=
@@ -1025,6 +1029,12 @@ dump_grd(ReturnSetInfo *rsinfo)
 			 fmt_int64((int64)cluster_grd_entry_lookup_hit_count()));
 	emit_row(rsinfo, "grd", "grd_entry_full_count",
 			 fmt_int64((int64)cluster_grd_entry_full_count()));
+	emit_row(rsinfo, "grd", "grd_entries_reclaimed_count",
+			 fmt_int64((int64)cluster_grd_entries_reclaimed_count()));
+	emit_row(rsinfo, "grd", "grd_reclaim_skipped_pinned_count",
+			 fmt_int64((int64)cluster_grd_reclaim_skipped_pinned_count()));
+	emit_row(rsinfo, "grd", "grd_pin_high_water", fmt_int64((int64)cluster_grd_pin_high_water()));
+	emit_row(rsinfo, "grd", "grd_sweep_runs", fmt_int64((int64)cluster_grd_sweep_runs()));
 
 	emit_row(rsinfo, "grd", "grd_holders_full_count",
 			 fmt_int64((int64)cluster_grd_holders_full_count()));