kv(composed1): M3 — verifyComposed1 apply-time gate + retry sentinels by bootjp · Pull Request #895 · bootjp/elastickv

bootjp · 2026-05-31T09:57:39Z

Summary

Third milestone (M3) of the Composed-1 cross-group commit-time guard per docs/design/2026_05_29_proposed_composed1_cross_group_commit_guard.md §M3.

Wires the safety property at apply time using the M2 versioned-snapshot ring and the M1 ObservedRouteVersion field.

Stacked on PR #894 (M2 plumbing).

What's enforced

(a) Observed-version owner — the spec-level Composed-1 check from tla/composed/Composed.tla. Every write key of a Composed-1-pinned txn must be owned by THIS Raft group at the txn's observed catalog version. Refusal returns ErrComposed1Violation.
(b) Current-version cross-version-read fence — the §4.4 / §3 codex P1 trace. Even when (a) passes, a route shift between BeginTxn and Commit can leave the write landing on the OLD owner while readers at the new version route to the NEW owner and miss the write. The current-version fence refuses that case. Refusal returns ErrComposed1Violation with a different diagnostic prefix (current-version vs observed-version) so the M4 retry path can distinguish.
(c) Retention-miss fail-closed — when the txn's observed version has been evicted from the M2 ring (long-running txn or high catalog churn), return ErrComposed1VersionGCd. Per design doc §4.3, the not-found semantics is a hard retryable error, NOT a soft pass, because soft-pass would let the gate be bypassed exactly in the cases where the cross-version-read hazard is most likely.

Short-circuits

Three legacy / not-applicable cases that the gate explicitly bypasses:

FSM constructed without WithRouteHistory (legacy / test seam): routes == nil, gate returns nil.
Request carries ObservedRouteVersion == 0 (unpinned — pre-M1 caller, or 2PC ABORT request that doesn't carry the version): gate returns nil. This is what keeps M3 from regressing every caller that has not migrated to set OperationGroup.ObservedRouteVersion.
Engine.Current returns (zero, false) — engine has no history (bare-struct test seam): the (b) fence falls through.

Changes

distribution/engine.go — adds Engine.Current() returning the snapshot at the current catalogVersion. Adds SetHistoryDepthForTest as the cross-package test seam that lets kv-side tests trigger eviction without bypassing the package boundary (claude review on PR kv(composed1): M2 — versioned-snapshot ring + kvFSM RouteHistory wiring #894 — fragile-but-test-local lock contract documented inline).
kv/fsm.go — defines ErrComposed1Violation and ErrComposed1VersionGCd sentinels with the wrapped-error diagnostic shape M4 retry will use. Adds verifyComposed1 and verifyOwnerFromSnapshot. Wires verifyComposed1 into handleTxnRequest at the top so PREPARE / COMMIT / ABORT / NONE all pass through the gate (ABORT carries ObservedRouteVersion=0 and falls through naturally).
kv/route_history.go — RouteHistory interface gains Current(). Adapter forwards to distribution.Engine.Current().

Tests (`kv/fsm_composed1_test.go`)

Test	Maps to design doc M3 "Done when"
`TestVerifyComposed1_StaleObservedVersionWithMovedKeyFails`	(i) observed-version snapshot resolves the key to a different group than this FSM
`TestVerifyComposed1_ObservedVersionOlderThanRingFails`	(ii) observed version is outside the ring → `ErrComposed1VersionGCd`
`TestVerifyComposed1_ObservedPassesButCurrentDiffersFails`	(iii) / §3 codex P1 trace: observed-version check passes but current-version differs → `ErrComposed1Violation` (`current-version` diagnostic)
`TestVerifyComposed1_ObservedVersionZeroSkipsGate`	Legacy caller bypasses the gate
`TestVerifyComposed1_NilRouteHistorySkipsGate`	Unwired FSM bypasses the gate

Verification

go build ./... — clean
go vet ./... — clean
go test -race -count=1 ./kv ./distribution — 10.2 s + 1.0 s, pass
make lint — 0 issues

Self-review (5 lenses)

Data loss — the gate refuses commits but cannot lose them. Every rejection emits a sentinel the M4 coordinator path will convert into a successful commit on the correct owner.
Concurrency — the gate runs under the FSM's apply loop (Raft applies are serial), so no new lock ordering. RouteHistory reads take the engine's read lock; SnapshotAt and Current each acquire and release in a single call.
Performance — two map lookups + a per-mutation linear scan of the snapshot's sorted routes per Composed-1-pinned commit. OwnerOf short-circuits via the M2 round-1 break optimisation, so the per-mutation cost is bounded by the first non-covering gap rather than the full routes slice. Legacy callers (ObservedRouteVersion=0) pay one branch per commit and exit.
Data consistency — closes the Composed-1 (observed) and Composed-1a (current) gaps end-to-end. Spec correspondence: tla/composed/Composed.tla's Commit precondition is now enforced at apply time; the §4.4 fence matches Composed1a_CommitToCurrentOwner from PR tla: Composed-1a current-version fence + currentfence_gap (#870 follow-up) #878.
Test coverage — 5 new unit tests cover the three design doc criteria + the two legacy short-circuit paths. Existing handleTxnRequest tests pass unchanged (their requests carry ObservedRouteVersion=0 from M1's behaviour-neutral default).

Test plan

go test -race ./kv ./distribution
make lint
kv(composed1): M2 — versioned-snapshot ring + kvFSM RouteHistory wiring #894 (M2) must merge first — this PR's branch is stacked on it.
(Follow-up PR) M4 — coordinator retry path. When either Composed-1 sentinel returns, the coordinator re-reads the route cache, re-routes the txn against the new owning group, and re-issues it once.

Resolves

The M3 row in the Composed-1 design doc.

Summary by CodeRabbit

Bug Fixes
- Prevents cross-group transaction commits when route ownership shifts between observed and current versions.
- Returns a clear error when an observed version is no longer retained in route history.
- Ensures transactions are skipped (not rejected) when the shard/group identifier is unset, avoiding false rejections.
Tests
- Added comprehensive tests covering validation across routing/version scenarios, including history GC and shard-group unset behavior.

Third milestone of the Composed-1 cross-group commit-time guard per docs/design/2026_05_29_proposed_composed1_cross_group_commit_guard.md §M3. Wires the safety property at apply time using the M2 versioned-snapshot ring and the M1 ObservedRouteVersion field. Stacked on PR #894 (M2 plumbing). What's enforced: (a) Observed-version owner — the spec-level Composed-1 check from tla/composed/Composed.tla. Every write key of a Composed-1- pinned txn must be owned by THIS Raft group at the txn's observed catalog version (the version it read its read-set at, set in M1 via OperationGroup.ObservedRouteVersion). Refusal returns ErrComposed1Violation. (b) Current-version cross-version-read fence — the §4.4 / §3 codex P1 trace. Even when (a) passes, a route shift between BeginTxn and Commit can leave the write landing on the OLD owner while readers at the new version route to the NEW owner and miss the write. The current-version fence refuses that case so M4 retry can re-route. Refusal returns ErrComposed1Violation with a different diagnostic prefix ("current-version" vs "observed-version") so the retry path can distinguish. (c) Retention-miss fail-closed — when the txn's observed version has been evicted from the M2 ring (long-running txn or high catalog churn), return ErrComposed1VersionGCd. Per design doc §4.3, the not-found semantics is a hard retryable error, NOT a soft pass, because soft-pass would let the gate be bypassed exactly in the cases where the cross-version-read hazard is most likely. Short-circuits cleanly in three legacy / not-applicable cases: * FSM constructed without WithRouteHistory (legacy / test seam): routes == nil, gate returns nil. * Request carries ObservedRouteVersion == 0 (unpinned — pre-M1 caller, or 2PC ABORT request that doesn't carry the version): gate returns nil. This is what keeps M3 from regressing every caller that has not migrated to set OperationGroup. ObservedRouteVersion. * Engine.Current returns (zero, false) — engine has no history (bare-struct test seam): the (b) fence falls through. Changes: * distribution/engine.go — adds Engine.Current() returning the snapshot at the current catalogVersion (used by the M3 fence). Adds SetHistoryDepthForTest as the cross-package test seam that lets kv-side tests trigger eviction without bypassing the package boundary (claude review on PR #894 — fragile-but- test-local lock contract documented inline). * kv/fsm.go — defines ErrComposed1Violation and ErrComposed1VersionGCd sentinels with the wrapped-error diagnostic shape M4 retry will use. Adds verifyComposed1 and verifyOwnerFromSnapshot. Wires verifyComposed1 into handleTxnRequest at the top so PREPARE / COMMIT / ABORT / NONE all pass through the gate (ABORT carries ObservedRouteVersion=0 and falls through naturally). * kv/route_history.go — RouteHistory interface gains Current(). Adapter forwards to distribution.Engine.Current(). Tests (kv/fsm_composed1_test.go): * TestVerifyComposed1_StaleObservedVersionWithMovedKeyFails — design doc M3 "done when" criterion (i): observed-version snapshot resolves the key to a different group than this FSM. * TestVerifyComposed1_ObservedVersionOlderThanRingFails — criterion (ii): observed version is outside the ring → ErrComposed1VersionGCd. Uses depth=2 via the cross-package SetHistoryDepthForTest seam. * TestVerifyComposed1_ObservedPassesButCurrentDiffersFails — criterion (iii) / the §3 codex P1 trace: observed-version check passes (routes[1][k1]=g1) but the current snapshot at v=2 has moved k1 to g2 → ErrComposed1Violation with the "current-version" diagnostic prefix. * TestVerifyComposed1_ObservedVersionZeroSkipsGate — legacy caller (ObservedRouteVersion=0) bypasses the gate. * TestVerifyComposed1_NilRouteHistorySkipsGate — unwired FSM (no WithRouteHistory option) bypasses the gate; matches the pre-feature posture. Verification: * go build ./... — clean * go vet ./... — clean * go test -race -count=1 ./kv ./distribution — 10.2 s + 1.0 s, pass Self-review (5 lenses): 1. Data loss — the gate refuses commits but cannot lose them. Every rejection emits a sentinel the M4 coordinator path will convert into a successful commit on the correct owner. 2. Concurrency — the gate runs under the FSM's apply loop (Raft applies are serial), so no new lock ordering. RouteHistory reads take the engine's read lock; SnapshotAt and Current each acquire and release in a single call. 3. Performance — two map lookups + a per-mutation linear scan of the snapshot's sorted routes per Composed-1-pinned commit. OwnerOf short-circuits via the M2 round-1 break optimisation, so the per-mutation cost is bounded by the first non-covering gap rather than the full routes slice. Legacy callers (ObservedRouteVersion=0) pay one branch per commit and exit. 4. Data consistency — closes the Composed-1 (observed) and Composed-1a (current) gaps end-to-end. Spec correspondence: tla/composed/Composed.tla's Commit precondition is now enforced at apply time; the §4.4 fence matches Composed1a_CommitToCurrentOwner from PR #878. 5. Test coverage — 5 new unit tests cover the three design doc criteria + the two legacy short-circuit paths. Existing handleTxnRequest tests pass unchanged (their requests carry ObservedRouteVersion=0 from M1's behaviour-neutral default). Next milestone (separate PR per design doc §6): M4 — coordinator retry path. When either Composed-1 sentinel returns, the coordinator re-reads the route cache, re-routes the txn against the new owning group, and re-issues it once.

bootjp · 2026-05-31T09:57:47Z

@claude review

coderabbitai · 2026-05-31T09:57:49Z

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 42 minutes and 31 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ee427f08-5688-46cf-b45f-a600e7064c18

📥 Commits

Reviewing files that changed from the base of the PR and between 6108626 and fbd0d46.

📒 Files selected for processing (3)

distribution/engine.go
kv/fsm.go
kv/fsm_composed1_test.go

📝 Walkthrough

Walkthrough

Adds Engine.Current() and SetHistoryDepthForTest(), extends RouteHistory with Current(), implements the composed-1 verification (verifyComposed1) in kv FSM with two retryable errors, and adds tests covering observed/current checks and short-circuit cases.

Changes

Composed-1 Gate for Transaction Consistency

Layer / File(s)	Summary
Engine history snapshot APIs `distribution/engine.go`, `kv/route_history.go`	`Engine.Current()` returns the snapshot for the engine's current catalogVersion; `Engine.SetHistoryDepthForTest()` adjusts retained history and evicts oldest entries when shrinking; `distributionEngineAdapter.Current()` exposes this to the `RouteHistory` interface.
RouteHistory interface contract `kv/fsm.go`	`RouteHistory` now includes `Current()` alongside `SnapshotAt()`, documenting ring-not-found semantics used by the composed-1 gate.
Composed-1 gate logic and errors `kv/fsm.go`	Adds `ErrComposed1Violation` and `ErrComposed1VersionGCd`; implements `verifyComposed1()` which validates mutation-key ownership against `SnapshotAt(ObservedRouteVersion)` and, when available, `Current()`; `handleTxnRequest` invokes the gate before phase dispatch.
Composed-1 gate test suite `kv/fsm_composed1_test.go`	Test helpers to populate an in-memory `distribution.Engine` and construct `kvFSM` plus tests for: observed-version owner mismatch, observed-version evicted (GCd), current-version fence rejection, ObservedRouteVersion==0 skip, nil routes skip, and shardGroupID==0 skip.

Sequence Diagram

sequenceDiagram
  participant Client
  participant FSM as kvFSM
  participant Gate as verifyComposed1
  participant Observed as RouteHistory(SnapshotAt)
  participant Current as RouteHistory(Current)

  Client->>FSM: Txn with ObservedRouteVersion
  FSM->>Gate: verifyComposed1(request)
  Gate->>Observed: SnapshotAt(ObservedVersion)
  alt Observed version evicted
    Observed-->>Gate: nil, false
    Gate-->>FSM: ErrComposed1VersionGCd
  else Observed version found
    Observed-->>Gate: RouteSnapshot
    Gate->>Gate: Check key owners vs shardGroupID
    alt Owner mismatch at observed
      Gate-->>FSM: ErrComposed1Violation (observed-version)
    else Owner match at observed
      Gate->>Current: Current()
      alt Current not available
        Current-->>Gate: nil, false
        Gate-->>FSM: nil
      else Current available
        Current-->>Gate: RouteSnapshot
        Gate->>Gate: Check key owners vs shardGroupID
        alt Owner mismatch at current
          Gate-->>FSM: ErrComposed1Violation (current-version)
        else All checks pass
          Gate-->>FSM: nil
          FSM->>FSM: Proceed to phase dispatch
        end
      end
    end
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

bootjp/elastickv#881: Propagates ObservedRouteVersion into pb.Request, which the composed-1 gate consumes for historical ownership checks.
bootjp/elastickv#894: Introduced the versioned route-history ring and SnapshotAt() plumbing that this change extends with Current() and test-depth control.

Poem

🐰 I peek at history, then glance at now,
I guard the keys with twitching brow.
If versions drift or memory's thin,
I nudge the txn to try again.
Hooray — no stale-commit sin!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly summarizes the main objective: implementing an apply-time Composed-1 gate (verifyComposed1) with retry error sentinels, which is the core feature across all modified files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/composed1-m3-verify-gate

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-31T09:57:51Z

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (c94e0d1):

distribution/engine.go
kv/fsm.go

What to check, by subsystem:

kv/hlc*.go — Next() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
kv/coordinator.go, kv/sharded_coordinator.go — RunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.

@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

chatgpt-codex-connector · 2026-05-31T09:57:58Z

To use Codex here, create a Codex account and connect to github.

claude · 2026-05-31T09:58:00Z

Claude finished @bootjp's task in 3m 24s —— View job

Code Review — kv(composed1): M3 — verifyComposed1 apply-time gate + retry sentinels

Gather context and understand the request
Read changed files
Review design doc for M3 spec correspondence
Review lens 1: Data loss
Review lens 2: Concurrency / distributed failures
Review lens 3: Performance
Review lens 4: Data consistency
Review lens 5: Test coverage

Overall the implementation is solid. The gate logic matches the TLA+ spec (§4.2(a)) and the cross-version fence (§4.4) is correctly wired. No data-loss paths, no lock-ordering issues, no safety bugs found. A few issues worth addressing are called out below in order of severity.

Issues

1. Misleading comment in `SetHistoryDepthForTest` — contradicts its own implementation

distribution/engine.go:217–228

// SetHistoryDepthForTest overrides the FIFO ring depth from outside
// the package.  Test-only — callers MUST set the depth before the
// Engine is shared with any concurrent reader (no internal
// synchronisation here …
func (e *Engine) SetHistoryDepthForTest(depth int) {
    e.mu.Lock()       // ← lock IS acquired
    defer e.mu.Unlock()
    e.historyDepth = depth
}

The comment says "no internal synchronisation here" but the function acquires e.mu.Lock(). A future maintainer reading the comment will think concurrent calls are unsafe when they are actually protected. The comment was likely copied from the in-package direct field write (TestEngineSnapshotAt_FIFOEviction) and not updated.

Suggested fix: change "no internal synchronisation here" → "caller should set the depth before the Engine receives concurrent Snapshots to avoid test-interleaving surprises, but the write itself is lock-protected." Fix this →

2. Missing "happy path" test — gate can silently block valid txns undetected

kv/fsm_composed1_test.go

All 5 tests assert that the gate fires (violation, GC'd) or short-circuits (nil routes, zero version). There is no test that asserts the gate passes cleanly when the txn's observed version correctly identifies this group as the owner at both the observed and current versions. Without a passing-gate test, an over-aggressive check (e.g., an off-by-one in OwnerOf or an inadvertent negation of owner != f.shardGroupID) could incorrectly reject every valid commit and the test suite would not catch it.

Suggested addition — a test along these lines:

func TestVerifyComposed1_ValidOwnershipPassesGate(t *testing.T) {
    t.Parallel()
    e := distribution.NewEngine()
    // v=1 and v=2 both assign k to g1 — no route shift.
    applyComposed1Snapshot(t, e, 1, []distribution.RouteDescriptor{
        {RouteID: 100, Start: []byte(""), End: nil, GroupID: 1, State: distribution.RouteStateActive},
    })
    applyComposed1Snapshot(t, e, 2, []distribution.RouteDescriptor{
        {RouteID: 100, Start: []byte(""), End: nil, GroupID: 1, State: distribution.RouteStateActive},
    })
    fsm := newComposed1FSM(t, e, 1)

    err := fsm.verifyComposed1(commitTxnRequest(1, "k"))
    require.NoError(t, err, "gate must pass when this group owns the key at both observed and current version")
}

Fix this →

3. Minor: `ErrComposed1Violation` doc comment overstates the scope of the check

kv/fsm.go:215

// ErrComposed1Violation is returned … because the
// txn's read-set or write-set keys are not owned by this group …

verifyOwnerFromSnapshot iterates r.GetMutations() (the write-set only). r.ReadKeys is not checked — consistent with the TLA+ spec where Composed1_CommitToOwningGroup quantifies over txnWriteSet. The doc comment's "read-set or write-set" is misleading.

Fix this →

4. Minor: Design doc Q1 not addressed at M3

Design doc §9 Q1 says: "allow unpinned at M1, log a warning at M3, error at M5."

The M3 implementation short-circuits ObservedRouteVersion==0 with return nil and no log. Keeping the gate silent is probably the right call right now (to avoid log spam from all legacy callers), but it drifts from the design doc's stated M3 plan. Either update the doc to defer the warning to M4/M5, or add a single slog.WarnContext at the observedVer == 0 short-circuit. This is a documentation/tracking issue, not a correctness issue.

5. Nit: Self-correction prose in test comment

kv/fsm_composed1_test.go:73–77

// Wait — re-read the design doc: criterion (i) is …

The "Wait — re-read the design doc" reads like a debugging note. The clarification is useful but could be written as a direct statement. This is purely cosmetic.

Minor optimisation opportunity (non-blocking)

When observedVer == currentSnap.Version() (the catalog hasn't advanced since txn-begin), verifyOwnerFromSnapshot is called twice against the same snapshot. Adding an early exit after the observed check passes avoids the redundant linear scan:

// (b) Current-version fence.
currentSnap, ok := f.routes.Current()
if !ok {
    return nil
}
// Fast path: if the current version is the same as the observed version,
// the observed-version check already validated current ownership.
if currentSnap.Version() == observedVer {
    return nil
}
return f.verifyOwnerFromSnapshot(r.GetMutations(), currentSnap, currentSnap.Version(), "current")

This is O(|write-set|) savings and entirely optional, but it's a clean win for the common steady-state case.

Lens summary

Lens	Result
Data loss	Clean — gate can only reject, never silently drop committed writes
Concurrency	Clean — gate runs under the serial Raft apply loop; `SnapshotAt`/`Current` acquire and release `e.mu.RLock` atomically
Performance	Two map lookups + O(
Data consistency	Spec correspondence correct; write-set only check matches `Composed1_CommitToOwningGroup`; fail-closed on ring miss is correct per §4.3
Test coverage	5 failure/short-circuit tests present; missing happy-path test (see issue 2 above); no multi-key test

gemini-code-assist

Code Review

This pull request implements the apply-time Composed-1 gate to prevent cross-version-read anomalies during route shifts. It introduces the Current() method to retrieve the latest route catalog snapshot, defines ErrComposed1Violation and ErrComposed1VersionGCd errors, and adds verification logic in kvFSM.handleTxnRequest to validate key ownership against both the observed and current catalog versions. Comprehensive unit tests are also added. The reviewer identified two critical issues: first, ABORT requests must bypass the Composed-1 gate to avoid leaking MVCC locks, and a short-circuit is needed when shardGroupID is unset (0); second, reducing history depth via SetHistoryDepthForTest on an already populated engine can cause a runtime panic unless excess history entries are immediately truncated.

gemini-code-assist · 2026-05-31T09:59:22Z

+func (f *kvFSM) verifyComposed1(r *pb.Request) error {
+	if f.routes == nil {
+		return nil
+	}


There are two critical issues here:

ABORT requests must not be blocked by the Composed-1 gate: If an ABORT request is rejected due to a route shift (ErrComposed1Violation) or history eviction (ErrComposed1VersionGCd), the transaction's locks and intents will be permanently leaked in the MVCC store, blocking future writes. Aborts should always bypass the gate to ensure successful cleanup.

Missing short-circuit for unset shardGroupID: The comments state that a shardGroupID of 0 represents an unset/legacy state and should short-circuit the gate. However, the code does not check for f.shardGroupID == 0, which will cause spurious ErrComposed1Violation errors on unset groups.

func (f *kvFSM) verifyComposed1(r *pb.Request) error { if r.Phase == pb.Phase_ABORT { return nil } if f.routes == nil || f.shardGroupID == 0 { return nil }

gemini-code-assist · 2026-05-31T09:59:22Z

+func (e *Engine) SetHistoryDepthForTest(depth int) {
+	e.mu.Lock()
+	defer e.mu.Unlock()
+	e.historyDepth = depth
+}


If SetHistoryDepthForTest is called to reduce the history depth after the history ring has already been populated with more entries than the new depth, a subsequent call to recordHistorySnapshotLocked will trigger a runtime panic. This is because make([]uint64, len(e.historyOrder)-1, e.historyDepth) will be invoked with a length greater than its capacity. To prevent this, truncate the excess history entries immediately when updating the depth.

Suggested change

func (e *Engine) SetHistoryDepthForTest(depth int) {

e.mu.Lock()

defer e.mu.Unlock()

e.historyDepth = depth

}

func (e *Engine) SetHistoryDepthForTest(depth int) {

e.mu.Lock()

defer e.mu.Unlock()

e.historyDepth = depth

if len(e.historyOrder) > depth {

excess := len(e.historyOrder) - depth

for i := 0; i < excess; i++ {

delete(e.history, e.historyOrder[i])

}

retained := make([]uint64, depth)

copy(retained, e.historyOrder[excess:])

e.historyOrder = retained

}

}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

kv/fsm_composed1_test.go (1)
32-35: ⚡ Quick win

Drop the new //nolint:unparam and refactor the helper instead.

This suppression is avoidable in new test code: either bake in the single currently used group ID, or make the group ID vary through actual table cases when you need that coverage.

As per coding guidelines, "Avoid //nolint annotations — refactor instead."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@kv/fsm_composed1_test.go` around lines 32 - 35, Remove the //nolint:unparam
annotation and refactor newComposed1FSM to stop accepting the unused
shardGroupID parameter: either bake the single used group ID (1) directly into
newComposed1FSM (replace the shardGroupID parameter with the constant 1 inside
the function) and update all call sites to call newComposed1FSM(t, e)
accordingly, or convert the tests that need varying group IDs into table-driven
cases that construct the FSM with different explicit group IDs (so the parameter
becomes used). Ensure all references to function newComposed1FSM are updated and
delete the //nolint comment.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@distribution/engine.go`:
- Around line 225-229: SetHistoryDepthForTest should validate the incoming depth
and reject non-positive values to fail fast; inside
Engine.SetHistoryDepthForTest (protected by e.mu), add a check like if depth <=
0 { panic("SetHistoryDepthForTest: depth must be > 0") } (or return an
error/report) before assigning e.historyDepth so recordHistorySnapshotLocked and
the eviction path never encounter an empty historyOrder; update any tests that
call SetHistoryDepthForTest to provide positive depths.

In `@kv/fsm.go`:
- Around line 582-589: verifyComposed1 currently only short-circuits when
f.routes == nil but the documented behavior says shardGroupID == 0 should also
bypass; update kvFSM.verifyComposed1 to return nil immediately when
f.shardGroupID == 0 (before any route/version checks) so callers that install
routes before setting shardGroupID preserve the legacy skip-for-group-0
behavior; keep the existing observedVer handling unchanged.

---

Nitpick comments:
In `@kv/fsm_composed1_test.go`:
- Around line 32-35: Remove the //nolint:unparam annotation and refactor
newComposed1FSM to stop accepting the unused shardGroupID parameter: either bake
the single used group ID (1) directly into newComposed1FSM (replace the
shardGroupID parameter with the constant 1 inside the function) and update all
call sites to call newComposed1FSM(t, e) accordingly, or convert the tests that
need varying group IDs into table-driven cases that construct the FSM with
different explicit group IDs (so the parameter becomes used). Ensure all
references to function newComposed1FSM are updated and delete the //nolint
comment.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: af79a132-4d95-4f0c-b28e-0e8008f5f03d

📥 Commits

Reviewing files that changed from the base of the PR and between 2b118a4 and c94e0d1.

📒 Files selected for processing (4)

distribution/engine.go
kv/fsm.go
kv/fsm_composed1_test.go
kv/route_history.go

…seam (reviews on PR #895) Three findings from PR #895's automated review pass, all addressed: * coderabbit MAJOR — verifyComposed1 only checked `routes == nil` for the bypass, leaving a partially-wired FSM (routes installed before group ID, or a test fixture that passed engine but shardGroupID=0) to reject every pinned txn with ErrComposed1Violation. The doc comment on kvFSM.shardGroupID says "Zero ... means 'unset' — the M3 gate will short-circuit (matches the pre-feature behaviour)"; the code now honours that contract too. Fix: bypass when EITHER routes==nil OR shardGroupID==0. Caller audit (the loop directive's semantic-change check): only handleTxnRequest calls verifyComposed1, and it treats nil as success — bypass-as-nil preserves the caller's semantics. No further wiring change. * coderabbit minor — SetHistoryDepthForTest accepted depth<=0, which would later surface as an index-out-of-range deep in recordHistorySnapshotLocked's eviction path (historyOrder[0] on an empty slice). Now panics at the seam with a clear message so the misconfigured test fails fast. * gemini medium — Shrinking historyDepth below the current ring size left len(historyOrder) > historyDepth, and the next recordHistorySnapshotLocked would `make([]uint64, len-1, historyDepth)` with len-1 > historyDepth → panic in apply. Now SetHistoryDepthForTest evicts the excess oldest entries and compacts the slice immediately under the write lock. * golangci/reviewdog formatting (gci on kv/fsm.go and kv/fsm_composed1_test.go) — auto-fixed by `make lint`. Regression test (added before the fix per the project convention): * TestVerifyComposed1_ShardGroupIDZeroSkipsGate — builds an engine where the route is owned by group 5 and constructs an FSM with shardGroupID=0; pre-fix the gate rejects the pinned commit (OwnerOf returns 5 ≠ this FSM's 0), post-fix the gate short-circuits via the new condition. Verified failing on the parent commit and passing on this commit. Verification: go test -race -count=1 ./kv ./distribution → pass. make lint → 0 issues.

bootjp · 2026-06-01T11:27:50Z

All 4 findings addressed:

coderabbit MAJOR (kv/fsm.go:589 — shardGroupID==0 bypass) — verifyComposed1 now honors both routes==nil and shardGroupID==0 per the documented unset-sentinel contract. Caller audit (per the loop directive's semantic-change check): only handleTxnRequest calls verifyComposed1 and treats nil as success — bypass-as-nil preserves caller semantics, no further wiring change. Regression test TestVerifyComposed1_ShardGroupIDZeroSkipsGate added BEFORE the fix per project convention, verified failing on parent / passing on fix.

coderabbit minor (distribution/engine.go:229 — reject depth<=0) — SetHistoryDepthForTest now panics at the seam with a clear message.

gemini medium (distribution/engine.go:229 — shrink panic) — SetHistoryDepthForTest now evicts excess entries and compacts the slice when shrinking depth below current ring size.

golangci/reviewdog (gci formatting on kv/fsm.go and kv/fsm_composed1_test.go) — auto-fixed by make lint --fix.

go test -race ./kv ./distribution pass / make lint 0 issues.

@claude review

github-actions · 2026-06-01T11:28:06Z

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (6108626):

distribution/engine.go
kv/fsm.go

What to check, by subsystem:

kv/hlc*.go — Next() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
kv/coordinator.go, kv/sharded_coordinator.go — RunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.

@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

claude · 2026-06-01T11:28:10Z

Claude finished @bootjp's task in 3m 15s —— View job

Code Review — Follow-up (commit `6108626` fixes)

Gather context and understand the request
Read changed files and diff for commit 6108626
Review fix: shardGroupID==0 bypass + regression test
Review fix: depth<=0 panic in SetHistoryDepthForTest
Review fix: shrink eviction in SetHistoryDepthForTest
Review fix: gci formatting
Check for any remaining or new issues
Post review

All 4 listed fixes are correct. Two items from the previous round and one from gemini remain open.

Confirmed correct

shardGroupID == 0 bypass (kv/fsm.go:591) — f.routes == nil || f.shardGroupID == 0 is the right predicate. Both sentinel values are now handled in one branch. The regression test TestVerifyComposed1_ShardGroupIDZeroSkipsGate correctly exercises the fix: engine wired with GroupID=5, FSM constructed with shardGroupID=0, asserts neither Composed-1 error surfaces.

depth <= 0 panic (distribution/engine.go:238) — Fails fast before lock acquisition, which is correct for a test-only seam. No lock-ordering issue (no held lock to release on panic path).

Shrink truncation (distribution/engine.go:244–252) — Logic is correct. excess = len(historyOrder) - depth entries are deleted from the map (oldest first, index 0..excess-1), and copy(retained, historyOrder[excess:]) lifts the depth most-recent tail into a new slice. The resulting invariant len(historyOrder) == depth matches recordHistorySnapshotLocked's eviction assumption.

gci formatting — The //nolint:unparam repositioning (from above the prose comment to immediately before the func line) is the correct placement for a //nolint directive.

Remaining: two open items from the previous claude review

1. Misleading "no internal synchronisation" comment — still present

distribution/engine.go:213–214

// SetHistoryDepthForTest overrides the FIFO ring depth from outside
// the package.  Test-only — callers MUST set the depth before the
// Engine is shared with any concurrent reader (no internal
// synchronisation here …

This comment was already contradicted by the pre-existing e.mu.Lock() when the previous review raised it (issue #1 in the previous claude review). Commit 6108626 adds the shrink logic inside that same lock but does not update the text. The "no internal synchronisation here" clause now causes genuine confusion: a future reader of the comment will infer concurrent calls are unsafe, then see the lock and wonder which is authoritative.

Suggested fix: replace "no internal synchronisation here for the same reason TestEngineSnapshotAt_FIFOEviction does the direct field write in-package" with "callers should set depth before sharing the Engine with concurrent Snapshots to avoid interleaving surprises, but the write itself is lock-protected". Fix this →

2. Missing happy-path test — still absent

kv/fsm_composed1_test.go now has 6 tests (5 original + the new shardGroupID=0 regression). All 6 test failure/bypass paths. There is still no test asserting that verifyComposed1 returns nil when the txn's observed version AND the current version both assign the key to this FSM's group. An off-by-one in OwnerOf, a reversed comparison, or an inadvertent sign-flip in the group-ID match would produce a false-positive rejection on every valid commit, and the test suite would not catch it. Fix this →

Remaining: gemini's ABORT-invariant concern (not in the 4 addressed)

Gemini flagged this as CRITICAL. The current protection for ABORT is that all ABORT construction sites (transaction.go:482, sharded_coordinator.go:859) leave ObservedRouteVersion at its zero-value default, so the observedVer == 0 branch fires before any route check. Verified: both callsites construct the pb.Request struct without setting ObservedRouteVersion.

This is safe today, but it is an undocumented protocol invariant — nothing in the type system or a test enforces that ABORT must carry ObservedRouteVersion == 0. A future developer adding a third ABORT construction path could inadvertently copy-paste a COMMIT request struct and set the version field, causing the gate to block ABORTs and permanently leak MVCC locks on those keys.

Gemini's suggestion — add if r.Phase == pb.Phase_ABORT { return nil } as the first guard — costs one branch and makes the invariant enforcement-in-code rather than enforcement-by-convention. Given that the consequence of a missed ABORT is lock leakage until TTL, the defensive guard is worth adding. Fix this →

Nit: `//nolint:unparam` may now be dead weight

newComposed1FSM is now called with both shardGroupID=1 (five tests) and shardGroupID=0 (the new regression test), so unparam should no longer flag the parameter as "always the same value." The annotation is harmless but can be removed if make lint no longer reports the finding. Fix this →

Summary

Finding	Status
coderabbit MAJOR: `shardGroupID==0` bypass	✅ Fixed + regression test
coderabbit minor: `depth <= 0` panic	✅ Fixed
gemini medium: shrink-panic in `SetHistoryDepthForTest`	✅ Fixed
golangci/gci formatting	✅ Fixed
Previous claude #1: misleading "no sync" comment	⚠️ Still open
Previous claude #2: missing happy-path test	⚠️ Still open
Gemini CRITICAL: no explicit ABORT Phase bypass	⚠️ Still open (safe today, fragile by convention)

The three remaining issues are all non-blocking for M3's correctness (the gate logic itself is correct; the ABORT path is safe today). Happy to approve once the happy-path test is in — that's the one change that gives the gate meaningful coverage in both the pass and fail directions.

chatgpt-codex-connector · 2026-06-01T11:28:13Z

To use Codex here, create a Codex account and connect to github.

…#895) Three remaining items from claude's round-2 follow-up review + the gemini HIGH-priority finding that was missed in round 2: * gemini HIGH (kv/fsm.go) — ABORT requests MUST bypass the Composed-1 gate. If a route shifted between PREPARE and ABORT (or the observed version was evicted), a rejected ABORT would leave the txn's intent locks pinned in MVCC until LockResolver's TTL — minutes of write-blocked keys for what should be a one-RPC cleanup. Production callers (transaction.go, sharded_coordinator.go) happen to leave ObservedRouteVersion=0 on ABORT requests, so today the existing observedVer==0 short-circuit already protects them. But that is enforcement-by-convention; a future ABORT construction site that copies a COMMIT struct by mistake would silently start failing aborts. Fix: add `if r.GetPhase() == pb.Phase_ABORT { return nil }` as the FIRST guard in verifyComposed1, before the routes/shardGroupID check. Enforcement-in-code rather than enforcement-by-convention. Caller audit (the loop directive's semantic-change check): verifyComposed1 is only called from handleTxnRequest, which treats nil as success. The ABORT bypass returns nil, so the caller's semantics are preserved. No further wiring change. * claude follow-up #1 (distribution/engine.go) — The SetHistoryDepthForTest doc comment said "no internal synchronisation here" even though the function acquired e.mu.Lock. Future readers would see the contradiction and not know which was authoritative. Rewrote the comment to describe the actual contract: the write is lock-protected, but callers should still set depth before sharing the Engine with concurrent SnapshotAt/Current readers to avoid interleaving surprises around the eviction watermark. * claude follow-up #2 (kv/fsm_composed1_test.go) — The gate suite had no happy-path test. An off-by-one in OwnerOf, a reversed comparison, or a sign-flip in the group-ID match would silently false-reject every valid commit and the failure-only suite would not catch it. Added TestVerifyComposed1_ValidOwnershipPassesGate: this FSM serves the group that owns the key at BOTH the observed version AND the current catalog version, so the gate must return nil. * claude follow-up nit (kv/fsm_composed1_test.go) — The `//nolint:unparam` annotation on newComposed1FSM is now redundant: the new ABORT and shardGroupID=0 tests pass distinct shardGroupID values, so golangci-lint no longer flags the parameter as constant. Removed (verified with `make lint` → 0 issues). Also added TestVerifyComposed1_AbortPhaseSkipsGate as the regression for the ABORT bypass: deliberately sets ObservedRouteVersion=1 (non-zero) on the ABORT to defeat the existing observedVer==0 short-circuit and isolate the new Phase_ABORT guard. Verification: go test -race -count=1 ./kv ./distribution → pass. make lint → 0 issues.

bootjp · 2026-06-01T11:37:52Z

Round-3: all remaining items addressed.

gemini HIGH (kv/fsm.go — ABORT bypass): Added if r.GetPhase() == pb.Phase_ABORT { return nil } as the FIRST guard in verifyComposed1. Production callers leave ObservedRouteVersion=0 on ABORT so today the existing observedVer==0 branch already protects them, but that's enforcement-by-convention — the new Phase_ABORT guard is enforcement-in-code. A future ABORT construction site that copies a COMMIT struct by mistake still bypasses. Caller audit: only handleTxnRequest calls verifyComposed1 and treats nil as success — semantics preserved.

claude follow-up #1 (SetHistoryDepthForTest doc comment): Rewrote to remove the contradictory "no internal synchronisation" wording; comment now correctly states the write is lock-protected but callers should still set depth before sharing the Engine with concurrent readers to avoid interleaving surprises.

claude follow-up #2 (happy-path test): Added TestVerifyComposed1_ValidOwnershipPassesGate — this FSM serves the group that owns the key at both the observed AND current version, gate returns nil. Locks down the pass direction the failure-only suite was missing.

ABORT regression test: TestVerifyComposed1_AbortPhaseSkipsGate deliberately sets ObservedRouteVersion=1 on the ABORT to defeat the existing observedVer==0 short-circuit and isolate the new Phase_ABORT guard.

Nit: removed the now-redundant //nolint:unparam on newComposed1FSM (new tests pass distinct shardGroupID values).

go test -race ./kv ./distribution pass / make lint 0 issues.

@claude review

claude · 2026-06-01T11:38:07Z

Claude finished @bootjp's task in 1m 50s —— View job

Code Review — Round 3 (commit `48bbcc4`)

Gather context and understand the request
Read changed files for round-3 commits
Verify ABORT Phase bypass fix
Verify SetHistoryDepthForTest doc comment fix
Verify happy-path test (ValidOwnershipPassesGate)
Verify ABORT regression test (AbortPhaseSkipsGate)
Verify nolint:unparam removal
Check for any remaining or new issues

All five round-3 items are correct. The remaining finding-tracking table is fully green.

Confirmed correct

ABORT Phase bypass (kv/fsm.go:594) — if r.GetPhase() == pb.Phase_ABORT { return nil } is correctly placed as the first guard — before the routes == nil || shardGroupID == 0 check and before the observedVer == 0 check. Uses the nil-safe GetPhase() proto accessor. Guard ordering is now:

Phase_ABORT → nil (new, enforcement-in-code)
routes == nil || shardGroupID == 0 → nil
observedVer == 0 → nil
observed-version lookup / current-version fence

This is the correct layering — an ABORT with a fabricated non-zero ObservedRouteVersion still safely bypasses.

SetHistoryDepthForTest doc comment (distribution/engine.go:211–238) — the contradictory "no internal synchronisation here" clause is gone. Replacement accurately states "the write itself is lock-protected (e.mu.Lock below)" while preserving the call-before-sharing guidance. ✅

Happy-path test (kv/fsm_composed1_test.go:263–281) — TestVerifyComposed1_ValidOwnershipPassesGate applies snapshots at v=1 (RouteID=100) and v=2 (RouteID=101), both assigning the full key range to GroupID=1. The FSM serves shardGroupID=1, request pins ObservedRouteVersion=1. Both the observed-version check (a) and the current-version fence (b) execute and pass. require.NoError correctly asserts the happy path. ✅

ABORT regression test (kv/fsm_composed1_test.go:298–321) — TestVerifyComposed1_AbortPhaseSkipsGate correctly isolates the new Phase_ABORT guard: key "k" is owned by group 5 at v=1, FSM serves group 1 (so absent the ABORT guard this would return ErrComposed1Violation), and ObservedRouteVersion=1 (non-zero) to defeat the observedVer==0 short-circuit. Only the Phase_ABORT guard saves it. ✅

//nolint:unparam removed — newComposed1FSM now receives both 0 (ShardGroupIDZero test) and 1 (all others), so golangci-lint's unparam check no longer fires and the suppression is gone. ✅

One cosmetic nit (non-blocking)

distribution/engine.go:217–226 has the "exists so tests in the kv package can drive eviction-trigger scenarios without adding a constructor option just for tests" sentence twice — once in the opening paragraph and once in the //Production code must use… paragraph. Entirely harmless.

Finding-tracking table

Finding	Status
coderabbit MAJOR: `shardGroupID==0` bypass	✅ Fixed (round 2)
coderabbit minor: `depth <= 0` panic	✅ Fixed (round 2)
gemini medium: shrink-panic in `SetHistoryDepthForTest`	✅ Fixed (round 2)
golangci/gci formatting	✅ Fixed (round 2)
Previous claude #1: misleading "no sync" comment	✅ Fixed (round 3)
Previous claude #2: missing happy-path test	✅ Fixed (round 3)
Gemini CRITICAL: no explicit ABORT Phase bypass	✅ Fixed (round 3)

No open findings. Gate logic matches the TLA+ spec (Composed1_CommitToOwningGroup applies to write-set, ABORT does not commit, so the bypass is spec-correct). Ready to merge once PR #894 (M2) lands.

github-actions · 2026-06-01T11:38:08Z

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (48bbcc4):

distribution/engine.go
kv/fsm.go

What to check, by subsystem:

kv/hlc*.go — Next() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
kv/coordinator.go, kv/sharded_coordinator.go — RunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.

@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

chatgpt-codex-connector · 2026-06-01T11:38:15Z

To use Codex here, create a Codex account and connect to github.

…-3 nit on PR #895) The previous text had the 'tests in the kv package can drive eviction-trigger scenarios without adding a constructor option' line twice — once in the opening paragraph and once in the 'Production code must use...' paragraph. Harmless duplication; collapsed into a single sentence in the second paragraph. No code change.

bootjp · 2026-06-01T11:45:21Z

Cosmetic nit (round-3 follow-up): deduped the doc-comment sentence "tests in the kv package can drive eviction-trigger scenarios without adding a constructor option" — was duplicated across two paragraphs of SetHistoryDepthForTest.

Doc-only; no code change. Claude round-3 confirmed all 7 findings ✅ Fixed and the table is fully green.

@claude review

claude · 2026-06-01T11:45:33Z

Claude finished @bootjp's task in 49s —— View job

Code Review — Round 4 (doc-only dedup, commit `fbd0d46`)

Read distribution/engine.go and verify the dedup
Check git diff for scope of changes
Post review

Scope confirmed: single file (distribution/engine.go), 6 insertions / 9 deletions, no code change. The diff removes the duplicated sentence:

"tests in the kv package can drive eviction-trigger scenarios without adding a constructor option just for tests"

…which previously appeared twice (once at the tail of the first paragraph, once at the head of the Production code must use… paragraph), and collapses them into a single occurrence in the second paragraph. The resulting comment is:

Non-duplicated ✅
Lock-protection description still accurate (the write itself is lock-protected (e.mu.Lock below)) ✅
depth <= 0 fail-fast rationale and shrink-panic explanation still present ✅
Production-only usage warning still present ✅

No new findings. Finding table remains fully green from round 3:

Finding	Status
coderabbit MAJOR: `shardGroupID==0` bypass	✅ Fixed (round 2)
coderabbit minor: `depth <= 0` panic	✅ Fixed (round 2)
gemini medium: shrink-panic in `SetHistoryDepthForTest`	✅ Fixed (round 2)
golangci/gci formatting	✅ Fixed (round 2)
Previous claude #1: misleading "no sync" comment	✅ Fixed (round 3)
Previous claude #2: missing happy-path test	✅ Fixed (round 3)
Gemini CRITICAL: no explicit ABORT Phase bypass	✅ Fixed (round 3)

Ready to merge once PR #894 (M2) lands.

github-actions · 2026-06-01T11:45:35Z

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (fbd0d46):

distribution/engine.go
kv/fsm.go

What to check, by subsystem:

kv/hlc*.go — Next() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
kv/coordinator.go, kv/sharded_coordinator.go — RunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.

@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

chatgpt-codex-connector · 2026-06-01T11:45:43Z

To use Codex here, create a Codex account and connect to github.

gemini-code-assist Bot reviewed May 31, 2026

View reviewed changes

coderabbitai Bot reviewed May 31, 2026

View reviewed changes

Comment thread distribution/engine.go

Comment thread kv/fsm.go

bootjp merged commit ed575bf into main Jun 1, 2026
11 checks passed

bootjp deleted the feat/composed1-m3-verify-gate branch June 1, 2026 12:16

coderabbitai Bot mentioned this pull request Jun 1, 2026

kv(composed1): M4 — Dispatch retry on ErrComposed1Violation / ErrComposed1VersionGCd #900

Open

3 tasks

Conversation

bootjp commented May 31, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's enforced

Short-circuits

Changes

Tests (kv/fsm_composed1_test.go)

Verification

Self-review (5 lenses)

Test plan

Resolves

Summary by CodeRabbit

Uh oh!

bootjp commented May 31, 2026

Uh oh!

coderabbitai Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

github-actions Bot commented May 31, 2026

TLA+ spec divergence review (auto-triggered)

Uh oh!

chatgpt-codex-connector Bot commented May 31, 2026

Uh oh!

claude Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — kv(composed1): M3 — verifyComposed1 apply-time gate + retry sentinels

Issues

1. Misleading comment in SetHistoryDepthForTest — contradicts its own implementation

2. Missing "happy path" test — gate can silently block valid txns undetected

3. Minor: ErrComposed1Violation doc comment overstates the scope of the check

4. Minor: Design doc Q1 not addressed at M3

5. Nit: Self-correction prose in test comment

Minor optimisation opportunity (non-blocking)

Lens summary

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bootjp commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

TLA+ spec divergence review (auto-triggered)

Uh oh!

claude Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — Follow-up (commit 6108626 fixes)

Confirmed correct

Remaining: two open items from the previous claude review

1. Misleading "no internal synchronisation" comment — still present

2. Missing happy-path test — still absent

Remaining: gemini's ABORT-invariant concern (not in the 4 addressed)

Nit: //nolint:unparam may now be dead weight

Summary

Uh oh!

chatgpt-codex-connector Bot commented Jun 1, 2026

Uh oh!

bootjp commented Jun 1, 2026

Uh oh!

claude Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

bootjp commented May 31, 2026 •

edited by coderabbitai Bot

Loading

Tests (`kv/fsm_composed1_test.go`)

coderabbitai Bot commented May 31, 2026 •

edited

Loading

claude Bot commented May 31, 2026 •

edited

Loading

1. Misleading comment in `SetHistoryDepthForTest` — contradicts its own implementation

3. Minor: `ErrComposed1Violation` doc comment overstates the scope of the check

claude Bot commented Jun 1, 2026 •

edited

Loading

Code Review — Follow-up (commit `6108626` fixes)

Nit: `//nolint:unparam` may now be dead weight

claude Bot commented Jun 1, 2026 •

edited

Loading

Code Review — Round 3 (commit `48bbcc4`)

claude Bot commented Jun 1, 2026 •

edited

Loading

Code Review — Round 4 (doc-only dedup, commit `fbd0d46`)