Skip to content

kv(composed1): M3 — verifyComposed1 apply-time gate + retry sentinels#895

Merged
bootjp merged 4 commits into
mainfrom
feat/composed1-m3-verify-gate
Jun 1, 2026
Merged

kv(composed1): M3 — verifyComposed1 apply-time gate + retry sentinels#895
bootjp merged 4 commits into
mainfrom
feat/composed1-m3-verify-gate

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 31, 2026

Summary

Third milestone (M3) of the Composed-1 cross-group commit-time guard per docs/design/2026_05_29_proposed_composed1_cross_group_commit_guard.md §M3.

Wires the safety property at apply time using the M2 versioned-snapshot ring and the M1 ObservedRouteVersion field.

Stacked on PR #894 (M2 plumbing).

What's enforced

  • (a) Observed-version owner — the spec-level Composed-1 check from tla/composed/Composed.tla. Every write key of a Composed-1-pinned txn must be owned by THIS Raft group at the txn's observed catalog version. Refusal returns ErrComposed1Violation.
  • (b) Current-version cross-version-read fence — the §4.4 / §3 codex P1 trace. Even when (a) passes, a route shift between BeginTxn and Commit can leave the write landing on the OLD owner while readers at the new version route to the NEW owner and miss the write. The current-version fence refuses that case. Refusal returns ErrComposed1Violation with a different diagnostic prefix (current-version vs observed-version) so the M4 retry path can distinguish.
  • (c) Retention-miss fail-closed — when the txn's observed version has been evicted from the M2 ring (long-running txn or high catalog churn), return ErrComposed1VersionGCd. Per design doc §4.3, the not-found semantics is a hard retryable error, NOT a soft pass, because soft-pass would let the gate be bypassed exactly in the cases where the cross-version-read hazard is most likely.

Short-circuits

Three legacy / not-applicable cases that the gate explicitly bypasses:

  • FSM constructed without WithRouteHistory (legacy / test seam): routes == nil, gate returns nil.
  • Request carries ObservedRouteVersion == 0 (unpinned — pre-M1 caller, or 2PC ABORT request that doesn't carry the version): gate returns nil. This is what keeps M3 from regressing every caller that has not migrated to set OperationGroup.ObservedRouteVersion.
  • Engine.Current returns (zero, false) — engine has no history (bare-struct test seam): the (b) fence falls through.

Changes

  • distribution/engine.go — adds Engine.Current() returning the snapshot at the current catalogVersion. Adds SetHistoryDepthForTest as the cross-package test seam that lets kv-side tests trigger eviction without bypassing the package boundary (claude review on PR kv(composed1): M2 — versioned-snapshot ring + kvFSM RouteHistory wiring #894 — fragile-but-test-local lock contract documented inline).
  • kv/fsm.go — defines ErrComposed1Violation and ErrComposed1VersionGCd sentinels with the wrapped-error diagnostic shape M4 retry will use. Adds verifyComposed1 and verifyOwnerFromSnapshot. Wires verifyComposed1 into handleTxnRequest at the top so PREPARE / COMMIT / ABORT / NONE all pass through the gate (ABORT carries ObservedRouteVersion=0 and falls through naturally).
  • kv/route_history.goRouteHistory interface gains Current(). Adapter forwards to distribution.Engine.Current().

Tests (kv/fsm_composed1_test.go)

Test Maps to design doc M3 "Done when"
TestVerifyComposed1_StaleObservedVersionWithMovedKeyFails (i) observed-version snapshot resolves the key to a different group than this FSM
TestVerifyComposed1_ObservedVersionOlderThanRingFails (ii) observed version is outside the ring → ErrComposed1VersionGCd
TestVerifyComposed1_ObservedPassesButCurrentDiffersFails (iii) / §3 codex P1 trace: observed-version check passes but current-version differs → ErrComposed1Violation (current-version diagnostic)
TestVerifyComposed1_ObservedVersionZeroSkipsGate Legacy caller bypasses the gate
TestVerifyComposed1_NilRouteHistorySkipsGate Unwired FSM bypasses the gate

Verification

  • go build ./... — clean
  • go vet ./... — clean
  • go test -race -count=1 ./kv ./distribution — 10.2 s + 1.0 s, pass
  • make lint — 0 issues

Self-review (5 lenses)

  1. Data loss — the gate refuses commits but cannot lose them. Every rejection emits a sentinel the M4 coordinator path will convert into a successful commit on the correct owner.
  2. Concurrency — the gate runs under the FSM's apply loop (Raft applies are serial), so no new lock ordering. RouteHistory reads take the engine's read lock; SnapshotAt and Current each acquire and release in a single call.
  3. Performance — two map lookups + a per-mutation linear scan of the snapshot's sorted routes per Composed-1-pinned commit. OwnerOf short-circuits via the M2 round-1 break optimisation, so the per-mutation cost is bounded by the first non-covering gap rather than the full routes slice. Legacy callers (ObservedRouteVersion=0) pay one branch per commit and exit.
  4. Data consistency — closes the Composed-1 (observed) and Composed-1a (current) gaps end-to-end. Spec correspondence: tla/composed/Composed.tla's Commit precondition is now enforced at apply time; the §4.4 fence matches Composed1a_CommitToCurrentOwner from PR tla: Composed-1a current-version fence + currentfence_gap (#870 follow-up) #878.
  5. Test coverage — 5 new unit tests cover the three design doc criteria + the two legacy short-circuit paths. Existing handleTxnRequest tests pass unchanged (their requests carry ObservedRouteVersion=0 from M1's behaviour-neutral default).

Test plan

Resolves

The M3 row in the Composed-1 design doc.

Summary by CodeRabbit

  • Bug Fixes

    • Prevents cross-group transaction commits when route ownership shifts between observed and current versions.
    • Returns a clear error when an observed version is no longer retained in route history.
    • Ensures transactions are skipped (not rejected) when the shard/group identifier is unset, avoiding false rejections.
  • Tests

    • Added comprehensive tests covering validation across routing/version scenarios, including history GC and shard-group unset behavior.

Third milestone of the Composed-1 cross-group commit-time guard per
docs/design/2026_05_29_proposed_composed1_cross_group_commit_guard.md
§M3.  Wires the safety property at apply time using the M2
versioned-snapshot ring and the M1 ObservedRouteVersion field.

Stacked on PR #894 (M2 plumbing).

What's enforced:

  (a) Observed-version owner — the spec-level Composed-1 check from
      tla/composed/Composed.tla.  Every write key of a Composed-1-
      pinned txn must be owned by THIS Raft group at the txn's
      observed catalog version (the version it read its read-set
      at, set in M1 via OperationGroup.ObservedRouteVersion).
      Refusal returns ErrComposed1Violation.

  (b) Current-version cross-version-read fence — the §4.4 / §3
      codex P1 trace.  Even when (a) passes, a route shift between
      BeginTxn and Commit can leave the write landing on the OLD
      owner while readers at the new version route to the NEW
      owner and miss the write.  The current-version fence refuses
      that case so M4 retry can re-route.  Refusal returns
      ErrComposed1Violation with a different diagnostic prefix
      ("current-version" vs "observed-version") so the retry path
      can distinguish.

  (c) Retention-miss fail-closed — when the txn's observed version
      has been evicted from the M2 ring (long-running txn or high
      catalog churn), return ErrComposed1VersionGCd.  Per design
      doc §4.3, the not-found semantics is a hard retryable error,
      NOT a soft pass, because soft-pass would let the gate be
      bypassed exactly in the cases where the cross-version-read
      hazard is most likely.

Short-circuits cleanly in three legacy / not-applicable cases:

  * FSM constructed without WithRouteHistory (legacy / test seam):
    routes == nil, gate returns nil.
  * Request carries ObservedRouteVersion == 0 (unpinned — pre-M1
    caller, or 2PC ABORT request that doesn't carry the version):
    gate returns nil.  This is what keeps M3 from regressing every
    caller that has not migrated to set OperationGroup.
    ObservedRouteVersion.
  * Engine.Current returns (zero, false) — engine has no history
    (bare-struct test seam): the (b) fence falls through.

Changes:

  * distribution/engine.go — adds Engine.Current() returning the
    snapshot at the current catalogVersion (used by the M3 fence).
    Adds SetHistoryDepthForTest as the cross-package test seam
    that lets kv-side tests trigger eviction without bypassing
    the package boundary (claude review on PR #894 — fragile-but-
    test-local lock contract documented inline).

  * kv/fsm.go — defines ErrComposed1Violation and
    ErrComposed1VersionGCd sentinels with the wrapped-error
    diagnostic shape M4 retry will use.  Adds verifyComposed1 and
    verifyOwnerFromSnapshot.  Wires verifyComposed1 into
    handleTxnRequest at the top so PREPARE / COMMIT / ABORT /
    NONE all pass through the gate (ABORT carries
    ObservedRouteVersion=0 and falls through naturally).

  * kv/route_history.go — RouteHistory interface gains Current().
    Adapter forwards to distribution.Engine.Current().

Tests (kv/fsm_composed1_test.go):

  * TestVerifyComposed1_StaleObservedVersionWithMovedKeyFails —
    design doc M3 "done when" criterion (i): observed-version
    snapshot resolves the key to a different group than this FSM.
  * TestVerifyComposed1_ObservedVersionOlderThanRingFails —
    criterion (ii): observed version is outside the ring →
    ErrComposed1VersionGCd.  Uses depth=2 via the cross-package
    SetHistoryDepthForTest seam.
  * TestVerifyComposed1_ObservedPassesButCurrentDiffersFails —
    criterion (iii) / the §3 codex P1 trace: observed-version
    check passes (routes[1][k1]=g1) but the current snapshot at
    v=2 has moved k1 to g2 → ErrComposed1Violation with the
    "current-version" diagnostic prefix.
  * TestVerifyComposed1_ObservedVersionZeroSkipsGate — legacy
    caller (ObservedRouteVersion=0) bypasses the gate.
  * TestVerifyComposed1_NilRouteHistorySkipsGate — unwired FSM
    (no WithRouteHistory option) bypasses the gate; matches the
    pre-feature posture.

Verification:
  * go build ./...                            — clean
  * go vet ./...                              — clean
  * go test -race -count=1 ./kv ./distribution — 10.2 s + 1.0 s, pass

Self-review (5 lenses):

  1. Data loss — the gate refuses commits but cannot lose them.
     Every rejection emits a sentinel the M4 coordinator path will
     convert into a successful commit on the correct owner.
  2. Concurrency — the gate runs under the FSM's apply loop (Raft
     applies are serial), so no new lock ordering.  RouteHistory
     reads take the engine's read lock; SnapshotAt and Current
     each acquire and release in a single call.
  3. Performance — two map lookups + a per-mutation linear scan
     of the snapshot's sorted routes per Composed-1-pinned commit.
     OwnerOf short-circuits via the M2 round-1 break optimisation,
     so the per-mutation cost is bounded by the first non-covering
     gap rather than the full routes slice.  Legacy callers
     (ObservedRouteVersion=0) pay one branch per commit and exit.
  4. Data consistency — closes the Composed-1 (observed) and
     Composed-1a (current) gaps end-to-end.  Spec correspondence:
     tla/composed/Composed.tla's Commit precondition is now
     enforced at apply time; the §4.4 fence matches
     Composed1a_CommitToCurrentOwner from PR #878.
  5. Test coverage — 5 new unit tests cover the three design doc
     criteria + the two legacy short-circuit paths.  Existing
     handleTxnRequest tests pass unchanged (their requests carry
     ObservedRouteVersion=0 from M1's behaviour-neutral default).

Next milestone (separate PR per design doc §6): M4 — coordinator
retry path.  When either Composed-1 sentinel returns, the
coordinator re-reads the route cache, re-routes the txn against
the new owning group, and re-issues it once.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 31, 2026

@claude review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 31, 2026

Review Change Stack

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 42 minutes and 31 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ee427f08-5688-46cf-b45f-a600e7064c18

📥 Commits

Reviewing files that changed from the base of the PR and between 6108626 and fbd0d46.

📒 Files selected for processing (3)
  • distribution/engine.go
  • kv/fsm.go
  • kv/fsm_composed1_test.go
📝 Walkthrough

Walkthrough

Adds Engine.Current() and SetHistoryDepthForTest(), extends RouteHistory with Current(), implements the composed-1 verification (verifyComposed1) in kv FSM with two retryable errors, and adds tests covering observed/current checks and short-circuit cases.

Changes

Composed-1 Gate for Transaction Consistency

Layer / File(s) Summary
Engine history snapshot APIs
distribution/engine.go, kv/route_history.go
Engine.Current() returns the snapshot for the engine's current catalogVersion; Engine.SetHistoryDepthForTest() adjusts retained history and evicts oldest entries when shrinking; distributionEngineAdapter.Current() exposes this to the RouteHistory interface.
RouteHistory interface contract
kv/fsm.go
RouteHistory now includes Current() alongside SnapshotAt(), documenting ring-not-found semantics used by the composed-1 gate.
Composed-1 gate logic and errors
kv/fsm.go
Adds ErrComposed1Violation and ErrComposed1VersionGCd; implements verifyComposed1() which validates mutation-key ownership against SnapshotAt(ObservedRouteVersion) and, when available, Current(); handleTxnRequest invokes the gate before phase dispatch.
Composed-1 gate test suite
kv/fsm_composed1_test.go
Test helpers to populate an in-memory distribution.Engine and construct kvFSM plus tests for: observed-version owner mismatch, observed-version evicted (GCd), current-version fence rejection, ObservedRouteVersion==0 skip, nil routes skip, and shardGroupID==0 skip.

Sequence Diagram

sequenceDiagram
  participant Client
  participant FSM as kvFSM
  participant Gate as verifyComposed1
  participant Observed as RouteHistory(SnapshotAt)
  participant Current as RouteHistory(Current)

  Client->>FSM: Txn with ObservedRouteVersion
  FSM->>Gate: verifyComposed1(request)
  Gate->>Observed: SnapshotAt(ObservedVersion)
  alt Observed version evicted
    Observed-->>Gate: nil, false
    Gate-->>FSM: ErrComposed1VersionGCd
  else Observed version found
    Observed-->>Gate: RouteSnapshot
    Gate->>Gate: Check key owners vs shardGroupID
    alt Owner mismatch at observed
      Gate-->>FSM: ErrComposed1Violation (observed-version)
    else Owner match at observed
      Gate->>Current: Current()
      alt Current not available
        Current-->>Gate: nil, false
        Gate-->>FSM: nil
      else Current available
        Current-->>Gate: RouteSnapshot
        Gate->>Gate: Check key owners vs shardGroupID
        alt Owner mismatch at current
          Gate-->>FSM: ErrComposed1Violation (current-version)
        else All checks pass
          Gate-->>FSM: nil
          FSM->>FSM: Proceed to phase dispatch
        end
      end
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • bootjp/elastickv#881: Propagates ObservedRouteVersion into pb.Request, which the composed-1 gate consumes for historical ownership checks.
  • bootjp/elastickv#894: Introduced the versioned route-history ring and SnapshotAt() plumbing that this change extends with Current() and test-depth control.

Poem

🐰 I peek at history, then glance at now,
I guard the keys with twitching brow.
If versions drift or memory's thin,
I nudge the txn to try again.
Hooray — no stale-commit sin!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly summarizes the main objective: implementing an apply-time Composed-1 gate (verifyComposed1) with retry error sentinels, which is the core feature across all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/composed1-m3-verify-gate

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (c94e0d1):

  • distribution/engine.go
  • kv/fsm.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@claude
Copy link
Copy Markdown

claude Bot commented May 31, 2026

Claude finished @bootjp's task in 3m 24s —— View job


Code Review — kv(composed1): M3 — verifyComposed1 apply-time gate + retry sentinels

  • Gather context and understand the request
  • Read changed files
  • Review design doc for M3 spec correspondence
  • Review lens 1: Data loss
  • Review lens 2: Concurrency / distributed failures
  • Review lens 3: Performance
  • Review lens 4: Data consistency
  • Review lens 5: Test coverage

Overall the implementation is solid. The gate logic matches the TLA+ spec (§4.2(a)) and the cross-version fence (§4.4) is correctly wired. No data-loss paths, no lock-ordering issues, no safety bugs found. A few issues worth addressing are called out below in order of severity.


Issues

1. Misleading comment in SetHistoryDepthForTest — contradicts its own implementation

distribution/engine.go:217–228

// SetHistoryDepthForTest overrides the FIFO ring depth from outside
// the package.  Test-only — callers MUST set the depth before the
// Engine is shared with any concurrent reader (no internal
// synchronisation here …
func (e *Engine) SetHistoryDepthForTest(depth int) {
    e.mu.Lock()       // ← lock IS acquired
    defer e.mu.Unlock()
    e.historyDepth = depth
}

The comment says "no internal synchronisation here" but the function acquires e.mu.Lock(). A future maintainer reading the comment will think concurrent calls are unsafe when they are actually protected. The comment was likely copied from the in-package direct field write (TestEngineSnapshotAt_FIFOEviction) and not updated.

Suggested fix: change "no internal synchronisation here""caller should set the depth before the Engine receives concurrent Snapshots to avoid test-interleaving surprises, but the write itself is lock-protected." Fix this →


2. Missing "happy path" test — gate can silently block valid txns undetected

kv/fsm_composed1_test.go

All 5 tests assert that the gate fires (violation, GC'd) or short-circuits (nil routes, zero version). There is no test that asserts the gate passes cleanly when the txn's observed version correctly identifies this group as the owner at both the observed and current versions. Without a passing-gate test, an over-aggressive check (e.g., an off-by-one in OwnerOf or an inadvertent negation of owner != f.shardGroupID) could incorrectly reject every valid commit and the test suite would not catch it.

Suggested addition — a test along these lines:

func TestVerifyComposed1_ValidOwnershipPassesGate(t *testing.T) {
    t.Parallel()
    e := distribution.NewEngine()
    // v=1 and v=2 both assign k to g1 — no route shift.
    applyComposed1Snapshot(t, e, 1, []distribution.RouteDescriptor{
        {RouteID: 100, Start: []byte(""), End: nil, GroupID: 1, State: distribution.RouteStateActive},
    })
    applyComposed1Snapshot(t, e, 2, []distribution.RouteDescriptor{
        {RouteID: 100, Start: []byte(""), End: nil, GroupID: 1, State: distribution.RouteStateActive},
    })
    fsm := newComposed1FSM(t, e, 1)

    err := fsm.verifyComposed1(commitTxnRequest(1, "k"))
    require.NoError(t, err, "gate must pass when this group owns the key at both observed and current version")
}

Fix this →


3. Minor: ErrComposed1Violation doc comment overstates the scope of the check

kv/fsm.go:215

// ErrComposed1Violation is returned … because the
// txn's read-set or write-set keys are not owned by this group …

verifyOwnerFromSnapshot iterates r.GetMutations() (the write-set only). r.ReadKeys is not checked — consistent with the TLA+ spec where Composed1_CommitToOwningGroup quantifies over txnWriteSet. The doc comment's "read-set or write-set" is misleading.

Fix this →


4. Minor: Design doc Q1 not addressed at M3

Design doc §9 Q1 says: "allow unpinned at M1, log a warning at M3, error at M5."

The M3 implementation short-circuits ObservedRouteVersion==0 with return nil and no log. Keeping the gate silent is probably the right call right now (to avoid log spam from all legacy callers), but it drifts from the design doc's stated M3 plan. Either update the doc to defer the warning to M4/M5, or add a single slog.WarnContext at the observedVer == 0 short-circuit. This is a documentation/tracking issue, not a correctness issue.


5. Nit: Self-correction prose in test comment

kv/fsm_composed1_test.go:73–77

// Wait — re-read the design doc: criterion (i) is …

The "Wait — re-read the design doc" reads like a debugging note. The clarification is useful but could be written as a direct statement. This is purely cosmetic.


Minor optimisation opportunity (non-blocking)

When observedVer == currentSnap.Version() (the catalog hasn't advanced since txn-begin), verifyOwnerFromSnapshot is called twice against the same snapshot. Adding an early exit after the observed check passes avoids the redundant linear scan:

// (b) Current-version fence.
currentSnap, ok := f.routes.Current()
if !ok {
    return nil
}
// Fast path: if the current version is the same as the observed version,
// the observed-version check already validated current ownership.
if currentSnap.Version() == observedVer {
    return nil
}
return f.verifyOwnerFromSnapshot(r.GetMutations(), currentSnap, currentSnap.Version(), "current")

This is O(|write-set|) savings and entirely optional, but it's a clean win for the common steady-state case.


Lens summary

Lens Result
Data loss Clean — gate can only reject, never silently drop committed writes
Concurrency Clean — gate runs under the serial Raft apply loop; SnapshotAt/Current acquire and release e.mu.RLock atomically
Performance Two map lookups + O(
Data consistency Spec correspondence correct; write-set only check matches Composed1_CommitToOwningGroup; fail-closed on ring miss is correct per §4.3
Test coverage 5 failure/short-circuit tests present; missing happy-path test (see issue 2 above); no multi-key test

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the apply-time Composed-1 gate to prevent cross-version-read anomalies during route shifts. It introduces the Current() method to retrieve the latest route catalog snapshot, defines ErrComposed1Violation and ErrComposed1VersionGCd errors, and adds verification logic in kvFSM.handleTxnRequest to validate key ownership against both the observed and current catalog versions. Comprehensive unit tests are also added. The reviewer identified two critical issues: first, ABORT requests must bypass the Composed-1 gate to avoid leaking MVCC locks, and a short-circuit is needed when shardGroupID is unset (0); second, reducing history depth via SetHistoryDepthForTest on an already populated engine can cause a runtime panic unless excess history entries are immediately truncated.

Comment thread kv/fsm.go
Comment on lines +582 to +585
func (f *kvFSM) verifyComposed1(r *pb.Request) error {
if f.routes == nil {
return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are two critical issues here:

  1. ABORT requests must not be blocked by the Composed-1 gate: If an ABORT request is rejected due to a route shift (ErrComposed1Violation) or history eviction (ErrComposed1VersionGCd), the transaction's locks and intents will be permanently leaked in the MVCC store, blocking future writes. Aborts should always bypass the gate to ensure successful cleanup.
  2. Missing short-circuit for unset shardGroupID: The comments state that a shardGroupID of 0 represents an unset/legacy state and should short-circuit the gate. However, the code does not check for f.shardGroupID == 0, which will cause spurious ErrComposed1Violation errors on unset groups.
func (f *kvFSM) verifyComposed1(r *pb.Request) error {
	if r.Phase == pb.Phase_ABORT {
		return nil
	}
	if f.routes == nil || f.shardGroupID == 0 {
		return nil
	}

Comment thread distribution/engine.go
Comment on lines +225 to +229
func (e *Engine) SetHistoryDepthForTest(depth int) {
e.mu.Lock()
defer e.mu.Unlock()
e.historyDepth = depth
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If SetHistoryDepthForTest is called to reduce the history depth after the history ring has already been populated with more entries than the new depth, a subsequent call to recordHistorySnapshotLocked will trigger a runtime panic. This is because make([]uint64, len(e.historyOrder)-1, e.historyDepth) will be invoked with a length greater than its capacity. To prevent this, truncate the excess history entries immediately when updating the depth.

Suggested change
func (e *Engine) SetHistoryDepthForTest(depth int) {
e.mu.Lock()
defer e.mu.Unlock()
e.historyDepth = depth
}
func (e *Engine) SetHistoryDepthForTest(depth int) {
e.mu.Lock()
defer e.mu.Unlock()
e.historyDepth = depth
if len(e.historyOrder) > depth {
excess := len(e.historyOrder) - depth
for i := 0; i < excess; i++ {
delete(e.history, e.historyOrder[i])
}
retained := make([]uint64, depth)
copy(retained, e.historyOrder[excess:])
e.historyOrder = retained
}
}

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
kv/fsm_composed1_test.go (1)

32-35: ⚡ Quick win

Drop the new //nolint:unparam and refactor the helper instead.

This suppression is avoidable in new test code: either bake in the single currently used group ID, or make the group ID vary through actual table cases when you need that coverage.

As per coding guidelines, "Avoid //nolint annotations — refactor instead."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@kv/fsm_composed1_test.go` around lines 32 - 35, Remove the //nolint:unparam
annotation and refactor newComposed1FSM to stop accepting the unused
shardGroupID parameter: either bake the single used group ID (1) directly into
newComposed1FSM (replace the shardGroupID parameter with the constant 1 inside
the function) and update all call sites to call newComposed1FSM(t, e)
accordingly, or convert the tests that need varying group IDs into table-driven
cases that construct the FSM with different explicit group IDs (so the parameter
becomes used). Ensure all references to function newComposed1FSM are updated and
delete the //nolint comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@distribution/engine.go`:
- Around line 225-229: SetHistoryDepthForTest should validate the incoming depth
and reject non-positive values to fail fast; inside
Engine.SetHistoryDepthForTest (protected by e.mu), add a check like if depth <=
0 { panic("SetHistoryDepthForTest: depth must be > 0") } (or return an
error/report) before assigning e.historyDepth so recordHistorySnapshotLocked and
the eviction path never encounter an empty historyOrder; update any tests that
call SetHistoryDepthForTest to provide positive depths.

In `@kv/fsm.go`:
- Around line 582-589: verifyComposed1 currently only short-circuits when
f.routes == nil but the documented behavior says shardGroupID == 0 should also
bypass; update kvFSM.verifyComposed1 to return nil immediately when
f.shardGroupID == 0 (before any route/version checks) so callers that install
routes before setting shardGroupID preserve the legacy skip-for-group-0
behavior; keep the existing observedVer handling unchanged.

---

Nitpick comments:
In `@kv/fsm_composed1_test.go`:
- Around line 32-35: Remove the //nolint:unparam annotation and refactor
newComposed1FSM to stop accepting the unused shardGroupID parameter: either bake
the single used group ID (1) directly into newComposed1FSM (replace the
shardGroupID parameter with the constant 1 inside the function) and update all
call sites to call newComposed1FSM(t, e) accordingly, or convert the tests that
need varying group IDs into table-driven cases that construct the FSM with
different explicit group IDs (so the parameter becomes used). Ensure all
references to function newComposed1FSM are updated and delete the //nolint
comment.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: af79a132-4d95-4f0c-b28e-0e8008f5f03d

📥 Commits

Reviewing files that changed from the base of the PR and between 2b118a4 and c94e0d1.

📒 Files selected for processing (4)
  • distribution/engine.go
  • kv/fsm.go
  • kv/fsm_composed1_test.go
  • kv/route_history.go

Comment thread distribution/engine.go
Comment thread kv/fsm.go
…seam (reviews on PR #895)

Three findings from PR #895's automated review pass, all addressed:

  * coderabbit MAJOR — verifyComposed1 only checked `routes == nil`
    for the bypass, leaving a partially-wired FSM (routes installed
    before group ID, or a test fixture that passed engine but
    shardGroupID=0) to reject every pinned txn with
    ErrComposed1Violation.  The doc comment on kvFSM.shardGroupID
    says "Zero ... means 'unset' — the M3 gate will short-circuit
    (matches the pre-feature behaviour)"; the code now honours
    that contract too.

    Fix: bypass when EITHER routes==nil OR shardGroupID==0.
    Caller audit (the loop directive's semantic-change check):
    only handleTxnRequest calls verifyComposed1, and it treats
    nil as success — bypass-as-nil preserves the caller's
    semantics.  No further wiring change.

  * coderabbit minor — SetHistoryDepthForTest accepted depth<=0,
    which would later surface as an index-out-of-range deep in
    recordHistorySnapshotLocked's eviction path (historyOrder[0]
    on an empty slice).  Now panics at the seam with a clear
    message so the misconfigured test fails fast.

  * gemini medium — Shrinking historyDepth below the current
    ring size left len(historyOrder) > historyDepth, and the
    next recordHistorySnapshotLocked would
    `make([]uint64, len-1, historyDepth)` with len-1 >
    historyDepth → panic in apply.  Now SetHistoryDepthForTest
    evicts the excess oldest entries and compacts the slice
    immediately under the write lock.

  * golangci/reviewdog formatting (gci on kv/fsm.go and
    kv/fsm_composed1_test.go) — auto-fixed by `make lint`.

Regression test (added before the fix per the project convention):

  * TestVerifyComposed1_ShardGroupIDZeroSkipsGate — builds an
    engine where the route is owned by group 5 and constructs an
    FSM with shardGroupID=0; pre-fix the gate rejects the pinned
    commit (OwnerOf returns 5 ≠ this FSM's 0), post-fix the gate
    short-circuits via the new condition.  Verified failing on
    the parent commit and passing on this commit.

Verification: go test -race -count=1 ./kv ./distribution → pass.
make lint → 0 issues.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

All 4 findings addressed:

coderabbit MAJOR (kv/fsm.go:589 — shardGroupID==0 bypass) — verifyComposed1 now honors both routes==nil and shardGroupID==0 per the documented unset-sentinel contract. Caller audit (per the loop directive's semantic-change check): only handleTxnRequest calls verifyComposed1 and treats nil as success — bypass-as-nil preserves caller semantics, no further wiring change. Regression test TestVerifyComposed1_ShardGroupIDZeroSkipsGate added BEFORE the fix per project convention, verified failing on parent / passing on fix.

coderabbit minor (distribution/engine.go:229 — reject depth<=0) — SetHistoryDepthForTest now panics at the seam with a clear message.

gemini medium (distribution/engine.go:229 — shrink panic) — SetHistoryDepthForTest now evicts excess entries and compacts the slice when shrinking depth below current ring size.

golangci/reviewdog (gci formatting on kv/fsm.go and kv/fsm_composed1_test.go) — auto-fixed by make lint --fix.

go test -race ./kv ./distribution pass / make lint 0 issues.

@claude review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (6108626):

  • distribution/engine.go
  • kv/fsm.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 3m 15s —— View job


Code Review — Follow-up (commit 6108626 fixes)

  • Gather context and understand the request
  • Read changed files and diff for commit 6108626
  • Review fix: shardGroupID==0 bypass + regression test
  • Review fix: depth<=0 panic in SetHistoryDepthForTest
  • Review fix: shrink eviction in SetHistoryDepthForTest
  • Review fix: gci formatting
  • Check for any remaining or new issues
  • Post review

All 4 listed fixes are correct. Two items from the previous round and one from gemini remain open.

Confirmed correct

shardGroupID == 0 bypass (kv/fsm.go:591) — f.routes == nil || f.shardGroupID == 0 is the right predicate. Both sentinel values are now handled in one branch. The regression test TestVerifyComposed1_ShardGroupIDZeroSkipsGate correctly exercises the fix: engine wired with GroupID=5, FSM constructed with shardGroupID=0, asserts neither Composed-1 error surfaces.

depth <= 0 panic (distribution/engine.go:238) — Fails fast before lock acquisition, which is correct for a test-only seam. No lock-ordering issue (no held lock to release on panic path).

Shrink truncation (distribution/engine.go:244–252) — Logic is correct. excess = len(historyOrder) - depth entries are deleted from the map (oldest first, index 0..excess-1), and copy(retained, historyOrder[excess:]) lifts the depth most-recent tail into a new slice. The resulting invariant len(historyOrder) == depth matches recordHistorySnapshotLocked's eviction assumption.

gci formatting — The //nolint:unparam repositioning (from above the prose comment to immediately before the func line) is the correct placement for a //nolint directive.


Remaining: two open items from the previous claude review

1. Misleading "no internal synchronisation" comment — still present

distribution/engine.go:213–214

// SetHistoryDepthForTest overrides the FIFO ring depth from outside
// the package.  Test-only — callers MUST set the depth before the
// Engine is shared with any concurrent reader (no internal
// synchronisation here …

This comment was already contradicted by the pre-existing e.mu.Lock() when the previous review raised it (issue #1 in the previous claude review). Commit 6108626 adds the shrink logic inside that same lock but does not update the text. The "no internal synchronisation here" clause now causes genuine confusion: a future reader of the comment will infer concurrent calls are unsafe, then see the lock and wonder which is authoritative.

Suggested fix: replace "no internal synchronisation here for the same reason TestEngineSnapshotAt_FIFOEviction does the direct field write in-package" with "callers should set depth before sharing the Engine with concurrent Snapshots to avoid interleaving surprises, but the write itself is lock-protected". Fix this →

2. Missing happy-path test — still absent

kv/fsm_composed1_test.go now has 6 tests (5 original + the new shardGroupID=0 regression). All 6 test failure/bypass paths. There is still no test asserting that verifyComposed1 returns nil when the txn's observed version AND the current version both assign the key to this FSM's group. An off-by-one in OwnerOf, a reversed comparison, or an inadvertent sign-flip in the group-ID match would produce a false-positive rejection on every valid commit, and the test suite would not catch it. Fix this →


Remaining: gemini's ABORT-invariant concern (not in the 4 addressed)

Gemini flagged this as CRITICAL. The current protection for ABORT is that all ABORT construction sites (transaction.go:482, sharded_coordinator.go:859) leave ObservedRouteVersion at its zero-value default, so the observedVer == 0 branch fires before any route check. Verified: both callsites construct the pb.Request struct without setting ObservedRouteVersion.

This is safe today, but it is an undocumented protocol invariant — nothing in the type system or a test enforces that ABORT must carry ObservedRouteVersion == 0. A future developer adding a third ABORT construction path could inadvertently copy-paste a COMMIT request struct and set the version field, causing the gate to block ABORTs and permanently leak MVCC locks on those keys.

Gemini's suggestion — add if r.Phase == pb.Phase_ABORT { return nil } as the first guard — costs one branch and makes the invariant enforcement-in-code rather than enforcement-by-convention. Given that the consequence of a missed ABORT is lock leakage until TTL, the defensive guard is worth adding. Fix this →


Nit: //nolint:unparam may now be dead weight

newComposed1FSM is now called with both shardGroupID=1 (five tests) and shardGroupID=0 (the new regression test), so unparam should no longer flag the parameter as "always the same value." The annotation is harmless but can be removed if make lint no longer reports the finding. Fix this →


Summary

Finding Status
coderabbit MAJOR: shardGroupID==0 bypass ✅ Fixed + regression test
coderabbit minor: depth <= 0 panic ✅ Fixed
gemini medium: shrink-panic in SetHistoryDepthForTest ✅ Fixed
golangci/gci formatting ✅ Fixed
Previous claude #1: misleading "no sync" comment ⚠️ Still open
Previous claude #2: missing happy-path test ⚠️ Still open
Gemini CRITICAL: no explicit ABORT Phase bypass ⚠️ Still open (safe today, fragile by convention)

The three remaining issues are all non-blocking for M3's correctness (the gate logic itself is correct; the ABORT path is safe today). Happy to approve once the happy-path test is in — that's the one change that gives the gate meaningful coverage in both the pass and fail directions.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

…#895)

Three remaining items from claude's round-2 follow-up review + the
gemini HIGH-priority finding that was missed in round 2:

  * gemini HIGH (kv/fsm.go) — ABORT requests MUST bypass the
    Composed-1 gate.  If a route shifted between PREPARE and
    ABORT (or the observed version was evicted), a rejected
    ABORT would leave the txn's intent locks pinned in MVCC
    until LockResolver's TTL — minutes of write-blocked keys
    for what should be a one-RPC cleanup.

    Production callers (transaction.go, sharded_coordinator.go)
    happen to leave ObservedRouteVersion=0 on ABORT requests, so
    today the existing observedVer==0 short-circuit already
    protects them.  But that is enforcement-by-convention; a
    future ABORT construction site that copies a COMMIT struct
    by mistake would silently start failing aborts.

    Fix: add `if r.GetPhase() == pb.Phase_ABORT { return nil }`
    as the FIRST guard in verifyComposed1, before the
    routes/shardGroupID check.  Enforcement-in-code rather than
    enforcement-by-convention.

    Caller audit (the loop directive's semantic-change check):
    verifyComposed1 is only called from handleTxnRequest, which
    treats nil as success.  The ABORT bypass returns nil, so the
    caller's semantics are preserved.  No further wiring change.

  * claude follow-up #1 (distribution/engine.go) — The
    SetHistoryDepthForTest doc comment said "no internal
    synchronisation here" even though the function acquired
    e.mu.Lock.  Future readers would see the contradiction and
    not know which was authoritative.  Rewrote the comment to
    describe the actual contract: the write is lock-protected,
    but callers should still set depth before sharing the
    Engine with concurrent SnapshotAt/Current readers to avoid
    interleaving surprises around the eviction watermark.

  * claude follow-up #2 (kv/fsm_composed1_test.go) — The gate
    suite had no happy-path test.  An off-by-one in OwnerOf, a
    reversed comparison, or a sign-flip in the group-ID match
    would silently false-reject every valid commit and the
    failure-only suite would not catch it.  Added
    TestVerifyComposed1_ValidOwnershipPassesGate: this FSM
    serves the group that owns the key at BOTH the observed
    version AND the current catalog version, so the gate must
    return nil.

  * claude follow-up nit (kv/fsm_composed1_test.go) — The
    `//nolint:unparam` annotation on newComposed1FSM is now
    redundant: the new ABORT and shardGroupID=0 tests pass
    distinct shardGroupID values, so golangci-lint no longer
    flags the parameter as constant.  Removed (verified with
    `make lint` → 0 issues).

Also added TestVerifyComposed1_AbortPhaseSkipsGate as the
regression for the ABORT bypass: deliberately sets
ObservedRouteVersion=1 (non-zero) on the ABORT to defeat the
existing observedVer==0 short-circuit and isolate the new
Phase_ABORT guard.

Verification: go test -race -count=1 ./kv ./distribution → pass.
make lint → 0 issues.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

Round-3: all remaining items addressed.

gemini HIGH (kv/fsm.go — ABORT bypass): Added if r.GetPhase() == pb.Phase_ABORT { return nil } as the FIRST guard in verifyComposed1. Production callers leave ObservedRouteVersion=0 on ABORT so today the existing observedVer==0 branch already protects them, but that's enforcement-by-convention — the new Phase_ABORT guard is enforcement-in-code. A future ABORT construction site that copies a COMMIT struct by mistake still bypasses. Caller audit: only handleTxnRequest calls verifyComposed1 and treats nil as success — semantics preserved.

claude follow-up #1 (SetHistoryDepthForTest doc comment): Rewrote to remove the contradictory "no internal synchronisation" wording; comment now correctly states the write is lock-protected but callers should still set depth before sharing the Engine with concurrent readers to avoid interleaving surprises.

claude follow-up #2 (happy-path test): Added TestVerifyComposed1_ValidOwnershipPassesGate — this FSM serves the group that owns the key at both the observed AND current version, gate returns nil. Locks down the pass direction the failure-only suite was missing.

ABORT regression test: TestVerifyComposed1_AbortPhaseSkipsGate deliberately sets ObservedRouteVersion=1 on the ABORT to defeat the existing observedVer==0 short-circuit and isolate the new Phase_ABORT guard.

Nit: removed the now-redundant //nolint:unparam on newComposed1FSM (new tests pass distinct shardGroupID values).

go test -race ./kv ./distribution pass / make lint 0 issues.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 1m 50s —— View job


Code Review — Round 3 (commit 48bbcc4)

  • Gather context and understand the request
  • Read changed files for round-3 commits
  • Verify ABORT Phase bypass fix
  • Verify SetHistoryDepthForTest doc comment fix
  • Verify happy-path test (ValidOwnershipPassesGate)
  • Verify ABORT regression test (AbortPhaseSkipsGate)
  • Verify nolint:unparam removal
  • Check for any remaining or new issues

All five round-3 items are correct. The remaining finding-tracking table is fully green.

Confirmed correct

ABORT Phase bypass (kv/fsm.go:594) — if r.GetPhase() == pb.Phase_ABORT { return nil } is correctly placed as the first guard — before the routes == nil || shardGroupID == 0 check and before the observedVer == 0 check. Uses the nil-safe GetPhase() proto accessor. Guard ordering is now:

  1. Phase_ABORT → nil (new, enforcement-in-code)
  2. routes == nil || shardGroupID == 0 → nil
  3. observedVer == 0 → nil
  4. observed-version lookup / current-version fence

This is the correct layering — an ABORT with a fabricated non-zero ObservedRouteVersion still safely bypasses.

SetHistoryDepthForTest doc comment (distribution/engine.go:211–238) — the contradictory "no internal synchronisation here" clause is gone. Replacement accurately states "the write itself is lock-protected (e.mu.Lock below)" while preserving the call-before-sharing guidance. ✅

Happy-path test (kv/fsm_composed1_test.go:263–281) — TestVerifyComposed1_ValidOwnershipPassesGate applies snapshots at v=1 (RouteID=100) and v=2 (RouteID=101), both assigning the full key range to GroupID=1. The FSM serves shardGroupID=1, request pins ObservedRouteVersion=1. Both the observed-version check (a) and the current-version fence (b) execute and pass. require.NoError correctly asserts the happy path. ✅

ABORT regression test (kv/fsm_composed1_test.go:298–321) — TestVerifyComposed1_AbortPhaseSkipsGate correctly isolates the new Phase_ABORT guard: key "k" is owned by group 5 at v=1, FSM serves group 1 (so absent the ABORT guard this would return ErrComposed1Violation), and ObservedRouteVersion=1 (non-zero) to defeat the observedVer==0 short-circuit. Only the Phase_ABORT guard saves it. ✅

//nolint:unparam removednewComposed1FSM now receives both 0 (ShardGroupIDZero test) and 1 (all others), so golangci-lint's unparam check no longer fires and the suppression is gone. ✅


One cosmetic nit (non-blocking)

distribution/engine.go:217–226 has the "exists so tests in the kv package can drive eviction-trigger scenarios without adding a constructor option just for tests" sentence twice — once in the opening paragraph and once in the //Production code must use… paragraph. Entirely harmless.


Finding-tracking table

Finding Status
coderabbit MAJOR: shardGroupID==0 bypass ✅ Fixed (round 2)
coderabbit minor: depth <= 0 panic ✅ Fixed (round 2)
gemini medium: shrink-panic in SetHistoryDepthForTest ✅ Fixed (round 2)
golangci/gci formatting ✅ Fixed (round 2)
Previous claude #1: misleading "no sync" comment ✅ Fixed (round 3)
Previous claude #2: missing happy-path test ✅ Fixed (round 3)
Gemini CRITICAL: no explicit ABORT Phase bypass ✅ Fixed (round 3)

No open findings. Gate logic matches the TLA+ spec (Composed1_CommitToOwningGroup applies to write-set, ABORT does not commit, so the bypass is spec-correct). Ready to merge once PR #894 (M2) lands.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (48bbcc4):

  • distribution/engine.go
  • kv/fsm.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

…-3 nit on PR #895)

The previous text had the 'tests in the kv package can drive
eviction-trigger scenarios without adding a constructor option' line
twice — once in the opening paragraph and once in the
'Production code must use...' paragraph.  Harmless duplication;
collapsed into a single sentence in the second paragraph.

No code change.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

Cosmetic nit (round-3 follow-up): deduped the doc-comment sentence "tests in the kv package can drive eviction-trigger scenarios without adding a constructor option" — was duplicated across two paragraphs of SetHistoryDepthForTest.

Doc-only; no code change. Claude round-3 confirmed all 7 findings ✅ Fixed and the table is fully green.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 49s —— View job


Code Review — Round 4 (doc-only dedup, commit fbd0d46)

  • Read distribution/engine.go and verify the dedup
  • Check git diff for scope of changes
  • Post review

Scope confirmed: single file (distribution/engine.go), 6 insertions / 9 deletions, no code change. The diff removes the duplicated sentence:

"tests in the kv package can drive eviction-trigger scenarios without adding a constructor option just for tests"

…which previously appeared twice (once at the tail of the first paragraph, once at the head of the Production code must use… paragraph), and collapses them into a single occurrence in the second paragraph. The resulting comment is:

  1. Non-duplicated ✅
  2. Lock-protection description still accurate (the write itself is lock-protected (e.mu.Lock below)) ✅
  3. depth <= 0 fail-fast rationale and shrink-panic explanation still present ✅
  4. Production-only usage warning still present ✅

No new findings. Finding table remains fully green from round 3:

Finding Status
coderabbit MAJOR: shardGroupID==0 bypass ✅ Fixed (round 2)
coderabbit minor: depth <= 0 panic ✅ Fixed (round 2)
gemini medium: shrink-panic in SetHistoryDepthForTest ✅ Fixed (round 2)
golangci/gci formatting ✅ Fixed (round 2)
Previous claude #1: misleading "no sync" comment ✅ Fixed (round 3)
Previous claude #2: missing happy-path test ✅ Fixed (round 3)
Gemini CRITICAL: no explicit ABORT Phase bypass ✅ Fixed (round 3)

Ready to merge once PR #894 (M2) lands.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (fbd0d46):

  • distribution/engine.go
  • kv/fsm.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@bootjp bootjp merged commit ed575bf into main Jun 1, 2026
11 checks passed
@bootjp bootjp deleted the feat/composed1-m3-verify-gate branch June 1, 2026 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant