kv(composed1): M3 — verifyComposed1 apply-time gate + retry sentinels#895
Conversation
Third milestone of the Composed-1 cross-group commit-time guard per docs/design/2026_05_29_proposed_composed1_cross_group_commit_guard.md §M3. Wires the safety property at apply time using the M2 versioned-snapshot ring and the M1 ObservedRouteVersion field. Stacked on PR #894 (M2 plumbing). What's enforced: (a) Observed-version owner — the spec-level Composed-1 check from tla/composed/Composed.tla. Every write key of a Composed-1- pinned txn must be owned by THIS Raft group at the txn's observed catalog version (the version it read its read-set at, set in M1 via OperationGroup.ObservedRouteVersion). Refusal returns ErrComposed1Violation. (b) Current-version cross-version-read fence — the §4.4 / §3 codex P1 trace. Even when (a) passes, a route shift between BeginTxn and Commit can leave the write landing on the OLD owner while readers at the new version route to the NEW owner and miss the write. The current-version fence refuses that case so M4 retry can re-route. Refusal returns ErrComposed1Violation with a different diagnostic prefix ("current-version" vs "observed-version") so the retry path can distinguish. (c) Retention-miss fail-closed — when the txn's observed version has been evicted from the M2 ring (long-running txn or high catalog churn), return ErrComposed1VersionGCd. Per design doc §4.3, the not-found semantics is a hard retryable error, NOT a soft pass, because soft-pass would let the gate be bypassed exactly in the cases where the cross-version-read hazard is most likely. Short-circuits cleanly in three legacy / not-applicable cases: * FSM constructed without WithRouteHistory (legacy / test seam): routes == nil, gate returns nil. * Request carries ObservedRouteVersion == 0 (unpinned — pre-M1 caller, or 2PC ABORT request that doesn't carry the version): gate returns nil. This is what keeps M3 from regressing every caller that has not migrated to set OperationGroup. ObservedRouteVersion. * Engine.Current returns (zero, false) — engine has no history (bare-struct test seam): the (b) fence falls through. Changes: * distribution/engine.go — adds Engine.Current() returning the snapshot at the current catalogVersion (used by the M3 fence). Adds SetHistoryDepthForTest as the cross-package test seam that lets kv-side tests trigger eviction without bypassing the package boundary (claude review on PR #894 — fragile-but- test-local lock contract documented inline). * kv/fsm.go — defines ErrComposed1Violation and ErrComposed1VersionGCd sentinels with the wrapped-error diagnostic shape M4 retry will use. Adds verifyComposed1 and verifyOwnerFromSnapshot. Wires verifyComposed1 into handleTxnRequest at the top so PREPARE / COMMIT / ABORT / NONE all pass through the gate (ABORT carries ObservedRouteVersion=0 and falls through naturally). * kv/route_history.go — RouteHistory interface gains Current(). Adapter forwards to distribution.Engine.Current(). Tests (kv/fsm_composed1_test.go): * TestVerifyComposed1_StaleObservedVersionWithMovedKeyFails — design doc M3 "done when" criterion (i): observed-version snapshot resolves the key to a different group than this FSM. * TestVerifyComposed1_ObservedVersionOlderThanRingFails — criterion (ii): observed version is outside the ring → ErrComposed1VersionGCd. Uses depth=2 via the cross-package SetHistoryDepthForTest seam. * TestVerifyComposed1_ObservedPassesButCurrentDiffersFails — criterion (iii) / the §3 codex P1 trace: observed-version check passes (routes[1][k1]=g1) but the current snapshot at v=2 has moved k1 to g2 → ErrComposed1Violation with the "current-version" diagnostic prefix. * TestVerifyComposed1_ObservedVersionZeroSkipsGate — legacy caller (ObservedRouteVersion=0) bypasses the gate. * TestVerifyComposed1_NilRouteHistorySkipsGate — unwired FSM (no WithRouteHistory option) bypasses the gate; matches the pre-feature posture. Verification: * go build ./... — clean * go vet ./... — clean * go test -race -count=1 ./kv ./distribution — 10.2 s + 1.0 s, pass Self-review (5 lenses): 1. Data loss — the gate refuses commits but cannot lose them. Every rejection emits a sentinel the M4 coordinator path will convert into a successful commit on the correct owner. 2. Concurrency — the gate runs under the FSM's apply loop (Raft applies are serial), so no new lock ordering. RouteHistory reads take the engine's read lock; SnapshotAt and Current each acquire and release in a single call. 3. Performance — two map lookups + a per-mutation linear scan of the snapshot's sorted routes per Composed-1-pinned commit. OwnerOf short-circuits via the M2 round-1 break optimisation, so the per-mutation cost is bounded by the first non-covering gap rather than the full routes slice. Legacy callers (ObservedRouteVersion=0) pay one branch per commit and exit. 4. Data consistency — closes the Composed-1 (observed) and Composed-1a (current) gaps end-to-end. Spec correspondence: tla/composed/Composed.tla's Commit precondition is now enforced at apply time; the §4.4 fence matches Composed1a_CommitToCurrentOwner from PR #878. 5. Test coverage — 5 new unit tests cover the three design doc criteria + the two legacy short-circuit paths. Existing handleTxnRequest tests pass unchanged (their requests carry ObservedRouteVersion=0 from M1's behaviour-neutral default). Next milestone (separate PR per design doc §6): M4 — coordinator retry path. When either Composed-1 sentinel returns, the coordinator re-reads the route cache, re-routes the txn against the new owning group, and re-issues it once.
|
@claude review |
|
Warning Review limit reached
More reviews will be available in 42 minutes and 31 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughAdds Engine.Current() and SetHistoryDepthForTest(), extends RouteHistory with Current(), implements the composed-1 verification (verifyComposed1) in kv FSM with two retryable errors, and adds tests covering observed/current checks and short-circuit cases. ChangesComposed-1 Gate for Transaction Consistency
Sequence DiagramsequenceDiagram
participant Client
participant FSM as kvFSM
participant Gate as verifyComposed1
participant Observed as RouteHistory(SnapshotAt)
participant Current as RouteHistory(Current)
Client->>FSM: Txn with ObservedRouteVersion
FSM->>Gate: verifyComposed1(request)
Gate->>Observed: SnapshotAt(ObservedVersion)
alt Observed version evicted
Observed-->>Gate: nil, false
Gate-->>FSM: ErrComposed1VersionGCd
else Observed version found
Observed-->>Gate: RouteSnapshot
Gate->>Gate: Check key owners vs shardGroupID
alt Owner mismatch at observed
Gate-->>FSM: ErrComposed1Violation (observed-version)
else Owner match at observed
Gate->>Current: Current()
alt Current not available
Current-->>Gate: nil, false
Gate-->>FSM: nil
else Current available
Current-->>Gate: RouteSnapshot
Gate->>Gate: Check key owners vs shardGroupID
alt Owner mismatch at current
Gate-->>FSM: ErrComposed1Violation (current-version)
else All checks pass
Gate-->>FSM: nil
FSM->>FSM: Proceed to phase dispatch
end
end
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (c94e0d1):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
To use Codex here, create a Codex account and connect to github. |
|
Claude finished @bootjp's task in 3m 24s —— View job Code Review — kv(composed1): M3 — verifyComposed1 apply-time gate + retry sentinels
Overall the implementation is solid. The gate logic matches the TLA+ spec (§4.2(a)) and the cross-version fence (§4.4) is correctly wired. No data-loss paths, no lock-ordering issues, no safety bugs found. A few issues worth addressing are called out below in order of severity. Issues1. Misleading comment in
|
| Lens | Result |
|---|---|
| Data loss | Clean — gate can only reject, never silently drop committed writes |
| Concurrency | Clean — gate runs under the serial Raft apply loop; SnapshotAt/Current acquire and release e.mu.RLock atomically |
| Performance | Two map lookups + O( |
| Data consistency | Spec correspondence correct; write-set only check matches Composed1_CommitToOwningGroup; fail-closed on ring miss is correct per §4.3 |
| Test coverage | 5 failure/short-circuit tests present; missing happy-path test (see issue 2 above); no multi-key test |
There was a problem hiding this comment.
Code Review
This pull request implements the apply-time Composed-1 gate to prevent cross-version-read anomalies during route shifts. It introduces the Current() method to retrieve the latest route catalog snapshot, defines ErrComposed1Violation and ErrComposed1VersionGCd errors, and adds verification logic in kvFSM.handleTxnRequest to validate key ownership against both the observed and current catalog versions. Comprehensive unit tests are also added. The reviewer identified two critical issues: first, ABORT requests must bypass the Composed-1 gate to avoid leaking MVCC locks, and a short-circuit is needed when shardGroupID is unset (0); second, reducing history depth via SetHistoryDepthForTest on an already populated engine can cause a runtime panic unless excess history entries are immediately truncated.
| func (f *kvFSM) verifyComposed1(r *pb.Request) error { | ||
| if f.routes == nil { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
There are two critical issues here:
- ABORT requests must not be blocked by the Composed-1 gate: If an
ABORTrequest is rejected due to a route shift (ErrComposed1Violation) or history eviction (ErrComposed1VersionGCd), the transaction's locks and intents will be permanently leaked in the MVCC store, blocking future writes. Aborts should always bypass the gate to ensure successful cleanup. - Missing short-circuit for unset shardGroupID: The comments state that a
shardGroupIDof0represents an unset/legacy state and should short-circuit the gate. However, the code does not check forf.shardGroupID == 0, which will cause spuriousErrComposed1Violationerrors on unset groups.
func (f *kvFSM) verifyComposed1(r *pb.Request) error {
if r.Phase == pb.Phase_ABORT {
return nil
}
if f.routes == nil || f.shardGroupID == 0 {
return nil
}| func (e *Engine) SetHistoryDepthForTest(depth int) { | ||
| e.mu.Lock() | ||
| defer e.mu.Unlock() | ||
| e.historyDepth = depth | ||
| } |
There was a problem hiding this comment.
If SetHistoryDepthForTest is called to reduce the history depth after the history ring has already been populated with more entries than the new depth, a subsequent call to recordHistorySnapshotLocked will trigger a runtime panic. This is because make([]uint64, len(e.historyOrder)-1, e.historyDepth) will be invoked with a length greater than its capacity. To prevent this, truncate the excess history entries immediately when updating the depth.
| func (e *Engine) SetHistoryDepthForTest(depth int) { | |
| e.mu.Lock() | |
| defer e.mu.Unlock() | |
| e.historyDepth = depth | |
| } | |
| func (e *Engine) SetHistoryDepthForTest(depth int) { | |
| e.mu.Lock() | |
| defer e.mu.Unlock() | |
| e.historyDepth = depth | |
| if len(e.historyOrder) > depth { | |
| excess := len(e.historyOrder) - depth | |
| for i := 0; i < excess; i++ { | |
| delete(e.history, e.historyOrder[i]) | |
| } | |
| retained := make([]uint64, depth) | |
| copy(retained, e.historyOrder[excess:]) | |
| e.historyOrder = retained | |
| } | |
| } |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
kv/fsm_composed1_test.go (1)
32-35: ⚡ Quick winDrop the new
//nolint:unparamand refactor the helper instead.This suppression is avoidable in new test code: either bake in the single currently used group ID, or make the group ID vary through actual table cases when you need that coverage.
As per coding guidelines, "Avoid
//nolintannotations — refactor instead."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@kv/fsm_composed1_test.go` around lines 32 - 35, Remove the //nolint:unparam annotation and refactor newComposed1FSM to stop accepting the unused shardGroupID parameter: either bake the single used group ID (1) directly into newComposed1FSM (replace the shardGroupID parameter with the constant 1 inside the function) and update all call sites to call newComposed1FSM(t, e) accordingly, or convert the tests that need varying group IDs into table-driven cases that construct the FSM with different explicit group IDs (so the parameter becomes used). Ensure all references to function newComposed1FSM are updated and delete the //nolint comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@distribution/engine.go`:
- Around line 225-229: SetHistoryDepthForTest should validate the incoming depth
and reject non-positive values to fail fast; inside
Engine.SetHistoryDepthForTest (protected by e.mu), add a check like if depth <=
0 { panic("SetHistoryDepthForTest: depth must be > 0") } (or return an
error/report) before assigning e.historyDepth so recordHistorySnapshotLocked and
the eviction path never encounter an empty historyOrder; update any tests that
call SetHistoryDepthForTest to provide positive depths.
In `@kv/fsm.go`:
- Around line 582-589: verifyComposed1 currently only short-circuits when
f.routes == nil but the documented behavior says shardGroupID == 0 should also
bypass; update kvFSM.verifyComposed1 to return nil immediately when
f.shardGroupID == 0 (before any route/version checks) so callers that install
routes before setting shardGroupID preserve the legacy skip-for-group-0
behavior; keep the existing observedVer handling unchanged.
---
Nitpick comments:
In `@kv/fsm_composed1_test.go`:
- Around line 32-35: Remove the //nolint:unparam annotation and refactor
newComposed1FSM to stop accepting the unused shardGroupID parameter: either bake
the single used group ID (1) directly into newComposed1FSM (replace the
shardGroupID parameter with the constant 1 inside the function) and update all
call sites to call newComposed1FSM(t, e) accordingly, or convert the tests that
need varying group IDs into table-driven cases that construct the FSM with
different explicit group IDs (so the parameter becomes used). Ensure all
references to function newComposed1FSM are updated and delete the //nolint
comment.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: af79a132-4d95-4f0c-b28e-0e8008f5f03d
📒 Files selected for processing (4)
distribution/engine.gokv/fsm.gokv/fsm_composed1_test.gokv/route_history.go
…seam (reviews on PR #895) Three findings from PR #895's automated review pass, all addressed: * coderabbit MAJOR — verifyComposed1 only checked `routes == nil` for the bypass, leaving a partially-wired FSM (routes installed before group ID, or a test fixture that passed engine but shardGroupID=0) to reject every pinned txn with ErrComposed1Violation. The doc comment on kvFSM.shardGroupID says "Zero ... means 'unset' — the M3 gate will short-circuit (matches the pre-feature behaviour)"; the code now honours that contract too. Fix: bypass when EITHER routes==nil OR shardGroupID==0. Caller audit (the loop directive's semantic-change check): only handleTxnRequest calls verifyComposed1, and it treats nil as success — bypass-as-nil preserves the caller's semantics. No further wiring change. * coderabbit minor — SetHistoryDepthForTest accepted depth<=0, which would later surface as an index-out-of-range deep in recordHistorySnapshotLocked's eviction path (historyOrder[0] on an empty slice). Now panics at the seam with a clear message so the misconfigured test fails fast. * gemini medium — Shrinking historyDepth below the current ring size left len(historyOrder) > historyDepth, and the next recordHistorySnapshotLocked would `make([]uint64, len-1, historyDepth)` with len-1 > historyDepth → panic in apply. Now SetHistoryDepthForTest evicts the excess oldest entries and compacts the slice immediately under the write lock. * golangci/reviewdog formatting (gci on kv/fsm.go and kv/fsm_composed1_test.go) — auto-fixed by `make lint`. Regression test (added before the fix per the project convention): * TestVerifyComposed1_ShardGroupIDZeroSkipsGate — builds an engine where the route is owned by group 5 and constructs an FSM with shardGroupID=0; pre-fix the gate rejects the pinned commit (OwnerOf returns 5 ≠ this FSM's 0), post-fix the gate short-circuits via the new condition. Verified failing on the parent commit and passing on this commit. Verification: go test -race -count=1 ./kv ./distribution → pass. make lint → 0 issues.
|
All 4 findings addressed: coderabbit MAJOR (kv/fsm.go:589 — shardGroupID==0 bypass) — verifyComposed1 now honors both coderabbit minor (distribution/engine.go:229 — reject depth<=0) — SetHistoryDepthForTest now panics at the seam with a clear message. gemini medium (distribution/engine.go:229 — shrink panic) — SetHistoryDepthForTest now evicts excess entries and compacts the slice when shrinking depth below current ring size. golangci/reviewdog (gci formatting on kv/fsm.go and kv/fsm_composed1_test.go) — auto-fixed by
@claude review |
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (6108626):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
Claude finished @bootjp's task in 3m 15s —— View job Code Review — Follow-up (commit 6108626 fixes)
All 4 listed fixes are correct. Two items from the previous round and one from gemini remain open. Confirmed correct
Shrink truncation ( gci formatting — The Remaining: two open items from the previous claude review1. Misleading "no internal synchronisation" comment — still present
// SetHistoryDepthForTest overrides the FIFO ring depth from outside
// the package. Test-only — callers MUST set the depth before the
// Engine is shared with any concurrent reader (no internal
// synchronisation here …This comment was already contradicted by the pre-existing Suggested fix: replace "no internal synchronisation here for the same reason TestEngineSnapshotAt_FIFOEviction does the direct field write in-package" with "callers should set depth before sharing the Engine with concurrent Snapshots to avoid interleaving surprises, but the write itself is lock-protected". Fix this → 2. Missing happy-path test — still absent
Remaining: gemini's ABORT-invariant concern (not in the 4 addressed)Gemini flagged this as CRITICAL. The current protection for ABORT is that all ABORT construction sites ( This is safe today, but it is an undocumented protocol invariant — nothing in the type system or a test enforces that ABORT must carry Gemini's suggestion — add Nit:
|
| Finding | Status |
|---|---|
coderabbit MAJOR: shardGroupID==0 bypass |
✅ Fixed + regression test |
coderabbit minor: depth <= 0 panic |
✅ Fixed |
gemini medium: shrink-panic in SetHistoryDepthForTest |
✅ Fixed |
| golangci/gci formatting | ✅ Fixed |
| Previous claude #1: misleading "no sync" comment | |
| Previous claude #2: missing happy-path test | |
| Gemini CRITICAL: no explicit ABORT Phase bypass |
The three remaining issues are all non-blocking for M3's correctness (the gate logic itself is correct; the ABORT path is safe today). Happy to approve once the happy-path test is in — that's the one change that gives the gate meaningful coverage in both the pass and fail directions.
|
To use Codex here, create a Codex account and connect to github. |
…#895) Three remaining items from claude's round-2 follow-up review + the gemini HIGH-priority finding that was missed in round 2: * gemini HIGH (kv/fsm.go) — ABORT requests MUST bypass the Composed-1 gate. If a route shifted between PREPARE and ABORT (or the observed version was evicted), a rejected ABORT would leave the txn's intent locks pinned in MVCC until LockResolver's TTL — minutes of write-blocked keys for what should be a one-RPC cleanup. Production callers (transaction.go, sharded_coordinator.go) happen to leave ObservedRouteVersion=0 on ABORT requests, so today the existing observedVer==0 short-circuit already protects them. But that is enforcement-by-convention; a future ABORT construction site that copies a COMMIT struct by mistake would silently start failing aborts. Fix: add `if r.GetPhase() == pb.Phase_ABORT { return nil }` as the FIRST guard in verifyComposed1, before the routes/shardGroupID check. Enforcement-in-code rather than enforcement-by-convention. Caller audit (the loop directive's semantic-change check): verifyComposed1 is only called from handleTxnRequest, which treats nil as success. The ABORT bypass returns nil, so the caller's semantics are preserved. No further wiring change. * claude follow-up #1 (distribution/engine.go) — The SetHistoryDepthForTest doc comment said "no internal synchronisation here" even though the function acquired e.mu.Lock. Future readers would see the contradiction and not know which was authoritative. Rewrote the comment to describe the actual contract: the write is lock-protected, but callers should still set depth before sharing the Engine with concurrent SnapshotAt/Current readers to avoid interleaving surprises around the eviction watermark. * claude follow-up #2 (kv/fsm_composed1_test.go) — The gate suite had no happy-path test. An off-by-one in OwnerOf, a reversed comparison, or a sign-flip in the group-ID match would silently false-reject every valid commit and the failure-only suite would not catch it. Added TestVerifyComposed1_ValidOwnershipPassesGate: this FSM serves the group that owns the key at BOTH the observed version AND the current catalog version, so the gate must return nil. * claude follow-up nit (kv/fsm_composed1_test.go) — The `//nolint:unparam` annotation on newComposed1FSM is now redundant: the new ABORT and shardGroupID=0 tests pass distinct shardGroupID values, so golangci-lint no longer flags the parameter as constant. Removed (verified with `make lint` → 0 issues). Also added TestVerifyComposed1_AbortPhaseSkipsGate as the regression for the ABORT bypass: deliberately sets ObservedRouteVersion=1 (non-zero) on the ABORT to defeat the existing observedVer==0 short-circuit and isolate the new Phase_ABORT guard. Verification: go test -race -count=1 ./kv ./distribution → pass. make lint → 0 issues.
|
Round-3: all remaining items addressed. gemini HIGH (kv/fsm.go — ABORT bypass): Added claude follow-up #1 (SetHistoryDepthForTest doc comment): Rewrote to remove the contradictory "no internal synchronisation" wording; comment now correctly states the write is lock-protected but callers should still set depth before sharing the Engine with concurrent readers to avoid interleaving surprises. claude follow-up #2 (happy-path test): Added ABORT regression test: Nit: removed the now-redundant
@claude review |
|
Claude finished @bootjp's task in 1m 50s —— View job Code Review — Round 3 (commit 48bbcc4)
All five round-3 items are correct. The remaining finding-tracking table is fully green. Confirmed correctABORT Phase bypass (
This is the correct layering — an ABORT with a fabricated non-zero SetHistoryDepthForTest doc comment ( Happy-path test ( ABORT regression test (
One cosmetic nit (non-blocking)
Finding-tracking table
No open findings. Gate logic matches the TLA+ spec ( |
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (48bbcc4):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
To use Codex here, create a Codex account and connect to github. |
…-3 nit on PR #895) The previous text had the 'tests in the kv package can drive eviction-trigger scenarios without adding a constructor option' line twice — once in the opening paragraph and once in the 'Production code must use...' paragraph. Harmless duplication; collapsed into a single sentence in the second paragraph. No code change.
|
Cosmetic nit (round-3 follow-up): deduped the doc-comment sentence "tests in the kv package can drive eviction-trigger scenarios without adding a constructor option" — was duplicated across two paragraphs of Doc-only; no code change. Claude round-3 confirmed all 7 findings ✅ Fixed and the table is fully green. @claude review |
|
Claude finished @bootjp's task in 49s —— View job Code Review — Round 4 (doc-only dedup, commit fbd0d46)
Scope confirmed: single file (
…which previously appeared twice (once at the tail of the first paragraph, once at the head of the
No new findings. Finding table remains fully green from round 3:
Ready to merge once PR #894 (M2) lands. |
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (fbd0d46):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
To use Codex here, create a Codex account and connect to github. |
Summary
Third milestone (M3) of the Composed-1 cross-group commit-time guard per
docs/design/2026_05_29_proposed_composed1_cross_group_commit_guard.md§M3.Wires the safety property at apply time using the M2 versioned-snapshot ring and the M1 ObservedRouteVersion field.
Stacked on PR #894 (M2 plumbing).
What's enforced
tla/composed/Composed.tla. Every write key of a Composed-1-pinned txn must be owned by THIS Raft group at the txn's observed catalog version. Refusal returnsErrComposed1Violation.ErrComposed1Violationwith a different diagnostic prefix (current-versionvsobserved-version) so the M4 retry path can distinguish.ErrComposed1VersionGCd. Per design doc §4.3, the not-found semantics is a hard retryable error, NOT a soft pass, because soft-pass would let the gate be bypassed exactly in the cases where the cross-version-read hazard is most likely.Short-circuits
Three legacy / not-applicable cases that the gate explicitly bypasses:
WithRouteHistory(legacy / test seam):routes == nil, gate returns nil.ObservedRouteVersion == 0(unpinned — pre-M1 caller, or 2PC ABORT request that doesn't carry the version): gate returns nil. This is what keeps M3 from regressing every caller that has not migrated to setOperationGroup.ObservedRouteVersion.Engine.Currentreturns(zero, false)— engine has no history (bare-struct test seam): the (b) fence falls through.Changes
distribution/engine.go— addsEngine.Current()returning the snapshot at the currentcatalogVersion. AddsSetHistoryDepthForTestas the cross-package test seam that lets kv-side tests trigger eviction without bypassing the package boundary (claude review on PR kv(composed1): M2 — versioned-snapshot ring + kvFSM RouteHistory wiring #894 — fragile-but-test-local lock contract documented inline).kv/fsm.go— definesErrComposed1ViolationandErrComposed1VersionGCdsentinels with the wrapped-error diagnostic shape M4 retry will use. AddsverifyComposed1andverifyOwnerFromSnapshot. WiresverifyComposed1intohandleTxnRequestat the top so PREPARE / COMMIT / ABORT / NONE all pass through the gate (ABORT carriesObservedRouteVersion=0and falls through naturally).kv/route_history.go—RouteHistoryinterface gainsCurrent(). Adapter forwards todistribution.Engine.Current().Tests (
kv/fsm_composed1_test.go)TestVerifyComposed1_StaleObservedVersionWithMovedKeyFailsTestVerifyComposed1_ObservedVersionOlderThanRingFailsErrComposed1VersionGCdTestVerifyComposed1_ObservedPassesButCurrentDiffersFailsErrComposed1Violation(current-versiondiagnostic)TestVerifyComposed1_ObservedVersionZeroSkipsGateTestVerifyComposed1_NilRouteHistorySkipsGateVerification
go build ./...— cleango vet ./...— cleango test -race -count=1 ./kv ./distribution— 10.2 s + 1.0 s, passmake lint— 0 issuesSelf-review (5 lenses)
SnapshotAtandCurrenteach acquire and release in a single call.OwnerOfshort-circuits via the M2 round-1 break optimisation, so the per-mutation cost is bounded by the first non-covering gap rather than the full routes slice. Legacy callers (ObservedRouteVersion=0) pay one branch per commit and exit.tla/composed/Composed.tla'sCommitprecondition is now enforced at apply time; the §4.4 fence matchesComposed1a_CommitToCurrentOwnerfrom PR tla: Composed-1a current-version fence + currentfence_gap (#870 follow-up) #878.handleTxnRequesttests pass unchanged (their requests carryObservedRouteVersion=0from M1's behaviour-neutral default).Test plan
go test -race ./kv ./distributionmake lintResolves
The M3 row in the Composed-1 design doc.
Summary by CodeRabbit
Bug Fixes
Tests