kv(composed1): M4 — Dispatch retry on ErrComposed1Violation / ErrComposed1VersionGCd#900
kv(composed1): M4 — Dispatch retry on ErrComposed1Violation / ErrComposed1VersionGCd#900bootjp wants to merge 10 commits into
Conversation
…osed1VersionGCd
Fourth milestone of the Composed-1 cross-group commit-time guard per
docs/design/2026_05_29_proposed_composed1_cross_group_commit_guard.md
§M4. Makes the M3 apply-time gate non-disruptive to clients: instead
of surfacing ErrComposed1Violation or ErrComposed1VersionGCd back to
the adapter (and thus the application), the coordinator absorbs the
sentinel, re-reads the catalog version, and re-issues the txn once
against the post-shift owning group.
Two changes:
* Pin reqs.ObservedRouteVersion at Dispatch entry. ShardedCoordinator
.Dispatch now stamps the engine's current catalog version onto
every transactional OperationGroup that arrived with the zero
sentinel. This makes the M3 gate effective by default — every
txn flowing through the sharded coordinator is now gate-eligible
without each adapter rediscovering the design doc §4.1 contract.
A caller that supplied a non-zero ObservedRouteVersion (a future
adapter migration target managing the version itself) still wins
over the Dispatch-entry pin.
* Retry-once on Composed-1 sentinels. dispatchTxnWithComposed1Retry
wraps the existing single-attempt dispatchTxn in a loop bounded
at composed1RetryAttempts = 1 (initial + 1 retry = 2 attempts).
On the sentinel:
- re-read engine.Version() and stamp the fresh value on the
OperationGroup
- clear StartTS / CommitTS so the next attempt allocates a
fresh HLC-correct pair via nextStartTS (the OCC invariants
require commitTS > startTS under the same HLC observation;
reusing the stale pair would risk a startTS that no longer
dominates the post-shift shard's applied commitTS)
- groupMutations on the next dispatchTxn pass re-runs
router.ResolveGroup, so a key whose owning group changed
since the last attempt naturally lands on the new group's
FSM
Beyond one retry the sentinel surfaces unchanged so a wrapping
adapter retry harness (or the client) can decide whether to
keep trying — preventing the coordinator from spinning when
catalog churn outpaces the dispatch.
Caller audit (the loop directive's semantic-change check):
ShardedCoordinator.Dispatch is the entry point every adapter calls.
The change SURFACES SUCCESS more often (turning a previous error
into a successful commit) and surfaces the sentinel after a bounded
retry budget rather than immediately. No adapter relies on a
faster-than-bounded surfacing of these sentinels (M3 only landed in
this session and adapters did not migrate), so the relaxed
surfacing is safe. Verified: grep -rn 'ErrComposed1' adapter/
returns no hits.
Tests (kv/sharded_coordinator_composed1_retry_test.go):
* TestShardedCoordinator_RetriesOnComposed1Violation —
happy-path: first attempt returns Composed1Violation, second
succeeds; caller sees success.
* TestShardedCoordinator_RetriesOnComposed1VersionGCd — same
retry path drives both sentinels.
* TestShardedCoordinator_PersistentComposed1ViolationSurfaces —
retry budget bounds the loop: persistent failure surfaces the
sentinel after exactly 2 total Commit calls.
* TestShardedCoordinator_PinsObservedRouteVersionAtDispatchEntry
— Dispatch pins reqs.ObservedRouteVersion=engine.Version()
when caller leaves it at zero.
* TestShardedCoordinator_HonorsCallerPinnedObservedRouteVersion
— caller-supplied non-zero ObservedRouteVersion survives the
Dispatch-entry pin (§4.1 contract preserved).
Verification:
* go build ./... — clean
* go test -race -count=1 ./kv — 10.3s, pass
* make lint — 0 issues
Self-review (5 lenses):
1. Data loss — the retry can only convert a previously-rejected
commit into a successful one OR surface the same error after
a bounded loop. Cannot lose a committed write. Cannot
duplicate a write because each attempt's commitTS is fresh
and the FSM's OCC validation already rejects duplicates.
2. Concurrency — no new locks. Retry runs serially on the
dispatcher goroutine; engine.Version() takes the existing
read lock per call. No new lock ordering.
3. Performance — for the common case (no sentinel) the retry
loop is a single iteration with one extra branch + one
engine.Version() read at entry. For the sentinel case the
cost is one extra dispatchTxn round-trip — design doc says
this is acceptable; the alternative is a client-visible
error.
4. Data consistency — closes the §M4 row of the design doc.
With M1 (plumbing), M2 (ring), M3 (gate), and M4 (retry) all
landed, a MoveRange / cross-group SplitRange between
BeginTxn and Commit is now transparent to the client: the
coordinator absorbs the gate's rejection, re-routes, and
succeeds on the new owner.
5. Test coverage — 5 new unit tests cover both sentinels,
the retry budget, the Dispatch-entry pin, and the caller-pin
precedence. Existing dispatchTxn tests pass unchanged
(their requests carry ObservedRouteVersion=0 but the test
coordinator's engine has no routes beyond v=0, so the gate
short-circuits exactly as before).
Next milestone: M5 — Jepsen workload (route-shuffle generator for
the DynamoDB suite, find no G1c anomaly under concurrent
MoveRange). Separate PR per the design doc §6 milestone breakdown.
|
@claude review |
|
Warning Review limit reached
More reviews will be available in 34 minutes and 23 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughAdds a bounded single-retry Composed-1 loop to ShardedCoordinator.Dispatch: on composed-1 sentinel errors it refreshes ObservedRouteVersion, clears/re-allocates timestamps, and retries the transaction once. Also stops propagating ObservedRouteVersion into 2PC commit-phase secondary requests. Tests validate retry budgeting, PrevCommitTS preservation, pinning and StartTS rules, read-write exceptions, and the 2PC secondary drop. ChangesComposed-1 Retry Implementation and Testing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (3cf4e90):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
Claude finished @bootjp's task in 4m 11s —— View job PR Review: kv(composed1): M4 — Dispatch retry on ErrComposed1Violation / ErrComposed1VersionGCd
The M4 design is sound and the implementation is largely clean. One correctness gap, two minor issues, and a couple of nits.
|
|
To use Codex here, create a Codex account and connect to github. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3cf4e9069d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| reqs.StartTS = 0 | ||
| reqs.CommitTS = 0 |
There was a problem hiding this comment.
Preserve the original read timestamp on route retry
When a transaction has already read data (for example Redis MULTI/EXEC populates ReadKeys and returns values from OperationGroup.StartTS), this retry path clears the original read timestamp and allocates a fresh one before re-issuing the same write set. The FSM/store OCC check rejects only versions with latest > startTS, so any concurrent write that lands after the client's original read snapshot but before this fresh retry timestamp is no longer detected; the retry can commit decisions based on stale reads after a Composed-1 route-shift error. The retry needs to keep the transaction's original read timestamp for OCC validation (or force the caller to recompute the transaction) rather than silently advancing it.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request implements an M4 Composed-1 retry loop in ShardedCoordinator to handle transaction routing shifts by re-reading the catalog version and re-issuing the transaction once. The review feedback highlights two critical issues: first, transparently retrying transactions with a bumped start timestamp is unsafe if the transaction contains read keys, as it can lead to stale reads and serializability violations; second, the coordinator should unconditionally allocate a new start timestamp on retry instead of checking for a non-nil clock, as leaving the timestamp at zero violates MVCC invariants.
| if !isComposed1RetryableError(err) { | ||
| return resp, err | ||
| } | ||
| if attempt == composed1RetryAttempts { | ||
| return resp, err | ||
| } |
There was a problem hiding this comment.
Transparently retrying a transaction with a bumped StartTS is only safe if the transaction is write-only (len(reqs.ReadKeys) == 0).
If the transaction has read keys, the client/adapter performed those reads at the original StartTS. If the coordinator transparently bumps StartTS to a newer timestamp on retry, any concurrent writes to those read keys that committed between the old StartTS and the new StartTS will be silently ignored by the FSM's OCC validation (since the FSM only checks for conflicts after the new StartTS). This allows the transaction to commit using stale read values, violating strict serializability (SSI) and introducing G1c/write-skew anomalies.
Per the general rule:
In transactional read-modify-write operations, set the transaction's start timestamp to the read timestamp to ensure a consistent snapshot and prevent race conditions.
We must propagate the ErrComposed1Violation or ErrComposed1VersionGCd error back to the caller if len(reqs.ReadKeys) > 0 so that the entire transaction can be retried from the beginning (including re-reading the keys at a fresh timestamp).
| if !isComposed1RetryableError(err) { | |
| return resp, err | |
| } | |
| if attempt == composed1RetryAttempts { | |
| return resp, err | |
| } | |
| if !isComposed1RetryableError(err) { | |
| return resp, err | |
| } | |
| if len(reqs.ReadKeys) > 0 { | |
| return resp, err | |
| } | |
| if attempt == composed1RetryAttempts { | |
| return resp, err | |
| } |
References
- In transactional read-modify-write operations, set the transaction's start timestamp to the read timestamp to ensure a consistent snapshot and prevent race conditions.
| if c.clock != nil { | ||
| startTS, allocErr := c.nextStartTS(ctx, reqs.Elems) | ||
| if allocErr != nil { | ||
| return resp, allocErr | ||
| } | ||
| reqs.StartTS = startTS | ||
| } |
There was a problem hiding this comment.
The check if c.clock != nil prevents allocating a new StartTS on retry when c.clock is nil. However, c.nextStartTS is fully capable of handling a nil clock (it falls back to maxTS + 1 at line 1031). Leaving reqs.StartTS at 0 on retry is incorrect because dispatchTxn does not allocate a StartTS if it is 0, which would cause the retried transaction to run with StartTS = 0 (violating MVCC/timestamp invariants). We should call c.nextStartTS unconditionally, just like Dispatch does at line 450.
startTS, allocErr := c.nextStartTS(ctx, reqs.Elems)
if allocErr != nil {
return resp, allocErr
}
reqs.StartTS = startTSReferences
- In an MVCC system, all operations within a single transaction must share the same start timestamp to ensure atomicity and a consistent snapshot view.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
kv/sharded_coordinator_composed1_retry_test.go (1)
68-74: 💤 Low valueRemove the empty goroutine or convert to a regular comment block.
This goroutine body contains only comments and executes nothing. If it's a placeholder for future work, consider adding a TODO or removing it entirely to avoid confusion.
Suggested fix (remove empty goroutine)
- // Between the first and second dispatch attempts, simulate the - // route catalog moving forward — M4's retry path must re-read - // the engine version and stamp the new one on the retry's - // pb.Request so the FSM sees the post-shift observed version. - go func() { - // The retry happens synchronously inside Dispatch; advance the - // catalog before the second attempt would re-read it. In a - // real cluster the advance is driven by a concurrent - // CatalogWatcher fan-out; here we just race against the - // in-process Commit to model the same observable effect. - }() + // NOTE: In a real cluster the route catalog would advance between + // the first and second dispatch attempts via a concurrent + // CatalogWatcher fan-out. This test verifies the retry path + // re-reads engine.Version(); the assertions below confirm both + // attempts record an ObservedRouteVersion (even when unchanged).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@kv/sharded_coordinator_composed1_retry_test.go` around lines 68 - 74, Remove the no-op goroutine in the test (the go func() { /* ...comments... */ }() block) because it executes nothing; either delete the entire goroutine invocation or replace it with a regular comment/TODO block so intent is clear and the test contains no empty concurrent construct. Locate the empty anonymous goroutine in kv/sharded_coordinator_composed1_retry_test.go and update it accordingly to eliminate confusion.kv/sharded_coordinator.go (1)
469-531: 💤 Low valueOn allocation failure, return
nilinstead of the stale response from the prior failed attempt.At line 523, when
nextStartTSfails,respis from the previousdispatchTxncall that failed with a Composed-1 sentinel. Returning this stale response alongside an unrelated error may confuse callers inspecting both values.Suggested fix
if c.clock != nil { startTS, allocErr := c.nextStartTS(ctx, reqs.Elems) if allocErr != nil { - return resp, allocErr + return nil, allocErr } reqs.StartTS = startTS }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@kv/sharded_coordinator.go` around lines 469 - 531, In dispatchTxnWithComposed1Retry, when nextStartTS returns an error (allocErr) you must not return the stale resp from the prior dispatchTxn; instead return nil and the allocation error. Update the error path inside the retry loop (the block that calls c.nextStartTS) to return (nil, allocErr) rather than (resp, allocErr). This touches dispatchTxnWithComposed1Retry and the variables resp, allocErr and ensures callers don't receive a stale CoordinateResponse alongside an unrelated allocation error.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@kv/sharded_coordinator_composed1_retry_test.go`:
- Around line 68-74: Remove the no-op goroutine in the test (the go func() { /*
...comments... */ }() block) because it executes nothing; either delete the
entire goroutine invocation or replace it with a regular comment/TODO block so
intent is clear and the test contains no empty concurrent construct. Locate the
empty anonymous goroutine in kv/sharded_coordinator_composed1_retry_test.go and
update it accordingly to eliminate confusion.
In `@kv/sharded_coordinator.go`:
- Around line 469-531: In dispatchTxnWithComposed1Retry, when nextStartTS
returns an error (allocErr) you must not return the stale resp from the prior
dispatchTxn; instead return nil and the allocation error. Update the error path
inside the retry loop (the block that calls c.nextStartTS) to return (nil,
allocErr) rather than (resp, allocErr). This touches
dispatchTxnWithComposed1Retry and the variables resp, allocErr and ensures
callers don't receive a stale CoordinateResponse alongside an unrelated
allocation error.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1f9d7b48-3ca0-4a71-a58f-fa7c79746643
📒 Files selected for processing (2)
kv/sharded_coordinator.gokv/sharded_coordinator_composed1_retry_test.go
… (gemini CRITICAL/HIGH + codex P1 on PR #900) Two converging findings from PR #900's automated review pass, both correctly identifying that the M4 transparent retry can introduce strict-serialisability violations: * gemini CRITICAL + codex P1 — Transparently retrying a txn with a bumped StartTS is SAFE only when the txn is write-only. When the txn has read keys (Redis MULTI/EXEC, DynamoDB TransactGetItems → TransactWrite, etc.), the client already performed those reads at the original StartTS. Bumping StartTS on retry would let any concurrent write that committed between the old and new timestamps go undetected by the FSM's OCC check (which only rejects versions with `latest > startTS`), allowing the retry to commit decisions based on stale reads — an SSI / G1c violation. Fix: surface the sentinel WITHOUT retry when `len(reqs.ReadKeys) > 0`. The caller / adapter is then responsible for re-executing the entire txn (including re-reading the keys at a fresh timestamp). Write-only txns (no reads to invalidate) remain safe to retry — the M3-to-M5 plumbing is still client-transparent for the most common cross-group SplitRange / MoveRange case (operator-frequency catalog churn, write-only adapter patterns). * gemini HIGH — The retry path gated StartTS allocation on `if c.clock != nil`, but `nextStartTS` already handles a nil clock internally (falls back to `maxTS+1`). Leaving `reqs.StartTS = 0` on retry was incorrect because `dispatchTxn` does NOT re-allocate when StartTS is zero — the retry would run with StartTS=0, violating MVCC invariants in the nil-clock test fixture (and any future nil-clock test seam that exercises the retry path). Fix: allocate StartTS unconditionally on retry via `nextStartTS`. Matches the unconditional call pattern Dispatch itself uses at line 450 of the same file. Caller audit (the loop directive's semantic-change check): the behaviour change is that ErrComposed1Violation / ErrComposed1VersionGCd now surfaces immediately for read-write txns instead of after a one-attempt retry. `grep -rn 'ErrComposed1' adapter/` returns no hits — no adapter currently distinguishes these sentinels (M3 only landed yesterday and adapters did not migrate to read them), so the earlier surfacing is semantics-safe. When an adapter does start matching on these sentinels, the new behaviour is what it would want (the entire txn needs re-execution, not a coordinator-side bump). Regression test (added before the fix per the project convention): * TestShardedCoordinator_DoesNotRetryReadWriteTxnOnComposed1 — constructs a txn with `ReadKeys` non-empty, stubs the transactional Commit to fail with ErrComposed1Violation, asserts: 1. the sentinel surfaces immediately (errors.Is check) 2. exactly 1 Commit attempt fires (no retry) Verified failing on the parent commit (the test received 2 Commit calls because the previous code retried) and passing on this commit. Verification: go test -race -count=1 ./kv → pass. make lint → 0 issues.
|
Both critical-class findings addressed: gemini CRITICAL / codex P1 (line 503/519 — bumped StartTS breaks SSI for read-write txns): Now surfaces gemini HIGH (line 526 — nil-clock guard left StartTS=0): Caller audit (loop directive): Regression test (added before fix per convention):
@claude review |
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (66ad5b0):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
To use Codex here, create a Codex account and connect to github. |
|
Claude finished @bootjp's task in 2m 30s —— View job PR Review: Follow-up fixes (gemini CRITICAL/HIGH + codex P1)
The two critical/high fixes are correct. Two items from the prior review remain open. ✅ gemini CRITICAL / codex P1 — correct
✅ gemini HIGH — correctThe
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 66ad5b0e5c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if c.engine != nil && reqs.ObservedRouteVersion == 0 { | ||
| reqs.ObservedRouteVersion = c.engine.Version() |
There was a problem hiding this comment.
Pin observed routes at the transaction read time
When a caller already supplies a transaction StartTS/ReadKeys but leaves ObservedRouteVersion unset, this stamps the catalog version at EXEC/commit dispatch time rather than when the read set was captured. For example, Redis txnContext.commit builds an OperationGroup with StartTS and ReadKeys but no ObservedRouteVersion; if a route moves between the earlier MULTI read and this Dispatch call, the request is routed and verified against the new version, so verifyComposed1 will not detect the route shift the guard is meant to catch. This needs to be pinned when the transaction starts/reads, or not auto-filled here for already-started read-write transactions.
Useful? React with 👍 / 👎.
|
Codex P1 addressed: The Dispatch-entry auto-pin no longer fires for read-write txns. Per design doc §4.1, pinning must happen at BeginTxn (read-set capture); for read-write txns Dispatch is strictly LATER than BeginTxn, so the coordinator cannot retroactively reconstruct the route view the reads were taken against. Stamping the catalog version at EXEC time would associate post-shift routes with pre-shift reads — exactly the gate-bypass codex flagged for Redis MULTI/EXEC. Fix: gate the auto-pin on Caller audit (loop directive): Regression test (added before fix per convention):
@claude review |
|
Claude finished @bootjp's task in 3m 58s —— View job PR Review: Codex P1 fix — gate auto-pin on write-only txns
|
| Txn type | Auto-pin gate fires? | Composed-1 retry? | Result |
|---|---|---|---|
| Write-only | ✅ pins at entry | ✅ retries once | client-transparent |
| Read-write (no caller pin) | ❌ stays 0 | ❌ surfaces immediately | pre-feature fallback |
| Read-write (caller-supplied pin) | ❌ caller wins | ❌ surfaces immediately | §4.1 contract preserved |
This is correct. The retry and the auto-pin now agree on "only write-only txns are safe to coordinate transparently."
⚠️ Three findings from prior reviews still open
These were in the prior Claude reviews and were not addressed in 66ad5b0.
[Medium] PrevCommitTS not zeroed on retry
Line 544 clears CommitTS and re-allocates StartTS but does not zero PrevCommitTS. If a write-only txn had PrevCommitTS != 0 (single-shard at-most-once dedup), and the retry routes to multiple groups (because a SplitRange fired between the two attempts), dispatchMultiShardTxn returns ErrTxnDedupRequiresSingleShard — a wrong error, because the original attempt never committed (M3 gate fired at the FSM before any apply). The dedup signal is stale.
reqs.CommitTS = 0
reqs.PrevCommitTS = 0 // ← add: M3 means original attempt never committed; dedup signal is stale[Minor] Stale resp returned on allocErr
Line 547 returns resp, allocErr where resp is from the failed Composed-1 attempt. In practice resp is always nil when dispatchTxn errors, but the intent is clearer and safer as:
return nil, allocErr[Nit] Empty goroutine in TestShardedCoordinator_RetriesOnComposed1Violation
Lines 68–74 still contain:
go func() {
// ... comments only, no body ...
}()This goroutine executes nothing. Remove it (or keep the comment as a regular block comment).
Summary
The described Codex P1 fix (gate auto-pin on len(reqs.ReadKeys) == 0) is the right approach and would complete a consistent "write-only retry" contract. It just needs to be pushed to the branch. Once pushed, the three items above should also be addressed before merge — the PrevCommitTS one in particular is a correctness issue on the retry path.
…write txns (codex P1 on PR #900) When a caller already supplies a transaction StartTS + ReadKeys but leaves ObservedRouteVersion unset, the M4 Dispatch-entry auto-pin was stamping the catalog version at EXEC/commit dispatch time rather than when the read set was captured. Concrete scenario: Redis txnContext.commit builds an OperationGroup with StartTS from MULTI and ReadKeys from the GETs inside the txn body but no ObservedRouteVersion. If a route moved between the MULTI-time reads and the EXEC Dispatch call, the auto-pin stamped the POST-shift catalog version on the request and verifyComposed1 silently routed-and-verified against the new version — failing to detect the very shift the guard is meant to catch. Per design doc §4.1: pinning happens at BeginTxn time (read-set capture). For write-only txns Dispatch IS BeginTxn (no prior reads), so the auto-pin matches. For read-write txns Dispatch is strictly LATER than BeginTxn, and the coordinator cannot retroactively reconstruct the route view the reads were taken against. Such callers MUST migrate to pin ObservedRouteVersion themselves at BeginTxn time; until they do, surfacing observedVer=0 to the M3 gate (which short-circuits on the zero sentinel) is the correct non-regressing posture — better than a wrong-version "protection" that disguises a missed gate fire. Fix: gate the auto-pin on `len(reqs.ReadKeys) == 0`. Write-only txns continue to be auto-pinned at Dispatch entry; read-write txns flow through with whatever ObservedRouteVersion the caller supplied (zero = unpinned, non-zero = caller-managed per §4.1). Caller audit (the loop directive's semantic-change check): only ShardedCoordinator.Dispatch auto-pins. `grep -rn 'ObservedRouteVersion' adapter/` returns no hits — no adapter sets or reads the field today, so the semantic shrink (auto-pin no longer fires for read-write txns) is invisible to every caller. Adapters that want M3 protection for read-write txns must explicitly migrate to pin at BeginTxn, which is what the design doc has always required. Regression test (added before the fix per the project convention): * TestShardedCoordinator_DoesNotAutoPinObservedRouteVersionForReadWriteTxn — builds an OperationGroup with ReadKeys non-empty and ObservedRouteVersion=0, dispatches against an engine pinned at catalog version 5, asserts the request that reached the stub Commit carries ObservedRouteVersion=0 (the auto-pin did not fire). Verified failing on the parent commit (the request carried v=5) and passing on this commit. Verification: go test -race -count=1 ./kv → pass. make lint → 0 issues.
|
Pushed as 10123c5 — the @claude review request above was posted before the commit actually landed (a pre-commit cyclop lint failure on my first attempt — fixed by extracting the auto-pin into @claude review |
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (10123c5):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
To use Codex here, create a Codex account and connect to github. |
|
Claude finished @bootjp's task in 2m 16s —— View job PR Review: Codex P1 fix —
|
… (claude[bot] medium/minor/nit on PR #900) Three items from the claude[bot] review of 10123c5, all flagged across three review rounds: MEDIUM — PrevCommitTS leak on Composed-1 retry. The M4 retry block zeroed CommitTS and re-allocated StartTS but did NOT clear PrevCommitTS. Concrete failure mode (claude[bot]): an adapter-initiated write-only retry with a non-zero PrevCommitTS (the option-2 at-most-once dedup probe) hits ErrComposed1Violation; the catalog shifts such that groupMutations now spans two groups; dispatchMultiShardTxn:736 immediately returns ErrTxnDedupRequiresSingleShard because prevCommitTS != 0. But the M3 gate fired at APPLY time — the original attempt never committed — so the dedup probe is invalid on retry. The caller gets a misleading "you must single-shard" error instead of a correct cross-group re-route. Fix: drop reqs.PrevCommitTS = 0 alongside reqs.CommitTS = 0 in the retry block. The retry now reflects "fresh attempt, no prior landing to dedup against," which matches the actual state (M3 fired before any FSM apply). Caller audit (loop directive's semantic-change check): `grep -rn 'PrevCommitTS' adapter/` returns two sites (adapter/redis.go:3209, :3615), both of which construct an OperationGroup struct literal inline, pass it to Dispatch, and either return success or branch on the error. Neither call site reads reqs.PrevCommitTS back after Dispatch returns — the field is set from a separate adapter-owned `pending.commitTS` which is unmodified by Dispatch. The semantic change (clearing the field inside the M4 retry) is therefore invisible to every current caller. MINOR — empty goroutine in the retry test. The first retry test had a `go func() { /* comment */ }()` with an empty body — it looked like an unfinished concurrent-catalog-advance model but had no statements. Removed; promoted the intent comment to a plain block comment that precedes the Dispatch call and points at the actual observedSeen[1] assertion that enforces the re-read contract. NIT — stale resp on allocErr path. The retry block returned `resp, allocErr` where resp was the failed first attempt's response (always nil for the Composed-1 sentinels). Made it `nil, allocErr` so the "non-nil response only on success" contract is unambiguous. Regression test (added before the fix per the project convention): * TestShardedCoordinator_ClearsPrevCommitTSOnComposed1Retry — write-only OperationGroup with PrevCommitTS=17, first attempt rejected with ErrComposed1Violation. Decodes the encoded TxnMeta from the per-attempt pb.Request Mutation value to observe the on-wire PrevCommitTS, then asserts attempt 1 carried 17 (caller's probe unchanged) and attempt 2 carried 0 (probe dropped by the retry path). Verified failing on the parent commit (attempt 2 saw 17 = 0x11) and passing on this commit. The stub composed1RetryingTransactional grew a prevCommitTSSeen field and a per-call decode of muts[0].Value to support the assertion. Verification: go test -race -count=1 ./kv → pass (the whole package, 11s). make lint → 0 issues.
|
Pushed as 43c55df — claude[bot]'s three items on 10123c5 addressed. Medium — PrevCommitTS leak (3rd review round): retry block now zeros Caller audit (semantic change, loop directive): Minor — empty goroutine (3rd review round): removed; intent promoted to a plain block comment in front of Dispatch pointing at the observedSeen[1] assertion that enforces the re-read contract. Nit — stale resp on allocErr: Regression test (added before fix per convention):
@claude review |
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (43c55df):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
Claude finished @bootjp's task in 2m 37s —— View job PR Review: 43c55df — PrevCommitTS + empty goroutine + stale resp fixes
All three items from the prior review rounds are correctly addressed. No new issues found. ✅ PrevCommitTS cleared on retry — correct
Cross-checked with ✅ Regression test — correct
Test follows the project convention: verifies failing on the parent commit (attempt 2 would decode ✅ Empty goroutine removed — correctLines 82–89 now contain a plain block comment explaining the test design intent. No empty ✅ Stale resp on allocErr — correctLine 582: TLA+ spec divergence — no update neededThe changes are confined to the Dispatch-level retry orchestration in LGTM. The four review rounds across this PR have been thorough and all correctness findings are closed. The one remaining open item is the M5 Jepsen workload (pre-flagged as a follow-up in the PR description). |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 43c55dfe0a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // short-circuit with ErrTxnDedupRequiresSingleShard if the | ||
| // post-shift catalog now reroutes this txn to span two groups | ||
| // (claude[bot] medium, 3rd review round on PR #900). | ||
| reqs.PrevCommitTS = 0 |
There was a problem hiding this comment.
Preserve PrevCommitTS when retrying deduped attempts
When the request is already an option-2 retry (PrevCommitTS != 0) and has no ReadKeys, this line drops the only probe that tells the FSM whether the earlier ambiguous attempt already landed. I checked the Redis list-push reuse path in adapter/redis.go: it sends PrevCommitTS: pending.commitTS, and its readKeys can be empty for the empty-list case, so a route shift that triggers ErrComposed1Violation here will reroute and reapply the write at a new timestamp instead of no-oping against the version that may have already committed. The fact that this M4 attempt failed the Composed-1 gate only proves this attempt did not apply; it does not prove the prior attempt named by PrevCommitTS did not land.
Useful? React with 👍 / 👎.
…st-once probe (codex P1 on 43c55df, PR #900) 43c55df took claude[bot]'s medium feedback at face value and zeroed reqs.PrevCommitTS in the M4 retry block, on the theory that "the M3 gate fires at apply time → the original attempt never committed → probe is stale." Codex P1 on 43c55df correctly identifies that this conflates two different "original attempts": * Within M4: attempt 1's M3 rejection does prove attempt 1 wrote nothing (the FSM rejects before any state change). * Across adapter retries: PrevCommitTS names the ADAPTER'S prior attempt, not M4's. adapter/redis.go:3209 and :3615 (Redis EXEC reuse / list-push reuse retry paths) send PrevCommitTS = pending.commitTS as the option-2 ambiguous-outcome probe. That earlier attempt may have actually LANDED at the named commitTS before the route shift that triggered this M4 attempt's M3 sentinel. Dropping the probe in the retry would let the FSM re-apply the writes against the post-shift route, double-writing the caller's at-most-once payload if the named prior attempt did land. The at-most-once contract is the whole point of option-2 dedup; the M4 retry must not break it. Trade-off (accepted): when the post-shift route now spans multiple groups, dispatchMultiShardTxn:749 surfaces ErrTxnDedupRequiresSingleShard. That error is the FAITHFUL signal that the adapter's at-most-once retry contract cannot be transparently honoured across a shard split — the caller must reckon with the ambiguity at a higher level (re-read, re-execute, abort). Surfacing it is strictly safer than silently losing dedup. The earlier (claude[bot]-flagged) "misleading error" concern is real but the only correct fix lives in the adapter retry loop, not by tampering with the probe at the coordinator layer. Fix: revert the `reqs.PrevCommitTS = 0` line that 43c55df added. Other changes from 43c55df (empty goroutine removal, stale-resp nit) are correct and stay. Caller audit (loop directive's semantic-change check): no caller reads reqs.PrevCommitTS back after Dispatch. Adapter sites (adapter/redis.go:3209, :3615) construct OperationGroup inline, pass it, and branch on the error. The semantic change here (stop clearing the probe inside the retry) restores the pre-43c55dfe wire contract and is invisible to those callers. Regression test (replaces 43c55df's misnamed ClearsPrevCommitTSOnComposed1Retry): * TestShardedCoordinator_PreservesPrevCommitTSOnComposed1Retry — write-only OperationGroup with PrevCommitTS=17, first attempt rejected with ErrComposed1Violation. Decodes the encoded TxnMeta from the per-attempt pb.Request mutation value (same bytes the FSM reads), asserts attempt 1 AND attempt 2 both carry PrevCommitTS=17 — the probe survives the retry intact. This is the inverse of the test 43c55df added; the new assertion encodes codex P1's at-most-once correctness contract. Verification: go test -race -count=1 ./kv → pass (full pkg, 11s). make lint → 0 issues.
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (57da888):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
Pushed as 57da888 — codex P1 on 43c55df addressed. Codex P1 was correct, claude[bot]'s prior medium was wrong. 43c55df took claude[bot]'s reasoning at face value ("M3 fired at apply time → original attempt never committed → probe is stale") and zeroed
Dropping the probe would let the FSM re-apply against the post-shift route, double-writing the at-most-once payload if the prior attempt landed. The at-most-once contract is the whole point of option-2 dedup — M4 must not break it. Trade-off accepted: post-shift-multi-shard case still surfaces Caller audit (loop directive): no caller reads Regression test replaces the misnamed prior one:
@claude review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 57da88864e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| startTS, allocErr := c.nextStartTS(ctx, reqs.Elems) | ||
| if allocErr != nil { | ||
| // resp here is the failed first attempt's response (always | ||
| // nil for the Composed-1 sentinels) — return nil | ||
| // explicitly so the contract "non-nil response on success | ||
| // only" is unambiguous (claude[bot] nit on PR #900). | ||
| return nil, allocErr | ||
| } | ||
| reqs.StartTS = startTS |
There was a problem hiding this comment.
Preserve caller StartTS on OCC retries
When a caller supplies StartTS from an earlier OCC read but leaves ReadKeys empty, this retry silently replaces the snapshot timestamp. Fresh evidence beyond the earlier ReadKeys case is S3 createBucket, which reads bucket metadata at readTS (adapter/s3.go:573) and dispatches a txn with that StartTS but no ReadKeys (adapter/s3.go:605-613); if a route shift triggers this path, a concurrent create that committed after the S3 read but before the retry is no longer rejected by the FSM's latest > startTS write-conflict check, so the retry can overwrite/accept a stale decision instead of surfacing a conflict. The retry should only advance StartTS for coordinator-created write-only snapshots, or otherwise force the adapter to re-read and rebuild the request.
Useful? React with 👍 / 👎.
…(codex P1 on 57da888, PR #900) Codex P1 on 57da888 identifies that the M4 retry path's StartTS bump is unsafe whenever the caller supplied StartTS — not only the already-gated ReadKeys-non-empty case. When the caller hands us StartTS=readTS (because it already read metadata at that snapshot and is now dispatching a write), bumping StartTS on retry lets any concurrent commit landed in (originalStartTS, newStartTS) bypass the FSM's `latest > startTS` write-conflict check. The intervening commit is now dominated by the new snapshot, so OCC silently accepts the retry's stale-snapshot decision. The pre-Dispatch read is INVISIBLE to the coordinator (it is not surfaced as a ReadKeys entry), so the only signal that the snapshot is caller-owned is reqs.StartTS being non-zero at Dispatch entry. Real exposure: a code-base grep for adapter sites that supply `StartTS = readTS` without ReadKeys turns up at least 13 hits across S3, SQS, and DynamoDB adapters: * adapter/s3.go:573 (createBucket) * adapter/s3_admin_objects.go:144, :348 * adapter/sqs_messages.go:541, :1113, :1332, :1487, :1614 * adapter/sqs_messages_batch.go:222 * adapter/sqs_fifo.go:241 * adapter/sqs_purge.go:146 * adapter/sqs_tags.go:179 * adapter/dynamodb.go:4366, :8274 All of these read metadata at readTS and dispatch a write with that StartTS but empty ReadKeys. Under the pre-fix retry path any Composed-1 sentinel they hit would have been silently absorbed by the StartTS-bumping retry, breaking OCC for concurrent writes in the bumped-by window. Fix: capture `callerSuppliedStartTS := reqs.IsTxn && reqs.StartTS != 0` at Dispatch entry (BEFORE the coordinator-allocates-on-zero branch that mutates the field), thread it into dispatchTxnWithComposed1Retry, and surface the sentinel without retry whenever the flag is set. Coordinator-allocated StartTS remains retryable — by definition no read has happened at that timestamp, so the bump is OCC-safe. Caller audit (loop directive's semantic-change check): the adapter sites listed above are the BENEFICIARIES of this fix. Pre-fix they were silently exposed to stale-snapshot OCC bypass under route shifts; post-fix they correctly surface ErrComposed1Violation, which propagates through their existing HTTP error path to the client. No caller depended on the silent retry succeeding — the M4 retry was added in this PR and no adapter is yet wired to take advantage of it in the caller-supplied-StartTS case. The DynamoDB adapter sites that already supply StartTS=0 (adapter/dynamodb.go:922, :981, etc.) continue to use the coordinator-allocated path, so the M4 retry remains available for those. Test changes: * NEW TestShardedCoordinator_SurfacesComposed1OnCallerSuppliedStartTS: write-only OperationGroup with StartTS=42, ReadKeys empty, firstErr=ErrComposed1Violation. Asserts the sentinel surfaces unchanged and exactly ONE Commit attempt fires. Verified failing on the parent commit (the retry absorbed the sentinel and the second attempt succeeded) and passing on this commit. * Existing M4-retry tests that previously set StartTS=10 (a test fixture, not a real "caller-supplied" intent) updated to omit StartTS so the coordinator allocates and the retry path is eligible to fire: - TestShardedCoordinator_RetriesOnComposed1Violation - TestShardedCoordinator_RetriesOnComposed1VersionGCd - TestShardedCoordinator_PersistentComposed1ViolationSurfaces - TestShardedCoordinator_PreservesPrevCommitTSOnComposed1Retry These tests use NewHLC() and the in-memory engine, so nextStartTS allocates cleanly. * Tests that exercise read-write surfacing (TestShardedCoordinator_DoesNotRetryReadWriteTxnOnComposed1, TestShardedCoordinator_DoesNotAutoPinObservedRouteVersionForReadWriteTxn) keep StartTS=10 — ReadKeys non-empty surfaces first, so the new callerSuppliedStartTS gate is never reached. Same for the pin tests that succeed on first attempt. Also fixes a doc-comment nit claude[bot] flagged on 57da888: the composed1RetryingTransactional stub's PrevCommitTS-decode comment still said "assert the field is cleared on the retry attempt" — that was the now-disproven claude[bot] medium reasoning from 43c55df. Updated to say "PRESERVED across retry attempts" with the codex P1 (43c55df) reference. Verification: go test -race -count=1 ./kv → pass (full pkg, 11s). make lint → 0 issues.
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (144ec0c):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
Pushed as 144ec0c — codex P1 on 57da888 addressed (plus claude[bot] stale-comment nit). Codex P1 — caller-supplied StartTS surfaces. The M4 retry path's StartTS bump was unsafe for ALL caller-supplied StartTS values, not just the already-gated ReadKeys-non-empty case. A code-base grep found 13+ adapter sites that supply
Under the pre-fix retry, any Composed-1 sentinel they hit would have been silently absorbed by the bumped-StartTS retry, letting any commit landed in (originalStartTS, newStartTS) bypass the FSM's Fix: capture Caller audit (loop directive): the 13+ adapter sites are the BENEFICIARIES of this fix. Pre-fix they were silently exposed to stale-snapshot OCC bypass under route shifts; post-fix they correctly surface ErrComposed1Violation through their existing HTTP error path. No caller depended on the silent retry succeeding (M4 was added in this PR; no adapter is yet wired to take advantage). DynamoDB sites with Regression test: Existing M4-retry tests updated to omit StartTS (was claude[bot] nit on 57da888: stub's PrevCommitTS-decode comment still said "assert the field is cleared" (the now-disproven 43c55df-era claude[bot] medium reasoning). Updated to "PRESERVED across retry attempts" with codex P1 (43c55df) reference.
@claude review |
|
To use Codex here, create a Codex account and connect to github. |
|
Claude finished @bootjp's task in 3m 6s —— View job PR Review: Codex P1 fix — caller-supplied StartTS surfaces (144ec0c)
✅ Codex P1 — correct
The new guard at lines 572–574 is placed correctly in the loop body — after ✅ Regression test — correct
The test comment documents the 13+ real adapter sites and explains why the pre-Dispatch read is invisible to the coordinator (no ReadKeys). The test correctly distinguishes the ✅ Existing tests updated correctly
✅ Stale-comment nit from 57da888 — fixedThe ✅ All prior review findings resolved
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 144ec0ca87
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if c.engine == nil || reqs.ObservedRouteVersion != 0 || len(reqs.ReadKeys) != 0 { | ||
| return | ||
| } | ||
| reqs.ObservedRouteVersion = c.engine.Version() |
There was a problem hiding this comment.
Avoid auto-pinning caller-owned snapshots
When a transaction arrives with a caller-supplied StartTS but empty ReadKeys (for example adapter/s3.go:573-614 reads bucket metadata at readTS and dispatches the write with StartTS: startTS and no ObservedRouteVersion), this still stamps the catalog version at Dispatch time. If a route moved between that adapter-layer read and Dispatch, the request is pinned to the post-shift route version, so verifyComposed1 will not report the route shift even though the write decision was based on the old snapshot; the later callerSuppliedStartTS retry guard never helps because the sentinel was masked. The auto-pin needs to be limited to coordinator-owned snapshots as well as empty ReadKeys, or these adapters need to pin the version when they take the read snapshot.
Useful? React with 👍 / 👎.
…alse-confidence gate signal (codex P1 on 144ec0c, PR #900) Codex P1 on 144ec0c identifies a remaining hole in the M3 auto-pin coverage: when an adapter supplies StartTS=readTS with ReadKeys empty (the 13+ S3/SQS/DynamoDB sites enumerated in 144ec0c's commit message), the Dispatch-entry auto-pin stamps reqs.ObservedRouteVersion = engine.Version() — the CURRENT catalog version, not the version the read was actually captured at. If a route shift landed between the adapter-layer read and this Dispatch call, the auto-pin records the POST-shift version and verifyComposed1 finds no mismatch at apply time. The route shift becomes INVISIBLE to the gate — masking exactly the failure mode the gate exists to catch. This is strictly worse than ObservedRouteVersion=0: * observedVer=0 → M3 gate short-circuits → operator knows no protection is in place (correct signal). * spurious auto-pin → M3 gate runs and passes → operator gets false confidence (corrupt signal). The "partial coverage" framing from claude[bot]'s 144ec0c review ("dispatch→apply window is still covered for these callers") is illusory: the auto-pin's claim of catalog version on the request LIES about when the read happened. A false claim that the gate verifies as true is a hostile signal, not a partial one. The correct fix lives in the adapters (pin ObservedRouteVersion at read time, design doc §4.1), which is M5+ migration work. Until then, withholding the auto-pin is the only honest posture. Fix: extend maybeAutoPinObservedRouteVersion's skip conditions to include callerSuppliedStartTS, mirroring the M4-retry gate added in 144ec0c. The condition now reads: if engine == nil || ObservedRouteVersion != 0 || len(ReadKeys) != 0 || callerSuppliedStartTS { return // do NOT auto-pin } reqs.ObservedRouteVersion = engine.Version() The flag is already captured at Dispatch entry (before the allocate-on-zero branch) by 144ec0c; this commit just threads it one more level down. Caller audit (loop directive's semantic-change check): the 13+ adapter sites that supply StartTS=readTS without ReadKeys (adapter/s3.go:573, sqs_messages.go:541 etc., dynamodb.go:4366 etc.) currently get auto-pinned and the gate verifies with a potentially-false claim. Post-fix they get ObservedRouteVersion=0 and the gate short-circuits. No caller loses real protection — the auto-pin's "protection" was false-positive coverage that masked actual shifts. Adapters that want genuine M3 protection must pin ObservedRouteVersion at read time per §4.1, which is the pre-flagged M5+ migration. Coordinator-allocated StartTS callers (DynamoDB sites with StartTS=0 — adapter/dynamodb.go:922, :981, :1760, :4686, :4703, :4722, :4735) are unchanged: callerSuppliedStartTS=false, auto-pin fires as before. The coordinator's allocation IS the read-time-equivalent moment for those callers (no read against a different catalog version has happened earlier), so the auto-pin's claim is honest. Regression test: * TestShardedCoordinator_DoesNotAutoPinObservedRouteVersionForCallerSuppliedStartTS — caller-supplied StartTS=42, ReadKeys empty, ObservedRouteVersion=0, engine pinned at v=9. Asserts the request that reached the stub Commit carries ObservedRouteVersion=0 (auto-pin did NOT fire). Verified failing on the parent (the request carried v=9) and passing on this commit. Existing test TestShardedCoordinator_PinsObservedRouteVersionAtDispatchEntry updated to omit StartTS so the auto-pin path remains exercised on the coordinator-allocated branch. TestShardedCoordinator_HonorsCallerPinnedObservedRouteVersion is unchanged (its ObservedRouteVersion is non-zero so the auto-pin short-circuits on the existing condition before the new gate even applies). TestShardedCoordinator_DoesNotAutoPinObservedRouteVersionForReadWriteTxn is unchanged (ReadKeys non-empty short-circuits first). Verification: go test -race -count=1 ./kv → pass (full pkg, 10s). make lint → 0 issues.
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (6202b96):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
Pushed as 6202b96 — codex P1 on 144ec0c addressed. Codex P1 — skip auto-pin for caller-supplied StartTS. The auto-pin at Dispatch entry stamped Why this is worse than no auto-pin: ObservedRouteVersion=0 → gate short-circuits → operator knows protection is not in place. Spurious auto-pin → gate verifies a FALSE claim → false confidence. claude[bot]'s 144ec0c framing of "partial coverage" is illusory because the auto-pin LIES about when the read happened — it's a corrupt signal, not a partial one. Fix: thread the already-captured if engine == nil || ObservedRouteVersion != 0 || len(ReadKeys) != 0 || callerSuppliedStartTS { return }Caller audit (loop directive): the 13+ caller-supplied-StartTS adapter sites lose the auto-pin's false-positive coverage but no real protection — they were never honestly covered. DynamoDB sites with Regression test: Existing test
@claude review |
|
To use Codex here, create a Codex account and connect to github. |
|
Claude finished @bootjp's task in 2m 35s —— View job PR Review: Codex P1 fix — skip auto-pin for caller-supplied StartTS (6202b96)
✅ Codex P1 — correctThe if c.engine == nil || reqs.ObservedRouteVersion != 0 || len(reqs.ReadKeys) != 0 || callerSuppliedStartTS {
return
}The flag is correctly captured at line 464 before the coordinator-allocates-on-zero branch at line 466 can mutate
The "false confidence is strictly worse than honest no-coverage" reasoning from the trigger comment is correct. ✅ Regression test — correct
✅
|
| Finding | Round | Status |
|---|---|---|
| gemini CRITICAL / codex P1: SSI violation on read-write retry | R1 | ✅ 66ad5b0 — ReadKeys != 0 surfaces immediately |
| gemini HIGH: nil-clock leaves StartTS=0 | R1 | ✅ 66ad5b0 — unconditional nextStartTS |
| codex P1: auto-pin stamps post-shift version for read-write txns | R2 | ✅ 10123c5 — len(ReadKeys) != 0 guard |
| Minor: empty goroutine in test | R2 | ✅ 43c55df — removed; converted to block comment |
Nit: stale resp on allocErr |
R2 | ✅ 43c55df — return nil, allocErr |
| claude medium: PrevCommitTS clearing (was wrongly added, then correctly reverted) | R2–R3 | ✅ 57da888 — NOT cleared; preserves adapter's at-most-once probe |
| codex P1: callerSuppliedStartTS OCC bypass on retry | R4 | ✅ 144ec0c — surfaces without retry |
| codex P1: auto-pin for caller-supplied StartTS | R5 | ✅ 6202b96 — this commit |
TLA+ spec divergence — no update needed
6202b96 is confined to maybeAutoPinObservedRouteVersion and dispatchTxnWithComposed1Retry in sharded_coordinator.go. No changes to hlc*.go, coordinator.go HLC lease-renewal paths, transaction.go, lock_resolver.go, fsm.go, mvcc_store.go, or distribution/. The auto-pin decision does not touch timestamp issuance, OCC commit-ts assignment, FSM apply, or catalog versioning. No spec module needs updating.
LGTM. Six review rounds across this PR have been thorough. All correctness findings are closed. The final shape correctly distinguishes four cases (coordinator-allocated / caller-supplied-StartTS-only / read-write / caller-pinned-route-version) and is honest about which ones M3/M4 can and cannot protect. The one remaining open item is the M5 Jepsen workload, pre-flagged in the PR description.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6202b9648a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if c.engine == nil || reqs.ObservedRouteVersion != 0 || len(reqs.ReadKeys) != 0 || callerSuppliedStartTS { | ||
| return | ||
| } | ||
| reqs.ObservedRouteVersion = c.engine.Version() |
There was a problem hiding this comment.
Don't auto-pin 2PC writes until secondary failures are handled
When this auto-pins a coordinator-owned write-only transaction, ordinary multi-shard requests become subject to the Composed-1 gate even though the 2PC secondary path still treats secondary Commit errors as best-effort: dispatchMultiShardTxn calls commitSecondaryTxns and then returns success, while commitSecondaryTxns only logs errors from commitSecondaryWithRetry. If a route moves after PREPARE/primary COMMIT but before a secondary COMMIT, that secondary can now return ErrComposed1Violation; Dispatch still acknowledges success, leaving the secondary write uncommitted/invisible to readers on the new owner. Before this auto-pin these requests carried ObservedRouteVersion=0 and the gate short-circuited, so this change can turn route churn into an acknowledged partial commit unless secondary Composed-1 errors are made fatal/recoverable before success is returned.
Useful? React with 👍 / 👎.
…partial commits (codex P1 on 6202b96, PR #900) Codex P1 on 6202b96 identifies a structural conflict between M3's fail-closed semantics and 2PC's best-effort secondary semantic. When dispatchMultiShardTxn propagates a non-zero ObservedRouteVersion to secondary commit Requests, a route shift that lands AFTER primary-COMMIT but BEFORE a secondary-COMMIT makes the secondary FSM return ErrComposed1Violation. commitSecondaryTxns swallows the error (logs, then continues), and dispatchMultiShardTxn returns success to the caller — the secondary's write is uncommitted, invisible to readers on the new owner, and the txn is acknowledged as durable. This is a silent partial commit hazard. The auto-pin added in 66ad5b0 (and the explicit caller-pin path from PR #881) inadvertently extended the M3 gate to a step where its fail-closed semantic conflicts with the 2PC contract. The fundamental tension: * Prewrite: gate-active. A failed gate aborts before any locks are placed (groupReadKeys / prewriteTxn return the error). Clean recovery, no partial state. * Primary commit: gate-active. A failed gate triggers abortPreparedTxn (the existing line 849 path), which releases the prewrite intents on every prepared shard. Clean recovery, no partial state. * Secondary commit: gate-active is HAZARDOUS. The primary has already durably committed; there is no rollback path. commitSecondaryTxns is best-effort by design (its docstring says lock-resolution backstops unavailable secondaries), but lock resolution itself would fail the same gate on the same shifted route. Net effect: silent partial commit. Fix: drop ObservedRouteVersion to 0 on every per-secondary commit Request. fsm.go:609 short-circuits verifyComposed1 on observedVer==0, so the gate never fires on secondaries. This is NOT a regression of PR #881's "propagate the pin everywhere" intent — it is a refinement that recognizes secondary commit as the step where propagation cannot compose with the 2PC contract. PR #881's underlying concern (prewrite and primary commit silently dropping the version) remains addressed; only the secondary commit's propagation is reversed. Caller audit (loop directive's semantic-change check): the change is entirely internal to commitSecondaryTxns. The observedRouteVersion parameter is removed from the function signature; dispatchMultiShardTxn no longer passes it. No external caller is affected. Effect on production callers: * Auto-pinned write-only coordinator-allocated multi-shard txns (e.g. DynamoDB BatchWriteItem with StartTS=0): secondary commits no longer trigger Composed-1 on cross-COMMIT route shifts. Pre-fix exposure: silent partial commit under operator-triggered SplitRange. Post-fix: secondary commit succeeds (or surfaces a non-Composed-1 error from the underlying Raft layer). Lock-resolution backstop unaffected. * Caller-pinned write-only multi-shard txns: same as above — the pin still applies to prewrite and primary commit but not to secondary commits. Callers that want secondary-side M3 protection need a fatal-secondary 2PC variant (M5+). Regression test: * TestShardedCoordinator_DropsObservedRouteVersionOn2PCSecondaryCommit — 2-shard fixture (primary "b" → g1, secondary "x" → g2), coordinator-allocated StartTS so auto-pin stamps ObservedRouteVersion=7 on the OperationGroup. Asserts the coverage matrix: * prewrite on both shards: observedVer=7 * primary commit on g1: observedVer=7 * secondary commit on g2: observedVer=0 Verified failing on the parent (g2 commit carried v=7) and passing on this commit. Existing test TestShardedCoordinatorDispatchTxn_CrossShardPropagatesObservedRouteVersion (PR #881 contract) updated to reflect the codex P1 refinement. It now asserts the same matrix above for the caller-pinned case (pin=42). The original assertion (every 2PC envelope carries the pin) was the regression codex P1 identifies; the updated assertion encodes the corrected contract. Verification: go test -race -count=1 ./kv → pass (full pkg, 8s). make lint → 0 issues.
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (d848767):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
Pushed as d848767 — codex P1 on 6202b96 addressed. Note on flaky CI: Codex P1 — drop M3 on secondary commits. Coverage matrix the post-fix code satisfies:
Fix: drop the Caller audit (loop directive): change is entirely internal to commitSecondaryTxns. No external callers affected. Auto-pinned write-only multi-shard txns (DynamoDB BatchWriteItem etc.) lose the silent-partial-commit hazard; explicit-caller-pin callers also lose secondary-side coverage (the fundamental hazard is structural to 2PC + fail-closed gate; M5+ work for a fatal-secondary 2PC variant or lock-resolution-bypass). Regression test: Existing test
@claude review |
|
To use Codex here, create a Codex account and connect to github. |
|
Claude finished @bootjp's task in 3m 8s —— View job PR Review: d848767 — drop ObservedRouteVersion on 2PC secondary commits
✅ Codex P1 — correct
req := &pb.Request{
IsTxn: true,
Phase: pb.Phase_COMMIT,
Ts: startTS,
Mutations: append([]*pb.Mutation{meta}, keyMutations(grouped[gid])...),
// ObservedRouteVersion intentionally omitted (zero) ...
}The comment block (lines 1019–1038) is one of the most thorough explanations in this PR: it names the exact failure mode (route shift between primary-COMMIT and secondary-COMMIT → ErrComposed1Violation → swallowed → silent partial commit), explains why The call-site comment in ✅ Coverage matrix satisfiedThe two-phase flow for a 2-shard txn after d848767:
This matches the PR description's coverage matrix exactly. ✅ Regression test — correct
The request ordering is correct because ✅ Updated
|
| Finding | Round | Commit | Status |
|---|---|---|---|
| gemini CRITICAL / codex P1: SSI violation on read-write retry | R1 | 66ad5b0 | ✅ |
| gemini HIGH: nil-clock leaves StartTS=0 | R1 | 66ad5b0 | ✅ |
| codex P1: auto-pin for read-write txns | R2 | 10123c5 | ✅ |
| Minor: empty goroutine in test | R2 | 43c55df | ✅ |
Nit: stale resp on allocErr |
R2 | 43c55df | ✅ |
| claude medium: PrevCommitTS clearing (correctly NOT cleared) | R2–R3 | 57da888 | ✅ |
| codex P1: callerSuppliedStartTS OCC bypass on retry | R4 | 144ec0c | ✅ |
| codex P1: auto-pin for caller-supplied StartTS | R5 | 6202b96 | ✅ |
| codex P1: 2PC secondary ObservedRouteVersion silent partial commit | R6 | d848767 | ✅ this commit |
TLA+ spec divergence — no update needed
Only kv/sharded_coordinator.go, kv/sharded_coordinator_composed1_retry_test.go, and kv/sharded_coordinator_txn_test.go changed in this PR. No modifications to kv/hlc*.go, kv/coordinator.go (HLC lease renewal), kv/transaction.go, kv/lock_resolver.go, kv/fsm.go, store/mvcc_store.go, or distribution/. The 2PC secondary commit change is purely in the Request construction layer — it does not touch timestamp issuance, OCC commit-ts assignment, FSM apply, or catalog versioning. No spec module needs updating.
LGTM. Seven review rounds across this PR have systematically closed every correctness finding. The M1+M2+M3+M4 stack is now complete for write-only txns with coordinator-allocated timestamps — route shifts are transparent to the client for that class. The M5 follow-ups (Jepsen workload, fatal-secondary 2PC, adapter §4.1 migration) are correctly pre-flagged in the PR description.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d8487672d9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| req := &pb.Request{ | ||
| IsTxn: true, | ||
| Phase: pb.Phase_COMMIT, | ||
| Ts: startTS, | ||
| Mutations: append([]*pb.Mutation{meta}, keyMutations(grouped[gid])...), | ||
| ObservedRouteVersion: observedRouteVersion, | ||
| IsTxn: true, | ||
| Phase: pb.Phase_COMMIT, | ||
| Ts: startTS, | ||
| Mutations: append([]*pb.Mutation{meta}, keyMutations(grouped[gid])...), |
There was a problem hiding this comment.
Preserve the secondary commit route pin
When a multi-shard transaction is caller-pinned or auto-pinned, this request literal now always sends ObservedRouteVersion=0 for secondary commits. If the secondary key's route moves after PREPARE/primary COMMIT but before this best-effort secondary COMMIT, the old secondary FSM short-circuits verifyComposed1 and applies the write on a group that is no longer the current owner; dispatchMultiShardTxn then acknowledges success. The M3 current-version fence was added specifically to reject these late commits to old owners, so secondary Composed-1 failures need to be made fatal/recoverable rather than disabling the gate for all secondaries.
Useful? React with 👍 / 👎.
…e propagation (codex P1 on d848767, PR #900) d848767 dropped ObservedRouteVersion from secondary commit Requests to avoid the silent-partial-commit hazard codex P1 on 6202b96 flagged: "fail-closed gate + best-effort swallow = uncommitted secondary, missing-write silent partial commit." Codex P1 on d848767 correctly identifies that dropping the gate trades one silent partial commit for another: with the gate off, the secondary FSM short-circuits verifyComposed1 and APPLIES the write on a group that is no longer the catalog's owner. Readers on the new owner never see it. Both codex reviews actually agreed on the right fix — make secondary Composed-1 errors FATAL — but d848767 picked the wrong interpretation ("disable the gate") instead of the explicit recommendation ("make fatal/recoverable"). This commit applies the codex-correct fix: 1. Restore ObservedRouteVersion propagation to secondary commit Requests (revert d848767's literal change). 2. Add ErrTxnSecondaryRouteShiftedAfterPrimaryCommit sentinel (kv/txn_errors.go). Distinct from ErrComposed1Violation by design: the M4 retry path in dispatchTxnWithComposed1Retry matches ErrComposed1Violation / ErrComposed1VersionGCd; reusing those here would loop the retry against a primary that is already durable, re-prewriting into existing locks at the old gid — pure wasted work since the route catalog will not move backward. 3. Change commitSecondaryTxns to return (uint64, error). On Composed-1 sentinel from any per-secondary commitSecondaryWithRetry, surface ErrTxnSecondaryRouteShiftedAfterPrimaryCommit. Non-Composed-1 errors stay best-effort (logged and continued) to preserve the original lock-resolution-backstop semantic. 4. dispatchMultiShardTxn propagates the sentinel to its caller. Trade-off accepted: the caller now sees a fatal error when a multi-shard 2PC race lands. The primary's write is durable; the secondary's is missing. Application-level recovery (re-read, re-execute, abort, or compensating write) is required. This is strictly better than either silent direction — at least the caller knows the txn is half-broken and can act on it. Caller audit (loop directive's semantic-change check): commitSecondaryTxns's only caller is dispatchMultiShardTxn, updated in this commit. The new error type ErrTxnSecondaryRouteShiftedAfterPrimaryCommit is NOT in the M4 retry's retryable set (isComposed1RetryableError matches ErrComposed1Violation / ErrComposed1VersionGCd only), so the M4 retry loop in dispatchTxnWithComposed1Retry surfaces it directly without re-attempting. Production callers (adapter/redis.go, adapter/dynamodb.go, etc.) see the new error type at the Dispatch boundary and propagate it as their HTTP/RPC error. No adapter explicitly matches Composed-1 sentinels today, so the semantic shift (success-on-partial-commit → error-on-partial-commit) is a pure correctness improvement — pre-fix exposure was silent data loss under cross-COMMIT route shifts; post-fix is visible error. Test changes: * NEW TestShardedCoordinator_SurfacesFatalErrorOn2PCSecondaryComposed1 — 2-shard fixture, primary "b" → g1 succeeds (prewrite + commit), secondary "x" → g2 prewrite succeeds, commit persistently fails with ErrComposed1Violation (every retry of commitSecondaryWithRetry). Asserts the Dispatch error is ErrTxnSecondaryRouteShiftedAfterPrimaryCommit and NOT ErrComposed1Violation (so M4 retry does not match). Verified failing on the parent (Dispatch returned nil error because commitSecondaryTxns swallowed) and passing on this commit. * RENAMED TestShardedCoordinator_DropsObservedRouteVersionOn2PCSecondaryCommit → TestShardedCoordinator_PropagatesObservedRouteVersionOn2PCSecondaryCommit. Inverted the assertion: secondary commit's ObservedRouteVersion MUST equal the auto-pinned engine version, not zero. The d848767 contract was the temporary "drop-the-gate" posture; this commit restores propagation. * REVERTED TestShardedCoordinatorDispatchTxn_CrossShardPropagatesObservedRouteVersion to the original PR #881 contract: every 2PC envelope on every shard carries the pinned version. The d848767 refinement ("drop on secondary commit") is undone in favour of the new fatal-sentinel approach. Verification: go test -race -count=1 ./kv → pass (full pkg, 9s). make lint → 0 issues. Note on the d848767 CI failure: TestRedis_StreamXReadLatencyIsConstant in adapter/ failed at 48.35s on the loaded ubuntu-latest runner. That is a latency-sensitive perf test unrelated to this PR (no Redis stream code touched).
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (6a458a2):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
To use Codex here, create a Codex account and connect to github. |
|
Pushed as 6a458a2 — codex P1 on d848767 addressed. The d848767 fix was the wrong interpretation. Codex P1 on 6202b96 flagged "fail-closed gate + best-effort swallow = silent partial commit (missing-write direction)" and recommended "make secondary errors fatal/recoverable." I picked the wrong option ("disable the gate") instead of the explicit recommendation ("make fatal"). Codex P1 on d848767 correctly showed that the disabled-gate posture trades one silent partial commit for another (stale-owner direction): the secondary FSM short-circuits verifyComposed1 and applies the write on a group that's no longer the catalog's owner. Correct fix — make secondary Composed-1 fatal with a distinct sentinel.
Coverage matrix (now restored to PR #881's original):
Caller audit (loop directive): Tests:
Note on d848767 CI failure: @claude review |
|
Claude finished @bootjp's task in 3m 12s —— View job PR Review: 6a458a2 — surface secondary Composed-1 as fatal sentinel, restore propagation
The revert of d848767 is correct, and the approach here (make secondary Composed-1 fatal with a distinct sentinel rather than disabling the gate) is the right resolution of the two competing codex P1 findings. ✅ Sentinel design — correct
The new sentinel is correctly NOT matched by
✅
|
| Step | ObservedRouteVersion | Composed-1 outcome |
|---|---|---|
| Prewrite (both shards) | pinned | abort cleanly via abortPreparedTxn |
| Primary commit | pinned | abortPreparedTxn on failure |
| Secondary commit | pinned | ErrTxnSecondaryRouteShiftedAfterPrimaryCommit (fatal) |
✅ Regression test — correct
TestShardedCoordinator_SurfacesFatalErrorOn2PCSecondaryComposed1 (lines 715–661):
g2Errs = [nil, ErrComposed1Violation × txnSecondaryCommitRetryAttempts]— correctly accounts for the innercommitSecondaryWithRetryretry loop (3 attempts before returninglastErr = ErrComposed1Violation).require.ErrorIs(t, err, ErrTxnSecondaryRouteShiftedAfterPrimaryCommit)— asserts the new sentinel ✅require.NotErrorIs(t, err, ErrComposed1Violation)— asserts M4 retry does not have a false surface ✅
TestShardedCoordinator_PropagatesObservedRouteVersionOn2PCSecondaryCommit correctly reverts the d848767 assertion back to observedVer=7 for all four envelopes.
TestShardedCoordinatorDispatchTxn_CrossShardPropagatesObservedRouteVersion correctly loops over all requests and asserts pinnedVer everywhere, eliminating the special-cased observedVer=0 assertion from d848767.
✅ All prior review findings resolved
| Finding | Round | Status |
|---|---|---|
| gemini CRITICAL / codex P1: SSI on read-write retry | R1 | ✅ 66ad5b0 |
| gemini HIGH: nil-clock StartTS=0 | R1 | ✅ 66ad5b0 |
| codex P1: auto-pin for read-write txns | R2 | ✅ 10123c5 |
| Minor: empty goroutine | R2 | ✅ 43c55df |
Nit: stale resp on allocErr |
R2 | ✅ 43c55df |
| claude medium: PrevCommitTS (correctly preserved) | R2–R3 | ✅ 57da888 |
| codex P1: callerSuppliedStartTS OCC bypass on retry | R4 | ✅ 144ec0c |
| codex P1: auto-pin for caller-supplied StartTS | R5 | ✅ 6202b96 |
| codex P1: 2PC secondary gate — silent partial commit (both directions) | R6–R7 | ✅ 6a458a2 (this commit) |
⚠️ Minor — commitSecondaryWithRetry sleeps between Composed-1 retries
commitSecondaryWithRetry sleeps 20ms + 40ms = 60ms before returning the final error on 3 Composed-1 attempts. These sleeps are useful for transient leader-election drops but wasteful for Composed-1 (the catalog won't move back during the retry window). The caller already correctly classifies the final error as fatal, so the semantics are correct — the only cost is latency on the error path. Not blocking; flagging for M5+ or a separate refactor to add a shouldRetry(err) bool predicate in commitSecondaryWithRetry.
TLA+ spec divergence — no update needed
6a458a2 is confined to kv/sharded_coordinator.go, kv/sharded_coordinator_composed1_retry_test.go, kv/sharded_coordinator_txn_test.go, and kv/txn_errors.go. No changes to kv/hlc*.go, kv/coordinator.go HLC renewal paths, kv/transaction.go, kv/lock_resolver.go, kv/fsm.go, store/mvcc_store.go, or distribution/. The sentinel classification and secondary-commit error propagation don't touch timestamp issuance, OCC commit-ts assignment, FSM apply, or catalog versioning. No spec module needs updating.
LGTM. Eight review rounds have been thorough. The final shape correctly handles all four classes of Dispatch caller, correctly gates the M3 sentinel at every 2PC step, and surfaces secondary Composed-1 failures as a distinct honest error rather than either swallowing or disabling the gate. The one remaining open item is the M5 Jepsen workload, pre-flagged in the PR description.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6a458a2840
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if c.engine == nil || reqs.ObservedRouteVersion != 0 || len(reqs.ReadKeys) != 0 || callerSuppliedStartTS { | ||
| return | ||
| } | ||
| reqs.ObservedRouteVersion = c.engine.Version() |
There was a problem hiding this comment.
Avoid auto-pinning resolver-routed transactions
When a partition resolver routes a write-only transaction to a group that differs from the byte-range engine (e.g. SQS partition keys), this auto-pin makes the FSM enforce the engine snapshot rather than the resolver's ownership: ShardRouter.ResolveGroup sends the raw key to the resolver first, but verifyComposed1 later calls the route-history OwnerOf on the raw mutation key, whose snapshots come only from distribution.Engine. With production FSMs wired via WithRouteHistory, a coordinator-owned txn with no StartTS/ReadKeys that used to commit unpinned can now be rejected with ErrComposed1Violation on both M4 attempts even though the resolver selected the correct group. The auto-pin needs to be skipped or made resolver-aware for resolver-claimed keys.
Useful? React with 👍 / 👎.
…a458a2, PR #900) Codex P1 on 6a458a2 identifies a final auto-pin skip case the prior rounds missed: when a PartitionResolver is wired (SQS HT-FIFO partition keys, etc.), ShardRouter routes via the resolver but verifyComposed1 calls route-history OwnerOf on the raw mutation key against the BYTE-RANGE engine snapshot. The engine has no knowledge of the resolver's mapping, so the gate spuriously rejects resolver-routed commits even when the resolver picked the correct gid. Concrete failure mode (codex): 1. ShardRouter.ResolveGroup → resolver returns gid R. 2. Dispatch-entry auto-pin stamps ObservedRouteVersion = engine.Version() on the request. 3. FSM at gid R applies; verifyComposed1 calls OwnerOf on the raw mutation key against the engine's snapshot. 4. Engine's snapshot says gid B (default byte-range owner), not gid R. 5. Mismatch → ErrComposed1Violation. 6. M4 retry re-reads engine.Version() (unchanged), retries, fails again. Spurious double failure. With production FSMs wired via WithRouteHistory, a coordinator-owned txn with no StartTS/ReadKeys that used to commit unpinned now gets rejected. The auto-pin's "blanket protection" framing breaks for resolver-routed keys — this is the FIFTH distinct scenario where the auto-pin overreaches: R1 (66ad5b0): ReadKeys non-empty (pre-shift reads) R2 (10123c5): same — gated R3 (144ec0c): callerSuppliedStartTS (adapter-layer pre-reads) R4 (6202b96): same — gated R5 (6a458a2): 2PC secondary commits — surfaced fatally R6 (this): resolver-claimed keys — gated Fix: extend maybeAutoPinObservedRouteVersion's skip conditions to include "any element's key is recognised by the partition resolver". Extracted anyResolverClaimedKey helper that: - returns false fast when no resolver is wired (the byte-range-only deployment — most common case); - iterates Elems and calls resolver.RecognisesPartitionedKey on each non-empty key; - returns true on first claim. Adapter for SQS HT-FIFO (the main resolver consumer) now correctly bypasses the M3 gate. Resolver-aware M3 is M5+ work. Caller audit (loop directive's semantic-change check): the change is internal to maybeAutoPinObservedRouteVersion. The auto-pin already had skip conditions for engine==nil, ObservedRouteVersion!=0, ReadKeys non-empty, and callerSuppliedStartTS; adding the resolver-claim skip is a straightforward extension. No external caller cares (the only caller is dispatchTxnWithComposed1Retry which threads the same flag through). Production callers using the partition resolver (SQS HT-FIFO via adapter/sqs_messages.go, adapter/sqs_messages_batch.go, etc.) see resolver-routed txns commit instead of being spuriously rejected — strict correctness improvement. Coordinator-routed (byte-range) txns continue to auto-pin as before; they were never affected by this hazard. Regression test: * TestShardedCoordinator_DoesNotAutoPinObservedRouteVersionForResolverRoutedTxn — installs a stubPartitionResolver claiming prefix "part:". Dispatches a write-only txn with key "part:q1:1234" and no StartTS/ReadKeys/ObservedRouteVersion. Engine pinned at version 9. Asserts the request that reaches Commit carries ObservedRouteVersion=0 (auto-pin did NOT fire). Verified failing on parent (request carried v=9) and passing here. Stub stubPartitionResolver is a minimal in-test PartitionResolver: it claims every key whose prefix matches a configurable byte sequence and routes them to a fixed gid. Sufficient for exercising the codex P1 skip path without pulling in the SQS-specific partition routing wiring. Verification: go test -race -count=1 ./kv → pass (full pkg, 10s). make lint → 0 issues.
|
Pushed as e08d44f — codex P1 on 6a458a2 addressed. Codex P1 — skip auto-pin for resolver-claimed keys. When a PartitionResolver is wired (SQS HT-FIFO partition keys etc.), ShardRouter routes via the resolver but verifyComposed1 calls route-history OwnerOf against the BYTE-RANGE engine snapshot. The engine has no knowledge of the resolver's mapping, so the gate spuriously rejects resolver-routed commits even when the resolver picked the correct gid. M4 retry re-reads engine.Version() (unchanged) and fails the same way — spurious double failure. Fix: extend Skip-condition summary:
Caller audit (loop directive): change is internal to Regression test:
@claude review |
TLA+ spec divergence review (auto-triggered)This PR touches files that the TLA+ safety spec has an anchor on (per Anchored files changed in this PR head (e08d44f):
What to check, by subsystem:
If the change is correct but requires a spec update, edit @claude review please verify TLA+ spec divergence per the checklist above. @codex review please verify TLA+ spec divergence per the checklist above. |
|
To use Codex here, create a Codex account and connect to github. |
|
Claude finished @bootjp's task in 3m 22s —— View job PR Review: e08d44f — skip auto-pin for resolver-claimed keys (codex P1 on 6a458a2)
✅ Codex P1 — correct
if c.engine == nil || reqs.ObservedRouteVersion != 0 || len(reqs.ReadKeys) != 0 || callerSuppliedStartTS {
return
}
if c.anyResolverClaimedKey(reqs.Elems) {
return
}
reqs.ObservedRouteVersion = c.engine.Version()The five skip conditions are now complete and coherent:
Retry path is correct without a resolver check. The retry fires only when ✅ Regression test — correct
The ✅ All prior review findings resolved
TLA+ spec divergence — no update needede08d44f is confined to LGTM. Nine review rounds across this PR have been thorough. All correctness findings are closed. The final skip-condition table in |
Summary
Fourth milestone (M4) of the Composed-1 cross-group commit-time guard per
docs/design/2026_05_29_proposed_composed1_cross_group_commit_guard.md§M4.Makes the M3 apply-time gate non-disruptive to clients: instead of surfacing
ErrComposed1ViolationorErrComposed1VersionGCdback to the adapter (and thus the application), the coordinator absorbs the sentinel, re-reads the catalog version, and re-issues the txn once against the post-shift owning group.What changes
ShardedCoordinator.Dispatch— pin ObservedRouteVersion at entryEvery transactional
OperationGroupthat arrives withObservedRouteVersion == 0(the M1 unpinned sentinel) gets stamped withengine.Version(). This makes the M3 gate effective by default — every txn flowing through the sharded coordinator is now gate-eligible without each adapter rediscovering the §4.1 contract.A caller that supplies a non-zero
ObservedRouteVersion(a future adapter migration target managing the version itself) still wins over the Dispatch-entry pin.New
dispatchTxnWithComposed1Retry— retry-once on Composed-1 sentinelsWraps the existing single-attempt
dispatchTxnin a loop bounded atcomposed1RetryAttempts = 1(initial + 1 retry = 2 attempts).On the sentinel:
engine.Version()and stamp the fresh value on the OperationGroupStartTS/CommitTSso the next attempt allocates a fresh HLC-correct pair vianextStartTS(OCC invariants requirecommitTS > startTSunder the same HLC observation)groupMutationson the next dispatchTxn pass re-runsrouter.ResolveGroup, so a key whose owning group changed naturally lands on the new group's FSMBeyond one retry the sentinel surfaces unchanged so a wrapping adapter retry harness (or the client) can decide whether to keep trying — preventing the coordinator from spinning when catalog churn outpaces the dispatch.
Caller audit (loop directive's semantic-change check)
ShardedCoordinator.Dispatchis the entry point every adapter calls. The change:grep -rn 'ErrComposed1' adapter/returns no hits — no adapter relies on a faster-than-bounded surfacing of these sentinels (M3 only landed yesterday).Tests (5 new)
TestShardedCoordinator_RetriesOnComposed1Violation— happy path: first attempt fails, second succeeds, caller sees success.TestShardedCoordinator_RetriesOnComposed1VersionGCd— same retry path drives both sentinels.TestShardedCoordinator_PersistentComposed1ViolationSurfaces— retry budget bounds the loop: persistent failure surfaces sentinel after exactly 2 total Commit calls.TestShardedCoordinator_PinsObservedRouteVersionAtDispatchEntry— Dispatch pins fromengine.Version()when caller leaves it at zero.TestShardedCoordinator_HonorsCallerPinnedObservedRouteVersion— caller-supplied non-zero value wins (§4.1 contract preserved).Verification
go build ./...— cleango test -race -count=1 ./kv— 10.3s, passmake lint— 0 issuesSelf-review (5 lenses)
engine.Version()takes the existing read lock per call.engine.Version()read at entry. Sentinel case: one extradispatchTxnround-trip.MoveRange/ cross-groupSplitRangebetween BeginTxn and Commit is now transparent to the client: coordinator absorbs the gate's rejection, re-routes, succeeds on the new owner.Test plan
go test -race ./kvmake lintResolves
The M4 row in the Composed-1 design doc.
Summary by CodeRabbit
New Features
Tests