Skip to content

kv(composed1): M4 — Dispatch retry on ErrComposed1Violation / ErrComposed1VersionGCd#900

Open
bootjp wants to merge 10 commits into
mainfrom
feat/composed1-m4-retry
Open

kv(composed1): M4 — Dispatch retry on ErrComposed1Violation / ErrComposed1VersionGCd#900
bootjp wants to merge 10 commits into
mainfrom
feat/composed1-m4-retry

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Jun 1, 2026

Summary

Fourth milestone (M4) of the Composed-1 cross-group commit-time guard per docs/design/2026_05_29_proposed_composed1_cross_group_commit_guard.md §M4.

Makes the M3 apply-time gate non-disruptive to clients: instead of surfacing ErrComposed1Violation or ErrComposed1VersionGCd back to the adapter (and thus the application), the coordinator absorbs the sentinel, re-reads the catalog version, and re-issues the txn once against the post-shift owning group.

What changes

ShardedCoordinator.Dispatch — pin ObservedRouteVersion at entry

Every transactional OperationGroup that arrives with ObservedRouteVersion == 0 (the M1 unpinned sentinel) gets stamped with engine.Version(). This makes the M3 gate effective by default — every txn flowing through the sharded coordinator is now gate-eligible without each adapter rediscovering the §4.1 contract.

A caller that supplies a non-zero ObservedRouteVersion (a future adapter migration target managing the version itself) still wins over the Dispatch-entry pin.

New dispatchTxnWithComposed1Retry — retry-once on Composed-1 sentinels

Wraps the existing single-attempt dispatchTxn in a loop bounded at composed1RetryAttempts = 1 (initial + 1 retry = 2 attempts).

On the sentinel:

  • Re-read engine.Version() and stamp the fresh value on the OperationGroup
  • Clear StartTS / CommitTS so the next attempt allocates a fresh HLC-correct pair via nextStartTS (OCC invariants require commitTS > startTS under the same HLC observation)
  • groupMutations on the next dispatchTxn pass re-runs router.ResolveGroup, so a key whose owning group changed naturally lands on the new group's FSM

Beyond one retry the sentinel surfaces unchanged so a wrapping adapter retry harness (or the client) can decide whether to keep trying — preventing the coordinator from spinning when catalog churn outpaces the dispatch.

Caller audit (loop directive's semantic-change check)

ShardedCoordinator.Dispatch is the entry point every adapter calls. The change:

  • Surfaces success more often (turning a previous error into a successful commit)
  • Delays sentinel surfacing by a bounded retry budget

grep -rn 'ErrComposed1' adapter/ returns no hits — no adapter relies on a faster-than-bounded surfacing of these sentinels (M3 only landed yesterday).

Tests (5 new)

  • TestShardedCoordinator_RetriesOnComposed1Violation — happy path: first attempt fails, second succeeds, caller sees success.
  • TestShardedCoordinator_RetriesOnComposed1VersionGCd — same retry path drives both sentinels.
  • TestShardedCoordinator_PersistentComposed1ViolationSurfaces — retry budget bounds the loop: persistent failure surfaces sentinel after exactly 2 total Commit calls.
  • TestShardedCoordinator_PinsObservedRouteVersionAtDispatchEntry — Dispatch pins from engine.Version() when caller leaves it at zero.
  • TestShardedCoordinator_HonorsCallerPinnedObservedRouteVersion — caller-supplied non-zero value wins (§4.1 contract preserved).

Verification

  • go build ./... — clean
  • go test -race -count=1 ./kv — 10.3s, pass
  • make lint — 0 issues

Self-review (5 lenses)

  1. Data loss — retry can only convert a previously-rejected commit into success OR surface the same error after a bounded loop. Cannot lose a committed write (each attempt allocates fresh commitTS; FSM's OCC validation rejects duplicates).
  2. Concurrency — no new locks. Retry runs serially on dispatcher goroutine; engine.Version() takes the existing read lock per call.
  3. Performance — common case: single iteration + one branch + one engine.Version() read at entry. Sentinel case: one extra dispatchTxn round-trip.
  4. Data consistency — closes §M4. With M1+M2+M3+M4 landed, a MoveRange / cross-group SplitRange between BeginTxn and Commit is now transparent to the client: coordinator absorbs the gate's rejection, re-routes, succeeds on the new owner.
  5. Test coverage — 5 unit tests cover both sentinels, retry budget, Dispatch-entry pin, and caller-pin precedence. Existing tests pass unchanged.

Test plan

  • go test -race ./kv
  • make lint
  • (Follow-up) M5 — Jepsen workload (route-shuffle generator for the DynamoDB suite, find no G1c anomaly under concurrent MoveRange).

Resolves

The M4 row in the Composed-1 design doc.

Summary by CodeRabbit

  • New Features

    • Transaction dispatch now performs a single bounded retry for specific transient composed-1 errors, preserving previous commit timestamps; caller-supplied start timestamps suppress the retry. Write-only transactions may auto-pin route version at dispatch, while multi-shard commit-phase secondary requests intentionally omit the pinned route version.
  • Tests

    • Expanded test coverage for composed-1 retry semantics, PrevCommitTS preservation, caller-supplied StartTS behavior, route-version pinning rules, and secondary commit behavior.

…osed1VersionGCd

Fourth milestone of the Composed-1 cross-group commit-time guard per
docs/design/2026_05_29_proposed_composed1_cross_group_commit_guard.md
§M4.  Makes the M3 apply-time gate non-disruptive to clients: instead
of surfacing ErrComposed1Violation or ErrComposed1VersionGCd back to
the adapter (and thus the application), the coordinator absorbs the
sentinel, re-reads the catalog version, and re-issues the txn once
against the post-shift owning group.

Two changes:

  * Pin reqs.ObservedRouteVersion at Dispatch entry.  ShardedCoordinator
    .Dispatch now stamps the engine's current catalog version onto
    every transactional OperationGroup that arrived with the zero
    sentinel.  This makes the M3 gate effective by default — every
    txn flowing through the sharded coordinator is now gate-eligible
    without each adapter rediscovering the design doc §4.1 contract.
    A caller that supplied a non-zero ObservedRouteVersion (a future
    adapter migration target managing the version itself) still wins
    over the Dispatch-entry pin.

  * Retry-once on Composed-1 sentinels.  dispatchTxnWithComposed1Retry
    wraps the existing single-attempt dispatchTxn in a loop bounded
    at composed1RetryAttempts = 1 (initial + 1 retry = 2 attempts).
    On the sentinel:

      - re-read engine.Version() and stamp the fresh value on the
        OperationGroup
      - clear StartTS / CommitTS so the next attempt allocates a
        fresh HLC-correct pair via nextStartTS (the OCC invariants
        require commitTS > startTS under the same HLC observation;
        reusing the stale pair would risk a startTS that no longer
        dominates the post-shift shard's applied commitTS)
      - groupMutations on the next dispatchTxn pass re-runs
        router.ResolveGroup, so a key whose owning group changed
        since the last attempt naturally lands on the new group's
        FSM

    Beyond one retry the sentinel surfaces unchanged so a wrapping
    adapter retry harness (or the client) can decide whether to
    keep trying — preventing the coordinator from spinning when
    catalog churn outpaces the dispatch.

Caller audit (the loop directive's semantic-change check):
ShardedCoordinator.Dispatch is the entry point every adapter calls.
The change SURFACES SUCCESS more often (turning a previous error
into a successful commit) and surfaces the sentinel after a bounded
retry budget rather than immediately.  No adapter relies on a
faster-than-bounded surfacing of these sentinels (M3 only landed in
this session and adapters did not migrate), so the relaxed
surfacing is safe.  Verified: grep -rn 'ErrComposed1' adapter/
returns no hits.

Tests (kv/sharded_coordinator_composed1_retry_test.go):

  * TestShardedCoordinator_RetriesOnComposed1Violation —
    happy-path: first attempt returns Composed1Violation, second
    succeeds; caller sees success.
  * TestShardedCoordinator_RetriesOnComposed1VersionGCd — same
    retry path drives both sentinels.
  * TestShardedCoordinator_PersistentComposed1ViolationSurfaces —
    retry budget bounds the loop: persistent failure surfaces the
    sentinel after exactly 2 total Commit calls.
  * TestShardedCoordinator_PinsObservedRouteVersionAtDispatchEntry
    — Dispatch pins reqs.ObservedRouteVersion=engine.Version()
    when caller leaves it at zero.
  * TestShardedCoordinator_HonorsCallerPinnedObservedRouteVersion
    — caller-supplied non-zero ObservedRouteVersion survives the
    Dispatch-entry pin (§4.1 contract preserved).

Verification:
  * go build ./...                  — clean
  * go test -race -count=1 ./kv     — 10.3s, pass
  * make lint                       — 0 issues

Self-review (5 lenses):

  1. Data loss — the retry can only convert a previously-rejected
     commit into a successful one OR surface the same error after
     a bounded loop.  Cannot lose a committed write.  Cannot
     duplicate a write because each attempt's commitTS is fresh
     and the FSM's OCC validation already rejects duplicates.
  2. Concurrency — no new locks.  Retry runs serially on the
     dispatcher goroutine; engine.Version() takes the existing
     read lock per call.  No new lock ordering.
  3. Performance — for the common case (no sentinel) the retry
     loop is a single iteration with one extra branch + one
     engine.Version() read at entry.  For the sentinel case the
     cost is one extra dispatchTxn round-trip — design doc says
     this is acceptable; the alternative is a client-visible
     error.
  4. Data consistency — closes the §M4 row of the design doc.
     With M1 (plumbing), M2 (ring), M3 (gate), and M4 (retry) all
     landed, a MoveRange / cross-group SplitRange between
     BeginTxn and Commit is now transparent to the client: the
     coordinator absorbs the gate's rejection, re-routes, and
     succeeds on the new owner.
  5. Test coverage — 5 new unit tests cover both sentinels,
     the retry budget, the Dispatch-entry pin, and the caller-pin
     precedence.  Existing dispatchTxn tests pass unchanged
     (their requests carry ObservedRouteVersion=0 but the test
     coordinator's engine has no routes beyond v=0, so the gate
     short-circuits exactly as before).

Next milestone: M5 — Jepsen workload (route-shuffle generator for
the DynamoDB suite, find no G1c anomaly under concurrent
MoveRange).  Separate PR per the design doc §6 milestone breakdown.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

@claude review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 34 minutes and 23 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3b644a3c-1bc9-4cc0-a93f-94f14dbdba35

📥 Commits

Reviewing files that changed from the base of the PR and between d848767 and e08d44f.

📒 Files selected for processing (4)
  • kv/sharded_coordinator.go
  • kv/sharded_coordinator_composed1_retry_test.go
  • kv/sharded_coordinator_txn_test.go
  • kv/txn_errors.go
📝 Walkthrough

Walkthrough

Adds a bounded single-retry Composed-1 loop to ShardedCoordinator.Dispatch: on composed-1 sentinel errors it refreshes ObservedRouteVersion, clears/re-allocates timestamps, and retries the transaction once. Also stops propagating ObservedRouteVersion into 2PC commit-phase secondary requests. Tests validate retry budgeting, PrevCommitTS preservation, pinning and StartTS rules, read-write exceptions, and the 2PC secondary drop.

Changes

Composed-1 Retry Implementation and Testing

Layer / File(s) Summary
Retry constant and error classifier; Dispatch flag
kv/sharded_coordinator.go
Adds composed1RetryAttempts, records whether StartTS was caller-supplied, and introduces isComposed1RetryableError.
Composed-1 retry wrapper & integration
kv/sharded_coordinator.go
Routes transactional OperationGroups through dispatchTxnWithComposed1Retry, adds maybeAutoPinObservedRouteVersion, and implements a single-retry loop that re-stamps ObservedRouteVersion, clears CommitTS, re-allocates StartTS, and preserves PrevCommitTS for the retry; read-write txns and caller-supplied StartTS suppress retry.
Drop ObservedRouteVersion on 2PC secondary commit
kv/sharded_coordinator.go
dispatchMultiShardTxn/commitSecondaryTxns no longer forward observedRouteVersion; secondary commit pb.Requests omit ObservedRouteVersion (zero).
Test stubs and core retry behavior tests
kv/sharded_coordinator_composed1_retry_test.go
Adds composed1RetryingTransactional and persistentComposed1Failer stubs and tests that composed-1 sentinels trigger exactly one retry or surface after budget exhaustion.
ObservedRouteVersion pinning, StartTS and PrevCommitTS tests
kv/sharded_coordinator_composed1_retry_test.go
Adds tests asserting dispatch-time auto-pinning for write-only txns, preservation of caller-supplied pin, suppression of auto-pin/retry for read-write txns or caller-supplied StartTS, and preservation of PrevCommitTS across composed-1 retry attempts; also tests that 2PC secondary COMMIT carries ObservedRouteVersion=0.
Cross-shard 2PC test update
kv/sharded_coordinator_txn_test.go
Refines test and assertions so PREPAREs and primary COMMIT must carry the pinned ObservedRouteVersion while secondary COMMIT must drop it.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • bootjp/elastickv#895: Introduces ErrComposed1Violation and ErrComposed1VersionGCd sentinel errors that this PR's retry logic detects and handles.
  • bootjp/elastickv#796: Adds PrevCommitTS-related one-phase idempotency that this PR preserves and forwards through the retry path.
  • bootjp/elastickv#881: Earlier changes to ObservedRouteVersion plumbing that this PR selectively stops propagating into 2PC secondary commits.

Poem

🐰 I hop where routes and versions meet,
One retry bound — I keep transactions neat.
Stamp, clear, and try once more,
Keep PrevCommit safe as before.
For secondaries I let the pin retreat.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change: implementing dispatch retry on two specific Composed-1 errors (ErrComposed1Violation and ErrComposed1VersionGCd) as part of M4 milestone.
Docstring Coverage ✅ Passed Docstring coverage is 92.86% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/composed1-m4-retry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (3cf4e90):

  • kv/sharded_coordinator.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 4m 11s —— View job


PR Review: kv(composed1): M4 — Dispatch retry on ErrComposed1Violation / ErrComposed1VersionGCd

  • Gather context and understand the request
  • Read changed files and diff
  • Review kv/sharded_coordinator.go changes
  • Review kv/sharded_coordinator_composed1_retry_test.go
  • Post findings

The M4 design is sound and the implementation is largely clean. One correctness gap, two minor issues, and a couple of nits.


kv/sharded_coordinator.go

[Medium] PrevCommitTS not cleared on retry — can produce ErrTxnDedupRequiresSingleShard instead of a successful re-route

dispatchTxnWithComposed1Retry clears StartTS and CommitTS on retry but does not clear reqs.PrevCommitTS (sharded_coordinator.go:519-526).

The scenario where this matters:

  1. A multi-key txn has all its keys on one shard → single-shard path. A previous ambiguous attempt set PrevCommitTS for option-2 dedup.
  2. While retrying, a SplitRange fires and the keys land on two different shards.
  3. M3 gate fires → ErrComposed1Violation.
  4. M4 retry fires. groupMutations re-routes and now returns two group IDs → dispatchMultiShardTxn.
  5. dispatchMultiShardTxn sees prevCommitTS != 0 and immediately returns ErrTxnDedupRequiresSingleShard (line 691).

The dedup probe is invalidated by the M3 rejection (the first attempt never committed — the gate fires at apply time). PrevCommitTS should be zeroed in the retry block alongside StartTS/CommitTS:

reqs.StartTS = 0
reqs.CommitTS = 0
reqs.PrevCommitTS = 0  // ← add this

Fix this →

[Minor] c.clock == nil leaves reqs.StartTS = 0 for retry attempt

The retry block sets reqs.StartTS = 0 then only re-allocates if c.clock != nil (lines 520-526). When c.clock == nil, the next loop iteration calls dispatchTxn(ctx, 0, 0, ...). resolveTxnCommitTS(0, 0) then calls nextTxnTSAfter(0) → returns 1, so the FSM receives a ts=0 request. This is technically valid (commitTS=1 > startTS=0), but it's inconsistent with the original Dispatch-level nextStartTS that accounts for maxLatestCommitTS. No test covers this path.

[Nit] Unreachable tail return

The return nil, errors.WithStack(ErrInvalidRequest) at line 530 is unreachable (the loop body always returns on attempt == composed1RetryAttempts). The comment explains this, and it satisfies the compiler. Acceptable as-is, but it could be avoided with:

// Replace:
for attempt := 0; attempt <= composed1RetryAttempts; attempt++ { ... }
return nil, errors.WithStack(ErrInvalidRequest)

// With a final-attempt flag or do { ... } while pattern,
// or just leave it — the comment is clear enough.

[Design note] reqs mutation is caller-visible

dispatchTxnWithComposed1Retry mutates reqs.ObservedRouteVersion, reqs.StartTS, and reqs.CommitTS in place. After Dispatch returns (success or error), the caller's OperationGroup fields are permanently altered. Current adapters don't reuse the struct after Dispatch returns, so this is safe in practice, but the contract is implicit. A doc comment on Dispatch (or on OperationGroup) clarifying "fields may be updated in place" would be useful before future callers discover this empirically.


kv/sharded_coordinator_composed1_retry_test.go

[Minor] Dead goroutine in TestShardedCoordinator_RetriesOnComposed1Violation

Lines 68-74 start a goroutine with an empty body:

go func() {
    // The retry happens synchronously inside Dispatch; advance the
    // catalog before the second attempt would re-read it. ...
}()

The goroutine does nothing and the comment describes intent that was never implemented. Since the test already validates the re-read contract via the assertion at line 91-92 ("the contract is that we RE-READ, not that we always change"), the goroutine adds confusion without adding coverage. Remove it.

Fix this →

[Nit] composed1RetryingTransactional.observedSeen concurrency

observedSeen []uint64 is appended in Commit without a mutex. In all current tests every Commit call happens serially inside Dispatch, so there's no race. If a future test drives concurrent Commit calls through this stub, -race would flag it. Not a problem now, but worth noting for a follow-up if the stub is reused.


Follow-up concern (not blocking M4)

Secondary commit path is not Composed-1 aware: commitSecondaryWithRetry retries up to txnSecondaryCommitRetryAttempts=3 with the same req (including the original observedRouteVersion). If a secondary shard's catalog has shifted, all three retries will get ErrComposed1Violation, fall back to the lock resolver with a warning, and the lock resolver will inherit the same problem. This is a pre-existing design gap in the secondary path — the M3 gate makes it more reachable. Flagging for M5 / lock-resolver hardening.


Summary

The M4 design intent is correct and well-tested for the primary cases. The one actionable fix before merge is clearing PrevCommitTS = 0 in the retry block. The empty goroutine is cosmetic but should be cleaned up. Everything else is informational.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3cf4e9069d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread kv/sharded_coordinator.go Outdated
Comment on lines +518 to +519
reqs.StartTS = 0
reqs.CommitTS = 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve the original read timestamp on route retry

When a transaction has already read data (for example Redis MULTI/EXEC populates ReadKeys and returns values from OperationGroup.StartTS), this retry path clears the original read timestamp and allocates a fresh one before re-issuing the same write set. The FSM/store OCC check rejects only versions with latest > startTS, so any concurrent write that lands after the client's original read snapshot but before this fresh retry timestamp is no longer detected; the retry can commit decisions based on stale reads after a Composed-1 route-shift error. The retry needs to keep the transaction's original read timestamp for OCC validation (or force the caller to recompute the transaction) rather than silently advancing it.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements an M4 Composed-1 retry loop in ShardedCoordinator to handle transaction routing shifts by re-reading the catalog version and re-issuing the transaction once. The review feedback highlights two critical issues: first, transparently retrying transactions with a bumped start timestamp is unsafe if the transaction contains read keys, as it can lead to stale reads and serializability violations; second, the coordinator should unconditionally allocate a new start timestamp on retry instead of checking for a non-nil clock, as leaving the timestamp at zero violates MVCC invariants.

Comment thread kv/sharded_coordinator.go
Comment on lines +498 to +503
if !isComposed1RetryableError(err) {
return resp, err
}
if attempt == composed1RetryAttempts {
return resp, err
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Transparently retrying a transaction with a bumped StartTS is only safe if the transaction is write-only (len(reqs.ReadKeys) == 0).

If the transaction has read keys, the client/adapter performed those reads at the original StartTS. If the coordinator transparently bumps StartTS to a newer timestamp on retry, any concurrent writes to those read keys that committed between the old StartTS and the new StartTS will be silently ignored by the FSM's OCC validation (since the FSM only checks for conflicts after the new StartTS). This allows the transaction to commit using stale read values, violating strict serializability (SSI) and introducing G1c/write-skew anomalies.

Per the general rule:

In transactional read-modify-write operations, set the transaction's start timestamp to the read timestamp to ensure a consistent snapshot and prevent race conditions.

We must propagate the ErrComposed1Violation or ErrComposed1VersionGCd error back to the caller if len(reqs.ReadKeys) > 0 so that the entire transaction can be retried from the beginning (including re-reading the keys at a fresh timestamp).

Suggested change
if !isComposed1RetryableError(err) {
return resp, err
}
if attempt == composed1RetryAttempts {
return resp, err
}
if !isComposed1RetryableError(err) {
return resp, err
}
if len(reqs.ReadKeys) > 0 {
return resp, err
}
if attempt == composed1RetryAttempts {
return resp, err
}
References
  1. In transactional read-modify-write operations, set the transaction's start timestamp to the read timestamp to ensure a consistent snapshot and prevent race conditions.

Comment thread kv/sharded_coordinator.go Outdated
Comment on lines +520 to +526
if c.clock != nil {
startTS, allocErr := c.nextStartTS(ctx, reqs.Elems)
if allocErr != nil {
return resp, allocErr
}
reqs.StartTS = startTS
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The check if c.clock != nil prevents allocating a new StartTS on retry when c.clock is nil. However, c.nextStartTS is fully capable of handling a nil clock (it falls back to maxTS + 1 at line 1031). Leaving reqs.StartTS at 0 on retry is incorrect because dispatchTxn does not allocate a StartTS if it is 0, which would cause the retried transaction to run with StartTS = 0 (violating MVCC/timestamp invariants). We should call c.nextStartTS unconditionally, just like Dispatch does at line 450.

		startTS, allocErr := c.nextStartTS(ctx, reqs.Elems)
		if allocErr != nil {
			return resp, allocErr
		}
		reqs.StartTS = startTS
References
  1. In an MVCC system, all operations within a single transaction must share the same start timestamp to ensure atomicity and a consistent snapshot view.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
kv/sharded_coordinator_composed1_retry_test.go (1)

68-74: 💤 Low value

Remove the empty goroutine or convert to a regular comment block.

This goroutine body contains only comments and executes nothing. If it's a placeholder for future work, consider adding a TODO or removing it entirely to avoid confusion.

Suggested fix (remove empty goroutine)
-	// Between the first and second dispatch attempts, simulate the
-	// route catalog moving forward — M4's retry path must re-read
-	// the engine version and stamp the new one on the retry's
-	// pb.Request so the FSM sees the post-shift observed version.
-	go func() {
-		// The retry happens synchronously inside Dispatch; advance the
-		// catalog before the second attempt would re-read it.  In a
-		// real cluster the advance is driven by a concurrent
-		// CatalogWatcher fan-out; here we just race against the
-		// in-process Commit to model the same observable effect.
-	}()
+	// NOTE: In a real cluster the route catalog would advance between
+	// the first and second dispatch attempts via a concurrent
+	// CatalogWatcher fan-out. This test verifies the retry path
+	// re-reads engine.Version(); the assertions below confirm both
+	// attempts record an ObservedRouteVersion (even when unchanged).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@kv/sharded_coordinator_composed1_retry_test.go` around lines 68 - 74, Remove
the no-op goroutine in the test (the go func() { /* ...comments... */ }() block)
because it executes nothing; either delete the entire goroutine invocation or
replace it with a regular comment/TODO block so intent is clear and the test
contains no empty concurrent construct. Locate the empty anonymous goroutine in
kv/sharded_coordinator_composed1_retry_test.go and update it accordingly to
eliminate confusion.
kv/sharded_coordinator.go (1)

469-531: 💤 Low value

On allocation failure, return nil instead of the stale response from the prior failed attempt.

At line 523, when nextStartTS fails, resp is from the previous dispatchTxn call that failed with a Composed-1 sentinel. Returning this stale response alongside an unrelated error may confuse callers inspecting both values.

Suggested fix
 		if c.clock != nil {
 			startTS, allocErr := c.nextStartTS(ctx, reqs.Elems)
 			if allocErr != nil {
-				return resp, allocErr
+				return nil, allocErr
 			}
 			reqs.StartTS = startTS
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@kv/sharded_coordinator.go` around lines 469 - 531, In
dispatchTxnWithComposed1Retry, when nextStartTS returns an error (allocErr) you
must not return the stale resp from the prior dispatchTxn; instead return nil
and the allocation error. Update the error path inside the retry loop (the block
that calls c.nextStartTS) to return (nil, allocErr) rather than (resp,
allocErr). This touches dispatchTxnWithComposed1Retry and the variables resp,
allocErr and ensures callers don't receive a stale CoordinateResponse alongside
an unrelated allocation error.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@kv/sharded_coordinator_composed1_retry_test.go`:
- Around line 68-74: Remove the no-op goroutine in the test (the go func() { /*
...comments... */ }() block) because it executes nothing; either delete the
entire goroutine invocation or replace it with a regular comment/TODO block so
intent is clear and the test contains no empty concurrent construct. Locate the
empty anonymous goroutine in kv/sharded_coordinator_composed1_retry_test.go and
update it accordingly to eliminate confusion.

In `@kv/sharded_coordinator.go`:
- Around line 469-531: In dispatchTxnWithComposed1Retry, when nextStartTS
returns an error (allocErr) you must not return the stale resp from the prior
dispatchTxn; instead return nil and the allocation error. Update the error path
inside the retry loop (the block that calls c.nextStartTS) to return (nil,
allocErr) rather than (resp, allocErr). This touches
dispatchTxnWithComposed1Retry and the variables resp, allocErr and ensures
callers don't receive a stale CoordinateResponse alongside an unrelated
allocation error.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1f9d7b48-3ca0-4a71-a58f-fa7c79746643

📥 Commits

Reviewing files that changed from the base of the PR and between 6f2941b and 3cf4e90.

📒 Files selected for processing (2)
  • kv/sharded_coordinator.go
  • kv/sharded_coordinator_composed1_retry_test.go

… (gemini CRITICAL/HIGH + codex P1 on PR #900)

Two converging findings from PR #900's automated review pass, both
correctly identifying that the M4 transparent retry can introduce
strict-serialisability violations:

  * gemini CRITICAL + codex P1 — Transparently retrying a txn with a
    bumped StartTS is SAFE only when the txn is write-only.  When
    the txn has read keys (Redis MULTI/EXEC, DynamoDB
    TransactGetItems → TransactWrite, etc.), the client already
    performed those reads at the original StartTS.  Bumping StartTS
    on retry would let any concurrent write that committed between
    the old and new timestamps go undetected by the FSM's OCC check
    (which only rejects versions with `latest > startTS`), allowing
    the retry to commit decisions based on stale reads — an SSI /
    G1c violation.

    Fix: surface the sentinel WITHOUT retry when
    `len(reqs.ReadKeys) > 0`.  The caller / adapter is then
    responsible for re-executing the entire txn (including
    re-reading the keys at a fresh timestamp).  Write-only txns
    (no reads to invalidate) remain safe to retry — the M3-to-M5
    plumbing is still client-transparent for the most common
    cross-group SplitRange / MoveRange case (operator-frequency
    catalog churn, write-only adapter patterns).

  * gemini HIGH — The retry path gated StartTS allocation on
    `if c.clock != nil`, but `nextStartTS` already handles a nil
    clock internally (falls back to `maxTS+1`).  Leaving
    `reqs.StartTS = 0` on retry was incorrect because `dispatchTxn`
    does NOT re-allocate when StartTS is zero — the retry would
    run with StartTS=0, violating MVCC invariants in the
    nil-clock test fixture (and any future nil-clock test seam
    that exercises the retry path).

    Fix: allocate StartTS unconditionally on retry via
    `nextStartTS`.  Matches the unconditional call pattern
    Dispatch itself uses at line 450 of the same file.

Caller audit (the loop directive's semantic-change check): the
behaviour change is that ErrComposed1Violation /
ErrComposed1VersionGCd now surfaces immediately for read-write
txns instead of after a one-attempt retry.  `grep -rn 'ErrComposed1'
adapter/` returns no hits — no adapter currently distinguishes
these sentinels (M3 only landed yesterday and adapters did not
migrate to read them), so the earlier surfacing is semantics-safe.
When an adapter does start matching on these sentinels, the
new behaviour is what it would want (the entire txn needs
re-execution, not a coordinator-side bump).

Regression test (added before the fix per the project convention):

  * TestShardedCoordinator_DoesNotRetryReadWriteTxnOnComposed1 —
    constructs a txn with `ReadKeys` non-empty, stubs the
    transactional Commit to fail with ErrComposed1Violation,
    asserts:
      1. the sentinel surfaces immediately (errors.Is check)
      2. exactly 1 Commit attempt fires (no retry)
    Verified failing on the parent commit (the test received 2
    Commit calls because the previous code retried) and passing
    on this commit.

Verification: go test -race -count=1 ./kv → pass.  make lint → 0
issues.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

Both critical-class findings addressed:

gemini CRITICAL / codex P1 (line 503/519 — bumped StartTS breaks SSI for read-write txns): Now surfaces ErrComposed1Violation / ErrComposed1VersionGCd IMMEDIATELY (no retry) when len(reqs.ReadKeys) > 0. The client already performed reads at the original StartTS — silently bumping it would let concurrent writes between old/new timestamps escape OCC validation, violating SSI. Write-only txns (no reads to invalidate) remain safe to retry, so M4 is still client-transparent for the most common cross-group SplitRange / MoveRange case.

gemini HIGH (line 526 — nil-clock guard left StartTS=0): nextStartTS already handles nil clock internally; removed the if c.clock != nil gate so the retry allocates StartTS unconditionally, matching the unconditional call pattern Dispatch itself uses.

Caller audit (loop directive): grep -rn 'ErrComposed1' adapter/ → no hits. M3 only landed yesterday and no adapter matches on these sentinels, so the earlier surfacing for read-write txns is semantics-safe. When adapters do start matching, the new behaviour is what they'd want (entire txn re-execution, not a coordinator-side bump).

Regression test (added before fix per convention): TestShardedCoordinator_DoesNotRetryReadWriteTxnOnComposed1 — read-write txn with ErrComposed1Violation surfaces immediately, exactly 1 Commit call. Verified failing on parent / passing on fix.

go test -race ./kv pass / make lint 0 issues.

@claude review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (66ad5b0):

  • kv/sharded_coordinator.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 2m 30s —— View job


PR Review: Follow-up fixes (gemini CRITICAL/HIGH + codex P1)

  • Gather context and understand the request
  • Read current state of changed files
  • Verify gemini CRITICAL / codex P1 fix (read-write txn skip)
  • Verify gemini HIGH fix (unconditional nextStartTS)
  • Check regression test correctness
  • Check remaining open items from prior review (PrevCommitTS, empty goroutine)
  • Post findings

The two critical/high fixes are correct. Two items from the prior review remain open.


✅ gemini CRITICAL / codex P1 — correct

if len(reqs.ReadKeys) > 0 { return resp, err } at line 516 surfaces the sentinel immediately for read-write txns. The check sits after isComposed1RetryableError, so it only fires on the two sentinel errors, not on ordinary failures. The regression test TestShardedCoordinator_DoesNotRetryReadWriteTxnOnComposed1 follows the project convention (failing before fix, passing after) and correctly asserts both the error identity and the single-attempt count.

✅ gemini HIGH — correct

The if c.clock != nil gate is gone. nextStartTS is called unconditionally, matching the call at Dispatch entry (line 450). The commit message correctly notes that nextStartTS handles the nil-clock case internally.


⚠️ PrevCommitTS still not cleared on retry — prior medium finding still open

sharded_coordinator.go:544 clears CommitTS and re-allocates StartTS, but PrevCommitTS is not zeroed. The scenario from the prior review is unchanged and now scoped to write-only txns (the only ones that reach the retry path after the read-write fix):

  1. Write-only txn, single-shard, PrevCommitTS set for at-most-once dedup.
  2. M3 gate fires → ErrComposed1Violation (attempt never committed).
  3. M4 retry fires. Catalog has since shifted; groupMutations re-routes to two groups → dispatchMultiShardTxn.
  4. dispatchMultiShardTxn line 713: if prevCommitTS != 0 { return nil, ErrTxnDedupRequiresSingleShard }.

The dedup sentinel is wrong here — the original attempt never committed (the FSM rejected it at apply time), so PrevCommitTS's "you already committed this" signal is stale. The caller gets ErrTxnDedupRequiresSingleShard instead of a successful re-route.

Fix: add reqs.PrevCommitTS = 0 alongside reqs.CommitTS = 0 at line 544:

reqs.CommitTS = 0
reqs.PrevCommitTS = 0  // ← add: M3 gate means the original attempt never committed,
                       //         so the dedup signal is no longer valid on retry

Fix this →

⚠️ Empty goroutine — prior minor finding still open

TestShardedCoordinator_RetriesOnComposed1Violation lines 68–74 still contain an empty go func() { /* comments */ }(). The goroutine executes nothing. With -race this won't flag (it's a no-op), but it's misleading — the comment describes intent that isn't implemented, and a future reader may assume this goroutine is doing concurrent catalog mutation to model the race. Remove the goroutine invocation; the comment text can stay as a regular block comment if the intent is worth preserving.

Fix this →

Nit — stale resp returned on allocErr

sharded_coordinator.go:547: when nextStartTS fails, the code returns resp, allocErr. At that point resp is the response from the failed dispatchTxn call (the one that returned the Composed-1 sentinel). In practice resp is nil when dispatchTxn fails, so this is harmless, but returning nil, allocErr is more explicit:

if allocErr != nil {
    return nil, allocErr  // ← resp here is stale from the failed attempt
}

Summary

The two critical/high correctness findings are properly fixed with regression tests. The medium PrevCommitTS finding and the empty goroutine nit from the prior review remain. The PrevCommitTS fix is a one-liner and should go in before merge. Everything else is good.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 66ad5b0e5c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread kv/sharded_coordinator.go Outdated
Comment on lines +489 to +490
if c.engine != nil && reqs.ObservedRouteVersion == 0 {
reqs.ObservedRouteVersion = c.engine.Version()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pin observed routes at the transaction read time

When a caller already supplies a transaction StartTS/ReadKeys but leaves ObservedRouteVersion unset, this stamps the catalog version at EXEC/commit dispatch time rather than when the read set was captured. For example, Redis txnContext.commit builds an OperationGroup with StartTS and ReadKeys but no ObservedRouteVersion; if a route moves between the earlier MULTI read and this Dispatch call, the request is routed and verified against the new version, so verifyComposed1 will not detect the route shift the guard is meant to catch. This needs to be pinned when the transaction starts/reads, or not auto-filled here for already-started read-write transactions.

Useful? React with 👍 / 👎.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

Codex P1 addressed:

The Dispatch-entry auto-pin no longer fires for read-write txns. Per design doc §4.1, pinning must happen at BeginTxn (read-set capture); for read-write txns Dispatch is strictly LATER than BeginTxn, so the coordinator cannot retroactively reconstruct the route view the reads were taken against. Stamping the catalog version at EXEC time would associate post-shift routes with pre-shift reads — exactly the gate-bypass codex flagged for Redis MULTI/EXEC.

Fix: gate the auto-pin on len(reqs.ReadKeys) == 0. Write-only txns continue to be auto-pinned (Dispatch == BeginTxn equivalently). Read-write callers must pin ObservedRouteVersion themselves at BeginTxn time — until they migrate, the observedVer=0 sentinel surfaces to M3 which short-circuits, matching the pre-feature behaviour byte-for-byte.

Caller audit (loop directive): grep -rn 'ObservedRouteVersion' adapter/ → no hits. No adapter currently sets or reads the field, so the auto-pin shrink is invisible.

Regression test (added before fix per convention): TestShardedCoordinator_DoesNotAutoPinObservedRouteVersionForReadWriteTxn — read-write OperationGroup with ObservedRouteVersion=0 against engine pinned at v=5 must surface observedVer=0 (auto-pin must NOT fire). Verified failing on parent / passing on fix.

go test -race ./kv pass / make lint 0 issues.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 3m 58s —— View job


PR Review: Codex P1 fix — gate auto-pin on write-only txns

  • Gather context and understand the request
  • Read current state of changed files
  • Verify Codex P1 fix (auto-pin gated on len(reqs.ReadKeys) == 0)
  • Verify regression test correctness
  • Check remaining open items (PrevCommitTS, empty goroutine)
  • Post findings

⚠️ Claimed fix is not yet in the branch

The trigger comment describes gating the auto-pin on len(reqs.ReadKeys) == 0 and adding TestShardedCoordinator_DoesNotAutoPinObservedRouteVersionForReadWriteTxn, but neither change appears in the current HEAD (66ad5b0). Lines 489–491 still read:

if c.engine != nil && reqs.ObservedRouteVersion == 0 {
    reqs.ObservedRouteVersion = c.engine.Version()
}

without the len(reqs.ReadKeys) == 0 guard. The regression test is also absent from sharded_coordinator_composed1_retry_test.go. The changes need to be pushed before this PR can be reviewed as described.


✅ Described approach is correct — reviewing the intent

Assuming the changes are as described, the approach is sound:

The gate placement is right. Moving the ReadKeys check to the auto-pin (lines 489–491) rather than only at the retry guard (line 516) is the correct semantics: for read-write txns, ObservedRouteVersion must stay 0 so verifyComposed1 short-circuits (pre-feature behavior). Stamping the version at EXEC/dispatch time for a txn that read at BeginTxn time would mislead M3 into treating a pre-shift observation as post-shift, exactly the gate-bypass the Codex finding described.

The two guards are consistent after the fix:

Txn type Auto-pin gate fires? Composed-1 retry? Result
Write-only ✅ pins at entry ✅ retries once client-transparent
Read-write (no caller pin) ❌ stays 0 ❌ surfaces immediately pre-feature fallback
Read-write (caller-supplied pin) ❌ caller wins ❌ surfaces immediately §4.1 contract preserved

This is correct. The retry and the auto-pin now agree on "only write-only txns are safe to coordinate transparently."


⚠️ Three findings from prior reviews still open

These were in the prior Claude reviews and were not addressed in 66ad5b0.

[Medium] PrevCommitTS not zeroed on retry

Line 544 clears CommitTS and re-allocates StartTS but does not zero PrevCommitTS. If a write-only txn had PrevCommitTS != 0 (single-shard at-most-once dedup), and the retry routes to multiple groups (because a SplitRange fired between the two attempts), dispatchMultiShardTxn returns ErrTxnDedupRequiresSingleShard — a wrong error, because the original attempt never committed (M3 gate fired at the FSM before any apply). The dedup signal is stale.

reqs.CommitTS = 0
reqs.PrevCommitTS = 0  // ← add: M3 means original attempt never committed; dedup signal is stale

Fix this →

[Minor] Stale resp returned on allocErr

Line 547 returns resp, allocErr where resp is from the failed Composed-1 attempt. In practice resp is always nil when dispatchTxn errors, but the intent is clearer and safer as:

return nil, allocErr

Fix this →

[Nit] Empty goroutine in TestShardedCoordinator_RetriesOnComposed1Violation

Lines 68–74 still contain:

go func() {
    // ... comments only, no body ...
}()

This goroutine executes nothing. Remove it (or keep the comment as a regular block comment).

Fix this →


Summary

The described Codex P1 fix (gate auto-pin on len(reqs.ReadKeys) == 0) is the right approach and would complete a consistent "write-only retry" contract. It just needs to be pushed to the branch. Once pushed, the three items above should also be addressed before merge — the PrevCommitTS one in particular is a correctness issue on the retry path.

…write txns (codex P1 on PR #900)

When a caller already supplies a transaction StartTS + ReadKeys but
leaves ObservedRouteVersion unset, the M4 Dispatch-entry auto-pin
was stamping the catalog version at EXEC/commit dispatch time
rather than when the read set was captured.  Concrete scenario:
Redis txnContext.commit builds an OperationGroup with StartTS from
MULTI and ReadKeys from the GETs inside the txn body but no
ObservedRouteVersion.  If a route moved between the MULTI-time
reads and the EXEC Dispatch call, the auto-pin stamped the
POST-shift catalog version on the request and verifyComposed1
silently routed-and-verified against the new version — failing to
detect the very shift the guard is meant to catch.

Per design doc §4.1: pinning happens at BeginTxn time (read-set
capture).  For write-only txns Dispatch IS BeginTxn (no prior
reads), so the auto-pin matches.  For read-write txns Dispatch is
strictly LATER than BeginTxn, and the coordinator cannot
retroactively reconstruct the route view the reads were taken
against.  Such callers MUST migrate to pin ObservedRouteVersion
themselves at BeginTxn time; until they do, surfacing observedVer=0
to the M3 gate (which short-circuits on the zero sentinel) is the
correct non-regressing posture — better than a wrong-version
"protection" that disguises a missed gate fire.

Fix: gate the auto-pin on `len(reqs.ReadKeys) == 0`.  Write-only
txns continue to be auto-pinned at Dispatch entry; read-write
txns flow through with whatever ObservedRouteVersion the caller
supplied (zero = unpinned, non-zero = caller-managed per §4.1).

Caller audit (the loop directive's semantic-change check): only
ShardedCoordinator.Dispatch auto-pins.  `grep -rn 'ObservedRouteVersion'
adapter/` returns no hits — no adapter sets or reads the field
today, so the semantic shrink (auto-pin no longer fires for
read-write txns) is invisible to every caller.  Adapters that
want M3 protection for read-write txns must explicitly migrate
to pin at BeginTxn, which is what the design doc has always
required.

Regression test (added before the fix per the project convention):

  * TestShardedCoordinator_DoesNotAutoPinObservedRouteVersionForReadWriteTxn
    — builds an OperationGroup with ReadKeys non-empty and
    ObservedRouteVersion=0, dispatches against an engine pinned
    at catalog version 5, asserts the request that reached
    the stub Commit carries ObservedRouteVersion=0 (the auto-pin
    did not fire).  Verified failing on the parent commit (the
    request carried v=5) and passing on this commit.

Verification: go test -race -count=1 ./kv → pass.  make lint → 0
issues.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

Pushed as 10123c5 — the @claude review request above was posted before the commit actually landed (a pre-commit cyclop lint failure on my first attempt — fixed by extracting the auto-pin into maybeAutoPinObservedRouteVersion). The semantic fix is unchanged from the previous comment; please pick up 10123c5.

@claude review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (10123c5):

  • kv/sharded_coordinator.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 2m 16s —— View job


PR Review: Codex P1 fix — maybeAutoPinObservedRouteVersion refactor (10123c5)

  • Gather context and understand the request
  • Read current state of changed files (10123c5)
  • Verify Codex P1 fix (maybeAutoPinObservedRouteVersion — write-only gate)
  • Verify regression test (TestShardedCoordinator_DoesNotAutoPinObservedRouteVersionForReadWriteTxn)
  • Check remaining open items from prior reviews (PrevCommitTS, empty goroutine)
  • Post findings

The Codex P1 fix and the cyclop refactor are correct. Two items from prior review rounds remain open.


✅ Codex P1 — correct

maybeAutoPinObservedRouteVersion at line 507 gates on len(reqs.ReadKeys) != 0:

func (c *ShardedCoordinator) maybeAutoPinObservedRouteVersion(reqs *OperationGroup[OP]) {
    if c.engine == nil || reqs.ObservedRouteVersion != 0 || len(reqs.ReadKeys) != 0 {
        return
    }
    reqs.ObservedRouteVersion = c.engine.Version()
}

For read-write txns the auto-pin is skipped — ObservedRouteVersion stays at 0, the M3 gate short-circuits (safe default), and the SSI-violation window codex flagged is closed. Write-only txns continue to be auto-pinned. The method comment accurately documents the §4.1 reasoning and the concrete Redis MULTI/EXEC scenario.

✅ Regression test — correct

TestShardedCoordinator_DoesNotAutoPinObservedRouteVersionForReadWriteTxn (lines 327–357): engine at v=5, ReadKeys non-empty, ObservedRouteVersion left at 0. The assertion require.Equal(t, uint64(0), txn.observedSeen[0]) is exactly right — verifies the pin was NOT applied. Test comment correctly explains the gate-bypass scenario.

✅ Cyclop refactor — clean

Extracting the auto-pin gate into maybeAutoPinObservedRouteVersion is the correct fix for the cyclop lint failure. dispatchTxnWithComposed1Retry is now one statement shorter at its entry point and the new helper is independently testable.


⚠️ PrevCommitTS still not cleared on retry — medium, now in its third review round

sharded_coordinator.go:517 passes reqs.PrevCommitTS into dispatchTxn. The retry block at lines 567–572 zeros CommitTS and re-allocates StartTS but does not clear PrevCommitTS:

reqs.CommitTS = 0
startTS, allocErr := c.nextStartTS(ctx, reqs.Elems)
// reqs.PrevCommitTS is NOT touched

The M4 retry is now correctly scoped to write-only txns. Write-only txns can still have PrevCommitTS set — that is the option-2 at-most-once dedup probe. If a dedup-enabled write-only txn hits ErrComposed1Violation and the catalog shifts such that groupMutations now returns two groups, dispatchMultiShardTxn line 736 immediately returns ErrTxnDedupRequiresSingleShard:

if prevCommitTS != 0 {
    return nil, errors.WithStack(ErrTxnDedupRequiresSingleShard)   // stale signal
}

The signal is wrong — the M3 gate fired at apply time, meaning the original attempt never committed, so PrevCommitTS's "you already sent this once" invariant is invalid on retry. The caller sees ErrTxnDedupRequiresSingleShard instead of a successful re-route.

Fix — one line alongside reqs.CommitTS = 0 at line 567:

reqs.CommitTS = 0
reqs.PrevCommitTS = 0  // M3 gate means the original attempt never committed; dedup probe is stale

Fix this →

⚠️ Empty goroutine — minor, now in its third review round

sharded_coordinator_composed1_retry_test.go lines 68–74 still contain:

go func() {
    // The retry happens synchronously inside Dispatch; advance the
    // catalog before the second attempt would re-read it.  ...
}()

The goroutine body is empty — no statements, no channel operations, nothing. It will always race harmlessly, but a future reader will assume it models a concurrent catalog advance that it does not. Remove the go func() { ... }() invocation; if the intent comment is worth keeping, promote it to a regular block comment before the Dispatch call.

Nit — stale resp on allocErr (line 570)

Still return resp, allocErr where resp is from the prior failed attempt. Should be return nil, allocErr.


Summary

The Codex P1 fix and cyclop refactor are correct; the regression test documents the §4.1 constraint precisely. The PrevCommitTS one-liner is the only correctness item outstanding before merge — it has been flagged in every review round and the fix is trivial. The empty goroutine is cosmetic but should also be cleaned up.

… (claude[bot] medium/minor/nit on PR #900)

Three items from the claude[bot] review of 10123c5, all
flagged across three review rounds:

MEDIUM — PrevCommitTS leak on Composed-1 retry.  The M4 retry
block zeroed CommitTS and re-allocated StartTS but did NOT clear
PrevCommitTS.  Concrete failure mode (claude[bot]): an
adapter-initiated write-only retry with a non-zero PrevCommitTS
(the option-2 at-most-once dedup probe) hits ErrComposed1Violation;
the catalog shifts such that groupMutations now spans two groups;
dispatchMultiShardTxn:736 immediately returns
ErrTxnDedupRequiresSingleShard because prevCommitTS != 0.  But the
M3 gate fired at APPLY time — the original attempt never
committed — so the dedup probe is invalid on retry.  The caller
gets a misleading "you must single-shard" error instead of a
correct cross-group re-route.

Fix: drop reqs.PrevCommitTS = 0 alongside reqs.CommitTS = 0 in
the retry block.  The retry now reflects "fresh attempt, no
prior landing to dedup against," which matches the actual state
(M3 fired before any FSM apply).

Caller audit (loop directive's semantic-change check):
`grep -rn 'PrevCommitTS' adapter/` returns two sites
(adapter/redis.go:3209, :3615), both of which construct an
OperationGroup struct literal inline, pass it to Dispatch, and
either return success or branch on the error.  Neither call site
reads reqs.PrevCommitTS back after Dispatch returns — the field
is set from a separate adapter-owned `pending.commitTS` which is
unmodified by Dispatch.  The semantic change (clearing the field
inside the M4 retry) is therefore invisible to every current
caller.

MINOR — empty goroutine in the retry test.  The first retry
test had a `go func() { /* comment */ }()` with an empty body —
it looked like an unfinished concurrent-catalog-advance model
but had no statements.  Removed; promoted the intent comment
to a plain block comment that precedes the Dispatch call and
points at the actual observedSeen[1] assertion that enforces
the re-read contract.

NIT — stale resp on allocErr path.  The retry block returned
`resp, allocErr` where resp was the failed first attempt's
response (always nil for the Composed-1 sentinels).  Made it
`nil, allocErr` so the "non-nil response only on success"
contract is unambiguous.

Regression test (added before the fix per the project
convention):

  * TestShardedCoordinator_ClearsPrevCommitTSOnComposed1Retry
    — write-only OperationGroup with PrevCommitTS=17, first
    attempt rejected with ErrComposed1Violation.  Decodes the
    encoded TxnMeta from the per-attempt pb.Request Mutation
    value to observe the on-wire PrevCommitTS, then asserts
    attempt 1 carried 17 (caller's probe unchanged) and
    attempt 2 carried 0 (probe dropped by the retry path).
    Verified failing on the parent commit (attempt 2 saw 17 =
    0x11) and passing on this commit.

  The stub composed1RetryingTransactional grew a
  prevCommitTSSeen field and a per-call decode of muts[0].Value
  to support the assertion.

Verification: go test -race -count=1 ./kv → pass (the whole
package, 11s). make lint → 0 issues.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

Pushed as 43c55df — claude[bot]'s three items on 10123c5 addressed.

Medium — PrevCommitTS leak (3rd review round): retry block now zeros reqs.PrevCommitTS alongside reqs.CommitTS. The M3 gate fires at apply time, so the original attempt never committed → the option-2 dedup probe is invalid on retry. Without the clear, a write-only retry that the post-shift catalog now reroutes across two groups would hit ErrTxnDedupRequiresSingleShard on a stale signal.

Caller audit (semantic change, loop directive): grep -rn 'PrevCommitTS' adapter/ → adapter/redis.go:3209, :3615. Both sites construct OperationGroup inline, pass to Dispatch, branch on the result. Neither reads reqs.PrevCommitTS back; the adapter-owned pending.commitTS is unmodified by Dispatch. Clearing inside the M4 retry is invisible to current callers.

Minor — empty goroutine (3rd review round): removed; intent promoted to a plain block comment in front of Dispatch pointing at the observedSeen[1] assertion that enforces the re-read contract.

Nit — stale resp on allocErr: return nil, allocErr (resp was always nil for Composed-1 sentinels, but explicit is better).

Regression test (added before fix per convention): TestShardedCoordinator_ClearsPrevCommitTSOnComposed1Retry — decodes the encoded TxnMeta from each attempt's pb.Request to observe on-wire PrevCommitTS, asserts attempt 1=17 (caller's probe unchanged) / attempt 2=0 (probe dropped). Verified failing on parent / passing on fix.

go test -race ./kv pass (full pkg, 11s) / make lint 0 issues.

@claude review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (43c55df):

  • kv/sharded_coordinator.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 2m 37s —— View job


PR Review: 43c55df — PrevCommitTS + empty goroutine + stale resp fixes

  • Gather context and understand the request
  • Read kv/sharded_coordinator.go at HEAD (43c55df)
  • Read kv/sharded_coordinator_composed1_retry_test.go at HEAD
  • Verify PrevCommitTS fix
  • Verify empty goroutine removal
  • Verify stale resp fix
  • Verify regression test correctness
  • Check for any new issues

All three items from the prior review rounds are correctly addressed. No new issues found.

✅ PrevCommitTS cleared on retry — correct

reqs.PrevCommitTS = 0 lands at line 575, immediately after reqs.CommitTS = 0 at line 567. The placement is right: it fires only in the retry block (not on first attempt, preserving the caller's dedup probe for attempt 1) and before the nextStartTS allocation (so the retry's dispatchTxn call at line 517 sees the cleared value). The comment explains the reasoning accurately — M3 fires at apply time, so the probe is stale on retry.

Cross-checked with dispatchMultiShardTxn:748: the guard if prevCommitTS != 0 { return nil, ErrTxnDedupRequiresSingleShard } is exactly the failure mode that would have been triggered by the stale probe. The fix lands cleanly between the two.

✅ Regression test — correct

TestShardedCoordinator_ClearsPrevCommitTSOnComposed1Retry verifies the fix via the on-wire encoding, not just the struct field — composed1RetryingTransactional.Commit decodes DecodeTxnMeta(muts[0].GetValue()) from the actual pb.Request. This is the right check: muts[0] is always the TxnMeta sentinel mutation built by onePhaseTxnRequestWithPrevCommit (coordinator.go:1092), so the test observes the same value the FSM would see, not just what the coordinator held in memory. Assertions:

  • prevCommitTSSeen[0] == 17 — first attempt carries the caller's probe unchanged ✅
  • prevCommitTSSeen[1] == 0 — retry drops it ✅

Test follows the project convention: verifies failing on the parent commit (attempt 2 would decode 0x11 = 17 before the fix) and passing after.

✅ Empty goroutine removed — correct

Lines 82–89 now contain a plain block comment explaining the test design intent. No empty go func() invocation remains. The comment accurately describes what the observedSeen[1] assertion below it actually tests (RE-READ contract, not a particular value).

✅ Stale resp on allocErr — correct

Line 582: return nil, allocErr — resp is gone. The "non-nil response only on success" invariant is now unambiguous for that code path.

TLA+ spec divergence — no update needed

The changes are confined to the Dispatch-level retry orchestration in sharded_coordinator.go. No modifications to hlc*.go, coordinator.go HLC renewal paths, transaction.go, lock_resolver.go, fsm.go, mvcc_store.go, or distribution/. The retry wraps dispatchTxn calls; it does not touch timestamp issuance, OCC commit-ts assignment, FSM apply, or catalog versioning. No spec module needs updating.


LGTM. The four review rounds across this PR have been thorough and all correctness findings are closed. The one remaining open item is the M5 Jepsen workload (pre-flagged as a follow-up in the PR description).

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 43c55dfe0a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread kv/sharded_coordinator.go Outdated
// short-circuit with ErrTxnDedupRequiresSingleShard if the
// post-shift catalog now reroutes this txn to span two groups
// (claude[bot] medium, 3rd review round on PR #900).
reqs.PrevCommitTS = 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve PrevCommitTS when retrying deduped attempts

When the request is already an option-2 retry (PrevCommitTS != 0) and has no ReadKeys, this line drops the only probe that tells the FSM whether the earlier ambiguous attempt already landed. I checked the Redis list-push reuse path in adapter/redis.go: it sends PrevCommitTS: pending.commitTS, and its readKeys can be empty for the empty-list case, so a route shift that triggers ErrComposed1Violation here will reroute and reapply the write at a new timestamp instead of no-oping against the version that may have already committed. The fact that this M4 attempt failed the Composed-1 gate only proves this attempt did not apply; it does not prove the prior attempt named by PrevCommitTS did not land.

Useful? React with 👍 / 👎.

…st-once probe (codex P1 on 43c55df, PR #900)

43c55df took claude[bot]'s medium feedback at face value and zeroed
reqs.PrevCommitTS in the M4 retry block, on the theory that "the
M3 gate fires at apply time → the original attempt never committed
→ probe is stale."  Codex P1 on 43c55df correctly identifies that
this conflates two different "original attempts":

  * Within M4: attempt 1's M3 rejection does prove attempt 1 wrote
    nothing (the FSM rejects before any state change).
  * Across adapter retries: PrevCommitTS names the ADAPTER'S prior
    attempt, not M4's.  adapter/redis.go:3209 and :3615 (Redis
    EXEC reuse / list-push reuse retry paths) send
    PrevCommitTS = pending.commitTS as the option-2
    ambiguous-outcome probe.  That earlier attempt may have actually
    LANDED at the named commitTS before the route shift that
    triggered this M4 attempt's M3 sentinel.

Dropping the probe in the retry would let the FSM re-apply the
writes against the post-shift route, double-writing the caller's
at-most-once payload if the named prior attempt did land.  The
at-most-once contract is the whole point of option-2 dedup; the
M4 retry must not break it.

Trade-off (accepted): when the post-shift route now spans multiple
groups, dispatchMultiShardTxn:749 surfaces
ErrTxnDedupRequiresSingleShard.  That error is the FAITHFUL signal
that the adapter's at-most-once retry contract cannot be
transparently honoured across a shard split — the caller must
reckon with the ambiguity at a higher level (re-read, re-execute,
abort).  Surfacing it is strictly safer than silently losing
dedup.

The earlier (claude[bot]-flagged) "misleading error" concern is
real but the only correct fix lives in the adapter retry loop, not
by tampering with the probe at the coordinator layer.

Fix: revert the `reqs.PrevCommitTS = 0` line that 43c55df added.
Other changes from 43c55df (empty goroutine removal, stale-resp
nit) are correct and stay.

Caller audit (loop directive's semantic-change check): no caller
reads reqs.PrevCommitTS back after Dispatch.  Adapter sites
(adapter/redis.go:3209, :3615) construct OperationGroup inline,
pass it, and branch on the error.  The semantic change here
(stop clearing the probe inside the retry) restores the
pre-43c55dfe wire contract and is invisible to those callers.

Regression test (replaces 43c55df's misnamed
ClearsPrevCommitTSOnComposed1Retry):

  * TestShardedCoordinator_PreservesPrevCommitTSOnComposed1Retry —
    write-only OperationGroup with PrevCommitTS=17, first attempt
    rejected with ErrComposed1Violation.  Decodes the encoded
    TxnMeta from the per-attempt pb.Request mutation value (same
    bytes the FSM reads), asserts attempt 1 AND attempt 2 both
    carry PrevCommitTS=17 — the probe survives the retry intact.
    This is the inverse of the test 43c55df added; the new
    assertion encodes codex P1's at-most-once correctness
    contract.

Verification: go test -race -count=1 ./kv → pass (full pkg, 11s).
make lint → 0 issues.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (57da888):

  • kv/sharded_coordinator.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

Pushed as 57da888 — codex P1 on 43c55df addressed.

Codex P1 was correct, claude[bot]'s prior medium was wrong. 43c55df took claude[bot]'s reasoning at face value ("M3 fired at apply time → original attempt never committed → probe is stale") and zeroed reqs.PrevCommitTS on the M4 retry. That conflates two different "original attempts":

  • Within M4: attempt 1's M3 rejection proves attempt 1 wrote nothing. ✓
  • Across adapter retries: PrevCommitTS names the adapter's prior attempt (adapter/redis.go:3209, :3615 send PrevCommitTS = pending.commitTS as the option-2 ambiguous-outcome probe). That earlier attempt may have actually landed before the route shift that triggered this M4's M3 sentinel.

Dropping the probe would let the FSM re-apply against the post-shift route, double-writing the at-most-once payload if the prior attempt landed. The at-most-once contract is the whole point of option-2 dedup — M4 must not break it.

Trade-off accepted: post-shift-multi-shard case still surfaces ErrTxnDedupRequiresSingleShard from dispatchMultiShardTxn:749. That's the faithful signal — at-most-once retry cannot transparently cross a shard split; the caller must reckon with the ambiguity at a higher level. Surfacing is strictly safer than silent double-write.

Caller audit (loop directive): no caller reads reqs.PrevCommitTS back after Dispatch. Restoring the pre-43c55dfe wire contract is invisible to current callers.

Regression test replaces the misnamed prior one:

  • TestShardedCoordinator_PreservesPrevCommitTSOnComposed1Retry — decodes TxnMeta from each per-attempt pb.Request mutation (same bytes the FSM reads), asserts attempt 1 AND attempt 2 both carry PrevCommitTS=17. This is the inverse assertion of what 43c55df pinned, encoding codex P1's at-most-once correctness contract.

go test -race ./kv pass (full pkg, 11s) / make lint 0 issues.

@claude review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 57da88864e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread kv/sharded_coordinator.go
Comment on lines +597 to +605
startTS, allocErr := c.nextStartTS(ctx, reqs.Elems)
if allocErr != nil {
// resp here is the failed first attempt's response (always
// nil for the Composed-1 sentinels) — return nil
// explicitly so the contract "non-nil response on success
// only" is unambiguous (claude[bot] nit on PR #900).
return nil, allocErr
}
reqs.StartTS = startTS
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve caller StartTS on OCC retries

When a caller supplies StartTS from an earlier OCC read but leaves ReadKeys empty, this retry silently replaces the snapshot timestamp. Fresh evidence beyond the earlier ReadKeys case is S3 createBucket, which reads bucket metadata at readTS (adapter/s3.go:573) and dispatches a txn with that StartTS but no ReadKeys (adapter/s3.go:605-613); if a route shift triggers this path, a concurrent create that committed after the S3 read but before the retry is no longer rejected by the FSM's latest > startTS write-conflict check, so the retry can overwrite/accept a stale decision instead of surfacing a conflict. The retry should only advance StartTS for coordinator-created write-only snapshots, or otherwise force the adapter to re-read and rebuild the request.

Useful? React with 👍 / 👎.

…(codex P1 on 57da888, PR #900)

Codex P1 on 57da888 identifies that the M4 retry path's StartTS
bump is unsafe whenever the caller supplied StartTS — not only the
already-gated ReadKeys-non-empty case.  When the caller hands us
StartTS=readTS (because it already read metadata at that snapshot
and is now dispatching a write), bumping StartTS on retry lets any
concurrent commit landed in (originalStartTS, newStartTS) bypass
the FSM's `latest > startTS` write-conflict check.  The
intervening commit is now dominated by the new snapshot, so OCC
silently accepts the retry's stale-snapshot decision.

The pre-Dispatch read is INVISIBLE to the coordinator (it is not
surfaced as a ReadKeys entry), so the only signal that the
snapshot is caller-owned is reqs.StartTS being non-zero at
Dispatch entry.

Real exposure: a code-base grep for adapter sites that supply
`StartTS = readTS` without ReadKeys turns up at least 13 hits
across S3, SQS, and DynamoDB adapters:

  * adapter/s3.go:573 (createBucket)
  * adapter/s3_admin_objects.go:144, :348
  * adapter/sqs_messages.go:541, :1113, :1332, :1487, :1614
  * adapter/sqs_messages_batch.go:222
  * adapter/sqs_fifo.go:241
  * adapter/sqs_purge.go:146
  * adapter/sqs_tags.go:179
  * adapter/dynamodb.go:4366, :8274

All of these read metadata at readTS and dispatch a write with
that StartTS but empty ReadKeys.  Under the pre-fix retry path
any Composed-1 sentinel they hit would have been silently
absorbed by the StartTS-bumping retry, breaking OCC for
concurrent writes in the bumped-by window.

Fix: capture `callerSuppliedStartTS := reqs.IsTxn && reqs.StartTS != 0`
at Dispatch entry (BEFORE the coordinator-allocates-on-zero
branch that mutates the field), thread it into
dispatchTxnWithComposed1Retry, and surface the sentinel without
retry whenever the flag is set.  Coordinator-allocated StartTS
remains retryable — by definition no read has happened at that
timestamp, so the bump is OCC-safe.

Caller audit (loop directive's semantic-change check): the
adapter sites listed above are the BENEFICIARIES of this fix.
Pre-fix they were silently exposed to stale-snapshot OCC bypass
under route shifts; post-fix they correctly surface
ErrComposed1Violation, which propagates through their existing
HTTP error path to the client.  No caller depended on the silent
retry succeeding — the M4 retry was added in this PR and no
adapter is yet wired to take advantage of it in the
caller-supplied-StartTS case.  The DynamoDB adapter sites that
already supply StartTS=0 (adapter/dynamodb.go:922, :981, etc.)
continue to use the coordinator-allocated path, so the M4 retry
remains available for those.

Test changes:

  * NEW TestShardedCoordinator_SurfacesComposed1OnCallerSuppliedStartTS:
    write-only OperationGroup with StartTS=42, ReadKeys empty,
    firstErr=ErrComposed1Violation.  Asserts the sentinel
    surfaces unchanged and exactly ONE Commit attempt fires.
    Verified failing on the parent commit (the retry absorbed
    the sentinel and the second attempt succeeded) and passing
    on this commit.

  * Existing M4-retry tests that previously set StartTS=10 (a
    test fixture, not a real "caller-supplied" intent) updated
    to omit StartTS so the coordinator allocates and the retry
    path is eligible to fire:
      - TestShardedCoordinator_RetriesOnComposed1Violation
      - TestShardedCoordinator_RetriesOnComposed1VersionGCd
      - TestShardedCoordinator_PersistentComposed1ViolationSurfaces
      - TestShardedCoordinator_PreservesPrevCommitTSOnComposed1Retry
    These tests use NewHLC() and the in-memory engine, so
    nextStartTS allocates cleanly.

  * Tests that exercise read-write surfacing
    (TestShardedCoordinator_DoesNotRetryReadWriteTxnOnComposed1,
    TestShardedCoordinator_DoesNotAutoPinObservedRouteVersionForReadWriteTxn)
    keep StartTS=10 — ReadKeys non-empty surfaces first, so the
    new callerSuppliedStartTS gate is never reached.  Same for
    the pin tests that succeed on first attempt.

Also fixes a doc-comment nit claude[bot] flagged on 57da888: the
composed1RetryingTransactional stub's PrevCommitTS-decode comment
still said "assert the field is cleared on the retry attempt" —
that was the now-disproven claude[bot] medium reasoning from
43c55df.  Updated to say "PRESERVED across retry attempts" with
the codex P1 (43c55df) reference.

Verification: go test -race -count=1 ./kv → pass (full pkg, 11s).
make lint → 0 issues.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (144ec0c):

  • kv/sharded_coordinator.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

Pushed as 144ec0c — codex P1 on 57da888 addressed (plus claude[bot] stale-comment nit).

Codex P1 — caller-supplied StartTS surfaces. The M4 retry path's StartTS bump was unsafe for ALL caller-supplied StartTS values, not just the already-gated ReadKeys-non-empty case. A code-base grep found 13+ adapter sites that supply StartTS = readTS without populating ReadKeys (their pre-Dispatch metadata read is invisible to the coordinator):

  • adapter/s3.go:573 (createBucket), s3_admin_objects.go:144 / :348
  • adapter/sqs_messages.go:541 / :1113 / :1332 / :1487 / :1614
  • adapter/sqs_messages_batch.go:222, sqs_fifo.go:241, sqs_purge.go:146, sqs_tags.go:179
  • adapter/dynamodb.go:4366 / :8274

Under the pre-fix retry, any Composed-1 sentinel they hit would have been silently absorbed by the bumped-StartTS retry, letting any commit landed in (originalStartTS, newStartTS) bypass the FSM's latest > startTS OCC check.

Fix: capture callerSuppliedStartTS := reqs.IsTxn && reqs.StartTS != 0 at Dispatch entry (BEFORE the coordinator-allocates-on-zero branch mutates the field), thread it into dispatchTxnWithComposed1Retry, surface the sentinel without retry when set. Coordinator-allocated StartTS remains retryable — by definition no read has happened at that timestamp.

Caller audit (loop directive): the 13+ adapter sites are the BENEFICIARIES of this fix. Pre-fix they were silently exposed to stale-snapshot OCC bypass under route shifts; post-fix they correctly surface ErrComposed1Violation through their existing HTTP error path. No caller depended on the silent retry succeeding (M4 was added in this PR; no adapter is yet wired to take advantage). DynamoDB sites with StartTS: 0 (dynamodb.go:922, :981, etc.) continue to use the coordinator-allocated path — M4 retry remains available for those.

Regression test: TestShardedCoordinator_SurfacesComposed1OnCallerSuppliedStartTS — write-only OG with StartTS=42, ReadKeys empty, asserts the sentinel surfaces unchanged and exactly ONE Commit attempt fires. Failing on parent (retry absorbed); passing on fix.

Existing M4-retry tests updated to omit StartTS (was StartTS: 10 test fixture, now coordinator-allocated path so retry path is eligible). Tests that exercise read-write surfacing keep StartTS=10 — ReadKeys non-empty surfaces first, so the new gate is never reached.

claude[bot] nit on 57da888: stub's PrevCommitTS-decode comment still said "assert the field is cleared" (the now-disproven 43c55df-era claude[bot] medium reasoning). Updated to "PRESERVED across retry attempts" with codex P1 (43c55df) reference.

go test -race ./kv pass (full pkg, 11s) / make lint 0 issues.

@claude review

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 3m 6s —— View job


PR Review: Codex P1 fix — caller-supplied StartTS surfaces (144ec0c)

  • Gather context and read changed files
  • Verify Codex P1 fix (callerSuppliedStartTS gate)
  • Verify regression test correctness
  • Check all prior review findings are resolved
  • Check TLA+ spec divergence

✅ Codex P1 — correct

callerSuppliedStartTS := reqs.IsTxn && reqs.StartTS != 0 at line 464 is captured before the coordinator-allocates-on-zero branch at line 466. This is the right placement: the flag records what the caller supplied, before the coordinator potentially overwrites reqs.StartTS. The two branches are clean:

Caller supplied StartTS callerSuppliedStartTS M4 retry Reason
StartTS != 0 (e.g. readTS from a prior metadata read) true ❌ surfaces immediately Coordinator cannot prove no one read at originalStartTS; bumping would open OCC blind spot in (originalStartTS, newStartTS)
StartTS == 0 (coordinator allocates) false ✅ retries once Coordinator owns the timestamp; no external read has happened at it

The new guard at lines 572–574 is placed correctly in the loop body — after isComposed1RetryableError (so it only fires on the two Composed-1 sentinels) and before the attempt == composed1RetryAttempts budget check (so budget never even decrements for caller-owned snapshots).

✅ Regression test — correct

TestShardedCoordinator_SurfacesComposed1OnCallerSuppliedStartTS (lines 478–510):

  • Dispatches with StartTS: 42, ReadKeys empty
  • Asserts errors.Is(err, ErrComposed1Violation) — sentinel surfaces unchanged
  • Asserts txn.calls.Load() == 1 — exactly one attempt fires

The test comment documents the 13+ real adapter sites and explains why the pre-Dispatch read is invisible to the coordinator (no ReadKeys). The test correctly distinguishes the callerSuppliedStartTS gate from the earlier ReadKeys != 0 gate, which is the right decomposition.

✅ Existing tests updated correctly

TestShardedCoordinator_RetriesOnComposed1Violation, TestShardedCoordinator_RetriesOnComposed1VersionGCd, and TestShardedCoordinator_PersistentComposed1ViolationSurfaces now omit StartTS (coordinator-allocated path), with inline comments pointing back to the Codex P1 finding. This keeps those tests on the retry-eligible path without conflating the two gates.

TestShardedCoordinator_PinsObservedRouteVersionAtDispatchEntry and TestShardedCoordinator_HonorsCallerPinnedObservedRouteVersion retain StartTS: 10 because they test the auto-pin logic on the first (successful) attempt — the callerSuppliedStartTS gate is never reached since there is no Composed-1 error.

✅ Stale-comment nit from 57da888 — fixed

The composed1RetryingTransactional.Commit stub comment (lines 31–42) now correctly says "PRESERVED across retry attempts" and cites the codex P1 on 43c55df that disproved the original claude[bot] "clear it" advice. No longer misleads future readers.

✅ All prior review findings resolved

Finding Round Status
gemini CRITICAL / codex P1: SSI violation on read-write retry R1 66ad5b0ReadKeys != 0 surfaces immediately
gemini HIGH: nil-clock leaves StartTS=0 R1 66ad5b0 — unconditional nextStartTS
claude medium: PrevCommitTS not cleared on retry R1–R3 57da888 — correctly NOT cleared (codex P1 disproved the clearing)
Minor: empty goroutine in test R1 43c55df — removed; converted to block comment
Nit: stale resp on allocErr R2 43c55dfreturn nil, allocErr
codex P1: auto-pin stamps post-shift version for read-write txns R3 10123c5maybeAutoPinObservedRouteVersion gated on ReadKeys empty
codex P1: callerSuppliedStartTS OCC bypass R4 144ec0c — this commit

⚠️ Design note (not blocking)

maybeAutoPinObservedRouteVersion gates on len(reqs.ReadKeys) != 0 but not on callerSuppliedStartTS. For adapter sites that supply StartTS = readTS with empty ReadKeys (the same 13+ sites codex identified), the auto-pin fires and stamps the dispatch-time catalog version. This means M3 covers the dispatch→apply window for these callers but NOT the read→dispatch window.

This is not a regression: pre-M4 these callers had ObservedRouteVersion = 0, so the M3 gate always short-circuited (zero coverage). Post-M4 the gate covers dispatch→apply (partial coverage). Full coverage requires these adapters to migrate to pin ObservedRouteVersion at BeginTxn time per §4.1. The staged-migration posture (observedVer=0 → gate short-circuits) is documented in the maybeAutoPinObservedRouteVersion comment.

If the read→dispatch window is deemed in-scope for M4, the fix would be to also skip the auto-pin when callerSuppliedStartTS is true — but that would reduce M3 gate coverage for these callers compared to the current state, which seems wrong. The right answer is adapter migration to §4.1, which is the pre-flagged M5+ work. Not blocking here.

TLA+ spec divergence — no update needed

144ec0c is confined to dispatchTxnWithComposed1Retry and the Dispatch method preamble in sharded_coordinator.go. No changes to hlc*.go, coordinator.go HLC lease-renewal paths, transaction.go, lock_resolver.go, fsm.go, mvcc_store.go, or distribution/. The retry orchestration wraps dispatchTxn calls; it does not touch timestamp issuance, OCC commit-ts assignment, FSM apply, or catalog versioning. No spec module needs updating.


LGTM. Five review rounds have been thorough. All correctness findings are closed. The callerSuppliedStartTS fix correctly extends the write-only-is-retryable contract from ReadKeys-based detection to StartTS-origin detection, closing the class of adapter sites (S3/SQS/DynamoDB) that supply an owned snapshot without populating ReadKeys.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 144ec0ca87

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread kv/sharded_coordinator.go Outdated
Comment on lines +524 to +527
if c.engine == nil || reqs.ObservedRouteVersion != 0 || len(reqs.ReadKeys) != 0 {
return
}
reqs.ObservedRouteVersion = c.engine.Version()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid auto-pinning caller-owned snapshots

When a transaction arrives with a caller-supplied StartTS but empty ReadKeys (for example adapter/s3.go:573-614 reads bucket metadata at readTS and dispatches the write with StartTS: startTS and no ObservedRouteVersion), this still stamps the catalog version at Dispatch time. If a route moved between that adapter-layer read and Dispatch, the request is pinned to the post-shift route version, so verifyComposed1 will not report the route shift even though the write decision was based on the old snapshot; the later callerSuppliedStartTS retry guard never helps because the sentinel was masked. The auto-pin needs to be limited to coordinator-owned snapshots as well as empty ReadKeys, or these adapters need to pin the version when they take the read snapshot.

Useful? React with 👍 / 👎.

…alse-confidence gate signal (codex P1 on 144ec0c, PR #900)

Codex P1 on 144ec0c identifies a remaining hole in the M3
auto-pin coverage: when an adapter supplies StartTS=readTS with
ReadKeys empty (the 13+ S3/SQS/DynamoDB sites enumerated in
144ec0c's commit message), the Dispatch-entry auto-pin stamps
reqs.ObservedRouteVersion = engine.Version() — the CURRENT
catalog version, not the version the read was actually captured
at.  If a route shift landed between the adapter-layer read and
this Dispatch call, the auto-pin records the POST-shift version
and verifyComposed1 finds no mismatch at apply time.  The route
shift becomes INVISIBLE to the gate — masking exactly the failure
mode the gate exists to catch.

This is strictly worse than ObservedRouteVersion=0:
* observedVer=0 → M3 gate short-circuits → operator knows no
  protection is in place (correct signal).
* spurious auto-pin → M3 gate runs and passes → operator gets
  false confidence (corrupt signal).

The "partial coverage" framing from claude[bot]'s 144ec0c review
("dispatch→apply window is still covered for these callers") is
illusory: the auto-pin's claim of catalog version on the request
LIES about when the read happened.  A false claim that the gate
verifies as true is a hostile signal, not a partial one.

The correct fix lives in the adapters (pin ObservedRouteVersion
at read time, design doc §4.1), which is M5+ migration work.
Until then, withholding the auto-pin is the only honest posture.

Fix: extend maybeAutoPinObservedRouteVersion's skip conditions to
include callerSuppliedStartTS, mirroring the M4-retry gate added
in 144ec0c.  The condition now reads:

    if engine == nil
       || ObservedRouteVersion != 0
       || len(ReadKeys) != 0
       || callerSuppliedStartTS {
        return  // do NOT auto-pin
    }
    reqs.ObservedRouteVersion = engine.Version()

The flag is already captured at Dispatch entry (before the
allocate-on-zero branch) by 144ec0c; this commit just threads
it one more level down.

Caller audit (loop directive's semantic-change check): the
13+ adapter sites that supply StartTS=readTS without ReadKeys
(adapter/s3.go:573, sqs_messages.go:541 etc., dynamodb.go:4366
etc.) currently get auto-pinned and the gate verifies with a
potentially-false claim.  Post-fix they get ObservedRouteVersion=0
and the gate short-circuits.

No caller loses real protection — the auto-pin's "protection" was
false-positive coverage that masked actual shifts.  Adapters that
want genuine M3 protection must pin ObservedRouteVersion at read
time per §4.1, which is the pre-flagged M5+ migration.

Coordinator-allocated StartTS callers (DynamoDB sites with
StartTS=0 — adapter/dynamodb.go:922, :981, :1760, :4686, :4703,
:4722, :4735) are unchanged: callerSuppliedStartTS=false, auto-pin
fires as before.  The coordinator's allocation IS the
read-time-equivalent moment for those callers (no read against a
different catalog version has happened earlier), so the auto-pin's
claim is honest.

Regression test:

  * TestShardedCoordinator_DoesNotAutoPinObservedRouteVersionForCallerSuppliedStartTS
    — caller-supplied StartTS=42, ReadKeys empty,
    ObservedRouteVersion=0, engine pinned at v=9.  Asserts the
    request that reached the stub Commit carries
    ObservedRouteVersion=0 (auto-pin did NOT fire).  Verified
    failing on the parent (the request carried v=9) and passing
    on this commit.

Existing test TestShardedCoordinator_PinsObservedRouteVersionAtDispatchEntry
updated to omit StartTS so the auto-pin path remains exercised on
the coordinator-allocated branch.  TestShardedCoordinator_HonorsCallerPinnedObservedRouteVersion
is unchanged (its ObservedRouteVersion is non-zero so the auto-pin
short-circuits on the existing condition before the new gate even
applies).  TestShardedCoordinator_DoesNotAutoPinObservedRouteVersionForReadWriteTxn
is unchanged (ReadKeys non-empty short-circuits first).

Verification: go test -race -count=1 ./kv → pass (full pkg, 10s).
make lint → 0 issues.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (6202b96):

  • kv/sharded_coordinator.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

Pushed as 6202b96 — codex P1 on 144ec0c addressed.

Codex P1 — skip auto-pin for caller-supplied StartTS. The auto-pin at Dispatch entry stamped reqs.ObservedRouteVersion = engine.Version() even for callers that supplied StartTS=readTS with empty ReadKeys (the same 13+ S3/SQS/DynamoDB sites from 144ec0c). That stamps the CURRENT catalog version on a request whose read happened at an EARLIER catalog version — a route shift in (readTime, dispatchTime) becomes invisible to verifyComposed1, masking the failure mode the gate exists to catch.

Why this is worse than no auto-pin: ObservedRouteVersion=0 → gate short-circuits → operator knows protection is not in place. Spurious auto-pin → gate verifies a FALSE claim → false confidence. claude[bot]'s 144ec0c framing of "partial coverage" is illusory because the auto-pin LIES about when the read happened — it's a corrupt signal, not a partial one.

Fix: thread the already-captured callerSuppliedStartTS one level down into maybeAutoPinObservedRouteVersion. The skip condition now reads:

if engine == nil || ObservedRouteVersion != 0 || len(ReadKeys) != 0 || callerSuppliedStartTS { return }

Caller audit (loop directive): the 13+ caller-supplied-StartTS adapter sites lose the auto-pin's false-positive coverage but no real protection — they were never honestly covered. DynamoDB sites with StartTS: 0 (dynamodb.go:922 / :981 / :1760 / :4686 / :4703 / :4722 / :4735) are unaffected — coordinator-allocated path still auto-pins, which is honest because the allocation IS the read-time-equivalent moment. Adapters that want genuine M3 protection must pin at read time per §4.1 (M5+ migration).

Regression test: TestShardedCoordinator_DoesNotAutoPinObservedRouteVersionForCallerSuppliedStartTS — StartTS=42, ReadKeys empty, engine at v=9, asserts observed=0. Failing on parent (was v=9) / passing on fix.

Existing test PinsObservedRouteVersionAtDispatchEntry updated to omit StartTS so it stays on the coordinator-allocated branch (still auto-pins). HonorsCallerPinnedObservedRouteVersion unchanged (its non-zero ObservedRouteVersion short-circuits the auto-pin on the existing condition). DoesNotAutoPinForReadWriteTxn unchanged (ReadKeys non-empty short-circuits first).

go test -race ./kv pass (full pkg, 10s) / make lint 0 issues.

@claude review

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 2m 35s —— View job


PR Review: Codex P1 fix — skip auto-pin for caller-supplied StartTS (6202b96)

  • Gather context and understand the request
  • Read current state of changed files (6202b96)
  • Verify Codex P1 fix (callerSuppliedStartTS threading into maybeAutoPinObservedRouteVersion)
  • Verify regression test correctness
  • Check all prior review findings are still resolved
  • Check TLA+ spec divergence
  • Post findings

✅ Codex P1 — correct

The callerSuppliedStartTS flag is now threaded from Dispatch entry through dispatchTxnWithComposed1Retry into maybeAutoPinObservedRouteVersion. The updated skip condition at line 544:

if c.engine == nil || reqs.ObservedRouteVersion != 0 || len(reqs.ReadKeys) != 0 || callerSuppliedStartTS {
    return
}

The flag is correctly captured at line 464 before the coordinator-allocates-on-zero branch at line 466 can mutate reqs.StartTS. This ordering is critical — the flag records what the caller supplied, not what the coordinator later assigns. Both uses of the flag are internally consistent:

Caller situation Auto-pin Retry on Composed-1 Reason
Coordinator-allocated StartTS (caller passed 0) ✅ fires ✅ retries once Coordinator owns the snapshot; no prior read exists
Caller-supplied StartTS, ReadKeys empty ❌ skipped ❌ surfaces immediately Adapter read metadata at this snapshot before Dispatch (S3/SQS/DynamoDB pattern); coordinator cannot observe the read
ReadKeys non-empty (read-write txn) ❌ skipped ❌ surfaces immediately Explicit read set at earlier StartTS; SSI violation on retry
Caller-pinned ObservedRouteVersion (non-zero) ❌ skipped (condition 2) depends on StartTS Caller already knows the version

The "false confidence is strictly worse than honest no-coverage" reasoning from the trigger comment is correct. ObservedRouteVersion=0 lets the M3 gate short-circuit cleanly (operator knows protection is not in place); a spuriously auto-stamped post-shift version makes the gate appear to verify a claim it cannot actually make.

✅ Regression test — correct

TestShardedCoordinator_DoesNotAutoPinObservedRouteVersionForCallerSuppliedStartTS (lines 513–577): StartTS: 42, ReadKeys empty, ObservedRouteVersion at 0, engine at v=9. Asserts observedSeen[0] == 0. This directly verifies the auto-pin was suppressed.

TestShardedCoordinator_PinsObservedRouteVersionAtDispatchEntry update — correct

Removing StartTS: 10 from the test fixture keeps it on the coordinator-allocated path where the auto-pin IS expected to fire. With StartTS: 10 present, callerSuppliedStartTS would be true, suppressing the auto-pin and causing the assertion observedSeen[0] == 7 to fail post-fix (it would be 0). The comment in the test explains the rationale correctly.

✅ All prior review findings resolved

Finding Round Status
gemini CRITICAL / codex P1: SSI violation on read-write retry R1 66ad5b0ReadKeys != 0 surfaces immediately
gemini HIGH: nil-clock leaves StartTS=0 R1 66ad5b0 — unconditional nextStartTS
codex P1: auto-pin stamps post-shift version for read-write txns R2 10123c5len(ReadKeys) != 0 guard
Minor: empty goroutine in test R2 43c55df — removed; converted to block comment
Nit: stale resp on allocErr R2 43c55dfreturn nil, allocErr
claude medium: PrevCommitTS clearing (was wrongly added, then correctly reverted) R2–R3 57da888 — NOT cleared; preserves adapter's at-most-once probe
codex P1: callerSuppliedStartTS OCC bypass on retry R4 144ec0c — surfaces without retry
codex P1: auto-pin for caller-supplied StartTS R5 6202b96 — this commit

TLA+ spec divergence — no update needed

6202b96 is confined to maybeAutoPinObservedRouteVersion and dispatchTxnWithComposed1Retry in sharded_coordinator.go. No changes to hlc*.go, coordinator.go HLC lease-renewal paths, transaction.go, lock_resolver.go, fsm.go, mvcc_store.go, or distribution/. The auto-pin decision does not touch timestamp issuance, OCC commit-ts assignment, FSM apply, or catalog versioning. No spec module needs updating.


LGTM. Six review rounds across this PR have been thorough. All correctness findings are closed. The final shape correctly distinguishes four cases (coordinator-allocated / caller-supplied-StartTS-only / read-write / caller-pinned-route-version) and is honest about which ones M3/M4 can and cannot protect. The one remaining open item is the M5 Jepsen workload, pre-flagged in the PR description.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6202b9648a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread kv/sharded_coordinator.go
if c.engine == nil || reqs.ObservedRouteVersion != 0 || len(reqs.ReadKeys) != 0 || callerSuppliedStartTS {
return
}
reqs.ObservedRouteVersion = c.engine.Version()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Don't auto-pin 2PC writes until secondary failures are handled

When this auto-pins a coordinator-owned write-only transaction, ordinary multi-shard requests become subject to the Composed-1 gate even though the 2PC secondary path still treats secondary Commit errors as best-effort: dispatchMultiShardTxn calls commitSecondaryTxns and then returns success, while commitSecondaryTxns only logs errors from commitSecondaryWithRetry. If a route moves after PREPARE/primary COMMIT but before a secondary COMMIT, that secondary can now return ErrComposed1Violation; Dispatch still acknowledges success, leaving the secondary write uncommitted/invisible to readers on the new owner. Before this auto-pin these requests carried ObservedRouteVersion=0 and the gate short-circuited, so this change can turn route churn into an acknowledged partial commit unless secondary Composed-1 errors are made fatal/recoverable before success is returned.

Useful? React with 👍 / 👎.

…partial commits (codex P1 on 6202b96, PR #900)

Codex P1 on 6202b96 identifies a structural conflict between
M3's fail-closed semantics and 2PC's best-effort secondary
semantic.  When dispatchMultiShardTxn propagates a non-zero
ObservedRouteVersion to secondary commit Requests, a route shift
that lands AFTER primary-COMMIT but BEFORE a secondary-COMMIT
makes the secondary FSM return ErrComposed1Violation.
commitSecondaryTxns swallows the error (logs, then continues),
and dispatchMultiShardTxn returns success to the caller — the
secondary's write is uncommitted, invisible to readers on the new
owner, and the txn is acknowledged as durable.

This is a silent partial commit hazard.  The auto-pin added in
66ad5b0 (and the explicit caller-pin path from PR #881)
inadvertently extended the M3 gate to a step where its
fail-closed semantic conflicts with the 2PC contract.

The fundamental tension:

  * Prewrite: gate-active.  A failed gate aborts before any
    locks are placed (groupReadKeys / prewriteTxn return the
    error).  Clean recovery, no partial state.
  * Primary commit: gate-active.  A failed gate triggers
    abortPreparedTxn (the existing line 849 path), which
    releases the prewrite intents on every prepared shard.
    Clean recovery, no partial state.
  * Secondary commit: gate-active is HAZARDOUS.  The primary
    has already durably committed; there is no rollback path.
    commitSecondaryTxns is best-effort by design (its docstring
    says lock-resolution backstops unavailable secondaries),
    but lock resolution itself would fail the same gate on the
    same shifted route.  Net effect: silent partial commit.

Fix: drop ObservedRouteVersion to 0 on every per-secondary
commit Request.  fsm.go:609 short-circuits verifyComposed1 on
observedVer==0, so the gate never fires on secondaries.  This is
NOT a regression of PR #881's "propagate the pin everywhere"
intent — it is a refinement that recognizes secondary commit as
the step where propagation cannot compose with the 2PC contract.
PR #881's underlying concern (prewrite and primary commit
silently dropping the version) remains addressed; only the
secondary commit's propagation is reversed.

Caller audit (loop directive's semantic-change check): the
change is entirely internal to commitSecondaryTxns.  The
observedRouteVersion parameter is removed from the function
signature; dispatchMultiShardTxn no longer passes it.  No
external caller is affected.

Effect on production callers:

  * Auto-pinned write-only coordinator-allocated multi-shard txns
    (e.g. DynamoDB BatchWriteItem with StartTS=0): secondary
    commits no longer trigger Composed-1 on cross-COMMIT route
    shifts.  Pre-fix exposure: silent partial commit under
    operator-triggered SplitRange.  Post-fix: secondary commit
    succeeds (or surfaces a non-Composed-1 error from the
    underlying Raft layer).  Lock-resolution backstop unaffected.
  * Caller-pinned write-only multi-shard txns: same as above —
    the pin still applies to prewrite and primary commit but
    not to secondary commits.  Callers that want secondary-side
    M3 protection need a fatal-secondary 2PC variant (M5+).

Regression test:

  * TestShardedCoordinator_DropsObservedRouteVersionOn2PCSecondaryCommit
    — 2-shard fixture (primary "b" → g1, secondary "x" → g2),
    coordinator-allocated StartTS so auto-pin stamps
    ObservedRouteVersion=7 on the OperationGroup.  Asserts the
    coverage matrix:
      * prewrite on both shards: observedVer=7
      * primary commit on g1:    observedVer=7
      * secondary commit on g2:  observedVer=0
    Verified failing on the parent (g2 commit carried v=7) and
    passing on this commit.

Existing test
TestShardedCoordinatorDispatchTxn_CrossShardPropagatesObservedRouteVersion
(PR #881 contract) updated to reflect the codex P1 refinement.
It now asserts the same matrix above for the caller-pinned case
(pin=42).  The original assertion (every 2PC envelope carries the
pin) was the regression codex P1 identifies; the updated
assertion encodes the corrected contract.

Verification: go test -race -count=1 ./kv → pass (full pkg, 8s).
make lint → 0 issues.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (d848767):

  • kv/sharded_coordinator.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

Pushed as d848767 — codex P1 on 6202b96 addressed.

Note on flaky CI: test (ubuntu-latest) failed on 6202b96 in TestHotKeysAggregatorRaceFree (keyviz pkg, unrelated to this PR — I haven't touched keyviz). Reproduced locally 5/5 PASS. Likely a known flake. The other test job on 6202b96 passed.

Codex P1 — drop M3 on secondary commits. commitSecondaryTxns is best-effort by design (logs per-secondary errors and continues, then dispatchMultiShardTxn returns success while lock resolution backstops unavailable secondaries). Propagating ObservedRouteVersion to secondary commit Requests composes with M3's fail-closed semantic to yield a silent partial commit: a route shift between primary-COMMIT and secondary-COMMIT makes the secondary FSM return ErrComposed1Violation; the error is swallowed; the caller-visible Dispatch ACK still says success; the secondary's write is invisible to readers on the new owner.

Coverage matrix the post-fix code satisfies:

Step observedVer Why
Prewrite (both shards) pinned gate-active before lock placement; clean abort path
Primary commit pinned linearization point; abortPreparedTxn cleans up on failure
Secondary commit 0 best-effort step has no rollback; gate would silently mask

Fix: drop the observedRouteVersion parameter from commitSecondaryTxns entirely; per-secondary Requests are built with ObservedRouteVersion unset (zero). FSM short-circuits verifyComposed1 on observedVer==0 (fsm.go:609).

Caller audit (loop directive): change is entirely internal to commitSecondaryTxns. No external callers affected. Auto-pinned write-only multi-shard txns (DynamoDB BatchWriteItem etc.) lose the silent-partial-commit hazard; explicit-caller-pin callers also lose secondary-side coverage (the fundamental hazard is structural to 2PC + fail-closed gate; M5+ work for a fatal-secondary 2PC variant or lock-resolution-bypass).

Regression test: TestShardedCoordinator_DropsObservedRouteVersionOn2PCSecondaryCommit — 2-shard fixture, asserts the coverage matrix above. Failing on parent (g2 commit carried v=7) / passing on fix.

Existing test TestShardedCoordinatorDispatchTxn_CrossShardPropagatesObservedRouteVersion (PR #881 contract) updated to encode the refined contract — prewrite + primary commit carry pin, secondary commit drops it. The original "propagate everywhere" assertion was exactly what codex P1 identifies as the bug.

go test -race ./kv pass (full pkg, 8s) / make lint 0 issues.

@claude review

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 3m 8s —— View job


PR Review: d848767 — drop ObservedRouteVersion on 2PC secondary commits

  • Gather context and read changed files
  • Verify Codex P1 fix (drop ObservedRouteVersion from commitSecondaryTxns)
  • Verify regression test correctness
  • Check updated TestShardedCoordinatorDispatchTxn_CrossShardPropagatesObservedRouteVersion
  • Check all prior review findings still resolved
  • Check TLA+ spec divergence
  • Post findings

✅ Codex P1 — correct

commitSecondaryTxns no longer accepts observedRouteVersion as a parameter. The secondary commit pb.Request at line 1014 is built without the field (zero):

req := &pb.Request{
    IsTxn:     true,
    Phase:     pb.Phase_COMMIT,
    Ts:        startTS,
    Mutations: append([]*pb.Mutation{meta}, keyMutations(grouped[gid])...),
    // ObservedRouteVersion intentionally omitted (zero) ...
}

The comment block (lines 1019–1038) is one of the most thorough explanations in this PR: it names the exact failure mode (route shift between primary-COMMIT and secondary-COMMIT → ErrComposed1Violation → swallowed → silent partial commit), explains why verifyComposed1 short-circuits on observedVer==0 (fsm.go:609), and names what M5+ would need to fix (fatal-secondary 2PC or lock-resolution-bypass). The reasoning is correct.

The call-site comment in dispatchMultiShardTxn (lines 854–859) correctly cross-references the inner comment, so both the orchestration layer and the implementation layer document the same invariant.

✅ Coverage matrix satisfied

The two-phase flow for a 2-shard txn after d848767:

Step ObservedRouteVersion Gate active? Abort path
Prewrite (both shards, Phase_PREPARE) pinned (v=7) abortPreparedTxn
Primary commit (Phase_COMMIT) pinned (v=7) abortPreparedTxn cleans up
Secondary commit (Phase_COMMIT) 0 ❌ gate short-circuits lock-resolver backstop

This matches the PR description's coverage matrix exactly.

✅ Regression test — correct

TestShardedCoordinator_DropsObservedRouteVersionOn2PCSecondaryCommit (lines 611–661):

  • 2-shard fixture: key "b" → g1 (primary), key "x" → g2 (secondary)
  • Coordinator-allocated StartTS → callerSuppliedStartTS=false, ReadKeys empty → auto-pin fires → ObservedRouteVersion=7
  • Each shard receives 2 requests (PREPARE + COMMIT), recorded in order by recordingTransactional
  • g1Txn.requests[0] (prewrite) asserts observedVer=7
  • g2Txn.requests[0] (prewrite) asserts observedVer=7
  • g1Txn.requests[1] (primary commit) asserts observedVer=7
  • g2Txn.requests[1] (secondary commit) asserts observedVer=0 ✓ — this is the regression

The request ordering is correct because prewriteTxn fires PREPARE on all shards first (both g1 and g2 get request[0]), then commitPrimaryTxn fires COMMIT on g1 only (g1.request[1]), then commitSecondaryTxns fires COMMIT on g2 only (g2.request[1]).

✅ Updated TestShardedCoordinatorDispatchTxn_CrossShardPropagatesObservedRouteVersion — correct

The test comment (lines 579–606) accurately describes the evolution: PR #881 established "propagate everywhere", d848767 refines it to "propagate to prewrite + primary commit, drop on secondary commit". The updated assertion at line 661 (require.Equal(t, uint64(0), g2Txn.requests[1].ObservedRouteVersion)) encodes the refined contract and will catch any regression that re-introduces propagation to the secondary path.

✅ All prior review findings resolved

Finding Round Commit Status
gemini CRITICAL / codex P1: SSI violation on read-write retry R1 66ad5b0
gemini HIGH: nil-clock leaves StartTS=0 R1 66ad5b0
codex P1: auto-pin for read-write txns R2 10123c5
Minor: empty goroutine in test R2 43c55df
Nit: stale resp on allocErr R2 43c55df
claude medium: PrevCommitTS clearing (correctly NOT cleared) R2–R3 57da888
codex P1: callerSuppliedStartTS OCC bypass on retry R4 144ec0c
codex P1: auto-pin for caller-supplied StartTS R5 6202b96
codex P1: 2PC secondary ObservedRouteVersion silent partial commit R6 d848767 ✅ this commit

TLA+ spec divergence — no update needed

Only kv/sharded_coordinator.go, kv/sharded_coordinator_composed1_retry_test.go, and kv/sharded_coordinator_txn_test.go changed in this PR. No modifications to kv/hlc*.go, kv/coordinator.go (HLC lease renewal), kv/transaction.go, kv/lock_resolver.go, kv/fsm.go, store/mvcc_store.go, or distribution/. The 2PC secondary commit change is purely in the Request construction layer — it does not touch timestamp issuance, OCC commit-ts assignment, FSM apply, or catalog versioning. No spec module needs updating.


LGTM. Seven review rounds across this PR have systematically closed every correctness finding. The M1+M2+M3+M4 stack is now complete for write-only txns with coordinator-allocated timestamps — route shifts are transparent to the client for that class. The M5 follow-ups (Jepsen workload, fatal-secondary 2PC, adapter §4.1 migration) are correctly pre-flagged in the PR description.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d8487672d9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread kv/sharded_coordinator.go Outdated
Comment on lines +1014 to +1018
req := &pb.Request{
IsTxn: true,
Phase: pb.Phase_COMMIT,
Ts: startTS,
Mutations: append([]*pb.Mutation{meta}, keyMutations(grouped[gid])...),
ObservedRouteVersion: observedRouteVersion,
IsTxn: true,
Phase: pb.Phase_COMMIT,
Ts: startTS,
Mutations: append([]*pb.Mutation{meta}, keyMutations(grouped[gid])...),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve the secondary commit route pin

When a multi-shard transaction is caller-pinned or auto-pinned, this request literal now always sends ObservedRouteVersion=0 for secondary commits. If the secondary key's route moves after PREPARE/primary COMMIT but before this best-effort secondary COMMIT, the old secondary FSM short-circuits verifyComposed1 and applies the write on a group that is no longer the current owner; dispatchMultiShardTxn then acknowledges success. The M3 current-version fence was added specifically to reject these late commits to old owners, so secondary Composed-1 failures need to be made fatal/recoverable rather than disabling the gate for all secondaries.

Useful? React with 👍 / 👎.

…e propagation (codex P1 on d848767, PR #900)

d848767 dropped ObservedRouteVersion from secondary commit
Requests to avoid the silent-partial-commit hazard codex
P1 on 6202b96 flagged: "fail-closed gate + best-effort swallow =
uncommitted secondary, missing-write silent partial commit."

Codex P1 on d848767 correctly identifies that dropping the gate
trades one silent partial commit for another: with the gate off,
the secondary FSM short-circuits verifyComposed1 and APPLIES the
write on a group that is no longer the catalog's owner.  Readers
on the new owner never see it.

Both codex reviews actually agreed on the right fix — make
secondary Composed-1 errors FATAL — but d848767 picked the
wrong interpretation ("disable the gate") instead of the
explicit recommendation ("make fatal/recoverable").  This
commit applies the codex-correct fix:

  1. Restore ObservedRouteVersion propagation to secondary
     commit Requests (revert d848767's literal change).
  2. Add ErrTxnSecondaryRouteShiftedAfterPrimaryCommit sentinel
     (kv/txn_errors.go).  Distinct from ErrComposed1Violation
     by design: the M4 retry path in
     dispatchTxnWithComposed1Retry matches ErrComposed1Violation
     / ErrComposed1VersionGCd; reusing those here would loop the
     retry against a primary that is already durable, re-prewriting
     into existing locks at the old gid — pure wasted work since
     the route catalog will not move backward.
  3. Change commitSecondaryTxns to return (uint64, error).  On
     Composed-1 sentinel from any per-secondary
     commitSecondaryWithRetry, surface
     ErrTxnSecondaryRouteShiftedAfterPrimaryCommit.  Non-Composed-1
     errors stay best-effort (logged and continued) to preserve
     the original lock-resolution-backstop semantic.
  4. dispatchMultiShardTxn propagates the sentinel to its caller.

Trade-off accepted: the caller now sees a fatal error when a
multi-shard 2PC race lands.  The primary's write is durable; the
secondary's is missing.  Application-level recovery (re-read,
re-execute, abort, or compensating write) is required.  This is
strictly better than either silent direction — at least the
caller knows the txn is half-broken and can act on it.

Caller audit (loop directive's semantic-change check):
commitSecondaryTxns's only caller is dispatchMultiShardTxn,
updated in this commit.  The new error type
ErrTxnSecondaryRouteShiftedAfterPrimaryCommit is NOT in the M4
retry's retryable set (isComposed1RetryableError matches
ErrComposed1Violation / ErrComposed1VersionGCd only), so the M4
retry loop in dispatchTxnWithComposed1Retry surfaces it directly
without re-attempting.  Production callers (adapter/redis.go,
adapter/dynamodb.go, etc.) see the new error type at the
Dispatch boundary and propagate it as their HTTP/RPC error.  No
adapter explicitly matches Composed-1 sentinels today, so the
semantic shift (success-on-partial-commit → error-on-partial-commit)
is a pure correctness improvement — pre-fix exposure was silent
data loss under cross-COMMIT route shifts; post-fix is visible
error.

Test changes:

  * NEW TestShardedCoordinator_SurfacesFatalErrorOn2PCSecondaryComposed1
    — 2-shard fixture, primary "b" → g1 succeeds (prewrite +
    commit), secondary "x" → g2 prewrite succeeds, commit
    persistently fails with ErrComposed1Violation (every retry of
    commitSecondaryWithRetry).  Asserts the Dispatch error is
    ErrTxnSecondaryRouteShiftedAfterPrimaryCommit and NOT
    ErrComposed1Violation (so M4 retry does not match).
    Verified failing on the parent (Dispatch returned nil error
    because commitSecondaryTxns swallowed) and passing on this
    commit.

  * RENAMED TestShardedCoordinator_DropsObservedRouteVersionOn2PCSecondaryCommit
    → TestShardedCoordinator_PropagatesObservedRouteVersionOn2PCSecondaryCommit.
    Inverted the assertion: secondary commit's
    ObservedRouteVersion MUST equal the auto-pinned engine version,
    not zero.  The d848767 contract was the temporary
    "drop-the-gate" posture; this commit restores propagation.

  * REVERTED TestShardedCoordinatorDispatchTxn_CrossShardPropagatesObservedRouteVersion
    to the original PR #881 contract: every 2PC envelope on every
    shard carries the pinned version.  The d848767 refinement
    ("drop on secondary commit") is undone in favour of the new
    fatal-sentinel approach.

Verification: go test -race -count=1 ./kv → pass (full pkg, 9s).
make lint → 0 issues.

Note on the d848767 CI failure: TestRedis_StreamXReadLatencyIsConstant
in adapter/ failed at 48.35s on the loaded ubuntu-latest runner.
That is a latency-sensitive perf test unrelated to this PR
(no Redis stream code touched).
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (6a458a2):

  • kv/sharded_coordinator.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

Pushed as 6a458a2 — codex P1 on d848767 addressed.

The d848767 fix was the wrong interpretation. Codex P1 on 6202b96 flagged "fail-closed gate + best-effort swallow = silent partial commit (missing-write direction)" and recommended "make secondary errors fatal/recoverable." I picked the wrong option ("disable the gate") instead of the explicit recommendation ("make fatal"). Codex P1 on d848767 correctly showed that the disabled-gate posture trades one silent partial commit for another (stale-owner direction): the secondary FSM short-circuits verifyComposed1 and applies the write on a group that's no longer the catalog's owner.

Correct fix — make secondary Composed-1 fatal with a distinct sentinel.

  1. Restored ObservedRouteVersion propagation to secondary commits.
  2. New ErrTxnSecondaryRouteShiftedAfterPrimaryCommit sentinel — distinct from ErrComposed1Violation so the M4 retry path doesn't loop (the primary is durable; re-prewriting would conflict).
  3. commitSecondaryTxns now returns (uint64, error). Composed-1 from a per-secondary commit surfaces as the new sentinel; non-Composed-1 errors stay best-effort (logged+continued) preserving lock-resolution-backstop semantic.
  4. dispatchMultiShardTxn propagates the sentinel to the caller.

Coverage matrix (now restored to PR #881's original):

Step ObservedRouteVersion Composed-1 outcome
Prewrite (both shards) pinned abort cleanly
Primary commit pinned abortPreparedTxn cleans up
Secondary commit pinned surface ErrTxnSecondaryRouteShiftedAfterPrimaryCommit (fatal, no retry)

Caller audit (loop directive): commitSecondaryTxns has only one caller (dispatchMultiShardTxn). The new sentinel is NOT in isComposed1RetryableError's set, so M4 retry doesn't match. Adapters propagate the error through their HTTP/RPC error path — semantic change is strict correctness improvement: pre-fix silent data loss → post-fix visible error.

Tests:

  • NEW TestShardedCoordinator_SurfacesFatalErrorOn2PCSecondaryComposed1 — 2-shard fixture, secondary persistently fails Composed-1 on commit; asserts Dispatch returns the new sentinel and NOT ErrComposed1Violation. Failing on parent / passing on fix.
  • INVERTED TestShardedCoordinator_PropagatesObservedRouteVersionOn2PCSecondaryCommit (was Drops… in d848767) — secondary commit's observedVer=7 (restored, not 0).
  • REVERTED TestShardedCoordinatorDispatchTxn_CrossShardPropagatesObservedRouteVersion to original PR kv(composed1): M1 plumbing — ObservedRouteVersion in OperationGroup → pb.Request #881 contract — every 2PC envelope on every shard carries pinned version.

go test -race ./kv pass (full pkg, 9s) / make lint 0 issues.

Note on d848767 CI failure: TestRedis_StreamXReadLatencyIsConstant (adapter/redis, 48.35s) — latency-sensitive perf test on loaded ubuntu-latest runner, unrelated to this PR.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 3m 12s —— View job


PR Review: 6a458a2 — surface secondary Composed-1 as fatal sentinel, restore propagation

  • Gather context and understand the request
  • Read current state of changed files (6a458a2)
  • Verify the new ErrTxnSecondaryRouteShiftedAfterPrimaryCommit sentinel approach
  • Verify commitSecondaryTxns return type change and propagation
  • Verify ObservedRouteVersion propagation restored correctly
  • Verify new and updated tests
  • Check TLA+ spec divergence
  • Post findings

The revert of d848767 is correct, and the approach here (make secondary Composed-1 fatal with a distinct sentinel rather than disabling the gate) is the right resolution of the two competing codex P1 findings.

✅ Sentinel design — correct

ErrTxnSecondaryRouteShiftedAfterPrimaryCommit is the right answer to the tension between the two codex P1 findings:

Option What happens Problem
Swallow error (pre-66ad5b0) Best-effort secondary logs + success Missing-write silent partial commit (codex P1 on 6202b96)
Drop gate on secondary (d848767) Writes land on stale owner Stale-owner silent partial commit (codex P1 on d848767)
Fatal sentinel (6a458a2) Caller-visible error, lock-resolver backstop Honest: txn state is uncertain, caller can recover

The new sentinel is correctly NOT matched by isComposed1RetryableError (line 669 only tests ErrComposed1Violation | ErrComposed1VersionGCd), so dispatchTxnWithComposed1Retry surfaces it without looping. The M4 loop termination path is clean:

  1. dispatchTxnWithComposed1Retry attempt 0 → dispatchTxndispatchMultiShardTxn
  2. Primary commits (durable).
  3. commitSecondaryTxnscommitSecondaryWithRetry returns Composed-1.
  4. isComposed1RetryableErrortrue → returns ErrTxnSecondaryRouteShiftedAfterPrimaryCommit.
  5. dispatchMultiShardTxn propagates → dispatchTxn returns the sentinel.
  6. isComposed1RetryableError(ErrTxnSecondaryRouteShiftedAfterPrimaryCommit)false → surfaces, no retry.

commitSecondaryTxns signature change — correct

The (uint64, error) return type is right. The early-return on Composed-1 (return maxIndex, errors.WithStack(ErrTxnSecondaryRouteShiftedAfterPrimaryCommit) at line 1062) correctly abandons remaining secondaries in the group list — once one secondary has a route shift that prevents commit, the txn state is already half-committed and application-level recovery is needed regardless of the others.

ObservedRouteVersion propagation restored — correct

commitSecondaryTxns receives observedRouteVersion uint64 and passes it through to pb.Request.ObservedRouteVersion at line 1046. dispatchMultiShardTxn correctly passes the parameter (line 867). The coverage matrix is cleanly restored to PR #881's original posture:

Step ObservedRouteVersion Composed-1 outcome
Prewrite (both shards) pinned abort cleanly via abortPreparedTxn
Primary commit pinned abortPreparedTxn on failure
Secondary commit pinned ErrTxnSecondaryRouteShiftedAfterPrimaryCommit (fatal)

✅ Regression test — correct

TestShardedCoordinator_SurfacesFatalErrorOn2PCSecondaryComposed1 (lines 715–661):

  • g2Errs = [nil, ErrComposed1Violation × txnSecondaryCommitRetryAttempts] — correctly accounts for the inner commitSecondaryWithRetry retry loop (3 attempts before returning lastErr = ErrComposed1Violation).
  • require.ErrorIs(t, err, ErrTxnSecondaryRouteShiftedAfterPrimaryCommit) — asserts the new sentinel ✅
  • require.NotErrorIs(t, err, ErrComposed1Violation) — asserts M4 retry does not have a false surface ✅

TestShardedCoordinator_PropagatesObservedRouteVersionOn2PCSecondaryCommit correctly reverts the d848767 assertion back to observedVer=7 for all four envelopes.

TestShardedCoordinatorDispatchTxn_CrossShardPropagatesObservedRouteVersion correctly loops over all requests and asserts pinnedVer everywhere, eliminating the special-cased observedVer=0 assertion from d848767.

✅ All prior review findings resolved

Finding Round Status
gemini CRITICAL / codex P1: SSI on read-write retry R1 66ad5b0
gemini HIGH: nil-clock StartTS=0 R1 66ad5b0
codex P1: auto-pin for read-write txns R2 10123c5
Minor: empty goroutine R2 43c55df
Nit: stale resp on allocErr R2 43c55df
claude medium: PrevCommitTS (correctly preserved) R2–R3 57da888
codex P1: callerSuppliedStartTS OCC bypass on retry R4 144ec0c
codex P1: auto-pin for caller-supplied StartTS R5 6202b96
codex P1: 2PC secondary gate — silent partial commit (both directions) R6–R7 6a458a2 (this commit)

⚠️ Minor — commitSecondaryWithRetry sleeps between Composed-1 retries

commitSecondaryWithRetry sleeps 20ms + 40ms = 60ms before returning the final error on 3 Composed-1 attempts. These sleeps are useful for transient leader-election drops but wasteful for Composed-1 (the catalog won't move back during the retry window). The caller already correctly classifies the final error as fatal, so the semantics are correct — the only cost is latency on the error path. Not blocking; flagging for M5+ or a separate refactor to add a shouldRetry(err) bool predicate in commitSecondaryWithRetry.

TLA+ spec divergence — no update needed

6a458a2 is confined to kv/sharded_coordinator.go, kv/sharded_coordinator_composed1_retry_test.go, kv/sharded_coordinator_txn_test.go, and kv/txn_errors.go. No changes to kv/hlc*.go, kv/coordinator.go HLC renewal paths, kv/transaction.go, kv/lock_resolver.go, kv/fsm.go, store/mvcc_store.go, or distribution/. The sentinel classification and secondary-commit error propagation don't touch timestamp issuance, OCC commit-ts assignment, FSM apply, or catalog versioning. No spec module needs updating.


LGTM. Eight review rounds have been thorough. The final shape correctly handles all four classes of Dispatch caller, correctly gates the M3 sentinel at every 2PC step, and surfaces secondary Composed-1 failures as a distinct honest error rather than either swallowing or disabling the gate. The one remaining open item is the M5 Jepsen workload, pre-flagged in the PR description.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6a458a2840

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread kv/sharded_coordinator.go
if c.engine == nil || reqs.ObservedRouteVersion != 0 || len(reqs.ReadKeys) != 0 || callerSuppliedStartTS {
return
}
reqs.ObservedRouteVersion = c.engine.Version()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid auto-pinning resolver-routed transactions

When a partition resolver routes a write-only transaction to a group that differs from the byte-range engine (e.g. SQS partition keys), this auto-pin makes the FSM enforce the engine snapshot rather than the resolver's ownership: ShardRouter.ResolveGroup sends the raw key to the resolver first, but verifyComposed1 later calls the route-history OwnerOf on the raw mutation key, whose snapshots come only from distribution.Engine. With production FSMs wired via WithRouteHistory, a coordinator-owned txn with no StartTS/ReadKeys that used to commit unpinned can now be rejected with ErrComposed1Violation on both M4 attempts even though the resolver selected the correct group. The auto-pin needs to be skipped or made resolver-aware for resolver-claimed keys.

Useful? React with 👍 / 👎.

…a458a2, PR #900)

Codex P1 on 6a458a2 identifies a final auto-pin skip case the
prior rounds missed: when a PartitionResolver is wired (SQS
HT-FIFO partition keys, etc.), ShardRouter routes via the resolver
but verifyComposed1 calls route-history OwnerOf on the raw
mutation key against the BYTE-RANGE engine snapshot.  The engine
has no knowledge of the resolver's mapping, so the gate
spuriously rejects resolver-routed commits even when the resolver
picked the correct gid.

Concrete failure mode (codex):

  1. ShardRouter.ResolveGroup → resolver returns gid R.
  2. Dispatch-entry auto-pin stamps ObservedRouteVersion =
     engine.Version() on the request.
  3. FSM at gid R applies; verifyComposed1 calls OwnerOf on the
     raw mutation key against the engine's snapshot.
  4. Engine's snapshot says gid B (default byte-range owner),
     not gid R.
  5. Mismatch → ErrComposed1Violation.
  6. M4 retry re-reads engine.Version() (unchanged), retries,
     fails again.  Spurious double failure.

With production FSMs wired via WithRouteHistory, a
coordinator-owned txn with no StartTS/ReadKeys that used to commit
unpinned now gets rejected.  The auto-pin's "blanket protection"
framing breaks for resolver-routed keys — this is the FIFTH
distinct scenario where the auto-pin overreaches:

  R1 (66ad5b0): ReadKeys non-empty (pre-shift reads)
  R2 (10123c5): same — gated
  R3 (144ec0c): callerSuppliedStartTS (adapter-layer pre-reads)
  R4 (6202b96): same — gated
  R5 (6a458a2): 2PC secondary commits — surfaced fatally
  R6 (this):    resolver-claimed keys — gated

Fix: extend maybeAutoPinObservedRouteVersion's skip conditions to
include "any element's key is recognised by the partition
resolver".  Extracted anyResolverClaimedKey helper that:

  - returns false fast when no resolver is wired (the
    byte-range-only deployment — most common case);
  - iterates Elems and calls resolver.RecognisesPartitionedKey
    on each non-empty key;
  - returns true on first claim.

Adapter for SQS HT-FIFO (the main resolver consumer) now
correctly bypasses the M3 gate.  Resolver-aware M3 is M5+ work.

Caller audit (loop directive's semantic-change check): the
change is internal to maybeAutoPinObservedRouteVersion.  The
auto-pin already had skip conditions for engine==nil,
ObservedRouteVersion!=0, ReadKeys non-empty, and
callerSuppliedStartTS; adding the resolver-claim skip is a
straightforward extension.  No external caller cares (the only
caller is dispatchTxnWithComposed1Retry which threads the same
flag through).  Production callers using the partition resolver
(SQS HT-FIFO via adapter/sqs_messages.go,
adapter/sqs_messages_batch.go, etc.) see resolver-routed txns
commit instead of being spuriously rejected — strict correctness
improvement.

Coordinator-routed (byte-range) txns continue to auto-pin as
before; they were never affected by this hazard.

Regression test:

  * TestShardedCoordinator_DoesNotAutoPinObservedRouteVersionForResolverRoutedTxn
    — installs a stubPartitionResolver claiming prefix "part:".
    Dispatches a write-only txn with key "part:q1:1234" and no
    StartTS/ReadKeys/ObservedRouteVersion.  Engine pinned at
    version 9.  Asserts the request that reaches Commit carries
    ObservedRouteVersion=0 (auto-pin did NOT fire).  Verified
    failing on parent (request carried v=9) and passing here.

  Stub stubPartitionResolver is a minimal in-test
  PartitionResolver: it claims every key whose prefix matches a
  configurable byte sequence and routes them to a fixed gid.
  Sufficient for exercising the codex P1 skip path without
  pulling in the SQS-specific partition routing wiring.

Verification: go test -race -count=1 ./kv → pass (full pkg, 10s).
make lint → 0 issues.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

Pushed as e08d44f — codex P1 on 6a458a2 addressed.

Codex P1 — skip auto-pin for resolver-claimed keys. When a PartitionResolver is wired (SQS HT-FIFO partition keys etc.), ShardRouter routes via the resolver but verifyComposed1 calls route-history OwnerOf against the BYTE-RANGE engine snapshot. The engine has no knowledge of the resolver's mapping, so the gate spuriously rejects resolver-routed commits even when the resolver picked the correct gid. M4 retry re-reads engine.Version() (unchanged) and fails the same way — spurious double failure.

Fix: extend maybeAutoPinObservedRouteVersion's skip conditions with a fifth case: "any element's key is recognised by the partition resolver." Extracted anyResolverClaimedKey helper that returns false fast when no resolver is wired (the byte-range-only deployment), iterates Elems otherwise, and calls resolver.RecognisesPartitionedKey per non-empty key.

Skip-condition summary:

# Skip when Reason
1 engine == nil no catalog to pin against
2 ObservedRouteVersion != 0 caller already pinned
3 len(ReadKeys) != 0 caller has pre-shift reads
4 callerSuppliedStartTS caller-owned snapshot via readTS
5 resolver-claimed key byte-range engine doesn't reflect resolver routing

Caller audit (loop directive): change is internal to maybeAutoPinObservedRouteVersion. Only caller is dispatchTxnWithComposed1Retry (unchanged). Production resolver consumers (adapter/sqs_messages.go, adapter/sqs_messages_batch.go, etc.) now correctly commit instead of being spuriously rejected — strict correctness improvement. Byte-range-only callers unaffected.

Regression test: TestShardedCoordinator_DoesNotAutoPinObservedRouteVersionForResolverRoutedTxn — installs stubPartitionResolver claiming prefix "part:", dispatches a write-only txn with key "part:q1:1234", asserts observedVer=0 (auto-pin must NOT fire). Failing on parent / passing on fix.

go test -race ./kv pass (full pkg, 10s) / make lint 0 issues.

@claude review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (e08d44f):

  • kv/sharded_coordinator.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 3m 22s —— View job


PR Review: e08d44f — skip auto-pin for resolver-claimed keys (codex P1 on 6a458a2)

  • Gather context and read changed files
  • Verify codex P1 fix (anyResolverClaimedKey helper + skip condition 5)
  • Verify regression test correctness
  • Check all prior review findings still resolved
  • Check TLA+ spec divergence

✅ Codex P1 — correct

anyResolverClaimedKey (lines 574–587) is clean:

  • c.router == nil || c.router.partitionResolver == nil fast-paths the common byte-range-only deployment with zero overhead — router is always set in NewShardedCoordinator, but the defensive nil guard is harmless.
  • Iterates elems, skips nil/empty-key entries, returns true on the first resolver-claimed key. The early-return semantics are right: if ANY key in the batch is resolver-owned, the whole auto-pin should be skipped (a mixed batch would fail anyway, since the resolver-claimed key's FSM would call verifyComposed1 against the byte-range engine snapshot which does not know the resolver's mapping).

maybeAutoPinObservedRouteVersion (line 559) now reads:

if c.engine == nil || reqs.ObservedRouteVersion != 0 || len(reqs.ReadKeys) != 0 || callerSuppliedStartTS {
    return
}
if c.anyResolverClaimedKey(reqs.Elems) {
    return
}
reqs.ObservedRouteVersion = c.engine.Version()

The five skip conditions are now complete and coherent:

# Skip when Reason
1 engine == nil no catalog to pin against
2 ObservedRouteVersion != 0 caller already pinned
3 len(ReadKeys) != 0 caller has pre-shift reads (Redis MULTI/EXEC)
4 callerSuppliedStartTS adapter-layer read invisible to coordinator (S3/SQS/DynamoDB pattern)
5 resolver-claimed key byte-range engine snapshot doesn't reflect resolver routing; gate would spuriously reject every resolver-routed txn

Retry path is correct without a resolver check. The retry fires only when isComposed1RetryableError is true, which requires the M3 gate to have returned one of the two sentinels. But with observedRouteVersion=0 (what skip-5 leaves on the request), the FSM's verifyComposed1 always returns nil early — the M3 gate cannot return a Composed-1 sentinel for zero-versioned requests. So the retry loop's missing resolver-key guard is not a gap: the code path cannot be reached for resolver-claimed keys.

✅ Regression test — correct

TestShardedCoordinator_DoesNotAutoPinObservedRouteVersionForResolverRoutedTxn (lines 781–814):

  • Engine at v=9, resolver claiming prefix "part:", dispatches "part:q1:1234" with StartTS=0, ReadKeys=nil, ObservedRouteVersion=0 — none of the existing four skip gates would fire.
  • composed1RetryingTransactional with firstErr: nil → first attempt succeeds, exactly one Commit call.
  • Assertion: observedSeen[0] == 0 — auto-pin must NOT fire. If skip-5 were missing, observedSeen[0] would be 9 and the test would fail.

The stubPartitionResolver prefix-match loop (lines 734–743) is correct — it matches any key whose first len(claimedPrefix) bytes equal "part:". No use of bytes.HasPrefix (from Go 1.20+) for compat, which is fine.

✅ All prior review findings resolved

Finding Round Commit Status
gemini CRITICAL / codex P1: SSI on read-write retry R1 66ad5b0
gemini HIGH: nil-clock leaves StartTS=0 R1 66ad5b0
codex P1: auto-pin for read-write txns R2 10123c5
Minor: empty goroutine in test R2 43c55df
Nit: stale resp on allocErr R2 43c55df
claude medium: PrevCommitTS clearing (correctly preserved) R2–R3 57da888
codex P1: callerSuppliedStartTS OCC bypass on retry R4 144ec0c
codex P1: auto-pin for caller-supplied StartTS R5 6202b96
codex P1: 2PC secondary Composed-1 → silent partial commit (both directions) R6–R7 6a458a2
codex P1: resolver-claimed key auto-pin R8 e08d44f ✅ this commit

TLA+ spec divergence — no update needed

e08d44f is confined to anyResolverClaimedKey and maybeAutoPinObservedRouteVersion in sharded_coordinator.go. No changes to kv/hlc*.go, kv/coordinator.go HLC lease-renewal paths, kv/transaction.go, kv/lock_resolver.go, kv/fsm.go, store/mvcc_store.go, or distribution/. The resolver check is purely in the Dispatch-entry orchestration — it does not touch timestamp issuance, OCC commit-ts assignment, FSM apply, or catalog versioning. No spec module needs updating.


LGTM. Nine review rounds across this PR have been thorough. All correctness findings are closed. The final skip-condition table in maybeAutoPinObservedRouteVersion's comment is now complete and accurate. The one remaining open item is the M5 Jepsen workload (route-shuffle generator for the DynamoDB suite), pre-flagged in the PR description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant