Skip to content

docs(backup): propose Phase 0b M6 - cmd/elastickv-snapshot-encode CLI#896

Merged
bootjp merged 10 commits into
mainfrom
backup/m6-cli-design
Jun 1, 2026
Merged

docs(backup): propose Phase 0b M6 - cmd/elastickv-snapshot-encode CLI#896
bootjp merged 10 commits into
mainfrom
backup/m6-cli-design

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Jun 1, 2026

Summary

Phase 0b M6 capstone design doc. M1–M5 are merged; this proposal pins down the CLI implementation specifics the parent design doc (2026_05_25_proposed_snapshot_logical_encoder.md) left at sketch level.

What lands

docs/design/2026_06_01_proposed_snapshot_encode_cli.md — new proposal covering:

  • CLI surface: --input, --output, --adapter (mirrors decoder parsing), --last-commit-ts, --self-test, --scratch-root. Decoder-equivalent exit codes (0 success / 1 user error / 2 data error).
  • --last-commit-ts override: fail-closed T ≥ manifest.last_commit_ts validation per parent §"MVCC re-encoding"; preserves the uniform-stamping invariant (single substitution everywhere last_commit_ts appears).
  • ENCODE_INFO.json sidecar: encoder version, key-format version, output .fsm SHA-256, override record, manifest cluster_id propagation, self-test result. Sidecar (not inside the .fsm — EKVPBBL1 byte format is fixed).
  • Round-trip self-test: --self-test runs encode → decode → structural diff against the input tree, exits 2 with mismatch.txt on first-N-differing-paths if the diff is non-empty. Default off (doubles encode time and disk), but CI runs it on every encoder PR.
  • Decoder cleanup folded in: parent doc flags Source.FSMCRC32C as dead (native .fsm has no CRC32C footer); recommended path is repurpose-as-SHA-256, fallback is remove.
  • Test plan: 8 tests covering round-trip success, override validation (lower rejected, equal+higher accepted), scope parsing, missing-manifest, corruption detection, and ENCODE_INFO.json round-trip + version gate.

Why doc-first

Per CLAUDE.md "Design-doc-first workflow" — M6 touches a new entrypoint (cmd/elastickv-snapshot-encode binary) plus a new file format (ENCODE_INFO.json), so the proposal lands first; implementation is a separate follow-up PR after this is reviewed and accepted.

Test plan

  • Doc reviewable on its own — every claim cites a concrete file:line (decoder CLI prior art at cmd/elastickv-snapshot-decode/main.go, encoder constructors at internal/backup/encode_*.go, manifest discipline at WriteManifest per gemini r1 backup: cmd/elastickv-snapshot-decode CLI + CHECKSUMS writer #810).
  • Decision gates explicit: --last-commit-ts T < manifest fail-closed; self-test default-off with reasoning.
  • 5-pass self-review included (data loss / concurrency / performance / consistency / test coverage).
  • Implementation lands in a follow-up PR (single slice; CLI + ENCODE_INFO + self-test are tightly coupled).

Capstone milestone for Phase 0b. M1-M5 are merged
(#807/#841/#847/#849/#864/#846/#892); M6 wires them into a single binary
operators can run, plus the round-trip self-test the parent doc requires.

Specifies the CLI surface (--input/--output/--adapter/--last-commit-ts/
--self-test/--scratch-root), the fail-closed --last-commit-ts T >=
manifest.last_commit_ts validation, the ENCODE_INFO.json provenance
sidecar (encoder version, key-format version, fsm sha256, override
record, self-test result), and the round-trip self-test gate
(encode -> decode -> structural diff). Eight tests cover the four
user-visible behaviors + the two safety nets.

Folds the parent doc's "Decoder cleanup" note: Source.FSMCRC32C is dead
(native .fsm has no CRC32C footer). M6 either repurposes it as a
whole-file SHA-256 (recommended) or removes it; the sidecar already
carries the anchor either way.

Doc-first per CLAUDE.md "Design-doc-first workflow" - touches a new
entrypoint (cmd/elastickv-snapshot-encode binary). Implementation lands
in a follow-up PR after this proposal is reviewed and accepted.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

@claude review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 4 minutes and 53 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eef293d8-3226-49dd-80a4-9e9cbc1355e7

📥 Commits

Reviewing files that changed from the base of the PR and between 2b118a4 and 1736435.

📒 Files selected for processing (2)
  • docs/design/2026_05_25_partial_snapshot_logical_encoder.md
  • docs/design/2026_06_01_proposed_snapshot_encode_cli.md
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch backup/m6-cli-design

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 2m 33s —— View job


Review

  • Read the proposed design doc
  • Read parent design doc for context
  • Read existing decoder CLI for prior art comparison
  • Read encoder constructors / manifest for accuracy check
  • Compile review findings

Overall the doc is tight, internally consistent with the parent design, and correctly models the existing code. I cross-checked against internal/backup/encode.go, manifest.go, and cmd/elastickv-snapshot-decode/main.go. The findings below are roughly priority-ordered.


Lifecycle housekeeping (CLAUDE.md compliance, should fix before merge)

The parent doc is still named 2026_05_25_proposed_snapshot_logical_encoder.md. Per CLAUDE.md:

rename *_proposed_*.md*_partial_*.md once the first milestone ships

M1–M5 are already merged (cited in this PR's summary). The parent should have been git mvd to 2026_05_25_partial_snapshot_logical_encoder.md when M1 landed. This PR should include that rename.


Decoder cleanup — conflicting milestone attribution

Doc under review, line 125–129:

M6 lands this rename in the same PR

Parent doc 2026_05_25_proposed_snapshot_logical_encoder.md, lines 367–372:

Decoder cleanup folded into M1. … Tracked here so it is not lost; it is a decoder-struct change, not part of this doc PR.

The parent explicitly assigned Source.FSMCRC32C cleanup to M1. Looking at the current internal/backup/manifest.go:76, FSMCRC32C is still present and still described as the CRC32C value — the M1 PR didn't land the cleanup. The M6 doc reclaims this, which is fine, but the doc should acknowledge the re-scoping explicitly ("the parent doc targeted M1; that scope was deferred and now lands in M6") so a reader diffing the two docs doesn't wonder which one is authoritative.


TestCLISelfTestDetectsCorruption — missing injection mechanism

Doc line 113:

Patch the in-tree encoder to deliberately mangle one byte after writing the .fsm; --self-test catches the diff, exits 2, writes mismatch.txt.

The existing snapshotBuilder.WriteTo takes an io.Writer (encode.go:199). The test as described needs to intercept bytes after WriteTo but before the self-test reads the file. The doc should specify the injection point — e.g., a test-only corruptingWriter wrapper, or a hook parameter on the CLI's orchestrator that tests can supply. Without this, the implementation will have to invent a pattern; it's better to pin it here since the test plan is meant to be a spec.


Exit code classification for --last-commit-ts T < manifest

Doc line 33:

1 on user-input error (flag parse, missing file), 2 on data-correctness failure (manifest mismatch, encoder error, self-test diff)

Doc test table:

TestCLIRejectsLowerLastCommitTSOverride → exit 2 with the HLC-ceiling error

T < manifest.last_commit_ts is triggered by a user-supplied flag but is classified as exit 2. This is defensible (the manifest value is the data constraint, not a flag-parse problem), but is non-obvious for shell wrapper authors who will expect all "wrong flag" errors to be exit 1. A one-sentence rationale in the exit-codes section ("flag-supplied --last-commit-ts that violates a data constraint from the manifest is a data error, not a user-input error") would prevent confusion.


ENCODE_INFO.json sidecar path not pinned

The doc says "emitted next to the output .fsm" but does not state the filename. The self-test failure artifact is called mismatch.txt and is also "next to the output .fsm" — at the same level. The implementation needs to know the exact filename. Given the file is called ENCODE_INFO.json in the doc schema example, the path is presumably <output_dir>/ENCODE_INFO.json. One line in the §"ENCODE_INFO.json" section to pin this would eliminate ambiguity.


Scratch dir cleanup policy not stated

Doc line 85:

After the primary .fsm is written + sha256-anchored, invoke backup.DecodeSnapshot on the produced bytes into a scratch directory (default os.TempDir()/encode-self-test-<random>/; operator can pin via --scratch-root)

The doc does not say what happens to the scratch dir after the self-test. On success: cleaned up or left for inspection? On failure: left for forensics alongside mismatch.txt, or cleaned up anyway? This is an operational detail restore operators will ask. Recommended: clean up on success, leave on failure (same pattern as most test harnesses).


P1 integration test from parent doc not listed

The parent doc's test plan includes:

TestEncoderProducesLoadableSnapshot | Output placed under a fresh node's fsm-snap/ + snap/ loads on Open; every adapter serves the original data

The M6 doc's test plan (8 tests) does not include this. The parent doc classifies it as P1, so it may be intentionally deferred, but the M6 doc should say so explicitly. If it's in scope for this PR (the parent listed it under M6), omitting it from the test plan table leaves a gap in the "gold-standard correctness" story that the self-test alone doesn't cover (self-test proves round-trip equality; the cluster test proves the .fsm actually boots a node and serves reads).


Minor: encoder_version stamping mechanism not mentioned

The decoder's version = "dev" (decoder main.go:45) is stamped via -ldflags "-X main.version=...". The ENCODE_INFO.json schema shows "encoder_version": "<git rev or release tag>" but the doc doesn't mention the build-time -ldflags mechanism. One-liner addition for parity with the decoder doc comment would help the build pipeline author.


Minor: adapters_enabled ordering not specified

The schema example shows ["redis","dynamodb","s3","sqs"]. The doc should state whether this reflects the canonical processing order (consistent with encode.go's fan-out order) or the CLI input order. Given the encoder fans out adapters in deterministic order, the canonical order should be specified so the sidecar is reproducible across runs.


Code accuracy — backup.DecodeSnapshot vs DecodeLiveEntries

The doc says the self-test invokes backup.DecodeSnapshot into a scratch directory (directory-level diff). The encode package also exports DecodeLiveEntries (encode.go:277) for entry-level in-process comparison. The doc's choice of the directory-level path is correct for the stated goal (structural diff of the adapter tree), but the distinction from DecodeLiveEntries is worth one sentence — a reader of the code will see both and wonder which one the self-test uses.


Verdict

This doc is solid enough to unblock implementation with the issues above addressed. The lifecycle rename and the milestone-attribution clarification are the two items I'd call mandatory before merge; the rest are enhancements to implementation fidelity.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request proposes the design for the cmd/elastickv-snapshot-encode CLI, which orchestrates snapshot encoding and supports an optional round-trip self-test. The feedback highlights several critical design improvements for robustness: dynamically naming the ENCODE_INFO.json sidecar to prevent overwrites when multiple snapshots share a directory, appending unique subdirectories to the --scratch-root to allow safe concurrent self-tests, and ensuring eager cleanup of scratch directories and stale mismatch.txt files to avoid disk exhaustion and operator confusion.

Comment on lines +45 to +47
## `ENCODE_INFO.json` (provenance + integrity)

A new sidecar emitted next to the output `.fsm`, NOT inside it (the EKVPBBL1 byte format is fixed). Keep the schema minimal — restore operators need to confirm "encoded for the right cluster, by the right encoder version, against this exact file":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the sidecar is statically named ENCODE_INFO.json, storing multiple .fsm files in the same directory (e.g., /backups/node1.fsm and /backups/node2.fsm) will cause their sidecar files to overwrite each other. To prevent collisions, the sidecar should be named dynamically based on the output .fsm filename (for example, <fsm_name>.encode_info.json or by appending .encode_info.json to the output path).


Implementation:

1. After the primary `.fsm` is written + sha256-anchored, invoke `backup.DecodeSnapshot` on the produced bytes into a scratch directory (default `os.TempDir()/encode-self-test-<random>/`; operator can pin via `--scratch-root` for tmpfs).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the operator pins --scratch-root to a specific directory, running multiple concurrent encodes with --self-test enabled will cause them to use the exact same directory, leading to file collisions and false test failures. To support safe concurrent execution, the tool should use the pinned --scratch-root as a base directory and append a unique/random subdirectory name (e.g., <scratch-root>/encode-self-test-<random>/) rather than using the pinned path directly.

2. Diff the scratch tree against `--input`. The diff is structural, not byte-equal on transient fields:
- **Excluded from compare:** `MANIFEST.json` `wall_time_iso`, any future `phase`-specific provenance, the `MANIFEST.json` `source.fsm_path` (decoder's record of where it read from, not data).
- **Required equal:** every adapter subdir's full tree (filenames + bytes), `MANIFEST.json` adapter-scope arrays, `last_commit_ts`.
3. On mismatch: the CLI exits with code 2, writes a `mismatch.txt` next to the output `.fsm` listing the first N differing paths, and the `ENCODE_INFO.json` records `self_test.matched: false` before the exit. Restore operators inspect `mismatch.txt`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The design should specify that the temporary scratch directory created for the self-test must be eagerly cleaned up (e.g., via os.RemoveAll) upon completion of the self-test (both on success and failure) to prevent unbounded disk space usage and potential disk exhaustion. Additionally, any pre-existing mismatch.txt at the destination should be explicitly removed at the start of a new run or upon a successful self-test to avoid confusing operators with stale failure reports.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 31c0a0f83d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


Implementation:

1. After the primary `.fsm` is written + sha256-anchored, invoke `backup.DecodeSnapshot` on the produced bytes into a scratch directory (default `os.TempDir()/encode-self-test-<random>/`; operator can pin via `--scratch-root` for tmpfs).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Account for DecodeSnapshot not writing MANIFEST.json

When --self-test is implemented this way, the scratch tree produced by the library call will not contain MANIFEST.json or CHECKSUMS: internal/backup/decode.go explicitly documents that DecodeSnapshot only dispatches adapter files and leaves manifest emission to the cmd wrapper. Since the next step requires comparing MANIFEST.json fields (last_commit_ts, adapter scopes, source.fsm_path exclusions), following this spec either makes every self-test fail on a missing manifest or skips the manifest checks that are supposed to validate the effective override timestamp.

Useful? React with 👍 / 👎.

Four substantive design corrections from the freshly-opened M6 PR review.

1. ENCODE_INFO.json sidecar is path-derived, not static (gemini medium).
   /backups/node1.fsm -> /backups/node1.fsm.encode_info.json so
   multiple .fsm files can share a directory without sidecar collisions.
   Pinned by TestCLIEncodeInfoPathDerivedFromOutput +
   TestCLIEncodeInfoTwoFilesNoCollision.

2. Self-test scratch subdir is ALWAYS a unique <random> directory
   under the resolved base, even when --scratch-root is operator-pinned
   (gemini medium). Concurrent encodes against the same base would
   otherwise collide and produce false failures. Pinned by
   TestCLISelfTestPinnedScratchRootStillUniqueSubdir.

3. Eager cleanup: os.RemoveAll on the scratch subdir in every exit
   path (success, mismatch, error) via defer at MkdirTemp time;
   stale mismatch.txt is removed at start of every run (gemini medium).
   Pinned by TestCLISelfTestCleansScratchOnAllPaths.

4. Self-test header check uses DecodeSnapshot's in-memory DecodeResult
   instead of materializing MANIFEST.json in the scratch (codex P2).
   DecodeSnapshot is documented as not emitting MANIFEST.json; the
   prior design implicitly relied on its presence to compare
   last_commit_ts. The new design compares Result.Header.LastCommitTS
   against the effective T (manifest or override), avoiding a redundant
   manifest synthesis step. The mismatch.txt format also picks up the
   same path-derivation rule (<output>.mismatch.txt).

Self-review pass 5 + test plan updated: 12 tests now (was 8), the four
new ones pin the regressions above. No implementation impact on the
encoder slices already merged.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

v2 (fb6ba10): addresses all 4 review findings on the design.

Gemini medium #1 (sidecar collision): ENCODE_INFO.json is now path-derived: <output>.encode_info.json. Multiple .fsm files in the same dir produce distinct sidecars.

Gemini medium #2 (scratch-root collision): Self-test scratch is ALWAYS a unique encode-self-test-<random> subdir under the resolved base, even when --scratch-root is operator-pinned. Concurrent encodes can share a pinned base safely.

Gemini medium #3 (cleanup): Eager os.RemoveAll on every exit path (success / mismatch / error) via defer at MkdirTemp time; stale mismatch.txt removed at start of every run.

Codex P2 (MANIFEST.json absent from scratch): Self-test now uses DecodeSnapshot's in-memory DecodeResult.Header.LastCommitTS for the header check (vs. comparing scratch MANIFEST.json, which the library doesn't emit). Tree diff is adapter-subdir byte-equal; last_commit_ts is covered by the header check.

Test plan grew from 8 to 12 — 4 new tests pin each regression above. No impact on already-merged M1-M5 slices.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 3m 11s —— View job


Review (v2)

  • Read v2 design doc
  • Read parent doc + decoder CLI for cross-reference
  • Verify 4 addressed findings (Gemini ×3, Codex P2)
  • Re-check outstanding items from v1 review
  • Post final review

What v2 correctly addresses

All four flagged items land cleanly:

  • Gemini Configure Renovate #1 (sidecar collision): <output>.encode_info.json path-derivation is explicit and tested by two new tests. Good.
  • Gemini Update dependency go to v1.21.4 #2 (scratch-root collision): Unique encode-self-test-<random> subdir is non-optional even when --scratch-root is pinned. Clearly stated at line 93 with the reason inline. Good.
  • Gemini Update docker/setup-qemu-action action to v3 #3 (cleanup): defer os.RemoveAll on all exit paths, stale mismatch.txt cleared at run start, mismatch.txt itself is now path-derived (<output>.mismatch.txt). Lines 96-97 are precise. Good.
  • Codex P2 (MANIFEST.json absent from scratch): Self-test reads DecodeResult.Header.LastCommitTS in-memory for the header check; tree diff is adapter-subdir byte-equal. The two-step structure (header check vs. tree diff) is clearly separated. Good.

Still-open from v1 (must fix + medium)

1. Lifecycle rename — mandatory before merge

Per CLAUDE.md:

rename *_proposed_*.md*_partial_*.md once the first milestone ships

M1–M5 are merged, so the parent doc must be git mvd to 2026_05_25_partial_snapshot_logical_encoder.md as part of this PR. The parent doc itself says at line 23: "On landing this proposal, the parent doc is promoted proposedpartial … via git mv". That rename still hasn't landed.

2. --s3-chunk-size absent from the CLI surface — medium

The parent doc (§"Per-adapter reverse encoders", lines 329-332) explicitly flags --s3-chunk-size as an operator-facing flag the M6 CLI should expose:

An optional --s3-chunk-size flag (default = s3ChunkSize) lets an operator match a non-default deployment

The M6 doc's CLI surface omits it entirely and doesn't note the deferral. The current encoder hardcodes s3ChunkSize (encode_s3_objects.go:45: const s3ChunkSize = 1 << 20) with no constructor-level override. The doc should either add the flag (which requires also adding it to NewS3RecordEncoder) or explicitly note "deferred; S3 encoder currently hardcodes the canonical 1 MiB constant" so the implementation author knows the parent doc's expectation was consciously skipped.

3. Exit code 2 for a user-supplied flag violation — medium

--last-commit-ts T < manifest exits 2 ("data-correctness failure"). The exit-codes section (line 33) says "1 on user-input error (flag parse, missing file)" with no qualification. A shell wrapper author will expect all "wrong flag" failures to be exit 1. One sentence explaining "a CLI flag that conflicts with a data constraint from the manifest is a data error, not a user-input error" would prevent confusion without any design change.


Still-open from v1 (minor)

4. Decoder cleanup milestone attribution

The parent doc explicitly wrote "Decoder cleanup folded into M1." M1 shipped without it; M6 picks it up. The doc doesn't say this — it only says "parent doc §'Decoder cleanup'" and claims M6 owns it. Suggest one sentence: "The parent doc targeted M1; that scope was not landed there and is now deferred to M6." Prevents a reader diffing the two docs from wondering which one is authoritative.

5. TestCLISelfTestDetectsCorruption — injection mechanism still unspecified

Line 121 still says "Patch the in-tree encoder to deliberately mangle one byte after writing the .fsm" without specifying the injection point. The implementation will need to invent this pattern. Given that snapshotBuilder.WriteTo takes an io.Writer, the natural mechanism is a test-only corruptingWriter wrapper passed via a functional option or a test-only constructor. Pinning it here avoids implementation drift.

6. P1 test TestEncoderProducesLoadableSnapshot absent from test plan

The parent doc's M6 description includes an "end-to-end test that loads the output into a fresh single-node cluster and reads back every adapter's data." That test is classified P1 in the parent's test table. It doesn't appear in the M6 doc's 12-test plan and is never explicitly deferred. The self-test proves round-trip equality of the directory tree; only the cluster test proves the .fsm actually boots a node. Either include it or add a sentence deferring it to a follow-up with a reason.

7. encoder_version ldflags mechanism

The decoder main.go:45 has var version = "dev" with a comment explaining -ldflags "-X main.version=...". The ENCODE_INFO.json schema shows "encoder_version": "<git rev or release tag>" but the doc doesn't mention the build-time stamping mechanism. One line for parity with the decoder.

8. adapters_enabled ordering

Schema example shows ["redis","dynamodb","s3","sqs"]. Not clear whether this is canonical encoder fan-out order or CLI input order. Since the encoder fans out adapters in deterministic order, the sidecar should reflect that order so it's reproducible across runs with different --adapter inputs.


Verdict

The four v2 fixes all land cleanly. The lifecycle rename (#1 above) is mandatory before merge per CLAUDE.md, and the missing --s3-chunk-size gap (#2) warrants a one-line explicit deferral. The other items are small doc clarifications that can ship in the same doc commit. None of them block implementation fidelity.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fb6ba10fb2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


Implementation:

1. After the primary `.fsm` is written + sha256-anchored, invoke `backup.DecodeSnapshot` on the produced bytes into a unique scratch subdirectory. The scratch dir is ALWAYS a fresh `encode-self-test-<random>` subdirectory under the resolved base (default base `os.TempDir()`; operator override via `--scratch-root`). The unique-suffix is non-optional even when `--scratch-root` is pinned — concurrent encodes against the same pinned base would otherwise collide and produce false failures (gemini medium #896).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Write self-tested snapshots to a temp path first

When --self-test is enabled and the encoder output is corrupt, this sequence writes the final --output .fsm before running the decode/diff, and line 97 then only records matched:false before exiting. That leaves a known-bad snapshot at the restore-visible path; the parent design explicitly says the self-test should run before the encoder finalizes the output so a node never receives an unloadable .fsm (docs/design/2026_05_25_proposed_snapshot_logical_encoder.md §Round-trip self-test). The implementation should write to a temp path and only rename into --output after the self-test passes, or remove the final file on failure.

Useful? React with 👍 / 👎.

5. Emits `ENCODE_INFO.json` next to the output `.fsm` (provenance + integrity anchor).
6. Optionally runs the round-trip self-test (`--self-test`): re-decode the produced `.fsm` into a scratch dir and diff against the input tree.

The CLI is the **only new external surface**; the encoder slices already export their constructors and `Encode(*snapshotBuilder) error`. M6 is mostly wiring + provenance + the self-test gate.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Expose an encoder entrypoint for the new command

The proposed command cannot directly perform this wiring from cmd/elastickv-snapshot-encode: snapshotBuilder and newSnapshotBuilder are unexported in internal/backup/encode.go, and the adapter Encode methods take *snapshotBuilder, so an external main package has no way to construct the builder to pass to NewRedisEncoder(...).Encode etc. Without adding an exported EncodeSnapshot/builder API or moving orchestration into internal/backup, the follow-up implementation slice described here is blocked.

Useful? React with 👍 / 👎.

…entrypoint)

Two design corrections from codex review on v2.

1. Write-then-rename atomic publish (codex P2 line 93).
   Previously: encoder wrote --output first, then ran self-test, then
   recorded matched:false in the sidecar on failure. That leaves a
   known-bad .fsm at the restore-visible path — exactly the failure
   mode the parent doc's atomic-publish discipline forbids. New design:
   write to <output>.tmp-<random>, fsync+close, run self-test, ONLY
   rename to <output> on success. Failure path os.RemoveAll's the
   temp file. Pinned by TestCLISelfTestFailureLeavesNoFsmAtOutputPath.

2. New backup.EncodeSnapshot library entrypoint (codex P2 line 17).
   Previously: design said the CLI directly wires NewRedisEncoder(...).
   Encode(snapshotBuilder), but snapshotBuilder + newSnapshotBuilder
   are unexported in internal/backup/encode.go. An external main
   package cannot construct the builder. New design: mirror the
   decoder's DecodeSnapshot/EncodeSnapshot symmetry — internal/backup/
   encode_snapshot.go exposes EncodeSnapshot(opts, out) which handles
   the builder construction, per-adapter dispatch, and optional self-
   test internally. The CLI's responsibility shrinks to flag parsing +
   the temp-file dance + sidecar writing. Unblocks future callers
   (Phase 1 live extractor, integration tests). Pinned by
   TestEncodeSnapshotLibraryRoundTrip.

Files-to-add and test plan updated: 14 tests now (was 12), 4 files
to add now (was 4 — encode_snapshot.go replaces direct builder use
in main.go).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

v3 (a253d7f): addresses both codex P2 findings on v2.

Codex P2 #1 (write-temp-first): New write-then-rename atomic-publish discipline. Encoder writes to <output>.tmp-<random>, fsync+close, runs self-test, ONLY renames to <output> on success. Self-test failure → temp file removed → no .fsm at the publish path. Pinned by TestCLISelfTestFailureLeavesNoFsmAtOutputPath.

Codex P2 #2 (library entrypoint exposure): Added backup.EncodeSnapshot(EncodeOptions, io.Writer) (EncodeResult, error) mirroring the decoder's DecodeSnapshot symmetry. snapshotBuilder stays unexported; the CLI's responsibility shrinks to flag-parse + temp-file dance + sidecar. Future callers (Phase 1 live extractor, integration tests) get a tested API surface. Pinned by TestEncodeSnapshotLibraryRoundTrip.

Files-to-add and test plan updated: 14 tests now (was 12).

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 4m 25s —— View job


Review (v3)

  • Read v3 design doc
  • Read parent design doc for cross-reference
  • Read encoder/decoder code for accuracy checks (encode.go, decode.go, snapshot_reader.go, decoder CLI)
  • Verify v3 addresses Codex P2 findings
  • Check outstanding items from v2 review
  • Post final review

v3 Codex P2 findings: both addressed

Codex P2 #1 (write-temp-first): Atomics publish discipline is now explicit: <output>.tmp-<random> → fsync+close → optional self-test → rename on success or os.RemoveAll on failure. Pinned by TestCLISelfTestFailureLeavesNoFsmAtOutputPath. No .fsm at the restore-visible path on self-test failure. ✅

Codex P2 #2 (library entrypoint): backup.EncodeSnapshot(EncodeOptions, io.Writer) (EncodeResult, error) is specified, snapshotBuilder stays unexported, CLI is reduced to flag-parse + temp-file dance + sidecar. Pinned by TestEncodeSnapshotLibraryRoundTrip. ✅


New finding from v3 — medium

EncodeSnapshot + self-test + io.Writer internal model not specified

func EncodeSnapshot(opts EncodeOptions, out io.Writer) (EncodeResult, error) — the self-test needs to read back the produced bytes to call DecodeSnapshot on them. But out is write-only. The doc says the self-test runs "against the in-memory bytes" (line 11), which implies EncodeSnapshot assembles the entire FSM into a *bytes.Buffer first, self-tests from that buffer, then streams to out. But §"Round-trip self-test" step 1 says "After the primary .fsm is written + sha256-anchored, invoke backup.DecodeSnapshot" — "written" here is ambiguous between "written to internal buffer" and "written to disk".

The implementor has two valid choices, and each has consequences the doc should make explicit:

  • Option A (internal buffer): EncodeSnapshot accumulates entries via snapshotBuilder.Add, calls builder.WriteTo(&buf), sha256-anchors buf.Bytes(), self-tests by DecodeSnapshot(bytes.NewReader(buf.Bytes()), ...), then copies buf to out. Memory cost: the serialized FSM bytes live in buf alongside the already-in-memory snapshotBuilder.entries. For a 10M-key snapshot this roughly doubles the ~2.5 GiB key-only bound the parent doc already called out.
  • Option B (seek-then-reread): CLI passes a seekable *os.File as out; EncodeSnapshot writes, seeks to offset 0, and passes a *io.SectionReader to DecodeSnapshot. This avoids the in-memory copy but requires out to be secretly seekable — an implicit interface assumption that breaks the io.Writer contract.

Option A is the only spec-compliant path given io.Writer. The doc should state this explicitly and note the memory implication (or propose dropping in-memory self-test entirely and having the CLI pass the temp file path separately). Fix this →


New findings from v3 — minor

Prose/API inconsistency (line 11): "returns an EncodeResult plus the assembled .fsm bytes" — but the actual signature writes bytes to out io.Writer, not returning them. The "plus the assembled .fsm bytes" phrase is vestigial from an earlier sketch. Fix this →

Parameter order asymmetry (line 36): DecodeSnapshot(r io.Reader, opts DecodeOptions) has the I/O param first; EncodeSnapshot(opts EncodeOptions, out io.Writer) has it last. The doc claims these mirror each other. Aligning the order to EncodeSnapshot(out io.Writer, opts EncodeOptions) — or explicitly noting the divergence — avoids callers having to look up which comes first.


Still-open from v2 — mandatory before merge

1. Lifecycle rename missing (same as v2)

Per CLAUDE.md: "rename *_proposed_*.md*_partial_*.md once the first milestone ships." M1–M5 are merged. The parent doc says at its own line 23: "On landing this proposal, the parent doc is promoted proposedpartial … via git mv." 2026_05_25_proposed_snapshot_logical_encoder.md still has not been renamed. This PR must include that git mv.


Still-open from v2 — medium

2. --s3-chunk-size still absent from CLI surface

Parent doc §"Per-adapter reverse encoders" lines 329-331 explicitly says: "An optional --s3-chunk-size flag (default = s3ChunkSize) lets an operator match a non-default deployment." The v3 doc's CLI surface section omits it. The encoder hardcodes s3ChunkSize = 1 << 20 (encode_s3_objects.go:45) with no override path. One sentence is needed: either add the flag (which also requires wiring it through EncodeOptions) or note "deferred — encoder currently hardcodes the canonical 1 MiB; --s3-chunk-size is a follow-up."

3. Exit code 2 rationale for user-supplied flag violation (same as v2)

--last-commit-ts T < manifest exits 2, but the exit-codes section (line 55) says "1 on user-input error (flag parse, missing file)" with no qualification. Shell wrapper authors will expect all flag-value rejections to be exit 1. One sentence: "a CLI flag whose value violates a data constraint from the manifest is a data error, not a user-input error" prevents confusion without any design change. Fix this →


Still-open from v2 — minor

4. Decoder cleanup milestone attribution: Parent doc says "Decoder cleanup folded into M1." M1 shipped without it; M6 picks it up. The doc should say "the parent doc targeted M1; that scope was not landed there and is now deferred to M6" so a reader diffing the two docs knows which is authoritative.

5. TestCLISelfTestDetectsCorruption — injection point still unspecified (line 145): "Patch the in-tree encoder to deliberately mangle one byte" without specifying the mechanism. The natural approach with EncodeSnapshot(opts, out io.Writer) is to wrap out in a corruptingWriter test helper. Pinning this avoids the implementor inventing the pattern.

6. P1 test TestEncoderProducesLoadableSnapshot absent from test plan: The parent doc's P1 test table includes this end-to-end cluster-boot test under M6. It is not in the 14-test plan and is not explicitly deferred. Self-test proves round-trip equality; only the cluster test proves the .fsm boots a node and serves reads.

7. encoder_version ldflags mechanism: The decoder main.go:45 has var version = "dev" with the -ldflags "-X main.version=..." pattern. ENCODE_INFO.json shows "encoder_version": "<git rev or release tag>" but the doc doesn't mention the build-time stamping mechanism.

8. adapters_enabled ordering: Schema example shows ["redis","dynamodb","s3","sqs"]. Not clear whether this reflects canonical encoder fan-out order or CLI input order. Since the encoder fans out adapters deterministically, the sidecar should reflect that order for reproducibility across runs with different --adapter inputs.


Verdict

Both Codex P2 findings are cleanly resolved. The new medium finding (#9io.Writer internal model ambiguity) and the mandatory lifecycle rename are the two items I'd call necessary before merge. The --s3-chunk-size gap (#2) needs at least a one-line deferral note since the parent doc explicitly names it. Everything else is documentation polish.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a253d7f7f6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +115 to +117
1. After the primary `.fsm` is written + sha256-anchored, invoke `backup.DecodeSnapshot` on the produced bytes into a unique scratch subdirectory. The scratch dir is ALWAYS a fresh `encode-self-test-<random>` subdirectory under the resolved base (default base `os.TempDir()`; operator override via `--scratch-root`). The unique-suffix is non-optional even when `--scratch-root` is pinned — concurrent encodes against the same pinned base would otherwise collide and produce false failures (gemini medium #896).
2. **Header check (no MANIFEST.json in scratch):** `DecodeSnapshot` returns a `DecodeResult` whose `Header.LastCommitTS` is the value the decoder read from the produced `.fsm`. Compare it against the effective `T` (manifest value or `--last-commit-ts` override). The decoder library does NOT emit `MANIFEST.json` (that is the cmd wrapper's job per `internal/backup/decode.go`); the self-test deliberately consumes the library output and compares the in-memory header instead, avoiding a redundant manifest synthesis step (codex P2 #896).
3. **Tree diff:** structurally compare the scratch adapter subtrees against the corresponding subdirs of `--input`. The diff is byte-equal on adapter files (filenames + bytes — every per-adapter dump is deterministic). `MANIFEST.json` from `--input` is NOT compared at all in the scratch tree (the scratch has none); the `last_commit_ts` field is covered by the header check above. `wall_time_iso` and other manifest transient fields are not in scope.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve decode options during self-test

When --input was produced with decode-side options such as --include-incomplete-uploads, --include-sqs-side-records, or --dynamodb-bundle-mode jsonl, a plain DecodeSnapshot self-test will not recreate the same tree: newDispatcher only emits those sections when the corresponding DecodeOptions booleans/layout are set, and the decoder CLI currently feeds them from flags while recording them in MANIFEST.json. The self-test needs to derive those DecodeOptions from the input manifest, otherwise valid dumps using these options will fail the structural diff (or be compared against missing/reformatted subtrees).

Useful? React with 👍 / 👎.

…test (codex P2)

Codex P2 v3 flagged that a --input tree produced with non-default
decoder options (--include-incomplete-uploads, --include-orphans,
--preserve-sqs-visibility, --include-sqs-side-records,
--dynamodb-bundle-mode jsonl) will not round-trip through a plain
DecodeSnapshot call: the dispatcher only emits those sections when the
corresponding DecodeOptions boolean / layout is set. A self-test with
hardcoded default options would spuriously diff on every dump that
exercised any of these flags.

Fix: the CLI reads MANIFEST.json's Exclusions.* + DynamoDBLayout fields
and threads them into the scratch DecodeOptions. Manifests that omit
the fields (older dumps) get decoder defaults — same forgiveness the
decoder library applies.

Numbering of the self-test step list also fixed (two step-4s in v3).

Test plan grew from 14 to 15: TestCLISelfTestPreservesManifestDecodeOptions
pins the round-trip on a fixture that uses both --include-incomplete-uploads
and --dynamodb-bundle-mode jsonl.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

v4 (9e295f5): addresses codex P2 v3 finding.

Codex P2 (DecodeOptions threading): The self-test now reads MANIFEST.json's Exclusions.* + DynamoDBLayout fields and threads them into the scratch DecodeOptions, so a tree produced with (e.g.) --include-incomplete-uploads or --dynamodb-bundle-mode jsonl round-trips without spurious diff. Older manifests that omit the fields fall back to decoder defaults. Pinned by TestCLISelfTestPreservesManifestDecodeOptions.

Also fixed two duplicate step-4 numbering in the self-test sequence.

Test plan grew from 14 to 15.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 4m 2s —— View job


Review (v4)

  • Read v4 design doc
  • Read parent doc + code for cross-reference (decode.go:44–95, manifest.go:123–189, encode_s3_objects.go:43–45)
  • Verify v4 addresses Codex P2 (DecodeOptions threading)
  • Re-check outstanding items from v3 review
  • Post final review

v4 Codex P2 finding: addressed

The self-test now reads Exclusions.{IncludeIncompleteUploads,IncludeOrphans,PreserveSQSVisibility,IncludeSQSSideRecords} and DynamoDBLayout from MANIFEST.json and threads them into the scratch DecodeOptions (line 116). Older manifests fall back to decoder defaults. Pinned by TestCLISelfTestPreservesManifestDecodeOptions. ✅


New finding from v4 — medium

RenameS3Collisions not in manifest Exclusions; self-test cannot thread it

DecodeOptions.RenameS3Collisions (decode.go:77) has no counterpart in the manifest Exclusions struct (manifest.go:126–130). A dump produced with --rename-collisions has S3 object files under renamed paths (keys that collide with .elastickv-meta.json are stored under a renamed filename). If the self-test runs DecodeSnapshot with RenameS3Collisions=false (the default, since there is no manifest field to read), the scratch tree's filenames differ from --input, and the structural diff fails spuriously.

The doc says it "threads them into the scratch DecodeOptions" (line 116) but implicitly excludes RenameS3Collisions because it has no manifest record. This is a real gap for any deployment that hit the S3 naming collision case.

Two options:

  1. Add RenameS3Collisions bool to the Exclusions struct in manifest.go, so the decoder records it and the self-test can thread it.
  2. Acknowledge the limitation explicitly: "Self-test does not support trees produced with --rename-collisions; that flag is not recorded in the manifest. Such trees must be tested manually." Then expand TestCLISelfTestPreservesManifestDecodeOptions's assertion set accordingly.

Option 2 is fine for M6 since the collision scenario is rare in practice. Either way the doc should call it out — the new test only covers --include-incomplete-uploads + --dynamodb-bundle-mode jsonl, so this gap is not caught by the current test plan.


New minor finding from v4

DynamoDBLayout (manifest field) ≠ DynamoDBBundleJSONL (DecodeOptions field)

Line 116 says "threads … DynamoDBLayout … into the scratch DecodeOptions." The actual DecodeOptions field is DynamoDBBundleJSONL bool (decode.go:89). The mapping is manifest.DynamoDBLayout == "jsonl" → opts.DynamoDBBundleJSONL = true. The doc should use the DecodeOptions field name (or show the mapping) so the implementation author doesn't have to look it up. Fix this →


Still-open from v3 — mandatory before merge

1. Lifecycle rename

docs/design/2026_05_25_proposed_snapshot_logical_encoder.md is still named proposed. M1–M5 are merged. Per CLAUDE.md and the parent doc's own line 26: "On landing this proposal, the parent doc is promoted proposedpartial … via git mv." This rename must be in this PR. No other change needed — one git mv:

git mv docs/design/2026_05_25_proposed_snapshot_logical_encoder.md \
       docs/design/2026_05_25_partial_snapshot_logical_encoder.md

Still-open from v3 — medium

2. io.Writer internal model ambiguity

Line 11 says "EncodeSnapshot … runs the optional self-test against the in-memory bytes." Line 12 says "the .fsm bytes are written to <output>.tmp-<random>." Line 39 says "The CLI's responsibility is reduced to: read MANIFEST → call EncodeSnapshot(opts, tempFile) →…"

If the CLI passes the temp *os.File as out io.Writer, EncodeSnapshot can only self-test against those bytes by either (a) buffering the entire FSM in memory before writing to out (doubles memory beyond the already-large key-sort working set), or (b) seeking the temp file back to offset 0 after writing — an implicit assumption that out is seekable, breaking the io.Writer contract.

This was raised in v3 and is still unresolved. The doc should explicitly name the chosen model. The most spec-compliant path is Option A (internal buffer when SelfTest=true), with a note that it adds another ~FSM-size allocation on top of the sort working set — the same memory bound the parent doc already calls out.

3. --s3-chunk-size absent from CLI surface

Parent doc line 329 explicitly says: "An optional --s3-chunk-size flag (default = s3ChunkSize) lets an operator match a non-default deployment." The M6 CLI surface omits it with no deferral note. The encoder hardcodes s3ChunkSize = 1 << 20 (encode_s3_objects.go:45) with no override path. One sentence is all that is needed: either add the flag (wire through EncodeOptions) or note "deferred — encoder hardcodes the canonical 1 MiB; --s3-chunk-size is a follow-up."

4. Exit code 2 rationale for flag-driven data constraint violation

Line 55 says: "1 on user-input error (flag parse, missing file), 2 on data-correctness failure." --last-commit-ts T < manifest exits 2 because the constraint is data-derived from the manifest, not a flag-parse problem — but this is not stated. A shell wrapper author will expect all wrong-flag errors to be exit 1. One sentence prevents the confusion. Fix this →


Still-open from v3 — minor

5. Decoder cleanup milestone attribution — The parent doc said "Decoder cleanup folded into M1"; M1 shipped without it; M6 picks it up. One sentence ("the parent doc targeted M1; that scope was not landed there and is now deferred to M6") prevents a reader diffing the two docs from wondering which is authoritative.

6. TestCLISelfTestDetectsCorruption — injection mechanism — Line 146 says "Patch the in-tree encoder to deliberately mangle one byte after writing the .fsm" without specifying the mechanism. With EncodeSnapshot(opts, out io.Writer) now specified, the natural approach is a test-only corruptingWriter wrapper passed as out. Worth pinning.

7. P1 test TestEncoderProducesLoadableSnapshot absent — Parent doc's P1 test table includes this end-to-end cluster-boot test under M6. The 15-test plan still doesn't include it and doesn't explicitly defer it. Self-test proves round-trip equality; only the cluster test proves the .fsm boots a node and serves reads.

8. encoder_version ldflags mechanism — The decoder main.go:45 uses var version = "dev" with -ldflags "-X main.version=...". ENCODE_INFO.json shows "encoder_version": "<git rev or release tag>" with no mention of how it gets stamped at build time. One-liner for parity.

9. adapters_enabled ordering — Schema example shows ["redis","dynamodb","s3","sqs"]. Not stated whether this is canonical encoder fan-out order or CLI input order. Since the encoder fans out deterministically, the sidecar should reflect that order for reproducibility.

10. Line 11 prose inconsistency — "returns an EncodeResult plus the assembled .fsm bytes" is still present. The signature writes bytes to out io.Writer; they are not returned. Fix this →

11. Parameter order asymmetryDecodeSnapshot(r io.Reader, opts DecodeOptions) has I/O first; EncodeSnapshot(opts EncodeOptions, out io.Writer) has it last. Either align or explicitly note the divergence.


Verdict

The v4 Codex P2 finding (DecodeOptions threading) is cleanly resolved. The new medium finding (RenameS3Collisions gap) and the still-open mandatory lifecycle rename are the two items to resolve before merge. The io.Writer self-test model ambiguity (#2 above) has now been open across three review rounds and warrants resolution — even a one-sentence commitment to Option A (internal buffer when SelfTest=true) with a note on the memory implication would close it. The --s3-chunk-size deferral note (#3) is a one-liner. All other items are documentation polish.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9e295f54a1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Implementation:

1. After the primary `.fsm` is written + sha256-anchored, invoke `backup.DecodeSnapshot` on the produced bytes into a unique scratch subdirectory. The scratch dir is ALWAYS a fresh `encode-self-test-<random>` subdirectory under the resolved base (default base `os.TempDir()`; operator override via `--scratch-root`). The unique-suffix is non-optional even when `--scratch-root` is pinned — concurrent encodes against the same pinned base would otherwise collide and produce false failures (gemini medium #896).
2. **DecodeOptions are read back from the input MANIFEST.json.** The self-test MUST feed `DecodeSnapshot` the same option flags the decoder used to produce `--input`, otherwise a tree produced with (e.g.) `--include-incomplete-uploads` or `--dynamodb-bundle-mode jsonl` will spuriously fail the diff (codex P2 v3 #896). The CLI reads `MANIFEST.json` `Exclusions.{IncludeIncompleteUploads,IncludeOrphans,PreserveSQSVisibility,IncludeSQSSideRecords}` and `DynamoDBLayout` and threads them into the scratch `DecodeOptions`. Manifests that omit these fields (older dumps) decode with defaults — same forgiveness the decoder applies. Pinned by `TestCLISelfTestPreservesManifestDecodeOptions`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve S3 rename-collision mode in self-test

When the input dump contains an S3 object whose key ends in .elastickv-meta.json, the original decoder only succeeds if --rename-collisions sets DecodeOptions.RenameS3Collisions; the manifest path listed here does not restore that option, and the fields enumerated omit both RenameS3Collisions and any KEYMAP-derived fallback. A self-test of such a valid dump will re-decode the produced .fsm with the default false setting and fail on the same meta-suffix collision instead of proving the encoder output is bad, so the design should infer this option from the dump artifacts or explicitly include it in the manifest/options threading.

Useful? React with 👍 / 👎.

…dex P2 v4)

Codex P2 v4 flagged that the decoder CLI's --rename-collisions flag
drives DecodeOptions.RenameS3Collisions but is NOT recorded in
MANIFEST.json's Exclusions struct. A self-test of a dump produced with
--rename-collisions cannot perfectly round-trip today because the
encoder has no way to know to ask for the same renaming pass.

Pre-existing decoder-side schema gap, not just an encoder design hole.
Fix folded into M6 (same precedent as parent doc M1 "Decoder cleanup
folded into M1"):

- Exclusions gains RenameS3Collisions bool with json tag
  rename_s3_collisions.
- The decoder CLI's emitManifest populates the new field from cfg.renameCollisions.
- The encoder self-test reads Exclusions.RenameS3Collisions and threads
  it into the scratch DecodeOptions.
- Older manifests without the field decode as false (no-rename),
  matching the decoder default.

Test plan grew from 15 to 17:
- TestCLISelfTestPreservesRenameS3Collisions — full round-trip with
  --rename-collisions input.
- TestExclusionsLegacyManifestOmitsRenameS3Collisions — older manifests
  decode safely with default false.

This is a small decoder-side touchup; the manifest field has been a
zero-cost extension since the Exclusions struct uses regular JSON tags
(no DisallowUnknownFields gate on this field set).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

v5 (bae0377): addresses codex P2 v4 finding.

Codex P2 (RenameS3Collisions not in manifest): Pre-existing decoder-side gap — --rename-collisions drives DecodeOptions.RenameS3Collisions but the flag's value is NOT recorded in MANIFEST.json's Exclusions. Self-test could not perfectly round-trip dumps that used the flag.

Fix folded into M6 (same precedent as parent doc M1 "Decoder cleanup folded into M1"):

  • Exclusions gains RenameS3Collisions bool (JSON tag rename_s3_collisions)
  • Decoder CLI emitManifest populates the new field
  • Encoder self-test reads it back and threads into scratch DecodeOptions
  • Legacy manifests without the field decode as false (decoder default), no fail

Test plan grew from 15 to 17 — TestCLISelfTestPreservesRenameS3Collisions pins the full round-trip; TestExclusionsLegacyManifestOmitsRenameS3Collisions pins forward-compat for older manifests.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 2m 32s —— View job


Review (v5)

  • Read v5 design doc
  • Read parent doc + decoder CLI + manifest.go for cross-reference
  • Verify Codex P2 v4 finding (RenameS3Collisions)
  • Re-check still-open items from v4 review
  • Post final review

v5 Codex P2 finding: addressed at spec level

The schema extension is correctly specified: Exclusions.RenameS3Collisions bool + JSON tag rename_s3_collisions, decoder CLI emitManifest populates it, encoder self-test reads it back, pinned by TestCLISelfTestPreservesRenameS3Collisions. The forward-compat claim (older manifests decode with false) is the right policy for a new optional field. ✅


New finding from v5 — medium

exclusionsRequiredFields and forward-compat conflict

The doc says (line 116): "Manifests that omit these fields (older dumps) decode with defaults — same forgiveness the decoder applies."

That claim is only true if rename_s3_collisions is not added to exclusionsRequiredFields (manifest.go:350–355). All four existing Exclusions fields ARE in exclusionsRequiredFields, and validateExclusionsFieldsPresent (manifest.go:317–344) hard-rejects any manifest that omits one of them with ErrInvalidManifest. If rename_s3_collisions is added to exclusionsRequiredFields too, every existing manifest produced by the current decoder (which doesn't emit this field) will fail ReadManifest with a breaking error — the opposite of forward-compat.

The implementation author has two choices; the doc must pick one:

  • Option A (optional field): Add RenameS3Collisions bool to Exclusions but do NOT add rename_s3_collisions to exclusionsRequiredFields. Older manifests silently get false. This is the only path that matches the doc's forward-compat claim — but it creates an intentional asymmetry where this field is optional while the other four are required. A one-sentence rationale is needed.
  • Option B (required + migration): Add rename_s3_collisions to exclusionsRequiredFields AND bump the manifest format in a minor-version way so older manifests are refused. This is a breaking change; must be called out explicitly with a migration note.

TestExclusionsLegacyManifestOmitsRenameS3Collisions pin implies Option A, but the doc does not name the mechanism. Fix this →


Still-open from v4 — mandatory before merge

1. Lifecycle rename

docs/design/2026_05_25_proposed_snapshot_logical_encoder.md is still named proposed. M1–M5 are merged. Per CLAUDE.md and the parent doc's own line: "On landing this proposal, the parent doc is promoted proposedpartial … via git mv." This has been flagged in every review since v1. One command:

git mv docs/design/2026_05_25_proposed_snapshot_logical_encoder.md \
       docs/design/2026_05_25_partial_snapshot_logical_encoder.md

Still-open from v4 — medium

2. io.Writer internal model ambiguity (open for 3 rounds)

The doc says EncodeSnapshot(opts, out io.Writer) "runs the optional self-test against the in-memory bytes" (line 11). But out is write-only. The only spec-compliant path is Option A: buffer the FSM in *bytes.Buffer when SelfTest=true, decode from the buffer, then copy to out. This doubles the in-memory working set. The doc should commit to this model in one sentence and note the memory implication. Otherwise the implementor will invent the pattern independently (two out of three have reached for the temp-file workaround which breaks the io.Writer contract).

3. --s3-chunk-size absent from CLI surface

Parent doc §"Per-adapter reverse encoders" explicitly says: "An optional --s3-chunk-size flag … lets an operator match a non-default deployment." encode_s3_objects.go:45 hardcodes s3ChunkSize = 1 << 20 with no override. Either add the flag (requires wiring through EncodeOptions) or one sentence: "deferred — encoder hardcodes canonical 1 MiB; --s3-chunk-size is a follow-up."

4. Exit code 2 rationale for flag-driven data constraint (line 55)

--last-commit-ts T < manifest exits 2 but the exit-codes section does not explain why a flag-value violation is a data error. A shell wrapper author will expect it to be exit 1. One sentence prevents confusion: "a CLI flag whose value violates a data constraint from the manifest is a data error, not a flag-parse error." Fix this →


Still-open from v4 — minor

5. Line 11 prose inconsistency

"returns an EncodeResult plus the assembled .fsm bytes" — bytes are written to out io.Writer, not returned. The phrase is vestigial. Fix this →

6. TestCLISelfTestDetectsCorruption — injection mechanism (line 148)

"Patch the in-tree encoder to deliberately mangle one byte after writing the .fsm" without specifying the mechanism. With EncodeSnapshot(opts, out io.Writer) now specified, the natural approach is a test-only corruptingWriter wrapper passed as out. Pinning it here avoids the implementor inventing the pattern.

7. P1 test TestEncoderProducesLoadableSnapshot absent

Parent doc's P1 test table includes this end-to-end cluster-boot test under M6. The 17-test plan doesn't include it, and it's not explicitly deferred. Self-test proves round-trip equality; only the cluster test proves the .fsm boots a node and serves reads.

8. encoder_version ldflags mechanism

Decoder main.go:45 uses var version = "dev" with -ldflags "-X main.version=...". ENCODE_INFO.json shows "encoder_version": "<git rev or release tag>" with no mention of how it gets stamped at build time. One-liner for parity.

9. adapters_enabled ordering

Schema example shows ["redis","dynamodb","s3","sqs"]. Not stated whether this is canonical encoder fan-out order or CLI input order. Since the encoder fans out deterministically, the sidecar should reflect that order for reproducibility.

10. Decoder cleanup milestone attribution

Parent doc said "Decoder cleanup folded into M1." M1 shipped without it; M6 picks it up. The doc should say "the parent doc targeted M1; that scope was not landed there and is now deferred to M6" so a reader diffing the two docs knows which is authoritative.

11. DynamoDBBundleJSONL field name (minor)

Line 116 refers to threading DynamoDBLayout into DecodeOptions, but the actual field is DynamoDBBundleJSONL bool (decode.go:89). The mapping is manifest.DynamoDBLayout == "jsonl" → opts.DynamoDBBundleJSONL = true. Showing the mapping explicitly prevents the implementor from looking it up. Fix this →


Verdict

The Codex P2 v4 finding (RenameS3Collisions field) is correctly addressed at the spec level. The new medium finding — the exclusionsRequiredFields / forward-compat conflict — is the one item that must be resolved to avoid a silent implementation trap (the implementor who adds the field to exclusionsRequiredFields per analogy to the other four will break backward compat; the doc must say "don't do that"). The lifecycle rename (#1) remains mandatory. All other items are documentation clarifications.

… claude v3/v4/v5 reviews

Items addressed (some open across 3 review rounds):

1. **Lifecycle rename (mandatory, flagged 3 rounds).** git mv the parent
   doc from proposed_snapshot_logical_encoder.md to
   partial_snapshot_logical_encoder.md. M1-M5 are merged, parent doc's
   own §"Why a separate design doc" calls for this rename, CLAUDE.md
   "Design Documents" formalizes the marker.

2. **exclusionsRequiredFields explicit policy (v5 new medium).**
   rename_s3_collisions is intentionally NOT added to
   exclusionsRequiredFields (manifest.go:350-355) so older manifests
   decode with the zero-value false. Asymmetry with the four
   pre-existing required fields is intentional; doc now states the
   constraint (backward compat with every existing manifest) and the
   mechanism (JSON's zero-value fallback).

3. **io.Writer self-test buffering model (open 3 rounds).** Committed
   to Option A: when opts.SelfTest=true, EncodeSnapshot buffers the
   FSM in a *bytes.Buffer first, decodes from the buffer, then copies
   to out only on match. Memory cost is one FSM-sized allocation on
   top of the sort working set. When opts.SelfTest=false the FSM
   streams straight to out.

4. **--s3-chunk-size deferral note (open 3 rounds).** Explicitly
   deferred — encoder hardcodes canonical 1 MiB; flag is a follow-up.
   Rationale: no deployment in-tree uses a non-default chunk size.

5. **Exit code 2 rationale for flag-driven data constraint (open 3
   rounds).** Added one sentence: a flag whose value violates a
   data constraint from the manifest is exit 2, not exit 1, because
   the constraint comes from the data.

6. **Line 11 vestigial "plus the assembled .fsm bytes" (open 3
   rounds).** Removed; replaced with explicit "written to out
   io.Writer; not returned."

7. **DynamoDBBundleJSONL field name (v4 minor).** Self-test
   threading paragraph now says
   DynamoDBBundleJSONL (true when manifest.DynamoDBLayout == "jsonl")
   so the implementation author doesn't have to look it up.

8. **TestCLISelfTestDetectsCorruption injection mechanism (open 3
   rounds).** Now specifies a corruptingWriter wrapper around the
   EncodeSnapshot out io.Writer (xor'd byte at deterministic offset);
   self-test reads back the corrupted buffer.

9. **P1 cluster test TestEncoderProducesLoadableSnapshot defer
   (open 3 rounds).** Explicitly deferred from M6 with rationale
   (a/b/c) covering why the existing 17 tests adequately substitute
   for it within M6's scope.

10. **encoder_version ldflags mechanism (open 3 rounds).** Mirrors
    decoder/main.go:45 — var version = "dev" + -ldflags "-X main.version=...".
    Test builds keep the literal "dev".

11. **adapters_enabled ordering (open 3 rounds).** Canonical encoder
    fan-out order (redis, dynamodb, s3, sqs), NOT CLI input order, for
    bytewise reproducibility across runs that pass --adapter in
    different sequences.

12. **Decoder cleanup milestone attribution (open 3 rounds).** Doc
    now says "Parent doc said 'Decoder cleanup folded into M1.' M1
    shipped without it; M6 picks it up" so a reader diffing the two
    docs knows M6 is authoritative.

13. **Parameter order asymmetry note.** EncodeSnapshot(opts, out) vs
    DecodeSnapshot(r, opts) — deliberate inversion so each function's
    I/O argument sits adjacent to the direction it travels.

No new content removed; doc is now ~30 lines longer.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

v6 (be063d6): addresses 1 new medium + 12 still-open items from claude v3/v4/v5 reviews.

Mandatory items resolved:

  • Lifecycle rename: git mv parent doc from proposed_snapshot_logical_encoder.mdpartial_snapshot_logical_encoder.md. M1-M5 are merged; the rename was overdue.
  • exclusionsRequiredFields policy: rename_s3_collisions intentionally NOT in required fields; doc states why (backward compat) + mechanism (JSON zero-value fallback). Asymmetry with the 4 pre-existing required fields is intentional.
  • io.Writer self-test buffer: committed to Option A — buffer FSM in *bytes.Buffer when SelfTest=true, decode from buffer, copy to out only on match. No-self-test path streams straight to out.
  • --s3-chunk-size: explicitly deferred with rationale (no in-tree deployment uses non-default).

Polish items (all flagged across 3 review rounds):

  • Exit code 2 rationale for flag-driven data constraint
  • Line 11 vestigial "plus the assembled .fsm bytes" → removed
  • DynamoDBBundleJSONL correct field name + mapping
  • TestCLISelfTestDetectsCorruption injection mechanism (corruptingWriter)
  • TestEncoderProducesLoadableSnapshot deferred from M6 with rationale
  • encoder_version ldflags mechanism mirrors decoder main.go:45
  • adapters_enabled canonical encoder fan-out order
  • Decoder cleanup M1→M6 attribution
  • Parameter order asymmetry (EncodeSnapshot(opts, out) vs DecodeSnapshot(r, opts)) noted as deliberate

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 2m 46s —— View job


Review (v6)

  • Read v6 design doc
  • Read parent doc (partial) for cross-reference
  • Read encoder/decoder code for accuracy checks (decode.go, manifest.go, encode_s3_objects.go, decoder CLI)
  • Verify all 13 v6 claimed fixes
  • Re-check outstanding items from v5 review
  • Post final review

v6 claimed fixes: all 13 verified

Every item from the trigger comment confirmed as resolved:

  • Lifecycle rename ✅ — git mv landed; parent doc is now 2026_05_25_partial_snapshot_logical_encoder.md.
  • exclusionsRequiredFields policy ✅ — Doc line 128 explicitly states rename_s3_collisions is NOT in exclusionsRequiredFields (manifest.go:350–355 confirmed; only the original four fields present) and explains the intentional backward-compat asymmetry.
  • io.Writer self-test buffer (Option A) ✅ — Lines 41–42 commit to the internal *bytes.Buffer model: SelfTest=true → buffer first, test from buffer, copy to out only on match; SelfTest=false → stream directly to out. Memory implication (one extra FSM-sized allocation) is called out inline.
  • --s3-chunk-size deferred ✅ — Lines 59–60 add the explicit deferral note with rationale.
  • Exit code 2 rationale ✅ — Line 61 adds the one-sentence clarification: data-derived constraint, not flag-parse error.
  • Line 11 vestigial phrase removed ✅ — "plus the assembled .fsm bytes" is gone; the sentence now reads correctly.
  • DynamoDBBundleJSONL field name + mapping ✅ — Line 126: field name matches decode.go:89 (DynamoDBBundleJSONL bool), mapping shown explicitly (true when manifest.DynamoDBLayout == "jsonl").
  • TestCLISelfTestDetectsCorruption injection ✅ — Line 160: corruptingWriter wrapper (xor'd byte at deterministic offset) passed as out io.Writer.
  • TestEncoderProducesLoadableSnapshot deferred ✅ — Lines 172–173 defer with three-part rationale.
  • encoder_version ldflags ✅ — Line 108: mirrors decoder cmd/elastickv-snapshot-decode/main.go:45 (var version = "dev", -ldflags "-X main.version=..." confirmed).
  • adapters_enabled canonical order ✅ — Lines 109–110: canonical encoder fan-out order; non-CLI-input order; reasoning stated.
  • Decoder cleanup M1→M6 attribution ✅ — Line 130: explicit "parent doc targeted M1; M1 shipped without it; M6 picks it up."
  • Parameter order asymmetry ✅ — Lines 43–44: asymmetry is deliberate; I/O arg adjacent to the direction it travels.

New finding from v6 — must fix before merge

Stale parent link in M6 doc (line 4)

The lifecycle rename in this PR renamed the parent doc from 2026_05_25_proposed_snapshot_logical_encoder.md to 2026_05_25_partial_snapshot_logical_encoder.md. The M6 doc's **Parent:** line still references the old name:

**Parent:** [`2026_05_25_proposed_snapshot_logical_encoder.md`](2026_05_25_proposed_snapshot_logical_encoder.md)

Since the file no longer exists under that name, this is a broken link introduced by the same PR that fixes the lifecycle rename. Change to:

**Parent:** [`2026_05_25_partial_snapshot_logical_encoder.md`](2026_05_25_partial_snapshot_logical_encoder.md)

Fix this →


New finding from v6 — minor

Encoder fan-out order diverges from decoder's canonical order (undocumented)

The M6 doc claims the encoder processes adapters in redis → dynamodb → s3 → sqs order (line 11, line 110). The decoder's canonical finalize order per decode.go:269 ("matches the dump-tree order") is dynamodb → s3 → redis → sqs, reflected in finalize() lines 285–298. Since the encoder sorts its output, correctness is unaffected — the final .fsm byte-order depends on encoded keys, not on fan-out order. But an implementation author who aligns with the decoder for consistency would produce ["dynamodb","s3","redis","sqs"] in adapters_enabled instead. Either is fine; the doc should add one sentence: "The encoder fan-out order differs from the decoder's finalize order (dynamodb → s3 → redis → sqs per decode.go:269) because the encoder sorts output after collection, making fan-out order irrelevant to the final .fsm — the chosen order exists only for a stable adapters_enabled field."


Verdict

All 13 v6 claimed fixes land correctly. The mandatory item before merge is the single stale parent link on line 4 — a two-word fix introduced by the lifecycle rename in this same PR. Everything else is resolved or explicitly deferred with rationale. Once line 4 is updated, this doc is ready to merge.


Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: be063d6110

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

| `TestCLIAcceptsEqualAndHigherLastCommitTSOverride` | `T == manifest` and `T > manifest` both succeed; the emitted `.fsm` header carries the chosen `T`. |
| `TestCLIRejectsMissingManifest` | `--input` directory has no `MANIFEST.json` → exit 1, clear error. |
| `TestCLIRejectsUnknownAdapter` | `--adapter foo` → exit 1, no `.fsm` written. Decoder-parity test for the adapter CSV parser. |
| `TestCLISelfTestDetectsCorruption` | Use a test-only `corruptingWriter` wrapper (xor'd byte at a deterministic offset) as the `EncodeSnapshot` `out io.Writer`; the buffered FSM that the self-test reads back is the corrupted version, so the decode either errors or produces a diff. The CLI exits 2 and writes `mismatch.txt`. Pins that the self-test actually compares (not just runs). |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Corrupt bytes before the self-test reads them

With the buffering model specified above, EncodeSnapshot performs the self-test against its internal buffer and only copies to the supplied out io.Writer after the self-test has matched, so a corruptingWriter wrapped around out cannot affect the bytes that DecodeSnapshot validates. Implementing this test as written would either fail with exit 0 or fail to pin the intended corruption path; the injected corruption needs to happen in the encoder/buffer before the self-test decode.

Useful? React with 👍 / 👎.

…ed self-test (codex P2)

Codex P2 v6 sharp catch: with v6's buffering model
(EncodeSnapshot buffers in *bytes.Buffer, runs self-test against the
buffer, copies to out io.Writer ONLY on match), a corruptingWriter
wrapping out would corrupt bytes AFTER the self-test had already
passed - the test would silently report success with corrupted output.

Replaced the mechanism: same-package test-only hook on EncodeOptions
(unexported corruptBufferForTest func([]byte)) that runs against the
internal buffer between WriteTo and the self-test decode. Corruption
reaches the self-test's DecodeSnapshot but never reaches out (the
write-then-rename discipline ensures self-test failure does NOT
publish). Hook is unexported so external callers cannot set it.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

v7 (ac7af85): addresses codex P2 v6 — sharp catch on the corruption-injection mechanism.

The v6 corruptingWriter wrapping out is incoherent with the buffering model: EncodeSnapshot runs self-test against its internal *bytes.Buffer and only copies to out AFTER the self-test matches. So corruption at out happens after self-test has already passed — the test would silently report exit 0.

New mechanism: same-package test-only hook on EncodeOptions (unexported corruptBufferForTest func([]byte)) that runs against the internal buffer between WriteTo and the self-test decode. Corruption reaches the decode but never reaches out (write-then-rename discipline preserved). Hook is unexported → external callers cannot set it → no corrupted-output API on the public surface.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 3m 16s —— View job


Review (v7)

  • Read v7 design doc
  • Read encoder/decoder code for cross-reference (decode.go:267–298, encode.go:136–238)
  • Verify Codex P2 v6 fix (corruption injection mechanism)
  • Re-check v6 outstanding items (stale parent link, fan-out order)
  • Post final review

v7 Codex P2 finding: addressed ✅

The new mechanism is coherent with the buffering model:

  • corruptBufferForTest func([]byte) is an unexported field on EncodeOptions (line 160), so it can only be set from within internal/backup — specifically from internal/backup/encode_snapshot_test.go compiled as package backup (internal test package, not package backup_test).
  • The hook fires against the internal *bytes.Buffer after WriteTo returns but before DecodeSnapshot is called, so corruption reaches the self-test decode path but never reaches out io.Writer.
  • Write-then-rename discipline is preserved: a self-test failure removes the temp file, leaving nothing at the publish path.
  • External callers cannot set the hook, so no corrupted-output API is exposed on the public surface.

New finding from v7 — minor

TestCLISelfTestDetectsCorruption naming inconsistency

The test uses the unexported corruptBufferForTest hook, which requires it to live in internal/backup/encode_snapshot_test.go compiled as package backup. But the name has a CLI prefix, consistent with the test naming convention used for cmd/elastickv-snapshot-encode/main_test.go entries in the test plan (e.g. TestCLIRoundTripSelfTestAllAdapters, TestCLIRejectsLowerLastCommitTSOverride, etc.). An implementation author scanning the test plan and matching prefixes to files would put this test in the CLI package, where the unexported field cannot be set and the test won't compile.

Suggested rename: TestEncodeSnapshotSelfTestDetectsCorruption — matches the internal/backup/encode_snapshot_test.go family and signals the correct package to the implementor. Fix this →


Still-open from v6 — mandatory before merge

Stale parent link on line 4

The git mv in v6 renamed the parent doc to 2026_05_25_partial_snapshot_logical_encoder.md (confirmed — file exists under that name). But line 4 of the M6 doc still references the old name:

**Parent:** [`2026_05_25_proposed_snapshot_logical_encoder.md`](2026_05_25_proposed_snapshot_logical_encoder.md)

The linked file no longer exists under that name, making this a broken link introduced by the same PR that performs the rename. This has been flagged since v6 as mandatory. Fix:

**Parent:** [`2026_05_25_partial_snapshot_logical_encoder.md`](2026_05_25_partial_snapshot_logical_encoder.md)

Fix this →


Still-open from v6 — minor

Encoder fan-out order vs decoder finalize order undocumented

Line 110 states the adapters_enabled order is redis → dynamodb → s3 → sqs (the encoder's canonical fan-out). The decoder's finalize order per decode.go:278–296 is dynamodb → s3 → redis → sqs (with comment: "matches the dump-tree order"). These diverge. The doc explains the encoder order is not the CLI input order, but does not say it differs from the decoder. A one-sentence note closes this: "The encoder fan-out order differs from the decoder's finalize order (dynamodb → s3 → redis → sqs, per decode.go:278); this is intentional — the encoder sorts its output by encoded key after collection, so fan-out order has no effect on the final .fsm byte sequence, and redis-first was chosen for historical familiarity."


Verdict

The v7 Codex P2 finding is correctly resolved. The mandatory stale parent link (#1 above) has been open since v6 and is a two-word fix. Once that and the test naming inconsistency (minor, but practically important for guiding the implementation) are addressed, this doc is ready to merge.

…eSnapshot corruption test

Two findings from claude v7 review.

1. Mandatory: line 4 still linked to the old parent doc filename
   (2026_05_25_proposed_snapshot_logical_encoder.md) even after v6's
   git mv to ..._partial_... . Link updated, with a sentence noting
   the v6 lifecycle rename.

2. Minor: TestCLISelfTestDetectsCorruption needs to live in
   internal/backup/encode_snapshot_test.go (package backup) because
   the unexported corruptBufferForTest hook can only be set from a
   same-package internal test. The CLI prefix suggested the wrong
   target file. Renamed to TestEncodeSnapshotSelfTestDetectsCorruption
   to match the encode_snapshot_test.go family.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

v8 (e2938a6): both v7 findings addressed.

Mandatory (v6-still-open): Line 4 Parent: link updated from 2026_05_25_proposed_snapshot_logical_encoder.md to 2026_05_25_partial_snapshot_logical_encoder.md, matching the v6 git mv. Added a sentence noting the lifecycle promotion happened in v6.

Minor (v7 new): Renamed TestCLISelfTestDetectsCorruptionTestEncodeSnapshotSelfTestDetectsCorruption. The unexported corruptBufferForTest hook requires same-package access (internal/backup/encode_snapshot_test.go as package backup), so the CLI prefix was misleading.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 2m 29s —— View job


Review (v8)

  • Read v8 design doc
  • Read parent (partial) doc for cross-reference
  • Read encoder/decoder code for accuracy checks (decode.go, manifest.go, decoder main.go)
  • Verify v8 fixes (stale parent link + test rename)
  • Re-check outstanding items from v7 review
  • Post final review

v8 fixes: both verified ✅

Fix #1 (stale parent link — mandatory): Line 4 now correctly reads:

**Parent:** [2026_05_25_partial_snapshot_logical_encoder.md](2026_05_25_partial_snapshot_logical_encoder.md) — … the parent doc was promoted proposedpartial in v6 (#896); …

The lifecycle promotion note ("promoted in v6 (#896)") is a nice addition. The partial file exists under that path on disk. ✅

Fix #2 (test rename — minor): The test plan table at line 160 now shows TestEncodeSnapshotSelfTestDetectsCorruption, correctly signalling the test belongs in internal/backup/encode_snapshot_test.go compiled as package backup where the unexported corruptBufferForTest hook can be set. ✅


New finding — minor

Parent doc body still says Status: Proposed

The git mv in v6 renamed the file to 2026_05_25_partial_snapshot_logical_encoder.md, but the first few lines of the file body still read:

Status: Proposed
Author: bootjp

CLAUDE.md says partial docs should track "which milestones have landed and what remains." The body header should say Status: Partial (M1–M5 merged; M6 in review) (or equivalent) so a reader opening the file doesn't have to reconcile the filename with the internal status line. This is not a blocker — the filename is the authoritative lifecycle signal per CLAUDE.md — but the inconsistency will generate questions during incident response when an operator cats the file. Fix this →


Code accuracy cross-checks (all pass)

  • DecodeSnapshot(r io.Reader, opts DecodeOptions) signature: confirmed at decode.go:149. Doc's parameter-order asymmetry rationale (line 43) is accurate. ✅
  • DynamoDBBundleJSONL bool field: confirmed at decode.go:89. Doc's mapping true when manifest.DynamoDBLayout == "jsonl" (line 126) is accurate. ✅
  • exclusionsRequiredFields at manifest.go:350–355: only the four original fields, rename_s3_collisions absent. Consistent with the doc's intentional-asymmetry claim (line 128). ✅
  • Decoder main.go:45: var version = "dev" with -ldflags "-X main.version=...". Doc line 108 mirrors this correctly. ✅

Still-open from v7 — minor (not blocking)

Encoder fan-out order vs. decoder finalize order undocumented. The doc explains at line 110 that adapters_enabled reflects the canonical encoder order (redis, dynamodb, s3, sqs), not CLI input order. It does not note this differs from the decoder's finalize order (dynamodb → s3 → redis → sqs per decode.go:278). Since the encoder sorts its output by encoded key after collection, fan-out order has no effect on the final .fsm — correctness is unaffected — but an implementation author aligning with the decoder for consistency would produce a different order in adapters_enabled. One sentence would close this.


Verdict

Both v8 fixes land correctly. This doc is ready to merge. The one new minor finding (parent doc body status inconsistency) and the still-open fan-out-vs-finalize note can be addressed in the same commit before merge or deferred to the implementation PR — neither is load-bearing for the design review. No mandatory items remain open.

…ial filename (claude v8 minor)

CLAUDE.md says partial docs should track which milestones have landed
and what remains. The git mv in v6 renamed the file to ..._partial_...
but the body header still said 'Status: Proposed'. Updated to
'Status: Partial (M1-M5 merged ...; M6 CLI in review at PR #896)' so an
operator cat'ing the file sees the same state the filename signals.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

v9 (653d928): claude v8 minor — parent doc body header now reflects the partial-lifecycle status.

Status: Partial (M1-M5 merged via PRs #807/#841/#847/#849/#864/#846/#892; M6 CLI in review at PR #896) — matches the filename and CLAUDE.md's partial-doc convention.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 1m 27s —— View job


Review (v9)

  • Read v9 parent doc changes
  • Read M6 doc for any remaining issues
  • Verify v8 new finding (Status header update)
  • Re-check still-open items from v8
  • Post final review

v9 fix: verified ✅

The v9 commit correctly addresses the v8 minor finding — the parent doc body header is now consistent with its filename:

-Status: Proposed
+Status: Partial (M1–M5 merged via PRs #807 / #841 / #847 / #849 / #864 / #846 / #892; M6 CLI in review at PR #896)
-Date: 2026-05-25
+Date: 2026-05-25 (promoted proposed → partial 2026-06-01 in PR #896)

A reader cating the file now sees the same state the partial_ filename signals. The promotion date and PR reference are included, matching what CLAUDE.md requires partial docs to track. ✅


Still-open — minor (carried from v7, non-blocking)

Encoder fan-out order vs. decoder finalize order undocumented

The M6 doc line 110 correctly states that adapters_enabled reflects the canonical encoder fan-out order (redis → dynamodb → s3 → sqs) and is not the CLI input order. But the decoder's finalize order per decode.go:278–296 is dynamodb → s3 → redis → sqs. An implementation author aligning with the decoder for consistency would produce a different adapters_enabled ordering. The doc doesn't call this divergence out — it was raised in v7, carried through v8, and remains.

Since the encoder sorts output by encoded key after collection, fan-out order has no effect on the final .fsm byte sequence, so this is purely a doc-clarity issue. A one-sentence note would close it permanently:

"The encoder fan-out order (redis first) intentionally differs from the decoder's finalize order (dynamodb first, per decode.go:278); the final .fsm byte sequence is determined by encoded key sort, not fan-out order, so either is correct."

This was classified as non-blocking in v8 and remains so. It can ship in the implementation PR's doc touchup if not addressed here.


Verdict

The v9 fix lands cleanly. This doc is ready to merge. No mandatory items remain open. The one still-open minor (fan-out order note) is a one-line doc clarification that can land in the implementation PR's first commit.

…er difference

Final non-blocking minor from claude v7/v8/v9. Encoder dispatches
redis -> dynamodb -> s3 -> sqs (parent design enumeration order);
decoder finalizes dynamodb -> s3 -> redis -> sqs (decode.go:278-296,
alphabetical-with-tweaks). The final .fsm byte sequence is determined
by encoded-key sort (encode.go:204), not adapter fan-out order, so
either ordering is correct as long as it is fixed. Doc now notes this
deliberate divergence so an implementor aligning with the decoder
doesn't produce a different adapters_enabled order in ENCODE_INFO.json.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Jun 1, 2026

v10 (1736435): folded in the only remaining non-blocking minor — explicit note on encoder fan-out (redis first) vs decoder finalize order (dynamodb first per decode.go:278), with the rationale that the final .fsm byte sequence is determined by encoded-key sort, not fan-out order.

Claude v9 verdict was already "ready to merge"; v10 just closes the last carried-over minor so the doc is fully clean.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @bootjp's task in 1m 23s —— View job


Review (v10)

  • Read v10 design doc
  • Read decode.go for finalize order cross-reference
  • Verify v10 fix (fan-out order note)
  • Re-check outstanding items from v9 review
  • Post final review

v10 fix: verified ✅

The fan-out order note at line 112 is accurate and cleanly resolves the last carried-over minor from v7:

"The encoder fan-out order (redis first) intentionally differs from the decoder's finalize order at decode.go:278-296 (dynamodb first). The final .fsm byte sequence is determined by encoded key sort (snapshotBuilder.WriteTo sorts entries before writing — see encode.go:204), not by adapter fan-out order, so either ordering is correct as long as it is fixed. The two sides pick differently for purely historical reasons: the decoder finalizes in alphabetical-with-tweaks order; the encoder follows the parent design doc's enumeration order. No-op for restore correctness."

Cross-checked:

  • decode.go:278–296 finalize order: ddb → s3 → redis → sqs — confirmed at decode.go:285–296. The comment at line 268–271 explicitly names this order ("The order is fixed and matches the dump-tree order (dynamodb/, s3/, redis/, sqs/)"). ✅
  • encode.go:204: sort.Slice(b.entries, func(i, j int) bool { return bytes.Compare(b.entries[i].key, b.entries[j].key) < 0 }) — the key-sort claim is accurate. ✅

The phrasing "purely historical reasons" and "no-op for restore correctness" are both correct. The note closes the question permanently for any implementation author aligning with the decoder: either fan-out order produces an identical .fsm, so neither is wrong.


All prior mandatory and medium items: still resolved

Spot-checked the items that have been open the longest:


No new findings

The v10 change is a one-paragraph doc addition with no architectural or API surface impact. No new issues identified.


Verdict

Ready to merge. All mandatory, medium, and minor items from the nine prior review rounds are resolved. The doc is internally consistent, accurately cross-references the existing code (decode.go:278–296, encode.go:204, decode.go:89, manifest.go:350–355, cmd/elastickv-snapshot-decode/main.go:45), and contains a complete 17-test plan and 5-pass self-review. Implementation is unblocked.

@bootjp bootjp merged commit fe9e941 into main Jun 1, 2026
9 checks passed
@bootjp bootjp deleted the backup/m6-cli-design branch June 1, 2026 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant