docs(backup): propose Phase 0b M6 - cmd/elastickv-snapshot-encode CLI#896
Conversation
Capstone milestone for Phase 0b. M1-M5 are merged (#807/#841/#847/#849/#864/#846/#892); M6 wires them into a single binary operators can run, plus the round-trip self-test the parent doc requires. Specifies the CLI surface (--input/--output/--adapter/--last-commit-ts/ --self-test/--scratch-root), the fail-closed --last-commit-ts T >= manifest.last_commit_ts validation, the ENCODE_INFO.json provenance sidecar (encoder version, key-format version, fsm sha256, override record, self-test result), and the round-trip self-test gate (encode -> decode -> structural diff). Eight tests cover the four user-visible behaviors + the two safety nets. Folds the parent doc's "Decoder cleanup" note: Source.FSMCRC32C is dead (native .fsm has no CRC32C footer). M6 either repurposes it as a whole-file SHA-256 (recommended) or removes it; the sidecar already carries the anchor either way. Doc-first per CLAUDE.md "Design-doc-first workflow" - touches a new entrypoint (cmd/elastickv-snapshot-encode binary). Implementation lands in a follow-up PR after this proposal is reviewed and accepted.
|
@claude review |
|
Warning Review limit reached
More reviews will be available in 4 minutes and 53 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Claude finished @bootjp's task in 2m 33s —— View job Review
Overall the doc is tight, internally consistent with the parent design, and correctly models the existing code. I cross-checked against Lifecycle housekeeping (CLAUDE.md compliance, should fix before merge)The parent doc is still named
M1–M5 are already merged (cited in this PR's summary). The parent should have been Decoder cleanup — conflicting milestone attributionDoc under review, line 125–129:
Parent doc
The parent explicitly assigned
|
There was a problem hiding this comment.
Code Review
This pull request proposes the design for the cmd/elastickv-snapshot-encode CLI, which orchestrates snapshot encoding and supports an optional round-trip self-test. The feedback highlights several critical design improvements for robustness: dynamically naming the ENCODE_INFO.json sidecar to prevent overwrites when multiple snapshots share a directory, appending unique subdirectories to the --scratch-root to allow safe concurrent self-tests, and ensuring eager cleanup of scratch directories and stale mismatch.txt files to avoid disk exhaustion and operator confusion.
| ## `ENCODE_INFO.json` (provenance + integrity) | ||
|
|
||
| A new sidecar emitted next to the output `.fsm`, NOT inside it (the EKVPBBL1 byte format is fixed). Keep the schema minimal — restore operators need to confirm "encoded for the right cluster, by the right encoder version, against this exact file": |
There was a problem hiding this comment.
If the sidecar is statically named ENCODE_INFO.json, storing multiple .fsm files in the same directory (e.g., /backups/node1.fsm and /backups/node2.fsm) will cause their sidecar files to overwrite each other. To prevent collisions, the sidecar should be named dynamically based on the output .fsm filename (for example, <fsm_name>.encode_info.json or by appending .encode_info.json to the output path).
|
|
||
| Implementation: | ||
|
|
||
| 1. After the primary `.fsm` is written + sha256-anchored, invoke `backup.DecodeSnapshot` on the produced bytes into a scratch directory (default `os.TempDir()/encode-self-test-<random>/`; operator can pin via `--scratch-root` for tmpfs). |
There was a problem hiding this comment.
If the operator pins --scratch-root to a specific directory, running multiple concurrent encodes with --self-test enabled will cause them to use the exact same directory, leading to file collisions and false test failures. To support safe concurrent execution, the tool should use the pinned --scratch-root as a base directory and append a unique/random subdirectory name (e.g., <scratch-root>/encode-self-test-<random>/) rather than using the pinned path directly.
| 2. Diff the scratch tree against `--input`. The diff is structural, not byte-equal on transient fields: | ||
| - **Excluded from compare:** `MANIFEST.json` `wall_time_iso`, any future `phase`-specific provenance, the `MANIFEST.json` `source.fsm_path` (decoder's record of where it read from, not data). | ||
| - **Required equal:** every adapter subdir's full tree (filenames + bytes), `MANIFEST.json` adapter-scope arrays, `last_commit_ts`. | ||
| 3. On mismatch: the CLI exits with code 2, writes a `mismatch.txt` next to the output `.fsm` listing the first N differing paths, and the `ENCODE_INFO.json` records `self_test.matched: false` before the exit. Restore operators inspect `mismatch.txt`. |
There was a problem hiding this comment.
The design should specify that the temporary scratch directory created for the self-test must be eagerly cleaned up (e.g., via os.RemoveAll) upon completion of the self-test (both on success and failure) to prevent unbounded disk space usage and potential disk exhaustion. Additionally, any pre-existing mismatch.txt at the destination should be explicitly removed at the start of a new run or upon a successful self-test to avoid confusing operators with stale failure reports.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 31c0a0f83d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| Implementation: | ||
|
|
||
| 1. After the primary `.fsm` is written + sha256-anchored, invoke `backup.DecodeSnapshot` on the produced bytes into a scratch directory (default `os.TempDir()/encode-self-test-<random>/`; operator can pin via `--scratch-root` for tmpfs). |
There was a problem hiding this comment.
Account for DecodeSnapshot not writing MANIFEST.json
When --self-test is implemented this way, the scratch tree produced by the library call will not contain MANIFEST.json or CHECKSUMS: internal/backup/decode.go explicitly documents that DecodeSnapshot only dispatches adapter files and leaves manifest emission to the cmd wrapper. Since the next step requires comparing MANIFEST.json fields (last_commit_ts, adapter scopes, source.fsm_path exclusions), following this spec either makes every self-test fail on a missing manifest or skips the manifest checks that are supposed to validate the effective override timestamp.
Useful? React with 👍 / 👎.
Four substantive design corrections from the freshly-opened M6 PR review. 1. ENCODE_INFO.json sidecar is path-derived, not static (gemini medium). /backups/node1.fsm -> /backups/node1.fsm.encode_info.json so multiple .fsm files can share a directory without sidecar collisions. Pinned by TestCLIEncodeInfoPathDerivedFromOutput + TestCLIEncodeInfoTwoFilesNoCollision. 2. Self-test scratch subdir is ALWAYS a unique <random> directory under the resolved base, even when --scratch-root is operator-pinned (gemini medium). Concurrent encodes against the same base would otherwise collide and produce false failures. Pinned by TestCLISelfTestPinnedScratchRootStillUniqueSubdir. 3. Eager cleanup: os.RemoveAll on the scratch subdir in every exit path (success, mismatch, error) via defer at MkdirTemp time; stale mismatch.txt is removed at start of every run (gemini medium). Pinned by TestCLISelfTestCleansScratchOnAllPaths. 4. Self-test header check uses DecodeSnapshot's in-memory DecodeResult instead of materializing MANIFEST.json in the scratch (codex P2). DecodeSnapshot is documented as not emitting MANIFEST.json; the prior design implicitly relied on its presence to compare last_commit_ts. The new design compares Result.Header.LastCommitTS against the effective T (manifest or override), avoiding a redundant manifest synthesis step. The mismatch.txt format also picks up the same path-derivation rule (<output>.mismatch.txt). Self-review pass 5 + test plan updated: 12 tests now (was 8), the four new ones pin the regressions above. No implementation impact on the encoder slices already merged.
|
v2 (fb6ba10): addresses all 4 review findings on the design. Gemini medium #1 (sidecar collision): ENCODE_INFO.json is now path-derived: Gemini medium #2 (scratch-root collision): Self-test scratch is ALWAYS a unique Gemini medium #3 (cleanup): Eager Codex P2 (MANIFEST.json absent from scratch): Self-test now uses Test plan grew from 8 to 12 — 4 new tests pin each regression above. No impact on already-merged M1-M5 slices. @claude review |
|
Claude finished @bootjp's task in 3m 11s —— View job Review (v2)
What v2 correctly addressesAll four flagged items land cleanly:
Still-open from v1 (must fix + medium)1. Lifecycle rename — mandatory before merge Per CLAUDE.md:
M1–M5 are merged, so the parent doc must be 2. The parent doc (§"Per-adapter reverse encoders", lines 329-332) explicitly flags
The M6 doc's CLI surface omits it entirely and doesn't note the deferral. The current encoder hardcodes 3. Exit code 2 for a user-supplied flag violation — medium
Still-open from v1 (minor)4. Decoder cleanup milestone attribution The parent doc explicitly wrote "Decoder cleanup folded into M1." M1 shipped without it; M6 picks it up. The doc doesn't say this — it only says "parent doc §'Decoder cleanup'" and claims M6 owns it. Suggest one sentence: "The parent doc targeted M1; that scope was not landed there and is now deferred to M6." Prevents a reader diffing the two docs from wondering which one is authoritative. 5. Line 121 still says "Patch the in-tree encoder to deliberately mangle one byte after writing the 6. P1 test The parent doc's M6 description includes an "end-to-end test that loads the output into a fresh single-node cluster and reads back every adapter's data." That test is classified P1 in the parent's test table. It doesn't appear in the M6 doc's 12-test plan and is never explicitly deferred. The self-test proves round-trip equality of the directory tree; only the cluster test proves the 7. The decoder 8. Schema example shows VerdictThe four v2 fixes all land cleanly. The lifecycle rename (#1 above) is mandatory before merge per CLAUDE.md, and the missing |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fb6ba10fb2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| Implementation: | ||
|
|
||
| 1. After the primary `.fsm` is written + sha256-anchored, invoke `backup.DecodeSnapshot` on the produced bytes into a unique scratch subdirectory. The scratch dir is ALWAYS a fresh `encode-self-test-<random>` subdirectory under the resolved base (default base `os.TempDir()`; operator override via `--scratch-root`). The unique-suffix is non-optional even when `--scratch-root` is pinned — concurrent encodes against the same pinned base would otherwise collide and produce false failures (gemini medium #896). |
There was a problem hiding this comment.
Write self-tested snapshots to a temp path first
When --self-test is enabled and the encoder output is corrupt, this sequence writes the final --output .fsm before running the decode/diff, and line 97 then only records matched:false before exiting. That leaves a known-bad snapshot at the restore-visible path; the parent design explicitly says the self-test should run before the encoder finalizes the output so a node never receives an unloadable .fsm (docs/design/2026_05_25_proposed_snapshot_logical_encoder.md §Round-trip self-test). The implementation should write to a temp path and only rename into --output after the self-test passes, or remove the final file on failure.
Useful? React with 👍 / 👎.
| 5. Emits `ENCODE_INFO.json` next to the output `.fsm` (provenance + integrity anchor). | ||
| 6. Optionally runs the round-trip self-test (`--self-test`): re-decode the produced `.fsm` into a scratch dir and diff against the input tree. | ||
|
|
||
| The CLI is the **only new external surface**; the encoder slices already export their constructors and `Encode(*snapshotBuilder) error`. M6 is mostly wiring + provenance + the self-test gate. |
There was a problem hiding this comment.
Expose an encoder entrypoint for the new command
The proposed command cannot directly perform this wiring from cmd/elastickv-snapshot-encode: snapshotBuilder and newSnapshotBuilder are unexported in internal/backup/encode.go, and the adapter Encode methods take *snapshotBuilder, so an external main package has no way to construct the builder to pass to NewRedisEncoder(...).Encode etc. Without adding an exported EncodeSnapshot/builder API or moving orchestration into internal/backup, the follow-up implementation slice described here is blocked.
Useful? React with 👍 / 👎.
…entrypoint) Two design corrections from codex review on v2. 1. Write-then-rename atomic publish (codex P2 line 93). Previously: encoder wrote --output first, then ran self-test, then recorded matched:false in the sidecar on failure. That leaves a known-bad .fsm at the restore-visible path — exactly the failure mode the parent doc's atomic-publish discipline forbids. New design: write to <output>.tmp-<random>, fsync+close, run self-test, ONLY rename to <output> on success. Failure path os.RemoveAll's the temp file. Pinned by TestCLISelfTestFailureLeavesNoFsmAtOutputPath. 2. New backup.EncodeSnapshot library entrypoint (codex P2 line 17). Previously: design said the CLI directly wires NewRedisEncoder(...). Encode(snapshotBuilder), but snapshotBuilder + newSnapshotBuilder are unexported in internal/backup/encode.go. An external main package cannot construct the builder. New design: mirror the decoder's DecodeSnapshot/EncodeSnapshot symmetry — internal/backup/ encode_snapshot.go exposes EncodeSnapshot(opts, out) which handles the builder construction, per-adapter dispatch, and optional self- test internally. The CLI's responsibility shrinks to flag parsing + the temp-file dance + sidecar writing. Unblocks future callers (Phase 1 live extractor, integration tests). Pinned by TestEncodeSnapshotLibraryRoundTrip. Files-to-add and test plan updated: 14 tests now (was 12), 4 files to add now (was 4 — encode_snapshot.go replaces direct builder use in main.go).
|
v3 (a253d7f): addresses both codex P2 findings on v2. Codex P2 #1 (write-temp-first): New write-then-rename atomic-publish discipline. Encoder writes to Codex P2 #2 (library entrypoint exposure): Added Files-to-add and test plan updated: 14 tests now (was 12). @claude review |
|
Claude finished @bootjp's task in 4m 25s —— View job Review (v3)
v3 Codex P2 findings: both addressedCodex P2 #1 (write-temp-first): Atomics publish discipline is now explicit: Codex P2 #2 (library entrypoint): New finding from v3 — medium
The implementor has two valid choices, and each has consequences the doc should make explicit:
Option A is the only spec-compliant path given New findings from v3 — minorProse/API inconsistency (line 11): "returns an Parameter order asymmetry (line 36): Still-open from v2 — mandatory before merge1. Lifecycle rename missing (same as v2) Per CLAUDE.md: "rename Still-open from v2 — medium2. Parent doc §"Per-adapter reverse encoders" lines 329-331 explicitly says: "An optional 3. Exit code 2 rationale for user-supplied flag violation (same as v2)
Still-open from v2 — minor4. Decoder cleanup milestone attribution: Parent doc says "Decoder cleanup folded into M1." M1 shipped without it; M6 picks it up. The doc should say "the parent doc targeted M1; that scope was not landed there and is now deferred to M6" so a reader diffing the two docs knows which is authoritative. 5. 6. P1 test 7. 8. VerdictBoth Codex P2 findings are cleanly resolved. The new medium finding (#9 — |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a253d7f7f6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| 1. After the primary `.fsm` is written + sha256-anchored, invoke `backup.DecodeSnapshot` on the produced bytes into a unique scratch subdirectory. The scratch dir is ALWAYS a fresh `encode-self-test-<random>` subdirectory under the resolved base (default base `os.TempDir()`; operator override via `--scratch-root`). The unique-suffix is non-optional even when `--scratch-root` is pinned — concurrent encodes against the same pinned base would otherwise collide and produce false failures (gemini medium #896). | ||
| 2. **Header check (no MANIFEST.json in scratch):** `DecodeSnapshot` returns a `DecodeResult` whose `Header.LastCommitTS` is the value the decoder read from the produced `.fsm`. Compare it against the effective `T` (manifest value or `--last-commit-ts` override). The decoder library does NOT emit `MANIFEST.json` (that is the cmd wrapper's job per `internal/backup/decode.go`); the self-test deliberately consumes the library output and compares the in-memory header instead, avoiding a redundant manifest synthesis step (codex P2 #896). | ||
| 3. **Tree diff:** structurally compare the scratch adapter subtrees against the corresponding subdirs of `--input`. The diff is byte-equal on adapter files (filenames + bytes — every per-adapter dump is deterministic). `MANIFEST.json` from `--input` is NOT compared at all in the scratch tree (the scratch has none); the `last_commit_ts` field is covered by the header check above. `wall_time_iso` and other manifest transient fields are not in scope. |
There was a problem hiding this comment.
Preserve decode options during self-test
When --input was produced with decode-side options such as --include-incomplete-uploads, --include-sqs-side-records, or --dynamodb-bundle-mode jsonl, a plain DecodeSnapshot self-test will not recreate the same tree: newDispatcher only emits those sections when the corresponding DecodeOptions booleans/layout are set, and the decoder CLI currently feeds them from flags while recording them in MANIFEST.json. The self-test needs to derive those DecodeOptions from the input manifest, otherwise valid dumps using these options will fail the structural diff (or be compared against missing/reformatted subtrees).
Useful? React with 👍 / 👎.
…test (codex P2) Codex P2 v3 flagged that a --input tree produced with non-default decoder options (--include-incomplete-uploads, --include-orphans, --preserve-sqs-visibility, --include-sqs-side-records, --dynamodb-bundle-mode jsonl) will not round-trip through a plain DecodeSnapshot call: the dispatcher only emits those sections when the corresponding DecodeOptions boolean / layout is set. A self-test with hardcoded default options would spuriously diff on every dump that exercised any of these flags. Fix: the CLI reads MANIFEST.json's Exclusions.* + DynamoDBLayout fields and threads them into the scratch DecodeOptions. Manifests that omit the fields (older dumps) get decoder defaults — same forgiveness the decoder library applies. Numbering of the self-test step list also fixed (two step-4s in v3). Test plan grew from 14 to 15: TestCLISelfTestPreservesManifestDecodeOptions pins the round-trip on a fixture that uses both --include-incomplete-uploads and --dynamodb-bundle-mode jsonl.
|
v4 (9e295f5): addresses codex P2 v3 finding. Codex P2 (DecodeOptions threading): The self-test now reads MANIFEST.json's Also fixed two duplicate step-4 numbering in the self-test sequence. Test plan grew from 14 to 15. @claude review |
|
Claude finished @bootjp's task in 4m 2s —— View job Review (v4)
v4 Codex P2 finding: addressedThe self-test now reads New finding from v4 — medium
The doc says it "threads them into the scratch Two options:
Option 2 is fine for M6 since the collision scenario is rare in practice. Either way the doc should call it out — the new test only covers New minor finding from v4
Line 116 says "threads … Still-open from v3 — mandatory before merge1. Lifecycle rename
Still-open from v3 — medium2. Line 11 says " If the CLI passes the temp This was raised in v3 and is still unresolved. The doc should explicitly name the chosen model. The most spec-compliant path is Option A (internal buffer when 3. Parent doc line 329 explicitly says: "An optional 4. Exit code 2 rationale for flag-driven data constraint violation Line 55 says: "1 on user-input error (flag parse, missing file), 2 on data-correctness failure." Still-open from v3 — minor5. Decoder cleanup milestone attribution — The parent doc said "Decoder cleanup folded into M1"; M1 shipped without it; M6 picks it up. One sentence ("the parent doc targeted M1; that scope was not landed there and is now deferred to M6") prevents a reader diffing the two docs from wondering which is authoritative. 6. 7. P1 test 8. 9. 10. Line 11 prose inconsistency — "returns an 11. Parameter order asymmetry — VerdictThe v4 Codex P2 finding (DecodeOptions threading) is cleanly resolved. The new medium finding ( |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9e295f54a1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| Implementation: | ||
|
|
||
| 1. After the primary `.fsm` is written + sha256-anchored, invoke `backup.DecodeSnapshot` on the produced bytes into a unique scratch subdirectory. The scratch dir is ALWAYS a fresh `encode-self-test-<random>` subdirectory under the resolved base (default base `os.TempDir()`; operator override via `--scratch-root`). The unique-suffix is non-optional even when `--scratch-root` is pinned — concurrent encodes against the same pinned base would otherwise collide and produce false failures (gemini medium #896). | ||
| 2. **DecodeOptions are read back from the input MANIFEST.json.** The self-test MUST feed `DecodeSnapshot` the same option flags the decoder used to produce `--input`, otherwise a tree produced with (e.g.) `--include-incomplete-uploads` or `--dynamodb-bundle-mode jsonl` will spuriously fail the diff (codex P2 v3 #896). The CLI reads `MANIFEST.json` `Exclusions.{IncludeIncompleteUploads,IncludeOrphans,PreserveSQSVisibility,IncludeSQSSideRecords}` and `DynamoDBLayout` and threads them into the scratch `DecodeOptions`. Manifests that omit these fields (older dumps) decode with defaults — same forgiveness the decoder applies. Pinned by `TestCLISelfTestPreservesManifestDecodeOptions`. |
There was a problem hiding this comment.
Preserve S3 rename-collision mode in self-test
When the input dump contains an S3 object whose key ends in .elastickv-meta.json, the original decoder only succeeds if --rename-collisions sets DecodeOptions.RenameS3Collisions; the manifest path listed here does not restore that option, and the fields enumerated omit both RenameS3Collisions and any KEYMAP-derived fallback. A self-test of such a valid dump will re-decode the produced .fsm with the default false setting and fail on the same meta-suffix collision instead of proving the encoder output is bad, so the design should infer this option from the dump artifacts or explicitly include it in the manifest/options threading.
Useful? React with 👍 / 👎.
…dex P2 v4) Codex P2 v4 flagged that the decoder CLI's --rename-collisions flag drives DecodeOptions.RenameS3Collisions but is NOT recorded in MANIFEST.json's Exclusions struct. A self-test of a dump produced with --rename-collisions cannot perfectly round-trip today because the encoder has no way to know to ask for the same renaming pass. Pre-existing decoder-side schema gap, not just an encoder design hole. Fix folded into M6 (same precedent as parent doc M1 "Decoder cleanup folded into M1"): - Exclusions gains RenameS3Collisions bool with json tag rename_s3_collisions. - The decoder CLI's emitManifest populates the new field from cfg.renameCollisions. - The encoder self-test reads Exclusions.RenameS3Collisions and threads it into the scratch DecodeOptions. - Older manifests without the field decode as false (no-rename), matching the decoder default. Test plan grew from 15 to 17: - TestCLISelfTestPreservesRenameS3Collisions — full round-trip with --rename-collisions input. - TestExclusionsLegacyManifestOmitsRenameS3Collisions — older manifests decode safely with default false. This is a small decoder-side touchup; the manifest field has been a zero-cost extension since the Exclusions struct uses regular JSON tags (no DisallowUnknownFields gate on this field set).
|
v5 (bae0377): addresses codex P2 v4 finding. Codex P2 (RenameS3Collisions not in manifest): Pre-existing decoder-side gap — Fix folded into M6 (same precedent as parent doc M1 "Decoder cleanup folded into M1"):
Test plan grew from 15 to 17 — @claude review |
|
Claude finished @bootjp's task in 2m 32s —— View job Review (v5)
v5 Codex P2 finding: addressed at spec levelThe schema extension is correctly specified: New finding from v5 — medium
The doc says (line 116): "Manifests that omit these fields (older dumps) decode with defaults — same forgiveness the decoder applies." That claim is only true if The implementation author has two choices; the doc must pick one:
Still-open from v4 — mandatory before merge1. Lifecycle rename
Still-open from v4 — medium2. The doc says 3. Parent doc §"Per-adapter reverse encoders" explicitly says: "An optional 4. Exit code 2 rationale for flag-driven data constraint (line 55)
Still-open from v4 — minor5. Line 11 prose inconsistency "returns an 6. "Patch the in-tree encoder to deliberately mangle one byte after writing the 7. P1 test Parent doc's P1 test table includes this end-to-end cluster-boot test under M6. The 17-test plan doesn't include it, and it's not explicitly deferred. Self-test proves round-trip equality; only the cluster test proves the 8. Decoder 9. Schema example shows 10. Decoder cleanup milestone attribution Parent doc said "Decoder cleanup folded into M1." M1 shipped without it; M6 picks it up. The doc should say "the parent doc targeted M1; that scope was not landed there and is now deferred to M6" so a reader diffing the two docs knows which is authoritative. 11. Line 116 refers to threading VerdictThe Codex P2 v4 finding ( |
… claude v3/v4/v5 reviews
Items addressed (some open across 3 review rounds):
1. **Lifecycle rename (mandatory, flagged 3 rounds).** git mv the parent
doc from proposed_snapshot_logical_encoder.md to
partial_snapshot_logical_encoder.md. M1-M5 are merged, parent doc's
own §"Why a separate design doc" calls for this rename, CLAUDE.md
"Design Documents" formalizes the marker.
2. **exclusionsRequiredFields explicit policy (v5 new medium).**
rename_s3_collisions is intentionally NOT added to
exclusionsRequiredFields (manifest.go:350-355) so older manifests
decode with the zero-value false. Asymmetry with the four
pre-existing required fields is intentional; doc now states the
constraint (backward compat with every existing manifest) and the
mechanism (JSON's zero-value fallback).
3. **io.Writer self-test buffering model (open 3 rounds).** Committed
to Option A: when opts.SelfTest=true, EncodeSnapshot buffers the
FSM in a *bytes.Buffer first, decodes from the buffer, then copies
to out only on match. Memory cost is one FSM-sized allocation on
top of the sort working set. When opts.SelfTest=false the FSM
streams straight to out.
4. **--s3-chunk-size deferral note (open 3 rounds).** Explicitly
deferred — encoder hardcodes canonical 1 MiB; flag is a follow-up.
Rationale: no deployment in-tree uses a non-default chunk size.
5. **Exit code 2 rationale for flag-driven data constraint (open 3
rounds).** Added one sentence: a flag whose value violates a
data constraint from the manifest is exit 2, not exit 1, because
the constraint comes from the data.
6. **Line 11 vestigial "plus the assembled .fsm bytes" (open 3
rounds).** Removed; replaced with explicit "written to out
io.Writer; not returned."
7. **DynamoDBBundleJSONL field name (v4 minor).** Self-test
threading paragraph now says
DynamoDBBundleJSONL (true when manifest.DynamoDBLayout == "jsonl")
so the implementation author doesn't have to look it up.
8. **TestCLISelfTestDetectsCorruption injection mechanism (open 3
rounds).** Now specifies a corruptingWriter wrapper around the
EncodeSnapshot out io.Writer (xor'd byte at deterministic offset);
self-test reads back the corrupted buffer.
9. **P1 cluster test TestEncoderProducesLoadableSnapshot defer
(open 3 rounds).** Explicitly deferred from M6 with rationale
(a/b/c) covering why the existing 17 tests adequately substitute
for it within M6's scope.
10. **encoder_version ldflags mechanism (open 3 rounds).** Mirrors
decoder/main.go:45 — var version = "dev" + -ldflags "-X main.version=...".
Test builds keep the literal "dev".
11. **adapters_enabled ordering (open 3 rounds).** Canonical encoder
fan-out order (redis, dynamodb, s3, sqs), NOT CLI input order, for
bytewise reproducibility across runs that pass --adapter in
different sequences.
12. **Decoder cleanup milestone attribution (open 3 rounds).** Doc
now says "Parent doc said 'Decoder cleanup folded into M1.' M1
shipped without it; M6 picks it up" so a reader diffing the two
docs knows M6 is authoritative.
13. **Parameter order asymmetry note.** EncodeSnapshot(opts, out) vs
DecodeSnapshot(r, opts) — deliberate inversion so each function's
I/O argument sits adjacent to the direction it travels.
No new content removed; doc is now ~30 lines longer.
|
v6 (be063d6): addresses 1 new medium + 12 still-open items from claude v3/v4/v5 reviews. Mandatory items resolved:
Polish items (all flagged across 3 review rounds):
@claude review |
|
Claude finished @bootjp's task in 2m 46s —— View job Review (v6)
v6 claimed fixes: all 13 verifiedEvery item from the trigger comment confirmed as resolved:
New finding from v6 — must fix before mergeStale parent link in M6 doc (line 4) The lifecycle rename in this PR renamed the parent doc from **Parent:** [`2026_05_25_proposed_snapshot_logical_encoder.md`](2026_05_25_proposed_snapshot_logical_encoder.md)Since the file no longer exists under that name, this is a broken link introduced by the same PR that fixes the lifecycle rename. Change to: **Parent:** [`2026_05_25_partial_snapshot_logical_encoder.md`](2026_05_25_partial_snapshot_logical_encoder.md)New finding from v6 — minorEncoder fan-out order diverges from decoder's canonical order (undocumented) The M6 doc claims the encoder processes adapters in VerdictAll 13 v6 claimed fixes land correctly. The mandatory item before merge is the single stale parent link on line 4 — a two-word fix introduced by the lifecycle rename in this same PR. Everything else is resolved or explicitly deferred with rationale. Once line 4 is updated, this doc is ready to merge. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: be063d6110
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| | `TestCLIAcceptsEqualAndHigherLastCommitTSOverride` | `T == manifest` and `T > manifest` both succeed; the emitted `.fsm` header carries the chosen `T`. | | ||
| | `TestCLIRejectsMissingManifest` | `--input` directory has no `MANIFEST.json` → exit 1, clear error. | | ||
| | `TestCLIRejectsUnknownAdapter` | `--adapter foo` → exit 1, no `.fsm` written. Decoder-parity test for the adapter CSV parser. | | ||
| | `TestCLISelfTestDetectsCorruption` | Use a test-only `corruptingWriter` wrapper (xor'd byte at a deterministic offset) as the `EncodeSnapshot` `out io.Writer`; the buffered FSM that the self-test reads back is the corrupted version, so the decode either errors or produces a diff. The CLI exits 2 and writes `mismatch.txt`. Pins that the self-test actually compares (not just runs). | |
There was a problem hiding this comment.
Corrupt bytes before the self-test reads them
With the buffering model specified above, EncodeSnapshot performs the self-test against its internal buffer and only copies to the supplied out io.Writer after the self-test has matched, so a corruptingWriter wrapped around out cannot affect the bytes that DecodeSnapshot validates. Implementing this test as written would either fail with exit 0 or fail to pin the intended corruption path; the injected corruption needs to happen in the encoder/buffer before the self-test decode.
Useful? React with 👍 / 👎.
…ed self-test (codex P2) Codex P2 v6 sharp catch: with v6's buffering model (EncodeSnapshot buffers in *bytes.Buffer, runs self-test against the buffer, copies to out io.Writer ONLY on match), a corruptingWriter wrapping out would corrupt bytes AFTER the self-test had already passed - the test would silently report success with corrupted output. Replaced the mechanism: same-package test-only hook on EncodeOptions (unexported corruptBufferForTest func([]byte)) that runs against the internal buffer between WriteTo and the self-test decode. Corruption reaches the self-test's DecodeSnapshot but never reaches out (the write-then-rename discipline ensures self-test failure does NOT publish). Hook is unexported so external callers cannot set it.
|
v7 (ac7af85): addresses codex P2 v6 — sharp catch on the corruption-injection mechanism. The v6 New mechanism: same-package test-only hook on @claude review |
|
Claude finished @bootjp's task in 3m 16s —— View job Review (v7)
v7 Codex P2 finding: addressed ✅The new mechanism is coherent with the buffering model:
New finding from v7 — minor
The test uses the unexported Suggested rename: Still-open from v6 — mandatory before mergeStale parent link on line 4 The **Parent:** [`2026_05_25_proposed_snapshot_logical_encoder.md`](2026_05_25_proposed_snapshot_logical_encoder.md)The linked file no longer exists under that name, making this a broken link introduced by the same PR that performs the rename. This has been flagged since v6 as mandatory. Fix: **Parent:** [`2026_05_25_partial_snapshot_logical_encoder.md`](2026_05_25_partial_snapshot_logical_encoder.md)Still-open from v6 — minorEncoder fan-out order vs decoder finalize order undocumented Line 110 states the VerdictThe v7 Codex P2 finding is correctly resolved. The mandatory stale parent link (#1 above) has been open since v6 and is a two-word fix. Once that and the test naming inconsistency (minor, but practically important for guiding the implementation) are addressed, this doc is ready to merge. |
…eSnapshot corruption test Two findings from claude v7 review. 1. Mandatory: line 4 still linked to the old parent doc filename (2026_05_25_proposed_snapshot_logical_encoder.md) even after v6's git mv to ..._partial_... . Link updated, with a sentence noting the v6 lifecycle rename. 2. Minor: TestCLISelfTestDetectsCorruption needs to live in internal/backup/encode_snapshot_test.go (package backup) because the unexported corruptBufferForTest hook can only be set from a same-package internal test. The CLI prefix suggested the wrong target file. Renamed to TestEncodeSnapshotSelfTestDetectsCorruption to match the encode_snapshot_test.go family.
|
v8 (e2938a6): both v7 findings addressed. Mandatory (v6-still-open): Line 4 Minor (v7 new): Renamed @claude review |
|
Claude finished @bootjp's task in 2m 29s —— View job Review (v8)
v8 fixes: both verified ✅Fix #1 (stale parent link — mandatory): Line 4 now correctly reads:
The lifecycle promotion note ("promoted in v6 (#896)") is a nice addition. The Fix #2 (test rename — minor): The test plan table at line 160 now shows New finding — minorParent doc body still says The CLAUDE.md says partial docs should track "which milestones have landed and what remains." The body header should say Code accuracy cross-checks (all pass)
Still-open from v7 — minor (not blocking)Encoder fan-out order vs. decoder finalize order undocumented. The doc explains at line 110 that VerdictBoth v8 fixes land correctly. This doc is ready to merge. The one new minor finding (parent doc body status inconsistency) and the still-open fan-out-vs-finalize note can be addressed in the same commit before merge or deferred to the implementation PR — neither is load-bearing for the design review. No mandatory items remain open. |
…ial filename (claude v8 minor) CLAUDE.md says partial docs should track which milestones have landed and what remains. The git mv in v6 renamed the file to ..._partial_... but the body header still said 'Status: Proposed'. Updated to 'Status: Partial (M1-M5 merged ...; M6 CLI in review at PR #896)' so an operator cat'ing the file sees the same state the filename signals.
|
Claude finished @bootjp's task in 1m 27s —— View job Review (v9)
v9 fix: verified ✅The v9 commit correctly addresses the v8 minor finding — the parent doc body header is now consistent with its filename: -Status: Proposed
+Status: Partial (M1–M5 merged via PRs #807 / #841 / #847 / #849 / #864 / #846 / #892; M6 CLI in review at PR #896)
-Date: 2026-05-25
+Date: 2026-05-25 (promoted proposed → partial 2026-06-01 in PR #896)A reader Still-open — minor (carried from v7, non-blocking)Encoder fan-out order vs. decoder finalize order undocumented The M6 doc line 110 correctly states that Since the encoder sorts output by encoded key after collection, fan-out order has no effect on the final
This was classified as non-blocking in v8 and remains so. It can ship in the implementation PR's doc touchup if not addressed here. VerdictThe v9 fix lands cleanly. This doc is ready to merge. No mandatory items remain open. The one still-open minor (fan-out order note) is a one-line doc clarification that can land in the implementation PR's first commit. |
…er difference Final non-blocking minor from claude v7/v8/v9. Encoder dispatches redis -> dynamodb -> s3 -> sqs (parent design enumeration order); decoder finalizes dynamodb -> s3 -> redis -> sqs (decode.go:278-296, alphabetical-with-tweaks). The final .fsm byte sequence is determined by encoded-key sort (encode.go:204), not adapter fan-out order, so either ordering is correct as long as it is fixed. Doc now notes this deliberate divergence so an implementor aligning with the decoder doesn't produce a different adapters_enabled order in ENCODE_INFO.json.
|
v10 (1736435): folded in the only remaining non-blocking minor — explicit note on encoder fan-out ( Claude v9 verdict was already "ready to merge"; v10 just closes the last carried-over minor so the doc is fully clean. @claude review |
|
Claude finished @bootjp's task in 1m 23s —— View job Review (v10)
v10 fix: verified ✅The fan-out order note at line 112 is accurate and cleanly resolves the last carried-over minor from v7:
Cross-checked:
The phrasing "purely historical reasons" and "no-op for restore correctness" are both correct. The note closes the question permanently for any implementation author aligning with the decoder: either fan-out order produces an identical All prior mandatory and medium items: still resolvedSpot-checked the items that have been open the longest:
No new findingsThe v10 change is a one-paragraph doc addition with no architectural or API surface impact. No new issues identified. VerdictReady to merge. All mandatory, medium, and minor items from the nine prior review rounds are resolved. The doc is internally consistent, accurately cross-references the existing code ( |
Summary
Phase 0b M6 capstone design doc. M1–M5 are merged; this proposal pins down the CLI implementation specifics the parent design doc (
2026_05_25_proposed_snapshot_logical_encoder.md) left at sketch level.What lands
docs/design/2026_06_01_proposed_snapshot_encode_cli.md— new proposal covering:--input,--output,--adapter(mirrors decoder parsing),--last-commit-ts,--self-test,--scratch-root. Decoder-equivalent exit codes (0 success / 1 user error / 2 data error).--last-commit-tsoverride: fail-closedT ≥ manifest.last_commit_tsvalidation per parent §"MVCC re-encoding"; preserves the uniform-stamping invariant (single substitution everywherelast_commit_tsappears).ENCODE_INFO.jsonsidecar: encoder version, key-format version, output.fsmSHA-256, override record, manifest cluster_id propagation, self-test result. Sidecar (not inside the.fsm— EKVPBBL1 byte format is fixed).--self-testruns encode → decode → structural diff against the input tree, exits 2 withmismatch.txton first-N-differing-paths if the diff is non-empty. Default off (doubles encode time and disk), but CI runs it on every encoder PR.Source.FSMCRC32Cas dead (native.fsmhas no CRC32C footer); recommended path is repurpose-as-SHA-256, fallback is remove.ENCODE_INFO.jsonround-trip + version gate.Why doc-first
Per CLAUDE.md "Design-doc-first workflow" — M6 touches a new entrypoint (
cmd/elastickv-snapshot-encodebinary) plus a new file format (ENCODE_INFO.json), so the proposal lands first; implementation is a separate follow-up PR after this is reviewed and accepted.Test plan
cmd/elastickv-snapshot-decode/main.go, encoder constructors atinternal/backup/encode_*.go, manifest discipline atWriteManifestper gemini r1 backup: cmd/elastickv-snapshot-decode CLI + CHECKSUMS writer #810).--last-commit-ts T < manifestfail-closed; self-test default-off with reasoning.