Skip to content

inline-checksum: optional per-cluster CRC validation (TD.100226.1)#1016

Open
schmidt-scaled wants to merge 9 commits into
mainfrom
inline-checksum-validation
Open

inline-checksum: optional per-cluster CRC validation (TD.100226.1)#1016
schmidt-scaled wants to merge 9 commits into
mainfrom
inline-checksum-validation

Conversation

@schmidt-scaled
Copy link
Copy Markdown
Contributor

Summary

  • Control-plane support for the inline silent-data-error protection feature whose data-plane work landed on ultra branch checksum-validation. Optional, frozen at cluster-create time, no upgrade path for existing clusters.
  • Adds Cluster.inline_checksum, NVMeDevice.md_size / md_supported, bdev_alceml_create checksum-method kwargs, and per-cluster CLI flags on cluster create, cluster add, and sn configure.
  • Auto-detects per-device NVMe metadata support at add-node and on restart from SPDK's bdev_get_bdevs md_size field (spdk_bdev_get_md_size), so a cluster can mix md-on-device and fallback drives. Capacity overhead from the fallback layout (6 / 510 blocks per 2 MiB extent ≈ 1.17%) is charged as initial provisioned utilization rather than reducing reported raw capacity.

Why this branch and not main

  • A few inline-checksum lines previously leaked into main via the unrelated commit aa81fafe (iteration-77 hang fix) on 2026-04-30. Those four hunks were removed in 5ff3e214 so main is clean again. This branch carries the full feature in one self-contained commit on top of that clean main.

Design ref

  • TD.100226.1 (Inline Protection from Silent Data Errors).

Compatibility

  • Cluster flag defaults to False; existing clusters and code paths unchanged.
  • Data plane reads md_size directly from the bound bdev, so the control plane only flips checksum_validation_method 1/2/0 — matching the checksum-validation branch RPC schema (checksum_validation_method, cache_size, cache_eviction_threshold).

Test plan

  • 27 new unit tests in tests/test_inline_checksum.py (model defaults, find_md_lbaf_id, alceml_checksum_params, fallback-overhead math, RPC param wire-up for each method, addNvmeDevices md detection from a mocked bdev payload). All pass.
  • Full unit-test suite (870 tests) stayed green throughout development on a separate worktree.
  • Lab dual-fault test on an FT=2 cluster with --enable-inline-checksum, mixing one drive formatted with an md-capable LBAF (cv_md_method) and one without (cv_fallback_method); verify capacity reporting reflects the ~1.17% pre-charge only on the fallback drive.
  • Negative test: simulate single-bit corruption on a passtest bdev under an inline-checksum cluster and confirm read returns IO error rather than silent corruption.
  • Confirm existing clusters (no inline_checksum field in DB) read back as False and behave exactly as today.

Open follow-ups (not blocking this PR)

  • The add_cluster / cluster add controller has a pre-existing positional-arg shift around fabric (unrelated to this work). New kwarg added keyword-only so it isn't affected; bug should be fixed separately.
  • Operator-facing capacity reporting (sb cluster get-capacity) is fed from Prometheus and reflects what alceml itself reports; the per-device pre-charge added here is in the lvol-create provisioning gate. If we want it visible in cluster get-capacity too, that's a small extension to the stats collector.

🤖 Generated with Claude Code

Comment thread simplyblock_cli/cli.py Fixed
Comment thread simplyblock_cli/cli.py Fixed
Comment thread simplyblock_cli/cli.py Fixed
Comment thread simplyblock_core/utils/__init__.py Fixed
@schmidt-scaled schmidt-scaled force-pushed the inline-checksum-validation branch from f7823d3 to 9273148 Compare May 5, 2026 09:16
Comment thread scripts/setup_lab_perf_test1.py Fixed
Comment thread simplyblock_core/utils/__init__.py Fixed
Comment thread scripts/setup_lab_perf_test1.py Fixed
Comment thread scripts/setup_lab_perf_test1.py Fixed
for attempt in range(1, 16):
try:
client = paramiko.SSHClient()
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
Comment thread scripts/setup_lab_perf_test1.py Fixed
Comment thread scripts/lab_dual_node_outage_soak_mixed_churn.py Fixed
Comment thread scripts/lab_dual_node_outage_soak_mixed_churn.py Fixed
@schmidt-scaled schmidt-scaled force-pushed the inline-checksum-validation branch from 4e3d8a4 to ca9fa2f Compare May 5, 2026 12:06
Comment thread scripts/lab_dual_node_outage_soak_mixed_churn.py Fixed
Comment thread scripts/lab_dual_node_outage_soak_mixed_churn.py Fixed
Comment thread simplyblock_core/storage_node_ops.py Fixed
Comment thread simplyblock_core/storage_node_ops.py Fixed
for host in self.node_hosts.values():
try:
host.close()
except Exception:
Comment thread scripts/lab_dual_node_outage_soak_mixed_churn.py Fixed
@schmidt-scaled schmidt-scaled force-pushed the inline-checksum-validation branch from 9ba3cbc to 9bdcc9f Compare May 10, 2026 10:19
Comment thread simplyblock_core/utils/__init__.py Fixed
@schmidt-scaled schmidt-scaled force-pushed the inline-checksum-validation branch from 48245dc to 4748981 Compare May 17, 2026 16:46
Comment thread simplyblock_core/utils/__init__.py Fixed
Comment thread scripts/lab_dual_node_outage_soak_mixed_churn.py Fixed
Comment thread simplyblock_cli/cli.py Fixed
Comment thread simplyblock_core/utils/__init__.py Fixed
@schmidt-scaled schmidt-scaled force-pushed the inline-checksum-validation branch from 2a899b4 to ef60683 Compare May 26, 2026 21:30
Comment thread simplyblock_cli/cli.py Fixed
if cached is not None:
try:
cached.close()
except Exception:
schmidt-scaled and others added 2 commits May 27, 2026 20:33
Add --verify=md5 --do_verify=1 --verify_fatal=1 to the fio job so data
corruption is caught and fatal. Soften the "unrelated pair" category from
a hard TestRunError to a warning plus fallback to primary_tertiary then
primary_secondary, since a dense FT=2 ring with N <= 4 has no unrelated
pair and should still run the scenario.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pin BRANCH to inline-checksum-validation. Make ssh_exec accept a timeout
(default 600s) and use 1800s for the parallel docker pull so large image
pulls don't time out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@schmidt-scaled schmidt-scaled force-pushed the inline-checksum-validation branch from ef60683 to e2b9c7e Compare May 27, 2026 17:39
schmidt-scaled and others added 2 commits May 28, 2026 22:17
shutdown_storage_node used to refuse graceful shutdown when
`_check_ftt_allows_node_removal` reported insufficient FTT headroom
during rebalancing — emitting a stdout containing "rebalanc" and
forcing callers to wait until the rebalance drained. That conflation
of removal with shutdown was introduced in commit fbdffea
(2026-03-28) and is the wrong policy: shutdown is a transient state
the cluster is supposed to absorb under its FTT contract, while
removal permanently changes the storage budget. They are different
operations.

Remove the guard call from `shutdown_storage_node`. The function
`_check_ftt_allows_node_removal` itself stays — the web API layer
(v1 and v2 /storagenode/shutdown endpoints) still calls it directly
for their own non-force pre-check, which is where that policy
decision belongs.

Replaced `test_ftt_block_aborts` with `test_shutdown_does_not_consult
_ftt_removal_guard`, a regression guard that fails if any future
revision re-introduces the FTT-removal call inside the shutdown path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… gap

Two changes that close the remaining "soak waits for rebalancing"
gaps, plus add a guaranteed minimum overlap between the two outages
of each iteration.

Never wait for rebalance:
* `shutdown_with_migration_retry` -> `_graceful_shutdown` — single-shot
  sbctl call. The retry-on-{migration,rebalanc,active task} loop was
  needed to absorb the FTT-removal refusal added in fbdffea; with
  that refusal now removed from shutdown_storage_node (prior commit),
  the retry is dead and goes.
* `_forced_shutdown` — same cleanup. sbctl --force bypassed every
  CP-side shutdown guard, so the loop never actually fired; remove
  for clarity.
* `_network_outage` made non-blocking. Previously the function slept
  inline for the full `duration` before bringing NICs back up, which
  prevented any meaningful overlap with the paired outage — by the
  time `_apply_outage` returned for node 1, the NICs were already up
  and the CP was healing. NIC bring-up now runs on a daemon thread so
  the second outage can be applied while node 1 is still partitioned.

Random gap with guaranteed overlap:
* New args: --outage-gap-min (default 15), --outage-gap-max (default
  180), --min-outage-overlap (default 10). Legacy --shutdown-gap still
  overrides the random range when > 0 (with a warning if it violates
  the overlap invariant).
* `_expected_min_unavail_seconds(method)` floors the not-online window
  per method: network_outage_N -> N, container_kill -> 30,
  host_reboot -> 90, graceful/forced -> effectively unbounded (we
  control the manual restart). Conservative; real recoveries are
  longer.
* `_pick_outage_gap(method1)` draws uniformly from [min, max] then
  clamps to `expected_min_unavail - min_overlap` so the gap can never
  eat the entire recovery window of outage 1. For short methods this
  collapses the range (network_outage_20 with min_overlap=10 -> a
  deterministic 10s gap) — overlap wins over the configured range.

Tests: tests/test_soak_outage_gap.py — 17 cases covering the
per-method floor table, the cap math at edge values (zero overlap,
overlap-larger-than-unavail, legacy --shutdown-gap precedence), and a
distribution check that runs 200 draws per method and asserts the
invariant `gap + min_overlap <= expected_min_unavail` holds for all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
import os
import random
import sys
import types
schmidt-scaled and others added 3 commits May 28, 2026 23:21
Adds the sbcli control-plane support for the inline silent-data-error
protection feature whose data-plane work landed on ultra branch
checksum-validation. Frozen at cluster-create time; no upgrade path for
existing clusters. Per-device alceml mode (md-on-device vs fallback) is
auto-detected from the bound SPDK bdev's md_size at add-node and on
restart, so a cluster can have a heterogeneous mix of drives.

Cluster + per-device flags
- Cluster.inline_checksum (bool, default False): persisted only via
  create_cluster / add_cluster; no mutator.
- NVMeDevice.md_size (int) and NVMeDevice.md_supported (bool): set in
  addNvmeDevices from bdev_get_bdevs' top-level md_size field
  (spdk_bdev_get_md_size; emitted by lib/bdev/bdev_rpc.c). Refreshed on
  every restart-device path so a between-restart `nvme format` is
  reflected.

alceml RPC plumbing (matches checksum-validation branch)
- bdev_alceml_create gains optional checksum_method (1=md-on-device,
  2=fallback, default 0=off), cache_size, cache_eviction_threshold;
  zero-defaults are not sent so the data plane keeps its built-ins
  (cv_default_cache_size=2000, threshold 90%).
- utils.alceml_checksum_params(cluster, dev) picks 0/1/2 from the
  cluster flag plus md_supported. Data plane refuses method=1 on
  md_size==0, so the per-device decision is mandatory.
- Both alceml call sites (storage_node_ops._create_storage_device_stack
  and device_controller restart-device) thread the params through and
  warn when a device falls back.

CLI
- cluster create / cluster add / sn configure each gain
  --enable-inline-checksum.
- sn configure runs `find_md_lbaf_id` against `nvme id-ns` JSON to pick
  the smallest qualifying LBAF (ds=12, ms>=8) and forces a reformat
  past the existing 4K-already-formatted early-out (SectorSize stays
  4096 across an md/no-md transition).

Capacity accounting
- alceml_fallback_overhead_bytes(cluster, size): 6 lost data blocks
  per 2 MiB extent (510 -> 504, ~1.171875%) when the device runs in
  cv_fallback_method.
- lvol_controller charges that as initial provisioned utilization
  rather than reducing reported raw cluster_size_total, so the
  overhead surfaces through the existing prov_cap_warn / prov_cap_crit
  thresholds and md-on-device drives contribute zero.

Tests
- 27 new unit tests in tests/test_inline_checksum.py: model defaults,
  find_md_lbaf_id corner cases, alceml_checksum_params combos,
  fallback-overhead math (including 6/512 ratio sanity), RPC param
  wire-up for each method, and addNvmeDevices md detection from a
  mocked bdev_get_bdevs payload.

Behaviour notes
- Default-off cluster flag means no behaviour change for existing
  clusters; full unit-test suite (870 passing) stayed green during
  development on a separate worktree.
- Data plane reads md_size itself via spdk_bdev_get_md_size, so the
  control plane never has to pass it. Cv_md_method does not auto
  fall back: control plane must pick correctly per-device, hence the
  add-node-time detection plumbing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- cli-reference.yaml: add --enable-inline-checksum under storage-node configure,
  cluster create and cluster deploy. The flag was hand-edited into cli.py but
  missing from the YAML source-of-truth, so the generator kept dropping it.
- storage_node_ops.py: annotate attached as set[str] for mypy.
- test_subsystem_add_ns_idempotent.py: drop unused MagicMock import.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gure

The lab perf bring-up now opts the cluster into inline CRC checksum
validation for silent-data-error protection. Two flags, paired:

* `cluster create --enable-inline-checksum`: sets the cluster-level
  flag at create time (frozen — no mutator, no upgrade path for
  existing clusters).
* `sn configure --enable-inline-checksum`: picks an LBAF with ds=12,
  ms>=8 (8B+ NVMe metadata per 4K block) and force-reformats through
  the existing 4K-already-set early-out, so alceml can run in
  md-on-device mode (checksum_method=1). Devices with no md-capable
  LBAF fall through to plain 4K + cv_fallback (checksum_method=2,
  ~1.17% capacity overhead) — auto-detected per-device at sn add-node
  and refreshed on every restart-device path.

Also preserves the existing branch pin (BRANCH=inline-checksum-
validation) and the corresponding pip install URL so the lab pulls
the matching control-plane and data-plane builds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread simplyblock_cli/cli.py
argument = subcommand.add_argument('--size-range', help='NVMe SSD device size range separated by -, can be X(m,g,t) or bytes as integer, example: --size-range 50G-1T or --size-range 1232345-67823987, --device-model and --size-range must be set together.', type=str, default='', dest='size_range', required=False)
argument = subcommand.add_argument('--nvme-names', help='Comma separated list of nvme namespace names like nvme0n1,nvme1n1.', type=str, default='', dest='nvme_names', required=False)
argument = subcommand.add_argument('--force', help='Force format detected or passed nvme pci address to 4K and clean partitions.', dest='force', action='store_true')
argument = subcommand.add_argument('--enable-inline-checksum', help='When formatting (with --force), prefer an LBAF that supports >=8 bytes of NVMe metadata per block, so alceml can run inline checksum validation in md-on-device mode. Drives with no md-capable LBAF still format to plain 4K and will use the fallback layout.', dest='inline_checksum', action='store_true')
Comment thread simplyblock_cli/cli.py
if self.developer_mode:
argument = subcommand.add_argument('--disable-monitoring', help='Disable monitoring stack, false by default. Default: `false`.', dest='disable_monitoring', action='store_true')
argument = subcommand.add_argument('--strict-node-anti-affinity', help='Enable strict node anti affinity for storage nodes. Never more than one chunk is placed on a node. This requires a minimum of _data-chunks-in-stripe + parity-chunks-in-stripe + 1_ nodes in the cluster.', dest='strict_node_anti_affinity', action='store_true')
argument = subcommand.add_argument('--enable-inline-checksum', help='Enable inline CRC checksum validation on every IO for silent-data-error protection. Cannot be enabled or disabled after cluster creation. Per-device alceml mode (md-on-device vs fallback) is auto-detected at add-node.', dest='inline_checksum', action='store_true')
Comment thread simplyblock_cli/cli.py
if self.developer_mode:
argument = subcommand.add_argument('--inflight-io-threshold', help='The number of inflight IOs allowed before the IO queuing starts. Default: `4`.', type=int, default=4, dest='inflight_io_threshold')
argument = subcommand.add_argument('--strict-node-anti-affinity', help='Enable strict node anti affinity for storage nodes. Never more than one chunk is placed on a node. This requires a minimum of _data-chunks-in-stripe + parity-chunks-in-stripe + 1_ nodes in the cluster."', dest='strict_node_anti_affinity', action='store_true')
argument = subcommand.add_argument('--enable-inline-checksum', help='Enable inline CRC checksum validation on every IO for silent-data-error protection. Cannot be enabled or disabled after cluster creation.', dest='inline_checksum', action='store_true')
Comment thread simplyblock_core/utils/__init__.py Fixed
…soak

A sidecar monitor for the mixed-churn outage soak (or any of the other
aws_*_soak scripts) that:

1. Waits for the soak's PID to exit. Polls signal-0 across the
   poll-interval; works for non-descendant PIDs too so the monitor can
   be launched after the soak.
2. On exit, regardless of outcome (clean completion of the configured
   iteration budget OR a permanent failure that took the soak down),
   issues `StopInstances` for every id under
   `storage_nodes[*].instance_id` and `clients[*].instance_id` in the
   cluster metadata file.
3. Leaves `mgmt.instance_id` running by default so the user keeps SSH
   access for log retrieval. `--stop-mgmt` overrides if requested.

Best-effort policy: each instance is stopped independently, failures
are logged per-instance, and the monitor exits non-zero overall if
any stop failed — without short-circuiting the remaining ids.

Typical pairing:
    nohup python3 scripts/aws_dual_node_outage_soak_mixed_churn.py \
        --metadata cluster_metadata_mp.json … > soak.out 2>&1 &
    SOAK_PID=$!
    nohup python3 scripts/stop_lab_after_soak.py \
        --metadata cluster_metadata_mp.json \
        --soak-pid "$SOAK_PID" > stop_lab.out 2>&1 &

Smoke-tested against scripts/cluster_metadata_mp.json: inventory
parsing finds 6 storage nodes + 1 client + 1 mgmt; --dry-run output
confirms storage + client are selected and mgmt is kept running.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
import errno
import json
import os
import signal
Restore the checksum-validation docker/ultra images that a prior rebase
onto origin/main clobbered back to the :main defaults. A local
merge=keepdev driver now keeps this branch's env_var across future
rebases.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants