Skip to content

fix(defaults): async state-commit for full-mode followers (PLT-537)#34

Merged
bdchatham merged 1 commit into
mainfrom
fix/plt-537-full-async-commit
Jun 15, 2026
Merged

fix(defaults): async state-commit for full-mode followers (PLT-537)#34
bdchatham merged 1 commit into
mainfrom
fix/plt-537-full-async-commit

Conversation

@bdchatham

Copy link
Copy Markdown
Collaborator

Problem

Full-mode followers (node/rpc/syncer) default to synchronous memIAVL commit (sc-async-commit-buffer=0), which holds the consensus lock cs.mtx across the entire block write. That starves the single-goroutine consensus StateChannel drain → dropped NewRoundStep/HasVote gossip → degraded peer-state view → bistable fall-behind (recurring NodeFellBehind on pacific-1). Root cause: PLT-537.

Change

Default full mode to AsyncCommitBuffer = 100 in applyFullOverrides (matches the already-async archive mode). Flows to full + archive (archive inherits applyFullOverrides); validators and seeds stay synchronous by construction (separate override paths) — async commit's in-memory crash window is unacceptable for a signing node.

Adds TestDefaultForMode_AsyncCommitBufferByMode pinning full=100, archive=100, validator=0, seed=0 so the fix and the validator-safety constraint can't silently regress.

Validation

  • Canary on pacific-1 node-0: 23.7k blocks behind → tip, 0 StateChannel drops, block_processing 45→29 ms, held 18 min.
  • Full pacific-1 K8s follower fleet (5 nodes) recovered to tip; unchanged sync-commit nodes stayed trapped (syncer-0-0 relapsed → restart-alone is not durable, the config is).
  • EC2 state-sync-node-0: 32.5k → tip, block_processing 194→11 ms, drops → 0 (cross-topology confirmation).

Rollout

This is a library default — it reaches prod once consumed downstream: sei-config tag → sei-node-controller go.mod bump → seictl sidecar rebuild → release → rollout (pod recreation re-renders app.toml). Manual sc-async-commit-buffer=100 patches are holding the live fleet in the meantime.

Risk

Async commit buffers un-flushed commits in memory; on an ungraceful crash the tail is re-fetched via WAL/blocksync (no corruption). Appropriate for followers/RPC/archive; explicitly not validators (preserved here).

Refs: PLT-537

🤖 Generated with Claude Code

@cursor

cursor Bot commented Jun 15, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Changes default state-commit behavior for all new full/archive deployments; async commit trades a small in-memory crash window for lower consensus lock hold time, which is intentional for non-signing nodes only.

Overview
Full-mode defaults now set Storage.StateCommit.AsyncCommitBuffer = 100 in applyFullOverrides, so follower/RPC-style nodes use async memIAVL commits.id commit instead of synchronous (0). Archive still ends at 100 via applyFullOverrides plus its existing override; validator and seed remain 0 (no change on their override paths).

Adds TestDefaultForMode_AsyncCommitBufferByMode to lock full=100, archive=100, validator=0, seed=0 so the follower default and signing-node constraint do not regress silently.

Reviewed by Cursor Bugbot for commit 7e5d431. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread defaults.go Outdated
Comment thread config_test.go Outdated
…consensus-gossip starvation (PLT-537)

Full-mode followers (node/rpc/syncer) defaulted to synchronous memIAVL
commit (sc-async-commit-buffer=0), which holds cs.mtx across the whole
block write and starves the single-goroutine consensus StateChannel
drain -> dropped NewRoundStep/HasVote gossip -> bistable fall-behind
(PLT-537).

Default full mode to AsyncCommitBuffer=100 (matches the already-async
archive). Validators and seeds stay synchronous by construction (separate
override paths) -- async commit's in-memory crash window is unacceptable
for a signing node.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham bdchatham force-pushed the fix/plt-537-full-async-commit branch from 65df4f2 to 7e5d431 Compare June 15, 2026 19:48
@bdchatham bdchatham merged commit e239fd6 into main Jun 15, 2026
3 checks passed
@bdchatham bdchatham mentioned this pull request Jun 15, 2026
bdchatham added a commit that referenced this pull request Jun 15, 2026
Cuts v0.0.20 at current main. Notable since v0.0.19:
- fix(defaults): async state-commit for full-mode followers (PLT-537, #34)
- docs burn-down: docs/ evacuated to bdchatham-designs (PLT-497, #33)

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant