fix(defaults): async state-commit for full-mode followers (PLT-537)#34
Conversation
PR SummaryMedium Risk Overview Adds Reviewed by Cursor Bugbot for commit 7e5d431. Bugbot is set up for automated code reviews on this repo. Configure here. |
…consensus-gossip starvation (PLT-537) Full-mode followers (node/rpc/syncer) defaulted to synchronous memIAVL commit (sc-async-commit-buffer=0), which holds cs.mtx across the whole block write and starves the single-goroutine consensus StateChannel drain -> dropped NewRoundStep/HasVote gossip -> bistable fall-behind (PLT-537). Default full mode to AsyncCommitBuffer=100 (matches the already-async archive). Validators and seeds stay synchronous by construction (separate override paths) -- async commit's in-memory crash window is unacceptable for a signing node. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
65df4f2 to
7e5d431
Compare
Problem
Full-mode followers (
node/rpc/syncer) default to synchronous memIAVL commit (sc-async-commit-buffer=0), which holds the consensus lockcs.mtxacross the entire block write. That starves the single-goroutine consensus StateChannel drain → droppedNewRoundStep/HasVotegossip → degraded peer-state view → bistable fall-behind (recurringNodeFellBehindon pacific-1). Root cause: PLT-537.Change
Default
fullmode toAsyncCommitBuffer = 100inapplyFullOverrides(matches the already-asyncarchivemode). Flows to full + archive (archive inheritsapplyFullOverrides); validators and seeds stay synchronous by construction (separate override paths) — async commit's in-memory crash window is unacceptable for a signing node.Adds
TestDefaultForMode_AsyncCommitBufferByModepinningfull=100, archive=100, validator=0, seed=0so the fix and the validator-safety constraint can't silently regress.Validation
node-0: 23.7k blocks behind → tip, 0 StateChannel drops,block_processing45→29 ms, held 18 min.syncer-0-0relapsed → restart-alone is not durable, the config is).state-sync-node-0: 32.5k → tip,block_processing194→11 ms, drops → 0 (cross-topology confirmation).Rollout
This is a library default — it reaches prod once consumed downstream: sei-config tag →
sei-node-controllergo.mod bump → seictl sidecar rebuild → release → rollout (pod recreation re-rendersapp.toml). Manualsc-async-commit-buffer=100patches are holding the live fleet in the meantime.Risk
Async commit buffers un-flushed commits in memory; on an ungraceful crash the tail is re-fetched via WAL/blocksync (no corruption). Appropriate for followers/RPC/archive; explicitly not validators (preserved here).
Refs: PLT-537
🤖 Generated with Claude Code