fix startup deadlock when raftCommitIndex exceeds journal after crash#197
Open
gmelikov wants to merge 1 commit into
Open
fix startup deadlock when raftCommitIndex exceeds journal after crash#197gmelikov wants to merge 1 commit into
gmelikov wants to merge 1 commit into
Conversation
On an unclean shutdown (power loss, OOM kill) the node can start with raftCommitIndex from .journal.meta pointing past the last entry in the journal file. This happens because journal entries are written to an mmap (page cache, no msync per write) while the meta is saved via f.flush() + atomic rename once per second — two paths the OS can flush to disk in any order. Concretely: after log compaction leaves the journal at [prev, N], new entries N+1..N+k are appended to the mmap. The meta timer fires and persists raftCommitIndex=N+k. The node crashes. The post-compaction pages had already been written back by the OS; the newer dirty pages had not. On recovery: meta=N+k, journal ends at N. Without the fix the node deadlocks: raftCommitIndex > getCurrentLogIndex() so applyLogEntries finds nothing to apply, and the leader commit loop condition `while commitIdx < getCurrentLogIndex()` is immediately false, so commitIndex can never advance either. Fix: in __loadDumpFile, clamp raftCommitIndex down to getCurrentLogIndex() if it overshoots. The lost entries are unrecoverable; rolling the pointer back is the only safe option. The corrected value is logged as a warning and re-persisted to .journal.meta on the next onOneSecondTimer tick.
Contributor
Author
|
Sidenote: AFAIK mmap has some issues with straightforward data flush on writes, looks like now we may easily miss some writes. If I don't miss anything and my hypothesis is right, it's worth to:
What do you think? I can try to implement one of them in a different PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
tl;dr: I use PySyncObj+Patroni+Soft kernel watchdog, and test it on local VMs on my laptop. Nearly each laptop sleep finishes with stuck Raft due to watchdog fire->force reset->inconsistent raft info on disk. I have all raft files with reproduced problem, if you want to look at them: raft.tar.gz. This PR fixes it for me, now instead of infinite loop I see new warning, rollback of __raftCommitIndex and successfull start of raft.
On an unclean shutdown (power loss, OOM kill) the node can start with raftCommitIndex from .journal.meta pointing past the last entry in the journal file. This happens because journal entries are written to an mmap (page cache, no msync per write) while the meta is saved via f.flush() + atomic rename once per second — two paths the OS can flush to disk in any order.
Concretely: after log compaction leaves the journal at [prev, N], new entries N+1..N+k are appended to the mmap. The meta timer fires and persists raftCommitIndex=N+k. The node crashes. The post-compaction pages had already been written back by the OS; the newer dirty pages had not. On recovery: meta=N+k, journal ends at N.
Without the fix the node deadlocks: raftCommitIndex > getCurrentLogIndex() so applyLogEntries finds nothing to apply, and the leader commit loop condition
while commitIdx < getCurrentLogIndex()is immediately false, so commitIndex can never advance either.Fix: in __loadDumpFile, clamp raftCommitIndex down to getCurrentLogIndex() if it overshoots. The lost entries are unrecoverable; rolling the pointer back is the only safe option. The corrected value is logged as a warning and re-persisted to .journal.meta on the next onOneSecondTimer tick.