Skip to content

fix startup deadlock when raftCommitIndex exceeds journal after crash#197

Open
gmelikov wants to merge 1 commit into
bakwc:masterfrom
gmelikov:reset_fix
Open

fix startup deadlock when raftCommitIndex exceeds journal after crash#197
gmelikov wants to merge 1 commit into
bakwc:masterfrom
gmelikov:reset_fix

Conversation

@gmelikov

@gmelikov gmelikov commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

tl;dr: I use PySyncObj+Patroni+Soft kernel watchdog, and test it on local VMs on my laptop. Nearly each laptop sleep finishes with stuck Raft due to watchdog fire->force reset->inconsistent raft info on disk. I have all raft files with reproduced problem, if you want to look at them: raft.tar.gz. This PR fixes it for me, now instead of infinite loop I see new warning, rollback of __raftCommitIndex and successfull start of raft.


On an unclean shutdown (power loss, OOM kill) the node can start with raftCommitIndex from .journal.meta pointing past the last entry in the journal file. This happens because journal entries are written to an mmap (page cache, no msync per write) while the meta is saved via f.flush() + atomic rename once per second — two paths the OS can flush to disk in any order.

Concretely: after log compaction leaves the journal at [prev, N], new entries N+1..N+k are appended to the mmap. The meta timer fires and persists raftCommitIndex=N+k. The node crashes. The post-compaction pages had already been written back by the OS; the newer dirty pages had not. On recovery: meta=N+k, journal ends at N.

Without the fix the node deadlocks: raftCommitIndex > getCurrentLogIndex() so applyLogEntries finds nothing to apply, and the leader commit loop condition while commitIdx < getCurrentLogIndex() is immediately false, so commitIndex can never advance either.

Fix: in __loadDumpFile, clamp raftCommitIndex down to getCurrentLogIndex() if it overshoots. The lost entries are unrecoverable; rolling the pointer back is the only safe option. The corrected value is logged as a warning and re-persisted to .journal.meta on the next onOneSecondTimer tick.

On an unclean shutdown (power loss, OOM kill) the node can start with
raftCommitIndex from .journal.meta pointing past the last entry in the
journal file. This happens because journal entries are written to an
mmap (page cache, no msync per write) while the meta is saved via
f.flush() + atomic rename once per second — two paths the OS can flush
to disk in any order.

Concretely: after log compaction leaves the journal at [prev, N], new
entries N+1..N+k are appended to the mmap. The meta timer fires and
persists raftCommitIndex=N+k. The node crashes. The post-compaction
pages had already been written back by the OS; the newer dirty pages
had not. On recovery: meta=N+k, journal ends at N.

Without the fix the node deadlocks: raftCommitIndex > getCurrentLogIndex()
so applyLogEntries finds nothing to apply, and the leader commit loop
condition `while commitIdx < getCurrentLogIndex()` is immediately false,
so commitIndex can never advance either.

Fix: in __loadDumpFile, clamp raftCommitIndex down to getCurrentLogIndex()
if it overshoots. The lost entries are unrecoverable; rolling the pointer
back is the only safe option. The corrected value is logged as a warning
and re-persisted to .journal.meta on the next onOneSecondTimer tick.
@gmelikov

Copy link
Copy Markdown
Contributor Author

Sidenote: AFAIK mmap has some issues with straightforward data flush on writes, looks like now we may easily miss some writes. If I don't miss anything and my hypothesis is right, it's worth to:

  • flush() ResizableFile on every write (expensive) or at least with self.__metaStorer.storeMeta(self.__meta) in FileJournal
  • maybe even rework ResizableFile from mmap to usual write()/pwrite() (do we support Windows?) with flush, I think whole point of journal is to store data as fast as OS/disk can. Even if we don't fsync data, at least without mmap both files' data will be in same pagecache/ditrycache.

What do you think? I can try to implement one of them in a different PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant