fix startup deadlock when raftCommitIndex exceeds journal after crash by gmelikov · Pull Request #197 · bakwc/PySyncObj

gmelikov · 2026-06-13T17:15:26Z

tl;dr: I use PySyncObj+Patroni+Soft kernel watchdog, and test it on local VMs on my laptop. Nearly each laptop sleep finishes with stuck Raft due to watchdog fire->force reset->inconsistent raft info on disk. I have all raft files with reproduced problem, if you want to look at them: raft.tar.gz. This PR fixes it for me, now instead of infinite loop I see new warning, rollback of __raftCommitIndex and successfull start of raft.

On an unclean shutdown (power loss, OOM kill) the node can start with raftCommitIndex from .journal.meta pointing past the last entry in the journal file. This happens because journal entries are written to an mmap (page cache, no msync per write) while the meta is saved via f.flush() + atomic rename once per second — two paths the OS can flush to disk in any order.

Concretely: after log compaction leaves the journal at [prev, N], new entries N+1..N+k are appended to the mmap. The meta timer fires and persists raftCommitIndex=N+k. The node crashes. The post-compaction pages had already been written back by the OS; the newer dirty pages had not. On recovery: meta=N+k, journal ends at N.

Without the fix the node deadlocks: raftCommitIndex > getCurrentLogIndex() so applyLogEntries finds nothing to apply, and the leader commit loop condition while commitIdx < getCurrentLogIndex() is immediately false, so commitIndex can never advance either.

Fix: in __loadDumpFile, clamp raftCommitIndex down to getCurrentLogIndex() if it overshoots. The lost entries are unrecoverable; rolling the pointer back is the only safe option. The corrected value is logged as a warning and re-persisted to .journal.meta on the next onOneSecondTimer tick.

On an unclean shutdown (power loss, OOM kill) the node can start with raftCommitIndex from .journal.meta pointing past the last entry in the journal file. This happens because journal entries are written to an mmap (page cache, no msync per write) while the meta is saved via f.flush() + atomic rename once per second — two paths the OS can flush to disk in any order. Concretely: after log compaction leaves the journal at [prev, N], new entries N+1..N+k are appended to the mmap. The meta timer fires and persists raftCommitIndex=N+k. The node crashes. The post-compaction pages had already been written back by the OS; the newer dirty pages had not. On recovery: meta=N+k, journal ends at N. Without the fix the node deadlocks: raftCommitIndex > getCurrentLogIndex() so applyLogEntries finds nothing to apply, and the leader commit loop condition `while commitIdx < getCurrentLogIndex()` is immediately false, so commitIndex can never advance either. Fix: in __loadDumpFile, clamp raftCommitIndex down to getCurrentLogIndex() if it overshoots. The lost entries are unrecoverable; rolling the pointer back is the only safe option. The corrected value is logged as a warning and re-persisted to .journal.meta on the next onOneSecondTimer tick.

gmelikov · 2026-06-13T19:19:02Z

Sidenote: AFAIK mmap has some issues with straightforward data flush on writes, looks like now we may easily miss some writes. If I don't miss anything and my hypothesis is right, it's worth to:

flush() ResizableFile on every write (expensive) or at least with self.__metaStorer.storeMeta(self.__meta) in FileJournal
maybe even rework ResizableFile from mmap to usual write()/pwrite() (do we support Windows?) with flush, I think whole point of journal is to store data as fast as OS/disk can. Even if we don't fsync data, at least without mmap both files' data will be in same pagecache/ditrycache.

What do you think? I can try to implement one of them in a different PR.

gmelikov mentioned this pull request Jun 13, 2026

Benchmarks: support py3 #198

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix startup deadlock when raftCommitIndex exceeds journal after crash#197

fix startup deadlock when raftCommitIndex exceeds journal after crash#197
gmelikov wants to merge 1 commit into
bakwc:masterfrom
gmelikov:reset_fix

gmelikov commented Jun 13, 2026 •

edited

Loading

Uh oh!

gmelikov commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gmelikov commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmelikov commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gmelikov commented Jun 13, 2026 •

edited

Loading