Fix deadlock in beginJob when compiled debug#269
Conversation
The deadlock was caused by two issues: 1. Lock ordering: beginJob() held m_ while creating XThreadTimer (which acquires gSystemMutex internally), while the app thread could hold gSystemMutex (inside ProcessEvents) and try to acquire m_ via a timer callback - classic ABBA deadlock. 2. Bare cv_.wait() without predicate: susceptible to missed notifications if the signal was sent before the wait began. Fix: - Add appStarted_ and eveSetupDone_ boolean state flags - Use predicate-based cv_.wait() to handle missed notifications - Don't hold m_ during XThreadTimer creation to avoid lock ordering issue
brownd1978
left a comment
There was a problem hiding this comment.
This PR does not change the behavior: Mu2eEventDisplay complied debug still hangs in beginJob at line 273.
|
thanks Dave, I will try to investigate. I've never seen an issue, but I've also never compiled in debug mode. The multi-threading aspect of the display does cause some difficulties, and I've tried debugging with AI in recent weeks, with little insight appart from confirmation of what I already knew. This specific issue may be fixable though, I will take a look |
|
Hi Sophie, thanks for taking a look. There are other problems associated with the locks: currently when I run the display from LBL it segvs when I hit 'next event', also traced back to a 'wait' statement. I was also surprised to see 14 threads active at the point of this crash. |
Debug-compiled EventDisplay deadlocks in
beginJob()atcv_.wait(lock). GDB shows only 2 threads with the app thread apparently never reaching its event loop.Root cause: ABBA lock ordering between
m_(held bybeginJob) andgSystemMutex(acquired insideXThreadTimerconstructor viaR__LOCKGUARD2). The app thread holdsgSystemMutexingSystem->ProcessEvents()while timer callbacks attempt to acquirem_. Debug timing makes this race deterministic. Additionally, barecv_.wait()without a predicate loses notifications if the signal arrives before the wait begins.Changes:
beginJob()to only holdm_duringcv_.wait(), not duringXThreadTimerconstruction — eliminates lock ordering cycleappStarted_andeveSetupDone_predicate flags for the two synchronization pointscv_.wait(lock, predicate)— handles both missed notifications and spurious wakeupssignalAppStart()andsetup_eve()set their respective flags under lock before notifying