…ites
InnoDB deadlocks and lock-wait timeouts were returned as a generic
'db_error' and treated as non-retryable, so the optimistic-concurrency
loop bailed and silently dropped the engine_data write. Concurrent
events ingestion flows (parallel scraper/Ticketmaster jobs CAS-ing the
same rows) hit this daily.
Classify deadlock (1213) / lock-wait timeout (1205) as a transient,
retryable condition and re-read the latest snapshot with a small
randomized backoff, mirroring the existing logical-conflict retry path.
Genuinely fatal DB errors still fail fast.
Closes #2785
Summary
Fixes #2785. InnoDB deadlocks (and lock-wait timeouts) during the engine_data compare-and-swap were returned as a generic
db_errorand treated as non-retryable, so the optimistic-concurrency loop bailed and silently dropped the write. This is exactly the case MySQL recommends retrying ("try restarting transaction").Observed in production on events.extrachill.com — parallel scraper/Ticketmaster ingestion flows CAS the same engine_data rows at the same scheduled tick (~15:02–15:03 UTC) and step on each other's locks, dropping tool-run-state / step-progress writes.
Changes
Jobs::compare_and_swap_engine_data()— on$wpdb->updatefailure, classify$wpdb->last_errorvia a newis_retryable_db_error()helper (matches deadlock 1213 / lock-wait timeout 1205). Return aretryableflag anderror: 'deadlock', and log atwarning(noterror) when transient.EngineData::mutate()— treatretryablelike a logicalconflict: re-read the latest snapshot and retry within the existingmax_attemptsbudget, with a small randomized backoff (5–25ms) to let the winning transaction commit. Genuinely fatal DB errors still fail fast.RuntimeToolRunStateStore::mutate_engine_data()— mirror the same classification in the fallback CAS loop.Behavior
max_attempts).warning, withreason: deadlock|conflict) from fatal failure (error).Verification
php -lclean on all three files.phpcsclean (no warnings/errors) on all three files.