Skip to content

fix: retry engine_data CAS on InnoDB deadlocks instead of dropping writes#2786

Open
chubes4 wants to merge 1 commit into
mainfrom
cas-deadlock-retry
Open

fix: retry engine_data CAS on InnoDB deadlocks instead of dropping writes#2786
chubes4 wants to merge 1 commit into
mainfrom
cas-deadlock-retry

Conversation

@chubes4

@chubes4 chubes4 commented Jun 25, 2026

Copy link
Copy Markdown
Member

Summary

Fixes #2785. InnoDB deadlocks (and lock-wait timeouts) during the engine_data compare-and-swap were returned as a generic db_error and treated as non-retryable, so the optimistic-concurrency loop bailed and silently dropped the write. This is exactly the case MySQL recommends retrying ("try restarting transaction").

Observed in production on events.extrachill.com — parallel scraper/Ticketmaster ingestion flows CAS the same engine_data rows at the same scheduled tick (~15:02–15:03 UTC) and step on each other's locks, dropping tool-run-state / step-progress writes.

Changes

  • Jobs::compare_and_swap_engine_data() — on $wpdb->update failure, classify $wpdb->last_error via a new is_retryable_db_error() helper (matches deadlock 1213 / lock-wait timeout 1205). Return a retryable flag and error: 'deadlock', and log at warning (not error) when transient.
  • EngineData::mutate() — treat retryable like a logical conflict: re-read the latest snapshot and retry within the existing max_attempts budget, with a small randomized backoff (5–25ms) to let the winning transaction commit. Genuinely fatal DB errors still fail fast.
  • RuntimeToolRunStateStore::mutate_engine_data() — mirror the same classification in the fallback CAS loop.

Behavior

  • A deadlock on the CAS write is now retried instead of dropping the write.
  • Non-retryable DB errors still fail fast (no infinite loop — bounded by max_attempts).
  • Logging distinguishes transient contention (warning, with reason: deadlock|conflict) from fatal failure (error).

Verification

  • php -l clean on all three files.
  • phpcs clean (no warnings/errors) on all three files.

…ites

InnoDB deadlocks and lock-wait timeouts were returned as a generic
'db_error' and treated as non-retryable, so the optimistic-concurrency
loop bailed and silently dropped the engine_data write. Concurrent
events ingestion flows (parallel scraper/Ticketmaster jobs CAS-ing the
same rows) hit this daily.

Classify deadlock (1213) / lock-wait timeout (1205) as a transient,
retryable condition and re-read the latest snapshot with a small
randomized backoff, mirroring the existing logical-conflict retry path.
Genuinely fatal DB errors still fail fast.

Closes #2785
@homeboy-ci

homeboy-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Homeboy Results — data-machine

Lint

lint — passed

ℹ️ Full options: homeboy docs commands/lint
Deep dive: homeboy lint data-machine --changed-since 8a413e6

Artifacts and drill-down
  • CI results artifact: homeboy-ci-results-data-machine-lint-quality-Linux-node24 contains immediate command JSON for this action invocation.
  • Observation artifact: homeboy-observations-data-machine-lint-quality-Linux-node24 contains exported Homeboy run history for deeper queries.
  • Drill-down: download the observation artifact, then run homeboy runs import <dir>, homeboy runs list, and homeboy runs findings <run-id>.
  • Artifacts are attached to the workflow run: https://github.com/Extra-Chill/data-machine/actions/runs/28140624728

Test

test — failed

ℹ️ No tests ran — the runner failed before producing results. See raw_output.stderr_tail / raw_output.stdout_tail for the underlying error (bootstrap failure, missing deps, DB connection, etc.).
ℹ️ To run specific tests: homeboy test data-machine -- --filter=TestName
ℹ️ Auto-fix lint issues: homeboy refactor data-machine --from lint --write
ℹ️ Collect coverage: homeboy test data-machine --coverage
ℹ️ Analyze failures: homeboy test data-machine --analyze
ℹ️ Pass args to test runner: homeboy test -- [args]
ℹ️ Full options: homeboy docs commands/test
Deep dive: homeboy test data-machine --changed-since 8a413e6

Artifacts and drill-down
  • CI results artifact: homeboy-ci-results-data-machine-test-quality-Linux-node24 contains immediate command JSON for this action invocation.
  • Observation artifact: homeboy-observations-data-machine-test-quality-Linux-node24 contains exported Homeboy run history for deeper queries.
  • Drill-down: download the observation artifact, then run homeboy runs import <dir>, homeboy runs list, and homeboy runs findings <run-id>.
  • Artifacts are attached to the workflow run: https://github.com/Extra-Chill/data-machine/actions/runs/28140624728

Audit

audit — passed

  • audit — 131 finding(s)
  • Total: 131 finding(s)

Deep dive: homeboy audit data-machine --changed-since 8a413e6

Artifacts and drill-down
  • CI results artifact: homeboy-ci-results-data-machine-audit-quality-Linux-node24 contains immediate command JSON for this action invocation.
  • Observation artifact: homeboy-observations-data-machine-audit-quality-Linux-node24 contains exported Homeboy run history for deeper queries.
  • Drill-down: download the observation artifact, then run homeboy runs import <dir>, homeboy runs list, and homeboy runs findings <run-id>.
  • Artifacts are attached to the workflow run: https://github.com/Extra-Chill/data-machine/actions/runs/28140624728
Tooling versions
  • Homeboy CLI: homeboy 0.260.0+ba82bac50654+18263261
  • Extension: wordpress from https://github.com/Extra-Chill/homeboy-extensions
  • Extension revision: 40d1495f
  • Action: unknown@unknown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EngineData CAS treats InnoDB deadlocks as fatal, dropping concurrent engine_data writes

1 participant