Skip to content

Run Nx integration tests sequentially to prevent shared-sandbox race condition#380

Merged
rhoerr merged 1 commit into
mage-os:mainfrom
ddevallan:fix/nx-integration-tests-sequential-execution
May 27, 2026
Merged

Run Nx integration tests sequentially to prevent shared-sandbox race condition#380
rhoerr merged 1 commit into
mage-os:mainfrom
ddevallan:fix/nx-integration-tests-sequential-execution

Conversation

@ddevallan

Copy link
Copy Markdown
Contributor

Problem

When Nx runs multiple affected projects in parallel (default --parallel=3), each project's PHPUnit bootstrap runs concurrently inside the same Warden php-fpm container, sharing the same bind-mounted filesystem. This produces two classes of race condition that crash the integration test bootstrap before any tests execute:

1. deployTestModules.php conflicts

Both bootstraps call deployTestModules.php simultaneously. Each process removes and re-copies to app/code/Magento/TestModule*/ at the same time — one process deletes directories the other just created, or two mkdir() calls collide on the same path, triggering a "File exists" PHP warning that the integration test error handler converts into a fatal PHPUnit\Framework\Exception.

2. Shared sandbox directory conflict

Both bootstraps call setup:install via Magento\TestFramework\Application::install(), which creates a sandbox directory under:

dev/tests/integration/tmp/sandbox-0-<sha256(sha1(install-config))>/

Because both targets share the same phpunit.xml.dist and install-config-mysql.php, they compute an identical sandbox path. The second process to call mkdir() on that path fails with "File exists", again converted to a fatal bootstrap exception.

Fix

Add --parallel=1 to the nx affected command so test targets execute sequentially:

-nx affected --target=test:integration --head=HEAD --base=remotes/origin/${{ inputs.pr_base }}
+nx affected --target=test:integration --head=HEAD --base=remotes/origin/${{ inputs.pr_base }} --parallel=1

Trade-off

Sequential execution increases wall-clock time when multiple packages are affected simultaneously. However:

  • Most PRs touch 1–2 packages, so the practical impact is small.
  • The Warden stack (DB, Redis, RabbitMQ, OpenSearch) is a shared resource anyway, so true parallelism offers limited speed benefit.
  • The tests were failing entirely without this change — reliability over speed.

A longer-term fix would give each test target a distinct install config (different DB name), producing unique sandbox paths and eliminating the shared-state problem while restoring parallelism.

Related

Follows #379 (cache key fix, already merged).

🤖 Generated with Claude Code

When Nx runs multiple affected projects in parallel (default --parallel=3),
each project's PHPUnit bootstrap runs concurrently inside the same Warden
php-fpm container, sharing the same filesystem. This causes two classes of
race condition:

1. Both bootstraps call deployTestModules.php simultaneously, each removing
   and re-copying to app/code/Magento/TestModule*/ at the same time.

2. Both bootstraps call setup:install, which creates a sandbox directory
   under dev/tests/integration/tmp/ keyed by a SHA-256 hash of the install
   config. Since both targets share the same phpunit.xml.dist and
   install-config-mysql.php, they produce an identical sandbox path. The
   second process to arrive fails with "mkdir(): File exists", which the
   PHPUnit error handler converts to a fatal exception that crashes the
   bootstrap before any tests run.

Setting --parallel=1 makes Nx execute test targets sequentially, eliminating
the shared-state conflicts. The tradeoff is longer wall-clock time when
multiple packages are affected, but the tests were failing entirely without
this change so reliability wins over speed.

@rhoerr rhoerr left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Seems plausible and safe. Merging to try it on mage-os/mageos-magento2#213 .

@rhoerr rhoerr merged commit 84d1462 into mage-os:main May 27, 2026
1 check passed
ddevallan added a commit to ddevallan/mageos-magento2 that referenced this pull request May 27, 2026
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rhoerr pushed a commit to mage-os/mageos-magento2 that referenced this pull request May 31, 2026
…(cron path) (#213)

* Replace basic_get polling with basic_consume push for AMQP consumers

Magento already uses basic_consume (push-based) when running a consumer
in daemon mode via Queue::subscribe(). However, when --max-messages is
set — the path taken by every cron-spawned consumer via ConsumersRunner
— CallbackInvoker falls into a dequeue() loop that calls basic_get()
once per message, requiring one full network round-trip per message.
RabbitMQ explicitly discourages basic_get for continuous consumption.

This change adds Queue::subscribeWithLimit(), structurally identical to
the existing Queue::subscribe() but cancelling the consumer via
basic_cancel() once $maxMessages have been processed. CallbackInvoker
now routes AMQP queues with a message limit through this method, making
the cron path consistent with the CLI daemon path.

The dequeue() polling loop is preserved unchanged for MySQL and STOMP
backends. The consumers_wait_for_messages config is respected: value 1
blocks indefinitely (matching original behaviour), value 0 uses a
1-second idle timeout to exit promptly on an empty queue.

Benchmark (5,000 messages, Docker/local RabbitMQ ~0.1ms RTT):
  basic_get before:  ~0.72s  (~6,900 msg/s)
  basic_consume after: ~0.30s  (~16,500 msg/s)  — 2.4x faster

At a typical production RTT of 2ms the improvement is ~100x:
  5,000 x 2ms = 10s before vs ~50 round-trips x 2ms = 0.1s after.

* Fix coding-standard warnings and pre-existing test fixture mismatch

- Expand empty closures `function () {}` to multi-line form to satisfy
  the Magento2 coding standard (closing brace must be on its own line)
  in QueueTest.php and CallbackInvokerTest.php
- Update TRemoteService.txt fixture to use nullable type `?int $typeId`
  matching the TRepositoryInterface source; the old fixture used `int`
  which caused RemoteServiceGeneratorTest::testGenerate #1 to fail

* Clean up stale test modules before deploying to prevent mkdir collision

When integration tests run against a warm CI cache, test module directories
from a previous run may already exist in app/code/Magento. The deployTestModules
script's mkdir() call then fails with "File exists", which the PHPUnit error
handler converts to a fatal exception, crashing the bootstrap before any tests
run.

Mirroring the existing shutdown-time deleteTestModules() cleanup upfront ensures
a clean slate on every run regardless of cache state.

* Trigger CI to pick up --parallel=1 fix from mage-os/github-actions#380

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Trigger CI

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants