CI: diagnose AOTI hang on macOS — isolated test with native stack sampling#19886
Open
SS-JIA wants to merge 1 commit into
Open
CI: diagnose AOTI hang on macOS — isolated test with native stack sampling#19886SS-JIA wants to merge 1 commit into
SS-JIA wants to merge 1 commit into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19886
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 59171bf with merge base 88faab2 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This PR needs a
|
1 task
SS-JIA
added a commit
that referenced
this pull request
May 29, 2026
Summary: AOTI tests (llama3_2_vision and select extension/llm tests) hang indefinitely on macOS CI runners after the PyTorch 2.12 pin update. The hang is in native C/C++ code (inductor compilation / dlopen), which prevents faulthandler from producing a traceback. Diagnosis is ongoing in #19886. Skip the affected tests and bump the macOS job timeout from the default 90 to 120 minutes to add margin (observed completion at ~79 min with skips applied). Co-Authored-By: Claude <noreply@anthropic.com>
Summary: The text_decoder AOTI test passed on CI (~99s), so the hang is one of the other AOTI tests. This update runs all 6 skipped AOTI tests individually with: - Per-test 10-minute timeout (generous but bounded) - Background `sample` watchdog for native C/C++ stack traces every 60s - PYTHONUNBUFFERED=1 for real-time output - Timestamps marking start/finish of each test Tests run sequentially: 1. llama3_2_vision/preprocess/test_preprocess.py 2. llama3_2_vision/vision_encoder/test/test_vision_encoder.py 3. llama3_2_vision/text_decoder/test/test_text_decoder.py 4. test_position_embeddings::test_tile_positional_embedding_aoti 5. test_position_embeddings::test_tiled_token_positional_embedding_aoti 6. test_attention::test_attention_aoti Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The macOS unittest job has been timing out since the PyTorch pin was updated to 2.12. Three CI runs showed 38-42 minutes of complete silence after ~55% test completion, with faulthandler unable to fire (confirming the hang is in native C/C++ code, not Python).
This PR isolates the diagnosis:
test_llama3_2_text_decoder_aoti) run as an instrumented scriptsampleto capture native C/C++ call stacks every 60s, since faulthandler cannot see into native code that holds the GILPYTHONUNBUFFERED=1to prevent pipe bufferingTest plan
sampleoutput will show the exact native function blockingCo-Authored-By: Claude noreply@anthropic.com