[DRAFT][CI-VERIFY] Bloaty size reports on arm jobs by rascani · Pull Request #1 · rascani/executorch

rascani · 2026-05-23T00:19:37Z

Draft PR to verify the new bloaty PR-comment workflow added in
.github/workflows/bloaty-size-comment.yml. The synthetic +190-byte change
to runtime/executor/program.cpp ([bloaty-ci-verify-string]) should
produce a sticky comment showing the regression on both arm-bare_metal
and arm-zephyr-preset.

Do not merge. Will be force-pushed for additional test cases.

Differential Revision: D102880053 Pull Request resolved: pytorch#19211

Differential Revision: D106123930 Pull Request resolved: pytorch#19742

pytorch#19746) pytorch#18476 clone version due to bot crash

…ackend (pytorch#19747) clone pytorch#18477 due to bot crash

clone pytorch#18728 due to bot crash

Differential Revision: D106162684 Pull Request resolved: pytorch#19749

@robert-kalmar

### Summary Add tests verifying correct support for add.tensor by the Neutron backend using the new Neutron MLIR flow. ### Test plan Unit tests provided. cc @robert-kalmar

…#19752) Differential Revision: D106254596 Pull Request resolved: pytorch#19752

Treat BUCK and TARGETS files as build metadata in the Arm pre-push license check so they do not need copyright headers. Signed-off-by: Per Held <per.held@arm.com> Change-Id: I4b3bbd1e03ba4b9c38fd06225156344985f0cc70

@robert-kalmar

### Summary Add tests verifying correct support for sub.tensor by the Neutron backend using the new Neutron MLIR flow. ### Test plan Unit tests provided. cc @robert-kalmar @JakeStevens @digantdesai @rascani

…opy (pytorch#19751) Follow-up to pytorch#17097, which added BF16 support to the TOSA GATHER op. `aten.index_select` and `aten.unfold_copy` both lower via TOSA GATHER but their support checks were not updated at the time. In both decompositions(`DecomposeIndexSelectToGatherPass()` and `DecomposeUnfoldToGatherPass()`), the bf16 values tensor flows through dtype-agnostic reshape ops and `tosa.GATHER`, which accepts `BF16`. The support check was the only blocker. | Op | bf16 before | bf16 after | |---------------------|:-----------:|:----------:| | `aten.gather` | ✅ | ✅ | | `aten.index.Tensor` | ✅ | ✅ | | `aten.slice_copy` | ✅ | ✅ | | `aten.index_select` | ❌ | ✅ | | `aten.unfold_copy` | ❌ | ✅ | Changes: - `index_select_support.py`, `unfold_copy_support.py`: extend float branch to include `bfloat16`; add bf16 extension guard; update rejection message. - `test_index_select.py`, `test_unfold_copy.py`: add isolated `_tosa_FP_bf16` test functions using `TosaPipelineFP(..., tosa_extensions=["bf16"])`. ### Test plan `test_index_select_tosa_FP_bf16` and `test_unfold_copy_tosa_FP_bf16` exercise the bf16 path end-to-end through `TosaPipelineFP` with the bf16 extension enabled, following the same pattern of the existing `test_slice_tensor_tosa_FP_bf16` from pytorch#17492

@psiddh

This is done for conv, depthwise conv, transpose conv, and bmm. Add scratch tensors to the operator signatures, which are then assigned exir.memory.alloc. These allocs are automatically memory planned by ExecuTorch. Introduce `required_cmsis_buffer_size`which computes the buffer size from node properties + the Cortex-M configuration. The function uses functions registered by target in backends/cortex_m/passes/scratch_buffer_sizes.py This is used to set the size of the allocs in ConvertToCortexMPass Finally, modify the kernels to use the new scratch tensor instead of allocating temporary memory. Add a new macro CORTEX_M_ENABLE_RUNTIME_CHECKS to do a safety check that the aot computed buffer size is equal to the buffer size computed at runtime. Use this when testing. cc @psiddh @AdrianLundell @digantdesai @rascani @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell --------- Signed-off-by: Erik Lundell <erik.lundell@arm.com> Co-authored-by: Måns Nilsson <mans.nilsson@arm.com>

@cccclai

…es (pytorch#19146) ### Summary To enable GPU backend support in the Llama runner, refactoring is required because the dtypes of kv_cache, attention_mask, and logits are currently hardcoded, preventing floating‑point models from running. This PR focuses on removing the hardcode dtype for them. #### Key changes - Remove template parameter <typename T> from KVManager, LhdTokenGenerator, MultimodalPromptProcessor, and related runner classes - Detect kv_cache and attention_mask dtypes dynamically from MethodMeta at construction time instead of compile-time bitwidth detection - Switch to std::byte* pointer arithmetic with getDtypeSize() for all buffer offsets; add fill_mask() helper for multi-dtype attention mask filling - Update spec_prop pass for custom llama op for sharding case greater than 1 ### Test plan ``` python backends/qualcomm/tests/test_qnn_delegate.py -k TestExampleLLMScript.test_llama_stories_110m --model SM8650 --build_folder /local/mnt/workspace/chenweng/executorch/executorch/build-android --device acfa9311 --executorch_root . --artifact_dir ./stories_110m_pte_size --llama_artifacts . --use_fp16 ``` <img width="1977" height="468" alt="image" src="https://github.com/user-attachments/assets/8bf3bffa-9b9f-4655-9cbc-b20127c2468a" /> cc @cccclai @cbilgin @abhinaykukkadapu

Summary: Pull Request resolved: pytorch#19764 Reviewed By: kirklandsign Differential Revision: D106332819

@digantdesai

As documented at https://vkdoc.net/man/VkDataGraphPipelineSessionBindPointRequirementARM .stype of VkDataGraphPipelineSessionBindPointRequirementARM should alway be set to VK_STRUCTURE_TYPE_DATA_GRAPH_PIPELINE_SESSION_BIND_POINT_REQUIREMENT_ARM cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani Signed-off-by: Erik Lundell <erik.lundell@arm.com>

Enable CPPCHECK for Cortex-M sources and headers. The Cortex-M kernels are registered through generated wrappers, so cppcheck cannot see direct call sites for the exported *_out entry points and reports them as unused. Keep narrow unusedFunction suppressions for those registration-visible functions. The scratch buffer context header is linted as a standalone header but currently exposes helper API without in-tree call sites, so suppress unusedFunction at file scope there instead of dropping Cortex-M header coverage. Keep the quantize and dequantize context parameters non-const to match the generated kernel ABI; changing them to const changes the mangled symbols used by registration. Signed-off-by: Per Held <per.held@arm.com> Change-Id: I3bcb6e5d3f125ae400005d1b033b24a07eb7924f

### Summary It relates to pytorch#18833. It doesn't add Yolo on baremetal, but it at least makes sure that it works using Portable Kernels and XNNPACK backends. ### Test plan It's only adding a model to CI, so the CI is the test plan.

Convert BenchmarkActivity, BenchmarkMetric, LlmBenchmark, LlmModelRunner, and ModelRunner from Java to Kotlin. Differential Revision: D106195816

@digantdesai

…rch#19731) ### Summary Extend the Cortex-M cross-CPU build pipeline to Armv6-M by patching two upstream issues that block the Corstone-300 target source and the CMSIS Cortex DFP from building for `cortex-m0plus`: * `core_platform/0003-*.patch` guards the `HardFault_Handler` in `targets/corstone-300/target.cpp`. The handler uses an `ite eq` IT-block in inline asm and dereferences the SCB CFSR/BFAR/MMFAR fault-status registers; both are Armv7-M / Armv8-M Mainline only. The patch wraps the rich handler in `__ARM_ARCH_7M__ / 7EM / 8M_MAIN / 8_1M_MAIN` and falls back to a minimal stub on Armv6-M / Armv8-M Baseline (M0/M0+/M23). * `core_software/0002-*.patch` fixes `cmsis.cmake`'s handling of the M0+ device. The Cortex DFP names the device directory and headers `ARMCM0plus` (lowercase suffix), while the device sources (`startup_ARMCM0plus.c`, `system_ARMCM0plus.c`) gate their implementations on the `ARMCM0P` preprocessor macro — three different spellings. The previous `string(TOUPPER ...)` produced `ARMCM0PLUS`: the include path lookup failed and the source files hit their `#error device not specified!` guard. Override `ARM_CPU` to `ARMCM0plus` for the directory + filename and introduce a separate `CMSIS_DEVICE_CPU_DEFINE` set to `ARMCM0P` for the cmsis_startup and cmsis_system compile-definitions; all other cores still drive both paths from the uppercased default. Both patches are layered via the existing `patch_repo` mechanism; the `corstone_utils.cmake` TODO is updated so the deletion plan for 0002 and 0003 is documented together. ### Test Plan Locally validated end-to-end on the Corstone-300 FVP with the `qadd` model: `cortex-m0plus` build links a runner that includes `startup_ARMCM0plus.c` / `system_ARMCM0plus.c` and the patched `target.cpp`, and the FVP run prints `TEST: BundleIO index[0] Test_result: PASS` with all error stats zero. The bundled `libcmsis-nn.a` reports `Tag_CPU_arch: v6S-M` and `Tag_THUMB_ISA_use: Thumb-1` with zero DSP / MVE / saturating instructions, confirming the scalar code path was exercised. Authored with Claude. cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

Differential Revision: D106026285 Pull Request resolved: pytorch#19734

Differential Revision: D106394605 Pull Request resolved: pytorch#19775

@robert-kalmar

pytorch#19772) … Registration ### Summary Docs improvement. ### Test plan Docs only. cc @robert-kalmar @JakeStevens @digantdesai @rascani

@digantdesai

Re-upload with BUCK changes. Share TOSA RESIZE parameter validation between upsample support checks and fake RESIZE lowering so invalid nearest and bilinear resize parameters are rejected before delegation. Change-Id: I57c267aca96d733879ae90329267e44adce399c6 cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani Signed-off-by: Per Held <per.held@arm.com>

Differential Revision: D106408368 Pull Request resolved: pytorch#19783

### Summary In pytorch#19651, I added a global seed for pytest runs. This was intended to reduce random tolerance flakes, but didn't actually do so in practice. This is because the parallel test runners don't guarantee any ordering, so random state is unstable between runs. I've updated it to set the seed per-test. This should hopefully make the random state invariant of test execution order.

Differential Revision: D106430647 Pull Request resolved: pytorch#19790

…19743) Differential Revision: D105630451 Pull Request resolved: pytorch#19743

@kimishpatel

… GPU / CPU (pytorch#19252) ### Summary CoreML decides at compile/load time which device each MIL operation will execute on, and coremltools 9.0+ exposes that through `MLComputePlan`. The recurring question on the issue tracker is *"why isn't my model running fully on the ANE?"* — for example: - pytorch#4091 — `llama model is not fully lowered to ANE` - pytorch#11541 — `CoreML model is crashing on iPhone GPU, but not on iPhone CPU or macOS GPU` - pytorch#8439 — `ANE compile OOMs on certain input shapes` - pytorch#8445 — `CPU Overhead After ANE Execution` Today the only way for an ExecuTorch user to answer it is to break out Swift / Xcode. This PR adds a Python wrapper around `MLComputePlan` so the answer is one shell command: ``` $ python coreml_compute_plan.py --model_path my_model.mlpackage \ --compute_units cpu_and_ne --show_non_ane === my_model.mlpackage === ANE: 412 / 480 ( 85.8%) CPU: 68 / 480 ( 14.2%) Non-ANE op types: 32 ios17.cast 18 ios17.gather 12 ios17.reshape 6 ios17.constexpr_blockwise_shift_scale ``` Inputs supported: | Input | Behavior | |---|---| | `.pte` | Extract every Core ML partition into a tempdir, then analyze each. | | `.mlpackage` | Compile to `.mlmodelc` in a tempdir, then analyze. | | `.mlmodelc` | Analyze directly. | The PTE path reuses the same JSON/named-data extraction logic that `extract_coreml_models.py` uses, and is inlined into the script so it can be run against a plain CoreML model without depending on the executorch package. ### Test plan Added `test_coreml_compute_plan.py` covering: - `_device_name(...)` for `None` and a stub `MLNeuralEngineComputeDevice`. - `_COMPUTE_UNIT_CHOICES` mapping (`cpu_and_ne` / `all`). - `analyze_one(...)` end-to-end on a tiny `relu(x @ x.T) + x.sum()` mlpackage built with `coremltools.convert(...)`: returns rows for every dispatched op, with a `main` function and the expected MIL op types (`matmul`, `relu`, `add`, `reduce_sum`). ``` $ python -m pytest examples/apple/coreml/scripts/test_coreml_compute_plan.py -v ============================== 7 passed in 3.68s =============================== ``` I also ran the script against a few hand-built `.mlpackage` and `.mlmodelc` files on macOS 26 with coremltools 9.0 and verified the output matches what `MLComputePlan` returns directly. Authored with Claude. cc @kimishpatel @YifanShenSZ @cymbalrush @metascroy

Differential Revision: D106412035 Pull Request resolved: pytorch#19777

…h#16986) Differential Revision: D91725222 Pull Request resolved: pytorch#16986

Adds a simple pass for replacing single Aten ops with corresponding dialect ops to be reused across multiple backends. Signed-off-by: Adrian Lundell <adrian.lundell@arm.com>

Differential Revision: D106575515 Pull Request resolved: pytorch#19831

…rch#19858) clone pytorch#18729 due to bot crash

Enable loading GGUF files (e.g. Q4_K_M) and exporting to the MLX backend. Three areas of change: GGUF loader (gguf_loader.py): - Add MLX backend support alongside CUDA - Keep embedding quantized for MLX (QuantizedEmbeddingHandler supports quantized gather natively, unlike CUDA's Int4Tensor) - Fix stale docstring references to Int4TilePackedTo4dTensor/tinygemm MLX backend (op_helpers.py, patterns.py): - Accept group_size=16 in parse_dequant_node for GGUF Q6_K tensors - For group_size < 32, emit DequantizeNode + TransposeNode + AddmmNode instead of QuantizedMatmulNode, since MLX Metal kernels are only instantiated for group_size >= 32. Weights stay packed as int8 in the .pte file and are dequantized on-device at runtime — same strategy CUDA/Inductor uses (separate Triton dequant + cuBLAS mm). Packer (pack_mlx.py): - Add 16 to supported group sizes so Q6_K IntxUnpackedToInt8Tensor passes through to export unchanged Tests (test_ops.py): - Add group_size=16 configs for int8, int4, and no-bias variants Test Plan: Export and run this model https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/blob/main/gemma-4-31B-it-Q4_K_M.gguf On M1 32GB machine (exported on Linux A100) ``` (executorch_dev) mnachin@mnachin-mbp executorch % ./cmake-out/examples/models/gemma4_31b/gemma4_31b_runner \ --model_path /Users/mnachin/repos/models/gemma-4-31B-it-GGUF/model.pte \ --tokenizer_path /Users/mnachin/repos/models/gemma-4-31B-it-HQQ-INT4/tokenizer.json \ --prompt "Tell me a joke about RAM usage" \ --max_new_tokens 128 \ --temperature 0.8 I tokenizers:regex.cpp:27] Registering override fallback regex WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1779926968.603672 54889180 re2.cc:237] Error parsing '((\<pad\>|ool\|\>1\x00\x00\ �\<t|respo|\<tool_call\|\>|\<bos\>|\<\|tool_response\>|\<\|think\|\>|\x0...': invalid UTF-8 I tokenizers:re2_regex.cpp:27] Re2 failed to compile regex: ((\<pad\>|ool\|\>1\x00\x00\ �\<t|respo|\<tool_call\|\>|\<bos\>|\<\|tool_response\>|\<\|think\|\>|\x00\x00\\\<|\<tool_response\|\>|\<mask\>|\<\|\"\|\>|all\|\>j\x00\x00\\|\<channel\|\>|\<\|turn\>|\<turn\|\>|\<\|image\>|\<\|$ I tokenizers:regex_lookahead.cpp:27] Creating PCRE2 regex I tokenizers:pcre2_regex.cpp:48] PCRE2 UTF-8 validation failed at offset 27: UTF-8 error: byte 2 top bits not 0x80. Retrying without UTF flags. Loading model... Prompt tokens: 23 Why did the computer go to therapy? Because it had too many **unresolved dependencies** and it just couldn't stop **dwelling on the past**... but it forgot everything the moment it took a nap.<turn|> PyTorchObserver {"prefill_token_per_sec":2.49539,"decode_token_per_sec":0.0880671,"prompt_tokens":23,"generated_tokens":44,"model_load_start_ms":1779926968052,"model_load_end_ms":1779926982494,"inference_start_ms":1779926982497,"inference_end_ms":1779927491333,"prompt_eval_end_ms":1779926991714,"first_token_ms":1779926991714,"aggregate_sampling_time_ms":0,"SCALING_FACTOR_UNITS_PER_SECOND":1000} ``` For reference, here's the this model: https://huggingface.co/SocialLocalMobile/Qwen3.5-35B-A3B-HQQ-INT4 ``` (executorch_dev) mnachin@mnachin-mbp executorch % ./cmake-out/examples/models/gemma4_31b/gemma4_31b_runner \ --model_path /Users/mnachin/repos/models/gemma-4-31B-it-HQQ-INT4/model.pte \ --tokenizer_path /Users/mnachin/repos/models/gemma-4-31B-it-HQQ-INT4/tokenizer.json \ --prompt "Tell me a joke about RAM usage" \ --max_new_tokens 128 \ --temperature 0.8 I tokenizers:regex.cpp:27] Registering override fallback regex WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1779927592.109382 54914733 re2.cc:237] Error parsing '((\<pad\>|ool\|\>1\x00\x00\ �\<t|respo|\<tool_call\|\>|\<bos\>|\<\|tool_response\>|\<\|think\|\>|\x0...': invalid UTF-8 I tokenizers:re2_regex.cpp:27] Re2 failed to compile regex: ((\<pad\>|ool\|\>1\x00\x00\ �\<t|respo|\<tool_call\|\>|\<bos\>|\<\|tool_response\>|\<\|think\|\>|\x00\x00\\\<|\<tool_response\|\>|\<mask\>|\<\|\"\|\>|all\|\>j\x00\x00\\|\<channel\|\>|\<\|turn\>|\<turn\|\>|\<\|image\>|\<\|$ I tokenizers:regex_lookahead.cpp:27] Creating PCRE2 regex I tokenizers:pcre2_regex.cpp:48] PCRE2 UTF-8 validation failed at offset 27: UTF-8 error: byte 2 top bits not 0x80. Retrying without UTF flags. Loading model... Prompt tokens: 23 Why did the computer go to therapy? Because it had too many **unresolved dependencies** and couldn't stop **dwelling on the past**, but it still couldn't remember why it was there. *** Alternatively, a shorter one: **Why was the RAM so stressed?** Because it had too much on its mind, but it knew that as soon as it slept, it would forget everything.<turn|> PyTorchObserver {"prefill_token_per_sec":9.11975,"decode_token_per_sec":5.24998,"prompt_tokens":23,"generated_tokens":86,"model_load_start_ms":1779927591719,"model_load_end_ms":1779927603575,"inference_start_ms":1779927603579,"inference_end_ms":1779927622482,"prompt_eval_end_ms":1779927606101,"first_token_ms":1779927606101,"aggregate_sampling_time_ms":0,"SCALING_FACTOR_UNITS_PER_SECOND":1000} ``` There's definitely performance degradation when running GGUF

Adds two new Android instrumentation test suites covering previously untested API surfaces, completing feature testing coverage for OKR 3.2. AsrModuleInstrumentationTest (18 tests): constructor validation, lifecycle (close idempotency, use-after-close), transcribe validation, and AsrTranscribeConfig builder/validation. LlmLoraInstrumentationTest (13 tests): dataFiles constructor variants, LlmModuleConfig with dataPath, invalid data file error handling, baseline equivalence, and config builder validation. ## Test plan - [x] `./gradlew :executorch_android:connectedAndroidTest -Pandroid.testInstrumentationRunnerArguments.class=org.pytorch.executor ch.AsrModuleInstrumentationTest` - [x] `./gradlew :executorch_android:connectedAndroidTest -Pandroid.testInstrumentationRunnerArguments.class=org.pytorch.executor ch.LlmLoraInstrumentationTest` - [x] Verify all 31 new tests pass on emulator (API 34 x86_64) - [x] Verify existing tests are unaffected

Differential Revision: D105728137 Pull Request resolved: pytorch#19724

@robert-kalmar

…pytorch#19793) ### Summary Enable `aten.upsample_bilinear2d` with new Neutron flow. ### Test plan Unit tests provided. cc @robert-kalmar @JakeStevens @digantdesai @rascani

@robert-kalmar

…ytorch#19796) ### Summary NXP backend: Enable `aten.upsample_nearest2d` with new Neutron flow. ### Test plan Unit tests provided. cc @robert-kalmar @JakeStevens @digantdesai @rascani

@digantdesai

logger.level was used to determine whether to add the partition_report.txt FileHandler to the logger. This value is not est by logging.setBasicConfig, and defaults to 0. This caused empty reports to be output when intermediate path was set and logging was > info Instead, use .getEffectiveLevel() cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani Signed-off-by: Erik Lundell <erik.lundell@arm.com>

@digantdesai

* Add layers that run in BF16 in the HF model Change-Id: If75434db138059f3a433a70abda3f3e26f6dd3b6 cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani --------- Signed-off-by: Tom Allsop <tom.allsop@arm.com>

@digantdesai

This is stacked on top of pytorch#19029 - make non-KV-cache example inputs match the static export window - fix PT2E calibration flow for padded prefixes and optional LM-Eval tasks - update SmolLM2 export settings used by the VGF PT2E workflow - Fix rope_theta in 135M_config.json to align with Hugging face model config cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani Signed-off-by: Xingguo Li <xingguo.li@arm.com> Co-authored-by: Zingo Andersen <zingo.andersen@arm.com>

@digantdesai

Change-Id: Id97fcb787369b62aecd4a0be27132ff4a0785fcf cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani Signed-off-by: Michiel Olieslagers <michiel.olieslagers@arm.com>

@digantdesai

…h#19839) Add ArmPass.should_run_pass() as a reusable early-exit hook before call() starts the normal ExportPass retracing path. The default hook returns true, preserving existing behavior for ArmPass subclasses. Introduce ArmOpTargetedPass for passes that only transform a known set of operator targets. It implements should_run_pass() by scanning the current graph and nested GraphModules for matching target operators. If no matching target operator is found, the pass returns an unmodified PassResult. For passes that already gate transformations with allowed_to_transform(), allow the target pre-scan to apply the same check before deciding whether the pass needs to run. This avoids running TFA passes when all matching target nodes are marked as disallowed. The should_run_pass() hook and ArmOpTargetedPass pre-scan avoid rebuilding graphs for decomposition and rewrite passes that cannot affect the current graph. The speedup is most visible on large models. Single-run paired benchmarks on Arm backend model tests across FP32, INT, VGF no-quant, and VGF quant variants: | Model | E2E avg | Pass-manager avg | |-------------|--------:|-----------------:| | T5-small | +30.5% | +47.5% | | DeepLabV3 | +12.9% | +49.8% | | Wav2Letter | +16.9% | +51.2% | | InceptionV3 | +22.2% | +46.5% | | MobileNetV2 | +22.2% | +52.5% | | MobileNetV3 | +29.9% | +54.6% | Model rows are unweighted averages over successful variants. Unweighted average across 23 successful model/target variants: E2E speedup: +22.4% Pass-manager speedup: +50.5% Change-Id: Iaa09638473a1d6d1e2ce98f5a0e3fc3a14378143 cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani Signed-off-by: Yufeng Shi <yufeng.shi@arm.com> Co-authored-by: Erik Lundell <erik.lundell@arm.com>

- Export & lower the smollm2 via extensions/llm/export_llm - Build the arm_executor_runner application - Fix the propagation of select_ops_list in the CMakeLists.txt - Test the application runs on FVP in fast mode Signed-off-by: George Gekov <george.gekov@arm.com> Change-Id: I8acd87c2f5c3e6b5b189bb987ceccfe4877e2254

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Summary: Currently, __builtin_FUNCTION is used opportunistically if it exists. However, for heavily templated code, this results in extremely long string which adds .rodata which can be wasteful on embedded targets. This commit adds an override which uses the shorter __FUNCTION__ even if __bultin_FUNCTION exists and exposes as a BUCK constraint. Integration into CMake intentially left out for now. Differential Revision: D106668077

…ytorch#19834) Summary: The current approach use __FILE__ and opportunistically trims it if the utility is available. However, the long name is still stored in .rodata This can contribute some memory on embedded platforms. Instead, first try __FILE_NAME__ Differential Revision: D106587633

Summary: ghstack 0.15.0 changed the header URL in PR bodies from `Stack from [ghstack](https://github.com/ezyang/ghstack)` to `Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0)`. The exact string match in `propose_ghstack_orig_pr.py` no longer matched, causing every ghstack_land workflow run to fail since May 14. Use `startswith("Stack from [ghstack]")` instead to be resilient to URL changes. Test Plan: Verified the new pattern matches both the old format (`https://github.com/ezyang/ghstack`) and the new format (`https://github.com/ezyang/ghstack/tree/0.15.0`). This PR was authored with the help of Claude. Reviewers:

Pull Request resolved: pytorch#19867 Some environments preserve stale failure state when tests are reported through unittest skip results. This switches currently disabled Vulkan delegate coverage to a local decorator so those tests stay discoverable, log their disabled reason, and produce an executed result. ghstack-source-id: 387629544 @exported-using-ghexport Differential Revision: [D106732141](https://our.internmc.facebook.com/intern/diff/D106732141/)

Applies the same disabled-test treatment as the prior diffs in this stack to the devtools inspector tests. Some test runners preserve stale failure state when tests report through unittest skip results, so this replaces the conditionally disabled coverage with a local decorator that keeps the tests discoverable, logs their disabled reason, and produces an executed result. Adds a disable_if decorator that mirrors unittest.skipIf (evaluating the condition at decoration time) and converts the three Windows-gated test cases to use it. Differential Revision: [D106736354](https://our.internmc.facebook.com/intern/diff/D106736354/) ghstack-source-id: 387629542 Pull-Request: pytorch#19874

The test-arm-cortex-m-size-test job now builds the PR's merge base in addition to head, runs bloaty against both, and uploads a per-leg artifact. A new workflow_run-triggered workflow downloads the artifacts and posts a sticky PR comment with per-segment, per-section, and per-bucket deltas plus the top-5 symbols by Δ. Reporting is best-effort and never fails the size job. Existing threshold gates are unchanged. The custom bloaty data source in test/bloaty/executorch.bloaty groups demangled symbols into runtime/extension/backends/kernels/etc buckets so the diff is readable. Drafted with Claude.

Synthetic +~190 byte change to verify the bloaty PR-comment workflow reports the regression correctly. Look for the [bloaty-ci-verify-string] marker — revert this commit before any real PR.

Three fixes from the first CI run on PR pytorch#19888: 1. Base build was failing because the worktree was at /tmp/base-worktree, but CMakeLists.txt:420 requires the repo dir to be named exactly `executorch`. Worktree is now /tmp/bloaty-base/executorch. 2. bloaty isn't in apt on the executorch-ubuntu-22.04-arm-sdk image. Install via conda-forge instead (conda is already in PATH). 3. `set -e` inside a subshell doesn't fire when the subshell itself is on the left of `||` (per bash spec). Replaced with explicit `|| exit 1` after each critical command in the bloaty subshell, so a python crash actually aborts before the artifact-upload mv runs. Also added workflow_dispatch to bloaty-size-comment.yml so the comment poster can be invoked manually for verification before the workflow lands on the default branch.

…oaty Two more fixes from the second CI run: 1. `git worktree add` fails on the base SHA because actions/checkout's shallow fetch only has the head commit. Explicitly `git fetch --depth=1 origin <base_sha>` before the worktree add. 2. Conda-forge's bloaty links against a newer libstdc++ than the docker image ships, so `bloaty --version` fails with CXXABI_1.3.15 not found. Prepend the conda env's lib dir to LD_LIBRARY_PATH so bloaty finds its own libstdc++ before the system one.

LD_LIBRARY_PATH override didn't help because the main conda env doesn't ship a libstdc++.so.6, so the linker still picked the system one (too old for conda-forge bloaty: CXXABI_1.3.15 missing). Install bloaty into a dedicated conda env alongside libstdcxx-ng, then invoke via `conda run --no-capture-output -p <env> bloaty`, which sets LD_LIBRARY_PATH correctly. Pass the resolved command to bloaty_diff.py via a new BLOATY env var.

First successful run on PR pytorch#19888 revealed two real issues: 1. bloaty defaults to -n 20 even without a flag, so metadata.json only contained the top 20 symbols + an aggregated [N Others] row. Diffing the resulting capped lists would produce phantom regressions when any symbol crossed the cutoff. -n 0 means unlimited. 2. The arm size_test binary's actual contents on bare-metal: - ~40% newlib stdio internals (_vfprintf_r, _svfprintf_r, _dtoa_r) - ~10% libsupc++ C++ demangler (d_print_*, cplus_demangle_*) - ~10% C++ unwind personality + section-level debug entries None of these matched our regexes. Added patterns so they bucket into libc/stdlib/metadata correctly.

The previous test bloater (ET_LOG format string) didn't survive Release mode because ET_LOG_ENABLED=0 compiles log strings out entirely. This version adds a ~265-byte string in .rodata, referenced through a volatile static pointer to defeat the optimizer, inside a function (get_execution_plan) that size_test actually links. Verified locally that the string survives -Os + strip. Look for [bloaty-ci-verify-string] in the bloaty PR comment.

kirklandsign and others added 30 commits May 23, 2026 00:39

Convert Android LLM extension from Java to Kotlin (pytorch#19211)

158c5d8

Differential Revision: D102880053 Pull Request resolved: pytorch#19211

Globally serialize XNNPACK execution, add logging (pytorch#19742)

6bda6c4

Differential Revision: D106123930 Pull Request resolved: pytorch#19742

[ET Device Support] Module: allocate device memory for planned buffers (

12f62f2

pytorch#19746) pytorch#18476 clone version due to bot crash

[ET Device Support] CudaAllocator: device memory allocator for CUDA b…

c27cc5d

…ackend (pytorch#19747) clone pytorch#18477 due to bot crash

[ET Device Support] Define AOT device copy ops registry (pytorch#19748)

7d8063f

clone pytorch#18728 due to bot crash

Add extension_llm_runner to CMake deps (pytorch#19749)

d757776

Differential Revision: D106162684 Pull Request resolved: pytorch#19749

NXP backend: Enable Add Tensor with new Neutron flow (pytorch#19550)

b69cbcd

### Summary Add tests verifying correct support for add.tensor by the Neutron backend using the new Neutron MLIR flow. ### Test plan Unit tests provided. cc @robert-kalmar

Back out "Globally serialize XNNPACK execution, add logging" (pytorch…

ba6074c

…#19752) Differential Revision: D106254596 Pull Request resolved: pytorch#19752

Arm backend: Exclude build metadata from license checks

ee4c90a

Treat BUCK and TARGETS files as build metadata in the Arm pre-push license check so they do not need copyright headers. Signed-off-by: Per Held <per.held@arm.com> Change-Id: I4b3bbd1e03ba4b9c38fd06225156344985f0cc70

NXP backend: Enable Sub Tensor with new Neutron flow (pytorch#19588)

b73df0b

### Summary Add tests verifying correct support for sub.tensor by the Neutron backend using the new Neutron MLIR flow. ### Test plan Unit tests provided. cc @robert-kalmar @JakeStevens @digantdesai @rascani

add cuda allocator to cmake target (pytorch#19764) (pytorch#19764)

75fb249

Summary: Pull Request resolved: pytorch#19764 Reviewed By: kirklandsign Differential Revision: D106332819

Convert minibench Java files to Kotlin (pytorch#19760)

6128a45

Convert BenchmarkActivity, BenchmarkMetric, LlmBenchmark, LlmModelRunner, and ModelRunner from Java to Kotlin. Differential Revision: D106195816

Harden against concurrency violations (pytorch#19734) (pytorch#19734)

fb3f6eb

Differential Revision: D106026285 Pull Request resolved: pytorch#19734

Convert Experimental, DType, MethodMetadata from Java to Kotlin

50ee05e

Differential Revision: D106394605 Pull Request resolved: pytorch#19775

NXP backend: Improve docs for NXP eIQ Neutron Kernel Selective Kernel… (

5d36c7c

pytorch#19772) … Registration ### Summary Docs improvement. ### Test plan Docs only. cc @robert-kalmar @JakeStevens @digantdesai @rascani

Fix cortex_m test failures from D106339880

29c3a23

Differential Revision: D106408368 Pull Request resolved: pytorch#19783

Collapse Experimental.kt annotation onto a single line to satisfy linter

b4d62ed

Differential Revision: D106430647 Pull Request resolved: pytorch#19790

Handle out_dtype in ReplacePT2DequantWithCadenceDequantPass (pytorch#…

034b044

…19743) Differential Revision: D105630451 Pull Request resolved: pytorch#19743

Fix bug with mixed weight cache + workspace sharing

fb420f3

Differential Revision: D106412035 Pull Request resolved: pytorch#19777

New exported program pass manager and exported program passes (pytorc…

77df9b7

…h#16986) Differential Revision: D91725222 Pull Request resolved: pytorch#16986

AdrianLundell and others added 22 commits May 28, 2026 18:41

Add general Aten lowering pass (pytorch#19837)

463fbe4

Adds a simple pass for replacing single Aten ops with corresponding dialect ops to be reused across multiple backends. Signed-off-by: Adrian Lundell <adrian.lundell@arm.com>

Remove google-java-format from CI lint infrastructure

c8c04e4

Differential Revision: D106575515 Pull Request resolved: pytorch#19831

[ET Device Support] Define et_copy runtime h2d and d2h copy ops (pyto…

000d810

…rch#19858) clone pytorch#18729 due to bot crash

Add shared fusion infrastructure and QuantFusionPass (pytorch#19724)

4de16d0

Differential Revision: D105728137 Pull Request resolved: pytorch#19724

NXP backend: Enable aten.upsample_bilinear2d with new Neutron flow. (…

007570a

…pytorch#19793) ### Summary Enable `aten.upsample_bilinear2d` with new Neutron flow. ### Test plan Unit tests provided. cc @robert-kalmar @JakeStevens @digantdesai @rascani

NXP backend: Enable aten.upsample_nearest2d with new Neutron flow. (p…

c72bc87

…ytorch#19796) ### Summary NXP backend: Enable `aten.upsample_nearest2d` with new Neutron flow. ### Test plan Unit tests provided. cc @robert-kalmar @JakeStevens @digantdesai @rascani

Arm backend: Fix VKML install bug for macOS. (pytorch#19612)

1494535

Change-Id: Id97fcb787369b62aecd4a0be27132ff4a0785fcf cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani Signed-off-by: Michiel Olieslagers <michiel.olieslagers@arm.com>

Change python to python3 in shell script

b0441b5

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

TEST DO NOT MERGE: bloat program.cpp error string

04229ee

Synthetic +~190 byte change to verify the bloaty PR-comment workflow reports the regression correctly. Look for the [bloaty-ci-verify-string] marker — revert this commit before any real PR.

rascani force-pushed the test/bloaty-ci-verify branch from 1498925 to 04229ee Compare May 29, 2026 20:53

github-actions Bot added module: arm ciflow/trunk labels May 29, 2026

rascani added 5 commits May 29, 2026 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT][CI-VERIFY] Bloaty size reports on arm jobs#1

[DRAFT][CI-VERIFY] Bloaty size reports on arm jobs#1
rascani wants to merge 88 commits into
mainfrom
test/bloaty-ci-verify

rascani commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

rascani commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants