Skip to content

[DRAFT][CI-VERIFY] Bloaty size reports on arm jobs#1

Draft
rascani wants to merge 88 commits into
mainfrom
test/bloaty-ci-verify
Draft

[DRAFT][CI-VERIFY] Bloaty size reports on arm jobs#1
rascani wants to merge 88 commits into
mainfrom
test/bloaty-ci-verify

Conversation

@rascani
Copy link
Copy Markdown
Owner

@rascani rascani commented May 23, 2026

Draft PR to verify the new bloaty PR-comment workflow added in
.github/workflows/bloaty-size-comment.yml. The synthetic +190-byte change
to runtime/executor/program.cpp ([bloaty-ci-verify-string]) should
produce a sticky comment showing the regression on both arm-bare_metal
and arm-zephyr-preset.

Do not merge. Will be force-pushed for additional test cases.

kirklandsign and others added 30 commits May 23, 2026 00:39
Differential Revision: D102880053

Pull Request resolved: pytorch#19211
Differential Revision: D106123930

Pull Request resolved: pytorch#19742
Differential Revision: D106162684

Pull Request resolved: pytorch#19749
### Summary
Add tests verifying correct support for add.tensor by the Neutron
backend using the new Neutron MLIR flow.

### Test plan
Unit tests provided.

cc @robert-kalmar
Treat BUCK and TARGETS files as build metadata in the Arm
pre-push license check so they do not need copyright headers.

Signed-off-by: Per Held <per.held@arm.com>
Change-Id: I4b3bbd1e03ba4b9c38fd06225156344985f0cc70
### Summary
Add tests verifying correct support for sub.tensor by the Neutron
backend using the new Neutron MLIR flow.

### Test plan
Unit tests provided.


cc @robert-kalmar @JakeStevens @digantdesai @rascani
…opy (pytorch#19751)

Follow-up to pytorch#17097, which added BF16 support to the TOSA GATHER op.
`aten.index_select` and `aten.unfold_copy` both lower via TOSA GATHER
but their support checks were not updated at the time.

In both decompositions(`DecomposeIndexSelectToGatherPass()` and
`DecomposeUnfoldToGatherPass()`),
the bf16 values tensor flows through dtype-agnostic reshape ops and
`tosa.GATHER`, which accepts `BF16`.
The support check was the only blocker.

| Op                  | bf16 before | bf16 after |
|---------------------|:-----------:|:----------:|
| `aten.gather`       | ✅          | ✅         |
| `aten.index.Tensor` | ✅          | ✅         |
| `aten.slice_copy`   | ✅          | ✅         |
| `aten.index_select` | ❌          | ✅         |
| `aten.unfold_copy`  | ❌          | ✅         |

Changes:
- `index_select_support.py`, `unfold_copy_support.py`: extend float
branch
to include `bfloat16`; add bf16 extension guard; update rejection
message.
- `test_index_select.py`, `test_unfold_copy.py`: add isolated
  `_tosa_FP_bf16` test functions using
  `TosaPipelineFP(..., tosa_extensions=["bf16"])`.

### Test plan

`test_index_select_tosa_FP_bf16` and `test_unfold_copy_tosa_FP_bf16`
exercise the bf16 path end-to-end through `TosaPipelineFP` with the bf16
extension enabled, following the same pattern of the existing
`test_slice_tensor_tosa_FP_bf16` from pytorch#17492
This is done for conv, depthwise conv, transpose conv, and bmm.

Add scratch tensors to the operator signatures, which are then
assigned exir.memory.alloc. These allocs are automatically memory
planned by ExecuTorch.
    
Introduce `required_cmsis_buffer_size`which computes the buffer
size from node properties + the Cortex-M configuration.
The function uses functions registered by target in
backends/cortex_m/passes/scratch_buffer_sizes.py
This is used to set the size of the allocs in ConvertToCortexMPass
    
Finally, modify the kernels to use the new scratch tensor instead
of allocating temporary memory. Add a new macro
CORTEX_M_ENABLE_RUNTIME_CHECKS
to do a safety check that the aot computed buffer size is equal to the
buffer size computed at runtime. Use this when testing.


cc @psiddh @AdrianLundell @digantdesai @rascani @freddan80 @per @zingo
@oscarandersson8218 @mansnils @Sebastian-Larsson @robell

---------

Signed-off-by: Erik Lundell <erik.lundell@arm.com>
Co-authored-by: Måns Nilsson <mans.nilsson@arm.com>
…es (pytorch#19146)

### Summary
To enable GPU backend support in the Llama runner, refactoring is
required because the dtypes of kv_cache, attention_mask, and logits are
currently hardcoded, preventing floating‑point models from running.
This PR focuses on removing the hardcode dtype for them.

#### Key changes
- Remove template parameter <typename T> from KVManager,
LhdTokenGenerator,
  MultimodalPromptProcessor, and related runner classes
- Detect kv_cache and attention_mask dtypes dynamically from MethodMeta
at
  construction time instead of compile-time bitwidth detection
- Switch to std::byte* pointer arithmetic with getDtypeSize() for all
buffer
  offsets; add fill_mask() helper for multi-dtype attention mask filling
- Update spec_prop pass for custom llama op for sharding case greater
than 1


### Test plan
```
python backends/qualcomm/tests/test_qnn_delegate.py -k TestExampleLLMScript.test_llama_stories_110m --model SM8650 --build_folder /local/mnt/workspace/chenweng/executorch/executorch/build-android  --device acfa9311 --executorch_root . --artifact_dir ./stories_110m_pte_size --llama_artifacts . --use_fp16
```
<img width="1977" height="468" alt="image"
src="https://github.com/user-attachments/assets/8bf3bffa-9b9f-4655-9cbc-b20127c2468a"
/>


cc @cccclai @cbilgin @abhinaykukkadapu
Summary: Pull Request resolved:
pytorch#19764

Reviewed By: kirklandsign

Differential Revision: D106332819
As documented at
https://vkdoc.net/man/VkDataGraphPipelineSessionBindPointRequirementARM
.stype of VkDataGraphPipelineSessionBindPointRequirementARM should alway
be set to
VK_STRUCTURE_TYPE_DATA_GRAPH_PIPELINE_SESSION_BIND_POINT_REQUIREMENT_ARM

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell @rascani

Signed-off-by: Erik Lundell <erik.lundell@arm.com>
Enable CPPCHECK for Cortex-M sources and headers. The Cortex-M kernels
are registered through generated wrappers, so cppcheck cannot see
direct call sites for the exported *_out entry points and reports them
as unused. Keep narrow unusedFunction suppressions for those
registration-visible functions.

The scratch buffer context header is linted as a standalone header but
currently exposes helper API without in-tree call sites, so suppress
unusedFunction at file scope there instead of dropping Cortex-M header
coverage.

Keep the quantize and dequantize context parameters non-const to match
the generated kernel ABI; changing them to const changes the mangled
symbols used by registration.

Signed-off-by: Per Held <per.held@arm.com>

Change-Id: I3bcb6e5d3f125ae400005d1b033b24a07eb7924f
### Summary

It relates to pytorch#18833. It
doesn't add Yolo on baremetal, but it at least makes sure that it works
using Portable Kernels and XNNPACK backends.

### Test plan

It's only adding a model to CI, so the CI is the test plan.
Convert BenchmarkActivity, BenchmarkMetric, LlmBenchmark,
LlmModelRunner, and ModelRunner from Java to Kotlin.

Differential Revision: D106195816
…rch#19731)

### Summary 
Extend the Cortex-M cross-CPU build pipeline to Armv6-M by patching two
upstream issues that block the Corstone-300 target source and the CMSIS
Cortex DFP from building for `cortex-m0plus`:

* `core_platform/0003-*.patch` guards the `HardFault_Handler` in
`targets/corstone-300/target.cpp`. The handler uses an `ite eq` IT-block
in inline asm and dereferences the SCB CFSR/BFAR/MMFAR fault-status
registers; both are Armv7-M / Armv8-M Mainline only. The patch wraps the
rich handler in `__ARM_ARCH_7M__ / 7EM / 8M_MAIN / 8_1M_MAIN` and falls
back to a minimal stub on Armv6-M / Armv8-M Baseline (M0/M0+/M23).

* `core_software/0002-*.patch` fixes `cmsis.cmake`'s handling of the M0+
device. The Cortex DFP names the device directory and headers
`ARMCM0plus` (lowercase suffix), while the device sources
(`startup_ARMCM0plus.c`, `system_ARMCM0plus.c`) gate their
implementations on the `ARMCM0P` preprocessor macro — three different
spellings. The previous `string(TOUPPER ...)` produced `ARMCM0PLUS`: the
include path lookup failed and the source files hit their `#error device
not specified!` guard. Override `ARM_CPU` to `ARMCM0plus` for the
directory + filename and introduce a separate `CMSIS_DEVICE_CPU_DEFINE`
set to `ARMCM0P` for the cmsis_startup and cmsis_system
compile-definitions; all other cores still drive both paths from the
uppercased default.

Both patches are layered via the existing `patch_repo` mechanism; the
`corstone_utils.cmake` TODO is updated so the deletion plan for 0002 and
0003 is documented together.

### Test Plan
Locally validated end-to-end on the Corstone-300 FVP with the `qadd`
model: `cortex-m0plus` build links a runner that includes
`startup_ARMCM0plus.c` / `system_ARMCM0plus.c` and the patched
`target.cpp`, and the FVP run prints
`TEST: BundleIO index[0] Test_result: PASS` with all error stats zero.
The bundled `libcmsis-nn.a` reports `Tag_CPU_arch: v6S-M` and
`Tag_THUMB_ISA_use: Thumb-1` with zero DSP / MVE / saturating
instructions, confirming the scalar code path was exercised.

Authored with Claude.

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell
Differential Revision: D106026285

Pull Request resolved: pytorch#19734
Differential Revision: D106394605

Pull Request resolved: pytorch#19775
Re-upload with BUCK changes.

Share TOSA RESIZE parameter validation between upsample support checks
and fake RESIZE lowering so invalid nearest and bilinear resize
parameters are rejected before delegation.


Change-Id: I57c267aca96d733879ae90329267e44adce399c6


cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell @rascani

Signed-off-by: Per Held <per.held@arm.com>
Differential Revision: D106408368

Pull Request resolved: pytorch#19783
### Summary
In pytorch#19651, I added a global
seed for pytest runs. This was intended to reduce random tolerance
flakes, but didn't actually do so in practice. This is because the
parallel test runners don't guarantee any ordering, so random state is
unstable between runs.

I've updated it to set the seed per-test. This should hopefully make the
random state invariant of test execution order.
Differential Revision: D106430647

Pull Request resolved: pytorch#19790
… GPU / CPU (pytorch#19252)

### Summary

CoreML decides at compile/load time which device each MIL operation will
execute on, and coremltools 9.0+ exposes that through `MLComputePlan`.
The recurring question on the issue tracker is *"why isn't my model
running fully on the ANE?"* — for example:

- pytorch#4091 — `llama model is not fully lowered to ANE`
- pytorch#11541 — `CoreML model is crashing on iPhone GPU, but not on iPhone
CPU or macOS GPU`
- pytorch#8439 — `ANE compile OOMs on certain input shapes`
- pytorch#8445 — `CPU Overhead After ANE Execution`

Today the only way for an ExecuTorch user to answer it is to break out
Swift / Xcode.  This PR adds a Python wrapper around `MLComputePlan` so
the answer is one shell command:

```
$ python coreml_compute_plan.py --model_path my_model.mlpackage \
      --compute_units cpu_and_ne --show_non_ane

=== my_model.mlpackage ===
  ANE:   412 / 480 ( 85.8%)
  CPU:    68 / 480 ( 14.2%)

  Non-ANE op types:
       32  ios17.cast
       18  ios17.gather
       12  ios17.reshape
        6  ios17.constexpr_blockwise_shift_scale
```

Inputs supported:

| Input | Behavior |
|---|---|
| `.pte` | Extract every Core ML partition into a tempdir, then analyze
each. |
| `.mlpackage` | Compile to `.mlmodelc` in a tempdir, then analyze. |
| `.mlmodelc` | Analyze directly. |

The PTE path reuses the same JSON/named-data extraction logic that
`extract_coreml_models.py` uses, and is inlined into the script so it
can
be run against a plain CoreML model without depending on the executorch
package.

### Test plan

Added `test_coreml_compute_plan.py` covering:

- `_device_name(...)` for `None` and a stub
`MLNeuralEngineComputeDevice`.
- `_COMPUTE_UNIT_CHOICES` mapping (`cpu_and_ne` / `all`).
- `analyze_one(...)` end-to-end on a tiny `relu(x @ x.T) + x.sum()`
  mlpackage built with `coremltools.convert(...)`: returns rows for
  every dispatched op, with a `main` function and the expected MIL op
  types (`matmul`, `relu`, `add`, `reduce_sum`).

```
$ python -m pytest examples/apple/coreml/scripts/test_coreml_compute_plan.py -v
============================== 7 passed in 3.68s ===============================
```

I also ran the script against a few hand-built `.mlpackage` and
`.mlmodelc` files on macOS 26 with coremltools 9.0 and verified the
output matches what `MLComputePlan` returns directly.

Authored with Claude.

cc @kimishpatel @YifanShenSZ @cymbalrush @metascroy
Differential Revision: D106412035

Pull Request resolved: pytorch#19777
AdrianLundell and others added 22 commits May 28, 2026 18:41
Adds a simple pass for replacing single Aten ops with corresponding
dialect ops to be reused across multiple backends.

Signed-off-by: Adrian Lundell <adrian.lundell@arm.com>
Differential Revision: D106575515

Pull Request resolved: pytorch#19831
Enable loading GGUF files (e.g. Q4_K_M) and exporting to the MLX
backend. Three areas of change:

GGUF loader (gguf_loader.py):
- Add MLX backend support alongside CUDA
- Keep embedding quantized for MLX (QuantizedEmbeddingHandler supports
  quantized gather natively, unlike CUDA's Int4Tensor)
- Fix stale docstring references to Int4TilePackedTo4dTensor/tinygemm

MLX backend (op_helpers.py, patterns.py):
- Accept group_size=16 in parse_dequant_node for GGUF Q6_K tensors
- For group_size < 32, emit DequantizeNode + TransposeNode + AddmmNode
  instead of QuantizedMatmulNode, since MLX Metal kernels are only
  instantiated for group_size >= 32. Weights stay packed as int8 in the
  .pte file and are dequantized on-device at runtime — same strategy
  CUDA/Inductor uses (separate Triton dequant + cuBLAS mm).

Packer (pack_mlx.py):
- Add 16 to supported group sizes so Q6_K IntxUnpackedToInt8Tensor
  passes through to export unchanged

Tests (test_ops.py):
- Add group_size=16 configs for int8, int4, and no-bias variants


Test Plan: 

Export and run this model


https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/blob/main/gemma-4-31B-it-Q4_K_M.gguf

On M1 32GB machine (exported on Linux A100)

```
(executorch_dev) mnachin@mnachin-mbp executorch % ./cmake-out/examples/models/gemma4_31b/gemma4_31b_runner \
    --model_path  /Users/mnachin/repos/models/gemma-4-31B-it-GGUF/model.pte \
    --tokenizer_path /Users/mnachin/repos/models/gemma-4-31B-it-HQQ-INT4/tokenizer.json \
    --prompt "Tell me a joke about RAM usage" \
    --max_new_tokens 128 \
    --temperature 0.8
I tokenizers:regex.cpp:27] Registering override fallback regex
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1779926968.603672 54889180 re2.cc:237] Error parsing '((\<pad\>|ool\|\>1\x00\x00\
                                                                                             �\<t|respo|\<tool_call\|\>|\<bos\>|\<\|tool_response\>|\<\|think\|\>|\x0...': invalid UTF-8
I tokenizers:re2_regex.cpp:27] Re2 failed to compile regex: ((\<pad\>|ool\|\>1\x00\x00\
                                                                                       �\<t|respo|\<tool_call\|\>|\<bos\>|\<\|tool_response\>|\<\|think\|\>|\x00\x00\\\<|\<tool_response\|\>|\<mask\>|\<\|\"\|\>|all\|\>j\x00\x00\\|\<channel\|\>|\<\|turn\>|\<turn\|\>|\<\|image\>|\<\|$
I tokenizers:regex_lookahead.cpp:27] Creating PCRE2 regex
I tokenizers:pcre2_regex.cpp:48] PCRE2 UTF-8 validation failed at offset 27: UTF-8 error: byte 2 top bits not 0x80. Retrying without UTF flags.
Loading model...
Prompt tokens: 23
Why did the computer go to therapy?

Because it had too many **unresolved dependencies** and it just couldn't stop **dwelling on the past**... but it forgot everything the moment it took a nap.<turn|>
PyTorchObserver {"prefill_token_per_sec":2.49539,"decode_token_per_sec":0.0880671,"prompt_tokens":23,"generated_tokens":44,"model_load_start_ms":1779926968052,"model_load_end_ms":1779926982494,"inference_start_ms":1779926982497,"inference_end_ms":1779927491333,"prompt_eval_end_ms":1779926991714,"first_token_ms":1779926991714,"aggregate_sampling_time_ms":0,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
```

For reference, here's the this model:
https://huggingface.co/SocialLocalMobile/Qwen3.5-35B-A3B-HQQ-INT4

```
(executorch_dev) mnachin@mnachin-mbp executorch % ./cmake-out/examples/models/gemma4_31b/gemma4_31b_runner \
    --model_path  /Users/mnachin/repos/models/gemma-4-31B-it-HQQ-INT4/model.pte \
    --tokenizer_path /Users/mnachin/repos/models/gemma-4-31B-it-HQQ-INT4/tokenizer.json \
    --prompt "Tell me a joke about RAM usage" \
    --max_new_tokens 128 \
    --temperature 0.8
I tokenizers:regex.cpp:27] Registering override fallback regex
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1779927592.109382 54914733 re2.cc:237] Error parsing '((\<pad\>|ool\|\>1\x00\x00\
                                                                                             �\<t|respo|\<tool_call\|\>|\<bos\>|\<\|tool_response\>|\<\|think\|\>|\x0...': invalid UTF-8
I tokenizers:re2_regex.cpp:27] Re2 failed to compile regex: ((\<pad\>|ool\|\>1\x00\x00\
                                                                                       �\<t|respo|\<tool_call\|\>|\<bos\>|\<\|tool_response\>|\<\|think\|\>|\x00\x00\\\<|\<tool_response\|\>|\<mask\>|\<\|\"\|\>|all\|\>j\x00\x00\\|\<channel\|\>|\<\|turn\>|\<turn\|\>|\<\|image\>|\<\|$
I tokenizers:regex_lookahead.cpp:27] Creating PCRE2 regex
I tokenizers:pcre2_regex.cpp:48] PCRE2 UTF-8 validation failed at offset 27: UTF-8 error: byte 2 top bits not 0x80. Retrying without UTF flags.
Loading model...
Prompt tokens: 23
Why did the computer go to therapy?

Because it had too many **unresolved dependencies** and couldn't stop **dwelling on the past**, but it still couldn't remember why it was there.

***

Alternatively, a shorter one:

**Why was the RAM so stressed?**
Because it had too much on its mind, but it knew that as soon as it slept, it would forget everything.<turn|>
PyTorchObserver {"prefill_token_per_sec":9.11975,"decode_token_per_sec":5.24998,"prompt_tokens":23,"generated_tokens":86,"model_load_start_ms":1779927591719,"model_load_end_ms":1779927603575,"inference_start_ms":1779927603579,"inference_end_ms":1779927622482,"prompt_eval_end_ms":1779927606101,"first_token_ms":1779927606101,"aggregate_sampling_time_ms":0,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
```

There's definitely performance degradation when running GGUF
Adds two new Android instrumentation test suites covering previously
untested API surfaces, completing feature testing coverage for OKR 3.2.

AsrModuleInstrumentationTest (18 tests): constructor validation,
lifecycle (close idempotency, use-after-close), transcribe validation,
and AsrTranscribeConfig builder/validation.

LlmLoraInstrumentationTest (13 tests): dataFiles constructor variants,
LlmModuleConfig with dataPath, invalid data file error handling,
baseline equivalence, and config builder validation.

  ## Test plan
  - [x] `./gradlew :executorch_android:connectedAndroidTest
-Pandroid.testInstrumentationRunnerArguments.class=org.pytorch.executor
  ch.AsrModuleInstrumentationTest`
  - [x] `./gradlew :executorch_android:connectedAndroidTest
-Pandroid.testInstrumentationRunnerArguments.class=org.pytorch.executor
  ch.LlmLoraInstrumentationTest`
  - [x] Verify all 31 new tests pass on emulator (API 34 x86_64)
  - [x] Verify existing tests are unaffected
Differential Revision: D105728137

Pull Request resolved: pytorch#19724
…pytorch#19793)

### Summary
Enable `aten.upsample_bilinear2d` with new Neutron flow.

### Test plan
Unit tests provided.


cc @robert-kalmar @JakeStevens @digantdesai @rascani
…ytorch#19796)

### Summary
NXP backend: Enable `aten.upsample_nearest2d` with new Neutron flow.

### Test plan
Unit tests provided.


cc @robert-kalmar @JakeStevens @digantdesai @rascani
logger.level was used to determine whether to
add the partition_report.txt FileHandler to the logger. This value is
not est by logging.setBasicConfig,
and defaults to 0. This caused empty reports to be output when
intermediate path was set and logging was > info

Instead, use .getEffectiveLevel()

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell @rascani

Signed-off-by: Erik Lundell <erik.lundell@arm.com>
* Add layers that run in BF16 in the HF model

Change-Id: If75434db138059f3a433a70abda3f3e26f6dd3b6

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell @rascani

---------

Signed-off-by: Tom Allsop <tom.allsop@arm.com>
This is stacked on top of
pytorch#19029
- make non-KV-cache example inputs match the static export window
- fix PT2E calibration flow for padded prefixes
  and optional LM-Eval tasks
- update SmolLM2 export settings used by the VGF PT2E workflow
- Fix rope_theta in 135M_config.json to align with Hugging face
  model config

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell @rascani

Signed-off-by: Xingguo Li <xingguo.li@arm.com>
Co-authored-by: Zingo Andersen <zingo.andersen@arm.com>
Change-Id: Id97fcb787369b62aecd4a0be27132ff4a0785fcf

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell @rascani

Signed-off-by: Michiel Olieslagers <michiel.olieslagers@arm.com>
…h#19839)

Add ArmPass.should_run_pass() as a reusable early-exit hook before
  call() starts the normal ExportPass retracing path. The default hook
  returns true, preserving existing behavior for ArmPass subclasses.

  Introduce ArmOpTargetedPass for passes that only transform a known
  set of operator targets. It implements should_run_pass() by scanning
  the current graph and nested GraphModules for matching target
  operators. If no matching target operator is found, the pass returns
  an unmodified PassResult.

  For passes that already gate transformations with
  allowed_to_transform(), allow the target pre-scan to apply the same
  check before deciding whether the pass needs to run. This avoids
  running TFA passes when all matching target nodes are marked as
  disallowed.

  The should_run_pass() hook and ArmOpTargetedPass pre-scan avoid
  rebuilding graphs for decomposition and rewrite passes that cannot
  affect the current graph. The speedup is most visible on large models.

  Single-run paired benchmarks on Arm backend model tests
  across FP32, INT, VGF no-quant, and VGF quant variants:

  | Model       | E2E avg | Pass-manager avg |
  |-------------|--------:|-----------------:|
  | T5-small    | +30.5%  | +47.5%           |
  | DeepLabV3   | +12.9%  | +49.8%           |
  | Wav2Letter  | +16.9%  | +51.2%           |
  | InceptionV3 | +22.2%  | +46.5%           |
  | MobileNetV2 | +22.2%  | +52.5%           |
  | MobileNetV3 | +29.9%  | +54.6%           |

  Model rows are unweighted averages over successful variants.
  Unweighted average across 23 successful model/target variants:
  E2E speedup: +22.4%
  Pass-manager speedup: +50.5%

Change-Id: Iaa09638473a1d6d1e2ce98f5a0e3fc3a14378143


cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell @rascani

Signed-off-by: Yufeng Shi <yufeng.shi@arm.com>
Co-authored-by: Erik Lundell <erik.lundell@arm.com>
- Export & lower the smollm2 via extensions/llm/export_llm
- Build the arm_executor_runner application
- Fix the propagation of select_ops_list in the CMakeLists.txt
- Test the application runs on FVP in fast mode

Signed-off-by: George Gekov <george.gekov@arm.com>
Change-Id: I8acd87c2f5c3e6b5b189bb987ceccfe4877e2254
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Summary:
Currently, __builtin_FUNCTION is used opportunistically if it exists.


However, for heavily templated code, this results in extremely long
string which adds .rodata which can be wasteful on embedded targets.


This commit adds an override which uses the shorter __FUNCTION__ even if
__bultin_FUNCTION exists and exposes as a BUCK constraint.

Integration into CMake intentially left out for now.

Differential Revision: D106668077
…ytorch#19834)

Summary:

The current approach use __FILE__ and opportunistically trims it if the
utility is available.

However, the long name is still stored in .rodata

This can contribute some memory on embedded platforms.


Instead, first try __FILE_NAME__

Differential Revision: D106587633
Summary:

ghstack 0.15.0 changed the header URL in PR bodies from
`Stack from [ghstack](https://github.com/ezyang/ghstack)` to
`Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0)`.

The exact string match in `propose_ghstack_orig_pr.py` no longer matched,
causing every ghstack_land workflow run to fail since May 14. Use
`startswith("Stack from [ghstack]")` instead to be resilient to URL changes.

Test Plan:

Verified the new pattern matches both the old format
(`https://github.com/ezyang/ghstack`) and the new format
(`https://github.com/ezyang/ghstack/tree/0.15.0`).

This PR was authored with the help of Claude.

Reviewers:
Pull Request resolved: pytorch#19867

Some environments preserve stale failure state when tests are reported through unittest skip results. This switches currently disabled Vulkan delegate coverage to a local decorator so those tests stay discoverable, log their disabled reason, and produce an executed result.

ghstack-source-id: 387629544
@exported-using-ghexport

Differential Revision: [D106732141](https://our.internmc.facebook.com/intern/diff/D106732141/)
Applies the same disabled-test treatment as the prior diffs in this stack to the devtools inspector tests. Some test runners preserve stale failure state when tests report through unittest skip results, so this replaces the conditionally disabled coverage with a local decorator that keeps the tests discoverable, logs their disabled reason, and produces an executed result.

Adds a disable_if decorator that mirrors unittest.skipIf (evaluating the condition at decoration time) and converts the three Windows-gated test cases to use it.

Differential Revision: [D106736354](https://our.internmc.facebook.com/intern/diff/D106736354/)


ghstack-source-id: 387629542
Pull-Request: pytorch#19874
The test-arm-cortex-m-size-test job now builds the PR's merge base in
addition to head, runs bloaty against both, and uploads a per-leg
artifact. A new workflow_run-triggered workflow downloads the artifacts
and posts a sticky PR comment with per-segment, per-section, and
per-bucket deltas plus the top-5 symbols by Δ.

Reporting is best-effort and never fails the size job. Existing
threshold gates are unchanged. The custom bloaty data source in
test/bloaty/executorch.bloaty groups demangled symbols into
runtime/extension/backends/kernels/etc buckets so the diff is readable.

Drafted with Claude.
Synthetic +~190 byte change to verify the bloaty PR-comment workflow
reports the regression correctly. Look for the [bloaty-ci-verify-string]
marker — revert this commit before any real PR.
rascani added 5 commits May 29, 2026 15:55
Three fixes from the first CI run on PR pytorch#19888:

1. Base build was failing because the worktree was at /tmp/base-worktree,
   but CMakeLists.txt:420 requires the repo dir to be named exactly
   `executorch`. Worktree is now /tmp/bloaty-base/executorch.

2. bloaty isn't in apt on the executorch-ubuntu-22.04-arm-sdk image.
   Install via conda-forge instead (conda is already in PATH).

3. `set -e` inside a subshell doesn't fire when the subshell itself is
   on the left of `||` (per bash spec). Replaced with explicit
   `|| exit 1` after each critical command in the bloaty subshell, so
   a python crash actually aborts before the artifact-upload mv runs.

Also added workflow_dispatch to bloaty-size-comment.yml so the comment
poster can be invoked manually for verification before the workflow
lands on the default branch.
…oaty

Two more fixes from the second CI run:

1. `git worktree add` fails on the base SHA because actions/checkout's
   shallow fetch only has the head commit. Explicitly `git fetch --depth=1
   origin <base_sha>` before the worktree add.

2. Conda-forge's bloaty links against a newer libstdc++ than the docker
   image ships, so `bloaty --version` fails with CXXABI_1.3.15 not found.
   Prepend the conda env's lib dir to LD_LIBRARY_PATH so bloaty finds
   its own libstdc++ before the system one.
LD_LIBRARY_PATH override didn't help because the main conda env doesn't
ship a libstdc++.so.6, so the linker still picked the system one (too
old for conda-forge bloaty: CXXABI_1.3.15 missing).

Install bloaty into a dedicated conda env alongside libstdcxx-ng, then
invoke via `conda run --no-capture-output -p <env> bloaty`, which sets
LD_LIBRARY_PATH correctly. Pass the resolved command to bloaty_diff.py
via a new BLOATY env var.
First successful run on PR pytorch#19888 revealed two real issues:

1. bloaty defaults to -n 20 even without a flag, so metadata.json only
   contained the top 20 symbols + an aggregated [N Others] row. Diffing
   the resulting capped lists would produce phantom regressions when
   any symbol crossed the cutoff. -n 0 means unlimited.

2. The arm size_test binary's actual contents on bare-metal:
   - ~40% newlib stdio internals (_vfprintf_r, _svfprintf_r, _dtoa_r)
   - ~10% libsupc++ C++ demangler (d_print_*, cplus_demangle_*)
   - ~10% C++ unwind personality + section-level debug entries
   None of these matched our regexes. Added patterns so they bucket
   into libc/stdlib/metadata correctly.
The previous test bloater (ET_LOG format string) didn't survive Release
mode because ET_LOG_ENABLED=0 compiles log strings out entirely.

This version adds a ~265-byte string in .rodata, referenced through a
volatile static pointer to defeat the optimizer, inside a function
(get_execution_plan) that size_test actually links. Verified locally
that the string survives -Os + strip.

Look for [bloaty-ci-verify-string] in the bloaty PR comment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.