Skip to content

[IP+NF] modular SSA optimizer + split Thumb backend + RP2350 fixes#7

Merged
matgla merged 62 commits into
mobfrom
fixesForYasos2
Jun 25, 2026
Merged

[IP+NF] modular SSA optimizer + split Thumb backend + RP2350 fixes#7
matgla merged 62 commits into
mobfrom
fixesForYasos2

Conversation

@matgla

@matgla matgla commented Jun 25, 2026

Copy link
Copy Markdown
Owner

Rewrite ir/opt into modular SSA passes, split arm-thumb gen into per-insn thop_* units, fix -O1/-O2 on-target crashes, add unit/asm tests:

  • Optimizer: split monolithic ir/opt.c into ~50 SSA passes — constprop, copyprop, DCE, GVN, SCCP, LICM, loop opts/reroll, known-bits/VRP, reassoc, strength reduction, load-CSE/memory, branch fold, const-aggregate, bitfield, fusion, pack64, switch-to-data — driven by new opt_engine/opt_pipeline.
  • IR infra: add cfg.c, ssa.c, def-use (opt_du), opt_hash, and a standalone regalloc.c.
  • ARM backend: modularize arch/arm/thumb/thop_* per instruction class; add arm_regalloc and ssa_opt_arm.
  • On-target fixes: VRP 74KB stack-frame blowup, packed-IROperand STRD unaligned fault, reassoc use-record corruption, MLA/UMULL/SMULL dest-clobbering scratch, tcc_mallocz memset restore (self-host -O2 crash).
  • Tests: new tests/unit/arm/armv8m C unit suites, thumb asm-encode tests, tests2 additions.

matgla and others added 29 commits May 16, 2026 19:53
… scratch

Two independent self-host miscompile root causes:

1. ssa_opt_reassoc.c: reassoc_binary removed the still-live inner
   instruction's operand use record when folding (x OP c1) OP c2.
   GVN could then CSE the outer back onto the live inner and drop the
   operand's use count to zero, letting SSA-DCE delete the operand's
   def (e.g. a SELECT) out from under live code. Broke
   tcc_yaff_write_data_relocations (section=CODE instead of DATA) →
   bench_strlen_scan, bug_const_ptr_got_deref, mibench_qsort,
   mibench_stringsearch.

2. arm-thumb-gen.c: mla/umull/smull _mop handlers did not pre-exclude
   the pre-allocated destination register from scratch allocation, so a
   source load could pick it as a saved (push/pop) scratch and the
   restoring pop clobbered the just-computed result. Broke
   find_nested_func_by_sym (returned sym instead of &nested_funcs[i])
   → trampoline never emitted → nested_funcptr* HardFault.

Full QEMU smoke suite: 424 passed, 0 failed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

fix four gcc-torture failure classes (892 -> ~14 failures)

1. ir/opt_dce.c: dead_var_store_elim's read scan missed the MLA
   accumulator operand, so after mla-fusion it deleted the def of a VAR
   still read as an accumulator. In the self-hosted compiler this killed
   `ar_index = data + entrysize` in create_archive_sym_cache, making the
   archive symbol table load from the saved-r4 stack slot — every link
   touching armv8m-libtcc1.a failed with "invalid archive" (850 tests).

2. tccgen.c: two late frame-shrink passes ran after the variadic
   `loc = -28` guards and re-shrank the frame (the prologue-managed
   va_area never appears as an IR STACKOFF). The va descriptor then sat
   below SP and the va_start helper's own push{r3,r4} clobbered it with
   the caller's 4th argument register. Re-clamp after both shrinks.

3. arm-thumb-gen.c + ir/codegen.c + tcc.h: the nested-call R9/arg-reg
   save area was addressed SP-relative; once a VLA/alloca moved SP the
   slots landed inside the user's buffer and the callee's writes
   corrupted the saved GOT base. New per-function func_dynamic_sp flag
   (set on VLA_ALLOC) switches the save area to FP-relative addressing.

4. tccgen.c: __builtin_setjmp/__builtin_longjmp now emit the NL
   (non-local-goto) IR ops. The 3-word variant restored only FP/SP/PC,
   so code resumed after longjmp with the longjmp caller's r4-r11
   (including register-allocated locals and the r9 GOT-base protocol).
   The NL buffer (40 bytes) restores the full callee-saved file.

QEMU smoke incl. gcc torture: 4153 passed / 14 failed (from 892 failed).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

fix alloca routing, __builtin_setjmp 5-word ABI, and soft dadd rounding

Three gcc-torture failure classes:

- tccgen.c: '#ifdef TOK_alloca' was always false (TOK_* are enum
  constants, not macros), so plain alloca() calls bypassed the
  VLA_ALLOC builtin and called lib/alloca.S, which moves SP behind the
  backend's back; the SP-relative per-call R9 save area then reloaded
  garbage from inside the alloca'd buffer. Use the real target guard so
  alloca routes through unary_builtin_alloca (fixes 20020314-1,
  941202-1, pr22061-1).

- __builtin_setjmp/__builtin_longjmp: GCC's ABI gives the builtin a
  5-WORD buffer and pr84521 passes exactly void *buf[5]; the previous
  NL_SETJMP routing wrote 40 bytes and smashed the caller's stack.
  TCCIR_OP_SETJMP now saves r4-r11 into a hidden 32-byte frame area
  (src2, FRAME_ADDR operand) and stores only FP/resume/SP/&area in the
  buffer; LONGJMP restores the register file via buf[3]. NL_* keeps its
  layout for the nested-function non-local-goto path (fixes pr84521 at
  -O0 and -O1; built-in-setjmp, pr86528, 20021113-1, 20020412-1 still
  pass).

- lib/fp/soft/dadd.c: __aeabi_dadd had no guard/round/sticky bits, so
  aligning the smaller operand just truncated it (1 + -2^53 returned
  -2^53 instead of the exact -(2^53-1)). Add 3-bit GRS alignment and
  round-to-nearest-even; verified bit-exact against the host FPU on 2M
  randomized normal cases (fixes the dadd half of ieee/pr28634).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

more fixes, tests are passing on hardware now
…d STRD

Two fixes for ARMv8-M self-hosted compiles that crashed the entire
gcc-torture suite at -O1/-O2 (all passing at -O0):

1. tcc_ir_opt_vrp had two VRPRange[VRP_MAX_POS*3] (~18 KB each) stack
   arrays -> a 74,364-byte prologue frame, which alone overflows the
   32 KB process stack (CFSR=0x00100000 STKOF on the prologue `sub sp`;
   fault dump r12=0x1227C confirms the frame size). VRP only runs at
   -O1+, hence -O0 was unaffected. Move both arrays to the heap
   (tcc_mallocz) and free them at the single return; replace sizeof(ranges)
   in the memset/memcpy with the explicit byte count. Frame 74364 -> 732 B.

2. try_rotate_loop zero-initialised its packed IROperand (sizeof==9)
   scratch slots with `= (IROperand){0}`, which the codegen STRD-pairing
   peephole lowered to an 8-byte STRD. With a stride of 9, &arr[b] is not
   4-aligned for odd b, and STRD requires >=4-byte alignment on ARMv8-M ->
   UNALIGNED fault (CFSR=0x01000000). The backing buffer is tcc_mallocz'd
   (already zeroed), so the inits were redundant; drop them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The yasos-native double-zero removal assumed malloc() always returns zeroed memory, but that is false for bump allocations served from a RECYCLED pool: mk_pool() resets the bump pointer without re-zeroing the pool body, so those bytes hold stale data. The device tcc then read non-zero where it expected zero and crashed self-host -O2 compiles (e.g. builtin-bitops-1: free() of a -1 sentinel). Host builds memset unconditionally, so the bug only manifested on device. Also includes in-progress IR optimization work.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@matgla matgla merged commit e53b23e into mob Jun 25, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant