Skip to content

Cut TLBI broadcasts with FEAT_TLBIRANGE#65

Merged
jserv merged 1 commit into
mainfrom
tlb
Jun 1, 2026
Merged

Cut TLBI broadcasts with FEAT_TLBIRANGE#65
jserv merged 1 commit into
mainfrom
tlb

Conversation

@jserv
Copy link
Copy Markdown
Contributor

@jserv jserv commented Jun 1, 2026

mprotect-heavy workloads (dynamic linker RELRO, glibc bring-up, multi-vCPU contention) burned cycles in two ways. First, every page table mutation site (guest_{update_perms,invalidate_ptes,extend_page_tables, install_va_pages,map_va_range}) requested a TLBI VAE1IS / VMALLE1IS for the full touched range, even when the request was a no-op (descriptor already held the new value). Second, any range above TLBI_SELECTIVE_MAX_PAGES=16 promoted straight to TLBI VMALLE1IS, so 17..64 page mprotects (common when several shared libraries' RELRO ranges union-coalesce) cost a full inner-shareable broadcast plus an unconditional IC IALLU even when the change was data-only.

Skip-no-op tracking. Each helper now tracks the smallest [changed_lo, changed_hi) sub-range whose descriptor actually transitioned and only requests TLBI for that interval (no request when nothing changed). guest_invalidate_ptes treats write-0-over-0 as a no-op; guest_extend treats slots already marked PT_VALID as no-ops; install_va_pages and map_va_range compare each descriptor word against the new value before writing. Once the per-vCPU accumulator has already promoted to TLBI_BROADCAST, a hoisted tlbi_request_is_broadcast() local skips the per-page bookkeeping (the broadcast invalidates everything anyway).

Conditional IC IALLU. tlbi_request_t gains an icache_flush byte set by the helpers when perms include MEM_PERM_X (i.e. the change introduces executable content visible to EL0). The syscall epilogue passes the hint via X11 alongside X8; the shim's tlbi_full and tlbi_selective paths cbz x11 to skip the I-cache invalidation when the change was data-only (RW<->R mprotect, munmap of data, etc.). The X8=2 exec_drop_frame path keeps its unconditional flush because execve loads new code. shim.S grows tiny .Ltlbi_*_skip_ic labels but preserves the existing register restore tail.

FEAT_TLBIRANGE single-shot range invalidation. A new TLBI_RANGE_LARGE kind drives TLBI RVAE1IS for ranges 17..TLBI_RVAE_MAX_PAGES=64 pages when g_tlbi_range_supported is set. One instruction covers (NUM+1)*2 pages with SCALE=0; the operand carries BaseADDR, NUM, TG=01 (4 KiB granule) and an implicit ASID=0 from the single-ASID guest (TCR_EL1.A1=0, TTBR0 ASID=0). The shim adds a tlbi_range_large handler that issues "tlbi rvae1is, x9" plus the same conditional IC IALLU tail. The encoder lives in a pure helper
tlbi_rvae1is_operand() exercised by a host-side unit test (tests/test-tlbi-encoder-host.c) that decomposes every operand bit field (TG, SCALE, NUM, TTL, ASID, BaseADDR) and asserts against the ARM ARM D8.7.6 layout. The unit test is wired into make check as a build prerequisite -- it would fail on a regression that drops TG=01 (integration tests would silently pass because Apple Silicon's PE tolerantly falls back to TCR_EL1.TGn for TG=00).

Capability probe. g_tlbi_range_supported uses sysctlbyname ("hw.optional.arm.FEAT_LSE2") as an ARMv8.4 proxy (both LSE2 and TLBIRANGE became mandatory in v8.4, Apple ships them together). Width-tolerant uint64_t buffer so a future widening to uint64_t still reads cleanly. pthread_once gates the probe so a re-bootstrap path (sys_execve, fork IPC restore) cannot race a live vCPU's reader. ELFUSE_DISABLE_TLBI_RANGE=1 forces the pre-TLBIRANGE broadcast fallback so that path stays exercisable in CI on Apple Silicon hosts.

Hardening collateral.

  • guest_invalidate_ptes and guest_update_perms reject inputs where end falls within PAGE_SIZE-1 of UINT64_MAX. The ALIGN_UP step could silently wrap end to 0 and turn the call into a no-op.
  • guest_extend_page_tables gains a sister wrap guard for ALIGN_2MIB_UP to keep the three sites consistent if a future caller lifts the guest_size cap.
  • L2 kind check in extend rewritten as an explicit PT_VALID test (was relying on PT_BLOCK == PT_VALID == bit 0 to coincidentally cover both block and table descriptors).
  • guest_split_block documents the FEAT_BBM Level 2 dependency that lets block-to-table conversion skip break-before-make on Apple Silicon (M1+ implements it; the split-heavy stress paths in tests/test-stress, tests/test-shim-urandom-toctou and the new tests/test-mprotect-mt run cleanly).
  • proc.c lazy-materialize stops leaving stale TLBI requests in the accumulator. The HVC Build fails on macOS Tahoe #11 handler in shim.S erets without dispatching on X8, so the previous hardcoded X8=1/X11=1 was a no-op; the helpers inside guest_materialize_lazy did populate cpu_tlbi_req which then leaked into the next syscall. tlbi_request_clear() at the success break drops the leak. shim.S documents inline why a dedicated post-HVC-11 TLBI dispatch would need a frameless eret tail (the standard svc_restore_eret reloads X1/X2/X30 which signal_deliver has already overwritten via hv_vcpu_set_reg).
  • tlbi_request_t pins layout via _Static_assert(sizeof == 16).

Tests. tests/test-mprotect-mt covers no-op mprotect false-positive correctness, R<->RW alternation via syscall reader, a 7-size boundary sweep (2 / 16 / 17 / 32 / 63 / 64 / 65 / 128 pages straddling the selective / RVAE1IS / broadcast branches), a 2 MiB block-straddle cycle that forces guest_split_block on both adjacent blocks, an R<->RX I-cache hint cycle that calls each page after PROT_EXEC mprotect to catch a dropped IC IALLU, and a parameterized multi-vCPU stress at 17 / 32 / 64 pages (NUM=8 / 15 / 31). Tier F multi-vCPU mprotect stress entry in TODO.md is now resolved.


Summary by cubic

Cut TLB shootdowns by using FEAT_TLBIRANGE and skipping no‑op page‑table edits. Speeds up mprotect‑heavy workloads and reduces cross‑vCPU contention while avoiding unnecessary I‑cache flushes.

  • New Features

    • Single‑shot TLBI RVAE1IS for 17–64 pages when supported; shim adds X8=4 (tlbi rvae1is, x9) with X11 as the I‑cache hint. Probed once via sysctlbyname("hw.optional.arm.FEAT_LSE2"); set ELFUSE_DISABLE_TLBI_RANGE=1 to force the broadcast fallback. Includes encoder tlbi_rvae1is_operand() and a host unit test tests/test-tlbi-encoder-host wired into make check.
    • No‑op elimination + conditional I‑cache: helpers track only changed sub‑ranges, coalesce, and skip work once a broadcast is pending; set icache_flush only when perms add exec. Centralized syscall/EL0‑fault emission via tlbi_request_emit_to_vcpu. Tests: add tests/test-mprotect-mt (boundary sweep, 2 MiB straddle, R↔RX I‑cache, multi‑vCPU).
  • Bug Fixes

    • Proper TLBI on EL0 faults: post‑HVC‑11 dispatch mirrors the syscall epilogue, lazy MAP_NORESERVE materialize now drains the accumulator before retry, and signal delivery explicitly sets X8=0. Use ubfx for VAE1IS operands to pin VA bits.
    • Hardening: wrap guards for end alignment, explicit PT_VALID at L2, FEAT_BBM Level 2 notes for block→table splits.

Written for commit 9fc96f7. Summary will update on new commits.

Review in cubic

cubic-dev-ai[bot]

This comment was marked as resolved.

mprotect-heavy workloads (dynamic linker RELRO, glibc bring-up,
multi-vCPU contention) burned cycles in two ways. First, every page
table mutation site (guest_{update_perms,invalidate_ptes,
extend_page_tables,install_va_pages,map_va_range}) requested a TLBI
VAE1IS / VMALLE1IS for the full touched range, even when the request was
a no-op (descriptor already held the new value). Second, any range above
TLBI_SELECTIVE_MAX_PAGES=16 promoted straight to TLBI VMALLE1IS, so
17..64 page mprotects (common when several shared libraries' RELRO
ranges union-coalesce) cost a full inner-shareable broadcast plus an
unconditional IC IALLU even when the change was data-only.

Skip-no-op tracking. Each helper now tracks the smallest [changed_lo,
changed_hi) sub-range whose descriptor actually transitioned and only
requests TLBI for that interval (no request when nothing changed).
guest_invalidate_ptes treats write-0-over-0 as a no-op; guest_extend
treats slots already marked PT_VALID as no-ops; install_va_pages and
map_va_range compare each descriptor word against the new value
before writing. Once the per-vCPU accumulator has already promoted to
TLBI_BROADCAST, a hoisted tlbi_request_is_broadcast() local skips the
per-page bookkeeping (the broadcast invalidates everything anyway).

Conditional IC IALLU. tlbi_request_t gains an icache_flush byte set by
the helpers when perms include MEM_PERM_X (i.e. the change introduces
executable content visible to EL0). The syscall epilogue passes the
hint via X11 alongside X8; the shim's tlbi_full and tlbi_selective
paths cbz x11 to skip the I-cache invalidation when the change was
data-only (RW<->R mprotect, munmap of data, etc.). The X8=2
exec_drop_frame path keeps its unconditional flush because execve
loads new code.

FEAT_TLBIRANGE single-shot range invalidation. A new TLBI_RANGE_LARGE
kind drives TLBI RVAE1IS for ranges 17..TLBI_RVAE_MAX_PAGES=64 pages
when g_tlbi_range_supported is set. One instruction covers
(NUM+1)*2 pages with SCALE=0; the operand carries BaseADDR, NUM, TG=01
(4 KiB granule) and an implicit ASID=0 from the single-ASID guest
(TCR_EL1.A1=0, TTBR0 ASID=0). The shim adds a tlbi_range_large
handler that issues "tlbi rvae1is, x9" plus the same conditional
IC IALLU tail. The encoder lives in a pure helper
tlbi_rvae1is_operand() exercised by a host-side unit test
(tests/test-tlbi-encoder-host.c) that decomposes every operand bit
field (TG, SCALE, NUM, TTL, ASID, BaseADDR) and asserts against the
ARM ARM D8.7.6 layout. The unit test is wired into make check as a
build prerequisite -- it would fail on a regression that drops TG=01
(integration tests would silently pass because Apple Silicon's PE
tolerantly falls back to TCR_EL1.TGn for TG=00).

Post-HVC-11 TLBI dispatch. cubic-dev-ai PR #65 review flagged P1 on
the prior deferral of the lazy MAP_NORESERVE materialize TLBI:
clearing the pending TLBI relied on Apple Silicon not aggressively
caching translation-fault entries, but the architecture permits the
caching and a future PE could surface a refault loop. Landed the
proper fix in two parts. (1) shim.S gains
RESTORE_GPRS_KEEP_SIGFRAME -- a sister to RESTORE_GPRS_KEEP_X0 that
reloads X3-X29 only, preserving signal_deliver's writes to X0
(signum), X1 (siginfo*), X2 (ucontext*), and X30 (sa_restorer).
handle_el0_fault switched from "RESTORE_GPRS; hvc #11; eret" to
"LOAD_GPRS; hvc #11; <X8 dispatch>; <frameless eret>". The dispatch
mirrors handle_svc_0's epilogue: X8 == 0 takes a no-restore add-sp
+ eret fast path (the common signal-delivery branch); X8 in
{1, 3, 4} routes to inline .Lel0_fault_tlbi_* handlers ending with
RESTORE_GPRS_KEEP_SIGFRAME + eret. X8 == 2 (exec_drop_frame) is
rejected here and falls through to the conservative full flush.
(2) src/syscall/proc.c lazy-materialize success replaced
tlbi_request_clear() with tlbi_request_emit_to_vcpu(vcpu) so the
accumulator drives a real TLBI before the EL0 retry; the I-cache
hint propagates automatically when the materialized region's prot
includes PROT_EXEC.

proc.c case 11 refactor. Inverted the EC test to early-break the
SIGILL branch, so the abort / SIGSEGV / lazy-mat path lives at the
case-body indent rather than nested inside an else block. Drops two
levels of indent on the SIGSEGV-decision block. signal_deliver's
return paths explicitly write X8 = 0 via hv_vcpu_set_reg so the
shim's post-HVC-11 dispatch takes the eret-only fast path when no
TLBI is requested. An EC-classification comment documents that only
EC 0x20 (instruction abort) and 0x24 (data abort) are SIGSEGV
candidates today; new lower-EL abort classes must be added
explicitly rather than relaxing the check casually.

Defensive VA encoding. The three TLBI VAE1IS operand sites
(handle_el0_fault selective branch, the pre-existing tlbi_selective
handler, tlbi_restore_eret) now use "ubfx X_, X_, #12, #44" instead
of "lsr X_, X_, #12". The 44-bit ubfx mask pins VA[55:12] to operand
bits [43:0] so future LPA2 / TTL / tagged-address support cannot
leak high VA bits into the operand's TTL [47:44] or ASID [63:48]
fields. lsr currently works because Apple Silicon's setup leaves
the high operand bits zero, but ubfx is the same uOp cost and
documents the architectural intent.

Capability probe. g_tlbi_range_supported uses sysctlbyname
("hw.optional.arm.FEAT_LSE2") as an ARMv8.4 proxy (both LSE2 and
TLBIRANGE became mandatory in v8.4, Apple ships them together).
Width-tolerant uint64_t buffer so a future widening to uint64_t still
reads cleanly. pthread_once gates the probe so a re-bootstrap path
(sys_execve, fork IPC restore) cannot race a live vCPU's reader.
ELFUSE_DISABLE_TLBI_RANGE=1 forces the pre-TLBIRANGE broadcast
fallback so that path stays exercisable in CI on Apple Silicon hosts.

Hardening collateral.
- guest_invalidate_ptes and guest_update_perms reject inputs where
  end falls within PAGE_SIZE-1 of UINT64_MAX. The ALIGN_UP step could
  silently wrap end to 0 and turn the call into a no-op.
- guest_extend_page_tables gains a sister wrap guard for ALIGN_2MIB_UP
  to keep the three sites consistent if a future caller lifts the
  guest_size cap.
- L2 kind check in extend rewritten as an explicit PT_VALID test (was
  relying on PT_BLOCK == PT_VALID == bit 0 to coincidentally cover
  both block and table descriptors).
- guest_split_block documents the FEAT_BBM Level 2 dependency that
  lets block-to-table conversion skip break-before-make on Apple
  Silicon (M1+ implements it; the split-heavy stress paths in
  tests/test-stress, tests/test-shim-urandom-toctou and the new
  tests/test-mprotect-mt run cleanly).
- tlbi_request_t pins layout via _Static_assert(sizeof == 16).

Tests. tests/test-mprotect-mt covers no-op mprotect false-positive
correctness, R<->RW alternation via syscall reader, a 7-size boundary
sweep (2 / 16 / 17 / 32 / 63 / 64 / 65 / 128 pages straddling the
selective / RVAE1IS / broadcast branches), a 2 MiB block-straddle
cycle that forces guest_split_block on both adjacent blocks, an
R<->RX I-cache hint cycle that calls each page after PROT_EXEC
mprotect to catch a dropped IC IALLU, and a parameterized multi-vCPU
stress at 17 / 32 / 64 pages (NUM=8 / 15 / 31).
@jserv jserv merged commit 0a46e4f into main Jun 1, 2026
4 checks passed
@jserv jserv deleted the tlb branch June 1, 2026 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant