Cut TLBI broadcasts with FEAT_TLBIRANGE by jserv · Pull Request #65 · sysprog21/elfuse

jserv · 2026-06-01T01:43:21Z

mprotect-heavy workloads (dynamic linker RELRO, glibc bring-up, multi-vCPU contention) burned cycles in two ways. First, every page table mutation site (guest_{update_perms,invalidate_ptes,extend_page_tables, install_va_pages,map_va_range}) requested a TLBI VAE1IS / VMALLE1IS for the full touched range, even when the request was a no-op (descriptor already held the new value). Second, any range above TLBI_SELECTIVE_MAX_PAGES=16 promoted straight to TLBI VMALLE1IS, so 17..64 page mprotects (common when several shared libraries' RELRO ranges union-coalesce) cost a full inner-shareable broadcast plus an unconditional IC IALLU even when the change was data-only.

Skip-no-op tracking. Each helper now tracks the smallest [changed_lo, changed_hi) sub-range whose descriptor actually transitioned and only requests TLBI for that interval (no request when nothing changed). guest_invalidate_ptes treats write-0-over-0 as a no-op; guest_extend treats slots already marked PT_VALID as no-ops; install_va_pages and map_va_range compare each descriptor word against the new value before writing. Once the per-vCPU accumulator has already promoted to TLBI_BROADCAST, a hoisted tlbi_request_is_broadcast() local skips the per-page bookkeeping (the broadcast invalidates everything anyway).

Conditional IC IALLU. tlbi_request_t gains an icache_flush byte set by the helpers when perms include MEM_PERM_X (i.e. the change introduces executable content visible to EL0). The syscall epilogue passes the hint via X11 alongside X8; the shim's tlbi_full and tlbi_selective paths cbz x11 to skip the I-cache invalidation when the change was data-only (RW<->R mprotect, munmap of data, etc.). The X8=2 exec_drop_frame path keeps its unconditional flush because execve loads new code. shim.S grows tiny .Ltlbi_*_skip_ic labels but preserves the existing register restore tail.

FEAT_TLBIRANGE single-shot range invalidation. A new TLBI_RANGE_LARGE kind drives TLBI RVAE1IS for ranges 17..TLBI_RVAE_MAX_PAGES=64 pages when g_tlbi_range_supported is set. One instruction covers (NUM+1)*2 pages with SCALE=0; the operand carries BaseADDR, NUM, TG=01 (4 KiB granule) and an implicit ASID=0 from the single-ASID guest (TCR_EL1.A1=0, TTBR0 ASID=0). The shim adds a tlbi_range_large handler that issues "tlbi rvae1is, x9" plus the same conditional IC IALLU tail. The encoder lives in a pure helper
tlbi_rvae1is_operand() exercised by a host-side unit test (tests/test-tlbi-encoder-host.c) that decomposes every operand bit field (TG, SCALE, NUM, TTL, ASID, BaseADDR) and asserts against the ARM ARM D8.7.6 layout. The unit test is wired into make check as a build prerequisite -- it would fail on a regression that drops TG=01 (integration tests would silently pass because Apple Silicon's PE tolerantly falls back to TCR_EL1.TGn for TG=00).

Capability probe. g_tlbi_range_supported uses sysctlbyname ("hw.optional.arm.FEAT_LSE2") as an ARMv8.4 proxy (both LSE2 and TLBIRANGE became mandatory in v8.4, Apple ships them together). Width-tolerant uint64_t buffer so a future widening to uint64_t still reads cleanly. pthread_once gates the probe so a re-bootstrap path (sys_execve, fork IPC restore) cannot race a live vCPU's reader. ELFUSE_DISABLE_TLBI_RANGE=1 forces the pre-TLBIRANGE broadcast fallback so that path stays exercisable in CI on Apple Silicon hosts.

Hardening collateral.

guest_invalidate_ptes and guest_update_perms reject inputs where end falls within PAGE_SIZE-1 of UINT64_MAX. The ALIGN_UP step could silently wrap end to 0 and turn the call into a no-op.
guest_extend_page_tables gains a sister wrap guard for ALIGN_2MIB_UP to keep the three sites consistent if a future caller lifts the guest_size cap.
L2 kind check in extend rewritten as an explicit PT_VALID test (was relying on PT_BLOCK == PT_VALID == bit 0 to coincidentally cover both block and table descriptors).
guest_split_block documents the FEAT_BBM Level 2 dependency that lets block-to-table conversion skip break-before-make on Apple Silicon (M1+ implements it; the split-heavy stress paths in tests/test-stress, tests/test-shim-urandom-toctou and the new tests/test-mprotect-mt run cleanly).
proc.c lazy-materialize stops leaving stale TLBI requests in the accumulator. The HVC Build fails on macOS Tahoe #11 handler in shim.S erets without dispatching on X8, so the previous hardcoded X8=1/X11=1 was a no-op; the helpers inside guest_materialize_lazy did populate cpu_tlbi_req which then leaked into the next syscall. tlbi_request_clear() at the success break drops the leak. shim.S documents inline why a dedicated post-HVC-11 TLBI dispatch would need a frameless eret tail (the standard svc_restore_eret reloads X1/X2/X30 which signal_deliver has already overwritten via hv_vcpu_set_reg).
tlbi_request_t pins layout via _Static_assert(sizeof == 16).

Tests. tests/test-mprotect-mt covers no-op mprotect false-positive correctness, R<->RW alternation via syscall reader, a 7-size boundary sweep (2 / 16 / 17 / 32 / 63 / 64 / 65 / 128 pages straddling the selective / RVAE1IS / broadcast branches), a 2 MiB block-straddle cycle that forces guest_split_block on both adjacent blocks, an R<->RX I-cache hint cycle that calls each page after PROT_EXEC mprotect to catch a dropped IC IALLU, and a parameterized multi-vCPU stress at 17 / 32 / 64 pages (NUM=8 / 15 / 31). Tier F multi-vCPU mprotect stress entry in TODO.md is now resolved.

Summary by cubic

Cut TLB shootdowns by using FEAT_TLBIRANGE and skipping no‑op page‑table edits. Speeds up mprotect‑heavy workloads and reduces cross‑vCPU contention while avoiding unnecessary I‑cache flushes.

New Features
- Single‑shot TLBI RVAE1IS for 17–64 pages when supported; shim adds X8=4 (tlbi rvae1is, x9) with X11 as the I‑cache hint. Probed once via sysctlbyname("hw.optional.arm.FEAT_LSE2"); set ELFUSE_DISABLE_TLBI_RANGE=1 to force the broadcast fallback. Includes encoder tlbi_rvae1is_operand() and a host unit test tests/test-tlbi-encoder-host wired into make check.
- No‑op elimination + conditional I‑cache: helpers track only changed sub‑ranges, coalesce, and skip work once a broadcast is pending; set icache_flush only when perms add exec. Centralized syscall/EL0‑fault emission via tlbi_request_emit_to_vcpu. Tests: add tests/test-mprotect-mt (boundary sweep, 2 MiB straddle, R↔RX I‑cache, multi‑vCPU).
Bug Fixes
- Proper TLBI on EL0 faults: post‑HVC‑11 dispatch mirrors the syscall epilogue, lazy MAP_NORESERVE materialize now drains the accumulator before retry, and signal delivery explicitly sets X8=0. Use ubfx for VAE1IS operands to pin VA bits.
- Hardening: wrap guards for end alignment, explicit PT_VALID at L2, FEAT_BBM Level 2 notes for block→table splits.

^{Written for commit 9fc96f7. Summary will update on new commits.}

mprotect-heavy workloads (dynamic linker RELRO, glibc bring-up, multi-vCPU contention) burned cycles in two ways. First, every page table mutation site (guest_{update_perms,invalidate_ptes, extend_page_tables,install_va_pages,map_va_range}) requested a TLBI VAE1IS / VMALLE1IS for the full touched range, even when the request was a no-op (descriptor already held the new value). Second, any range above TLBI_SELECTIVE_MAX_PAGES=16 promoted straight to TLBI VMALLE1IS, so 17..64 page mprotects (common when several shared libraries' RELRO ranges union-coalesce) cost a full inner-shareable broadcast plus an unconditional IC IALLU even when the change was data-only. Skip-no-op tracking. Each helper now tracks the smallest [changed_lo, changed_hi) sub-range whose descriptor actually transitioned and only requests TLBI for that interval (no request when nothing changed). guest_invalidate_ptes treats write-0-over-0 as a no-op; guest_extend treats slots already marked PT_VALID as no-ops; install_va_pages and map_va_range compare each descriptor word against the new value before writing. Once the per-vCPU accumulator has already promoted to TLBI_BROADCAST, a hoisted tlbi_request_is_broadcast() local skips the per-page bookkeeping (the broadcast invalidates everything anyway). Conditional IC IALLU. tlbi_request_t gains an icache_flush byte set by the helpers when perms include MEM_PERM_X (i.e. the change introduces executable content visible to EL0). The syscall epilogue passes the hint via X11 alongside X8; the shim's tlbi_full and tlbi_selective paths cbz x11 to skip the I-cache invalidation when the change was data-only (RW<->R mprotect, munmap of data, etc.). The X8=2 exec_drop_frame path keeps its unconditional flush because execve loads new code. FEAT_TLBIRANGE single-shot range invalidation. A new TLBI_RANGE_LARGE kind drives TLBI RVAE1IS for ranges 17..TLBI_RVAE_MAX_PAGES=64 pages when g_tlbi_range_supported is set. One instruction covers (NUM+1)*2 pages with SCALE=0; the operand carries BaseADDR, NUM, TG=01 (4 KiB granule) and an implicit ASID=0 from the single-ASID guest (TCR_EL1.A1=0, TTBR0 ASID=0). The shim adds a tlbi_range_large handler that issues "tlbi rvae1is, x9" plus the same conditional IC IALLU tail. The encoder lives in a pure helper tlbi_rvae1is_operand() exercised by a host-side unit test (tests/test-tlbi-encoder-host.c) that decomposes every operand bit field (TG, SCALE, NUM, TTL, ASID, BaseADDR) and asserts against the ARM ARM D8.7.6 layout. The unit test is wired into make check as a build prerequisite -- it would fail on a regression that drops TG=01 (integration tests would silently pass because Apple Silicon's PE tolerantly falls back to TCR_EL1.TGn for TG=00). Post-HVC-11 TLBI dispatch. cubic-dev-ai PR #65 review flagged P1 on the prior deferral of the lazy MAP_NORESERVE materialize TLBI: clearing the pending TLBI relied on Apple Silicon not aggressively caching translation-fault entries, but the architecture permits the caching and a future PE could surface a refault loop. Landed the proper fix in two parts. (1) shim.S gains RESTORE_GPRS_KEEP_SIGFRAME -- a sister to RESTORE_GPRS_KEEP_X0 that reloads X3-X29 only, preserving signal_deliver's writes to X0 (signum), X1 (siginfo*), X2 (ucontext*), and X30 (sa_restorer). handle_el0_fault switched from "RESTORE_GPRS; hvc #11; eret" to "LOAD_GPRS; hvc #11; <X8 dispatch>; <frameless eret>". The dispatch mirrors handle_svc_0's epilogue: X8 == 0 takes a no-restore add-sp + eret fast path (the common signal-delivery branch); X8 in {1, 3, 4} routes to inline .Lel0_fault_tlbi_* handlers ending with RESTORE_GPRS_KEEP_SIGFRAME + eret. X8 == 2 (exec_drop_frame) is rejected here and falls through to the conservative full flush. (2) src/syscall/proc.c lazy-materialize success replaced tlbi_request_clear() with tlbi_request_emit_to_vcpu(vcpu) so the accumulator drives a real TLBI before the EL0 retry; the I-cache hint propagates automatically when the materialized region's prot includes PROT_EXEC. proc.c case 11 refactor. Inverted the EC test to early-break the SIGILL branch, so the abort / SIGSEGV / lazy-mat path lives at the case-body indent rather than nested inside an else block. Drops two levels of indent on the SIGSEGV-decision block. signal_deliver's return paths explicitly write X8 = 0 via hv_vcpu_set_reg so the shim's post-HVC-11 dispatch takes the eret-only fast path when no TLBI is requested. An EC-classification comment documents that only EC 0x20 (instruction abort) and 0x24 (data abort) are SIGSEGV candidates today; new lower-EL abort classes must be added explicitly rather than relaxing the check casually. Defensive VA encoding. The three TLBI VAE1IS operand sites (handle_el0_fault selective branch, the pre-existing tlbi_selective handler, tlbi_restore_eret) now use "ubfx X_, X_, #12, #44" instead of "lsr X_, X_, #12". The 44-bit ubfx mask pins VA[55:12] to operand bits [43:0] so future LPA2 / TTL / tagged-address support cannot leak high VA bits into the operand's TTL [47:44] or ASID [63:48] fields. lsr currently works because Apple Silicon's setup leaves the high operand bits zero, but ubfx is the same uOp cost and documents the architectural intent. Capability probe. g_tlbi_range_supported uses sysctlbyname ("hw.optional.arm.FEAT_LSE2") as an ARMv8.4 proxy (both LSE2 and TLBIRANGE became mandatory in v8.4, Apple ships them together). Width-tolerant uint64_t buffer so a future widening to uint64_t still reads cleanly. pthread_once gates the probe so a re-bootstrap path (sys_execve, fork IPC restore) cannot race a live vCPU's reader. ELFUSE_DISABLE_TLBI_RANGE=1 forces the pre-TLBIRANGE broadcast fallback so that path stays exercisable in CI on Apple Silicon hosts. Hardening collateral. - guest_invalidate_ptes and guest_update_perms reject inputs where end falls within PAGE_SIZE-1 of UINT64_MAX. The ALIGN_UP step could silently wrap end to 0 and turn the call into a no-op. - guest_extend_page_tables gains a sister wrap guard for ALIGN_2MIB_UP to keep the three sites consistent if a future caller lifts the guest_size cap. - L2 kind check in extend rewritten as an explicit PT_VALID test (was relying on PT_BLOCK == PT_VALID == bit 0 to coincidentally cover both block and table descriptors). - guest_split_block documents the FEAT_BBM Level 2 dependency that lets block-to-table conversion skip break-before-make on Apple Silicon (M1+ implements it; the split-heavy stress paths in tests/test-stress, tests/test-shim-urandom-toctou and the new tests/test-mprotect-mt run cleanly). - tlbi_request_t pins layout via _Static_assert(sizeof == 16). Tests. tests/test-mprotect-mt covers no-op mprotect false-positive correctness, R<->RW alternation via syscall reader, a 7-size boundary sweep (2 / 16 / 17 / 32 / 63 / 64 / 65 / 128 pages straddling the selective / RVAE1IS / broadcast branches), a 2 MiB block-straddle cycle that forces guest_split_block on both adjacent blocks, an R<->RX I-cache hint cycle that calls each page after PROT_EXEC mprotect to catch a dropped IC IALLU, and a parameterized multi-vCPU stress at 17 / 32 / 64 pages (NUM=8 / 15 / 31).

This comment was marked as resolved.

Sign in to view

jserv force-pushed the tlb branch from 91b956a to dddf2b8 Compare June 1, 2026 01:47

jserv force-pushed the tlb branch from dddf2b8 to 9fc96f7 Compare June 1, 2026 03:00

jserv merged commit 0a46e4f into main Jun 1, 2026
4 checks passed

jserv deleted the tlb branch June 1, 2026 03:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cut TLBI broadcasts with FEAT_TLBIRANGE#65

Cut TLBI broadcasts with FEAT_TLBIRANGE#65
jserv merged 1 commit into
mainfrom
tlb

jserv commented Jun 1, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jserv commented Jun 1, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jserv commented Jun 1, 2026 •

edited by cubic-dev-ai Bot

Loading