Speedup vDSO CNTVCT and amortized urandom by jserv · Pull Request #48 · sysprog21/elfuse

jserv · 2026-05-27T05:55:30Z

vDSO clock_gettime drops from 1256 ns SVC trap to 2.5ns via CNTVCT-based fast path (493x speedup, 20x under the sub-50 ns design target). The trampoline emits a 28-instruction A64 sequence that reads CNTVCT_EL0, LDAR-acquires the vvar initialized flag, and interpolates wall clock from the anchor as delta * 125 / 3 (Apple Silicon CNTFRQ = 24 MHz), falling back to SVC on first call or CNTVCT regression. The first SVC seeds the vvar via a three-state CAS (0 -> 2 -> 1) so concurrent first calls cannot tear the anchor fields. The seed is gated on ELR_EL1 matching the trampoline's svc_fallback PC so an unrelated raw clock_gettime syscall cannot poison the anchor from arbitrary X9.

/dev/urandom 1-byte reads drop from 5688 ns uncached to 2054 ns (2.77x) via a new per-fd entropy cache: an arc4random_buf-refilled 4 KiB buffer per FD_URANDOM slot. The cache is zeroed on close via a type-to-cleanup registry that also closes pre-existing dup and fork-state race windows for every synthetic fd type.

eventfd dup shares state across aliases per the Linux contract (refcounted slot plus eventfd_owner[FD_TABLE_SIZE] table). The dup path holds fd_lock and sfd_lock together for the bind commit so racing close cannot leak the refcount; the source identity is pinned via snapshotted host fd so a racing close-and-rebind of the source cannot bind to the wrong slot. tests/test-eventfd-dup pins the shared-state contract.

fork_ipc_send_fd_table filters eventfd, signalfd, timerfd, inotify, netlink, pidfd, and epoll out of the SCM_RIGHTS payload. macOS rejects kqueue fds across SCM_RIGHTS and per-class side-table state is not transferable, so a clean drop is the only honest contract. tests/test-fork-synthetic-fd pins it.

Startup decomposition: ELFUSE_STARTUP_TRACE=1 emits per-step wall time for VM bring-up (17 steps on test-hello, dominated by hv_vcpu_create and guest_init at roughly 0.9 ms each). Zero overhead when unset.

Summary by cubic

Adds a CNTVCT-based vDSO fast path for clock_gettime and EL1 shim fast paths for identity syscalls and 1‑byte /dev/urandom reads. Delivers major speedups (clock_gettime ~500x; urandom 1‑byte ~2.7x; identity ~47 ns) and hardens isolation with an EL1‑only shim_data block and stricter infra guards.

New Features
- vDSO: versioned ELF with GNU symbol versions (LINUX_2.6.39), CNTVCT fast path for __kernel_clock_gettime, single SVC seeds REALTIME+MONOTONIC, publishes VDSO_OFF_SIGRET, resolves with glibc/musl.
- EL1 shim: EL1‑only shim_data (identity cache, attention mask, urandom ring + readable‑FD bitmap). Serves getpid..getegid in EL1 (gettid via CONTEXTIDR); 1‑byte urandom in EL1 with data‑abort recovery; TPIDR_EL1 set on exec/fork; attention recomputed in the vCPU loop and tied to signals/itimers and cred publishes.
- /dev/urandom: keep 4 KiB per‑fd host cache with per‑fd locks; bitmap mirrors readable urandom FDs; readv also refills the ring; cache/bitmap reset on close/dup/fork.
- Infra/boot: shim_data mapped MEM_PERM_RW_EL1_ONLY and shown as PROT_NONE in /proc/self/maps; ELF loader rejects PT_LOAD/PHDR writes into infra; mremap/madvise reject infra ranges; startup tracing via ELFUSE_STARTUP_TRACE=1.
Bug Fixes
- eventfd dup shares counter/readiness across aliases.
- Fork: drop eventfd, signalfd, timerfd, inotify, netlink, pidfd, and epoll from SCM_RIGHTS; child recreates them; no host‑fd leaks.
- Central FD cleanup registry applied atomically for synthetic types; urandom bitmap publication serialized.
- getrandom validates flags and returns EINVAL for unknown bits.

^{Written for commit 7642bee. Summary will update on new commits.}

Review in cubic

vDSO clock_gettime drops from 1256 ns SVC trap to 2.5ns via CNTVCT-based fast path (493x speedup, 20x under the sub-50 ns design target). The trampoline emits a 28-instruction A64 sequence that reads CNTVCT_EL0, LDAR-acquires the vvar initialized flag, and interpolates wall clock from the anchor as delta * 125 / 3 (Apple Silicon CNTFRQ = 24 MHz), falling back to SVC on first call or CNTVCT regression. The first SVC seeds the vvar via a three-state CAS (0 -> 2 -> 1) so concurrent first calls cannot tear the anchor fields. The seed is gated on ELR_EL1 matching the trampoline's svc_fallback PC so an unrelated raw clock_gettime syscall cannot poison the anchor from arbitrary X9. /dev/urandom 1-byte reads drop from 5688 ns uncached to 2054 ns (2.77x) via a new per-fd entropy cache: an arc4random_buf-refilled 4 KiB buffer per FD_URANDOM slot. The cache is zeroed on close via a type-to-cleanup registry that also closes pre-existing dup and fork-state race windows for every synthetic fd type. eventfd dup shares state across aliases per the Linux contract (refcounted slot plus eventfd_owner[FD_TABLE_SIZE] table). The dup path holds fd_lock and sfd_lock together for the bind commit so racing close cannot leak the refcount; the source identity is pinned via snapshotted host fd so a racing close-and-rebind of the source cannot bind to the wrong slot. tests/test-eventfd-dup pins the shared-state contract. fork_ipc_send_fd_table filters eventfd, signalfd, timerfd, inotify, netlink, pidfd, and epoll out of the SCM_RIGHTS payload. macOS rejects kqueue fds across SCM_RIGHTS and per-class side-table state is not transferable, so a clean drop is the only honest contract. tests/test-fork-synthetic-fd pins it. Startup decomposition: ELFUSE_STARTUP_TRACE=1 emits per-step wall time for VM bring-up (17 steps on test-hello, dominated by hv_vcpu_create and guest_init at roughly 0.9 ms each). Zero overhead when unset.

Max042004 · 2026-05-28T04:51:08Z

I re-ran the exact same Python syscall-density script from #34 against the same docker.io/library/python:3.12 image (glibc 2.41) on the same M4 / macOS 15.4.1 host

Band	#34 elfuse (μs)	This PR (μs)	#34 orbstack (μs)
baseline (pass)	0.007	0.013–0.018	0.007
getppid	0.960	0.974–0.987	0.091
clock_gettime (monotonic_ns)	1.006	0.025–0.037	0.018
/dev/urandom 1B read	1.704	1.045–1.064	0.210

clock_gettime drops from 1.006 μs → 0.025 μs (≈40× faster end-to-end in Python).
The 4 KiB per-fd cache removed the per-call entropy generation; what remains (~1 μs) is the HVF round-trip floor

This introduces an EL1-only shim_data block holding a host-published cache: identity slots (pid/ppid/uid/euid/gid/egid/tid), urandom-eligible fd bitmap, a 4 KiB urandom ring with head/tail/lock, and a 32-bit attention bitmask. The EL1 shim assembly serves identity and urandom 1-byte reads inline without trapping to the host; the existing HVC #5 forwarder is taken only when attention is raised, when a non-urandom fd is consulted, or when the ring needs a host-side refill. Measured at 1 M iterations under the new tests/bench-hot-syscalls.c : getpid/getppid/getuid/geteuid/getgid/getegid/gettid : 47 ns/op clock_gettime via __kernel_clock_gettime vDSO : 3.7 ns/op read(/dev/urandom, 1 byte) : 134 ns/op clock_gettime via SVC fallback : 2056 ns/op The vDSO clock_gettime trampoline now seeds CLOCK_{MONOTONIC,REALTIME} anchors back-to-back from a single SVC fallback, so the fast path serves either clockid after one warm-up call. The X9/ELR_EL1 gate runs before the host wall-clock samples so the anchor inherits no positive bias from the seeding round trip. Integrity boundary around the new cache: - The shim_data block is mapped MEM_PERM_RW_EL1_ONLY (AP[2:1]=00) by both bootstrap and execve so EL0 cannot read or store the bytes directly. /proc/self/maps reports PROT_NONE for [shim-data] to match what guest dereferences would observe. - gva_translate_perm refuses MEM_PERM_EL1_ONLY descriptors on guest-behalf access in both the L2 block and L3 page walk paths. read(fd, shim_data_gva, n) now returns EFAULT instead of letting the host spoof the cache. - elf_map_segments takes an explicit infra reserve range and rejects PT_PHDR copies or PT_LOAD segments whose page-aligned write extent intersects it, closing a host-side overwrite path through the ELF loader that bypassed page-table permissions. - A new EL1 data-abort recover handler in shim.S catches strb faults inside named urandom write ranges (caused by a racing EL0 munmap or mprotect), drops the inner exception frame, releases the ring lock, and returns EFAULT to EL0. Cred publish is bracketed so concurrent fast-path readers see a consistent snapshot. The attention word splits into ATTN_BIT_SIGTIMER (0x1), ATTN_BIT_CRED (0x2), and ATTN_BIT_TRACE (0x4). CRED_BRACKETED ORs the CRED bit, runs the setuid/setgid mutator, publishes the four cred slots, then ANDs the CRED bit off. shim_globals_attn_or uses __ATOMIC_SEQ_CST so the mutator's publish stores cannot become globally visible before the attention bit on weakly-ordered ARM64; the AND clear stays __ATOMIC_RELEASE because release pairs with the shim LDAR for the publish-then-clear order. vdso_attention_or mirrors the same ordering. Signal and itimer path support the lane discipline: - attention_guest is now _Atomic so signal_init's NULL clear during the execve reset window pairs with attention_raise's acquire load on any sibling thread. - signal_set_itimer writes expiry and interval before the release store of .active, matching the field order already used by the virt and prof setters. Consumers that ACQUIRE-load .active without holding sig_lock now never observe armed=true with stale fields. - New signal_attention_needed() OR-reads the three guest itimer .active fields plus an unblocked-deliverable signal hint so the HVC epilogue's recompute decides accurately whether the next call may stay on the fast path. The fd-table publication paths that feed the urandom bitmap are serialized so a pathological sibling close+reopen on the same guest fd cannot make the EL1 fast path consult a stale bit: - fd_refresh_urandom_bitmap snapshots (type, linux_flags) AND publishes the bitmap bit inside the same fd_lock critical section. - fd_alloc_opened_host and duplicate_guest_fd install linux_flags, dir, seals, and the urandom bit only after re-acquiring fd_lock and confirming the slot's (type, host_fd) tuple still matches the just- allocated values. On mismatch (the slot was reallocated by a sibling) the install is skipped and any cloned DIR* is closed to avoid a leak. - The host-side urandom cache replaces its single global mutex with a per-fd lock embedded in urandom_cache_t, initialized by io_init() from syscall_init. Concurrent urandom reads on different fds no longer serialize on one mutex. - sys_readv on /dev/urandom now triggers shim_globals_refill_urandom_ring on the slow path, matching sys_read so readv consumers do not leave the shim ring drained.

Max042004 · 2026-05-29T02:47:51Z

Re-ran #34's syscall-density script on M4 / macOS 15.4.1 with docker.io/library/python:3.12. All numbers are ns/call.

Band	#34 elfuse	`a24fc53`	`7642bee`	OrbStack (native)
getppid	~948	963	25	100
clock_gettime (monotonic_ns)	~994	13	13	14
/dev/urandom 1B read	~1692	1033	100	218

jserv requested a review from Max042004 May 27, 2026 05:55

This comment was marked as resolved.

Sign in to view

jserv force-pushed the perf branch from c19881f to a24fc53 Compare May 27, 2026 07:57

This comment was marked as resolved.

Sign in to view

jserv force-pushed the perf branch from f7dbac7 to 7642bee Compare May 29, 2026 02:29

jserv merged commit 75fb59b into main May 29, 2026
4 checks passed

jserv deleted the perf branch May 29, 2026 02:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup vDSO CNTVCT and amortized urandom#48

Speedup vDSO CNTVCT and amortized urandom#48
jserv merged 2 commits into
mainfrom
perf

jserv commented May 27, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

Max042004 commented May 28, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

Max042004 commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jserv commented May 27, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Uh oh!

This comment was marked as resolved.

Uh oh!

Max042004 commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Max042004 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jserv commented May 27, 2026 •

edited by cubic-dev-ai Bot

Loading

Max042004 commented May 28, 2026 •

edited

Loading

Max042004 commented May 29, 2026 •

edited

Loading