Add vDSO seqlock refresh and fast paths by jserv · Pull Request #52 · sysprog21/elfuse

jserv · 2026-05-29T08:20:01Z

This extends the synthetic vDSO at AT_SYSINFO_EHDR with four new fast paths and rebuilds the anchor publication protocol so the anchor can be refreshed safely while concurrent guest readers are interpolating.

New trampolines, each ending in its own SVC fallback so the dynamic linker sees a complete _kernel* symbol:

  __kernel_clock_getres   76 B / 19 instructions. Returns {0, 1}
                          inline for REALTIME, MONOTONIC, MONOTONIC_RAW,
                          REALTIME_COARSE, MONOTONIC_COARSE, BOOTTIME.
                          CPU and dynamic per-pid clockids SVC out.

  __kernel_gettimeofday  160 B / 40 instructions. Mirrors clock_gettime
                          using the REALTIME anchor and divides by 1000
                          for tv_usec. tz, if non-NULL, gets one
                          str xzr to clear the obsolete struct timezone.

  __kernel_getcpu         52 B / 13 instructions. Stores zero to both
                          out-pointers (elfuse models one CPU / one
                          node) and returns 0.

  __kernel_clock_gettime grows to 168 B / 42 instructions to hold the
                          anchor-age cap and the seqlock recheck added
                          to the existing CNTVCT trampoline.

VDSO_NUM_SYMS goes from 4 to 5; dynstr_data widens to 119 bytes; all post-text section offsets shift but the page still ends inside the 4 KiB at 0x6A0.

The vvar's first uint32 is now a Linux-style seqlock counter:

  0           unseeded, no anchor data
  odd  N >= 1 writer has reserved generation (N+1)/2
  even N >= 2 stable generation N/2, anchor fields readable

vdso_seed_anchor publishes through one CAS-then-release-store sequence that handles both initial seeding and refresh:

  load(seq, ACQUIRE), bail if odd
  CAS(seq, cur, cur+1, ACQUIRE, RELAXED), bail on contention
  thread_fence(RELEASE)           // CAS odd-publish before field stores
  store(field_i, RELAXED) * 5
  store(seq, cur+2, RELEASE)      // publish next even

The thread_fence(RELEASE) lowers to DMB ISH on AArch64 and closes the window where another CPU could observe the relaxed field stores before the odd-publish, since ARMv8 is not multi-copy atomic for unsynchronized stores to different locations. Without it a reader whose snapshot LDAR still saw the old even seq could read fields from the new generation and recheck the same old even seq, accepting torn data.

Trampoline readers snapshot the seq with LDAR, fall back on 0 or odd, plain-load anchor fields, DMB ISHLD, LDAR the seq again, and SVC on mismatch. The DMB ISHLD is load-bearing: LDAR provides forward acquire only, so without it the recheck load can be observed before the plain field LDR/LDPs complete and the race goes undetected. The host helpers (vdso_anchor_age_exceeded, vdso_realtime_drift_exceeded) read through a new vvar_snapshot_anchor() that mirrors this protocol in C: relaxed atomic field loads with __atomic_thread_fence(__ATOMIC_ACQUIRE) before the recheck.

Two staleness gates drive the refresh:

The trampolines LSR + CBNZ the CNTVCT delta against a 2**31-cycle cap (~89 s at 24 MHz). A stale anchor falls back to SVC so the host can publish a fresh one.
sys_clock_gettime and sys_gettimeofday sample both host clocks back-to-back on the vDSO SVC fallback and call vdso_seed_anchor whenever the anchor is unseeded, has aged out, or has drifted past VDSO_ANCHOR_MAX_DRIFT_NS (100 ms) relative to a fresh REALTIME. This catches macOS NTP wall-clock steps without a host timer thread. The drift detector short-circuits on age, guards anchor_sec + delta_sec with __builtin_add_overflow, and saturates the cross-second diff before the * 1e9 multiply so adversarial vvar values cannot trip signed overflow.

The host gates the HVF reads (ELR_EL1, X9) on clockid 0 or 1 every time, not just before the anchor is published. The previous short-circuit on vdso_anchor_is_seeded left stale anchors stranded.

sys_gettimeofday now writes 8 bytes of zero to tz_gva when non-null so SVC and fast-path callers see the same tz semantics; previously the SVC path ignored tz while the fast path zeroed it, so the first unseeded fallback with tz != NULL silently diverged.

Measured under tests/bench-vdso.c at 200000 iterations, with the seqlock and DMB ISHLD overhead included:

  clock_gettime(MONOTONIC) :    4.3 ns/op   (SVC: 1324 ns;  311x)
  clock_gettime(REALTIME)  :    4.1 ns/op   (SVC: 1315 ns;  321x)
  clock_getres(MONOTONIC)  :    2.3 ns/op   (SVC: 1349 ns;  591x)
  gettimeofday             :    4.9 ns/op   (SVC: 1331 ns;  271x)
  getcpu                   :    2.0 ns/op   (SVC: 1519 ns;  755x)

The ~1.5 ns increase on clock_gettime over the prior one-shot anchor (~2.5 ns baseline) is the cost of the seqlock recheck plus DMB ISHLD plus the age-cap LSR+CBNZ. clock_getres and getcpu have no anchor dependence so they stay at ~2 ns.

Summary by cubic

Adds a seqlock-protected vDSO time anchor with safe refresh and fast paths for __kernel_clock_getres, __kernel_clock_gettime, __kernel_gettimeofday, and __kernel_getcpu, plus an NT_GNU_ABI_TAG note so dynamic glibc binds to the vDSO. Also collapses the clock_gettime carry to remove the final divide, cutting dyn‑glibc down to ~4 ns.

New Features
- vvar seqlock time anchor with LDAR + DMB ISHLD + recheck and a 2^22‑cycle age cap; fast trampolines with SVC fallbacks for the four __kernel_* symbols. Host refresh on SVC (seed/refresh on unseeded, aged, or >100 ms REALTIME drift); sys_clock_getres returns fixed resolutions for common clockids; sys_gettimeofday zeros tz to match fast‑path behavior.
- PT_NOTE with NT_GNU_ABI_TAG (Linux 2.6.39); PHDR table relocated to keep vvar/text offsets stable. New tests and benches, plus a hot‑syscall guardrail in make check enforcing ceilings: getpid ≤ 200 ns, libc clock_gettime ≤ 50 ns, 1‑byte /dev/urandom read ≤ 200 ns. Optional dynamic‑glibc bench verifies glibc routes through the vDSO.
Refactors
- Replaced the hot‑path UDIV-by-1e9 in clock_gettime with SUBS+CSEL+CINC by tightening the anchor age shift from 31 to 22; keeps X9 live for the SVC fallback. Dyn‑glibc clock_gettime ~4.0 ns (static ~3.4 ns); costs one extra re‑seed per ~0.175 s idle. Host drift detector updated to the same mult+shift so rounding matches the trampoline.

^{Written for commit f5b3e21. Summary will update on new commits.}

This extends the synthetic vDSO at AT_SYSINFO_EHDR with four new fast paths and rebuilds the anchor publication protocol so the anchor can be refreshed safely while concurrent guest readers are interpolating. New trampolines, each ending in its own SVC fallback so the dynamic linker sees a complete __kernel_* symbol: __kernel_clock_getres 76 B / 19 instructions. Returns {0, 1} inline for REALTIME, MONOTONIC, MONOTONIC_RAW, REALTIME_COARSE, MONOTONIC_COARSE, BOOTTIME. CPU and dynamic per-pid clockids SVC out. __kernel_gettimeofday 160 B / 40 instructions. Mirrors clock_gettime using the REALTIME anchor and divides by 1000 for tv_usec. tz, if non-NULL, gets one str xzr to clear the obsolete struct timezone. __kernel_getcpu 52 B / 13 instructions. Stores zero to both out-pointers (elfuse models one CPU / one node) and returns 0. __kernel_clock_gettime grows to 168 B / 42 instructions to hold the anchor-age cap and the seqlock recheck added to the existing CNTVCT trampoline. VDSO_NUM_SYMS goes from 4 to 5; dynstr_data widens to 119 bytes; all post-text section offsets shift but the page still ends inside the 4 KiB at 0x6A0. The vvar's first uint32 is now a Linux-style seqlock counter: 0 unseeded, no anchor data odd N >= 1 writer has reserved generation (N+1)/2 even N >= 2 stable generation N/2, anchor fields readable vdso_seed_anchor publishes through one CAS-then-release-store sequence that handles both initial seeding and refresh: load(seq, ACQUIRE), bail if odd CAS(seq, cur, cur+1, ACQUIRE, RELAXED), bail on contention thread_fence(RELEASE) // CAS odd-publish before field stores store(field_i, RELAXED) * 5 store(seq, cur+2, RELEASE) // publish next even The thread_fence(RELEASE) lowers to DMB ISH on AArch64 and closes the window where another CPU could observe the relaxed field stores before the odd-publish, since ARMv8 is not multi-copy atomic for unsynchronized stores to different locations. Without it a reader whose snapshot LDAR still saw the old even seq could read fields from the new generation and recheck the same old even seq, accepting torn data. Trampoline readers snapshot the seq with LDAR, fall back on 0 or odd, plain-load anchor fields, DMB ISHLD, LDAR the seq again, and SVC on mismatch. The DMB ISHLD is load-bearing: LDAR provides forward acquire only, so without it the recheck load can be observed before the plain field LDR/LDPs complete and the race goes undetected. The host helpers (vdso_anchor_age_exceeded, vdso_realtime_drift_exceeded) read through a new vvar_snapshot_anchor() that mirrors this protocol in C: relaxed atomic field loads with __atomic_thread_fence(__ATOMIC_ACQUIRE) before the recheck. Two staleness gates drive the refresh: - The trampolines LSR + CBNZ the CNTVCT delta against a 2**31-cycle cap (~89 s at 24 MHz). A stale anchor falls back to SVC so the host can publish a fresh one. - sys_clock_gettime and sys_gettimeofday sample both host clocks back-to-back on the vDSO SVC fallback and call vdso_seed_anchor whenever the anchor is unseeded, has aged out, or has drifted past VDSO_ANCHOR_MAX_DRIFT_NS (100 ms) relative to a fresh REALTIME. This catches macOS NTP wall-clock steps without a host timer thread. The drift detector short-circuits on age, guards anchor_sec + delta_sec with __builtin_add_overflow, and saturates the cross-second diff before the * 1e9 multiply so adversarial vvar values cannot trip signed overflow. The host gates the HVF reads (ELR_EL1, X9) on clockid 0 or 1 every time, not just before the anchor is published. The previous short-circuit on vdso_anchor_is_seeded left stale anchors stranded. sys_gettimeofday now writes 8 bytes of zero to tz_gva when non-null so SVC and fast-path callers see the same tz semantics; previously the SVC path ignored tz while the fast path zeroed it, so the first unseeded fallback with tz != NULL silently diverged. Measured under tests/bench-vdso.c at 200000 iterations, with the seqlock and DMB ISHLD overhead included: clock_gettime(MONOTONIC) : 4.3 ns/op (SVC: 1324 ns; 311x) clock_gettime(REALTIME) : 4.1 ns/op (SVC: 1315 ns; 321x) clock_getres(MONOTONIC) : 2.3 ns/op (SVC: 1349 ns; 591x) gettimeofday : 4.9 ns/op (SVC: 1331 ns; 271x) getcpu : 2.0 ns/op (SVC: 1519 ns; 755x) The ~1.5 ns increase on clock_gettime over the prior one-shot anchor (~2.5 ns baseline) is the cost of the seqlock recheck plus DMB ISHLD plus the age-cap LSR+CBNZ. clock_getres and getcpu have no anchor dependence so they stay at ~2 ns.

The synthetic vDSO at AT_SYSINFO_EHDR already carries DT_HASH, LINUX_2.6.39 symbol versioning, and five __kernel_* trampolines, but glibc 2.41's dynamic-linker vDSO probe rejected the page for lack of an NT_GNU_ABI_TAG note: every dynamically-linked guest fell back to SVC for clock_gettime, gettimeofday, and clock_getres. PR #34 measured 1006 ns/op against an 18 ns/op OrbStack reference, a 56x gap the TODO Tier D P1 entry tracked as the highest-leverage single fix. This adds the note. To avoid moving VVAR (0x0B0), TEXT_OFF_SIGRET (0x0E0, exported in vdso.h for signal.c), or any trampoline / section offset, the program-header table relocates from 0x040 to 0x6B0 (after the section-header area). The reclaimed 0x040 window now holds the 32-byte NT_GNU_ABI_TAG: namesz : 4 ("GNU\0") descsz : 16 type : NT_GNU_ABI_TAG (1) name : "GNU\0" desc : { ELF_NOTE_OS_LINUX (0), 2, 6, 39 } The descriptor's minimum kernel ABI (2.6.39) matches the LINUX_2.6.39 symbol version already exposed through DT_VERDEF, so a glibc that honors the version also honors the note. PT_LOAD continues to cover the whole page so the relocated PHDR table and the note both stay mapped at runtime. Validation, dynamically-linked glibc 2.41 binary built from the cross-toolchain sysroot at /opt/toolchain/aarch64-linux-gnu (same toolchain PR #34 used for the baseline): libc clock_gettime : 6.97 ns/op (was 1006 ns/op pre-fix) direct vDSO call : 6.24 ns/op (dlsym function-pointer) raw SVC syscall : 2047.01 ns/op libc/vDSO ratio = 1.12x -- libc IS using the vDSO The 0.7 ns libc-vs-direct gap is glibc's dl_sysinfo_dso dispatch, not an SVC fallback. libc clock_gettime now beats the OrbStack reference (18 ns/op) by ~2.6x. gettimeofday and clock_getres land on the trampolines through the same probe path: libc gettimeofday : 7.5 ns/op (vDSO REALTIME anchor reuse) libc clock_getres : 4.9 ns/op (constant-resolution path) readelf parses the page cleanly: e_phnum=3, e_phoff=0x6B0, three PHDRs (PT_LOAD covering the whole page, PT_DYNAMIC at 0x420 size 0x90, PT_NOTE at 0x40 size 0x20), and `readelf -n` decodes the note as "GNU NT_GNU_ABI_TAG OS: Linux, ABI: 2.6.39". No region overlaps; total page usage 0x758 / 0x1000. Static vDSO bench unchanged at 6 ns/op for the time fast paths; the PHDR relocation only shifts where the dynamic linker looks for the table and does not touch any code the trampolines execute. test-signal explicit run passes, confirming the unchanged TEXT_OFF_SIGRET=0xE0 trampoline still drives the libc __restore_rt path.

Three hot paths the PR #34 OrbStack baseline tracked -- getpid (~47 ns), clock_gettime through the vDSO (~2.5 ns), and 1-byte /dev/urandom read (~134 ns) -- had no automated regression check. A silent slip-back to the SVC fallback turned each into a ~1-2 us trap without anything in CI to notice. This adds an explicit guardrail. tests/bench-hot-guard.c resolves __kernel_clock_gettime via AT_SYSINFO_EHDR + PT_DYNAMIC + DT_HASH (SysV ELF hash walk) and measures three labels in fixed-width "%-20s %10.1f ns/op last=%ld" output: getpid (raw SVC), clock_gettime (vDSO trampoline), and read-urandom1 (raw 1-byte read of /dev/urandom). The same source builds two binaries via a compile-time switch: build/bench-hot-guard Static glibc. Built without the macro. clock_gettime invokes the trampoline directly through the resolved function pointer. Static glibc never initializes dl_sysinfo_dso, so its libc wrapper falls back to raw SVC for reasons unrelated to the vDSO; measuring the wrapper would fail the 50 ns ceiling for the wrong reason. Direct call isolates the trampoline. build/bench-hot-guard-glibc Dynamic glibc. Built with -DGUARD_USE_LIBC_CG=1. clock_gettime invokes glibc's clock_gettime() wrapper -- which on glibc 2.41 + a correctly-stamped vDSO (NT_GNU_ABI_TAG PT_NOTE, LINUX_2.6.39 versioning) routes through the trampoline. A regression in the note or versioning would push this measurement from ~7 ns to SVC range and trip the ceiling. Built only when the cross-toolchain sysroot at $(LINUX_TOOLCHAIN)/aarch64-unknown-linux-gnu/sysroot exists; run with elfuse --sysroot at that path. Disassembly verifies the split: the dynamic binary lowers bench_clock_gettime to "bl <clock_gettime@plt>" while the static binary lowers it to "ldr x2, [x1], #8" + indirect dispatch. Validation: static getpid 50.4 ns, clock_gettime 6.7 ns, urandom 141.9 ns dyn-glibc getpid 71.9 ns, clock_gettime 17.8 ns, urandom 147.9 ns

Max042004 · 2026-05-29T13:35:59Z

The same python3.12 ELF binary was run on Apple M-series HVF under both elfuse and orbstack (Ubuntu 24.04 noble arm64, Linux 7.0.5-orbstack) to compare time-related syscall / vDSO behavior.

Startup time: `python3.12 -c 'pass'` (10 runs)

	samples (ms)	min	median
elfuse (PR #52)	48 41 35 34 34 34 34 34 34 34	34	34
orbstack	55 35 35 35 35 34 35 35 35 36	34	35

vDSO-heavy Python loops (N=500k, 5 rounds, median ns/op)

Operation	elfuse (PR #52)	orbstack	Gap
`time.time()` (glibc → `__kernel_gettimeofday`)	36.2	37.0	elfuse 2% faster
`time.monotonic_ns()` (→ `clock_gettime(MONO)`)	35.2	30.1	orbstack 17% faster
`clock_gettime(CLOCK_REALTIME)`	49.3	40.2	orbstack 23% faster
`clock_gettime(CLOCK_MONOTONIC)`	49.8	40.4	orbstack 23% faster
`clock_getres(CLOCK_MONOTONIC)`	49.3	38.9	orbstack 27% faster

CPU-bound control: `fibonacci(50000)` (10 runs)

	samples (ms)	min	median
elfuse (PR #52)	51 52 51 51 53 52 51 51 51 51	51	51
orbstack	52 52 52 51 52 51 53 51 52 52	51	52

The trampoline's last divide -- UDIV by 1e9 to split anchor_nsec + delta_ns into sec_carry and nsec_out -- runs ~10-22 cycles on Apple Silicon and is not pipelined. Tightening VDSO_ANCHOR_AGE_SHIFT from 31 to 22 caps delta_ns at ~175e6 ns and bounds the sum below 2e9, so the quotient is always 0 or 1. That collapses the carry to a single SUBS + CSEL + CINC (~3 cycles, fully pipelined), eliminating the only remaining hot-path divide in __kernel_clock_gettime. The shift change costs one extra SVC re-seed per ~0.175 s of idle, which is negligible compared to the per-call gain. Measured on M1: dyn-glibc clock_gettime drops from ~6.5 ns/op baseline to ~4.0 ns/op (~38% faster, guardrail static path 3.4 ns), closing most of the remaining gap to the PR #52 OrbStack baseline. The host-side vdso_realtime_drift_exceeded was also updated to the matching (delta * 699050666) >> 24 mult+shift so the drift detector cannot mis-classify the trampoline's own rounding. x9 stays live across the new math and reaches the SVC fallback intact, preserving the trustworthy CNTVCT contract with sys_clock_gettime. The overflow invariant is documented on vdso_seed_anchor in vdso.h.

Max042004 · 2026-05-30T02:15:18Z

The same statically-linked bench-hot-guard binary was executed inside orbstack via /mnt/mac/.... Its symbol-resolution code (AT_SYSINFO_EHDR + DT_HASH) walks whatever vDSO the running kernel publishes — elfuse's synthetic one in the elfuse case, the real Linux 7.0.5 vDSO in the orbstack case — so this is an apples-to-apples comparison of two vDSO __kernel_clock_gettime implementations on the same CPU.

5 runs each, ns/op:

	run 1	run 2	run 3	run 4	run 5	median
elfuse `f5b3e21`	2.6	2.2	2.0	1.8	1.9	2.0
orbstack (Linux 7.0.5 vDSO)	8.7	9.1	9.3	9.5	8.8	9.1

jserv requested a review from Max042004 May 29, 2026 08:20

jserv force-pushed the vdso branch from 278fbc2 to 8ee57eb Compare May 29, 2026 08:23

sysprog21 deleted a comment from cubic-dev-ai Bot May 29, 2026

This comment was marked as resolved.

Sign in to view

jserv added 2 commits May 29, 2026 17:29

Max042004 approved these changes May 29, 2026

View reviewed changes

jserv merged commit ed1811b into main May 30, 2026
5 checks passed

jserv deleted the vdso branch May 30, 2026 07:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vDSO seqlock refresh and fast paths#52

Add vDSO seqlock refresh and fast paths#52
jserv merged 4 commits into
mainfrom
vdso

jserv commented May 29, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

Max042004 commented May 29, 2026

Uh oh!

Max042004 commented May 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jserv commented May 29, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Uh oh!

This comment was marked as resolved.

Uh oh!

Max042004 commented May 29, 2026

Startup time: python3.12 -c 'pass' (10 runs)

vDSO-heavy Python loops (N=500k, 5 rounds, median ns/op)

CPU-bound control: fibonacci(50000) (10 runs)

Uh oh!

Max042004 commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jserv commented May 29, 2026 •

edited by cubic-dev-ai Bot

Loading

Startup time: `python3.12 -c 'pass'` (10 runs)

CPU-bound control: `fibonacci(50000)` (10 runs)

Max042004 commented May 30, 2026 •

edited

Loading