Skip to content

Add vDSO seqlock refresh and fast paths#52

Merged
jserv merged 4 commits into
mainfrom
vdso
May 30, 2026
Merged

Add vDSO seqlock refresh and fast paths#52
jserv merged 4 commits into
mainfrom
vdso

Conversation

@jserv
Copy link
Copy Markdown
Contributor

@jserv jserv commented May 29, 2026

This extends the synthetic vDSO at AT_SYSINFO_EHDR with four new fast paths and rebuilds the anchor publication protocol so the anchor can be refreshed safely while concurrent guest readers are interpolating.

New trampolines, each ending in its own SVC fallback so the dynamic linker sees a complete _kernel* symbol:

  __kernel_clock_getres   76 B / 19 instructions. Returns {0, 1}
                          inline for REALTIME, MONOTONIC, MONOTONIC_RAW,
                          REALTIME_COARSE, MONOTONIC_COARSE, BOOTTIME.
                          CPU and dynamic per-pid clockids SVC out.

  __kernel_gettimeofday  160 B / 40 instructions. Mirrors clock_gettime
                          using the REALTIME anchor and divides by 1000
                          for tv_usec. tz, if non-NULL, gets one
                          str xzr to clear the obsolete struct timezone.

  __kernel_getcpu         52 B / 13 instructions. Stores zero to both
                          out-pointers (elfuse models one CPU / one
                          node) and returns 0.

  __kernel_clock_gettime grows to 168 B / 42 instructions to hold the
                          anchor-age cap and the seqlock recheck added
                          to the existing CNTVCT trampoline.

VDSO_NUM_SYMS goes from 4 to 5; dynstr_data widens to 119 bytes; all post-text section offsets shift but the page still ends inside the 4 KiB at 0x6A0.

The vvar's first uint32 is now a Linux-style seqlock counter:

  0           unseeded, no anchor data
  odd  N >= 1 writer has reserved generation (N+1)/2
  even N >= 2 stable generation N/2, anchor fields readable

vdso_seed_anchor publishes through one CAS-then-release-store sequence that handles both initial seeding and refresh:

  load(seq, ACQUIRE), bail if odd
  CAS(seq, cur, cur+1, ACQUIRE, RELAXED), bail on contention
  thread_fence(RELEASE)           // CAS odd-publish before field stores
  store(field_i, RELAXED) * 5
  store(seq, cur+2, RELEASE)      // publish next even

The thread_fence(RELEASE) lowers to DMB ISH on AArch64 and closes the window where another CPU could observe the relaxed field stores before the odd-publish, since ARMv8 is not multi-copy atomic for unsynchronized stores to different locations. Without it a reader whose snapshot LDAR still saw the old even seq could read fields from the new generation and recheck the same old even seq, accepting torn data.

Trampoline readers snapshot the seq with LDAR, fall back on 0 or odd, plain-load anchor fields, DMB ISHLD, LDAR the seq again, and SVC on mismatch. The DMB ISHLD is load-bearing: LDAR provides forward acquire only, so without it the recheck load can be observed before the plain field LDR/LDPs complete and the race goes undetected. The host helpers (vdso_anchor_age_exceeded, vdso_realtime_drift_exceeded) read through a new vvar_snapshot_anchor() that mirrors this protocol in C: relaxed atomic field loads with __atomic_thread_fence(__ATOMIC_ACQUIRE) before the recheck.

Two staleness gates drive the refresh:

  • The trampolines LSR + CBNZ the CNTVCT delta against a 2**31-cycle cap (~89 s at 24 MHz). A stale anchor falls back to SVC so the host can publish a fresh one.
  • sys_clock_gettime and sys_gettimeofday sample both host clocks back-to-back on the vDSO SVC fallback and call vdso_seed_anchor whenever the anchor is unseeded, has aged out, or has drifted past VDSO_ANCHOR_MAX_DRIFT_NS (100 ms) relative to a fresh REALTIME. This catches macOS NTP wall-clock steps without a host timer thread. The drift detector short-circuits on age, guards anchor_sec + delta_sec with __builtin_add_overflow, and saturates the cross-second diff before the * 1e9 multiply so adversarial vvar values cannot trip signed overflow.

The host gates the HVF reads (ELR_EL1, X9) on clockid 0 or 1 every time, not just before the anchor is published. The previous short-circuit on vdso_anchor_is_seeded left stale anchors stranded.

sys_gettimeofday now writes 8 bytes of zero to tz_gva when non-null so SVC and fast-path callers see the same tz semantics; previously the SVC path ignored tz while the fast path zeroed it, so the first unseeded fallback with tz != NULL silently diverged.

Measured under tests/bench-vdso.c at 200000 iterations, with the seqlock and DMB ISHLD overhead included:

  clock_gettime(MONOTONIC) :    4.3 ns/op   (SVC: 1324 ns;  311x)
  clock_gettime(REALTIME)  :    4.1 ns/op   (SVC: 1315 ns;  321x)
  clock_getres(MONOTONIC)  :    2.3 ns/op   (SVC: 1349 ns;  591x)
  gettimeofday             :    4.9 ns/op   (SVC: 1331 ns;  271x)
  getcpu                   :    2.0 ns/op   (SVC: 1519 ns;  755x)

The ~1.5 ns increase on clock_gettime over the prior one-shot anchor (~2.5 ns baseline) is the cost of the seqlock recheck plus DMB ISHLD plus the age-cap LSR+CBNZ. clock_getres and getcpu have no anchor dependence so they stay at ~2 ns.


Summary by cubic

Adds a seqlock-protected vDSO time anchor with safe refresh and fast paths for __kernel_clock_getres, __kernel_clock_gettime, __kernel_gettimeofday, and __kernel_getcpu, plus an NT_GNU_ABI_TAG note so dynamic glibc binds to the vDSO. Also collapses the clock_gettime carry to remove the final divide, cutting dyn‑glibc down to ~4 ns.

  • New Features

    • vvar seqlock time anchor with LDAR + DMB ISHLD + recheck and a 2^22‑cycle age cap; fast trampolines with SVC fallbacks for the four __kernel_* symbols. Host refresh on SVC (seed/refresh on unseeded, aged, or >100 ms REALTIME drift); sys_clock_getres returns fixed resolutions for common clockids; sys_gettimeofday zeros tz to match fast‑path behavior.
    • PT_NOTE with NT_GNU_ABI_TAG (Linux 2.6.39); PHDR table relocated to keep vvar/text offsets stable. New tests and benches, plus a hot‑syscall guardrail in make check enforcing ceilings: getpid ≤ 200 ns, libc clock_gettime ≤ 50 ns, 1‑byte /dev/urandom read ≤ 200 ns. Optional dynamic‑glibc bench verifies glibc routes through the vDSO.
  • Refactors

    • Replaced the hot‑path UDIV-by-1e9 in clock_gettime with SUBS+CSEL+CINC by tightening the anchor age shift from 31 to 22; keeps X9 live for the SVC fallback. Dyn‑glibc clock_gettime ~4.0 ns (static ~3.4 ns); costs one extra re‑seed per ~0.175 s idle. Host drift detector updated to the same mult+shift so rounding matches the trampoline.

Written for commit f5b3e21. Summary will update on new commits.

Review in cubic

@jserv jserv requested a review from Max042004 May 29, 2026 08:20
This extends the synthetic vDSO at AT_SYSINFO_EHDR with four new fast
paths and rebuilds the anchor publication protocol so the anchor can be
refreshed safely while concurrent guest readers are interpolating.

New trampolines, each ending in its own SVC fallback so the dynamic
linker sees a complete __kernel_* symbol:
  __kernel_clock_getres   76 B / 19 instructions. Returns {0, 1}
                          inline for REALTIME, MONOTONIC, MONOTONIC_RAW,
                          REALTIME_COARSE, MONOTONIC_COARSE, BOOTTIME.
                          CPU and dynamic per-pid clockids SVC out.

  __kernel_gettimeofday  160 B / 40 instructions. Mirrors clock_gettime
                          using the REALTIME anchor and divides by 1000
                          for tv_usec. tz, if non-NULL, gets one
                          str xzr to clear the obsolete struct timezone.

  __kernel_getcpu         52 B / 13 instructions. Stores zero to both
                          out-pointers (elfuse models one CPU / one
                          node) and returns 0.

  __kernel_clock_gettime grows to 168 B / 42 instructions to hold the
                          anchor-age cap and the seqlock recheck added
                          to the existing CNTVCT trampoline.

VDSO_NUM_SYMS goes from 4 to 5; dynstr_data widens to 119 bytes; all
post-text section offsets shift but the page still ends inside the
4 KiB at 0x6A0.

The vvar's first uint32 is now a Linux-style seqlock counter:
  0           unseeded, no anchor data
  odd  N >= 1 writer has reserved generation (N+1)/2
  even N >= 2 stable generation N/2, anchor fields readable

vdso_seed_anchor publishes through one CAS-then-release-store sequence
that handles both initial seeding and refresh:
  load(seq, ACQUIRE), bail if odd
  CAS(seq, cur, cur+1, ACQUIRE, RELAXED), bail on contention
  thread_fence(RELEASE)           // CAS odd-publish before field stores
  store(field_i, RELAXED) * 5
  store(seq, cur+2, RELEASE)      // publish next even

The thread_fence(RELEASE) lowers to DMB ISH on AArch64 and closes the
window where another CPU could observe the relaxed field stores before
the odd-publish, since ARMv8 is not multi-copy atomic for unsynchronized
stores to different locations. Without it a reader whose snapshot LDAR
still saw the old even seq could read fields from the new generation
and recheck the same old even seq, accepting torn data.

Trampoline readers snapshot the seq with LDAR, fall back on 0 or odd,
plain-load anchor fields, DMB ISHLD, LDAR the seq again, and SVC on
mismatch. The DMB ISHLD is load-bearing: LDAR provides forward acquire
only, so without it the recheck load can be observed before the plain
field LDR/LDPs complete and the race goes undetected. The host helpers
(vdso_anchor_age_exceeded, vdso_realtime_drift_exceeded) read through a
new vvar_snapshot_anchor() that mirrors this protocol in C: relaxed
atomic field loads with __atomic_thread_fence(__ATOMIC_ACQUIRE) before
the recheck.

Two staleness gates drive the refresh:
  - The trampolines LSR + CBNZ the CNTVCT delta against a 2**31-cycle
    cap (~89 s at 24 MHz). A stale anchor falls back to SVC so the
    host can publish a fresh one.
  - sys_clock_gettime and sys_gettimeofday sample both host clocks
    back-to-back on the vDSO SVC fallback and call vdso_seed_anchor
    whenever the anchor is unseeded, has aged out, or has drifted past
    VDSO_ANCHOR_MAX_DRIFT_NS (100 ms) relative to a fresh REALTIME.
    This catches macOS NTP wall-clock steps without a host timer
    thread. The drift detector short-circuits on age, guards
    anchor_sec + delta_sec with __builtin_add_overflow, and saturates
    the cross-second diff before the * 1e9 multiply so adversarial
    vvar values cannot trip signed overflow.

The host gates the HVF reads (ELR_EL1, X9) on clockid 0 or 1 every time,
not just before the anchor is published. The previous short-circuit on
vdso_anchor_is_seeded left stale anchors stranded.

sys_gettimeofday now writes 8 bytes of zero to tz_gva when non-null so
SVC and fast-path callers see the same tz semantics; previously the SVC
path ignored tz while the fast path zeroed it, so the first unseeded
fallback with tz != NULL silently diverged.

Measured under tests/bench-vdso.c at 200000 iterations, with the
seqlock and DMB ISHLD overhead included:
  clock_gettime(MONOTONIC) :    4.3 ns/op   (SVC: 1324 ns;  311x)
  clock_gettime(REALTIME)  :    4.1 ns/op   (SVC: 1315 ns;  321x)
  clock_getres(MONOTONIC)  :    2.3 ns/op   (SVC: 1349 ns;  591x)
  gettimeofday             :    4.9 ns/op   (SVC: 1331 ns;  271x)
  getcpu                   :    2.0 ns/op   (SVC: 1519 ns;  755x)

The ~1.5 ns increase on clock_gettime over the prior one-shot anchor
(~2.5 ns baseline) is the cost of the seqlock recheck plus DMB ISHLD
plus the age-cap LSR+CBNZ. clock_getres and getcpu have no anchor
dependence so they stay at ~2 ns.
cubic-dev-ai[bot]

This comment was marked as resolved.

jserv added 2 commits May 29, 2026 17:29
The synthetic vDSO at AT_SYSINFO_EHDR already carries DT_HASH,
LINUX_2.6.39 symbol versioning, and five __kernel_* trampolines, but
glibc 2.41's dynamic-linker vDSO probe rejected the page for lack of an
NT_GNU_ABI_TAG note: every dynamically-linked guest fell back to SVC
for clock_gettime, gettimeofday, and clock_getres. PR #34 measured
1006 ns/op against an 18 ns/op OrbStack reference, a 56x gap the
TODO Tier D P1 entry tracked as the highest-leverage single fix.

This adds the note. To avoid moving VVAR (0x0B0), TEXT_OFF_SIGRET
(0x0E0, exported in vdso.h for signal.c), or any trampoline / section
offset, the program-header table relocates from 0x040 to 0x6B0 (after
the section-header area). The reclaimed 0x040 window now holds the
32-byte NT_GNU_ABI_TAG:

  namesz : 4   ("GNU\0")
  descsz : 16
  type   : NT_GNU_ABI_TAG (1)
  name   : "GNU\0"
  desc   : { ELF_NOTE_OS_LINUX (0), 2, 6, 39 }

The descriptor's minimum kernel ABI (2.6.39) matches the LINUX_2.6.39
symbol version already exposed through DT_VERDEF, so a glibc that
honors the version also honors the note. PT_LOAD continues to cover
the whole page so the relocated PHDR table and the note both stay
mapped at runtime.

Validation, dynamically-linked glibc 2.41 binary built from the
cross-toolchain sysroot at /opt/toolchain/aarch64-linux-gnu (same
toolchain PR #34 used for the baseline):

  libc  clock_gettime  :   6.97 ns/op   (was 1006 ns/op pre-fix)
  direct vDSO call     :   6.24 ns/op   (dlsym function-pointer)
  raw   SVC syscall    : 2047.01 ns/op
  libc/vDSO ratio = 1.12x -- libc IS using the vDSO

The 0.7 ns libc-vs-direct gap is glibc's dl_sysinfo_dso dispatch, not
an SVC fallback. libc clock_gettime now beats the OrbStack reference
(18 ns/op) by ~2.6x. gettimeofday and clock_getres land on the
trampolines through the same probe path:

  libc gettimeofday    :   7.5 ns/op    (vDSO REALTIME anchor reuse)
  libc clock_getres    :   4.9 ns/op    (constant-resolution path)

readelf parses the page cleanly: e_phnum=3, e_phoff=0x6B0, three
PHDRs (PT_LOAD covering the whole page, PT_DYNAMIC at 0x420 size
0x90, PT_NOTE at 0x40 size 0x20), and `readelf -n` decodes the note as
"GNU NT_GNU_ABI_TAG OS: Linux, ABI: 2.6.39". No region overlaps;
total page usage 0x758 / 0x1000.

Static vDSO bench unchanged at 6 ns/op for the time fast paths; the
PHDR relocation only shifts where the dynamic linker looks for the
table and does not touch any code the trampolines execute. test-signal
explicit run passes, confirming the unchanged TEXT_OFF_SIGRET=0xE0
trampoline still drives the libc __restore_rt path.
Three hot paths the PR #34 OrbStack baseline tracked -- getpid (~47
ns), clock_gettime through the vDSO (~2.5 ns), and 1-byte
/dev/urandom read (~134 ns) -- had no automated regression check. A
silent slip-back to the SVC fallback turned each into a ~1-2 us trap
without anything in CI to notice.

This adds an explicit guardrail. tests/bench-hot-guard.c resolves
__kernel_clock_gettime via AT_SYSINFO_EHDR + PT_DYNAMIC + DT_HASH (SysV
ELF hash walk) and measures three labels in fixed-width
"%-20s %10.1f ns/op  last=%ld" output: getpid (raw SVC), clock_gettime
(vDSO trampoline), and read-urandom1 (raw 1-byte read of /dev/urandom).

The same source builds two binaries via a compile-time switch:
  build/bench-hot-guard
        Static glibc. Built without the macro. clock_gettime invokes
        the trampoline directly through the resolved function pointer.
        Static glibc never initializes dl_sysinfo_dso, so its libc
        wrapper falls back to raw SVC for reasons unrelated to the
        vDSO; measuring the wrapper would fail the 50 ns ceiling for
        the wrong reason. Direct call isolates the trampoline.

  build/bench-hot-guard-glibc
        Dynamic glibc. Built with -DGUARD_USE_LIBC_CG=1.
        clock_gettime invokes glibc's clock_gettime() wrapper -- which
        on glibc 2.41 + a correctly-stamped vDSO (NT_GNU_ABI_TAG
        PT_NOTE, LINUX_2.6.39 versioning) routes through the
        trampoline. A regression in the note or versioning would push
        this measurement from ~7 ns to SVC range and trip the ceiling.
        Built only when the cross-toolchain sysroot at
        $(LINUX_TOOLCHAIN)/aarch64-unknown-linux-gnu/sysroot exists;
        run with elfuse --sysroot at that path.

Disassembly verifies the split: the dynamic binary lowers
bench_clock_gettime to "bl <clock_gettime@plt>" while the static
binary lowers it to "ldr x2, [x1], #8" + indirect dispatch.

Validation:

  static     getpid 50.4 ns, clock_gettime  6.7 ns, urandom 141.9 ns
  dyn-glibc  getpid 71.9 ns, clock_gettime 17.8 ns, urandom 147.9 ns
@Max042004
Copy link
Copy Markdown
Collaborator

The same python3.12 ELF binary was run on Apple M-series HVF under both elfuse and orbstack (Ubuntu 24.04 noble arm64, Linux 7.0.5-orbstack) to compare time-related syscall / vDSO behavior.

Startup time: python3.12 -c 'pass' (10 runs)

samples (ms) min median
elfuse (PR #52) 48 41 35 34 34 34 34 34 34 34 34 34
orbstack 55 35 35 35 35 34 35 35 35 36 34 35

vDSO-heavy Python loops (N=500k, 5 rounds, median ns/op)

Operation elfuse (PR #52) orbstack Gap
time.time() (glibc → __kernel_gettimeofday) 36.2 37.0 elfuse 2% faster
time.monotonic_ns() (→ clock_gettime(MONO)) 35.2 30.1 orbstack 17% faster
clock_gettime(CLOCK_REALTIME) 49.3 40.2 orbstack 23% faster
clock_gettime(CLOCK_MONOTONIC) 49.8 40.4 orbstack 23% faster
clock_getres(CLOCK_MONOTONIC) 49.3 38.9 orbstack 27% faster

CPU-bound control: fibonacci(50000) (10 runs)

samples (ms) min median
elfuse (PR #52) 51 52 51 51 53 52 51 51 51 51 51 51
orbstack 52 52 52 51 52 51 53 51 52 52 51 52

The trampoline's last divide -- UDIV by 1e9 to split anchor_nsec +
delta_ns into sec_carry and nsec_out -- runs ~10-22 cycles on Apple
Silicon and is not pipelined. Tightening VDSO_ANCHOR_AGE_SHIFT from 31
to 22 caps delta_ns at ~175e6 ns and bounds the sum below 2e9, so the
quotient is always 0 or 1. That collapses the carry to a single SUBS +
CSEL + CINC (~3 cycles, fully pipelined), eliminating the only remaining
hot-path divide in __kernel_clock_gettime.

The shift change costs one extra SVC re-seed per ~0.175 s of idle, which
is negligible compared to the per-call gain. Measured on M1: dyn-glibc
clock_gettime drops from ~6.5 ns/op baseline to ~4.0 ns/op (~38% faster,
guardrail static path 3.4 ns), closing most of the remaining gap to the
PR #52 OrbStack baseline. The host-side vdso_realtime_drift_exceeded was
also updated to the matching (delta * 699050666) >> 24 mult+shift so the
drift detector cannot mis-classify the trampoline's own rounding.

x9 stays live across the new math and reaches the SVC fallback intact,
preserving the trustworthy CNTVCT contract with sys_clock_gettime. The
overflow invariant is documented on vdso_seed_anchor in vdso.h.
@Max042004
Copy link
Copy Markdown
Collaborator

Max042004 commented May 30, 2026

The same statically-linked bench-hot-guard binary was executed inside orbstack via /mnt/mac/.... Its symbol-resolution code (AT_SYSINFO_EHDR + DT_HASH) walks whatever vDSO the running kernel publishes — elfuse's synthetic one in the elfuse case, the real Linux 7.0.5 vDSO in the orbstack case — so this is an apples-to-apples comparison of two vDSO __kernel_clock_gettime implementations on the same CPU.

5 runs each, ns/op:

run 1 run 2 run 3 run 4 run 5 median
elfuse f5b3e21 2.6 2.2 2.0 1.8 1.9 2.0
orbstack (Linux 7.0.5 vDSO) 8.7 9.1 9.3 9.5 8.8 9.1

@jserv jserv merged commit ed1811b into main May 30, 2026
5 checks passed
@jserv jserv deleted the vdso branch May 30, 2026 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants