Add OCI image support: pull, unpack, run, prune, status, policy#34
Open
Max042004 wants to merge 1 commit into
Open
Add OCI image support: pull, unpack, run, prune, status, policy#34Max042004 wants to merge 1 commit into
Max042004 wants to merge 1 commit into
Conversation
jserv
requested changes
May 23, 2026
Contributor
jserv
left a comment
There was a problem hiding this comment.
Rebase onto the latest main branch and squash/rework the commits into fewer, cleaner ones.
e988d6e to
5d6dbc7
Compare
This was referenced May 26, 2026
Closed
Implement the full elfuse OCI image lifecycle as a self-contained `elfuse oci` subcommand. Image distribution never touches Hypervisor.framework, so the subcommand dispatches in main() before any guest setup; only `oci run` enters the VM bring-up path. - pull / inspect: content-addressable blob store over HTTPS with bearer-token + Basic auth, OCI index walk to the linux/arm64 leaf, parallel blob fetch with HTTP Range resume, offline inspect renderer. - unpack: tar reader (ustar + PAX x/g records), gzip + decode-only vendored zstd, whiteout-aware layer apply, per-image case-sensitive APFS sysroot; cross-volume unpack via copyfile(2) with clone fallback. - run: clonefile(2) per-run rootfs; Entrypoint / Cmd / Env / WorkingDir and symbolic/numeric User honoured; reuses the shared elfuse_launch bring-up so a dynamic guest runs through the same shim + syscall path. - lifecycle: prune (--older-than / --keep-bytes), per-layer + ChainID stack snapshot caches, oci status (text + --json), rebuild-cache. - policy: podman/skopeo-style policy.json + registries.d overlay; loopback-gated --insecure; CLI flags override. Extract the VM bring-up from main() into core/launch.c (elfuse_launch) so oci run and the positional-ELF main share one path; the host-path resolution now lives in the caller per the guest_bootstrap_prepare split. Vendors decode-only zstd and cJSON; uses system zlib and libcurl. Adds 25 native test-oci-* unit suites plus an opt-in heavy compat mode.
jserv
added a commit
that referenced
this pull request
May 29, 2026
The synthetic vDSO at AT_SYSINFO_EHDR already carries DT_HASH, LINUX_2.6.39 symbol versioning, and five __kernel_* trampolines, but glibc 2.41's dynamic-linker vDSO probe rejected the page for lack of an NT_GNU_ABI_TAG note: every dynamically-linked guest fell back to SVC for clock_gettime, gettimeofday, and clock_getres. PR #34 measured 1006 ns/op against an 18 ns/op OrbStack reference, a 56x gap the TODO Tier D P1 entry tracked as the highest-leverage single fix. This adds the note. To avoid moving VVAR (0x0B0), TEXT_OFF_SIGRET (0x0E0, exported in vdso.h for signal.c), or any trampoline / section offset, the program-header table relocates from 0x040 to 0x6B0 (after the section-header area). The reclaimed 0x040 window now holds the 32-byte NT_GNU_ABI_TAG: namesz : 4 ("GNU\0") descsz : 16 type : NT_GNU_ABI_TAG (1) name : "GNU\0" desc : { ELF_NOTE_OS_LINUX (0), 2, 6, 39 } The descriptor's minimum kernel ABI (2.6.39) matches the LINUX_2.6.39 symbol version already exposed through DT_VERDEF, so a glibc that honors the version also honors the note. PT_LOAD continues to cover the whole page so the relocated PHDR table and the note both stay mapped at runtime. Validation, dynamically-linked glibc 2.41 binary built from the cross-toolchain sysroot at /opt/toolchain/aarch64-linux-gnu (same toolchain PR #34 used for the baseline): libc clock_gettime : 6.97 ns/op (was 1006 ns/op pre-fix) direct vDSO call : 6.24 ns/op (dlsym function-pointer) raw SVC syscall : 2047.01 ns/op libc/vDSO ratio = 1.12x -- libc IS using the vDSO The 0.7 ns libc-vs-direct gap is glibc's dl_sysinfo_dso dispatch, not an SVC fallback. libc clock_gettime now beats the OrbStack reference (18 ns/op) by ~2.6x. gettimeofday and clock_getres land on the trampolines through the same probe path: libc gettimeofday : 7.5 ns/op (vDSO REALTIME anchor reuse) libc clock_getres : 4.9 ns/op (constant-resolution path) readelf parses the page cleanly: e_phnum=3, e_phoff=0x6B0, three PHDRs (PT_LOAD covering the whole page, PT_DYNAMIC at 0x420 size 0x90, PT_NOTE at 0x40 size 0x20), and `readelf -n` decodes the note as "GNU NT_GNU_ABI_TAG OS: Linux, ABI: 2.6.39". No region overlaps; total page usage 0x758 / 0x1000. Static vDSO bench unchanged at 6 ns/op for the time fast paths; the PHDR relocation only shifts where the dynamic linker looks for the table and does not touch any code the trampolines execute. test-signal explicit run passes, confirming the unchanged TEXT_OFF_SIGRET=0xE0 trampoline still drives the libc __restore_rt path.
jserv
added a commit
that referenced
this pull request
May 29, 2026
Three hot paths the PR #34 OrbStack baseline tracked -- getpid (~47 ns), clock_gettime through the vDSO (~2.5 ns), and 1-byte /dev/urandom read (~134 ns) -- had no automated regression check. A silent slip-back to the SVC fallback turned each into a ~1-2 us trap without anything in CI to notice. This adds an explicit guardrail. tests/bench-hot-guard.c resolves __kernel_clock_gettime via AT_SYSINFO_EHDR + PT_DYNAMIC + DT_HASH (SysV ELF hash walk) and measures three labels in fixed-width "%-20s %10.1f ns/op last=%ld" output: getpid (raw SVC), clock_gettime (vDSO trampoline), and read-urandom1 (raw 1-byte read of /dev/urandom). The same source builds two binaries via a compile-time switch: build/bench-hot-guard Static glibc. Built without the macro. clock_gettime invokes the trampoline directly through the resolved function pointer. Static glibc never initializes dl_sysinfo_dso, so its libc wrapper falls back to raw SVC for reasons unrelated to the vDSO; measuring the wrapper would fail the 50 ns ceiling for the wrong reason. Direct call isolates the trampoline. build/bench-hot-guard-glibc Dynamic glibc. Built with -DGUARD_USE_LIBC_CG=1. clock_gettime invokes glibc's clock_gettime() wrapper -- which on glibc 2.41 + a correctly-stamped vDSO (NT_GNU_ABI_TAG PT_NOTE, LINUX_2.6.39 versioning) routes through the trampoline. A regression in the note or versioning would push this measurement from ~7 ns to SVC range and trip the ceiling. Built only when the cross-toolchain sysroot at $(LINUX_TOOLCHAIN)/aarch64-unknown-linux-gnu/sysroot exists; run with elfuse --sysroot at that path. Disassembly verifies the split: the dynamic binary lowers bench_clock_gettime to "bl <clock_gettime@plt>" while the static binary lowers it to "ldr x2, [x1], #8" + indirect dispatch. Validation: static getpid 50.4 ns, clock_gettime 6.7 ns, urandom 141.9 ns dyn-glibc getpid 71.9 ns, clock_gettime 17.8 ns, urandom 147.9 ns
jserv
added a commit
that referenced
this pull request
May 30, 2026
The dynamic-linker bring-up storm was the largest remaining startup band after pull request #34. Adding a per-syscall histogram pointed at the sidecar walker as the openat dominant cost (61% of getent startup), the per-call path_translation_t memset as the second source, and the opened_fd_type fstat as a small but real per-open round-trip. src/debug/syscall-hist.[ch]: opt-in histogram via ELFUSE_STARTUP_TRACE=syscalls (or =all alongside the existing step trace). Lock-free atomic counters per Linux syscall number, sorted total-ns descending in the dump. Records freeze on the first successful execve so steady-state traffic does not pollute the startup picture. Fork children disable the histogram explicitly because they resume from a parent snapshot, not a fresh bring-up. src/syscall/sidecar.c: First a per-directory absence cache keyed by (st_dev, st_ino, mtime, ctime) so the walker can skip the openat for .elfuse-sidecar-index when a recent fstat on the same dirfd already saw ENOENT. The mtime/ctime in the key closes ABA naturally and makes a cross-process index publish observable without explicit invalidation. Second a cached sysroot dirfd handed out as fcntl(F_DUPFD_CLOEXEC, 0) so each translated absolute path saves the ~30 us open(sysroot) round-trip and the dup carries CLOEXEC across any racing posix_spawn. src/syscall/path.c: drop the per-call zero-init of path_translation_t. The struct is ~12 KiB (24 metadata bytes plus three LINUX_PATH_MAX buffers) and the buffers are read-after- written by their respective resolvers. memset of all three was the dominant remaining cost after the sidecar caches. src/core/elf.c: skip the redundant memset of the file-data range in elf_map_segments. The loader previously zeroed the full page-aligned segment extent before issuing fread; now only the BSS portion plus page padding (filesz to zero_len) is zeroed. src/syscall/fs.c: skip opened_fd_type fstat when neither O_PATH nor O_DIRECTORY is set. Dynamic-linker opens are overwhelmingly regular files where the type is already implied. The corner where a guest opens a directory without O_DIRECTORY and then issues getdents now returns ENOTDIR; glibc fdopendir has required O_DIRECTORY since 2009 and the test corpus does not exercise the corner. src/core/startup-trace.h: env parsing extended to comma-separated tokens (steps, syscalls, all); legacy =1 keeps enabling steps only so existing scripts keep working. Measurement: 30-run distributions under ELFUSE_STARTUP_TRACE=syscalls, warm cache: bench-hot-guard-glibc startup syscalls: 5.225 ms baseline (single sample) -> 1.33 ms p50 (p25 1.21, p75 1.55, stdev 0.45, n=30) 3.9x bench openat per-call: 135 us baseline -> 33.4 us p50 (p25 32.4, p75 35.8, stdev 7.1, n=30) 4.0x getent passwd root startup syscalls: 7.478 ms baseline -> 2.22 ms p50 (p25 2.10, p75 2.28, stdev 0.27, n=30) 3.4x getent openat per-call: 230 us baseline -> 52.9 us p50 (p25 51.5, p75 55.1, stdev 2.2, n=30) 4.3x End-to-end wall-clock for getent: 14.6 ms p50 (p25 14.3, p75 15.1, stdev 1.18, n=30). Bench guardrail steady-state: static getpid 74 ns, clock_gettime 6.7 ns, urandom1 153 ns; dynamic-glibc getpid 53 ns, clock_gettime 6.4 ns, urandom1 142 ns. All under ceilings. The original baselines were single first-run samples; their variance band was not measured, so the speedup ratios are best-effort relative to the cited starting point. Lazy FD_REGULAR to FD_DIR promotion in sys_getdents64 was attempted but dropped after both reviewers flagged a HIGH-severity ABA hole: a sibling close+reopen between the probe and the install could land the original directory's DIR* onto a fresh regular file's slot. The fix path (fd-slot generation counter or stat+inode comparison under fd_lock) was invasive enough that the lazy promotion did not pay for its complexity.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR lands the full elfuse OCI image support. It supersedes the
original Phase 1 scope of this PR (CLI scaffold + pull/inspect) and
now covers Phases 1-4 plus the post-Phase-3 improvements plan: image
layout alignment, GC/prune, layer + stack snapshot caches, store
status, parallel pull, registry policy.json, and a heavy-mode compat
matrix.
Scope
token, OCI index walk to the linux/arm64 leaf manifest, partial-
store-aware inspect renderer.
x/grecords), gzip + decode-only vendored zstd, whiteout-aware layer apply (typeflag '1'/'2'/'5'
.wh.*markers), per-image sysroot on a case-sensitive APFSsparsebundle.
elfuse oci runclones the unpacked tree via clonefile(2),honors Entrypoint / Cmd / Env / WorkingDir / User, and reuses the
existing elfuse launch path so a dynamically-linked guest binary
runs through the same shim + syscall surface as the non-OCI mode.
oci prunewith--older-than/--keep-bytes;layer + stack prune sweep;
oci status(text +--json);oci rebuild-cachefor pre-snapshot stores.per-layer raw snapshot cache; ChainID stack snapshot cache; APFS
COW clone-rootfs reuse between runs.
policy.json+registries.doverlay (per-registry insecure / ca_bundle / auth_file). CLI flags
override; loopback-only
--insecure.test-oci-*), compat-shellsmoke (
tests/test-oci-compat.sh), and an opt-in heavy mode(
OCI_COMPAT_TEST=1) that drives three layered fixtures(alpine-shaped, busybox-shaped hardlink dispatch, two-layer
whiteout) end-to-end through a freshly-provisioned scratch
sparsebundle.
Manual smoke test (docker.io/library/python:3.12)
A real end-to-end pull-and-run against a mainstream multi-layer glibc
image. The image's default Entrypoint is
docker-entrypoint.sh(ashell script, which elfuse does not execute), so the commands below
override
--entrypointto the python3 binary directly.Performance characterization (vs OrbStack)
Measured on Apple M4 / macOS 15.4.1 (Darwin 24.4.0). OrbStack 2.1.3
acts as the ground-truth aarch64-linux runtime: it executes the same
docker.io/library/python:3.12image inside a Virtualization.framework-backed Linux VM with a real Linux kernel, so the comparison isolates
the cost of elfuse's user-mode ABI emulation against a native syscall
surface.
Pure CPU (factorial big-int multiply, no syscall)
Each engine ran twice; the second is warm.
computeis thetime.perf_counter()delta inside Python (pure interpreter +big-int multiply work);
realis the outer wall (includes enginestartup);
startup ≈ real - compute.Both engines emit
digit_sum=4154076 digits=973351— correctnessparity confirmed. Pure compute ratio: 1.01× (within measurement
noise). HVF runs guest aarch64 instructions directly so big-int
multiply + Python bytecode dispatch pay zero translation overhead.
Startup ratio: 15.0× (constant ~2.5 s for elfuse vs ~0.17 s for
orbstack), independent of N — verified separately at N=50000 where
both compute drops to ~0.14 s but elfuse startup stays at 2.53 s.
Syscall density (Python loop hammering syscalls)
syscall_overheadstrips the Python loop interpreter cost (measuredfrom the
baselineband) so the residual is the pure trap+returncost of a single syscall.
getppidis the cleanest measurement: no kernel work, just trap +return. elfuse pays roughly 1 μs per syscall versus ~0.1 μs native.
Rough HVF round-trip breakdown: vCPU state sync ~200 ns, Linux→macOS
semantics ~100 ns, the macOS syscall itself ~100 ns, errno + sync
back ~100 ns, HVF re-entry + ERET ~500 ns. This 1 μs floor is the
structural ceiling for any elfuse syscall path.
vDSO observation —
time.monotonic_nsshould hit the syntheticvDSO under
src/core/vdso.{c,h}and skip the trap (orbstack does, at0.018 μs), but the measured 1.006 μs matches the trapping baseline.
elfuse's vDSO entry is not being picked up by glibc 2.41 in this
image. This is an existing optimization opportunity unrelated to the
scope of this PR; left untouched here so the patch series stays
focused on image-distribution and runtime correctness.
Wall-clock model
For a pure-CPU workload of compute time W:
elfuse is competitive for long-running workloads (where the constant
startup amortizes out) and a known tradeoff for short CLI one-shots
where startup dominates total wall.
Known limitations
fork()followed byexecve()of a dynamically-linked ELF crashesin the child during dynamic-linker bring-up. This blocks Python's
subprocess.run([...other_dynamic_binary...]), shell pipelines thatspawn external binaries, and
timeout(1). Single-process Pythonworkloads, stdlib computation, and file I/O are unaffected.
linux/arm64. There isno
--platformflag; cross-arch image support is out of scope forthis PR.
pullprogress uses CSI cursor-up + clear-line for in-placeredraw. Terminal panes that ignore those escapes show stacking
rows; set
ELFUSE_OCI_PROGRESS=plainto disable the redraw andemit one summary line per blob instead.
Summary by cubic
Adds full OCI image lifecycle to elfuse: pull, inspect, unpack, clone, run, prune, rebuild-cache, and status, with parallel/resumable downloads, a content‑addressable store + caches, and a runtime path to execute images. Vendors
cJSONand decode‑onlyzstdto keep builds self-contained, and extracts a shared VM launcher foroci run.New Features
oci pull|inspect|unpack|clone|run|prune|rebuild-cache|status; pull adds progress and--refresh;status --json.libcurlwith bearer/basic auth, custom CA, loopback‑only--insecure; content‑addressable blob store;oci-layoutmarker; pins in OCIindex.json.zstd; whiteouts; case‑sensitive APFS sysroot; per‑run rootfs viaclonefile(2); raw layer and ChainID stack caches; parallel fetch + HTTP Range resume.Username/group lookup; inject/etc/{resolv.conf,hosts,hostname}; emulate/dev/{full,console}and basic/proc; multi‑arch index resolves to linux/arm64;ELFUSE_OCI_PROGRESS=plainfallback.policy.jsonwithregistries.doverlay; CLI flags override;inspectshows runtime fields and cross‑image dedup stats.Migration
index.json; store auto‑migrates fromrefs/on open.layers/cache schema v2; first open wipes legacy v1 cache entries (blobs/images untouched).zstdandcJSON; uses system zlib andlibcurl.Written for commit 426d7f6. Summary will update on new commits. Review in cubic