Add OCI image support: pull, unpack, run, prune, status, policy by Max042004 · Pull Request #34 · sysprog21/elfuse

Max042004 · 2026-05-15T15:18:20Z

This PR lands the full elfuse OCI image support. It supersedes the
original Phase 1 scope of this PR (CLI scaffold + pull/inspect) and
now covers Phases 1-4 plus the post-Phase-3 improvements plan: image
layout alignment, GC/prune, layer + stack snapshot caches, store
status, parallel pull, registry policy.json, and a heavy-mode compat
matrix.

Scope

Pull / inspect — content-addressable blob store, HTTPS + bearer
token, OCI index walk to the linux/arm64 leaf manifest, partial-
store-aware inspect renderer.
Unpack — tar reader (ustar + PAX x/g records), gzip + decode-
only vendored zstd, whiteout-aware layer apply (typeflag '1'/'2'/'5'
- .wh.* markers), per-image sysroot on a case-sensitive APFS
  sparsebundle.
Run — elfuse oci run clones the unpacked tree via clonefile(2),
honors Entrypoint / Cmd / Env / WorkingDir / User, and reuses the
existing elfuse launch path so a dynamically-linked guest binary
runs through the same shim + syscall surface as the non-OCI mode.
Lifecycle — oci prune with --older-than / --keep-bytes;
layer + stack prune sweep; oci status (text + --json);
oci rebuild-cache for pre-snapshot stores.
Performance — parallel blob fetch with HTTP Range resume;
per-layer raw snapshot cache; ChainID stack snapshot cache; APFS
COW clone-rootfs reuse between runs.
Policy — podman / skopeo-style policy.json + registries.d
overlay (per-registry insecure / ca_bundle / auth_file). CLI flags
override; loopback-only --insecure.
Test coverage — 25 OCI unit suites (test-oci-*), compat-shell
smoke (tests/test-oci-compat.sh), and an opt-in heavy mode
(OCI_COMPAT_TEST=1) that drives three layered fixtures
(alpine-shaped, busybox-shaped hardlink dispatch, two-layer
whiteout) end-to-end through a freshly-provisioned scratch
sparsebundle.

Manual smoke test (docker.io/library/python:3.12)

A real end-to-end pull-and-run against a mainstream multi-layer glibc
image. The image's default Entrypoint is docker-entrypoint.sh (a
shell script, which elfuse does not execute), so the commands below
override --entrypoint to the python3 binary directly.

make elfuse
SCRATCH=$(mktemp -d)
echo "store: $SCRATCH"

# 1. Pull (~400 MB across 7 layers, ~3 minutes on a fast link).
#    If your terminal mishandles CSI cursor-up and the progress
#    output stacks duplicate rows, prepend ELFUSE_OCI_PROGRESS=plain
#    to fall back to one summary line per blob.
./build/elfuse oci pull --store "$SCRATCH" python:3.12

# 2. Offline inspect: image index -> linux/arm64 manifest -> config
#    runtime block (Entrypoint / Cmd / Env / WorkingDir / User).
./build/elfuse oci inspect --store "$SCRATCH" python:3.12

# 3. Cold run. First invocation triggers layer unpack onto the
#    sysroot APFS sparsebundle, then clone-rootfs, then launch. The
#    unpack step dominates the ~50 s wall on a fresh store.
./build/elfuse oci run --store "$SCRATCH" \
    --entrypoint /usr/local/bin/python3 python:3.12 \
    -c 'print("hello from elfuse", 1+2)'
# expected stdout:  hello from elfuse 3

# 4. Warm run. clone-rootfs reuses the unpacked image tree, so wall
#    drops to ~2 s and is dominated by VM bring-up + dynamic-linker
#    bring-up + Python interp init.
./build/elfuse oci run --store "$SCRATCH" \
    --entrypoint /usr/local/bin/python3 python:3.12 \
    -c 'import sys, platform; print(sys.version); print(platform.platform()); print(platform.machine())'
# expected stdout:  Python 3.12.x ... / Linux-<kernel>-aarch64-with-glibc2.41 / aarch64

# 5. stdlib smoke. Confirms json + math + f-string formatting all
#    flow through the emulated syscall surface.
./build/elfuse oci run --store "$SCRATCH" \
    --entrypoint /usr/local/bin/python3 python:3.12 \
    -c 'import json, math; print(json.dumps({"pi": round(math.pi, 5), "ok": True}))'
# expected stdout:  {"pi": 3.14159, "ok": true}

Performance characterization (vs OrbStack)

Measured on Apple M4 / macOS 15.4.1 (Darwin 24.4.0). OrbStack 2.1.3
acts as the ground-truth aarch64-linux runtime: it executes the same
docker.io/library/python:3.12 image inside a Virtualization.framework-
backed Linux VM with a real Linux kernel, so the comparison isolates
the cost of elfuse's user-mode ABI emulation against a native syscall
surface.

Pure CPU (factorial big-int multiply, no syscall)

import sys, math, time
sys.set_int_max_str_digits(0)   # Python 3.12 default cap is 4300 digits
N = 200000
t = time.perf_counter()
f = math.factorial(N)
s = sum(int(d) for d in str(f))
print("fact(%d) digit_sum=%d digits=%d compute=%.3fs" %
      (N, s, len(str(f)), time.perf_counter() - t))

Each engine ran twice; the second is warm. compute is the
time.perf_counter() delta inside Python (pure interpreter +
big-int multiply work); real is the outer wall (includes engine
startup); startup ≈ real - compute.

Engine	run	compute (s)	real (s)	startup (s)
elfuse	1	0.791	3.72	2.93
elfuse	2 warm	0.804	3.35	2.55
orbstack	1	0.792	1.10	0.31
orbstack	2 warm	0.796	0.97	0.17

Both engines emit digit_sum=4154076 digits=973351 — correctness
parity confirmed. Pure compute ratio: 1.01× (within measurement
noise). HVF runs guest aarch64 instructions directly so big-int
multiply + Python bytecode dispatch pay zero translation overhead.
Startup ratio: 15.0× (constant ~2.5 s for elfuse vs ~0.17 s for
orbstack), independent of N — verified separately at N=50000 where
both compute drops to ~0.14 s but elfuse startup stays at 2.53 s.

Syscall density (Python loop hammering syscalls)

import os, time
N_BASE = 1_000_000
N_READ = 100_000

def time_loop(label, fn, n):
    fn(min(n // 100, 10_000))   # warm-up
    t = time.perf_counter()
    fn(n)
    return label, time.perf_counter() - t, n

def baseline(n):
    for _ in range(n): pass

def getppid(n):
    g = os.getppid
    for _ in range(n): g()

def clock_ns(n):
    g = time.monotonic_ns
    for _ in range(n): g()

def urandom_read(n):
    fd = os.open("/dev/urandom", os.O_RDONLY)
    try:
        rd = os.read
        for _ in range(n): rd(fd, 1)
    finally:
        os.close(fd)

results = [
    time_loop("baseline (pass)",              baseline,     N_BASE),
    time_loop("getppid",                      getppid,      N_BASE),
    time_loop("clock_gettime (monotonic_ns)", clock_ns,     N_BASE),
    time_loop("/dev/urandom 1B read",         urandom_read, N_READ),
]
base_per = results[0][1] / results[0][2]
for label, secs, n in results:
    per = secs / n
    overhead = (per - base_per) * 1e6 if label != "baseline (pass)" else 0.0
    print("%-38s total=%.3fs n=%d per=%.3fus  syscall_overhead=%.3fus" %
          (label, secs, n, per * 1e6, overhead))

syscall_overhead strips the Python loop interpreter cost (measured
from the baseline band) so the residual is the pure trap+return
cost of a single syscall.

Band	elfuse (μs/call)	orbstack (μs/call)	ratio
baseline (pass)	0.007	0.007	1.0×
getppid	0.960	0.091	10.5×
clock_gettime (monotonic_ns)	1.006	0.018	55.9×
/dev/urandom 1B read	1.704	0.210	8.1×

getppid is the cleanest measurement: no kernel work, just trap +
return. elfuse pays roughly 1 μs per syscall versus ~0.1 μs native.
Rough HVF round-trip breakdown: vCPU state sync ~200 ns, Linux→macOS
semantics ~100 ns, the macOS syscall itself ~100 ns, errno + sync
back ~100 ns, HVF re-entry + ERET ~500 ns. This 1 μs floor is the
structural ceiling for any elfuse syscall path.

vDSO observation — time.monotonic_ns should hit the synthetic
vDSO under src/core/vdso.{c,h} and skip the trap (orbstack does, at
0.018 μs), but the measured 1.006 μs matches the trapping baseline.
elfuse's vDSO entry is not being picked up by glibc 2.41 in this
image. This is an existing optimization opportunity unrelated to the
scope of this PR; left untouched here so the patch series stays
focused on image-distribution and runtime correctness.

Wall-clock model

For a pure-CPU workload of compute time W:

elfuse_total   ≈ 2.5 s + W
orbstack_total ≈ 0.17 s + W

W	elfuse	orbstack	ratio	scenario
0.1 s	2.6 s	0.27 s	9.6×	CLI one-shot
1 s	3.5 s	1.17 s	3.0×	short script
10 s	12.5 s	10.17 s	1.23×	medium task
60 s	62.5 s	60.17 s	1.04×	batch job

elfuse is competitive for long-running workloads (where the constant
startup amortizes out) and a known tradeoff for short CLI one-shots
where startup dominates total wall.

Known limitations

fork() followed by execve() of a dynamically-linked ELF crashes
in the child during dynamic-linker bring-up. This blocks Python's
subprocess.run([...other_dynamic_binary...]), shell pipelines that
spawn external binaries, and timeout(1). Single-process Python
workloads, stdlib computation, and file I/O are unaffected.
Multi-arch image selection is hardcoded to linux/arm64. There is
no --platform flag; cross-arch image support is out of scope for
this PR.
pull progress uses CSI cursor-up + clear-line for in-place
redraw. Terminal panes that ignore those escapes show stacking
rows; set ELFUSE_OCI_PROGRESS=plain to disable the redraw and
emit one summary line per blob instead.

Summary by cubic

Adds full OCI image lifecycle to elfuse: pull, inspect, unpack, clone, run, prune, rebuild-cache, and status, with parallel/resumable downloads, a content‑addressable store + caches, and a runtime path to execute images. Vendors cJSON and decode‑only zstd to keep builds self-contained, and extracts a shared VM launcher for oci run.

New Features
- CLI: oci pull|inspect|unpack|clone|run|prune|rebuild-cache|status; pull adds progress and --refresh; status --json.
- Registry/store: HTTPS via libcurl with bearer/basic auth, custom CA, loopback‑only --insecure; content‑addressable blob store; oci-layout marker; pins in OCI index.json.
- Unpack/caches: ustar/PAX tar with gzip/zstd; whiteouts; case‑sensitive APFS sysroot; per‑run rootfs via clonefile(2); raw layer and ChainID stack caches; parallel fetch + HTTP Range resume.
- Runtime: PATH resolver; image‑config User name/group lookup; inject /etc/{resolv.conf,hosts,hostname}; emulate /dev/{full,console} and basic /proc; multi‑arch index resolves to linux/arm64; ELFUSE_OCI_PROGRESS=plain fallback.
- Policy/inspect: podman/skopeo‑style policy.json with registries.d overlay; CLI flags override; inspect shows runtime fields and cross‑image dedup stats.
Migration
- Pins moved to OCI index.json; store auto‑migrates from refs/ on open.
- layers/ cache schema v2; first open wipes legacy v1 cache entries (blobs/images untouched).
- Vendors decode‑only zstd and cJSON; uses system zlib and libcurl.

^{Written for commit 426d7f6. Summary will update on new commits. Review in cubic}

jserv

Rebase onto the latest main branch and squash/rework the commits into fewer, cleaner ones.

Implement the full elfuse OCI image lifecycle as a self-contained `elfuse oci` subcommand. Image distribution never touches Hypervisor.framework, so the subcommand dispatches in main() before any guest setup; only `oci run` enters the VM bring-up path. - pull / inspect: content-addressable blob store over HTTPS with bearer-token + Basic auth, OCI index walk to the linux/arm64 leaf, parallel blob fetch with HTTP Range resume, offline inspect renderer. - unpack: tar reader (ustar + PAX x/g records), gzip + decode-only vendored zstd, whiteout-aware layer apply, per-image case-sensitive APFS sysroot; cross-volume unpack via copyfile(2) with clone fallback. - run: clonefile(2) per-run rootfs; Entrypoint / Cmd / Env / WorkingDir and symbolic/numeric User honoured; reuses the shared elfuse_launch bring-up so a dynamic guest runs through the same shim + syscall path. - lifecycle: prune (--older-than / --keep-bytes), per-layer + ChainID stack snapshot caches, oci status (text + --json), rebuild-cache. - policy: podman/skopeo-style policy.json + registries.d overlay; loopback-gated --insecure; CLI flags override. Extract the VM bring-up from main() into core/launch.c (elfuse_launch) so oci run and the positional-ELF main share one path; the host-path resolution now lives in the caller per the guest_bootstrap_prepare split. Vendors decode-only zstd and cJSON; uses system zlib and libcurl. Adds 25 native test-oci-* unit suites plus an opt-in heavy compat mode.

The synthetic vDSO at AT_SYSINFO_EHDR already carries DT_HASH, LINUX_2.6.39 symbol versioning, and five __kernel_* trampolines, but glibc 2.41's dynamic-linker vDSO probe rejected the page for lack of an NT_GNU_ABI_TAG note: every dynamically-linked guest fell back to SVC for clock_gettime, gettimeofday, and clock_getres. PR #34 measured 1006 ns/op against an 18 ns/op OrbStack reference, a 56x gap the TODO Tier D P1 entry tracked as the highest-leverage single fix. This adds the note. To avoid moving VVAR (0x0B0), TEXT_OFF_SIGRET (0x0E0, exported in vdso.h for signal.c), or any trampoline / section offset, the program-header table relocates from 0x040 to 0x6B0 (after the section-header area). The reclaimed 0x040 window now holds the 32-byte NT_GNU_ABI_TAG: namesz : 4 ("GNU\0") descsz : 16 type : NT_GNU_ABI_TAG (1) name : "GNU\0" desc : { ELF_NOTE_OS_LINUX (0), 2, 6, 39 } The descriptor's minimum kernel ABI (2.6.39) matches the LINUX_2.6.39 symbol version already exposed through DT_VERDEF, so a glibc that honors the version also honors the note. PT_LOAD continues to cover the whole page so the relocated PHDR table and the note both stay mapped at runtime. Validation, dynamically-linked glibc 2.41 binary built from the cross-toolchain sysroot at /opt/toolchain/aarch64-linux-gnu (same toolchain PR #34 used for the baseline): libc clock_gettime : 6.97 ns/op (was 1006 ns/op pre-fix) direct vDSO call : 6.24 ns/op (dlsym function-pointer) raw SVC syscall : 2047.01 ns/op libc/vDSO ratio = 1.12x -- libc IS using the vDSO The 0.7 ns libc-vs-direct gap is glibc's dl_sysinfo_dso dispatch, not an SVC fallback. libc clock_gettime now beats the OrbStack reference (18 ns/op) by ~2.6x. gettimeofday and clock_getres land on the trampolines through the same probe path: libc gettimeofday : 7.5 ns/op (vDSO REALTIME anchor reuse) libc clock_getres : 4.9 ns/op (constant-resolution path) readelf parses the page cleanly: e_phnum=3, e_phoff=0x6B0, three PHDRs (PT_LOAD covering the whole page, PT_DYNAMIC at 0x420 size 0x90, PT_NOTE at 0x40 size 0x20), and `readelf -n` decodes the note as "GNU NT_GNU_ABI_TAG OS: Linux, ABI: 2.6.39". No region overlaps; total page usage 0x758 / 0x1000. Static vDSO bench unchanged at 6 ns/op for the time fast paths; the PHDR relocation only shifts where the dynamic linker looks for the table and does not touch any code the trampolines execute. test-signal explicit run passes, confirming the unchanged TEXT_OFF_SIGRET=0xE0 trampoline still drives the libc __restore_rt path.

Three hot paths the PR #34 OrbStack baseline tracked -- getpid (~47 ns), clock_gettime through the vDSO (~2.5 ns), and 1-byte /dev/urandom read (~134 ns) -- had no automated regression check. A silent slip-back to the SVC fallback turned each into a ~1-2 us trap without anything in CI to notice. This adds an explicit guardrail. tests/bench-hot-guard.c resolves __kernel_clock_gettime via AT_SYSINFO_EHDR + PT_DYNAMIC + DT_HASH (SysV ELF hash walk) and measures three labels in fixed-width "%-20s %10.1f ns/op last=%ld" output: getpid (raw SVC), clock_gettime (vDSO trampoline), and read-urandom1 (raw 1-byte read of /dev/urandom). The same source builds two binaries via a compile-time switch: build/bench-hot-guard Static glibc. Built without the macro. clock_gettime invokes the trampoline directly through the resolved function pointer. Static glibc never initializes dl_sysinfo_dso, so its libc wrapper falls back to raw SVC for reasons unrelated to the vDSO; measuring the wrapper would fail the 50 ns ceiling for the wrong reason. Direct call isolates the trampoline. build/bench-hot-guard-glibc Dynamic glibc. Built with -DGUARD_USE_LIBC_CG=1. clock_gettime invokes glibc's clock_gettime() wrapper -- which on glibc 2.41 + a correctly-stamped vDSO (NT_GNU_ABI_TAG PT_NOTE, LINUX_2.6.39 versioning) routes through the trampoline. A regression in the note or versioning would push this measurement from ~7 ns to SVC range and trip the ceiling. Built only when the cross-toolchain sysroot at $(LINUX_TOOLCHAIN)/aarch64-unknown-linux-gnu/sysroot exists; run with elfuse --sysroot at that path. Disassembly verifies the split: the dynamic binary lowers bench_clock_gettime to "bl <clock_gettime@plt>" while the static binary lowers it to "ldr x2, [x1], #8" + indirect dispatch. Validation: static getpid 50.4 ns, clock_gettime 6.7 ns, urandom 141.9 ns dyn-glibc getpid 71.9 ns, clock_gettime 17.8 ns, urandom 147.9 ns

The dynamic-linker bring-up storm was the largest remaining startup band after pull request #34. Adding a per-syscall histogram pointed at the sidecar walker as the openat dominant cost (61% of getent startup), the per-call path_translation_t memset as the second source, and the opened_fd_type fstat as a small but real per-open round-trip. src/debug/syscall-hist.[ch]: opt-in histogram via ELFUSE_STARTUP_TRACE=syscalls (or =all alongside the existing step trace). Lock-free atomic counters per Linux syscall number, sorted total-ns descending in the dump. Records freeze on the first successful execve so steady-state traffic does not pollute the startup picture. Fork children disable the histogram explicitly because they resume from a parent snapshot, not a fresh bring-up. src/syscall/sidecar.c: First a per-directory absence cache keyed by (st_dev, st_ino, mtime, ctime) so the walker can skip the openat for .elfuse-sidecar-index when a recent fstat on the same dirfd already saw ENOENT. The mtime/ctime in the key closes ABA naturally and makes a cross-process index publish observable without explicit invalidation. Second a cached sysroot dirfd handed out as fcntl(F_DUPFD_CLOEXEC, 0) so each translated absolute path saves the ~30 us open(sysroot) round-trip and the dup carries CLOEXEC across any racing posix_spawn. src/syscall/path.c: drop the per-call zero-init of path_translation_t. The struct is ~12 KiB (24 metadata bytes plus three LINUX_PATH_MAX buffers) and the buffers are read-after- written by their respective resolvers. memset of all three was the dominant remaining cost after the sidecar caches. src/core/elf.c: skip the redundant memset of the file-data range in elf_map_segments. The loader previously zeroed the full page-aligned segment extent before issuing fread; now only the BSS portion plus page padding (filesz to zero_len) is zeroed. src/syscall/fs.c: skip opened_fd_type fstat when neither O_PATH nor O_DIRECTORY is set. Dynamic-linker opens are overwhelmingly regular files where the type is already implied. The corner where a guest opens a directory without O_DIRECTORY and then issues getdents now returns ENOTDIR; glibc fdopendir has required O_DIRECTORY since 2009 and the test corpus does not exercise the corner. src/core/startup-trace.h: env parsing extended to comma-separated tokens (steps, syscalls, all); legacy =1 keeps enabling steps only so existing scripts keep working. Measurement: 30-run distributions under ELFUSE_STARTUP_TRACE=syscalls, warm cache: bench-hot-guard-glibc startup syscalls: 5.225 ms baseline (single sample) -> 1.33 ms p50 (p25 1.21, p75 1.55, stdev 0.45, n=30) 3.9x bench openat per-call: 135 us baseline -> 33.4 us p50 (p25 32.4, p75 35.8, stdev 7.1, n=30) 4.0x getent passwd root startup syscalls: 7.478 ms baseline -> 2.22 ms p50 (p25 2.10, p75 2.28, stdev 0.27, n=30) 3.4x getent openat per-call: 230 us baseline -> 52.9 us p50 (p25 51.5, p75 55.1, stdev 2.2, n=30) 4.3x End-to-end wall-clock for getent: 14.6 ms p50 (p25 14.3, p75 15.1, stdev 1.18, n=30). Bench guardrail steady-state: static getpid 74 ns, clock_gettime 6.7 ns, urandom1 153 ns; dynamic-glibc getpid 53 ns, clock_gettime 6.4 ns, urandom1 142 ns. All under ceilings. The original baselines were single first-run samples; their variance band was not measured, so the speedup ratios are best-effort relative to the cited starting point. Lazy FD_REGULAR to FD_DIR promotion in sys_getdents64 was attempted but dropped after both reviewers flagged a HIGH-severity ABA hole: a sibling close+reopen between the probe and the install could land the original directory's DIR* onto a fresh regular file's slot. The fix path (fd-slot generation counter or stat+inode comparison under fd_lock) was invasive enough that the lazy promotion did not pay for its complexity.

This comment was marked as resolved.

Sign in to view

Max042004 changed the title ~~Add elfuse oci subcommand for pulling and inspecting images~~ Add OCI image support: pull, unpack, run, prune, status, policy May 23, 2026

This comment was marked as resolved.

Sign in to view

jserv requested changes May 23, 2026

View reviewed changes

Max042004 force-pushed the oci-image branch 2 times, most recently from e988d6e to 5d6dbc7 Compare May 23, 2026 14:38

sysprog21 deleted a comment from cubic-dev-ai Bot May 23, 2026

This comment was marked as resolved.

Sign in to view

Max042004 force-pushed the oci-image branch from 5d6dbc7 to 2154d99 Compare May 26, 2026 15:49

This was referenced May 26, 2026

clone(2) silently ignores CLONE_NEW* namespace flags while clone3(2) rejects them with EINVAL #44

Closed

fork/clone falls back to full guest-memory copy for x86 (Rosetta) guests, losing CoW #45

Closed

Max042004 force-pushed the oci-image branch from 2154d99 to 426d7f6 Compare May 26, 2026 16:31

Max042004 mentioned this pull request May 28, 2026

Speedup vDSO CNTVCT and amortized urandom #48

Merged

jserv mentioned this pull request May 30, 2026

Cut dynamic-linker startup syscalls #62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OCI image support: pull, unpack, run, prune, status, policy#34

Add OCI image support: pull, unpack, run, prune, status, policy#34
Max042004 wants to merge 1 commit into
sysprog21:mainfrom
Max042004:oci-image

Max042004 commented May 15, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

jserv left a comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Max042004 commented May 15, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scope

Manual smoke test (docker.io/library/python:3.12)

Performance characterization (vs OrbStack)

Pure CPU (factorial big-int multiply, no syscall)

Syscall density (Python loop hammering syscalls)

Wall-clock model

Known limitations

Summary by cubic

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

jserv left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Max042004 commented May 15, 2026 •

edited by cubic-dev-ai Bot

Loading