Skip to content

Add OCI image support: pull, unpack, run, prune, status, policy#34

Open
Max042004 wants to merge 1 commit into
sysprog21:mainfrom
Max042004:oci-image
Open

Add OCI image support: pull, unpack, run, prune, status, policy#34
Max042004 wants to merge 1 commit into
sysprog21:mainfrom
Max042004:oci-image

Conversation

@Max042004
Copy link
Copy Markdown
Collaborator

@Max042004 Max042004 commented May 15, 2026

This PR lands the full elfuse OCI image support. It supersedes the
original Phase 1 scope of this PR (CLI scaffold + pull/inspect) and
now covers Phases 1-4 plus the post-Phase-3 improvements plan: image
layout alignment, GC/prune, layer + stack snapshot caches, store
status, parallel pull, registry policy.json, and a heavy-mode compat
matrix.

Scope

  • Pull / inspect — content-addressable blob store, HTTPS + bearer
    token, OCI index walk to the linux/arm64 leaf manifest, partial-
    store-aware inspect renderer.
  • Unpack — tar reader (ustar + PAX x/g records), gzip + decode-
    only vendored zstd, whiteout-aware layer apply (typeflag '1'/'2'/'5'
    • .wh.* markers), per-image sysroot on a case-sensitive APFS
      sparsebundle.
  • Runelfuse oci run clones the unpacked tree via clonefile(2),
    honors Entrypoint / Cmd / Env / WorkingDir / User, and reuses the
    existing elfuse launch path so a dynamically-linked guest binary
    runs through the same shim + syscall surface as the non-OCI mode.
  • Lifecycleoci prune with --older-than / --keep-bytes;
    layer + stack prune sweep; oci status (text + --json);
    oci rebuild-cache for pre-snapshot stores.
  • Performance — parallel blob fetch with HTTP Range resume;
    per-layer raw snapshot cache; ChainID stack snapshot cache; APFS
    COW clone-rootfs reuse between runs.
  • Policy — podman / skopeo-style policy.json + registries.d
    overlay (per-registry insecure / ca_bundle / auth_file). CLI flags
    override; loopback-only --insecure.
  • Test coverage — 25 OCI unit suites (test-oci-*), compat-shell
    smoke (tests/test-oci-compat.sh), and an opt-in heavy mode
    (OCI_COMPAT_TEST=1) that drives three layered fixtures
    (alpine-shaped, busybox-shaped hardlink dispatch, two-layer
    whiteout) end-to-end through a freshly-provisioned scratch
    sparsebundle.

Manual smoke test (docker.io/library/python:3.12)

A real end-to-end pull-and-run against a mainstream multi-layer glibc
image. The image's default Entrypoint is docker-entrypoint.sh (a
shell script, which elfuse does not execute), so the commands below
override --entrypoint to the python3 binary directly.

make elfuse
SCRATCH=$(mktemp -d)
echo "store: $SCRATCH"

# 1. Pull (~400 MB across 7 layers, ~3 minutes on a fast link).
#    If your terminal mishandles CSI cursor-up and the progress
#    output stacks duplicate rows, prepend ELFUSE_OCI_PROGRESS=plain
#    to fall back to one summary line per blob.
./build/elfuse oci pull --store "$SCRATCH" python:3.12

# 2. Offline inspect: image index -> linux/arm64 manifest -> config
#    runtime block (Entrypoint / Cmd / Env / WorkingDir / User).
./build/elfuse oci inspect --store "$SCRATCH" python:3.12

# 3. Cold run. First invocation triggers layer unpack onto the
#    sysroot APFS sparsebundle, then clone-rootfs, then launch. The
#    unpack step dominates the ~50 s wall on a fresh store.
./build/elfuse oci run --store "$SCRATCH" \
    --entrypoint /usr/local/bin/python3 python:3.12 \
    -c 'print("hello from elfuse", 1+2)'
# expected stdout:  hello from elfuse 3

# 4. Warm run. clone-rootfs reuses the unpacked image tree, so wall
#    drops to ~2 s and is dominated by VM bring-up + dynamic-linker
#    bring-up + Python interp init.
./build/elfuse oci run --store "$SCRATCH" \
    --entrypoint /usr/local/bin/python3 python:3.12 \
    -c 'import sys, platform; print(sys.version); print(platform.platform()); print(platform.machine())'
# expected stdout:  Python 3.12.x ... / Linux-<kernel>-aarch64-with-glibc2.41 / aarch64

# 5. stdlib smoke. Confirms json + math + f-string formatting all
#    flow through the emulated syscall surface.
./build/elfuse oci run --store "$SCRATCH" \
    --entrypoint /usr/local/bin/python3 python:3.12 \
    -c 'import json, math; print(json.dumps({"pi": round(math.pi, 5), "ok": True}))'
# expected stdout:  {"pi": 3.14159, "ok": true}

Performance characterization (vs OrbStack)

Measured on Apple M4 / macOS 15.4.1 (Darwin 24.4.0). OrbStack 2.1.3
acts as the ground-truth aarch64-linux runtime: it executes the same
docker.io/library/python:3.12 image inside a Virtualization.framework-
backed Linux VM with a real Linux kernel, so the comparison isolates
the cost of elfuse's user-mode ABI emulation against a native syscall
surface.

Pure CPU (factorial big-int multiply, no syscall)

import sys, math, time
sys.set_int_max_str_digits(0)   # Python 3.12 default cap is 4300 digits
N = 200000
t = time.perf_counter()
f = math.factorial(N)
s = sum(int(d) for d in str(f))
print("fact(%d) digit_sum=%d digits=%d compute=%.3fs" %
      (N, s, len(str(f)), time.perf_counter() - t))

Each engine ran twice; the second is warm. compute is the
time.perf_counter() delta inside Python (pure interpreter +
big-int multiply work); real is the outer wall (includes engine
startup); startup ≈ real - compute.

Engine run compute (s) real (s) startup (s)
elfuse 1 0.791 3.72 2.93
elfuse 2 warm 0.804 3.35 2.55
orbstack 1 0.792 1.10 0.31
orbstack 2 warm 0.796 0.97 0.17

Both engines emit digit_sum=4154076 digits=973351 — correctness
parity confirmed. Pure compute ratio: 1.01× (within measurement
noise). HVF runs guest aarch64 instructions directly so big-int
multiply + Python bytecode dispatch pay zero translation overhead.
Startup ratio: 15.0× (constant ~2.5 s for elfuse vs ~0.17 s for
orbstack), independent of N — verified separately at N=50000 where
both compute drops to ~0.14 s but elfuse startup stays at 2.53 s.

Syscall density (Python loop hammering syscalls)

import os, time
N_BASE = 1_000_000
N_READ = 100_000

def time_loop(label, fn, n):
    fn(min(n // 100, 10_000))   # warm-up
    t = time.perf_counter()
    fn(n)
    return label, time.perf_counter() - t, n

def baseline(n):
    for _ in range(n): pass

def getppid(n):
    g = os.getppid
    for _ in range(n): g()

def clock_ns(n):
    g = time.monotonic_ns
    for _ in range(n): g()

def urandom_read(n):
    fd = os.open("/dev/urandom", os.O_RDONLY)
    try:
        rd = os.read
        for _ in range(n): rd(fd, 1)
    finally:
        os.close(fd)

results = [
    time_loop("baseline (pass)",              baseline,     N_BASE),
    time_loop("getppid",                      getppid,      N_BASE),
    time_loop("clock_gettime (monotonic_ns)", clock_ns,     N_BASE),
    time_loop("/dev/urandom 1B read",         urandom_read, N_READ),
]
base_per = results[0][1] / results[0][2]
for label, secs, n in results:
    per = secs / n
    overhead = (per - base_per) * 1e6 if label != "baseline (pass)" else 0.0
    print("%-38s total=%.3fs n=%d per=%.3fus  syscall_overhead=%.3fus" %
          (label, secs, n, per * 1e6, overhead))

syscall_overhead strips the Python loop interpreter cost (measured
from the baseline band) so the residual is the pure trap+return
cost of a single syscall.

Band elfuse (μs/call) orbstack (μs/call) ratio
baseline (pass) 0.007 0.007 1.0×
getppid 0.960 0.091 10.5×
clock_gettime (monotonic_ns) 1.006 0.018 55.9×
/dev/urandom 1B read 1.704 0.210 8.1×

getppid is the cleanest measurement: no kernel work, just trap +
return. elfuse pays roughly 1 μs per syscall versus ~0.1 μs native.
Rough HVF round-trip breakdown: vCPU state sync ~200 ns, Linux→macOS
semantics ~100 ns, the macOS syscall itself ~100 ns, errno + sync
back ~100 ns, HVF re-entry + ERET ~500 ns. This 1 μs floor is the
structural ceiling for any elfuse syscall path.

vDSO observationtime.monotonic_ns should hit the synthetic
vDSO under src/core/vdso.{c,h} and skip the trap (orbstack does, at
0.018 μs), but the measured 1.006 μs matches the trapping baseline.
elfuse's vDSO entry is not being picked up by glibc 2.41 in this
image. This is an existing optimization opportunity unrelated to the
scope of this PR; left untouched here so the patch series stays
focused on image-distribution and runtime correctness.

Wall-clock model

For a pure-CPU workload of compute time W:

elfuse_total   ≈ 2.5 s + W
orbstack_total ≈ 0.17 s + W
W elfuse orbstack ratio scenario
0.1 s 2.6 s 0.27 s 9.6× CLI one-shot
1 s 3.5 s 1.17 s 3.0× short script
10 s 12.5 s 10.17 s 1.23× medium task
60 s 62.5 s 60.17 s 1.04× batch job

elfuse is competitive for long-running workloads (where the constant
startup amortizes out) and a known tradeoff for short CLI one-shots
where startup dominates total wall.

Known limitations

  • fork() followed by execve() of a dynamically-linked ELF crashes
    in the child during dynamic-linker bring-up. This blocks Python's
    subprocess.run([...other_dynamic_binary...]), shell pipelines that
    spawn external binaries, and timeout(1). Single-process Python
    workloads, stdlib computation, and file I/O are unaffected.
  • Multi-arch image selection is hardcoded to linux/arm64. There is
    no --platform flag; cross-arch image support is out of scope for
    this PR.
  • pull progress uses CSI cursor-up + clear-line for in-place
    redraw. Terminal panes that ignore those escapes show stacking
    rows; set ELFUSE_OCI_PROGRESS=plain to disable the redraw and
    emit one summary line per blob instead.

Summary by cubic

Adds full OCI image lifecycle to elfuse: pull, inspect, unpack, clone, run, prune, rebuild-cache, and status, with parallel/resumable downloads, a content‑addressable store + caches, and a runtime path to execute images. Vendors cJSON and decode‑only zstd to keep builds self-contained, and extracts a shared VM launcher for oci run.

  • New Features

    • CLI: oci pull|inspect|unpack|clone|run|prune|rebuild-cache|status; pull adds progress and --refresh; status --json.
    • Registry/store: HTTPS via libcurl with bearer/basic auth, custom CA, loopback‑only --insecure; content‑addressable blob store; oci-layout marker; pins in OCI index.json.
    • Unpack/caches: ustar/PAX tar with gzip/zstd; whiteouts; case‑sensitive APFS sysroot; per‑run rootfs via clonefile(2); raw layer and ChainID stack caches; parallel fetch + HTTP Range resume.
    • Runtime: PATH resolver; image‑config User name/group lookup; inject /etc/{resolv.conf,hosts,hostname}; emulate /dev/{full,console} and basic /proc; multi‑arch index resolves to linux/arm64; ELFUSE_OCI_PROGRESS=plain fallback.
    • Policy/inspect: podman/skopeo‑style policy.json with registries.d overlay; CLI flags override; inspect shows runtime fields and cross‑image dedup stats.
  • Migration

    • Pins moved to OCI index.json; store auto‑migrates from refs/ on open.
    • layers/ cache schema v2; first open wipes legacy v1 cache entries (blobs/images untouched).
    • Vendors decode‑only zstd and cJSON; uses system zlib and libcurl.

Written for commit 426d7f6. Summary will update on new commits. Review in cubic

cubic-dev-ai[bot]

This comment was marked as resolved.

@Max042004 Max042004 changed the title Add elfuse oci subcommand for pulling and inspecting images Add OCI image support: pull, unpack, run, prune, status, policy May 23, 2026
cubic-dev-ai[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase onto the latest main branch and squash/rework the commits into fewer, cleaner ones.

@Max042004 Max042004 force-pushed the oci-image branch 2 times, most recently from e988d6e to 5d6dbc7 Compare May 23, 2026 14:38
@sysprog21 sysprog21 deleted a comment from cubic-dev-ai Bot May 23, 2026
jserv

This comment was marked as resolved.

cubic-dev-ai[bot]

This comment was marked as resolved.

Implement the full elfuse OCI image lifecycle as a self-contained
`elfuse oci` subcommand. Image distribution never touches
Hypervisor.framework, so the subcommand dispatches in main() before any
guest setup; only `oci run` enters the VM bring-up path.

- pull / inspect: content-addressable blob store over HTTPS with
  bearer-token + Basic auth, OCI index walk to the linux/arm64 leaf,
  parallel blob fetch with HTTP Range resume, offline inspect renderer.
- unpack: tar reader (ustar + PAX x/g records), gzip + decode-only
  vendored zstd, whiteout-aware layer apply, per-image case-sensitive
  APFS sysroot; cross-volume unpack via copyfile(2) with clone fallback.
- run: clonefile(2) per-run rootfs; Entrypoint / Cmd / Env / WorkingDir
  and symbolic/numeric User honoured; reuses the shared elfuse_launch
  bring-up so a dynamic guest runs through the same shim + syscall path.
- lifecycle: prune (--older-than / --keep-bytes), per-layer + ChainID
  stack snapshot caches, oci status (text + --json), rebuild-cache.
- policy: podman/skopeo-style policy.json + registries.d overlay;
  loopback-gated --insecure; CLI flags override.

Extract the VM bring-up from main() into core/launch.c (elfuse_launch)
so oci run and the positional-ELF main share one path; the host-path
resolution now lives in the caller per the guest_bootstrap_prepare split.

Vendors decode-only zstd and cJSON; uses system zlib and libcurl.
Adds 25 native test-oci-* unit suites plus an opt-in heavy compat mode.
jserv added a commit that referenced this pull request May 29, 2026
The synthetic vDSO at AT_SYSINFO_EHDR already carries DT_HASH,
LINUX_2.6.39 symbol versioning, and five __kernel_* trampolines, but
glibc 2.41's dynamic-linker vDSO probe rejected the page for lack of an
NT_GNU_ABI_TAG note: every dynamically-linked guest fell back to SVC
for clock_gettime, gettimeofday, and clock_getres. PR #34 measured
1006 ns/op against an 18 ns/op OrbStack reference, a 56x gap the
TODO Tier D P1 entry tracked as the highest-leverage single fix.

This adds the note. To avoid moving VVAR (0x0B0), TEXT_OFF_SIGRET
(0x0E0, exported in vdso.h for signal.c), or any trampoline / section
offset, the program-header table relocates from 0x040 to 0x6B0 (after
the section-header area). The reclaimed 0x040 window now holds the
32-byte NT_GNU_ABI_TAG:

  namesz : 4   ("GNU\0")
  descsz : 16
  type   : NT_GNU_ABI_TAG (1)
  name   : "GNU\0"
  desc   : { ELF_NOTE_OS_LINUX (0), 2, 6, 39 }

The descriptor's minimum kernel ABI (2.6.39) matches the LINUX_2.6.39
symbol version already exposed through DT_VERDEF, so a glibc that
honors the version also honors the note. PT_LOAD continues to cover
the whole page so the relocated PHDR table and the note both stay
mapped at runtime.

Validation, dynamically-linked glibc 2.41 binary built from the
cross-toolchain sysroot at /opt/toolchain/aarch64-linux-gnu (same
toolchain PR #34 used for the baseline):

  libc  clock_gettime  :   6.97 ns/op   (was 1006 ns/op pre-fix)
  direct vDSO call     :   6.24 ns/op   (dlsym function-pointer)
  raw   SVC syscall    : 2047.01 ns/op
  libc/vDSO ratio = 1.12x -- libc IS using the vDSO

The 0.7 ns libc-vs-direct gap is glibc's dl_sysinfo_dso dispatch, not
an SVC fallback. libc clock_gettime now beats the OrbStack reference
(18 ns/op) by ~2.6x. gettimeofday and clock_getres land on the
trampolines through the same probe path:

  libc gettimeofday    :   7.5 ns/op    (vDSO REALTIME anchor reuse)
  libc clock_getres    :   4.9 ns/op    (constant-resolution path)

readelf parses the page cleanly: e_phnum=3, e_phoff=0x6B0, three
PHDRs (PT_LOAD covering the whole page, PT_DYNAMIC at 0x420 size
0x90, PT_NOTE at 0x40 size 0x20), and `readelf -n` decodes the note as
"GNU NT_GNU_ABI_TAG OS: Linux, ABI: 2.6.39". No region overlaps;
total page usage 0x758 / 0x1000.

Static vDSO bench unchanged at 6 ns/op for the time fast paths; the
PHDR relocation only shifts where the dynamic linker looks for the
table and does not touch any code the trampolines execute. test-signal
explicit run passes, confirming the unchanged TEXT_OFF_SIGRET=0xE0
trampoline still drives the libc __restore_rt path.
jserv added a commit that referenced this pull request May 29, 2026
Three hot paths the PR #34 OrbStack baseline tracked -- getpid (~47
ns), clock_gettime through the vDSO (~2.5 ns), and 1-byte
/dev/urandom read (~134 ns) -- had no automated regression check. A
silent slip-back to the SVC fallback turned each into a ~1-2 us trap
without anything in CI to notice.

This adds an explicit guardrail. tests/bench-hot-guard.c resolves
__kernel_clock_gettime via AT_SYSINFO_EHDR + PT_DYNAMIC + DT_HASH (SysV
ELF hash walk) and measures three labels in fixed-width
"%-20s %10.1f ns/op  last=%ld" output: getpid (raw SVC), clock_gettime
(vDSO trampoline), and read-urandom1 (raw 1-byte read of /dev/urandom).

The same source builds two binaries via a compile-time switch:
  build/bench-hot-guard
        Static glibc. Built without the macro. clock_gettime invokes
        the trampoline directly through the resolved function pointer.
        Static glibc never initializes dl_sysinfo_dso, so its libc
        wrapper falls back to raw SVC for reasons unrelated to the
        vDSO; measuring the wrapper would fail the 50 ns ceiling for
        the wrong reason. Direct call isolates the trampoline.

  build/bench-hot-guard-glibc
        Dynamic glibc. Built with -DGUARD_USE_LIBC_CG=1.
        clock_gettime invokes glibc's clock_gettime() wrapper -- which
        on glibc 2.41 + a correctly-stamped vDSO (NT_GNU_ABI_TAG
        PT_NOTE, LINUX_2.6.39 versioning) routes through the
        trampoline. A regression in the note or versioning would push
        this measurement from ~7 ns to SVC range and trip the ceiling.
        Built only when the cross-toolchain sysroot at
        $(LINUX_TOOLCHAIN)/aarch64-unknown-linux-gnu/sysroot exists;
        run with elfuse --sysroot at that path.

Disassembly verifies the split: the dynamic binary lowers
bench_clock_gettime to "bl <clock_gettime@plt>" while the static
binary lowers it to "ldr x2, [x1], #8" + indirect dispatch.

Validation:

  static     getpid 50.4 ns, clock_gettime  6.7 ns, urandom 141.9 ns
  dyn-glibc  getpid 71.9 ns, clock_gettime 17.8 ns, urandom 147.9 ns
jserv added a commit that referenced this pull request May 30, 2026
The dynamic-linker bring-up storm was the largest remaining startup band
after pull request #34. Adding a per-syscall histogram pointed at the
sidecar walker as the openat dominant cost (61% of getent startup), the
per-call path_translation_t memset as the second source, and the
opened_fd_type fstat as a small but real per-open round-trip.

src/debug/syscall-hist.[ch]: opt-in histogram via
ELFUSE_STARTUP_TRACE=syscalls (or =all alongside the existing step
trace). Lock-free atomic counters per Linux syscall number, sorted
total-ns descending in the dump. Records freeze on the first successful
execve so steady-state traffic does not pollute the startup picture.
Fork children disable the histogram explicitly because they resume from
a parent snapshot, not a fresh bring-up.

src/syscall/sidecar.c: First a per-directory absence cache keyed by
(st_dev, st_ino, mtime, ctime) so the walker can skip the openat for
.elfuse-sidecar-index when a recent fstat on the same dirfd already saw
ENOENT. The mtime/ctime in the key closes ABA naturally and makes a
cross-process index publish observable without explicit invalidation.
Second a cached sysroot dirfd handed out as fcntl(F_DUPFD_CLOEXEC, 0) so
each translated absolute path saves the ~30 us open(sysroot) round-trip
and the dup carries CLOEXEC across any racing posix_spawn.

src/syscall/path.c: drop the per-call zero-init of path_translation_t.
The struct is ~12 KiB (24 metadata bytes plus three LINUX_PATH_MAX
buffers) and the buffers are read-after- written by their respective
resolvers. memset of all three was the dominant remaining cost after the
sidecar caches.

src/core/elf.c: skip the redundant memset of the file-data range in
elf_map_segments. The loader previously zeroed the full page-aligned
segment extent before issuing fread; now only the BSS portion plus page
padding (filesz to zero_len) is zeroed.

src/syscall/fs.c: skip opened_fd_type fstat when neither O_PATH nor
O_DIRECTORY is set. Dynamic-linker opens are overwhelmingly regular files
where the type is already implied. The corner where a guest opens a
directory without O_DIRECTORY and then issues getdents now returns
ENOTDIR; glibc fdopendir has required O_DIRECTORY since 2009 and the test
corpus does not exercise the corner.

src/core/startup-trace.h: env parsing extended to comma-separated tokens
(steps, syscalls, all); legacy =1 keeps enabling steps only so existing
scripts keep working.

Measurement: 30-run distributions under ELFUSE_STARTUP_TRACE=syscalls,
warm cache:
  bench-hot-guard-glibc startup syscalls:
    5.225 ms baseline (single sample) -> 1.33 ms p50
    (p25 1.21, p75 1.55, stdev 0.45, n=30)         3.9x
  bench openat per-call:
    135 us baseline -> 33.4 us p50
    (p25 32.4, p75 35.8, stdev 7.1, n=30)          4.0x
  getent passwd root startup syscalls:
    7.478 ms baseline -> 2.22 ms p50
    (p25 2.10, p75 2.28, stdev 0.27, n=30)         3.4x
  getent openat per-call:
    230 us baseline -> 52.9 us p50
    (p25 51.5, p75 55.1, stdev 2.2, n=30)          4.3x

End-to-end wall-clock for getent: 14.6 ms p50 (p25 14.3, p75 15.1, stdev
1.18, n=30). Bench guardrail steady-state: static getpid 74 ns,
clock_gettime 6.7 ns, urandom1 153 ns; dynamic-glibc getpid 53 ns,
clock_gettime 6.4 ns, urandom1 142 ns. All under ceilings.

The original baselines were single first-run samples; their variance
band was not measured, so the speedup ratios are best-effort relative
to the cited starting point.

Lazy FD_REGULAR to FD_DIR promotion in sys_getdents64 was attempted
but dropped after both reviewers flagged a HIGH-severity ABA hole:
a sibling close+reopen between the probe and the install could land
the original directory's DIR* onto a fresh regular file's slot. The
fix path (fd-slot generation counter or stat+inode comparison under
fd_lock) was invasive enough that the lazy promotion did not pay for
its complexity.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants