Fix: Fork+execve of dynamically-linked ELFs crashed the child in the dynamic-linker bring-up at a small absolute address#63
Conversation
elf_map_segments computed the BSS clear extent as PAGE_ALIGN_UP(memsz) bytes from gpa. When gpa is not page-aligned, that left the tail of the last page covered by the segment untouched. A fresh bootstrap saw zero bytes there because the primary slab was MAP_ANON; after execve into the same interpreter at the same address, the host slab still held the previous incarnation's bytes, and glibc ld.so allocated the new main link_map into that tail through dl_minimal_malloc, picking up a stale l_ld value and crashing in dl_main at a small absolute address. ld.so's RW LOAD sits at vaddr 0x2f650 in the cross-toolchain build, which made fork+execve and any dyn-to-dyn execve under --sysroot a reliable reproducer; the matrix in tests/test-fork-exec exercises the fork case end to end. Compute the extent as PAGE_ALIGN_UP(gpa + memsz) - gpa so the trailing page is fully zeroed regardless of gpa alignment. Skip PT_LOAD entries with memsz == 0 entirely; the unaligned-gpa rounding above would otherwise let a crafted ELF splat zeros across the tail of an earlier segment in the same page, or trip the infra-overlap check with no live mapping behind it. Linux's loader ignores zero-memsz PT_LOADs and elfuse mirrors that.
470846d to
edde64c
Compare
sc_openat2 previously accepted RESOLVE_NO_XDEV and let the open through
without enforcement, leaving the only RESOLVE_* flag in
include/uapi/linux/openat2.h unimplemented. The replacement is a
left-to-right component walker in path_openat2_crosses_mount that
classifies each running prefix against a mount-class taxonomy and
returns -EXDEV the first time the class changes.
The taxonomy distinguishes the root filesystem, /proc, /dev, /sys,
/tmp, /dev/shm, and each live or tombstoned FUSE mount keyed by its
mount_id. /tmp and /dev/shm are split out because Linux mounts them as
separate tmpfs filesystems, and treating them as DEV or ROOT would
under-reject. FUSE classes live above PATH_MOUNT_FUSE_BASE = 0x10000000
so mount_id growth never collides with the named classes.
The walk anchor matches kernel semantics: absolute paths under !in_root
begin at /, anything else begins at the dirfd's tracked guest path.
dirfd_guest_base_path pulls that from proc_path for /proc dirfds, from
fuse_resolve_at_path(".") for FUSE dirfds, and from F_GETPATH stripped
of the configured sysroot for regular dirfds. Components advance the
running path; . is skipped; .. pops the trailing component but clamps
at a floor (1 for non-IN_ROOT walks so a /proc/1 -> /proc -> / cross
still surfaces, dirfd-base length for IN_ROOT walks so the precheck
never out-rejects what path_openat2_normalize_in_root applies later in
the open).
Component-by-component classification is required because lexical
collapse hides transient mount visits: /proc/self/../../tmp/foo
normalizes to /tmp/foo even though the walk passes through /proc, and
Linux NO_XDEV catches that. The walker classifies after every step so
the transient PROC excursion surfaces as EXDEV before the upward
components apply.
fuse_path_mount_id is a new helper in src/syscall/fuse.c that looks up
the mount_id for a path under fuse_lock, returning -1 outside any FUSE
mount. The walker calls it for FUSE classification, sized so distinct
mounts compare unequal.
path_openat2_crosses_mount gains an out_start_class parameter; the
walker populates it whenever it returns non-error so the caller can
pass it straight into the post-open check. The signature change is
contained: sc_openat2 is the only caller.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary by cubic
Fixes a crash in
fork+execveof dynamically linked ELFs by fully zeroing the PT_LOAD page tail and ignoring zero-sized PT_LOADs. Adds fullopenat2RESOLVE_NO_XDEVenforcement with a pre-walk and post-open check to block cross-mount paths, including symlink and transient crossings.Bug Fixes
New Features
RESOLVE_NO_XDEVinsc_openat2using a component walker and a post-open verifier; classify mounts as root,/proc,/dev,/sys,/tmp,/dev/shm, and per-FUSE mount; respect absolute vs dirfd anchors, clamp..underRESOLVE_IN_ROOT, and extend tests for transient crosses, symlink targets, bare/proc, and/dev→/dev/shm.Written for commit 9b9a2e8. Summary will update on new commits.