Skip to content

feat(orch): decouple warm resume from memfile dedup#3166

Open
kalyazin wants to merge 9 commits into
mainfrom
en1369-inflight-memfd-serving
Open

feat(orch): decouple warm resume from memfile dedup#3166
kalyazin wants to merge 9 commits into
mainfrom
en1369-inflight-memfd-serving

Conversation

@kalyazin

@kalyazin kalyazin commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

feat(orch): decouple warm resume from memfile dedup

What & why

A warm resume that overlaps an in-flight pause of the same sandbox is delayed by
roughly the pause's memfile-dedup duration. The deduped diff is the only readable diff
source and isn't readable until dedup finishes, so a resume that lands on the origin
node while the pause is still deduping blocks — either in storage-template-memfile
(waiting for the diff header, which is produced after the dedup compare) or in
load-snapshot (UFFD memory serving waiting on the dedup drain). The data a resume
actually needs (base ⊕ dirty pages) is available the instant the pause captures the
snapshot; only dedup's compaction for upload is slow.

This PR lets a same-node resume serve dirty pages straight from the still-mapped
memfd while dedup runs in the background
, so the resume no longer waits for dedup.
Dedup still runs to completion and the uploaded artifact is unchanged.

How it works

The pause's dedup has two phases whose products land at different times: the header
(after compare) and the compacted data (after the drain). The fix covers both
windows with one mechanism, gated by one flag:

  • Provisional header (compare window). At pause we build a local-only header from
    the dirty bitmap + parent header (header.ToProvisionalDiffHeader), attributing dirty
    pages to a fresh provisional build id at identity (device) offsets, and register a
    memfd-backed diff source (block.MemfdIdentitySource) under that id. The local
    template is built from this header synchronously, so a concurrent resume gets a
    serviceable memfile immediately and reads dirty pages from the memfd — no wait on the
    header.
  • In-flight drain serving (drain window). When the deduped header resolves (after
    compare), SwapHeader points reads at the real build id, whose DedupedMemfdCache
    serves the drain window from the memfd until the compacted diff is ready, then from
    the drained cache.
  • Race-free by construction. build.File resolves an offset to (buildId, storageOff)
    from a single header snapshot, so the build id is the offset-regime tag: a read
    planned on the provisional header resolves to the memfd source (identity offsets); one
    planned on the deduped header resolves to the drained cache (packed offsets). No read
    can mix regimes.
  • The memfd is held until the swap. The dedup goroutine keeps the memfd mapped until
    the local template has swapped off the provisional header (MarkSwapped, signalled by
    the swap goroutine after SwapHeader), bounded by a grace timeout so a failed/absent
    swap can't leak the mapping. This is necessary because the swap can land after the
    drain (small dirty set + fragmented parent header → fast drain, slow header build), and
    releasing at drain-time would let a provisional read hit a released memfd. An RWMutex
    covers any straggler read against the release. The provisional store entry is not
    deleted at swap time (that would race a reader that planned on the provisional header
    but hasn't hit the store yet) — it ages out via the store TTL, harmless because no reads
    route to it post-swap and it reports FileSize 0, so it never skews disk eviction. The
    swap goroutine signals the release on every exit (success or error), so the grace is
    only a last-resort backstop.
  • The main memfile diff is pinned for the window. It shares its DedupedMemfdCache
    (and memfd) with the provisional source, but resume reads refresh only the provisional
    store entry — leaving the main entry LRU-cold. Disk-pressure eviction of the main entry
    would Close it, cancelling dedup and tearing the memfd down mid-serve. So the main
    memfile diff is pinned in the build store for the provisional window (disk-pressure
    eviction skips it; TTL still applies) and unpinned after the swap.
  • The swap never clobbers a newer header. The header on one build.File moves
    provisional → deduped (this swap) → finalized (Upload.publish). The provisional→deduped
    swap is a compare-and-swap from the still-provisional header (SwapHeaderIfCurrent),
    so if it runs late it becomes a no-op rather than reverting an already-finalized header.
  • The provisional build id never escapes the origin node. The provisional header maps
    dirty pages to a synthetic build id with no storage object, so it must never reach an
    uploaded snapshot, a peer, or placement metadata. All three consumers resolve the durable
    (deduped) header instead of the live one: the pause parent (processMemorySnapshot,
    so a resume-then-re-pause in the window can't persist an unreadable snapshot), the
    P2P header source (peerserver.headerSource.Stream, so a cross-node peer gets the
    real mappings), and scheduling metadata. build.File carries the deduped header as a
    durable-header future; the pause and P2P paths wait for it (DurableHeader), while
    scheduling metadata — on the latency-sensitive create/resume response — reads it
    non-blocking (DurableHeaderNow), reporting rootfs-only metadata during the window
    rather than blocking. Templates with no pending swap resolve all of these immediately.

Safety

  • The upload is unchanged. The upload path keeps using the deduped header; the
    provisional header + provisional build id live only in memory on the origin node and
    are never persisted. So the stored artifact is byte-for-byte unaffected.
  • Dedup is unaffected. Compare + drain still run; the deduped, compacted artifact is
    still uploaded.
  • One documented assumption (see the comment at the buildStore.Add call site): the
    provisional source must stay in the build store for the duration of the provisional
    window (the dedup compare, seconds); this holds because the store TTL (hours) dwarfs
    the window and eviction is LRU-by-TTL. We deliberately don't guard the (unrealistic)
    eviction-mid-window case.

Feature flag

Everything is gated behind memfd-dedup-inflight-serve (default off) and only takes
effect on the memfd-dedup path. With the flag off, behavior is identical to today
(resume waits for dedup). Roll out gradually.

Observability

The fix collapses distinct existing spans (rather than adding a rollout metric), so
rollout is observed via:

  • storage-template-memfile (resume path) → ≈ 0 when the header wait is removed;
  • load-snapshot (resume path) → tail drops when the drain wait is removed;
  • orchestrator_sandbox_create_duration{start_type="resume"} p99/p999 tail comes down.

Regression controls that must not move: orchestrator_sandbox_snapshot_diff_ratio{kind="dedup"}
(dedup effectiveness), dedup-pages / pause durations (dedup still runs, pause stays
fast). Correlate all with the flag-enable time.

One new counter is added: orchestrator.memfd.swap_grace_elapsed — increments when the
dedup goroutine releases the memfd via the grace timeout without a swap signal. Expected
≈ 0 (the swap normally signals before the drain finishes); a nonzero rate means swap
signals are being lost (e.g. pauses aborting before the swap goroutine spawns) and memfds
are held the full grace on those pauses — actionable, not fatal.

Testing

  • Unit tests (incl. -race): the identity-offset header builder; the memfd identity
    source (serve while mapped, BytesNotAvailableError after release, Slice); the
    in-flight DedupedMemfdCache serving + a concurrent drain/handover stress test; the
    memfd-held-until-swap ordering; the provisional-header fallback (declines for a
    non-memfd-dedup source / when disabled); DurableHeader / DurableHeaderNow and
    SwapHeaderIfCurrent on build.File; the P2P header source serving the durable header;
    and scheduling metadata using the durable header (and reporting rootfs-only while a swap
    is pending).
  • End-to-end on a dev node (dedup forced on, 15 s forced compare stall via temporary
    test-only code, resume overlapping the in-flight pause):
    • Bug reproduced (flag off): race resume blocked ~14.9 s on the in-flight dedup.
    • Fixed (flag on): the same race resume dropped to ~0.5 s (control resume ~0.66 s).
    • Regression caught & fixed by the rerun: the first flag-on run still blocked ~15 s.
      Diagnostics showed the provisional fast-path fully engaged, so the block had moved:
      SchedulingMetadata, on the resume response, was calling the blocking DurableHeader.
      Switching it to non-blocking DurableHeaderNow brought the race resume to ~0.5 s.
    • Cold-resume correctness: a 1 GiB in-RAM blob's md5 was identical across a resume
      forced to load the snapshot from object storage (orchestrators restarted to drop the
      in-memory template cache) — the uploaded artifact restores guest memory bit-for-bit.
    • Dedup preserved: dedup still ran and the uploaded artifact was byte-identical.

Not in scope / follow-ups

  • Cross-node resumes. A resume on a non-origin node can't reach the origin's memfd, so
    it doesn't get the fast serve — it waits for the deduped header. The P2P header source
    now serves the durable header, so such a resume in the window gets the real mappings
    (waiting for them) rather than the unresolvable provisional build id; a true fast cross-node serve
    (peer-fetch/prefetch of the in-flight diff) is out of scope. The cross-node path is
    covered by unit tests; a controlled cross-node e2e wasn't run — resume has origin
    affinity and there's no clean lever to force a peer resume while keeping the origin up to
    serve P2P, so that needs a placement-control harness (follow-up).
  • Handover test harness. The AddSnapshot provisional→deduped SwapHeader handover
    is covered by the e2e above but has no dedicated unit test yet (needs a build-store /
    storage-provider fixture).
  • Plan→serve straddling the release (self-healing). A read planned on the provisional
    header just before the swap can serve just after the memfd release, getting
    BytesNotAvailableError. This self-heals: the UFFD handler retries ReadAt and re-plans
    on the now-deduped header (post-swap) → drained cache → success. Worst case is one retry
    of added latency on a very narrow window, not a failed read. If it shows up in the
    load-snapshot tail after enabling the flag, the fix is a memfd→drained-cache fallback
    in MemfdIdentitySource.
  • Dedup failure with an overlapping resume (accepted degradation). If dedup itself
    errors (rare — best-effort dedup normally degrades rather than errors), the deduped
    header never resolves, so the swap can't happen and the memfd is released; a resume that
    was already in flight then fails its dirty-page faults. This is the same outcome as
    before this PR (the build is broken either way — pre-PR it failed at Memfile(), now it
    fails at fault time), and the durable-header guard still prevents any bad snapshot from
    being persisted (a re-pause's DurableHeader surfaces the dedup error and fails cleanly).
    Not worth extra machinery (keeping the memfd mapped to let the resume finish would leak
    it, since no swap can free it).

🤖 Generated with Claude Code

kalyazin and others added 5 commits July 1, 2026 15:48
A resume that overlaps an in-flight pause of the same sandbox blocks for
the full dedup duration: the deduped diff is the only readable diff
source and it is not readable until the background dedup drain finishes,
so UFFD memory serving in load-snapshot waits on DedupedMemfdCache.Wait.

Give DedupedMemfdCache an in-flight read path. After the dedup compare
publishes the metadata, expose the still-mapped memfd together with a
packed-to-absolute offset index built from the deduped dirty set. While
the drain runs, ReadAt/Slice translate the packed diff offset and copy
straight from the memfd instead of blocking on the drain; once the drain
completes, reads fall through to the drained cache. An RWMutex guards the
memfd so the drain's close cannot unmap it under an in-flight reader, and
the done barrier makes the memfd-to-cache handover race-free.

Gate the behaviour behind memfd-dedup-inflight-serve (default off); when
off the prior wait-for-drain behaviour is preserved. Dedup still runs to
completion and the uploaded artifact is unchanged.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add unit tests for the packed-to-absolute offset index (single-run
translation plus boundary and out-of-range rejection) and for
DedupedMemfdCache serving dirty pages from the memfd while the drain is
in progress, then bypassing the in-flight path once the drain resolves.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drive a real dedup goroutine with in-flight serving enabled and hammer
tryInflightRead from many goroutines across the drain window and the
memfd-to-cache handover: assert every served read returns the correct
bytes and, under -race, that the drain's memfd close never unmaps beneath
an in-flight reader.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add createIdentityMapping and DiffMetadata.ToProvisionalDiffHeader, which build
a local-only memfile header that attributes every dirty page to a given build at
an identity storage offset (its device offset) composed over the parent header.
Unlike ToDiffHeader it needs no dedup metadata, so a pause can produce it
immediately and a concurrent resume can serve from the still-mapped memfd without
waiting for the dedup compare. The uploaded artifact continues to use the deduped
ToDiffHeader output; this header is never persisted.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Assert ToProvisionalDiffHeader maps dirty pages to the new build at identity
storage offsets and leaves clean pages on the parent.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cla-bot cla-bot Bot added the cla-signed label Jul 1, 2026
@cursor

cursor Bot commented Jul 1, 2026

Copy link
Copy Markdown

PR Summary

High Risk
Touches snapshot parenting, concurrent header swaps, memfd lifetime, and resume memory serving; incorrect durable/provisional handling or early memfd release could corrupt reads or persist unreadable headers, though default-off rollout and extensive tests mitigate this.

Overview
Warm resumes on the origin node no longer have to wait for memfile dedup when a pause is still in flight. With memfd-dedup-inflight-serve (default off, memfd-dedup only), pause builds a local-only provisional header and memfd-backed diff so the template is usable immediately; after compare, reads can still hit the mapped memfd during drain via a packed offset index, then CAS swap to the deduped header when ready. The dedup goroutine keeps the memfd until MarkSwapped, with a grace timeout and mutex-safe release. Durable header APIs ensure pauses, P2P header streaming, and scheduling metadata never persist or advertise the synthetic provisional build id (scheduling uses a non-blocking path so resume responses do not block on dedup). The build cache pins the main memfile diff during the provisional window so disk eviction cannot tear down shared dedup state. Upload and the deduped artifact path are unchanged when the flag is off.

Reviewed by Cursor Bugbot for commit 31ee7fc. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread packages/orchestrator/pkg/sandbox/block/memfd.go

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces inflight serving of dirty pages directly from a still-mapped memfd during the dedup drain window, using a provisional local header and a provisional diff source to prevent blocking concurrent resumes. The feedback highlights two critical nil pointer dereference risks: one in buildProvisionalMemfile where originalHeader and diffMetadata are dereferenced without prior nil checks, and another in the background goroutine of AddSnapshot where the returned memfile is not verified to be non-nil before calling SwapHeader.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +1624 to +1627
dc, ok := cache.(*block.DedupedMemfdCache)
if !ok || !enabled || originalMemfile == nil {
return nil, nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The buildProvisionalMemfile function does not check if originalHeader or diffMetadata are nil before dereferencing them. If either is nil, a panic will occur during the provisional header construction. Adding nil checks to the guard clause prevents potential nil pointer dereferences.

Suggested change
dc, ok := cache.(*block.DedupedMemfdCache)
if !ok || !enabled || originalMemfile == nil {
return nil, nil
}
dc, ok := cache.(*block.DedupedMemfdCache)
if !ok || !enabled || originalMemfile == nil || originalHeader == nil || diffMetadata == nil {
return nil, nil
}
References
  1. Focus on nil-deref and error handling as specified in the repository style guide. (link)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 53ef5e1b5: added originalHeader == nil || diffMetadata == nil to the guard. On the memfd-dedup path both are effectively non-nil (the existing header goroutine already dereferences originalHeader via ToDiffHeader), so this is defensive rather than a live bug, but it makes the helper safe for nil inputs and matches the nil-deref style guide.

Comment on lines +297 to +303
mem, err := storageTemplate.Memfile(swapCtx)
if err != nil {
logger.L().Warn(swapCtx, "provisional memfile header swap: get memfile", zap.Error(err))

return
}
mem.SwapHeader(deduped)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The background goroutine for swapping the provisional header does not check if the returned memfile is nil before calling SwapHeader. If the memfile is nil, a panic will occur and crash the entire orchestrator process. Adding a nil check for the memfile ensures the background swap is safe and resilient.

			mem, err := storageTemplate.Memfile(swapCtx)
			if err != nil {
				logger.L().Warn(swapCtx, "provisional memfile header swap: get memfile", zap.Error(err))

				return
			}
			if mem == nil {
				logger.L().Warn(swapCtx, "provisional memfile header swap: memfile is nil")

				return
			}
			mem.SwapHeader(deduped)
References
  1. Focus on nil-deref and error handling as specified in the repository style guide. (link)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 53ef5e1b5: added a nil check on the returned memfile before SwapHeader. Memfile() returns (nil, err) or (non-nil, nil), so a nil without an error should not happen, but since this runs in an unrecovered background goroutine a nil deref would crash the orchestrator, so the guard is warranted.

@kalyazin kalyazin force-pushed the en1369-inflight-memfd-serving branch from 152cb65 to 678b863 Compare July 1, 2026 18:02
@kalyazin kalyazin marked this pull request as ready for review July 1, 2026 20:08
Comment on lines +285 to +310
// Swap the provisional header for the deduped one once dedup finishes, so
// subsequent reads route dirty pages to the (compacted) deduped diff and the
// provisional memfd source is no longer referenced.
if provisionalMemfileHeader != nil {
swapCtx := context.WithoutCancel(ctx)
go func() {
deduped, err := memfileHeader.Wait()
if err != nil {
logger.L().Warn(swapCtx, "provisional memfile header swap: deduped header failed", zap.Error(err))

return
}
mem, err := storageTemplate.Memfile(swapCtx)
if err != nil {
logger.L().Warn(swapCtx, "provisional memfile header swap: get memfile", zap.Error(err))

return
}
if mem == nil {
logger.L().Warn(swapCtx, "provisional memfile header swap: memfile is nil")

return
}
mem.SwapHeader(deduped)
}()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Provisional-header reads can hit a released memfd before SwapHeader swaps in the deduped build. In runDedup, releaseMemfd races the swap goroutine (both fan out from metaOut.SetValue with no happens-before ordering) — when the dedup drain is fast (small dirty set) and ToDiffHeader is slow (fragmented parent header), releaseMemfd wins, so a resume in the window gets BytesNotAvailableError from MemfdIdentitySource.ReadAt. Fix by ordering SwapHeader before releaseMemfd (e.g. gate release on a swap-done channel) or by having MemfdIdentitySource fall back through the packedIndex + drained cache when the memfd is nil.

Extended reasoning...

The race

Three goroutines fan out from runDedup (packages/orchestrator/pkg/sandbox/block/memfd.go) resolving metaOut:

Goroutine A (runDedup, memfd.go:381-405) — the same goroutine that publishes metaOut:

  1. metaOut.SetValue(meta) (line 394) — unblocks B
  2. dedupDrain(...) (line 397) — I/O bound; sub-ms for a small dirty set
  3. d.releaseMemfd() (line 399) — sets d.memfd = nil under mu.Lock
  4. d.done.SetResult(...)

Goroutine B (pauseProcessMemory header goroutine, sandbox.go:1589-1606):

  1. metaOut.Wait() (line 1595) — unblocked at A.1
  2. meta.ToDiffHeader(ctx, originalHeader, buildID) (line 1605) — CPU-bound: MergeMappings + NormalizeMappings over originalHeader.Mapping.Slice(). For a fragmented parent header (a sandbox paused many times) this is O(N) and can take tens of ms.
  3. headerOut.SetResult(...) — unblocks C

Goroutine C (AddSnapshot swap goroutine, cache.go:288-309):

  1. memfileHeader.Wait() — unblocked at B.3
  2. storageTemplate.Memfile(swapCtx)
  3. mem.SwapHeader(deduped) (line 308) — routing finally switches to the deduped build id

Nothing serializes A.3 releaseMemfd with C.3 SwapHeader. They race on the outputs of the same metaOut.SetValue.

Step-by-step proof

Consider an idle sandbox whose parent template has been paused many times, accumulating a large fragmented memfile header (originalHeader.Mapping has millions of entries), but whose current dirty set is small (a handful of pages). At t=0, dedupCompare completes and metaOut resolves:

  • t=0: metaOut.SetValue returns; A continues on the same goroutine, B/C wake on separate goroutines.
  • t≈0.2ms: A.2 dedupDrain completes (small dirty set → few pwrites). A.3 releaseMemfd sets d.memfd = nil under mu.Lock.
  • t≈15ms: B.2 ToDiffHeader finishes MergeMappings/NormalizeMappings over the fragmented parent header. headerOut resolves.
  • t≈15ms + ε: C.3 SwapHeader runs — the local template finally routes reads to the deduped build id.

During the ~15ms window [0.2ms, 15ms], the local template still uses provisionalMemfileHeader. A resume that lands in this window (exactly the target of this PR) does:

build.File.readAth.GetShiftedMapping → provisional buildID + identity offset →
buildStore.Get(provisional key)MemfdIdentitySource.ReadAt
d.ServeMemfd(b, off)mu.RLockd.memfd == nil → returns BytesNotAvailableError (memfd.go:504)

build.File.readAt only retries *block.CacheClosedError, not BytesNotAvailableError, so the error propagates to the UFFD handler.

What existing safeguards don't cover

  • The RWMutex in DedupedMemfdCache only guards the memfd pointer against use-after-close. It does NOT gate the header-routing decision that build.File.planRead has already made against the still-provisional header.
  • The tryInflightRead path only helps reads routed to the deduped build id (i.e. post-SwapHeader). Provisional-header reads never touch DedupedMemfdCache.ReadAt.
  • The build-store TTL argument is about eviction, not this ordering.
  • The a4cee7e fix (publishing d.index before metaOut.SetValue) addressed a different race — a post-swap deduped-build-id read seeing a nil packed index. It does nothing for the provisional/deduped handover on the header side.

Mitigating factor and impact

The UFFD handler at userfaultfd.go:606-641 does retry source.ReadAt up to sliceMaxRetries+1 = 4 times with exponential backoff (base 50ms, cap 500ms). Each retry re-invokes source.ReadAt which re-runs planRead, re-loading the atomic header. So if SwapHeader completes within ~525ms, the retry succeeds against the deduped build id.

  • Common case: 50–350ms of added resume latency per faulted page in the window, partially defeating the PR's stated storage-template-memfile speedup (14.8 s → 7 µs).
  • Pathological case (very fragmented parent header, GC pause, scheduler load): ToDiffHeader + SwapHeader can exceed 525ms, exhausting the retry budget and failing the resume with BytesNotAvailableError.

The documented invariant is false

The PR's own docstring on MemfdIdentitySource (memfd.go:526-528) asserts:

The memfd outlives the swap (freed only after the drain, under a write lock) and an RWMutex covers any straggler read against the free.

The RWMutex only prevents use-after-close of the fd itself. Nothing in the code enforces swap-happens-before-release. TestDedupedMemfdCache_InflightConcurrentDrainRace exercises tryInflightRead on the deduped-header path — it never routes reads through the provisional header, and the PR description explicitly acknowledges the handover has no dedicated unit test.

Suggested fixes

  1. Order swap-before-release: have the swap goroutine (AddSnapshot, cache.go:288-309) signal a channel that runDedup waits on before releaseMemfd. Straightforward.
  2. Fall back through the packed index: teach MemfdIdentitySource.ReadAt (or the deduped cache) to reverse-translate an identity offset back through packedIndex and serve from the drained cache once the memfd is released. Symmetric to tryInflightRead but going the other direction.
  3. Keep the memfd alive until swap has been observed to complete, then release.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — this was a real ordering bug, and it contradicted an invariant I had written ("the memfd outlives the swap"). Fixed in 659f9a1e0 + ef0a559d1: the dedup goroutine now keeps the memfd mapped until the local template has swapped off the provisional header. The swap goroutine calls MarkSwapped right after SwapHeader; runDedup waits on that (bounded by a 30s grace, and by ctx, so a failed/absent swap cannot leak the mapping) before releasing. The drained cache still resolves immediately, so only the release waits — and since the swap only depends on the compare it has usually already fired by the time the drain finishes, so the common path adds no latency. New regression test TestDedupedMemfdCache_MemfdHeldUntilSwap reproduces the drain-before-swap ordering (fails without the fix). I also corrected the design note.

Comment on lines +561 to +564
func (s *MemfdIdentitySource) Size() (int64, error) { return s.size, nil }
func (s *MemfdIdentitySource) BlockSize() int64 { return header.PageSize }
func (s *MemfdIdentitySource) Close() error { return nil }
func (s *MemfdIdentitySource) FileSize(context.Context) (int64, error) { return s.size, nil }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MemfdIdentitySource.FileSize returns s.size (the full guest RAM, tens of GiB), but the source wraps a still-mapped memfd with zero on-disk footprint — so when the provisional entry is scheduled for deletion its size is added to pdSizes and startDiskSpaceEviction computes used = dUsed - pUsed under-reported by ~guest-RAM bytes. With this flag on under sustained pause load, an eviction cycle that picks a provisional entry drops effective usage below the threshold for pdDelay (60s) while zero real disk bytes are freed, delaying disk-pressure reclamation. Fix: return 0 (or the actual dirty-page footprint) from MemfdIdentitySource.FileSize, and/or c.buildStore.Delete(provisional key) in the swap goroutine after SwapHeader so provisional entries do not linger for buildCacheTTL (25h).

Extended reasoning...

What the bug is. MemfdIdentitySource.FileSize at memfd.go:564 returns s.size, which buildProvisionalMemfile (sandbox.go:1637) initializes to int64(originalHeader.Metadata.Size) — the full guest RAM (multi-GiB). But the source wraps a still-mapped memfd, not a file on the cache disk, so its actual on-disk footprint is zero. Diff.FileSize is supposed to report bytes resident in the local cache file (compare (*block.Cache).FileSize at cache.go:557 which stats stat.Blocks*512, and DedupedMemfdCache.FileSize at memfd.go:581-588 which waits for the drain and delegates to the on-disk cache). MemfdIdentitySource is the outlier.

How the accounting breaks. The provisional source is wrapped in a localDiff via NewLocalDiffFromCache, and localDiff.FileSize (local_diff.go:132-134) delegates verbatim to the wrapped cache. DiffStore.deleteOldestFromCache (build/cache.go:287) calls item.Value().FileSize(ctx) and passes the value to scheduleDelete, which stores it in pdSizes[key].size (build/cache.go:327-335). startDiskSpaceEviction (build/cache.go:220-223) then computes pUsed := s.getPendingDeletesSize(), used := int64(dUsed) - pUsed, percentage := float64(used) / float64(dTotal) * 100, and skips further eviction if percentage <= threshold. A phantom entry inflates pUsed by ~guest-RAM bytes while zero real bytes will be freed by the corresponding cache.Delete (MemfdIdentitySource.Close is a no-op).

Why nothing else prevents it. The provisional entry is added to buildStore in AddSnapshot (template/cache.go) and the swap goroutine at template/cache.go:288-310 only calls SwapHeader — it never c.buildStore.Delete(provisional key). So the entry lingers up to buildCacheTTL (25h) after the SwapHeader completes; over sustained pause load with the flag on, multiple provisional entries accumulate and can be selected by RangeBackwards once older items have aged out.

Impact. Under sustained disk pressure with memfd-dedup-inflight-serve on, each provisional entry picked for eviction inflates pUsed by ~guest-RAM bytes for the pdDelay = 60s window (buildCacheDelayEviction), delaying real reclamation. It won't crash or corrupt data — it delays rather than defeats eviction — but under the exact workload the PR targets, it can meaningfully throttle disk-pressure reclamation of real cache bytes. Feature flag is off by default, so no immediate production impact today.

Step-by-step proof.

  1. Pause runs on a sandbox with Metadata.Size = 32 GiB and the flag on. buildProvisionalMemfile returns NewMemfdIdentitySource(dc, 32 GiB), wrapped in localDiff, added to buildStore via AddSnapshot.
  2. Some time later SwapHeader completes; the provisional entry is not deleted from buildStore and lingers.
  3. Disk pressure builds. deleteOldestFromCache picks the provisional entry (once older entries have been evicted) and calls FileSize, which returns 32 * 2^30.
  4. scheduleDelete stores 32 GiB in pdSizes[key].size.
  5. Next tick: pUsed = 32 GiB, used = dUsed - 32 GiB, percentage = (dUsed - 32 GiB)/dTotal * 100. On a 100 GiB disk running at 90 GiB used (90% > 85% threshold), the phantom drops used to 58 GiB and percentage to 58% — below threshold, eviction is skipped.
  6. cache.Delete fires after 60s; MemfdIdentitySource.Close is a no-op, so zero real bytes are freed. The phantom clears from pdSizes, eviction resumes, but 60s of pressure were wasted per phantom picked. Multiple concurrent provisional deletes compound.

Fix. Two-line change: (a) MemfdIdentitySource.FileSize returns 0 — it has no on-disk footprint, and eviction-selecting it also frees zero disk bytes on Close, so 0 is the accurate answer; (b) add c.buildStore.Delete(build.GetDiffStoreKey(provisionalBuildID, build.Memfile)) inside the swap goroutine after SwapHeader so the provisional entry doesn't linger for 25h.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 659f9a1e0 + ef0a559d1: MemfdIdentitySource.FileSize now returns 0 (it wraps a still-mapped memfd, not a cache file, so evicting it frees no disk bytes — returning the logical size inflated the disk-eviction accounting by ~guest-RAM bytes), and the swap goroutine now drops the provisional entry from the store via a new DiffStore.Delete(key) right after SwapHeader, so it no longer lingers for the TTL.

Comment on lines +553 to +559
// so dedup best-effort peeks and build.File.IsCached treat it as local.
func (s *MemfdIdentitySource) IsCached(_ context.Context, off, length int64) bool {
s.d.mu.RLock()
defer s.d.mu.RUnlock()

return s.d.memfd != nil && off >= 0 && off+length <= s.size
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The comment on MemfdIdentitySource.IsCached says "dedup best-effort peeks and build.File.IsCached treat it as local", but neither path actually calls this method: dedup.go:109 does base.(CachePeeker) on the parent memfile (not the diff source), and build.go:324 does diff.(block.CachePeeker) on a *localDiff, which has no IsCached method. The method is effectively dead via the described callers — consider updating the comment (or implementing CachePeeker on localDiff if the intent was for File.IsCached to leverage it). No functional impact, since MemfdIdentitySource is only referenced through the provisional-header path and reads work correctly regardless.

Extended reasoning...

The comment on MemfdIdentitySource.IsCached at packages/orchestrator/pkg/sandbox/block/memfd.go:552-559 claims the method exists "so dedup best-effort peeks and build.File.IsCached treat it as local". Verified against the current tree, both claims are wrong.

Dedup best-effort peek does not call it. In packages/orchestrator/pkg/sandbox/block/dedup.go:109, the type assertion is peeker, _ := base.(CachePeeker)base is the parent ReadonlyDevice passed into NewCacheFromMemfdDeduped (the original memfile / chunker), never the diff/provisional source. The subsequent peeker.IsCached(...) call therefore lands on the parent's IsCached implementation, not on MemfdIdentitySource.

build.File.IsCached does not call it either. In packages/orchestrator/pkg/sandbox/build/build.go:324 the assertion is peeker, ok := diff.(block.CachePeeker), where diff is what store.Lookup(...) returns — the provisional entry is registered via build.NewLocalDiffFromCache, whose concrete type is *localDiff (see packages/orchestrator/pkg/sandbox/build/local_diff.go). *localDiff has only CachePath / Close / ReadAt / Slice / Size / FileSize / CacheKey / BlockSize (verified with grep -nE "func \(.*localDiff\)" local_diff.go) — no IsCached. The cache field on localDiff is a named field, not embedded, so no method promotion occurs. The type assertion fails and File.IsCached returns false for any range routed through the provisional buildID, never invoking MemfdIdentitySource.IsCached.

Step-by-step proof (provisional buildID range):

  1. Pause runs on the memfd-dedup path, so buildProvisionalMemfile (sandbox.go:1634) creates provisionalSource := block.NewMemfdIdentitySource(dc, ...) and calls build.NewLocalDiffFromCache(provisionalKey, provisionalSource).
  2. NewLocalDiffFromCache returns a *localDiff whose cache field holds the *MemfdIdentitySource.
  3. cache.AddSnapshot (template/cache.go:250) adds this *localDiff under the provisional buildID.
  4. A subsequent File.IsCached call resolves an offset to the provisional buildID, does store.Lookup(...)*localDiff, and executes peeker, ok := diff.(block.CachePeeker).
  5. *localDiff has no IsCached method, so ok == false; File.IsCached short-circuits to false at build.go:326. MemfdIdentitySource.IsCached is never reached.

Reachability today. Beyond these two paths, the method is only invoked by memfd_test.go (the block-level unit test). No production path hits it via the callers named in the comment.

Impact and fix. Functionally none — reads through MemfdIdentitySource.ReadAt / Slice / ServeMemfd work correctly, and File.IsCached returning false is safe (its callers treat "not cached" as "go fetch it", and the fetch resolves through the same source). The only real cost is that the comment misleads future maintainers about how the provisional source integrates with IsCached callers. Either drop the "so dedup ... and build.File.IsCached treat it as local" clause from the comment, or, if the original intent was for File.IsCached to leverage this, add a small IsCached delegation on *localDiff (forwarding to its cache when the concrete cache implements CachePeeker) — that would make the comment accurate and give File.IsCached an accurate resident view during the provisional window.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the comment in 659f9a1e0: it now states the method satisfies the CachePeeker contract for callers holding a *MemfdIdentitySource directly (the block-level tests), and explicitly notes that the wrapping *build.localDiff does not promote it, so build.File.IsCached does not reach it today. No behavior change.

kalyazin and others added 2 commits July 1, 2026 22:32
…erving

Add DedupedMemfdCache.ServeMemfd (identity device-offset reads guarded by the
existing memfd RWMutex) and MemfdIdentitySource, a local diff source that serves
dirty pages from the still-mapped memfd while dedup runs. This is what a
provisional (pre-dedup) memfile header reads from, so a resume overlapping the
pause can serve dirty pages immediately instead of waiting for the deduped
header. Publish the memfd at construction and free it exactly once under the
write lock (releaseMemfd) so serving can't race the drain's memfd close. Composes
with the in-flight drain serving rather than replacing it. Not yet wired into the
pause/resume path.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Assert MemfdIdentitySource serves dirty pages from the memfd at identity offsets
while mapped, and reports BytesNotAvailableError once the memfd is released.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@kalyazin kalyazin force-pushed the en1369-inflight-memfd-serving branch from 678b863 to 12b5446 Compare July 1, 2026 21:35

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 12b544669f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// Memfile() doesn't block on dedup; fall back to the deduped header future.
localMemfileHeader := memfileHeader
if provisionalMemfileHeader != nil {
localMemfileHeader = resolvedHeader(provisionalMemfileHeader)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid persisting provisional headers as parents

When this provisional header is installed as the cached template header, a sandbox resumed during the compare window can be paused again before the swap goroutine replaces it. That second pause uses the provisional memfile header as its parent, so its uploaded header can inherit mappings to the synthetic provisional build ID; there is no upload future or storage object for that ID, leaving a persisted snapshot that cold or cross-node resumes cannot read.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real durability bug — fixed in babcac37e. The provisional header must never become a durable parent: if a resume lands in the window and is paused again before the swap, a naive pause would parent off the live (provisional) header and upload a header inheriting the synthetic B_prov mappings (no storage object) → unreadable on a cold/cross-node resume. Fix: build.File now carries the deduped header as a durable-header future, and processMemorySnapshot resolves the pause parent via DurableHeader (waiting for the deduped header if a swap is still pending) instead of the live header. Serving still uses the fast provisional header; only the pause parent is forced durable. Normal pauses (no pending swap) resolve immediately. Covered by TestFile_DurableHeaderPrefersDurableFuture (f3a3e9711).

Comment on lines +319 to +320
if provisionalSwapDone != nil {
provisionalSwapDone()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep the memfd until pre-swap readers finish

For a resume overlapping the header swap, build.File can plan a segment from the provisional header and keep provisionalMemfileDiff, then this callback can fire before that segment calls ReadAt. MarkSwapped lets runDedup release the memfd, so the later MemfdIdentitySource.ReadAt returns BytesNotAvailableError even though the read was routed by a valid pre-swap header; the RWMutex only protects readers already inside ServeMemfd, not readers between planning and serving.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct that a read planned on the provisional header can still serve after the release — the memfd-hold closes the [drain, swap] gap but not the [plan, serve] gap that straddles the release instant. In practice this is self-healing rather than a failure: build.File returns the BytesNotAvailableError, the UFFD handler retries ReadAt (up to 4×), and the retry re-plans on the now-deduped header (post-swap) → routes to the real build id → served from the drained cache. So the worst case is one retry of added latency on a very narrow window, not a failed read. I opted not to add plan-tracking for it (that would need reader-generation accounting); if we see it in the load-snapshot tail after enabling the flag I will add a memfd→drained-cache fallback in MemfdIdentitySource. Flagged this explicitly in the PR notes.

@kalyazin kalyazin force-pushed the en1369-inflight-memfd-serving branch from 12b5446 to f3a3e97 Compare July 2, 2026 08:20

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f3a3e97118

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


return
}
mem.SwapHeader(deduped)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard against overwriting finalized headers

If this swap goroutine is delayed until after Upload.publish has already swapped in the finalized header, this line can replace it with the older deduped header from memfileHeader.Wait() (still IncompletePendingUpload and missing the self Builds entry). In that ordering, later same-node child uploads or reads after the local diff is evicted observe an incomplete cached header even though the upload finished, which can make ancestor Wait return stale BuildData or make compressed self-diff resolution try the wrong uncompressed object path.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed and fixed in 4f2387167. Upload.publish does swap a finalized header into the same cached device unconditionally, so a late run of this goroutine could revert it to the older, still-IncompletePendingUpload deduped header. The swap is now a compare-and-swap from the still-provisional header (build.File.SwapHeaderIfCurrent(provisionalMemfileHeader, deduped)): once any newer header is installed (finalized by publish, or the P2P poll path) it becomes a no-op, so it can never clobber it. publish stays unconditional and authoritative. Covered by TestFile_SwapHeaderIfCurrent.

Comment thread packages/orchestrator/pkg/sandbox/template/cache.go
@kalyazin kalyazin force-pushed the en1369-inflight-memfd-serving branch from f3a3e97 to 0ab57e9 Compare July 2, 2026 09:01
Comment thread packages/orchestrator/pkg/sandbox/template/cache.go Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0ab57e93f5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

provisionalSwapDone()
}
if provisionalMemfileDiff != nil {
c.buildStore.Delete(provisionalMemfileDiff.CacheKey())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep the provisional diff until old-header readers drain

When a same-node resume has already loaded the provisional header but is between resolving a dirty-page mapping and calling cachedBuild, this deletion can remove the only local source for the synthetic provisional build ID. That reader then misses the DiffStore, falls through to createDiff, and tries to open object storage for a build ID that is intentionally never uploaded, so an otherwise valid in-flight resume can fail during the header-swap race instead of reading from the memfd.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right — fixed in d2861e4e3 by removing the Delete. It was added last round together with FileSize → 0, but FileSize → 0 already handles the real concern (eviction accounting). Deleting the entry at swap time races in-flight readers exactly as you describe: a reader that planned on the provisional header but has not hit the store yet would miss → createDiff → a storage fetch for a build id that was never uploaded. Leaving the entry is harmless — no reads route to it after the swap, it reports FileSize 0 so it never skews disk eviction, and the store TTL reclaims it (its Close is a no-op). So the accounting fix stands and the delete race is gone.

@kalyazin kalyazin force-pushed the en1369-inflight-memfd-serving branch from 0ab57e9 to 0fddce2 Compare July 2, 2026 09:26

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0fddce2d39

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +276 to +278
if provisionalMemfileHeader != nil {
localMemfileHeader = resolvedHeader(provisionalMemfileHeader)
durableMemfileHeader = memfileHeader

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not expose provisional headers through P2P

When peer-to-peer chunk transfer is enabled, a cross-node resume during the provisional window fetches memfile.header from the origin via peerserver.headerSource.Stream, which serializes device.Header(); because this publishes the local template with the provisional header, the remote node can receive mappings to the synthetic provisional build ID, but the peer registry and object storage only know the real build ID, so subsequent memfile reads for the synthetic ID fail instead of waiting for the durable deduped header.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real, and fixed in 37aa06d59. Confirmed: peerserver.headerSource.Stream serializes device.Header() (header.go:35), wired to the memfile via resolve.go, so during the provisional window it would ship B_prov mappings to a peer that can't resolve them. The P2P header source now resolves DurableHeader(ctx) too (same guard as the pause parent): a cross-node resume in the window receives the deduped mappings — waiting for them if a swap is still pending — rather than the synthetic provisional build id. The served header stays marked IncompletePendingUpload, so the peer still polls for the finalized version. Covered by TestHeaderSource_Stream_ServesDurableHeader. This was the third B_prov exposure beyond reads and the pause parent; double-gated in practice (needs the flag and P2P both on).

@kalyazin kalyazin force-pushed the en1369-inflight-memfd-serving branch from 0fddce2 to 4b5b520 Compare July 2, 2026 09:52

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4b5b520bc5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

localMemfileHeader := memfileHeader
var durableMemfileHeader *utils.SetOnce[*header.Header]
if provisionalMemfileHeader != nil {
localMemfileHeader = resolvedHeader(provisionalMemfileHeader)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep provisional IDs out of scheduling metadata

When memfd-dedup-inflight-serve is enabled and a resume hits this cached template before the swap, this publishes a live memfile header whose mappings contain the synthetic provisional build ID. storageTemplate.SchedulingMetadata still reads memfile.Header(), so SandboxCreateResponse.scheduling_metadata.memfile_build_ids can include that local-only ID; it is never uploaded or registered, so downstream placement can reason about a nonexistent layer. Please have scheduling metadata use the durable header as the pause and P2P paths do.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in eab2f2bc0. storageTemplate.SchedulingMetadata now resolves the memfile header via DurableHeader(ctx) (same guard as the pause and P2P paths), so memfile_build_ids reports the deduped build ids, never the synthetic provisional one; on a DurableHeader error it falls back to a nil memfile header (rootfs-only metadata) rather than B_prov. Covered by TestStorageTemplate_SchedulingMetadataUsesDurableHeader.

While here I audited the remaining header readers to be sure the B_prov blast radius is fully closed: the four build-id-bearing consumers are now all durable (read routing — intentionally provisional then swapped — plus pause parent, P2P header, and scheduling metadata). Size/BlockSize/UFFD read only header metadata dimensions (identical for both headers), and the upload-side Wait header lookup routes to the remote-poll branch because the provisional header is marked IncompletePendingUpload (set by newDiffHeader), so it never returns B_prov. So this should be the last of these.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction after an e2e rerun on a dev node: resolving the durable header here had to be made non-blocking. SchedulingMetadata runs synchronously on the create/resume response path, and the blocking DurableHeader(ctx) I first used waited for the deduped header — during an in-flight dedup that is the whole window, so a same-node resume still blocked ~15s even though the provisional fast-path was fully engaged (the block had just moved here). Fixed in 50c47c2df by adding a non-blocking DurableHeaderNow() and switching SchedulingMetadata to it: while the deduped header is still pending it reports rootfs-only metadata (never B_prov, never blocking); with no swap pending it returns the live/deduped header as before. The pause-parent and P2P paths keep the blocking DurableHeader (those must wait). After the fix the race resume dropped from ~14.9s to ~0.5s. Covered by TestFile_DurableHeaderNow and TestStorageTemplate_SchedulingMetadataSkipsPendingMemfile.

@kalyazin kalyazin force-pushed the en1369-inflight-memfd-serving branch 2 times, most recently from 4d2ec29 to ff968f2 Compare July 2, 2026 11:39
Comment thread packages/orchestrator/pkg/sandbox/template/cache.go
kalyazin and others added 2 commits July 2, 2026 13:05
A resume that overlaps an in-flight pause of the same sandbox can block in the
storage-template-memfile span: the local memfile template can't be built until
the deduped header resolves, which only happens after dedup's compare+defrag.
The in-flight memfd serving added earlier removes the drain wait but not this
header wait, because it is upstream of the diff source.

At pause, build a provisional local header (identity storage offsets, dirty
pages attributed to a fresh provisional build id) from the dirty bitmap and the
parent header, and register a memfd-backed diff source for that build id. The
local template is built from the provisional header immediately, so a concurrent
resume serves dirty pages straight from the still-mapped memfd instead of
blocking. Once the deduped header resolves it is swapped in (SwapHeader), routing
dirty pages to the compacted deduped diff; the distinct build id makes the swap
race-free since build.File resolves the source from the same header snapshot as
the offset. The swap is conditional (compare-and-swap from the still-provisional
header) so a late swap can't clobber a newer header the upload may have
finalized first. The provisional header is local-only: the upload continues to use
the deduped header, so the persisted artifact is unchanged.

A pause must never parent a snapshot off the provisional header: it maps dirty
pages to a synthetic build id with no storage object, so an uploaded header
that inherited those mappings (from a resume paused again before the swap) would
be unreadable on a cold or cross-node resume. build.File carries the deduped
header as a durable-header future, and processMemorySnapshot resolves the parent
via DurableHeader (waiting for the deduped header if a swap is still pending)
rather than the live, possibly-provisional header. The peer-to-peer header
source serves the durable header for the same reason, so a cross-node resume in
the window receives the deduped mappings (waiting if a swap is pending) instead
of the synthetic provisional build id it cannot resolve. Scheduling metadata
(storageTemplate.SchedulingMetadata) reads the durable header too, but
NON-BLOCKING (DurableHeaderNow): it runs on the create/resume response path, so
while the deduped header is still pending it reports rootfs-only metadata rather
than block the response for the whole dedup — and never carries the provisional
build id.

The dedup goroutine holds the memfd until the swap, bounded by a grace timeout; the
swap goroutine now signals on every exit (success or error) so the grace is only a
last-resort backstop, and a counter (orchestrator.memfd.swap_grace_elapsed) surfaces
if it ever fires. The main memfile diff is pinned in the build store for the
provisional window: it shares its DedupedMemfdCache (and memfd) with the provisional
source, so disk-pressure eviction of the LRU-cold main entry would otherwise Close it
and tear the memfd down mid-serve; it is unpinned after the swap.

Gated by memfd-dedup-inflight-serve (default off); composes with the in-flight
drain serving (provisional covers the compare window, in-flight the drain).

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extend the MemfdIdentitySource test to exercise Slice (identity read while
mapped, and BytesNotAvailableError after release), and add a test that
buildProvisionalMemfile declines (returns nil, nil, nil) for a non-memfd-dedup source
or when disabled, so the caller falls back to the deduped header. Also cover
build.File.DurableHeader: it returns the live header when no durable future is
set and otherwise waits for the deduped future, which is what keeps a pause from
parenting off a provisional header; that a pinned build-store entry is skipped by
disk-pressure eviction; and SwapHeaderIfCurrent, which swaps only
when the current header still matches so a late provisional swap can't clobber a
header the upload already finalized. Cover the peer-to-peer header source too:
it serves the durable (deduped) header, so a peer never receives provisional
synthetic-build-id mappings, and that storageTemplate.SchedulingMetadata reads
the durable header non-blockingly (DurableHeaderNow) so memfile_build_ids
exclude the provisional build id and never block the create/resume response —
reporting rootfs-only metadata while the deduped header is still pending.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@kalyazin kalyazin force-pushed the en1369-inflight-memfd-serving branch from ff968f2 to 31ee7fc Compare July 2, 2026 12:07

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 31ee7fc. Configure here.

swapGraceElapsedCounter.Add(ctx, 1)
logger.L().Warn(ctx, "memfd swap grace elapsed; releasing memfd without a swap signal")
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swap grace without signal

Medium Severity

With memfd-dedup-inflight-serve on, runDedup always waits on swapped (or the 30s grace) when inflight is true, but MarkSwapped is only invoked from the AddSnapshot swap goroutine when a provisional header was built. If buildProvisionalMemfile falls back (nil provisional), nothing signals swapped, so every pause can hold a guest-sized memfd mapping for the full grace after drain—slower than the flag-off path and with no resume benefit.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 31ee7fc. Configure here.

// the deduped header has been swapped in, so it doesn't linger for the TTL.
func (s *DiffStore) Delete(key DiffStoreKey) {
s.cache.Delete(key)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused DiffStore Delete

Low Severity

DiffStore.Delete is added with a comment that it drops the provisional memfile entry after header swap, but nothing in this change calls it—the swap path intentionally leaves the entry to TTL to avoid store races. The method is dead and the comment describes behavior that was removed.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 31ee7fc. Configure here.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 31ee7fcaae

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// window (seconds) and eviction is LRU-by-TTL, so this freshly-added
// entry is evicted last. We rely on that rather than guarding the miss:
// the eviction can't realistically occur, so a fallback isn't warranted.
c.buildStore.Add(provisionalMemfileDiff)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pin provisional memfile diffs until swap

When disk-pressure eviction runs during a long memfd compare, this synthetic provisional diff is the oldest evictable entry after the real memfile diff gets pinned below (the rootfs diff is added later). DiffStore.deleteOldestFromCache can therefore schedule and delete it after the 60s delay while the live provisional header still maps dirty pages to its synthetic build ID; a resume that reaches the store after that miss falls through to object storage, where that build ID is never uploaded. Pin this provisional key until the header swap, or otherwise make the miss retry against the deduped header.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant