feat(orch): decouple warm resume from memfile dedup by kalyazin · Pull Request #3166 · e2b-dev/infra

kalyazin · 2026-07-01T17:49:24Z

feat(orch): decouple warm resume from memfile dedup

What & why

A warm resume that overlaps an in-flight pause of the same sandbox is delayed by
roughly the pause's memfile-dedup duration. The deduped diff is the only readable diff
source and isn't readable until dedup finishes, so a resume that lands on the origin
node while the pause is still deduping blocks — either in storage-template-memfile
(waiting for the diff header, which is produced after the dedup compare) or in
load-snapshot (UFFD memory serving waiting on the dedup drain). The data a resume
actually needs (base ⊕ dirty pages) is available the instant the pause captures the
snapshot; only dedup's compaction for upload is slow.

This PR lets a same-node resume serve dirty pages straight from the still-mapped
memfd while dedup runs in the background, so the resume no longer waits for dedup.
Dedup still runs to completion and the uploaded artifact is unchanged.

How it works

The pause's dedup has two phases whose products land at different times: the header
(after compare) and the compacted data (after the drain). The fix covers both
windows with one mechanism, gated by one flag:

Provisional header (compare window). At pause we build a local-only header from
the dirty bitmap + parent header (header.ToProvisionalDiffHeader), attributing dirty
pages to a fresh provisional build id at identity (device) offsets, and register a
memfd-backed diff source (block.MemfdIdentitySource) under that id. The local
template is built from this header synchronously, so a concurrent resume gets a
serviceable memfile immediately and reads dirty pages from the memfd — no wait on the
header.
In-flight drain serving (drain window). When the deduped header resolves (after
compare), SwapHeader points reads at the real build id, whose DedupedMemfdCache
serves the drain window from the memfd until the compacted diff is ready, then from
the drained cache.
Race-free by construction. build.File resolves an offset to (buildId, storageOff)
from a single header snapshot, so the build id is the offset-regime tag: a read
planned on the provisional header resolves to the memfd source (identity offsets); one
planned on the deduped header resolves to the drained cache (packed offsets). No read
can mix regimes.
The memfd is held until the swap. The dedup goroutine keeps the memfd mapped until
the local template has swapped off the provisional header (MarkSwapped, signalled by
the swap goroutine after SwapHeader), bounded by a grace timeout so a failed/absent
swap can't leak the mapping. This is necessary because the swap can land after the
drain (small dirty set + fragmented parent header → fast drain, slow header build), and
releasing at drain-time would let a provisional read hit a released memfd. An RWMutex
covers any straggler read against the release. The provisional store entry is not
deleted at swap time (that would race a reader that planned on the provisional header
but hasn't hit the store yet) — it ages out via the store TTL, harmless because no reads
route to it post-swap and it reports FileSize 0, so it never skews disk eviction. The
swap goroutine signals the release on every exit (success or error), so the grace is
only a last-resort backstop.
The main memfile diff is pinned for the window. It shares its DedupedMemfdCache
(and memfd) with the provisional source, but resume reads refresh only the provisional
store entry — leaving the main entry LRU-cold. Disk-pressure eviction of the main entry
would Close it, cancelling dedup and tearing the memfd down mid-serve. So the main
memfile diff is pinned in the build store for the provisional window (disk-pressure
eviction skips it; TTL still applies) and unpinned after the swap.
The swap never clobbers a newer header. The header on one build.File moves
provisional → deduped (this swap) → finalized (Upload.publish). The provisional→deduped
swap is a compare-and-swap from the still-provisional header (SwapHeaderIfCurrent),
so if it runs late it becomes a no-op rather than reverting an already-finalized header.
The provisional build id never escapes the origin node. The provisional header maps
dirty pages to a synthetic build id with no storage object, so it must never reach an
uploaded snapshot, a peer, or placement metadata. All three consumers resolve the durable
(deduped) header instead of the live one: the pause parent (processMemorySnapshot,
so a resume-then-re-pause in the window can't persist an unreadable snapshot), the
P2P header source (peerserver.headerSource.Stream, so a cross-node peer gets the
real mappings), and scheduling metadata. build.File carries the deduped header as a
durable-header future; the pause and P2P paths wait for it (DurableHeader), while
scheduling metadata — on the latency-sensitive create/resume response — reads it
non-blocking (DurableHeaderNow), reporting rootfs-only metadata during the window
rather than blocking. Templates with no pending swap resolve all of these immediately.

Safety

The upload is unchanged. The upload path keeps using the deduped header; the
provisional header + provisional build id live only in memory on the origin node and
are never persisted. So the stored artifact is byte-for-byte unaffected.
Dedup is unaffected. Compare + drain still run; the deduped, compacted artifact is
still uploaded.
One documented assumption (see the comment at the buildStore.Add call site): the
provisional source must stay in the build store for the duration of the provisional
window (the dedup compare, seconds); this holds because the store TTL (hours) dwarfs
the window and eviction is LRU-by-TTL. We deliberately don't guard the (unrealistic)
eviction-mid-window case.

Feature flag

Everything is gated behind memfd-dedup-inflight-serve (default off) and only takes
effect on the memfd-dedup path. With the flag off, behavior is identical to today
(resume waits for dedup). Roll out gradually.

Observability

The fix collapses distinct existing spans (rather than adding a rollout metric), so
rollout is observed via:

storage-template-memfile (resume path) → ≈ 0 when the header wait is removed;
load-snapshot (resume path) → tail drops when the drain wait is removed;
orchestrator_sandbox_create_duration{start_type="resume"} p99/p999 tail comes down.

Regression controls that must not move: orchestrator_sandbox_snapshot_diff_ratio{kind="dedup"}
(dedup effectiveness), dedup-pages / pause durations (dedup still runs, pause stays
fast). Correlate all with the flag-enable time.

One new counter is added: orchestrator.memfd.swap_grace_elapsed — increments when the
dedup goroutine releases the memfd via the grace timeout without a swap signal. Expected
≈ 0 (the swap normally signals before the drain finishes); a nonzero rate means swap
signals are being lost (e.g. pauses aborting before the swap goroutine spawns) and memfds
are held the full grace on those pauses — actionable, not fatal.

Testing

Unit tests (incl. -race): the identity-offset header builder; the memfd identity
source (serve while mapped, BytesNotAvailableError after release, Slice); the
in-flight DedupedMemfdCache serving + a concurrent drain/handover stress test; the
memfd-held-until-swap ordering; the provisional-header fallback (declines for a
non-memfd-dedup source / when disabled); DurableHeader / DurableHeaderNow and
SwapHeaderIfCurrent on build.File; the P2P header source serving the durable header;
and scheduling metadata using the durable header (and reporting rootfs-only while a swap
is pending).
End-to-end on a dev node (dedup forced on, 15 s forced compare stall via temporary
test-only code, resume overlapping the in-flight pause):
- Bug reproduced (flag off): race resume blocked ~14.9 s on the in-flight dedup.
- Fixed (flag on): the same race resume dropped to ~0.5 s (control resume ~0.66 s).
- Regression caught & fixed by the rerun: the first flag-on run still blocked ~15 s.
  Diagnostics showed the provisional fast-path fully engaged, so the block had moved:
  SchedulingMetadata, on the resume response, was calling the blocking DurableHeader.
  Switching it to non-blocking DurableHeaderNow brought the race resume to ~0.5 s.
- Cold-resume correctness: a 1 GiB in-RAM blob's md5 was identical across a resume
  forced to load the snapshot from object storage (orchestrators restarted to drop the
  in-memory template cache) — the uploaded artifact restores guest memory bit-for-bit.
- Dedup preserved: dedup still ran and the uploaded artifact was byte-identical.

Not in scope / follow-ups

Cross-node resumes. A resume on a non-origin node can't reach the origin's memfd, so
it doesn't get the fast serve — it waits for the deduped header. The P2P header source
now serves the durable header, so such a resume in the window gets the real mappings
(waiting for them) rather than the unresolvable provisional build id; a true fast cross-node serve
(peer-fetch/prefetch of the in-flight diff) is out of scope. The cross-node path is
covered by unit tests; a controlled cross-node e2e wasn't run — resume has origin
affinity and there's no clean lever to force a peer resume while keeping the origin up to
serve P2P, so that needs a placement-control harness (follow-up).
Handover test harness. The AddSnapshot provisional→deduped SwapHeader handover
is covered by the e2e above but has no dedicated unit test yet (needs a build-store /
storage-provider fixture).
Plan→serve straddling the release (self-healing). A read planned on the provisional
header just before the swap can serve just after the memfd release, getting
BytesNotAvailableError. This self-heals: the UFFD handler retries ReadAt and re-plans
on the now-deduped header (post-swap) → drained cache → success. Worst case is one retry
of added latency on a very narrow window, not a failed read. If it shows up in the
load-snapshot tail after enabling the flag, the fix is a memfd→drained-cache fallback
in MemfdIdentitySource.
Dedup failure with an overlapping resume (accepted degradation). If dedup itself
errors (rare — best-effort dedup normally degrades rather than errors), the deduped
header never resolves, so the swap can't happen and the memfd is released; a resume that
was already in flight then fails its dirty-page faults. This is the same outcome as
before this PR (the build is broken either way — pre-PR it failed at Memfile(), now it
fails at fault time), and the durable-header guard still prevents any bad snapshot from
being persisted (a re-pause's DurableHeader surfaces the dedup error and fails cleanly).
Not worth extra machinery (keeping the memfd mapped to let the resume finish would leak
it, since no swap can free it).

🤖 Generated with Claude Code

A resume that overlaps an in-flight pause of the same sandbox blocks for the full dedup duration: the deduped diff is the only readable diff source and it is not readable until the background dedup drain finishes, so UFFD memory serving in load-snapshot waits on DedupedMemfdCache.Wait. Give DedupedMemfdCache an in-flight read path. After the dedup compare publishes the metadata, expose the still-mapped memfd together with a packed-to-absolute offset index built from the deduped dirty set. While the drain runs, ReadAt/Slice translate the packed diff offset and copy straight from the memfd instead of blocking on the drain; once the drain completes, reads fall through to the drained cache. An RWMutex guards the memfd so the drain's close cannot unmap it under an in-flight reader, and the done barrier makes the memfd-to-cache handover race-free. Gate the behaviour behind memfd-dedup-inflight-serve (default off); when off the prior wait-for-drain behaviour is preserved. Dedup still runs to completion and the uploaded artifact is unchanged. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add unit tests for the packed-to-absolute offset index (single-run translation plus boundary and out-of-range rejection) and for DedupedMemfdCache serving dirty pages from the memfd while the drain is in progress, then bypassing the in-flight path once the drain resolves. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Drive a real dedup goroutine with in-flight serving enabled and hammer tryInflightRead from many goroutines across the drain window and the memfd-to-cache handover: assert every served read returns the correct bytes and, under -race, that the drain's memfd close never unmaps beneath an in-flight reader. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add createIdentityMapping and DiffMetadata.ToProvisionalDiffHeader, which build a local-only memfile header that attributes every dirty page to a given build at an identity storage offset (its device offset) composed over the parent header. Unlike ToDiffHeader it needs no dedup metadata, so a pause can produce it immediately and a concurrent resume can serve from the still-mapped memfd without waiting for the dedup compare. The uploaded artifact continues to use the deduped ToDiffHeader output; this header is never persisted. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Assert ToProvisionalDiffHeader maps dirty pages to the new build at identity storage offsets and leaves clean pages on the parent. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cursor · 2026-07-01T17:49:34Z

PR Summary

High Risk
Touches snapshot parenting, concurrent header swaps, memfd lifetime, and resume memory serving; incorrect durable/provisional handling or early memfd release could corrupt reads or persist unreadable headers, though default-off rollout and extensive tests mitigate this.

Overview
Warm resumes on the origin node no longer have to wait for memfile dedup when a pause is still in flight. With memfd-dedup-inflight-serve (default off, memfd-dedup only), pause builds a local-only provisional header and memfd-backed diff so the template is usable immediately; after compare, reads can still hit the mapped memfd during drain via a packed offset index, then CAS swap to the deduped header when ready. The dedup goroutine keeps the memfd until MarkSwapped, with a grace timeout and mutex-safe release. Durable header APIs ensure pauses, P2P header streaming, and scheduling metadata never persist or advertise the synthetic provisional build id (scheduling uses a non-blocking path so resume responses do not block on dedup). The build cache pins the main memfile diff during the provisional window so disk eviction cannot tear down shared dedup state. Upload and the deduped artifact path are unchanged when the flag is off.

^{Reviewed by Cursor Bugbot for commit 31ee7fc. Bugbot is set up for automated code reviews on this repo. Configure here.}

codecov · 2026-07-01T17:51:03Z

Codecov Report

❌ Patch coverage is 50.76923% with 192 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...ackages/orchestrator/pkg/sandbox/template/cache.go	0.00%	78 Missing ⚠️
packages/orchestrator/pkg/sandbox/sandbox.go	11.90%	37 Missing ⚠️
packages/orchestrator/pkg/sandbox/block/memfd.go	82.39%	20 Missing and 5 partials ⚠️
...hestrator/pkg/sandbox/template/storage_template.go	41.93%	18 Missing ⚠️
packages/orchestrator/pkg/sandbox/build/cache.go	57.89%	7 Missing and 1 partial ⚠️
...kages/orchestrator/pkg/sandbox/template/storage.go	0.00%	8 Missing ⚠️
packages/shared/pkg/storage/header/metadata.go	68.42%	4 Missing and 2 partials ⚠️
...estrator/pkg/sandbox/template/peerserver/header.go	44.44%	4 Missing and 1 partial ⚠️
packages/orchestrator/pkg/server/sandboxes.go	0.00%	3 Missing ⚠️
...strator/pkg/template/build/layer/layer_executor.go	0.00%	3 Missing ⚠️
... and 1 more

📢 Thoughts on this report? Let us know!

gemini-code-assist

Code Review

This pull request introduces inflight serving of dirty pages directly from a still-mapped memfd during the dedup drain window, using a provisional local header and a provisional diff source to prevent blocking concurrent resumes. The feedback highlights two critical nil pointer dereference risks: one in buildProvisionalMemfile where originalHeader and diffMetadata are dereferenced without prior nil checks, and another in the background goroutine of AddSnapshot where the returned memfile is not verified to be non-nil before calling SwapHeader.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-07-01T17:51:31Z

+	dc, ok := cache.(*block.DedupedMemfdCache)
+	if !ok || !enabled || originalMemfile == nil {
+		return nil, nil
+	}


The buildProvisionalMemfile function does not check if originalHeader or diffMetadata are nil before dereferencing them. If either is nil, a panic will occur during the provisional header construction. Adding nil checks to the guard clause prevents potential nil pointer dereferences.

Suggested change

dc, ok := cache.(*block.DedupedMemfdCache)

if !ok || !enabled || originalMemfile == nil {

return nil, nil

}

dc, ok := cache.(*block.DedupedMemfdCache)

if !ok || !enabled || originalMemfile == nil || originalHeader == nil || diffMetadata == nil {

return nil, nil

}

References

Focus on nil-deref and error handling as specified in the repository style guide. ^(link)

Done in 53ef5e1b5: added originalHeader == nil || diffMetadata == nil to the guard. On the memfd-dedup path both are effectively non-nil (the existing header goroutine already dereferences originalHeader via ToDiffHeader), so this is defensive rather than a live bug, but it makes the helper safe for nil inputs and matches the nil-deref style guide.

gemini-code-assist · 2026-07-01T17:51:31Z

+			mem, err := storageTemplate.Memfile(swapCtx)
+			if err != nil {
+				logger.L().Warn(swapCtx, "provisional memfile header swap: get memfile", zap.Error(err))
+
+				return
+			}
+			mem.SwapHeader(deduped)


The background goroutine for swapping the provisional header does not check if the returned memfile is nil before calling SwapHeader. If the memfile is nil, a panic will occur and crash the entire orchestrator process. Adding a nil check for the memfile ensures the background swap is safe and resilient.

mem, err := storageTemplate.Memfile(swapCtx) if err != nil { logger.L().Warn(swapCtx, "provisional memfile header swap: get memfile", zap.Error(err)) return } if mem == nil { logger.L().Warn(swapCtx, "provisional memfile header swap: memfile is nil") return } mem.SwapHeader(deduped)

References

Focus on nil-deref and error handling as specified in the repository style guide. ^(link)

Done in 53ef5e1b5: added a nil check on the returned memfile before SwapHeader. Memfile() returns (nil, err) or (non-nil, nil), so a nil without an error should not happen, but since this runs in an unrecovered background goroutine a nil deref would crash the orchestrator, so the guard is warranted.

claude · 2026-07-01T20:41:52Z

+	// Swap the provisional header for the deduped one once dedup finishes, so
+	// subsequent reads route dirty pages to the (compacted) deduped diff and the
+	// provisional memfd source is no longer referenced.
+	if provisionalMemfileHeader != nil {
+		swapCtx := context.WithoutCancel(ctx)
+		go func() {
+			deduped, err := memfileHeader.Wait()
+			if err != nil {
+				logger.L().Warn(swapCtx, "provisional memfile header swap: deduped header failed", zap.Error(err))
+
+				return
+			}
+			mem, err := storageTemplate.Memfile(swapCtx)
+			if err != nil {
+				logger.L().Warn(swapCtx, "provisional memfile header swap: get memfile", zap.Error(err))
+
+				return
+			}
+			if mem == nil {
+				logger.L().Warn(swapCtx, "provisional memfile header swap: memfile is nil")
+
+				return
+			}
+			mem.SwapHeader(deduped)
+		}()
+	}


🔴 Provisional-header reads can hit a released memfd before SwapHeader swaps in the deduped build. In runDedup, releaseMemfd races the swap goroutine (both fan out from metaOut.SetValue with no happens-before ordering) — when the dedup drain is fast (small dirty set) and ToDiffHeader is slow (fragmented parent header), releaseMemfd wins, so a resume in the window gets BytesNotAvailableError from MemfdIdentitySource.ReadAt. Fix by ordering SwapHeader before releaseMemfd (e.g. gate release on a swap-done channel) or by having MemfdIdentitySource fall back through the packedIndex + drained cache when the memfd is nil.

Extended reasoning...

The race

Three goroutines fan out from runDedup (packages/orchestrator/pkg/sandbox/block/memfd.go) resolving metaOut:

Goroutine A (runDedup, memfd.go:381-405) — the same goroutine that publishes metaOut:

metaOut.SetValue(meta) (line 394) — unblocks B

dedupDrain(...) (line 397) — I/O bound; sub-ms for a small dirty set

d.releaseMemfd() (line 399) — sets d.memfd = nil under mu.Lock

d.done.SetResult(...)

Goroutine B (pauseProcessMemory header goroutine, sandbox.go:1589-1606):

metaOut.Wait() (line 1595) — unblocked at A.1

meta.ToDiffHeader(ctx, originalHeader, buildID) (line 1605) — CPU-bound: MergeMappings + NormalizeMappings over originalHeader.Mapping.Slice(). For a fragmented parent header (a sandbox paused many times) this is O(N) and can take tens of ms.

headerOut.SetResult(...) — unblocks C

Goroutine C (AddSnapshot swap goroutine, cache.go:288-309):

memfileHeader.Wait() — unblocked at B.3

storageTemplate.Memfile(swapCtx)

mem.SwapHeader(deduped) (line 308) — routing finally switches to the deduped build id

Nothing serializes A.3 releaseMemfd with C.3 SwapHeader. They race on the outputs of the same metaOut.SetValue.

Step-by-step proof

Consider an idle sandbox whose parent template has been paused many times, accumulating a large fragmented memfile header (originalHeader.Mapping has millions of entries), but whose current dirty set is small (a handful of pages). At t=0, dedupCompare completes and metaOut resolves:

t=0: metaOut.SetValue returns; A continues on the same goroutine, B/C wake on separate goroutines.

t≈0.2ms: A.2 dedupDrain completes (small dirty set → few pwrites). A.3 releaseMemfd sets d.memfd = nil under mu.Lock.

t≈15ms: B.2 ToDiffHeader finishes MergeMappings/NormalizeMappings over the fragmented parent header. headerOut resolves.

t≈15ms + ε: C.3 SwapHeader runs — the local template finally routes reads to the deduped build id.

During the ~15ms window [0.2ms, 15ms], the local template still uses provisionalMemfileHeader. A resume that lands in this window (exactly the target of this PR) does:

build.File.readAt → h.GetShiftedMapping → provisional buildID + identity offset →
buildStore.Get(provisional key) → MemfdIdentitySource.ReadAt →
d.ServeMemfd(b, off) → mu.RLock → d.memfd == nil → returns BytesNotAvailableError (memfd.go:504)

build.File.readAt only retries *block.CacheClosedError, not BytesNotAvailableError, so the error propagates to the UFFD handler.

What existing safeguards don't cover

The RWMutex in DedupedMemfdCache only guards the memfd pointer against use-after-close. It does NOT gate the header-routing decision that build.File.planRead has already made against the still-provisional header.

The tryInflightRead path only helps reads routed to the deduped build id (i.e. post-SwapHeader). Provisional-header reads never touch DedupedMemfdCache.ReadAt.

The build-store TTL argument is about eviction, not this ordering.

The a4cee7e fix (publishing d.index before metaOut.SetValue) addressed a different race — a post-swap deduped-build-id read seeing a nil packed index. It does nothing for the provisional/deduped handover on the header side.

Mitigating factor and impact

The UFFD handler at userfaultfd.go:606-641 does retry source.ReadAt up to sliceMaxRetries+1 = 4 times with exponential backoff (base 50ms, cap 500ms). Each retry re-invokes source.ReadAt which re-runs planRead, re-loading the atomic header. So if SwapHeader completes within ~525ms, the retry succeeds against the deduped build id.

Common case: 50–350ms of added resume latency per faulted page in the window, partially defeating the PR's stated storage-template-memfile speedup (14.8 s → 7 µs).

Pathological case (very fragmented parent header, GC pause, scheduler load): ToDiffHeader + SwapHeader can exceed 525ms, exhausting the retry budget and failing the resume with BytesNotAvailableError.

The documented invariant is false

The PR's own docstring on MemfdIdentitySource (memfd.go:526-528) asserts:

The memfd outlives the swap (freed only after the drain, under a write lock) and an RWMutex covers any straggler read against the free.

The RWMutex only prevents use-after-close of the fd itself. Nothing in the code enforces swap-happens-before-release. TestDedupedMemfdCache_InflightConcurrentDrainRace exercises tryInflightRead on the deduped-header path — it never routes reads through the provisional header, and the PR description explicitly acknowledges the handover has no dedicated unit test.

Suggested fixes

Order swap-before-release: have the swap goroutine (AddSnapshot, cache.go:288-309) signal a channel that runDedup waits on before releaseMemfd. Straightforward.

Fall back through the packed index: teach MemfdIdentitySource.ReadAt (or the deduped cache) to reverse-translate an identity offset back through packedIndex and serve from the drained cache once the memfd is released. Symmetric to tryInflightRead but going the other direction.

Keep the memfd alive until swap has been observed to complete, then release.

Good catch — this was a real ordering bug, and it contradicted an invariant I had written ("the memfd outlives the swap"). Fixed in 659f9a1e0 + ef0a559d1: the dedup goroutine now keeps the memfd mapped until the local template has swapped off the provisional header. The swap goroutine calls MarkSwapped right after SwapHeader; runDedup waits on that (bounded by a 30s grace, and by ctx, so a failed/absent swap cannot leak the mapping) before releasing. The drained cache still resolves immediately, so only the release waits — and since the swap only depends on the compare it has usually already fired by the time the drain finishes, so the common path adds no latency. New regression test TestDedupedMemfdCache_MemfdHeldUntilSwap reproduces the drain-before-swap ordering (fails without the fix). I also corrected the design note.

claude · 2026-07-01T20:41:53Z

+func (s *MemfdIdentitySource) Size() (int64, error)                    { return s.size, nil }
+func (s *MemfdIdentitySource) BlockSize() int64                        { return header.PageSize }
+func (s *MemfdIdentitySource) Close() error                            { return nil }
+func (s *MemfdIdentitySource) FileSize(context.Context) (int64, error) { return s.size, nil }


🟡 MemfdIdentitySource.FileSize returns s.size (the full guest RAM, tens of GiB), but the source wraps a still-mapped memfd with zero on-disk footprint — so when the provisional entry is scheduled for deletion its size is added to pdSizes and startDiskSpaceEviction computes used = dUsed - pUsed under-reported by ~guest-RAM bytes. With this flag on under sustained pause load, an eviction cycle that picks a provisional entry drops effective usage below the threshold for pdDelay (60s) while zero real disk bytes are freed, delaying disk-pressure reclamation. Fix: return 0 (or the actual dirty-page footprint) from MemfdIdentitySource.FileSize, and/or c.buildStore.Delete(provisional key) in the swap goroutine after SwapHeader so provisional entries do not linger for buildCacheTTL (25h).

Extended reasoning...

What the bug is. MemfdIdentitySource.FileSize at memfd.go:564 returns s.size, which buildProvisionalMemfile (sandbox.go:1637) initializes to int64(originalHeader.Metadata.Size) — the full guest RAM (multi-GiB). But the source wraps a still-mapped memfd, not a file on the cache disk, so its actual on-disk footprint is zero. Diff.FileSize is supposed to report bytes resident in the local cache file (compare (*block.Cache).FileSize at cache.go:557 which stats stat.Blocks*512, and DedupedMemfdCache.FileSize at memfd.go:581-588 which waits for the drain and delegates to the on-disk cache). MemfdIdentitySource is the outlier.

How the accounting breaks. The provisional source is wrapped in a localDiff via NewLocalDiffFromCache, and localDiff.FileSize (local_diff.go:132-134) delegates verbatim to the wrapped cache. DiffStore.deleteOldestFromCache (build/cache.go:287) calls item.Value().FileSize(ctx) and passes the value to scheduleDelete, which stores it in pdSizes[key].size (build/cache.go:327-335). startDiskSpaceEviction (build/cache.go:220-223) then computes pUsed := s.getPendingDeletesSize(), used := int64(dUsed) - pUsed, percentage := float64(used) / float64(dTotal) * 100, and skips further eviction if percentage <= threshold. A phantom entry inflates pUsed by ~guest-RAM bytes while zero real bytes will be freed by the corresponding cache.Delete (MemfdIdentitySource.Close is a no-op).

Why nothing else prevents it. The provisional entry is added to buildStore in AddSnapshot (template/cache.go) and the swap goroutine at template/cache.go:288-310 only calls SwapHeader — it never c.buildStore.Delete(provisional key). So the entry lingers up to buildCacheTTL (25h) after the SwapHeader completes; over sustained pause load with the flag on, multiple provisional entries accumulate and can be selected by RangeBackwards once older items have aged out.

Impact. Under sustained disk pressure with memfd-dedup-inflight-serve on, each provisional entry picked for eviction inflates pUsed by ~guest-RAM bytes for the pdDelay = 60s window (buildCacheDelayEviction), delaying real reclamation. It won't crash or corrupt data — it delays rather than defeats eviction — but under the exact workload the PR targets, it can meaningfully throttle disk-pressure reclamation of real cache bytes. Feature flag is off by default, so no immediate production impact today.

Step-by-step proof.

Pause runs on a sandbox with Metadata.Size = 32 GiB and the flag on. buildProvisionalMemfile returns NewMemfdIdentitySource(dc, 32 GiB), wrapped in localDiff, added to buildStore via AddSnapshot.

Some time later SwapHeader completes; the provisional entry is not deleted from buildStore and lingers.

Disk pressure builds. deleteOldestFromCache picks the provisional entry (once older entries have been evicted) and calls FileSize, which returns 32 * 2^30.

scheduleDelete stores 32 GiB in pdSizes[key].size.

Next tick: pUsed = 32 GiB, used = dUsed - 32 GiB, percentage = (dUsed - 32 GiB)/dTotal * 100. On a 100 GiB disk running at 90 GiB used (90% > 85% threshold), the phantom drops used to 58 GiB and percentage to 58% — below threshold, eviction is skipped.

cache.Delete fires after 60s; MemfdIdentitySource.Close is a no-op, so zero real bytes are freed. The phantom clears from pdSizes, eviction resumes, but 60s of pressure were wasted per phantom picked. Multiple concurrent provisional deletes compound.

Fix. Two-line change: (a) MemfdIdentitySource.FileSize returns 0 — it has no on-disk footprint, and eviction-selecting it also frees zero disk bytes on Close, so 0 is the accurate answer; (b) add c.buildStore.Delete(build.GetDiffStoreKey(provisionalBuildID, build.Memfile)) inside the swap goroutine after SwapHeader so the provisional entry doesn't linger for 25h.

Fixed in 659f9a1e0 + ef0a559d1: MemfdIdentitySource.FileSize now returns 0 (it wraps a still-mapped memfd, not a cache file, so evicting it frees no disk bytes — returning the logical size inflated the disk-eviction accounting by ~guest-RAM bytes), and the swap goroutine now drops the provisional entry from the store via a new DiffStore.Delete(key) right after SwapHeader, so it no longer lingers for the TTL.

claude · 2026-07-01T20:41:55Z

+// so dedup best-effort peeks and build.File.IsCached treat it as local.
+func (s *MemfdIdentitySource) IsCached(_ context.Context, off, length int64) bool {
+	s.d.mu.RLock()
+	defer s.d.mu.RUnlock()
+
+	return s.d.memfd != nil && off >= 0 && off+length <= s.size
+}


🟡 The comment on MemfdIdentitySource.IsCached says "dedup best-effort peeks and build.File.IsCached treat it as local", but neither path actually calls this method: dedup.go:109 does base.(CachePeeker) on the parent memfile (not the diff source), and build.go:324 does diff.(block.CachePeeker) on a *localDiff, which has no IsCached method. The method is effectively dead via the described callers — consider updating the comment (or implementing CachePeeker on localDiff if the intent was for File.IsCached to leverage it). No functional impact, since MemfdIdentitySource is only referenced through the provisional-header path and reads work correctly regardless.

Extended reasoning...

The comment on MemfdIdentitySource.IsCached at packages/orchestrator/pkg/sandbox/block/memfd.go:552-559 claims the method exists "so dedup best-effort peeks and build.File.IsCached treat it as local". Verified against the current tree, both claims are wrong.

Dedup best-effort peek does not call it. In packages/orchestrator/pkg/sandbox/block/dedup.go:109, the type assertion is peeker, _ := base.(CachePeeker) — base is the parent ReadonlyDevice passed into NewCacheFromMemfdDeduped (the original memfile / chunker), never the diff/provisional source. The subsequent peeker.IsCached(...) call therefore lands on the parent's IsCached implementation, not on MemfdIdentitySource.

build.File.IsCached does not call it either. In packages/orchestrator/pkg/sandbox/build/build.go:324 the assertion is peeker, ok := diff.(block.CachePeeker), where diff is what store.Lookup(...) returns — the provisional entry is registered via build.NewLocalDiffFromCache, whose concrete type is *localDiff (see packages/orchestrator/pkg/sandbox/build/local_diff.go). *localDiff has only CachePath / Close / ReadAt / Slice / Size / FileSize / CacheKey / BlockSize (verified with grep -nE "func \(.*localDiff\)" local_diff.go) — no IsCached. The cache field on localDiff is a named field, not embedded, so no method promotion occurs. The type assertion fails and File.IsCached returns false for any range routed through the provisional buildID, never invoking MemfdIdentitySource.IsCached.

Step-by-step proof (provisional buildID range):

Pause runs on the memfd-dedup path, so buildProvisionalMemfile (sandbox.go:1634) creates provisionalSource := block.NewMemfdIdentitySource(dc, ...) and calls build.NewLocalDiffFromCache(provisionalKey, provisionalSource).

NewLocalDiffFromCache returns a *localDiff whose cache field holds the *MemfdIdentitySource.

cache.AddSnapshot (template/cache.go:250) adds this *localDiff under the provisional buildID.

A subsequent File.IsCached call resolves an offset to the provisional buildID, does store.Lookup(...) → *localDiff, and executes peeker, ok := diff.(block.CachePeeker).

*localDiff has no IsCached method, so ok == false; File.IsCached short-circuits to false at build.go:326. MemfdIdentitySource.IsCached is never reached.

Reachability today. Beyond these two paths, the method is only invoked by memfd_test.go (the block-level unit test). No production path hits it via the callers named in the comment.

Impact and fix. Functionally none — reads through MemfdIdentitySource.ReadAt / Slice / ServeMemfd work correctly, and File.IsCached returning false is safe (its callers treat "not cached" as "go fetch it", and the fetch resolves through the same source). The only real cost is that the comment misleads future maintainers about how the provisional source integrates with IsCached callers. Either drop the "so dedup ... and build.File.IsCached treat it as local" clause from the comment, or, if the original intent was for File.IsCached to leverage this, add a small IsCached delegation on *localDiff (forwarding to its cache when the concrete cache implements CachePeeker) — that would make the comment accurate and give File.IsCached an accurate resident view during the provisional window.

Fixed the comment in 659f9a1e0: it now states the method satisfies the CachePeeker contract for callers holding a *MemfdIdentitySource directly (the block-level tests), and explicitly notes that the wrapping *build.localDiff does not promote it, so build.File.IsCached does not reach it today. No behavior change.

…erving Add DedupedMemfdCache.ServeMemfd (identity device-offset reads guarded by the existing memfd RWMutex) and MemfdIdentitySource, a local diff source that serves dirty pages from the still-mapped memfd while dedup runs. This is what a provisional (pre-dedup) memfile header reads from, so a resume overlapping the pause can serve dirty pages immediately instead of waiting for the deduped header. Publish the memfd at construction and free it exactly once under the write lock (releaseMemfd) so serving can't race the drain's memfd close. Composes with the in-flight drain serving rather than replacing it. Not yet wired into the pause/resume path. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Assert MemfdIdentitySource serves dirty pages from the memfd at identity offsets while mapped, and reports BytesNotAvailableError once the memfd is released. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 12b544669f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-07-01T21:39:55Z

+	// Memfile() doesn't block on dedup; fall back to the deduped header future.
+	localMemfileHeader := memfileHeader
+	if provisionalMemfileHeader != nil {
+		localMemfileHeader = resolvedHeader(provisionalMemfileHeader)


Avoid persisting provisional headers as parents

When this provisional header is installed as the cached template header, a sandbox resumed during the compare window can be paused again before the swap goroutine replaces it. That second pause uses the provisional memfile header as its parent, so its uploaded header can inherit mappings to the synthetic provisional build ID; there is no upload future or storage object for that ID, leaving a persisted snapshot that cold or cross-node resumes cannot read.

Useful? React with 👍 / 👎.

Real durability bug — fixed in babcac37e. The provisional header must never become a durable parent: if a resume lands in the window and is paused again before the swap, a naive pause would parent off the live (provisional) header and upload a header inheriting the synthetic B_prov mappings (no storage object) → unreadable on a cold/cross-node resume. Fix: build.File now carries the deduped header as a durable-header future, and processMemorySnapshot resolves the pause parent via DurableHeader (waiting for the deduped header if a swap is still pending) instead of the live header. Serving still uses the fast provisional header; only the pause parent is forced durable. Normal pauses (no pending swap) resolve immediately. Covered by TestFile_DurableHeaderPrefersDurableFuture (f3a3e9711).

chatgpt-codex-connector · 2026-07-01T21:39:55Z

+			if provisionalSwapDone != nil {
+				provisionalSwapDone()


Keep the memfd until pre-swap readers finish

For a resume overlapping the header swap, build.File can plan a segment from the provisional header and keep provisionalMemfileDiff, then this callback can fire before that segment calls ReadAt. MarkSwapped lets runDedup release the memfd, so the later MemfdIdentitySource.ReadAt returns BytesNotAvailableError even though the read was routed by a valid pre-swap header; the RWMutex only protects readers already inside ServeMemfd, not readers between planning and serving.

Useful? React with 👍 / 👎.

Correct that a read planned on the provisional header can still serve after the release — the memfd-hold closes the [drain, swap] gap but not the [plan, serve] gap that straddles the release instant. In practice this is self-healing rather than a failure: build.File returns the BytesNotAvailableError, the UFFD handler retries ReadAt (up to 4×), and the retry re-plans on the now-deduped header (post-swap) → routes to the real build id → served from the drained cache. So the worst case is one retry of added latency on a very narrow window, not a failed read. I opted not to add plan-tracking for it (that would need reader-generation accounting); if we see it in the load-snapshot tail after enabling the flag I will add a memfd→drained-cache fallback in MemfdIdentitySource. Flagged this explicitly in the PR notes.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f3a3e97118

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-07-02T08:26:14Z

+
+				return
+			}
+			mem.SwapHeader(deduped)


Guard against overwriting finalized headers

If this swap goroutine is delayed until after Upload.publish has already swapped in the finalized header, this line can replace it with the older deduped header from memfileHeader.Wait() (still IncompletePendingUpload and missing the self Builds entry). In that ordering, later same-node child uploads or reads after the local diff is evicted observe an incomplete cached header even though the upload finished, which can make ancestor Wait return stale BuildData or make compressed self-diff resolution try the wrong uncompressed object path.

Useful? React with 👍 / 👎.

Confirmed and fixed in 4f2387167. Upload.publish does swap a finalized header into the same cached device unconditionally, so a late run of this goroutine could revert it to the older, still-IncompletePendingUpload deduped header. The swap is now a compare-and-swap from the still-provisional header (build.File.SwapHeaderIfCurrent(provisionalMemfileHeader, deduped)): once any newer header is installed (finalized by publish, or the P2P poll path) it becomes a no-op, so it can never clobber it. publish stays unconditional and authoritative. Covered by TestFile_SwapHeaderIfCurrent.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0ab57e93f5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-07-02T09:06:12Z

+				provisionalSwapDone()
+			}
+			if provisionalMemfileDiff != nil {
+				c.buildStore.Delete(provisionalMemfileDiff.CacheKey())


Keep the provisional diff until old-header readers drain

When a same-node resume has already loaded the provisional header but is between resolving a dirty-page mapping and calling cachedBuild, this deletion can remove the only local source for the synthetic provisional build ID. That reader then misses the DiffStore, falls through to createDiff, and tries to open object storage for a build ID that is intentionally never uploaded, so an otherwise valid in-flight resume can fail during the header-swap race instead of reading from the memfd.

Useful? React with 👍 / 👎.

Right — fixed in d2861e4e3 by removing the Delete. It was added last round together with FileSize → 0, but FileSize → 0 already handles the real concern (eviction accounting). Deleting the entry at swap time races in-flight readers exactly as you describe: a reader that planned on the provisional header but has not hit the store yet would miss → createDiff → a storage fetch for a build id that was never uploaded. Leaving the entry is harmless — no reads route to it after the swap, it reports FileSize 0 so it never skews disk eviction, and the store TTL reclaims it (its Close is a no-op). So the accounting fix stands and the delete race is gone.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0fddce2d39

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-07-02T09:32:24Z

+	if provisionalMemfileHeader != nil {
+		localMemfileHeader = resolvedHeader(provisionalMemfileHeader)
+		durableMemfileHeader = memfileHeader


Do not expose provisional headers through P2P

When peer-to-peer chunk transfer is enabled, a cross-node resume during the provisional window fetches memfile.header from the origin via peerserver.headerSource.Stream, which serializes device.Header(); because this publishes the local template with the provisional header, the remote node can receive mappings to the synthetic provisional build ID, but the peer registry and object storage only know the real build ID, so subsequent memfile reads for the synthetic ID fail instead of waiting for the durable deduped header.

Useful? React with 👍 / 👎.

Real, and fixed in 37aa06d59. Confirmed: peerserver.headerSource.Stream serializes device.Header() (header.go:35), wired to the memfile via resolve.go, so during the provisional window it would ship B_prov mappings to a peer that can't resolve them. The P2P header source now resolves DurableHeader(ctx) too (same guard as the pause parent): a cross-node resume in the window receives the deduped mappings — waiting for them if a swap is still pending — rather than the synthetic provisional build id. The served header stays marked IncompletePendingUpload, so the peer still polls for the finalized version. Covered by TestHeaderSource_Stream_ServesDurableHeader. This was the third B_prov exposure beyond reads and the pause parent; double-gated in practice (needs the flag and P2P both on).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4b5b520bc5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-07-02T09:58:42Z

+	localMemfileHeader := memfileHeader
+	var durableMemfileHeader *utils.SetOnce[*header.Header]
+	if provisionalMemfileHeader != nil {
+		localMemfileHeader = resolvedHeader(provisionalMemfileHeader)


Keep provisional IDs out of scheduling metadata

When memfd-dedup-inflight-serve is enabled and a resume hits this cached template before the swap, this publishes a live memfile header whose mappings contain the synthetic provisional build ID. storageTemplate.SchedulingMetadata still reads memfile.Header(), so SandboxCreateResponse.scheduling_metadata.memfile_build_ids can include that local-only ID; it is never uploaded or registered, so downstream placement can reason about a nonexistent layer. Please have scheduling metadata use the durable header as the pause and P2P paths do.

Useful? React with 👍 / 👎.

Fixed in eab2f2bc0. storageTemplate.SchedulingMetadata now resolves the memfile header via DurableHeader(ctx) (same guard as the pause and P2P paths), so memfile_build_ids reports the deduped build ids, never the synthetic provisional one; on a DurableHeader error it falls back to a nil memfile header (rootfs-only metadata) rather than B_prov. Covered by TestStorageTemplate_SchedulingMetadataUsesDurableHeader.

While here I audited the remaining header readers to be sure the B_prov blast radius is fully closed: the four build-id-bearing consumers are now all durable (read routing — intentionally provisional then swapped — plus pause parent, P2P header, and scheduling metadata). Size/BlockSize/UFFD read only header metadata dimensions (identical for both headers), and the upload-side Wait header lookup routes to the remote-poll branch because the provisional header is marked IncompletePendingUpload (set by newDiffHeader), so it never returns B_prov. So this should be the last of these.

Correction after an e2e rerun on a dev node: resolving the durable header here had to be made non-blocking. SchedulingMetadata runs synchronously on the create/resume response path, and the blocking DurableHeader(ctx) I first used waited for the deduped header — during an in-flight dedup that is the whole window, so a same-node resume still blocked ~15s even though the provisional fast-path was fully engaged (the block had just moved here). Fixed in 50c47c2df by adding a non-blocking DurableHeaderNow() and switching SchedulingMetadata to it: while the deduped header is still pending it reports rootfs-only metadata (never B_prov, never blocking); with no swap pending it returns the live/deduped header as before. The pause-parent and P2P paths keep the blocking DurableHeader (those must wait). After the fix the race resume dropped from ~14.9s to ~0.5s. Covered by TestFile_DurableHeaderNow and TestStorageTemplate_SchedulingMetadataSkipsPendingMemfile.

A resume that overlaps an in-flight pause of the same sandbox can block in the storage-template-memfile span: the local memfile template can't be built until the deduped header resolves, which only happens after dedup's compare+defrag. The in-flight memfd serving added earlier removes the drain wait but not this header wait, because it is upstream of the diff source. At pause, build a provisional local header (identity storage offsets, dirty pages attributed to a fresh provisional build id) from the dirty bitmap and the parent header, and register a memfd-backed diff source for that build id. The local template is built from the provisional header immediately, so a concurrent resume serves dirty pages straight from the still-mapped memfd instead of blocking. Once the deduped header resolves it is swapped in (SwapHeader), routing dirty pages to the compacted deduped diff; the distinct build id makes the swap race-free since build.File resolves the source from the same header snapshot as the offset. The swap is conditional (compare-and-swap from the still-provisional header) so a late swap can't clobber a newer header the upload may have finalized first. The provisional header is local-only: the upload continues to use the deduped header, so the persisted artifact is unchanged. A pause must never parent a snapshot off the provisional header: it maps dirty pages to a synthetic build id with no storage object, so an uploaded header that inherited those mappings (from a resume paused again before the swap) would be unreadable on a cold or cross-node resume. build.File carries the deduped header as a durable-header future, and processMemorySnapshot resolves the parent via DurableHeader (waiting for the deduped header if a swap is still pending) rather than the live, possibly-provisional header. The peer-to-peer header source serves the durable header for the same reason, so a cross-node resume in the window receives the deduped mappings (waiting if a swap is pending) instead of the synthetic provisional build id it cannot resolve. Scheduling metadata (storageTemplate.SchedulingMetadata) reads the durable header too, but NON-BLOCKING (DurableHeaderNow): it runs on the create/resume response path, so while the deduped header is still pending it reports rootfs-only metadata rather than block the response for the whole dedup — and never carries the provisional build id. The dedup goroutine holds the memfd until the swap, bounded by a grace timeout; the swap goroutine now signals on every exit (success or error) so the grace is only a last-resort backstop, and a counter (orchestrator.memfd.swap_grace_elapsed) surfaces if it ever fires. The main memfile diff is pinned in the build store for the provisional window: it shares its DedupedMemfdCache (and memfd) with the provisional source, so disk-pressure eviction of the LRU-cold main entry would otherwise Close it and tear the memfd down mid-serve; it is unpinned after the swap. Gated by memfd-dedup-inflight-serve (default off); composes with the in-flight drain serving (provisional covers the compare window, in-flight the drain). Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Extend the MemfdIdentitySource test to exercise Slice (identity read while mapped, and BytesNotAvailableError after release), and add a test that buildProvisionalMemfile declines (returns nil, nil, nil) for a non-memfd-dedup source or when disabled, so the caller falls back to the deduped header. Also cover build.File.DurableHeader: it returns the live header when no durable future is set and otherwise waits for the deduped future, which is what keeps a pause from parenting off a provisional header; that a pinned build-store entry is skipped by disk-pressure eviction; and SwapHeaderIfCurrent, which swaps only when the current header still matches so a late provisional swap can't clobber a header the upload already finalized. Cover the peer-to-peer header source too: it serves the durable (deduped) header, so a peer never receives provisional synthetic-build-id mappings, and that storageTemplate.SchedulingMetadata reads the durable header non-blockingly (DurableHeaderNow) so memfile_build_ids exclude the provisional build id and never block the create/resume response — reporting rootfs-only metadata while the deduped header is still pending. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 31ee7fc. Configure here.}

cursor · 2026-07-02T12:09:40Z

+			swapGraceElapsedCounter.Add(ctx, 1)
+			logger.L().Warn(ctx, "memfd swap grace elapsed; releasing memfd without a swap signal")
+		}
+	}


Swap grace without signal

Medium Severity

With memfd-dedup-inflight-serve on, runDedup always waits on swapped (or the 30s grace) when inflight is true, but MarkSwapped is only invoked from the AddSnapshot swap goroutine when a provisional header was built. If buildProvisionalMemfile falls back (nil provisional), nothing signals swapped, so every pause can hold a guest-sized memfd mapping for the full grace after drain—slower than the flag-off path and with no resume benefit.

Additional Locations (1)

packages/orchestrator/pkg/sandbox/sandbox.go#L1521-L1522

^{Reviewed by Cursor Bugbot for commit 31ee7fc. Configure here.}

cursor · 2026-07-02T12:09:40Z

+// the deduped header has been swapped in, so it doesn't linger for the TTL.
+func (s *DiffStore) Delete(key DiffStoreKey) {
+	s.cache.Delete(key)
+}


Unused DiffStore Delete

Low Severity

DiffStore.Delete is added with a comment that it drops the provisional memfile entry after header swap, but nothing in this change calls it—the swap path intentionally leaves the entry to TTL to avoid store races. The method is dead and the comment describes behavior that was removed.

^{Reviewed by Cursor Bugbot for commit 31ee7fc. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 31ee7fcaae

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-07-02T12:12:10Z

+			// window (seconds) and eviction is LRU-by-TTL, so this freshly-added
+			// entry is evicted last. We rely on that rather than guarding the miss:
+			// the eviction can't realistically occur, so a fallback isn't warranted.
+			c.buildStore.Add(provisionalMemfileDiff)


Pin provisional memfile diffs until swap

When disk-pressure eviction runs during a long memfd compare, this synthetic provisional diff is the oldest evictable entry after the real memfile diff gets pinned below (the rootfs diff is added later). DiffStore.deleteOldestFromCache can therefore schedule and delete it after the 60s delay while the live provisional header still maps dirty pages to its synthetic build ID; a resume that reaches the store after that miss falls through to object storage, where that build ID is never uploaded. Pin this provisional key until the header swap, or otherwise make the miss retry against the deduped header.

Useful? React with 👍 / 👎.

kalyazin and others added 5 commits July 1, 2026 15:48

cla-bot Bot added the cla-signed label Jul 1, 2026

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/block/memfd.go

gemini-code-assist Bot reviewed Jul 1, 2026

View reviewed changes

kalyazin force-pushed the en1369-inflight-memfd-serving branch from 152cb65 to 678b863 Compare July 1, 2026 18:02

kalyazin marked this pull request as ready for review July 1, 2026 20:08

kalyazin requested review from ValentaTomas, dobrac and jakubno as code owners July 1, 2026 20:08

claude Bot reviewed Jul 1, 2026

View reviewed changes

kalyazin and others added 2 commits July 1, 2026 22:32

kalyazin force-pushed the en1369-inflight-memfd-serving branch from 678b863 to 12b5446 Compare July 1, 2026 21:35

chatgpt-codex-connector Bot reviewed Jul 1, 2026

View reviewed changes

kalyazin force-pushed the en1369-inflight-memfd-serving branch from 12b5446 to f3a3e97 Compare July 2, 2026 08:20

chatgpt-codex-connector Bot reviewed Jul 2, 2026

View reviewed changes

cursor Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/template/cache.go

kalyazin force-pushed the en1369-inflight-memfd-serving branch from f3a3e97 to 0ab57e9 Compare July 2, 2026 09:01

cursor Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/template/cache.go Outdated

chatgpt-codex-connector Bot reviewed Jul 2, 2026

View reviewed changes

kalyazin force-pushed the en1369-inflight-memfd-serving branch from 0ab57e9 to 0fddce2 Compare July 2, 2026 09:26

chatgpt-codex-connector Bot reviewed Jul 2, 2026

View reviewed changes

kalyazin force-pushed the en1369-inflight-memfd-serving branch from 0fddce2 to 4b5b520 Compare July 2, 2026 09:52

chatgpt-codex-connector Bot reviewed Jul 2, 2026

View reviewed changes

kalyazin force-pushed the en1369-inflight-memfd-serving branch 2 times, most recently from 4d2ec29 to ff968f2 Compare July 2, 2026 11:39

cursor Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/template/cache.go

kalyazin and others added 2 commits July 2, 2026 13:05

kalyazin force-pushed the en1369-inflight-memfd-serving branch from ff968f2 to 31ee7fc Compare July 2, 2026 12:07

cursor Bot reviewed Jul 2, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jul 2, 2026

View reviewed changes

Uh oh!

Conversation

kalyazin commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat(orch): decouple warm resume from memfile dedup

What & why

How it works

Safety

Feature flag

Observability

Testing

Not in scope / follow-ups

Uh oh!

cursor Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

codecov Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot Jul 1, 2026

Choose a reason for hiding this comment

The race

Step-by-step proof

What existing safeguards don't cover

Mitigating factor and impact

The documented invariant is false

Suggested fixes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kalyazin commented Jul 1, 2026 •

edited

Loading

cursor Bot commented Jul 1, 2026 •

edited

Loading

codecov Bot commented Jul 1, 2026 •

edited

Loading