feat(orch): decouple warm resume from memfile dedup#3166
Conversation
A resume that overlaps an in-flight pause of the same sandbox blocks for the full dedup duration: the deduped diff is the only readable diff source and it is not readable until the background dedup drain finishes, so UFFD memory serving in load-snapshot waits on DedupedMemfdCache.Wait. Give DedupedMemfdCache an in-flight read path. After the dedup compare publishes the metadata, expose the still-mapped memfd together with a packed-to-absolute offset index built from the deduped dirty set. While the drain runs, ReadAt/Slice translate the packed diff offset and copy straight from the memfd instead of blocking on the drain; once the drain completes, reads fall through to the drained cache. An RWMutex guards the memfd so the drain's close cannot unmap it under an in-flight reader, and the done barrier makes the memfd-to-cache handover race-free. Gate the behaviour behind memfd-dedup-inflight-serve (default off); when off the prior wait-for-drain behaviour is preserved. Dedup still runs to completion and the uploaded artifact is unchanged. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add unit tests for the packed-to-absolute offset index (single-run translation plus boundary and out-of-range rejection) and for DedupedMemfdCache serving dirty pages from the memfd while the drain is in progress, then bypassing the in-flight path once the drain resolves. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drive a real dedup goroutine with in-flight serving enabled and hammer tryInflightRead from many goroutines across the drain window and the memfd-to-cache handover: assert every served read returns the correct bytes and, under -race, that the drain's memfd close never unmaps beneath an in-flight reader. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add createIdentityMapping and DiffMetadata.ToProvisionalDiffHeader, which build a local-only memfile header that attributes every dirty page to a given build at an identity storage offset (its device offset) composed over the parent header. Unlike ToDiffHeader it needs no dedup metadata, so a pause can produce it immediately and a concurrent resume can serve from the still-mapped memfd without waiting for the dedup compare. The uploaded artifact continues to use the deduped ToDiffHeader output; this header is never persisted. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Assert ToProvisionalDiffHeader maps dirty pages to the new build at identity storage offsets and leaves clean pages on the parent. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR SummaryHigh Risk Overview Reviewed by Cursor Bugbot for commit 31ee7fc. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
Code Review
This pull request introduces inflight serving of dirty pages directly from a still-mapped memfd during the dedup drain window, using a provisional local header and a provisional diff source to prevent blocking concurrent resumes. The feedback highlights two critical nil pointer dereference risks: one in buildProvisionalMemfile where originalHeader and diffMetadata are dereferenced without prior nil checks, and another in the background goroutine of AddSnapshot where the returned memfile is not verified to be non-nil before calling SwapHeader.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| dc, ok := cache.(*block.DedupedMemfdCache) | ||
| if !ok || !enabled || originalMemfile == nil { | ||
| return nil, nil | ||
| } |
There was a problem hiding this comment.
The buildProvisionalMemfile function does not check if originalHeader or diffMetadata are nil before dereferencing them. If either is nil, a panic will occur during the provisional header construction. Adding nil checks to the guard clause prevents potential nil pointer dereferences.
| dc, ok := cache.(*block.DedupedMemfdCache) | |
| if !ok || !enabled || originalMemfile == nil { | |
| return nil, nil | |
| } | |
| dc, ok := cache.(*block.DedupedMemfdCache) | |
| if !ok || !enabled || originalMemfile == nil || originalHeader == nil || diffMetadata == nil { | |
| return nil, nil | |
| } |
References
- Focus on nil-deref and error handling as specified in the repository style guide. (link)
There was a problem hiding this comment.
Done in 53ef5e1b5: added originalHeader == nil || diffMetadata == nil to the guard. On the memfd-dedup path both are effectively non-nil (the existing header goroutine already dereferences originalHeader via ToDiffHeader), so this is defensive rather than a live bug, but it makes the helper safe for nil inputs and matches the nil-deref style guide.
| mem, err := storageTemplate.Memfile(swapCtx) | ||
| if err != nil { | ||
| logger.L().Warn(swapCtx, "provisional memfile header swap: get memfile", zap.Error(err)) | ||
|
|
||
| return | ||
| } | ||
| mem.SwapHeader(deduped) |
There was a problem hiding this comment.
The background goroutine for swapping the provisional header does not check if the returned memfile is nil before calling SwapHeader. If the memfile is nil, a panic will occur and crash the entire orchestrator process. Adding a nil check for the memfile ensures the background swap is safe and resilient.
mem, err := storageTemplate.Memfile(swapCtx)
if err != nil {
logger.L().Warn(swapCtx, "provisional memfile header swap: get memfile", zap.Error(err))
return
}
if mem == nil {
logger.L().Warn(swapCtx, "provisional memfile header swap: memfile is nil")
return
}
mem.SwapHeader(deduped)References
- Focus on nil-deref and error handling as specified in the repository style guide. (link)
There was a problem hiding this comment.
Done in 53ef5e1b5: added a nil check on the returned memfile before SwapHeader. Memfile() returns (nil, err) or (non-nil, nil), so a nil without an error should not happen, but since this runs in an unrecovered background goroutine a nil deref would crash the orchestrator, so the guard is warranted.
152cb65 to
678b863
Compare
| // Swap the provisional header for the deduped one once dedup finishes, so | ||
| // subsequent reads route dirty pages to the (compacted) deduped diff and the | ||
| // provisional memfd source is no longer referenced. | ||
| if provisionalMemfileHeader != nil { | ||
| swapCtx := context.WithoutCancel(ctx) | ||
| go func() { | ||
| deduped, err := memfileHeader.Wait() | ||
| if err != nil { | ||
| logger.L().Warn(swapCtx, "provisional memfile header swap: deduped header failed", zap.Error(err)) | ||
|
|
||
| return | ||
| } | ||
| mem, err := storageTemplate.Memfile(swapCtx) | ||
| if err != nil { | ||
| logger.L().Warn(swapCtx, "provisional memfile header swap: get memfile", zap.Error(err)) | ||
|
|
||
| return | ||
| } | ||
| if mem == nil { | ||
| logger.L().Warn(swapCtx, "provisional memfile header swap: memfile is nil") | ||
|
|
||
| return | ||
| } | ||
| mem.SwapHeader(deduped) | ||
| }() | ||
| } |
There was a problem hiding this comment.
🔴 Provisional-header reads can hit a released memfd before SwapHeader swaps in the deduped build. In runDedup, releaseMemfd races the swap goroutine (both fan out from metaOut.SetValue with no happens-before ordering) — when the dedup drain is fast (small dirty set) and ToDiffHeader is slow (fragmented parent header), releaseMemfd wins, so a resume in the window gets BytesNotAvailableError from MemfdIdentitySource.ReadAt. Fix by ordering SwapHeader before releaseMemfd (e.g. gate release on a swap-done channel) or by having MemfdIdentitySource fall back through the packedIndex + drained cache when the memfd is nil.
Extended reasoning...
The race
Three goroutines fan out from runDedup (packages/orchestrator/pkg/sandbox/block/memfd.go) resolving metaOut:
Goroutine A (runDedup, memfd.go:381-405) — the same goroutine that publishes metaOut:
metaOut.SetValue(meta)(line 394) — unblocks BdedupDrain(...)(line 397) — I/O bound; sub-ms for a small dirty setd.releaseMemfd()(line 399) — setsd.memfd = nilundermu.Lockd.done.SetResult(...)
Goroutine B (pauseProcessMemory header goroutine, sandbox.go:1589-1606):
metaOut.Wait()(line 1595) — unblocked at A.1meta.ToDiffHeader(ctx, originalHeader, buildID)(line 1605) — CPU-bound:MergeMappings+NormalizeMappingsoveroriginalHeader.Mapping.Slice(). For a fragmented parent header (a sandbox paused many times) this is O(N) and can take tens of ms.headerOut.SetResult(...)— unblocks C
Goroutine C (AddSnapshot swap goroutine, cache.go:288-309):
memfileHeader.Wait()— unblocked at B.3storageTemplate.Memfile(swapCtx)mem.SwapHeader(deduped)(line 308) — routing finally switches to the deduped build id
Nothing serializes A.3 releaseMemfd with C.3 SwapHeader. They race on the outputs of the same metaOut.SetValue.
Step-by-step proof
Consider an idle sandbox whose parent template has been paused many times, accumulating a large fragmented memfile header (originalHeader.Mapping has millions of entries), but whose current dirty set is small (a handful of pages). At t=0, dedupCompare completes and metaOut resolves:
- t=0:
metaOut.SetValuereturns; A continues on the same goroutine, B/C wake on separate goroutines. - t≈0.2ms: A.2
dedupDraincompletes (small dirty set → fewpwrites). A.3releaseMemfdsetsd.memfd = nilundermu.Lock. - t≈15ms: B.2
ToDiffHeaderfinishesMergeMappings/NormalizeMappingsover the fragmented parent header.headerOutresolves. - t≈15ms + ε: C.3
SwapHeaderruns — the local template finally routes reads to the deduped build id.
During the ~15ms window [0.2ms, 15ms], the local template still uses provisionalMemfileHeader. A resume that lands in this window (exactly the target of this PR) does:
build.File.readAt → h.GetShiftedMapping → provisional buildID + identity offset →
buildStore.Get(provisional key) → MemfdIdentitySource.ReadAt →
d.ServeMemfd(b, off) → mu.RLock → d.memfd == nil → returns BytesNotAvailableError (memfd.go:504)
build.File.readAt only retries *block.CacheClosedError, not BytesNotAvailableError, so the error propagates to the UFFD handler.
What existing safeguards don't cover
- The RWMutex in
DedupedMemfdCacheonly guards the memfd pointer against use-after-close. It does NOT gate the header-routing decision thatbuild.File.planReadhas already made against the still-provisional header. - The
tryInflightReadpath only helps reads routed to the deduped build id (i.e. post-SwapHeader). Provisional-header reads never touchDedupedMemfdCache.ReadAt. - The build-store TTL argument is about eviction, not this ordering.
- The
a4cee7efix (publishingd.indexbeforemetaOut.SetValue) addressed a different race — a post-swap deduped-build-id read seeing a nil packed index. It does nothing for the provisional/deduped handover on the header side.
Mitigating factor and impact
The UFFD handler at userfaultfd.go:606-641 does retry source.ReadAt up to sliceMaxRetries+1 = 4 times with exponential backoff (base 50ms, cap 500ms). Each retry re-invokes source.ReadAt which re-runs planRead, re-loading the atomic header. So if SwapHeader completes within ~525ms, the retry succeeds against the deduped build id.
- Common case: 50–350ms of added resume latency per faulted page in the window, partially defeating the PR's stated
storage-template-memfilespeedup (14.8 s → 7 µs). - Pathological case (very fragmented parent header, GC pause, scheduler load):
ToDiffHeader+SwapHeadercan exceed 525ms, exhausting the retry budget and failing the resume withBytesNotAvailableError.
The documented invariant is false
The PR's own docstring on MemfdIdentitySource (memfd.go:526-528) asserts:
The memfd outlives the swap (freed only after the drain, under a write lock) and an RWMutex covers any straggler read against the free.
The RWMutex only prevents use-after-close of the fd itself. Nothing in the code enforces swap-happens-before-release. TestDedupedMemfdCache_InflightConcurrentDrainRace exercises tryInflightRead on the deduped-header path — it never routes reads through the provisional header, and the PR description explicitly acknowledges the handover has no dedicated unit test.
Suggested fixes
- Order swap-before-release: have the swap goroutine (
AddSnapshot, cache.go:288-309) signal a channel thatrunDedupwaits on beforereleaseMemfd. Straightforward. - Fall back through the packed index: teach
MemfdIdentitySource.ReadAt(or the deduped cache) to reverse-translate an identity offset back throughpackedIndexand serve from the drained cache once the memfd is released. Symmetric totryInflightReadbut going the other direction. - Keep the memfd alive until swap has been observed to complete, then release.
There was a problem hiding this comment.
Good catch — this was a real ordering bug, and it contradicted an invariant I had written ("the memfd outlives the swap"). Fixed in 659f9a1e0 + ef0a559d1: the dedup goroutine now keeps the memfd mapped until the local template has swapped off the provisional header. The swap goroutine calls MarkSwapped right after SwapHeader; runDedup waits on that (bounded by a 30s grace, and by ctx, so a failed/absent swap cannot leak the mapping) before releasing. The drained cache still resolves immediately, so only the release waits — and since the swap only depends on the compare it has usually already fired by the time the drain finishes, so the common path adds no latency. New regression test TestDedupedMemfdCache_MemfdHeldUntilSwap reproduces the drain-before-swap ordering (fails without the fix). I also corrected the design note.
| func (s *MemfdIdentitySource) Size() (int64, error) { return s.size, nil } | ||
| func (s *MemfdIdentitySource) BlockSize() int64 { return header.PageSize } | ||
| func (s *MemfdIdentitySource) Close() error { return nil } | ||
| func (s *MemfdIdentitySource) FileSize(context.Context) (int64, error) { return s.size, nil } |
There was a problem hiding this comment.
🟡 MemfdIdentitySource.FileSize returns s.size (the full guest RAM, tens of GiB), but the source wraps a still-mapped memfd with zero on-disk footprint — so when the provisional entry is scheduled for deletion its size is added to pdSizes and startDiskSpaceEviction computes used = dUsed - pUsed under-reported by ~guest-RAM bytes. With this flag on under sustained pause load, an eviction cycle that picks a provisional entry drops effective usage below the threshold for pdDelay (60s) while zero real disk bytes are freed, delaying disk-pressure reclamation. Fix: return 0 (or the actual dirty-page footprint) from MemfdIdentitySource.FileSize, and/or c.buildStore.Delete(provisional key) in the swap goroutine after SwapHeader so provisional entries do not linger for buildCacheTTL (25h).
Extended reasoning...
What the bug is. MemfdIdentitySource.FileSize at memfd.go:564 returns s.size, which buildProvisionalMemfile (sandbox.go:1637) initializes to int64(originalHeader.Metadata.Size) — the full guest RAM (multi-GiB). But the source wraps a still-mapped memfd, not a file on the cache disk, so its actual on-disk footprint is zero. Diff.FileSize is supposed to report bytes resident in the local cache file (compare (*block.Cache).FileSize at cache.go:557 which stats stat.Blocks*512, and DedupedMemfdCache.FileSize at memfd.go:581-588 which waits for the drain and delegates to the on-disk cache). MemfdIdentitySource is the outlier.
How the accounting breaks. The provisional source is wrapped in a localDiff via NewLocalDiffFromCache, and localDiff.FileSize (local_diff.go:132-134) delegates verbatim to the wrapped cache. DiffStore.deleteOldestFromCache (build/cache.go:287) calls item.Value().FileSize(ctx) and passes the value to scheduleDelete, which stores it in pdSizes[key].size (build/cache.go:327-335). startDiskSpaceEviction (build/cache.go:220-223) then computes pUsed := s.getPendingDeletesSize(), used := int64(dUsed) - pUsed, percentage := float64(used) / float64(dTotal) * 100, and skips further eviction if percentage <= threshold. A phantom entry inflates pUsed by ~guest-RAM bytes while zero real bytes will be freed by the corresponding cache.Delete (MemfdIdentitySource.Close is a no-op).
Why nothing else prevents it. The provisional entry is added to buildStore in AddSnapshot (template/cache.go) and the swap goroutine at template/cache.go:288-310 only calls SwapHeader — it never c.buildStore.Delete(provisional key). So the entry lingers up to buildCacheTTL (25h) after the SwapHeader completes; over sustained pause load with the flag on, multiple provisional entries accumulate and can be selected by RangeBackwards once older items have aged out.
Impact. Under sustained disk pressure with memfd-dedup-inflight-serve on, each provisional entry picked for eviction inflates pUsed by ~guest-RAM bytes for the pdDelay = 60s window (buildCacheDelayEviction), delaying real reclamation. It won't crash or corrupt data — it delays rather than defeats eviction — but under the exact workload the PR targets, it can meaningfully throttle disk-pressure reclamation of real cache bytes. Feature flag is off by default, so no immediate production impact today.
Step-by-step proof.
- Pause runs on a sandbox with
Metadata.Size = 32 GiBand the flag on.buildProvisionalMemfilereturnsNewMemfdIdentitySource(dc, 32 GiB), wrapped inlocalDiff, added tobuildStoreviaAddSnapshot. - Some time later
SwapHeadercompletes; the provisional entry is not deleted frombuildStoreand lingers. - Disk pressure builds.
deleteOldestFromCachepicks the provisional entry (once older entries have been evicted) and callsFileSize, which returns32 * 2^30. scheduleDeletestores32 GiBinpdSizes[key].size.- Next tick:
pUsed = 32 GiB,used = dUsed - 32 GiB,percentage = (dUsed - 32 GiB)/dTotal * 100. On a 100 GiB disk running at 90 GiB used (90% > 85% threshold), the phantom dropsusedto 58 GiB andpercentageto 58% — below threshold, eviction is skipped. cache.Deletefires after 60s;MemfdIdentitySource.Closeis a no-op, so zero real bytes are freed. The phantom clears frompdSizes, eviction resumes, but 60s of pressure were wasted per phantom picked. Multiple concurrent provisional deletes compound.
Fix. Two-line change: (a) MemfdIdentitySource.FileSize returns 0 — it has no on-disk footprint, and eviction-selecting it also frees zero disk bytes on Close, so 0 is the accurate answer; (b) add c.buildStore.Delete(build.GetDiffStoreKey(provisionalBuildID, build.Memfile)) inside the swap goroutine after SwapHeader so the provisional entry doesn't linger for 25h.
There was a problem hiding this comment.
Fixed in 659f9a1e0 + ef0a559d1: MemfdIdentitySource.FileSize now returns 0 (it wraps a still-mapped memfd, not a cache file, so evicting it frees no disk bytes — returning the logical size inflated the disk-eviction accounting by ~guest-RAM bytes), and the swap goroutine now drops the provisional entry from the store via a new DiffStore.Delete(key) right after SwapHeader, so it no longer lingers for the TTL.
| // so dedup best-effort peeks and build.File.IsCached treat it as local. | ||
| func (s *MemfdIdentitySource) IsCached(_ context.Context, off, length int64) bool { | ||
| s.d.mu.RLock() | ||
| defer s.d.mu.RUnlock() | ||
|
|
||
| return s.d.memfd != nil && off >= 0 && off+length <= s.size | ||
| } |
There was a problem hiding this comment.
🟡 The comment on MemfdIdentitySource.IsCached says "dedup best-effort peeks and build.File.IsCached treat it as local", but neither path actually calls this method: dedup.go:109 does base.(CachePeeker) on the parent memfile (not the diff source), and build.go:324 does diff.(block.CachePeeker) on a *localDiff, which has no IsCached method. The method is effectively dead via the described callers — consider updating the comment (or implementing CachePeeker on localDiff if the intent was for File.IsCached to leverage it). No functional impact, since MemfdIdentitySource is only referenced through the provisional-header path and reads work correctly regardless.
Extended reasoning...
The comment on MemfdIdentitySource.IsCached at packages/orchestrator/pkg/sandbox/block/memfd.go:552-559 claims the method exists "so dedup best-effort peeks and build.File.IsCached treat it as local". Verified against the current tree, both claims are wrong.
Dedup best-effort peek does not call it. In packages/orchestrator/pkg/sandbox/block/dedup.go:109, the type assertion is peeker, _ := base.(CachePeeker) — base is the parent ReadonlyDevice passed into NewCacheFromMemfdDeduped (the original memfile / chunker), never the diff/provisional source. The subsequent peeker.IsCached(...) call therefore lands on the parent's IsCached implementation, not on MemfdIdentitySource.
build.File.IsCached does not call it either. In packages/orchestrator/pkg/sandbox/build/build.go:324 the assertion is peeker, ok := diff.(block.CachePeeker), where diff is what store.Lookup(...) returns — the provisional entry is registered via build.NewLocalDiffFromCache, whose concrete type is *localDiff (see packages/orchestrator/pkg/sandbox/build/local_diff.go). *localDiff has only CachePath / Close / ReadAt / Slice / Size / FileSize / CacheKey / BlockSize (verified with grep -nE "func \(.*localDiff\)" local_diff.go) — no IsCached. The cache field on localDiff is a named field, not embedded, so no method promotion occurs. The type assertion fails and File.IsCached returns false for any range routed through the provisional buildID, never invoking MemfdIdentitySource.IsCached.
Step-by-step proof (provisional buildID range):
- Pause runs on the memfd-dedup path, so
buildProvisionalMemfile(sandbox.go:1634) createsprovisionalSource := block.NewMemfdIdentitySource(dc, ...)and callsbuild.NewLocalDiffFromCache(provisionalKey, provisionalSource). NewLocalDiffFromCachereturns a*localDiffwhosecachefield holds the*MemfdIdentitySource.cache.AddSnapshot(template/cache.go:250) adds this*localDiffunder the provisional buildID.- A subsequent
File.IsCachedcall resolves an offset to the provisional buildID, doesstore.Lookup(...)→*localDiff, and executespeeker, ok := diff.(block.CachePeeker). *localDiffhas noIsCachedmethod, sook == false;File.IsCachedshort-circuits tofalseatbuild.go:326.MemfdIdentitySource.IsCachedis never reached.
Reachability today. Beyond these two paths, the method is only invoked by memfd_test.go (the block-level unit test). No production path hits it via the callers named in the comment.
Impact and fix. Functionally none — reads through MemfdIdentitySource.ReadAt / Slice / ServeMemfd work correctly, and File.IsCached returning false is safe (its callers treat "not cached" as "go fetch it", and the fetch resolves through the same source). The only real cost is that the comment misleads future maintainers about how the provisional source integrates with IsCached callers. Either drop the "so dedup ... and build.File.IsCached treat it as local" clause from the comment, or, if the original intent was for File.IsCached to leverage this, add a small IsCached delegation on *localDiff (forwarding to its cache when the concrete cache implements CachePeeker) — that would make the comment accurate and give File.IsCached an accurate resident view during the provisional window.
There was a problem hiding this comment.
Fixed the comment in 659f9a1e0: it now states the method satisfies the CachePeeker contract for callers holding a *MemfdIdentitySource directly (the block-level tests), and explicitly notes that the wrapping *build.localDiff does not promote it, so build.File.IsCached does not reach it today. No behavior change.
…erving Add DedupedMemfdCache.ServeMemfd (identity device-offset reads guarded by the existing memfd RWMutex) and MemfdIdentitySource, a local diff source that serves dirty pages from the still-mapped memfd while dedup runs. This is what a provisional (pre-dedup) memfile header reads from, so a resume overlapping the pause can serve dirty pages immediately instead of waiting for the deduped header. Publish the memfd at construction and free it exactly once under the write lock (releaseMemfd) so serving can't race the drain's memfd close. Composes with the in-flight drain serving rather than replacing it. Not yet wired into the pause/resume path. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Assert MemfdIdentitySource serves dirty pages from the memfd at identity offsets while mapped, and reports BytesNotAvailableError once the memfd is released. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
678b863 to
12b5446
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 12b544669f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // Memfile() doesn't block on dedup; fall back to the deduped header future. | ||
| localMemfileHeader := memfileHeader | ||
| if provisionalMemfileHeader != nil { | ||
| localMemfileHeader = resolvedHeader(provisionalMemfileHeader) |
There was a problem hiding this comment.
Avoid persisting provisional headers as parents
When this provisional header is installed as the cached template header, a sandbox resumed during the compare window can be paused again before the swap goroutine replaces it. That second pause uses the provisional memfile header as its parent, so its uploaded header can inherit mappings to the synthetic provisional build ID; there is no upload future or storage object for that ID, leaving a persisted snapshot that cold or cross-node resumes cannot read.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Real durability bug — fixed in babcac37e. The provisional header must never become a durable parent: if a resume lands in the window and is paused again before the swap, a naive pause would parent off the live (provisional) header and upload a header inheriting the synthetic B_prov mappings (no storage object) → unreadable on a cold/cross-node resume. Fix: build.File now carries the deduped header as a durable-header future, and processMemorySnapshot resolves the pause parent via DurableHeader (waiting for the deduped header if a swap is still pending) instead of the live header. Serving still uses the fast provisional header; only the pause parent is forced durable. Normal pauses (no pending swap) resolve immediately. Covered by TestFile_DurableHeaderPrefersDurableFuture (f3a3e9711).
| if provisionalSwapDone != nil { | ||
| provisionalSwapDone() |
There was a problem hiding this comment.
Keep the memfd until pre-swap readers finish
For a resume overlapping the header swap, build.File can plan a segment from the provisional header and keep provisionalMemfileDiff, then this callback can fire before that segment calls ReadAt. MarkSwapped lets runDedup release the memfd, so the later MemfdIdentitySource.ReadAt returns BytesNotAvailableError even though the read was routed by a valid pre-swap header; the RWMutex only protects readers already inside ServeMemfd, not readers between planning and serving.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Correct that a read planned on the provisional header can still serve after the release — the memfd-hold closes the [drain, swap] gap but not the [plan, serve] gap that straddles the release instant. In practice this is self-healing rather than a failure: build.File returns the BytesNotAvailableError, the UFFD handler retries ReadAt (up to 4×), and the retry re-plans on the now-deduped header (post-swap) → routes to the real build id → served from the drained cache. So the worst case is one retry of added latency on a very narrow window, not a failed read. I opted not to add plan-tracking for it (that would need reader-generation accounting); if we see it in the load-snapshot tail after enabling the flag I will add a memfd→drained-cache fallback in MemfdIdentitySource. Flagged this explicitly in the PR notes.
12b5446 to
f3a3e97
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f3a3e97118
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| return | ||
| } | ||
| mem.SwapHeader(deduped) |
There was a problem hiding this comment.
Guard against overwriting finalized headers
If this swap goroutine is delayed until after Upload.publish has already swapped in the finalized header, this line can replace it with the older deduped header from memfileHeader.Wait() (still IncompletePendingUpload and missing the self Builds entry). In that ordering, later same-node child uploads or reads after the local diff is evicted observe an incomplete cached header even though the upload finished, which can make ancestor Wait return stale BuildData or make compressed self-diff resolution try the wrong uncompressed object path.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Confirmed and fixed in 4f2387167. Upload.publish does swap a finalized header into the same cached device unconditionally, so a late run of this goroutine could revert it to the older, still-IncompletePendingUpload deduped header. The swap is now a compare-and-swap from the still-provisional header (build.File.SwapHeaderIfCurrent(provisionalMemfileHeader, deduped)): once any newer header is installed (finalized by publish, or the P2P poll path) it becomes a no-op, so it can never clobber it. publish stays unconditional and authoritative. Covered by TestFile_SwapHeaderIfCurrent.
f3a3e97 to
0ab57e9
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0ab57e93f5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| provisionalSwapDone() | ||
| } | ||
| if provisionalMemfileDiff != nil { | ||
| c.buildStore.Delete(provisionalMemfileDiff.CacheKey()) |
There was a problem hiding this comment.
Keep the provisional diff until old-header readers drain
When a same-node resume has already loaded the provisional header but is between resolving a dirty-page mapping and calling cachedBuild, this deletion can remove the only local source for the synthetic provisional build ID. That reader then misses the DiffStore, falls through to createDiff, and tries to open object storage for a build ID that is intentionally never uploaded, so an otherwise valid in-flight resume can fail during the header-swap race instead of reading from the memfd.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Right — fixed in d2861e4e3 by removing the Delete. It was added last round together with FileSize → 0, but FileSize → 0 already handles the real concern (eviction accounting). Deleting the entry at swap time races in-flight readers exactly as you describe: a reader that planned on the provisional header but has not hit the store yet would miss → createDiff → a storage fetch for a build id that was never uploaded. Leaving the entry is harmless — no reads route to it after the swap, it reports FileSize 0 so it never skews disk eviction, and the store TTL reclaims it (its Close is a no-op). So the accounting fix stands and the delete race is gone.
0ab57e9 to
0fddce2
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0fddce2d39
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if provisionalMemfileHeader != nil { | ||
| localMemfileHeader = resolvedHeader(provisionalMemfileHeader) | ||
| durableMemfileHeader = memfileHeader |
There was a problem hiding this comment.
Do not expose provisional headers through P2P
When peer-to-peer chunk transfer is enabled, a cross-node resume during the provisional window fetches memfile.header from the origin via peerserver.headerSource.Stream, which serializes device.Header(); because this publishes the local template with the provisional header, the remote node can receive mappings to the synthetic provisional build ID, but the peer registry and object storage only know the real build ID, so subsequent memfile reads for the synthetic ID fail instead of waiting for the durable deduped header.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Real, and fixed in 37aa06d59. Confirmed: peerserver.headerSource.Stream serializes device.Header() (header.go:35), wired to the memfile via resolve.go, so during the provisional window it would ship B_prov mappings to a peer that can't resolve them. The P2P header source now resolves DurableHeader(ctx) too (same guard as the pause parent): a cross-node resume in the window receives the deduped mappings — waiting for them if a swap is still pending — rather than the synthetic provisional build id. The served header stays marked IncompletePendingUpload, so the peer still polls for the finalized version. Covered by TestHeaderSource_Stream_ServesDurableHeader. This was the third B_prov exposure beyond reads and the pause parent; double-gated in practice (needs the flag and P2P both on).
0fddce2 to
4b5b520
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4b5b520bc5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| localMemfileHeader := memfileHeader | ||
| var durableMemfileHeader *utils.SetOnce[*header.Header] | ||
| if provisionalMemfileHeader != nil { | ||
| localMemfileHeader = resolvedHeader(provisionalMemfileHeader) |
There was a problem hiding this comment.
Keep provisional IDs out of scheduling metadata
When memfd-dedup-inflight-serve is enabled and a resume hits this cached template before the swap, this publishes a live memfile header whose mappings contain the synthetic provisional build ID. storageTemplate.SchedulingMetadata still reads memfile.Header(), so SandboxCreateResponse.scheduling_metadata.memfile_build_ids can include that local-only ID; it is never uploaded or registered, so downstream placement can reason about a nonexistent layer. Please have scheduling metadata use the durable header as the pause and P2P paths do.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in eab2f2bc0. storageTemplate.SchedulingMetadata now resolves the memfile header via DurableHeader(ctx) (same guard as the pause and P2P paths), so memfile_build_ids reports the deduped build ids, never the synthetic provisional one; on a DurableHeader error it falls back to a nil memfile header (rootfs-only metadata) rather than B_prov. Covered by TestStorageTemplate_SchedulingMetadataUsesDurableHeader.
While here I audited the remaining header readers to be sure the B_prov blast radius is fully closed: the four build-id-bearing consumers are now all durable (read routing — intentionally provisional then swapped — plus pause parent, P2P header, and scheduling metadata). Size/BlockSize/UFFD read only header metadata dimensions (identical for both headers), and the upload-side Wait header lookup routes to the remote-poll branch because the provisional header is marked IncompletePendingUpload (set by newDiffHeader), so it never returns B_prov. So this should be the last of these.
There was a problem hiding this comment.
Correction after an e2e rerun on a dev node: resolving the durable header here had to be made non-blocking. SchedulingMetadata runs synchronously on the create/resume response path, and the blocking DurableHeader(ctx) I first used waited for the deduped header — during an in-flight dedup that is the whole window, so a same-node resume still blocked ~15s even though the provisional fast-path was fully engaged (the block had just moved here). Fixed in 50c47c2df by adding a non-blocking DurableHeaderNow() and switching SchedulingMetadata to it: while the deduped header is still pending it reports rootfs-only metadata (never B_prov, never blocking); with no swap pending it returns the live/deduped header as before. The pause-parent and P2P paths keep the blocking DurableHeader (those must wait). After the fix the race resume dropped from ~14.9s to ~0.5s. Covered by TestFile_DurableHeaderNow and TestStorageTemplate_SchedulingMetadataSkipsPendingMemfile.
4d2ec29 to
ff968f2
Compare
A resume that overlaps an in-flight pause of the same sandbox can block in the storage-template-memfile span: the local memfile template can't be built until the deduped header resolves, which only happens after dedup's compare+defrag. The in-flight memfd serving added earlier removes the drain wait but not this header wait, because it is upstream of the diff source. At pause, build a provisional local header (identity storage offsets, dirty pages attributed to a fresh provisional build id) from the dirty bitmap and the parent header, and register a memfd-backed diff source for that build id. The local template is built from the provisional header immediately, so a concurrent resume serves dirty pages straight from the still-mapped memfd instead of blocking. Once the deduped header resolves it is swapped in (SwapHeader), routing dirty pages to the compacted deduped diff; the distinct build id makes the swap race-free since build.File resolves the source from the same header snapshot as the offset. The swap is conditional (compare-and-swap from the still-provisional header) so a late swap can't clobber a newer header the upload may have finalized first. The provisional header is local-only: the upload continues to use the deduped header, so the persisted artifact is unchanged. A pause must never parent a snapshot off the provisional header: it maps dirty pages to a synthetic build id with no storage object, so an uploaded header that inherited those mappings (from a resume paused again before the swap) would be unreadable on a cold or cross-node resume. build.File carries the deduped header as a durable-header future, and processMemorySnapshot resolves the parent via DurableHeader (waiting for the deduped header if a swap is still pending) rather than the live, possibly-provisional header. The peer-to-peer header source serves the durable header for the same reason, so a cross-node resume in the window receives the deduped mappings (waiting if a swap is pending) instead of the synthetic provisional build id it cannot resolve. Scheduling metadata (storageTemplate.SchedulingMetadata) reads the durable header too, but NON-BLOCKING (DurableHeaderNow): it runs on the create/resume response path, so while the deduped header is still pending it reports rootfs-only metadata rather than block the response for the whole dedup — and never carries the provisional build id. The dedup goroutine holds the memfd until the swap, bounded by a grace timeout; the swap goroutine now signals on every exit (success or error) so the grace is only a last-resort backstop, and a counter (orchestrator.memfd.swap_grace_elapsed) surfaces if it ever fires. The main memfile diff is pinned in the build store for the provisional window: it shares its DedupedMemfdCache (and memfd) with the provisional source, so disk-pressure eviction of the LRU-cold main entry would otherwise Close it and tear the memfd down mid-serve; it is unpinned after the swap. Gated by memfd-dedup-inflight-serve (default off); composes with the in-flight drain serving (provisional covers the compare window, in-flight the drain). Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extend the MemfdIdentitySource test to exercise Slice (identity read while mapped, and BytesNotAvailableError after release), and add a test that buildProvisionalMemfile declines (returns nil, nil, nil) for a non-memfd-dedup source or when disabled, so the caller falls back to the deduped header. Also cover build.File.DurableHeader: it returns the live header when no durable future is set and otherwise waits for the deduped future, which is what keeps a pause from parenting off a provisional header; that a pinned build-store entry is skipped by disk-pressure eviction; and SwapHeaderIfCurrent, which swaps only when the current header still matches so a late provisional swap can't clobber a header the upload already finalized. Cover the peer-to-peer header source too: it serves the durable (deduped) header, so a peer never receives provisional synthetic-build-id mappings, and that storageTemplate.SchedulingMetadata reads the durable header non-blockingly (DurableHeaderNow) so memfile_build_ids exclude the provisional build id and never block the create/resume response — reporting rootfs-only metadata while the deduped header is still pending. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ff968f2 to
31ee7fc
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 31ee7fc. Configure here.
| swapGraceElapsedCounter.Add(ctx, 1) | ||
| logger.L().Warn(ctx, "memfd swap grace elapsed; releasing memfd without a swap signal") | ||
| } | ||
| } |
There was a problem hiding this comment.
Swap grace without signal
Medium Severity
With memfd-dedup-inflight-serve on, runDedup always waits on swapped (or the 30s grace) when inflight is true, but MarkSwapped is only invoked from the AddSnapshot swap goroutine when a provisional header was built. If buildProvisionalMemfile falls back (nil provisional), nothing signals swapped, so every pause can hold a guest-sized memfd mapping for the full grace after drain—slower than the flag-off path and with no resume benefit.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 31ee7fc. Configure here.
| // the deduped header has been swapped in, so it doesn't linger for the TTL. | ||
| func (s *DiffStore) Delete(key DiffStoreKey) { | ||
| s.cache.Delete(key) | ||
| } |
There was a problem hiding this comment.
Unused DiffStore Delete
Low Severity
DiffStore.Delete is added with a comment that it drops the provisional memfile entry after header swap, but nothing in this change calls it—the swap path intentionally leaves the entry to TTL to avoid store races. The method is dead and the comment describes behavior that was removed.
Reviewed by Cursor Bugbot for commit 31ee7fc. Configure here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 31ee7fcaae
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // window (seconds) and eviction is LRU-by-TTL, so this freshly-added | ||
| // entry is evicted last. We rely on that rather than guarding the miss: | ||
| // the eviction can't realistically occur, so a fallback isn't warranted. | ||
| c.buildStore.Add(provisionalMemfileDiff) |
There was a problem hiding this comment.
Pin provisional memfile diffs until swap
When disk-pressure eviction runs during a long memfd compare, this synthetic provisional diff is the oldest evictable entry after the real memfile diff gets pinned below (the rootfs diff is added later). DiffStore.deleteOldestFromCache can therefore schedule and delete it after the 60s delay while the live provisional header still maps dirty pages to its synthetic build ID; a resume that reaches the store after that miss falls through to object storage, where that build ID is never uploaded. Pin this provisional key until the header swap, or otherwise make the miss retry against the deduped header.
Useful? React with 👍 / 👎.


feat(orch): decouple warm resume from memfile dedup
What & why
A warm resume that overlaps an in-flight pause of the same sandbox is delayed by
roughly the pause's memfile-dedup duration. The deduped diff is the only readable diff
source and isn't readable until dedup finishes, so a resume that lands on the origin
node while the pause is still deduping blocks — either in
storage-template-memfile(waiting for the diff header, which is produced after the dedup compare) or in
load-snapshot(UFFD memory serving waiting on the dedup drain). The data a resumeactually needs (
base ⊕ dirty pages) is available the instant the pause captures thesnapshot; only dedup's compaction for upload is slow.
This PR lets a same-node resume serve dirty pages straight from the still-mapped
memfd while dedup runs in the background, so the resume no longer waits for dedup.
Dedup still runs to completion and the uploaded artifact is unchanged.
How it works
The pause's dedup has two phases whose products land at different times: the header
(after compare) and the compacted data (after the drain). The fix covers both
windows with one mechanism, gated by one flag:
the dirty bitmap + parent header (
header.ToProvisionalDiffHeader), attributing dirtypages to a fresh provisional build id at identity (device) offsets, and register a
memfd-backed diff source (
block.MemfdIdentitySource) under that id. The localtemplate is built from this header synchronously, so a concurrent resume gets a
serviceable memfile immediately and reads dirty pages from the memfd — no wait on the
header.
compare),
SwapHeaderpoints reads at the real build id, whoseDedupedMemfdCacheserves the drain window from the memfd until the compacted diff is ready, then from
the drained cache.
build.Fileresolves an offset to(buildId, storageOff)from a single header snapshot, so the build id is the offset-regime tag: a read
planned on the provisional header resolves to the memfd source (identity offsets); one
planned on the deduped header resolves to the drained cache (packed offsets). No read
can mix regimes.
the local template has swapped off the provisional header (
MarkSwapped, signalled bythe swap goroutine after
SwapHeader), bounded by a grace timeout so a failed/absentswap can't leak the mapping. This is necessary because the swap can land after the
drain (small dirty set + fragmented parent header → fast drain, slow header build), and
releasing at drain-time would let a provisional read hit a released memfd. An RWMutex
covers any straggler read against the release. The provisional store entry is not
deleted at swap time (that would race a reader that planned on the provisional header
but hasn't hit the store yet) — it ages out via the store TTL, harmless because no reads
route to it post-swap and it reports
FileSize 0, so it never skews disk eviction. Theswap goroutine signals the release on every exit (success or error), so the grace is
only a last-resort backstop.
DedupedMemfdCache(and memfd) with the provisional source, but resume reads refresh only the provisional
store entry — leaving the main entry LRU-cold. Disk-pressure eviction of the main entry
would
Closeit, cancelling dedup and tearing the memfd down mid-serve. So the mainmemfile diff is pinned in the build store for the provisional window (disk-pressure
eviction skips it; TTL still applies) and unpinned after the swap.
build.Filemovesprovisional → deduped (this swap) → finalized (
Upload.publish). The provisional→dedupedswap is a compare-and-swap from the still-provisional header (
SwapHeaderIfCurrent),so if it runs late it becomes a no-op rather than reverting an already-finalized header.
dirty pages to a synthetic build id with no storage object, so it must never reach an
uploaded snapshot, a peer, or placement metadata. All three consumers resolve the durable
(deduped) header instead of the live one: the pause parent (
processMemorySnapshot,so a resume-then-re-pause in the window can't persist an unreadable snapshot), the
P2P header source (
peerserver.headerSource.Stream, so a cross-node peer gets thereal mappings), and scheduling metadata.
build.Filecarries the deduped header as adurable-header future; the pause and P2P paths wait for it (
DurableHeader), whilescheduling metadata — on the latency-sensitive create/resume response — reads it
non-blocking (
DurableHeaderNow), reporting rootfs-only metadata during the windowrather than blocking. Templates with no pending swap resolve all of these immediately.
Safety
provisional header + provisional build id live only in memory on the origin node and
are never persisted. So the stored artifact is byte-for-byte unaffected.
still uploaded.
buildStore.Addcall site): theprovisional source must stay in the build store for the duration of the provisional
window (the dedup compare, seconds); this holds because the store TTL (hours) dwarfs
the window and eviction is LRU-by-TTL. We deliberately don't guard the (unrealistic)
eviction-mid-window case.
Feature flag
Everything is gated behind
memfd-dedup-inflight-serve(default off) and only takeseffect on the memfd-dedup path. With the flag off, behavior is identical to today
(resume waits for dedup). Roll out gradually.
Observability
The fix collapses distinct existing spans (rather than adding a rollout metric), so
rollout is observed via:
storage-template-memfile(resume path) → ≈ 0 when the header wait is removed;load-snapshot(resume path) → tail drops when the drain wait is removed;orchestrator_sandbox_create_duration{start_type="resume"}p99/p999 tail comes down.Regression controls that must not move:
orchestrator_sandbox_snapshot_diff_ratio{kind="dedup"}(dedup effectiveness),
dedup-pages/ pause durations (dedup still runs, pause staysfast). Correlate all with the flag-enable time.
One new counter is added:
orchestrator.memfd.swap_grace_elapsed— increments when thededup goroutine releases the memfd via the grace timeout without a swap signal. Expected
≈ 0 (the swap normally signals before the drain finishes); a nonzero rate means swap
signals are being lost (e.g. pauses aborting before the swap goroutine spawns) and memfds
are held the full grace on those pauses — actionable, not fatal.
Testing
-race): the identity-offset header builder; the memfd identitysource (serve while mapped,
BytesNotAvailableErrorafter release,Slice); thein-flight
DedupedMemfdCacheserving + a concurrent drain/handover stress test; thememfd-held-until-swap ordering; the provisional-header fallback (declines for a
non-memfd-dedup source / when disabled);
DurableHeader/DurableHeaderNowandSwapHeaderIfCurrentonbuild.File; the P2P header source serving the durable header;and scheduling metadata using the durable header (and reporting rootfs-only while a swap
is pending).
test-only code, resume overlapping the in-flight pause):
Diagnostics showed the provisional fast-path fully engaged, so the block had moved:
SchedulingMetadata, on the resume response, was calling the blockingDurableHeader.Switching it to non-blocking
DurableHeaderNowbrought the race resume to ~0.5 s.forced to load the snapshot from object storage (orchestrators restarted to drop the
in-memory template cache) — the uploaded artifact restores guest memory bit-for-bit.
Not in scope / follow-ups
it doesn't get the fast serve — it waits for the deduped header. The P2P header source
now serves the durable header, so such a resume in the window gets the real mappings
(waiting for them) rather than the unresolvable provisional build id; a true fast cross-node serve
(peer-fetch/prefetch of the in-flight diff) is out of scope. The cross-node path is
covered by unit tests; a controlled cross-node e2e wasn't run — resume has origin
affinity and there's no clean lever to force a peer resume while keeping the origin up to
serve P2P, so that needs a placement-control harness (follow-up).
AddSnapshotprovisional→dedupedSwapHeaderhandoveris covered by the e2e above but has no dedicated unit test yet (needs a build-store /
storage-provider fixture).
header just before the swap can serve just after the memfd release, getting
BytesNotAvailableError. This self-heals: the UFFD handler retriesReadAtand re-planson the now-deduped header (post-swap) → drained cache → success. Worst case is one retry
of added latency on a very narrow window, not a failed read. If it shows up in the
load-snapshottail after enabling the flag, the fix is a memfd→drained-cache fallbackin
MemfdIdentitySource.errors (rare — best-effort dedup normally degrades rather than errors), the deduped
header never resolves, so the swap can't happen and the memfd is released; a resume that
was already in flight then fails its dirty-page faults. This is the same outcome as
before this PR (the build is broken either way — pre-PR it failed at
Memfile(), now itfails at fault time), and the durable-header guard still prevents any bad snapshot from
being persisted (a re-pause's
DurableHeadersurfaces the dedup error and fails cleanly).Not worth extra machinery (keeping the memfd mapped to let the resume finish would leak
it, since no swap can free it).
🤖 Generated with Claude Code