fix(collab): orphan-workspace sweep must not block the boot event loop (crash-loop hotfix)#47
Merged
Merged
Conversation
The S7 sweep shipped in #42 used synchronous rmSync(recursive) to delete orphan workspace dirs on boot. Each orphan is a ~1.5 GB tree on EFS (a network FS); deleting several synchronously blocked the event loop for minutes, so /healthz couldn't respond, the ALB health check timed out ("Request timed out"), ECS killed the task mid-sweep, and it crash-looped (observed 2026-06-14: server reached "listening on :4096" three times, each killed ~4 min later by failed ELB health checks; exit code null, not OOM). Fixes: - Use fs/promises rm + stat with `await` per deletion — libuv does the slow EFS work off-thread, so the loop stays free to answer /healthz between deletions. - Cap removals at 10 per boot (ORPHAN_WORKSPACE_MAX_PER_SWEEP); a larger backlog drains over subsequent boots instead of one marathon. - Defer the sweep ~90 s after boot (unref'd timer) in serve.ts, so it can't run during the ALB startup health-check window at all — belt-and-suspenders on top of the non-blocking rewrite. cleanupSessionWorkspace (explicit DELETE path) keeps sync rmSync — it's a single dir on a user action, not the boot path. This is a hotfix for the crash-loop that took the site down after the #37-#46 batch deploy; merge + Deploy collab to recover (or roll the service back to the pre-#42 task-def revision in the meantime). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🔴 Hotfix — recovers the site from a crash loop
The S7 orphan-workspace sweep from #42 used synchronous
rmSync(recursive)on the boot path. Each orphan is a ~1.5 GB tree on EFS (a network filesystem); deleting several synchronously blocked the event loop for minutes, so/healthzcouldn't respond → ALB health check "Request timed out" → ECS killed the task mid-sweep → crash loop.Evidence (2026-06-14 deploy): the server reached
opencode server listening on :4096three times, each killed ~4 min later by failed ELB health checks;exit: null(not OOM).Fix
fs/promisesrm/statwithawaitper dir; libuv does the slow EFS work off-thread, so the loop answers/healthzbetween deletions.serve.ts) — keeps the sweep out of the ALB startup health-check window entirely, on top of the non-blocking rewrite.cleanupSessionWorkspace(explicit DELETE) keeps syncrmSync— single dir, user action, not the boot path.Recovery sequence
/healthzgreen, service stable.Why CI didn't catch it
parse-smoke = syntax; the new typecheck-ratchet (#46) = types. Neither catches "synchronous I/O blocks the loop under real EFS load" — that's a runtime/perf property. Mitigation idea (follow-up): a smoke test that boots the container with a seeded orphan dir and asserts
/healthzstays <1 s.🤖 Generated with Claude Code