fix(collab): orphan-workspace sweep must not block the boot event loop (crash-loop hotfix) by hblanken · Pull Request #47 · unleashlive/opencode

hblanken · 2026-06-14T11:42:07Z

🔴 Hotfix — recovers the site from a crash loop

The S7 orphan-workspace sweep from #42 used synchronous rmSync(recursive) on the boot path. Each orphan is a ~1.5 GB tree on EFS (a network filesystem); deleting several synchronously blocked the event loop for minutes, so /healthz couldn't respond → ALB health check "Request timed out" → ECS killed the task mid-sweep → crash loop.

Evidence (2026-06-14 deploy): the server reached opencode server listening on :4096 three times, each killed ~4 min later by failed ELB health checks; exit: null (not OOM).

Fix

Async deletion — fs/promises rm/stat with await per dir; libuv does the slow EFS work off-thread, so the loop answers /healthz between deletions.
Per-boot cap (10) — a larger backlog drains over subsequent boots, not one marathon.
Deferred ~90 s (unref'd timer in serve.ts) — keeps the sweep out of the ALB startup health-check window entirely, on top of the non-blocking rewrite.

cleanupSessionWorkspace (explicit DELETE) keeps sync rmSync — single dir, user action, not the boot path.

Recovery sequence

Now: roll the service back to the pre-fix(collab): liveness heartbeat + RSS monitor + orphan-workspace sweep #42 task-def revision (site up).
Merge this PR, run Deploy collab — new image boots clean (sweep deferred + async), /healthz green, service stable.
Resume the preview-subdomain verification.

Why CI didn't catch it

parse-smoke = syntax; the new typecheck-ratchet (#46) = types. Neither catches "synchronous I/O blocks the loop under real EFS load" — that's a runtime/perf property. Mitigation idea (follow-up): a smoke test that boots the container with a seeded orphan dir and asserts /healthz stays <1 s.

🤖 Generated with Claude Code

The S7 sweep shipped in #42 used synchronous rmSync(recursive) to delete orphan workspace dirs on boot. Each orphan is a ~1.5 GB tree on EFS (a network FS); deleting several synchronously blocked the event loop for minutes, so /healthz couldn't respond, the ALB health check timed out ("Request timed out"), ECS killed the task mid-sweep, and it crash-looped (observed 2026-06-14: server reached "listening on :4096" three times, each killed ~4 min later by failed ELB health checks; exit code null, not OOM). Fixes: - Use fs/promises rm + stat with `await` per deletion — libuv does the slow EFS work off-thread, so the loop stays free to answer /healthz between deletions. - Cap removals at 10 per boot (ORPHAN_WORKSPACE_MAX_PER_SWEEP); a larger backlog drains over subsequent boots instead of one marathon. - Defer the sweep ~90 s after boot (unref'd timer) in serve.ts, so it can't run during the ALB startup health-check window at all — belt-and-suspenders on top of the non-blocking rewrite. cleanupSessionWorkspace (explicit DELETE path) keeps sync rmSync — it's a single dir on a user action, not the boot path. This is a hotfix for the crash-loop that took the site down after the #37-#46 batch deploy; merge + Deploy collab to recover (or roll the service back to the pre-#42 task-def revision in the meantime). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

hblanken merged commit d6fbda3 into collab Jun 14, 2026
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(collab): orphan-workspace sweep must not block the boot event loop (crash-loop hotfix)#47

fix(collab): orphan-workspace sweep must not block the boot event loop (crash-loop hotfix)#47
hblanken merged 1 commit into
collabfrom
fix/collab-orphan-sweep-nonblocking

hblanken commented Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hblanken commented Jun 14, 2026

🔴 Hotfix — recovers the site from a crash loop

Fix

Recovery sequence

Why CI didn't catch it

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant