Skip to content

fix(collab): orphan-workspace sweep must not block the boot event loop (crash-loop hotfix)#47

Merged
hblanken merged 1 commit into
collabfrom
fix/collab-orphan-sweep-nonblocking
Jun 14, 2026
Merged

fix(collab): orphan-workspace sweep must not block the boot event loop (crash-loop hotfix)#47
hblanken merged 1 commit into
collabfrom
fix/collab-orphan-sweep-nonblocking

Conversation

@hblanken

Copy link
Copy Markdown

🔴 Hotfix — recovers the site from a crash loop

The S7 orphan-workspace sweep from #42 used synchronous rmSync(recursive) on the boot path. Each orphan is a ~1.5 GB tree on EFS (a network filesystem); deleting several synchronously blocked the event loop for minutes, so /healthz couldn't respond → ALB health check "Request timed out" → ECS killed the task mid-sweep → crash loop.

Evidence (2026-06-14 deploy): the server reached opencode server listening on :4096 three times, each killed ~4 min later by failed ELB health checks; exit: null (not OOM).

Fix

  • Async deletionfs/promises rm/stat with await per dir; libuv does the slow EFS work off-thread, so the loop answers /healthz between deletions.
  • Per-boot cap (10) — a larger backlog drains over subsequent boots, not one marathon.
  • Deferred ~90 s (unref'd timer in serve.ts) — keeps the sweep out of the ALB startup health-check window entirely, on top of the non-blocking rewrite.

cleanupSessionWorkspace (explicit DELETE) keeps sync rmSync — single dir, user action, not the boot path.

Recovery sequence

  1. Now: roll the service back to the pre-fix(collab): liveness heartbeat + RSS monitor + orphan-workspace sweep #42 task-def revision (site up).
  2. Merge this PR, run Deploy collab — new image boots clean (sweep deferred + async), /healthz green, service stable.
  3. Resume the preview-subdomain verification.

Why CI didn't catch it

parse-smoke = syntax; the new typecheck-ratchet (#46) = types. Neither catches "synchronous I/O blocks the loop under real EFS load" — that's a runtime/perf property. Mitigation idea (follow-up): a smoke test that boots the container with a seeded orphan dir and asserts /healthz stays <1 s.

🤖 Generated with Claude Code

The S7 sweep shipped in #42 used synchronous rmSync(recursive) to delete
orphan workspace dirs on boot.  Each orphan is a ~1.5 GB tree on EFS (a
network FS); deleting several synchronously blocked the event loop for
minutes, so /healthz couldn't respond, the ALB health check timed out
("Request timed out"), ECS killed the task mid-sweep, and it crash-looped
(observed 2026-06-14: server reached "listening on :4096" three times,
each killed ~4 min later by failed ELB health checks; exit code null, not
OOM).

Fixes:
- Use fs/promises rm + stat with `await` per deletion — libuv does the
  slow EFS work off-thread, so the loop stays free to answer /healthz
  between deletions.
- Cap removals at 10 per boot (ORPHAN_WORKSPACE_MAX_PER_SWEEP); a larger
  backlog drains over subsequent boots instead of one marathon.
- Defer the sweep ~90 s after boot (unref'd timer) in serve.ts, so it
  can't run during the ALB startup health-check window at all —
  belt-and-suspenders on top of the non-blocking rewrite.

cleanupSessionWorkspace (explicit DELETE path) keeps sync rmSync — it's a
single dir on a user action, not the boot path.

This is a hotfix for the crash-loop that took the site down after the
#37-#46 batch deploy; merge + Deploy collab to recover (or roll the
service back to the pre-#42 task-def revision in the meantime).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hblanken hblanken merged commit d6fbda3 into collab Jun 14, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant