fix(collab): liveness heartbeat + RSS monitor + orphan-workspace sweep#42
Merged
Conversation
Three observability/hygiene guardrails for the single-replica task. S5 — /healthz event-loop liveness /healthz returned 200 as long as the HTTP listener answered, even if the event loop was wedged by a long synchronous op (giant JSON.parse on a runaway preview log, a tight loop in a plugin). Add a 5-s heartbeat stamping lastEventLoopTick; /healthz now reports event_loop + event_loop_lag_ms and returns 503 when the tick is >30 s stale, so the ALB pulls a wedged task ~1 min sooner. S6 — container RSS monitor (new cgroup-memory.ts) opencode itself had no view of its own memory pressure (PR #34's cgroup read was preview-only). New shared util reads /sys/fs/cgroup/memory.current; startMemoryMonitor() logs a WARNING when total task RSS crosses 13 GB (leading indicator for the 16 GB ceiling) and an INFO when it recovers. Pure telemetry — never kills anything; the preview memory cap remains the actor. Self-disables on platforms without the cgroup file (macOS dev). (preview-launcher.ts keeps its private copy of the read for now to avoid conflicting with the in-flight preview-hardening PR; consolidation is a deferred cleanup, noted in the util.) S7 — orphan-workspace sweep Explicit DELETE already wipes the workspace synchronously, so the original "soft-deleted dirs pile up" premise was wrong. The real gap is DRIFT: an rmSync that threw on an EFS hiccup, a task killed between soft-delete and cleanup, a manual DB edit — each leaves a ~1.5 GB frontend workspace orphaned on EFS forever. cleanupOrphan- Workspaces() (boot, fire-and-forget) removes workspace-root dirs with no live session row whose mtime is older than a 24 h safety floor (a session inserts its row before cloning, so a live dir always has a live row; the floor just rules out any boot-time race with an in-progress init). All three wired into serve.ts boot as fire-and-forget blocks beside the existing hook-sweep / preview-resume hooks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
6 tasks
hblanken
added a commit
that referenced
this pull request
Jun 14, 2026
#47) The S7 sweep shipped in #42 used synchronous rmSync(recursive) to delete orphan workspace dirs on boot. Each orphan is a ~1.5 GB tree on EFS (a network FS); deleting several synchronously blocked the event loop for minutes, so /healthz couldn't respond, the ALB health check timed out ("Request timed out"), ECS killed the task mid-sweep, and it crash-looped (observed 2026-06-14: server reached "listening on :4096" three times, each killed ~4 min later by failed ELB health checks; exit code null, not OOM). Fixes: - Use fs/promises rm + stat with `await` per deletion — libuv does the slow EFS work off-thread, so the loop stays free to answer /healthz between deletions. - Cap removals at 10 per boot (ORPHAN_WORKSPACE_MAX_PER_SWEEP); a larger backlog drains over subsequent boots instead of one marathon. - Defer the sweep ~90 s after boot (unref'd timer) in serve.ts, so it can't run during the ALB startup health-check window at all — belt-and-suspenders on top of the non-blocking rewrite. cleanupSessionWorkspace (explicit DELETE path) keeps sync rmSync — it's a single dir on a user action, not the boot path. This is a hotfix for the crash-loop that took the site down after the #37-#46 batch deploy; merge + Deploy collab to recover (or roll the service back to the pre-#42 task-def revision in the meantime). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stability hardening pass (PR 3 of 5). Observability + hygiene for the single-replica task.
S5 — /healthz event-loop liveness
/healthzreturned 200 as long as the HTTP listener answered, even if the event loop was wedged by a long synchronous op (giantJSON.parseon a runaway preview log, a tight loop in a plugin). Add a 5-s heartbeat stampinglastEventLoopTick;/healthznow reportsevent_loop+event_loop_lag_msand returns 503 when the tick is >30 s stale, so the ALB pulls a wedged task ~1 min sooner than waiting for request timeout.S6 — container RSS monitor (new
cgroup-memory.ts)opencode itself had no view of its own memory pressure (PR #34's cgroup read was preview-only). New shared util reads
/sys/fs/cgroup/memory.current;startMemoryMonitor()logs a WARNING when total task RSS crosses 13 GB (leading indicator for the 16 GB ceiling) and INFO when it recovers. Pure telemetry — never kills anything; the preview memory cap stays the actor. Self-disables on non-Linux dev.S7 — orphan-workspace sweep
The original premise was wrong — explicit
DELETEalready wipes the workspace synchronously (router.ts:1293), so soft-deleted dirs don't pile up in the normal path. The real gap is drift: anrmSyncthat threw on an EFS hiccup, a task killed between soft-delete and cleanup, a manual DB edit — each leaves a ~1.5 GB frontend workspace orphaned on EFS forever.cleanupOrphanWorkspaces()(boot, fire-and-forget) removes workspace-root dirs with no live session row whose mtime is older than a 24 h safety floor. A session inserts its DB row before cloning, so a live dir always has a live row; the floor just rules out any boot-time race with an in-progress init.Files
server/server.ts— event-loop heartbeat +/healthzstall checkcollab/cgroup-memory.ts(new) — shared RSS reader +startMemoryMonitor()collab/workspace.ts—cleanupOrphanWorkspaces()cli/cmd/serve.ts— wire S6 + S7 into bootTest plan
curl /healthz→ now includesevent_loop: ok+event_loop_lag_ms(small number)[collab.memory]line (monitor started, or disabled-on-non-Linux locally)reclaimed Nor silent when none)🤖 Generated with Claude Code