fix(collab): liveness heartbeat + RSS monitor + orphan-workspace sweep by hblanken · Pull Request #42 · unleashlive/opencode

hblanken · 2026-06-13T01:06:49Z

Stability hardening pass (PR 3 of 5). Observability + hygiene for the single-replica task.

S5 — /healthz event-loop liveness

/healthz returned 200 as long as the HTTP listener answered, even if the event loop was wedged by a long synchronous op (giant JSON.parse on a runaway preview log, a tight loop in a plugin). Add a 5-s heartbeat stamping lastEventLoopTick; /healthz now reports event_loop + event_loop_lag_ms and returns 503 when the tick is >30 s stale, so the ALB pulls a wedged task ~1 min sooner than waiting for request timeout.

S6 — container RSS monitor (new `cgroup-memory.ts`)

opencode itself had no view of its own memory pressure (PR #34's cgroup read was preview-only). New shared util reads /sys/fs/cgroup/memory.current; startMemoryMonitor() logs a WARNING when total task RSS crosses 13 GB (leading indicator for the 16 GB ceiling) and INFO when it recovers. Pure telemetry — never kills anything; the preview memory cap stays the actor. Self-disables on non-Linux dev.

preview-launcher.ts keeps its private copy of the cgroup read for now to avoid conflicting with the in-flight preview-hardening PR (#40). Consolidating onto this shared util is a deferred cleanup, noted in the util's docstring.

S7 — orphan-workspace sweep

The original premise was wrong — explicit DELETE already wipes the workspace synchronously (router.ts:1293), so soft-deleted dirs don't pile up in the normal path. The real gap is drift: an rmSync that threw on an EFS hiccup, a task killed between soft-delete and cleanup, a manual DB edit — each leaves a ~1.5 GB frontend workspace orphaned on EFS forever.

cleanupOrphanWorkspaces() (boot, fire-and-forget) removes workspace-root dirs with no live session row whose mtime is older than a 24 h safety floor. A session inserts its DB row before cloning, so a live dir always has a live row; the floor just rules out any boot-time race with an in-progress init.

Files

server/server.ts — event-loop heartbeat + /healthz stall check
collab/cgroup-memory.ts (new) — shared RSS reader + startMemoryMonitor()
collab/workspace.ts — cleanupOrphanWorkspaces()
cli/cmd/serve.ts — wire S6 + S7 into boot

Test plan

Deploy off this branch
curl /healthz → now includes event_loop: ok + event_loop_lag_ms (small number)
Boot log shows [collab.memory] line (monitor started, or disabled-on-non-Linux locally)
Boot log shows orphan-sweep result (reclaimed N or silent when none)
Create + delete a session → workspace gone immediately (unchanged); no orphan left for the sweep
Manually leave an orphan dir (mtime >24 h, no session row) → next boot reclaims it
A fresh dir (<24 h) with no row is NOT removed (init-race protection)
No false 503 on /healthz under normal load (event_loop_lag_ms stays well under 30 s)

🤖 Generated with Claude Code

Three observability/hygiene guardrails for the single-replica task. S5 — /healthz event-loop liveness /healthz returned 200 as long as the HTTP listener answered, even if the event loop was wedged by a long synchronous op (giant JSON.parse on a runaway preview log, a tight loop in a plugin). Add a 5-s heartbeat stamping lastEventLoopTick; /healthz now reports event_loop + event_loop_lag_ms and returns 503 when the tick is >30 s stale, so the ALB pulls a wedged task ~1 min sooner. S6 — container RSS monitor (new cgroup-memory.ts) opencode itself had no view of its own memory pressure (PR #34's cgroup read was preview-only). New shared util reads /sys/fs/cgroup/memory.current; startMemoryMonitor() logs a WARNING when total task RSS crosses 13 GB (leading indicator for the 16 GB ceiling) and an INFO when it recovers. Pure telemetry — never kills anything; the preview memory cap remains the actor. Self-disables on platforms without the cgroup file (macOS dev). (preview-launcher.ts keeps its private copy of the read for now to avoid conflicting with the in-flight preview-hardening PR; consolidation is a deferred cleanup, noted in the util.) S7 — orphan-workspace sweep Explicit DELETE already wipes the workspace synchronously, so the original "soft-deleted dirs pile up" premise was wrong. The real gap is DRIFT: an rmSync that threw on an EFS hiccup, a task killed between soft-delete and cleanup, a manual DB edit — each leaves a ~1.5 GB frontend workspace orphaned on EFS forever. cleanupOrphan- Workspaces() (boot, fire-and-forget) removes workspace-root dirs with no live session row whose mtime is older than a 24 h safety floor (a session inserts its row before cloning, so a live dir always has a live row; the floor just rules out any boot-time race with an in-progress init). All three wired into serve.ts boot as fire-and-forget blocks beside the existing hook-sweep / preview-resume hooks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

#47) The S7 sweep shipped in #42 used synchronous rmSync(recursive) to delete orphan workspace dirs on boot. Each orphan is a ~1.5 GB tree on EFS (a network FS); deleting several synchronously blocked the event loop for minutes, so /healthz couldn't respond, the ALB health check timed out ("Request timed out"), ECS killed the task mid-sweep, and it crash-looped (observed 2026-06-14: server reached "listening on :4096" three times, each killed ~4 min later by failed ELB health checks; exit code null, not OOM). Fixes: - Use fs/promises rm + stat with `await` per deletion — libuv does the slow EFS work off-thread, so the loop stays free to answer /healthz between deletions. - Cap removals at 10 per boot (ORPHAN_WORKSPACE_MAX_PER_SWEEP); a larger backlog drains over subsequent boots instead of one marathon. - Defer the sweep ~90 s after boot (unref'd timer) in serve.ts, so it can't run during the ALB startup health-check window at all — belt-and-suspenders on top of the non-blocking rewrite. cleanupSessionWorkspace (explicit DELETE path) keeps sync rmSync — it's a single dir on a user action, not the boot path. This is a hotfix for the crash-loop that took the site down after the #37-#46 batch deploy; merge + Deploy collab to recover (or roll the service back to the pre-#42 task-def revision in the meantime). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

hblanken mentioned this pull request Jun 14, 2026

feat(collab): serve preview at a dedicated subdomain root (retire /preview/ base-href) #45

Merged

6 tasks

hblanken merged commit 27961b8 into collab Jun 14, 2026
1 check passed

hblanken mentioned this pull request Jun 14, 2026

fix(collab): orphan-workspace sweep must not block the boot event loop (crash-loop hotfix) #47

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(collab): liveness heartbeat + RSS monitor + orphan-workspace sweep#42

fix(collab): liveness heartbeat + RSS monitor + orphan-workspace sweep#42
hblanken merged 1 commit into
collabfrom
fix/collab-liveness-memory-efs

hblanken commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hblanken commented Jun 13, 2026

S5 — /healthz event-loop liveness

S6 — container RSS monitor (new cgroup-memory.ts)

S7 — orphan-workspace sweep

Files

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

S6 — container RSS monitor (new `cgroup-memory.ts`)