Skip to content

fix(collab): liveness heartbeat + RSS monitor + orphan-workspace sweep#42

Merged
hblanken merged 1 commit into
collabfrom
fix/collab-liveness-memory-efs
Jun 14, 2026
Merged

fix(collab): liveness heartbeat + RSS monitor + orphan-workspace sweep#42
hblanken merged 1 commit into
collabfrom
fix/collab-liveness-memory-efs

Conversation

@hblanken

Copy link
Copy Markdown

Stability hardening pass (PR 3 of 5). Observability + hygiene for the single-replica task.

S5 — /healthz event-loop liveness

/healthz returned 200 as long as the HTTP listener answered, even if the event loop was wedged by a long synchronous op (giant JSON.parse on a runaway preview log, a tight loop in a plugin). Add a 5-s heartbeat stamping lastEventLoopTick; /healthz now reports event_loop + event_loop_lag_ms and returns 503 when the tick is >30 s stale, so the ALB pulls a wedged task ~1 min sooner than waiting for request timeout.

S6 — container RSS monitor (new cgroup-memory.ts)

opencode itself had no view of its own memory pressure (PR #34's cgroup read was preview-only). New shared util reads /sys/fs/cgroup/memory.current; startMemoryMonitor() logs a WARNING when total task RSS crosses 13 GB (leading indicator for the 16 GB ceiling) and INFO when it recovers. Pure telemetry — never kills anything; the preview memory cap stays the actor. Self-disables on non-Linux dev.

preview-launcher.ts keeps its private copy of the cgroup read for now to avoid conflicting with the in-flight preview-hardening PR (#40). Consolidating onto this shared util is a deferred cleanup, noted in the util's docstring.

S7 — orphan-workspace sweep

The original premise was wrong — explicit DELETE already wipes the workspace synchronously (router.ts:1293), so soft-deleted dirs don't pile up in the normal path. The real gap is drift: an rmSync that threw on an EFS hiccup, a task killed between soft-delete and cleanup, a manual DB edit — each leaves a ~1.5 GB frontend workspace orphaned on EFS forever.

cleanupOrphanWorkspaces() (boot, fire-and-forget) removes workspace-root dirs with no live session row whose mtime is older than a 24 h safety floor. A session inserts its DB row before cloning, so a live dir always has a live row; the floor just rules out any boot-time race with an in-progress init.

Files

  • server/server.ts — event-loop heartbeat + /healthz stall check
  • collab/cgroup-memory.ts (new) — shared RSS reader + startMemoryMonitor()
  • collab/workspace.tscleanupOrphanWorkspaces()
  • cli/cmd/serve.ts — wire S6 + S7 into boot

Test plan

  • Deploy off this branch
  • curl /healthz → now includes event_loop: ok + event_loop_lag_ms (small number)
  • Boot log shows [collab.memory] line (monitor started, or disabled-on-non-Linux locally)
  • Boot log shows orphan-sweep result (reclaimed N or silent when none)
  • Create + delete a session → workspace gone immediately (unchanged); no orphan left for the sweep
  • Manually leave an orphan dir (mtime >24 h, no session row) → next boot reclaims it
  • A fresh dir (<24 h) with no row is NOT removed (init-race protection)
  • No false 503 on /healthz under normal load (event_loop_lag_ms stays well under 30 s)

🤖 Generated with Claude Code

Three observability/hygiene guardrails for the single-replica task.

S5 — /healthz event-loop liveness
  /healthz returned 200 as long as the HTTP listener answered, even if
  the event loop was wedged by a long synchronous op (giant JSON.parse
  on a runaway preview log, a tight loop in a plugin).  Add a 5-s
  heartbeat stamping lastEventLoopTick; /healthz now reports
  event_loop + event_loop_lag_ms and returns 503 when the tick is
  >30 s stale, so the ALB pulls a wedged task ~1 min sooner.

S6 — container RSS monitor (new cgroup-memory.ts)
  opencode itself had no view of its own memory pressure (PR #34's
  cgroup read was preview-only).  New shared util reads
  /sys/fs/cgroup/memory.current; startMemoryMonitor() logs a WARNING
  when total task RSS crosses 13 GB (leading indicator for the 16 GB
  ceiling) and an INFO when it recovers.  Pure telemetry — never kills
  anything; the preview memory cap remains the actor.  Self-disables
  on platforms without the cgroup file (macOS dev).
  (preview-launcher.ts keeps its private copy of the read for now to
  avoid conflicting with the in-flight preview-hardening PR;
  consolidation is a deferred cleanup, noted in the util.)

S7 — orphan-workspace sweep
  Explicit DELETE already wipes the workspace synchronously, so the
  original "soft-deleted dirs pile up" premise was wrong.  The real
  gap is DRIFT: an rmSync that threw on an EFS hiccup, a task killed
  between soft-delete and cleanup, a manual DB edit — each leaves a
  ~1.5 GB frontend workspace orphaned on EFS forever.  cleanupOrphan-
  Workspaces() (boot, fire-and-forget) removes workspace-root dirs
  with no live session row whose mtime is older than a 24 h safety
  floor (a session inserts its row before cloning, so a live dir
  always has a live row; the floor just rules out any boot-time race
  with an in-progress init).

All three wired into serve.ts boot as fire-and-forget blocks beside
the existing hook-sweep / preview-resume hooks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hblanken hblanken merged commit 27961b8 into collab Jun 14, 2026
1 check passed
hblanken added a commit that referenced this pull request Jun 14, 2026
#47)

The S7 sweep shipped in #42 used synchronous rmSync(recursive) to delete
orphan workspace dirs on boot.  Each orphan is a ~1.5 GB tree on EFS (a
network FS); deleting several synchronously blocked the event loop for
minutes, so /healthz couldn't respond, the ALB health check timed out
("Request timed out"), ECS killed the task mid-sweep, and it crash-looped
(observed 2026-06-14: server reached "listening on :4096" three times,
each killed ~4 min later by failed ELB health checks; exit code null, not
OOM).

Fixes:
- Use fs/promises rm + stat with `await` per deletion — libuv does the
  slow EFS work off-thread, so the loop stays free to answer /healthz
  between deletions.
- Cap removals at 10 per boot (ORPHAN_WORKSPACE_MAX_PER_SWEEP); a larger
  backlog drains over subsequent boots instead of one marathon.
- Defer the sweep ~90 s after boot (unref'd timer) in serve.ts, so it
  can't run during the ALB startup health-check window at all —
  belt-and-suspenders on top of the non-blocking rewrite.

cleanupSessionWorkspace (explicit DELETE path) keeps sync rmSync — it's a
single dir on a user action, not the boot path.

This is a hotfix for the crash-loop that took the site down after the
#37-#46 batch deploy; merge + Deploy collab to recover (or roll the
service back to the pre-#42 task-def revision in the meantime).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant