Skip to content

fix(collab): preview install-hang watchdog + crash-loop breaker + cache telemetry#40

Merged
hblanken merged 1 commit into
collabfrom
fix/collab-preview-watchdog-crashloop
Jun 14, 2026
Merged

fix(collab): preview install-hang watchdog + crash-loop breaker + cache telemetry#40
hblanken merged 1 commit into
collabfrom
fix/collab-preview-watchdog-crashloop

Conversation

@hblanken

Copy link
Copy Markdown

Part of the stability hardening pass (PR 1 of 5). Targets the frontend's heavy install/compile path on the 4 vCPU / 16 GB task.

S1 — install-hang watchdog

The 30-min idle cap only counts request traffic, so a pnpm install wedged at minute 5 (dead registry, stuck native build, OOMing dep) held memory for 25 more minutes before anything noticed. Now: track _lastOutput per active state (reset on every stdout/stderr line); the existing 60-s sweep stops any preview still in the installing phase that has emitted nothing for 5 min. A healthy install/compile emits progress constantly, so prolonged silence is a strong wedge signal.

S2 — crash-loop breaker on auto-resume

The 24-h freshness cap (PR #32) stops stale intents from looping, but a fresh intent whose workspace is deterministically broken re-crashed on every boot. New collab_session columns preview_crash_count + preview_crash_at:

  • Incremented when a preview's child exits non-zero while still installing (and when the watchdog kills a hung install)
  • Reset to 0 on a successful install→running transition
  • Boot-resume skips any session with ≥ 3 install crashes in the last hour
  • A Driver pressing Launch clears the counter (explicit human retry overrides the breaker)

Only install-phase crashes feed the counter — a crash after reaching "running" is a dev-server runtime error, a different class that shouldn't suppress resume (the workspace installed fine).

V2 — framework-cache telemetry

Logs whether <repo>/.angular/cache survived the previous container at each launch. That cache lives on EFS and should persist across deploys; when it does, second-launch compile drops from ~2 min to ~20 s. If it logs "absent" on a previously-launched session, the cache is being wiped — a regression to chase. Shallow, bounded, never throws.

Schema

Two columns added via the existing PRAGMA-probe ALTER pattern in migrate.ts + the Drizzle table in schema.sql.ts. Legacy rows backfill to crash_count=0 (eligible for resume), so no special handling.

Files

  • preview-launcher.ts_lastOutput tracking, watchdog in sweep, crash record/clear on exit/ready, breaker filter in resume, cache telemetry
  • session.tsrecordPreviewCrash / clearPreviewCrashCount helpers, listPreviewIntents carries crash fields
  • migrate.ts + schema.sql.ts — new columns
  • router.ts — clear crash count on manual launch

Test plan

  • Deploy off this branch
  • Launch frontend preview → boot log shows framework-cache present/absent line
  • Simulate a hung install (point a session at a repo with an unreachable git dep) → after 5 min: install hung — no output for 5m, preview stops, task stays healthy
  • Force 3 install crashes on one session → 4th boot logs crash-loop breaker (…) and skips resume; Driver Launch still works (clears counter)
  • Healthy preview reaching "running" → crash count resets (verify a later transient install failure starts from 1, not 4)
  • No regression: existing launch / stop / idle-cap / lifetime-cap / memory-cap behavior unchanged

🤖 Generated with Claude Code

…he telemetry

Three stability guardrails for the preview lifecycle, all targeting the
frontend's heavy install/compile path on the 4 vCPU / 16 GB task.

S1 — install-hang watchdog
  The 30-min idle cap only counts request traffic, so an install wedged
  at minute 5 (dead registry, stuck native build, OOMing dep) sat holding
  memory for 25 more minutes before anything noticed.  Track _lastOutput
  per active state (reset on every stdout/stderr line); the 60-s sweep
  now stops any preview still in the `installing` phase that has emitted
  nothing for INSTALL_SILENCE_TIMEOUT_MS (5 min).  A healthy pnpm install
  / ng compile emits progress constantly, so prolonged silence is a
  strong wedge signal.

S2 — crash-loop breaker on auto-resume
  resumePreviewsOnBoot's 24-h freshness cap (PR #32) stops STALE intents
  from looping, but a FRESH intent whose workspace is deterministically
  broken (bad lockfile, missing dep, OOM during native build) re-crashed
  on every boot.  New collab_session columns preview_crash_count +
  preview_crash_at: incremented when a preview's child exits non-zero
  while still installing (and when the watchdog kills a hung install),
  reset to 0 on a successful install→running transition.  Boot-resume
  now skips any session with >= 3 install crashes in the last hour.  A
  Driver pressing Launch clears the counter (explicit human retry
  overrides the breaker).

  Only INSTALL-phase crashes feed the counter — a crash after reaching
  "running" is a dev-server runtime error, a different class that
  shouldn't suppress resume (the workspace installed fine).

V2 — framework-cache telemetry
  Log whether <repo>/.angular/cache survived the previous container at
  each launch.  That cache lives on EFS and should persist across
  deploys; when it does, second-launch compile drops from ~2 min to
  ~20 s.  If it logs "absent" on a previously-launched session, the
  cache is being wiped and that's a regression to chase.  Shallow,
  bounded, never throws.

Schema: two nullable/defaulted columns added via the existing
PRAGMA-probe ALTER pattern in migrate.ts; legacy rows backfill to
crash_count=0 (eligible for resume), so no special handling.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hblanken hblanken merged commit 5775446 into collab Jun 14, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant