fix(collab): preview install-hang watchdog + crash-loop breaker + cache telemetry#40
Merged
Merged
Conversation
…he telemetry Three stability guardrails for the preview lifecycle, all targeting the frontend's heavy install/compile path on the 4 vCPU / 16 GB task. S1 — install-hang watchdog The 30-min idle cap only counts request traffic, so an install wedged at minute 5 (dead registry, stuck native build, OOMing dep) sat holding memory for 25 more minutes before anything noticed. Track _lastOutput per active state (reset on every stdout/stderr line); the 60-s sweep now stops any preview still in the `installing` phase that has emitted nothing for INSTALL_SILENCE_TIMEOUT_MS (5 min). A healthy pnpm install / ng compile emits progress constantly, so prolonged silence is a strong wedge signal. S2 — crash-loop breaker on auto-resume resumePreviewsOnBoot's 24-h freshness cap (PR #32) stops STALE intents from looping, but a FRESH intent whose workspace is deterministically broken (bad lockfile, missing dep, OOM during native build) re-crashed on every boot. New collab_session columns preview_crash_count + preview_crash_at: incremented when a preview's child exits non-zero while still installing (and when the watchdog kills a hung install), reset to 0 on a successful install→running transition. Boot-resume now skips any session with >= 3 install crashes in the last hour. A Driver pressing Launch clears the counter (explicit human retry overrides the breaker). Only INSTALL-phase crashes feed the counter — a crash after reaching "running" is a dev-server runtime error, a different class that shouldn't suppress resume (the workspace installed fine). V2 — framework-cache telemetry Log whether <repo>/.angular/cache survived the previous container at each launch. That cache lives on EFS and should persist across deploys; when it does, second-launch compile drops from ~2 min to ~20 s. If it logs "absent" on a previously-launched session, the cache is being wiped and that's a regression to chase. Shallow, bounded, never throws. Schema: two nullable/defaulted columns added via the existing PRAGMA-probe ALTER pattern in migrate.ts; legacy rows backfill to crash_count=0 (eligible for resume), so no special handling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part of the stability hardening pass (PR 1 of 5). Targets the frontend's heavy install/compile path on the 4 vCPU / 16 GB task.
S1 — install-hang watchdog
The 30-min idle cap only counts request traffic, so a
pnpm installwedged at minute 5 (dead registry, stuck native build, OOMing dep) held memory for 25 more minutes before anything noticed. Now: track_lastOutputper active state (reset on every stdout/stderr line); the existing 60-s sweep stops any preview still in theinstallingphase that has emitted nothing for 5 min. A healthy install/compile emits progress constantly, so prolonged silence is a strong wedge signal.S2 — crash-loop breaker on auto-resume
The 24-h freshness cap (PR #32) stops stale intents from looping, but a fresh intent whose workspace is deterministically broken re-crashed on every boot. New
collab_sessioncolumnspreview_crash_count+preview_crash_at:Only install-phase crashes feed the counter — a crash after reaching "running" is a dev-server runtime error, a different class that shouldn't suppress resume (the workspace installed fine).
V2 — framework-cache telemetry
Logs whether
<repo>/.angular/cachesurvived the previous container at each launch. That cache lives on EFS and should persist across deploys; when it does, second-launch compile drops from ~2 min to ~20 s. If it logs "absent" on a previously-launched session, the cache is being wiped — a regression to chase. Shallow, bounded, never throws.Schema
Two columns added via the existing PRAGMA-probe
ALTERpattern inmigrate.ts+ the Drizzle table inschema.sql.ts. Legacy rows backfill tocrash_count=0(eligible for resume), so no special handling.Files
preview-launcher.ts—_lastOutputtracking, watchdog in sweep, crash record/clear on exit/ready, breaker filter in resume, cache telemetrysession.ts—recordPreviewCrash/clearPreviewCrashCounthelpers,listPreviewIntentscarries crash fieldsmigrate.ts+schema.sql.ts— new columnsrouter.ts— clear crash count on manual launchTest plan
install hung — no output for 5m, preview stops, task stays healthycrash-loop breaker (…)and skips resume; Driver Launch still works (clears counter)🤖 Generated with Claude Code