feat(benchmarks): Add Claude UI benchmark harness by cameroncooke · Pull Request #427 · getsentry/XcodeBuildMCP

cameroncooke · 2026-05-23T11:30:28Z

Add a local Claude UI benchmark harness for measuring simulator UI automation behavior against the development MCP server.

The harness runs deterministic app tasks from Markdown prompts, creates fresh temporary simulators, writes isolated MCP configuration, parses Claude Code transcripts, and reports tool counts, wall-clock timing, failures, and sequence drift. This gives us a repeatable way to catch regressions in agent efficiency and UI automation behavior across Weather, Contacts, and Reminders.

The benchmark setup also keeps simulator boot/open and first-run prompt cleanup outside the measured Claude task, so baselines reflect the actual app work rather than transient Apple setup screens. Mutating UI actions now wait for settled post-action runtime snapshots so the next agent step receives stable refs.

cameroncooke · 2026-05-23T11:30:46Z

feat(benchmarks): Add Claude UI benchmark harness #427 👈 (View in Graphite)
feat(ui-automation): Add rs/1 runtime automation parity #416
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

github-actions

Parser path hardcoded to author's local filesystem

The harness will fail for any developer or CI environment that doesn't have /Volumes/Developer/parse_claude_conversation.py; this path needs to be configurable or relative to the repo.

Evidence

harness.ts line 34 declares const parserPath = '/Volumes/Developer/parse_claude_conversation.py' as a module-level constant.
runParser at line 214 passes parserPath directly to runCommand as the script argument for python3.
No environment variable, config option, or fallback exists; the path is hardcoded unconditionally.
The path begins with /Volumes/Developer/, a macOS external-volume prefix unique to the author's machine.

_{Identified by Warden find-bugs}

github-actions

Snapshot-settle timeout returns unsettled snapshot without warning, silently breaking the stable-refs guarantee

When captureRuntimeSnapshotAfterAction in post-action-snapshot.ts exceeds its 1 500 ms deadline without the UI settling, it records and returns latestSnapshot.payload as a normal success; captureRuntimeSnapshotAfterActionSafely wraps this in { capture } with no warning or uiError, so every UI-action tool (tap, swipe, gesture, etc.) passes a potentially mid-animation snapshot to the agent as fully stable, silently breaking the PR's stated guarantee that the next agent step receives stable refs.

Evidence

captureRuntimeSnapshotAfterAction (post-action-snapshot.ts line 90–93): if (remainingMs <= 0) { recordRuntimeSnapshot(latestSnapshot); return latestSnapshot.payload; } — no indication of unsettled state.
captureRuntimeSnapshotAfterActionSafely only branches on exceptions; the timeout return path is indistinguishable from a fully settled return, so { capture: await captureRuntimeSnapshotAfterAction(params) } is emitted.
Callers such as tap.ts line 161, swipe.ts line 226, gesture.ts line 171 (and 7 more tools) check captureResult.warning / captureResult.uiError to surface problems; neither is set on timeout.
The existing test (post-action-snapshot.test.ts) only covers the settled path and does not test timeout behaviour, leaving the gap undetected.

_{Identified by Warden find-bugs}

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 4 total unresolved issues (including 2 from previous reviews).

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

✅ Fixed: Temp simulator leak on setup
- Added try-catch cleanup in prepareTemporarySimulator to delete the created simulator if post-creation setup steps fail.
✅ Fixed: Unused exported env helper
- Removed the unused sessionDefaultsEnv export from config.ts as it was dead code with no references.

Or push these changes by commenting:

@cursor push aed23b8a5b

Preview (aed23b8a5b)

diff --git a/CHANGELOG.md b/CHANGELOG.md
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -22,6 +22,7 @@
 
 ### Fixed
 
+- Fixed Claude UI benchmark temporary simulator cleanup so simulators created by the harness are deleted even when post-creation setup steps (boot, bootstatus, or Simulator.app open) fail.
 - Fixed Claude UI benchmark suite runs so temporary simulators are applied through an isolated per-run MCP config instead of being overridden by repo or example-project config defaults.
 - Fixed simulator launch failures before simulator-name resolution so they are not reported as macOS launch failures.
 - Fixed CLI JSON output so simulator-name resolution failures return the structured error envelope instead of plain stderr.

diff --git a/src/benchmarks/claude-ui/config.ts b/src/benchmarks/claude-ui/config.ts
--- a/src/benchmarks/claude-ui/config.ts
+++ b/src/benchmarks/claude-ui/config.ts
@@ -186,16 +186,3 @@
   const raw = parseYaml(await readFile(suitePath, 'utf8')) as unknown;
   return readConfig(raw, suitePath);
 }
-
-export function sessionDefaultsEnv(
-  sessionDefaults: Record<string, unknown> | undefined,
-): Record<string, string> {
-  const validated = validateSessionDefaults(sessionDefaults);
-  if (!validated) return {};
-
-  const env: Record<string, string> = {};
-  for (const [key, value] of Object.entries(validated)) {
-    env[sessionDefaultEnvNames[key]] = String(value);
-  }
-  return env;
-}

diff --git a/src/benchmarks/claude-ui/simulator-lifecycle.ts b/src/benchmarks/claude-ui/simulator-lifecycle.ts
--- a/src/benchmarks/claude-ui/simulator-lifecycle.ts
+++ b/src/benchmarks/claude-ui/simulator-lifecycle.ts
@@ -278,65 +278,89 @@
     logPath: opts.logPath,
   } satisfies CreatedTemporarySimulator;
 
-  opts.onEvent?.(`booting simulator ${simulatorId}`);
-  const bootArgs = ['simctl', 'boot', simulatorId];
-  const bootResult = await executor({
-    command: 'xcrun',
-    args: bootArgs,
-    cwd: opts.cwd,
-    logPath: opts.logPath,
-  });
-  if (!isAlreadyBooted(bootResult)) {
-    throw new Error(
-      `${opts.config.name}: failed to boot temporary simulator with ${commandText('xcrun', bootArgs)} (exit ${bootResult.exitCode}); see ${opts.logPath}`,
-    );
-  }
-  if (bootResult.exitCode !== 0) {
+  try {
+    opts.onEvent?.(`booting simulator ${simulatorId}`);
+    const bootArgs = ['simctl', 'boot', simulatorId];
+    const bootResult = await executor({
+      command: 'xcrun',
+      args: bootArgs,
+      cwd: opts.cwd,
+      logPath: opts.logPath,
+    });
+    if (!isAlreadyBooted(bootResult)) {
+      throw new Error(
+        `${opts.config.name}: failed to boot temporary simulator with ${commandText('xcrun', bootArgs)} (exit ${bootResult.exitCode}); see ${opts.logPath}`,
+      );
+    }
+    if (bootResult.exitCode !== 0) {
+      await appendLifecycleLog(
+        opts.logPath,
+        'Boot command reported simulator was already booted; continuing',
+        logWriter,
+      );
+    }
+
+    opts.onEvent?.(`waiting for simulator ${simulatorId} bootstatus`);
+    const bootstatusArgs = ['simctl', 'bootstatus', simulatorId, '-b'];
+    const bootstatusResult = await executor({
+      command: 'xcrun',
+      args: bootstatusArgs,
+      cwd: opts.cwd,
+      logPath: opts.logPath,
+    });
+    if (bootstatusResult.exitCode !== 0) {
+      throw new Error(
+        `${opts.config.name}: temporary simulator did not reach bootstatus with ${commandText('xcrun', bootstatusArgs)} (exit ${bootstatusResult.exitCode}); see ${opts.logPath}`,
+      );
+    }
+
+    opts.onEvent?.(`opening Simulator.app for ${simulatorId}`);
+    const openArgs = ['-a', 'Simulator', '--args', '-CurrentDeviceUDID', simulatorId];
+    const openResult = await executor({
+      command: 'open',
+      args: openArgs,
+      cwd: opts.cwd,
+      logPath: opts.logPath,
+    });
+    if (openResult.exitCode !== 0) {
+      throw new Error(
+        `${opts.config.name}: failed to open Simulator.app with ${commandText('open', openArgs)} (exit ${openResult.exitCode}); see ${opts.logPath}`,
+      );
+    }
+
+    await waitForReadinessDelay({
+      logPath: opts.logPath,
+      milliseconds: opts.readinessDelayMs ?? 2_000,
+      onEvent: opts.onEvent,
+      logWriter,
+    });
+    await appendLifecycleLog(opts.logPath, `Temporary simulator ready: ${simulatorId}`, logWriter);
+    opts.onEvent?.(`simulator ready ${simulatorId}`);
+
+    return simulator;
+  } catch (error) {
     await appendLifecycleLog(
       opts.logPath,
-      'Boot command reported simulator was already booted; continuing',
+      `Setup failed, cleaning up simulator ${simulatorId}`,
       logWriter,
     );
+    try {
+      await executor({
+        command: 'xcrun',
+        args: ['simctl', 'delete', simulatorId],
+        cwd: opts.cwd,
+        logPath: opts.logPath,
+      });
+      await appendLifecycleLog(opts.logPath, `Deleted simulator ${simulatorId}`, logWriter);
+    } catch (deleteError) {
+      await appendLifecycleLog(
+        opts.logPath,
+        `Failed to delete simulator ${simulatorId}: ${deleteError instanceof Error ? deleteError.message : String(deleteError)}`,
+        logWriter,
+      );
+    }
+    throw error;
   }
-
-  opts.onEvent?.(`waiting for simulator ${simulatorId} bootstatus`);
-  const bootstatusArgs = ['simctl', 'bootstatus', simulatorId, '-b'];
-  const bootstatusResult = await executor({
-    command: 'xcrun',
-    args: bootstatusArgs,
-    cwd: opts.cwd,
-    logPath: opts.logPath,
-  });
-  if (bootstatusResult.exitCode !== 0) {
-    throw new Error(
-      `${opts.config.name}: temporary simulator did not reach bootstatus with ${commandText('xcrun', bootstatusArgs)} (exit ${bootstatusResult.exitCode}); see ${opts.logPath}`,
-    );
-  }
-
-  opts.onEvent?.(`opening Simulator.app for ${simulatorId}`);
-  const openArgs = ['-a', 'Simulator', '--args', '-CurrentDeviceUDID', simulatorId];
-  const openResult = await executor({
-    command: 'open',
-    args: openArgs,
-    cwd: opts.cwd,
-    logPath: opts.logPath,
-  });
-  if (openResult.exitCode !== 0) {
-    throw new Error(
-      `${opts.config.name}: failed to open Simulator.app with ${commandText('open', openArgs)} (exit ${openResult.exitCode}); see ${opts.logPath}`,
-    );
-  }
-
-  await waitForReadinessDelay({
-    logPath: opts.logPath,
-    milliseconds: opts.readinessDelayMs ?? 2_000,
-    onEvent: opts.onEvent,
-    logWriter,
-  });
-  await appendLifecycleLog(opts.logPath, `Temporary simulator ready: ${simulatorId}`, logWriter);
-  opts.onEvent?.(`simulator ready ${simulatorId}`);
-
-  return simulator;
 }
 
 export async function deleteTemporarySimulator(

_{You can send follow-ups to the cloud agent here.}

pkg-pr-new · 2026-05-23T18:50:45Z

Open in StackBlitz

npm i https://pkg.pr.new/xcodebuildmcp@427

commit: 989ab76

github-actions

Invalid failurePatterns regex crashes benchmark analysis after Claude has already run

A malformed regex string in failurePatterns (e.g. [unclosed) causes new RegExp(pattern, 'i') in createPatternMatchers to throw an uncaught SyntaxError, aborting analyzeClaudeJsonl mid-run and discarding all Claude output already collected. Wrap the RegExp constructor in a try-catch and surface the error as a parseError instead.

Evidence

createPatternMatchers in transcript.ts:114 calls new RegExp(pattern, 'i') with no try-catch.
patterns comes from config.failurePatterns loaded from YAML — any user-authored regex string is passed verbatim.
analyzeClaudeJsonl calls createPatternMatchers at line 126 before processing any transcript lines; a throw here propagates up through runSuite in harness.ts, skipping result.json write.
The benchmark has already spent wall-clock time running Claude when this occurs, so all timing and transcript data is lost.

_{Identified by Warden find-bugs}

cameroncooke · 2026-05-23T19:50:58Z

Addressed the non-inline review summaries as well: snapshot settle timeout now returns a recoverable warning instead of unstable refs, invalid failurePatterns are validated before Claude runs, and the parser path is already repo-relative/configurable in the current branch. Fixed in afdf792.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

✅ Fixed: Preflight skipped without simulator ID
- Added explicit error when firstRunPromptDismissals is configured but no simulator ID is available, failing loudly instead of silently skipping preflight.
✅ Fixed: Command log omits suite simulator
- Changed command log to use effectiveSimulatorId which includes both temporary and session default simulator IDs for accurate debugging.

Or push these changes by commenting:

@cursor push 4df7567169

Preview (4df7567169)

diff --git a/src/benchmarks/claude-ui/harness.ts b/src/benchmarks/claude-ui/harness.ts
--- a/src/benchmarks/claude-ui/harness.ts
+++ b/src/benchmarks/claude-ui/harness.ts
@@ -379,6 +379,11 @@
       (typeof config.sessionDefaults?.simulatorId === 'string'
         ? config.sessionDefaults.simulatorId
         : undefined);
+    if (config.firstRunPromptDismissals && !effectiveSimulatorId) {
+      throw new Error(
+        'firstRunPromptDismissals configured but no simulator ID available: set temporarySimulator: true or provide sessionDefaults.simulatorId',
+      );
+    }
     if (effectiveSimulatorId) {
       await dismissFirstRunPrompts({
         config,
@@ -422,7 +427,7 @@
     ];
     await writeFile(
       artifacts.claudeCommandLogPath,
-      `Run dir: ${runDirectory}\nCommand: claude ${claudeArgs.join(' ')} < ${artifacts.promptPath} > ${artifacts.claudeJsonlPath} 2> ${artifacts.claudeStderrPath}\nWorking directory: ${workingDirectory}\nMCP workspace: ${artifacts.mcpWorkspaceDirectory}\nMCP workspace config: ${artifacts.mcpWorkspaceConfigPath}\nSimulator lifecycle log: ${artifacts.simulatorLifecycleLogPath}\nSimulator ID: ${temporarySimulator?.simulatorId ?? 'suite/default'}\nStarted: ${new Date().toISOString()}\n`,
+      `Run dir: ${runDirectory}\nCommand: claude ${claudeArgs.join(' ')} < ${artifacts.promptPath} > ${artifacts.claudeJsonlPath} 2> ${artifacts.claudeStderrPath}\nWorking directory: ${workingDirectory}\nMCP workspace: ${artifacts.mcpWorkspaceDirectory}\nMCP workspace config: ${artifacts.mcpWorkspaceConfigPath}\nSimulator lifecycle log: ${artifacts.simulatorLifecycleLogPath}\nSimulator ID: ${effectiveSimulatorId ?? 'suite/default'}\nStarted: ${new Date().toISOString()}\n`,
       'utf8',
     );

_{You can send follow-ups to the cloud agent here.}

Add a local Claude UI benchmark harness for running deterministic app tasks against the development MCP server. The harness creates temporary simulators, uses isolated MCP config, records tool-call and timing metrics, and reports sequence drift with readable terminal output. Stabilize post-action UI snapshots so mutating UI actions return settled refs before the next agent step. Add benchmark and UI automation tests covering the new harness behavior and snapshot polling. Co-Authored-By: Codex <noreply@openai.com>

Make the Claude UI benchmark parser path explicit instead of relying on a local absolute path. Tighten first-run preflight completion handling, render null process exit codes, and inject benchmark lifecycle logging and post-action timing in tests. Co-Authored-By: OpenAI Codex <noreply@openai.com>

Use the repo-local Claude UI transcript parser as the default so the benchmark command works without machine-specific parser arguments. Keep --parser and CLAUDE_UI_BENCHMARK_PARSER as explicit overrides for local parser testing. Co-Authored-By: OpenAI Codex <noreply@openai.com>

Start first-run prompt timeout after the target app finishes launching so slow first launches do not consume the inspection window before AXe can check for prompts. Co-Authored-By: OpenAI Codex <noreply@openai.com>

Harden the Claude UI benchmark harness so preflight retries transient UI inspection failures, temporary simulators are cleaned up after partial setup failures, and suite config validation fails before expensive runs. Also report unsettled post-action runtime snapshots as recoverable UI automation warnings instead of returning unstable refs. Co-Authored-By: OpenAI Codex <codex@openai.com>

Require benchmark preflight dismissals to have a concrete simulator ID and record suite-provided simulator IDs in command logs. Remove an unused post-action snapshot assignment reported by review tooling. Co-Authored-By: OpenAI Codex <codex@openai.com>

Validate Claude UI benchmark sessionDefaults while reading the suite config so unknown keys and invalid value types fail before simulator setup starts. Co-Authored-By: OpenAI Codex <codex@openai.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Preflight leaves app running on failure
- Wrapped dismissal logic in try-finally block to ensure simctl terminate is always called after launch, even when errors are thrown.

Or push these changes by commenting:

@cursor push 203c16e988

Preview (203c16e988)

diff --git a/src/benchmarks/claude-ui/first-run-preflight.ts b/src/benchmarks/claude-ui/first-run-preflight.ts
--- a/src/benchmarks/claude-ui/first-run-preflight.ts
+++ b/src/benchmarks/claude-ui/first-run-preflight.ts
@@ -160,76 +160,79 @@
     );
   }
 
-  const deadline = timing.now() + timeoutMs;
-  let promptsDismissed = false;
-  let uiSeen = false;
-  while (timing.now() < deadline) {
-    const search = await findFirstRunPromptLabel({
-      simulatorId: opts.simulatorId,
-      labels: dismissals.labels,
-      cwd: opts.cwd,
-      logPath: opts.logPath,
-      executor,
-      axePath,
-      axeEnv,
-    });
+  try {
+    const deadline = timing.now() + timeoutMs;
+    let promptsDismissed = false;
+    let uiSeen = false;
+    while (timing.now() < deadline) {
+      const search = await findFirstRunPromptLabel({
+        simulatorId: opts.simulatorId,
+        labels: dismissals.labels,
+        cwd: opts.cwd,
+        logPath: opts.logPath,
+        executor,
+        axePath,
+        axeEnv,
+      });
 
-    if (search.status === 'unavailable') {
-      await appendLifecycleLog(
-        opts.logPath,
-        `First-run prompt preflight: UI unavailable; retrying (exit ${search.exitCode})`,
-      );
-      await timing.sleep(500);
-      continue;
-    }
+      if (search.status === 'unavailable') {
+        await appendLifecycleLog(
+          opts.logPath,
+          `First-run prompt preflight: UI unavailable; retrying (exit ${search.exitCode})`,
+        );
+        await timing.sleep(500);
+        continue;
+      }
 
-    if (search.status === 'not-found') {
-      if (search.hasElements) {
-        uiSeen = true;
+      if (search.status === 'not-found') {
+        if (search.hasElements) {
+          uiSeen = true;
+        }
+        if (uiSeen) {
+          promptsDismissed = true;
+          break;
+        }
+        await timing.sleep(500);
+        continue;
       }
-      if (uiSeen) {
-        promptsDismissed = true;
-        break;
+
+      uiSeen = true;
+      const { label } = search;
+      opts.onEvent?.(`dismissing first-run prompt '${label}'`);
+      await appendLifecycleLog(opts.logPath, `Dismissing first-run prompt label: ${label}`);
+      const tap = await executor({
+        command: axePath,
+        args: ['tap', '--label', label, '--element-type', 'Button', '--udid', opts.simulatorId],
+        cwd: opts.cwd,
+        logPath: opts.logPath,
+        env: axeEnv,
+      });
+      if (tap.exitCode !== 0) {
+        throw new Error(
+          `${opts.config.name}: failed to dismiss first-run prompt '${label}' (exit ${tap.exitCode}); see ${opts.logPath}`,
+        );
       }
       await timing.sleep(500);
-      continue;
     }
 
-    uiSeen = true;
-    const { label } = search;
-    opts.onEvent?.(`dismissing first-run prompt '${label}'`);
-    await appendLifecycleLog(opts.logPath, `Dismissing first-run prompt label: ${label}`);
-    const tap = await executor({
-      command: axePath,
-      args: ['tap', '--label', label, '--element-type', 'Button', '--udid', opts.simulatorId],
+    if (!promptsDismissed) {
+      throw new Error(
+        `${opts.config.name}: timed out during first-run prompt preflight; see ${opts.logPath}`,
+      );
+    }
+
+    await appendLifecycleLog(opts.logPath, 'First-run prompt preflight: complete');
+  } finally {
+    const terminate = await executor({
+      command: 'xcrun',
+      args: ['simctl', 'terminate', opts.simulatorId, bundleId],
       cwd: opts.cwd,
       logPath: opts.logPath,
-      env: axeEnv,
     });
-    if (tap.exitCode !== 0) {
+    if (terminate.exitCode !== 0) {
       throw new Error(
-        `${opts.config.name}: failed to dismiss first-run prompt '${label}' (exit ${tap.exitCode}); see ${opts.logPath}`,
+        `${opts.config.name}: failed to terminate app after first-run prompt preflight (exit ${terminate.exitCode}); see ${opts.logPath}`,
       );
     }
-    await timing.sleep(500);
   }
-
-  if (!promptsDismissed) {
-    throw new Error(
-      `${opts.config.name}: timed out during first-run prompt preflight; see ${opts.logPath}`,
-    );
-  }
-
-  const terminate = await executor({
-    command: 'xcrun',
-    args: ['simctl', 'terminate', opts.simulatorId, bundleId],
-    cwd: opts.cwd,
-    logPath: opts.logPath,
-  });
-  if (terminate.exitCode !== 0) {
-    throw new Error(
-      `${opts.config.name}: failed to terminate app after first-run prompt preflight (exit ${terminate.exitCode}); see ${opts.logPath}`,
-    );
-  }
-  await appendLifecycleLog(opts.logPath, 'First-run prompt preflight: complete');
 }

_{You can send follow-ups to the cloud agent here.}

cursor · 2026-05-23T20:07:21Z

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Session defaults validate too late
- Added validateSessionDefaults call in readConfig to fail fast on invalid keys or value types before simulator setup.

Or push these changes by commenting:

@cursor push f981ed6361

Preview (f981ed6361)

diff --git a/CHANGELOG.md b/CHANGELOG.md
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -22,6 +22,7 @@
 
 ### Fixed
 
+- Fixed Claude UI benchmark suite loading so `sessionDefaults` unknown keys and invalid value types fail fast at config load time instead of after expensive simulator setup.
 - Fixed Claude UI benchmark preflight so transient malformed or still-loading UI snapshots no longer crash the harness or finish before app UI is observable.
 - Fixed Claude UI benchmark preflight so configured first-run dismissals require a concrete simulator ID and suite-provided simulator IDs are recorded in command logs.
 - Fixed Claude UI benchmark config handling so invalid `failurePatterns` regexes fail before a suite starts and partial `allowedVariance` overrides preserve defaults for omitted metrics.

diff --git a/src/benchmarks/claude-ui/config.ts b/src/benchmarks/claude-ui/config.ts
--- a/src/benchmarks/claude-ui/config.ts
+++ b/src/benchmarks/claude-ui/config.ts
@@ -185,7 +185,9 @@
     ),
   };
 
-  if (isRecord(raw.sessionDefaults)) config.sessionDefaults = raw.sessionDefaults;
+  if (isRecord(raw.sessionDefaults)) {
+    config.sessionDefaults = validateSessionDefaults(raw.sessionDefaults);
+  }
   config.allowedVariance = readAllowedVariance(raw.allowedVariance, `${source}.allowedVariance`);
 
   if (raw.baseline !== undefined) {

_{You can send follow-ups to the cloud agent here.}

Validate benchmark session defaults with the same schema used by runtime session defaults and terminate preflight-launched apps after post-launch failures. Co-Authored-By: Codex <codex@openai.com>

Retry transient runtime snapshot parse failures during post-action refresh and avoid ending first-run preflight during empty transition snapshots. Co-Authored-By: Codex <codex@openai.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

✅ Fixed: Empty --all reports success
- Added check to throw error when listSuitePaths returns empty array, preventing vacuous success from empty results.
✅ Fixed: Parser always exits zero
- Modified parse function to track JSON decode errors and return non-zero exit code when malformed lines are encountered.

Or push these changes by commenting:

@cursor push 3e2d05ccb6

Preview (3e2d05ccb6)

diff --git a/benchmarks/claude-ui/parse_claude_conversation.py b/benchmarks/claude-ui/parse_claude_conversation.py
--- a/benchmarks/claude-ui/parse_claude_conversation.py
+++ b/benchmarks/claude-ui/parse_claude_conversation.py
@@ -101,13 +101,14 @@
     return "\n\n".join(parts)
 
 
-def parse(path: Path, out_dir: Path, tool_prefix: str) -> None:
+def parse(path: Path, out_dir: Path, tool_prefix: str) -> bool:
     out_dir.mkdir(parents=True, exist_ok=True)
 
     # Track tool_use_ids that target our prefix so we keep matching results.
     tracked_ids: set[str] = set()
     tool_name_by_id: dict[str, str] = {}
     counter = 0
+    had_errors = False
 
     def next_path(kind: str, label: str | None = None) -> Path:
         nonlocal counter
@@ -124,6 +125,7 @@
                 entry = json.loads(raw)
             except json.JSONDecodeError as exc:
                 print(f"warn: skipping line {line_no}: {exc}", file=sys.stderr)
+                had_errors = True
                 continue
 
             etype = entry.get("type")
@@ -204,6 +206,7 @@
             # last-prompt, file-history-snapshot, thinking blocks) is dropped.
 
     print(f"Wrote {counter} files to {out_dir}")
+    return not had_errors
 
 
 def main() -> int:
@@ -227,8 +230,8 @@
         return 1
 
     out = args.output or args.jsonl.with_name(f"{args.jsonl.stem}_conversation")
-    parse(args.jsonl, out, args.tool_prefix)
-    return 0
+    success = parse(args.jsonl, out, args.tool_prefix)
+    return 0 if success else 1
 
 
 if __name__ == "__main__":

diff --git a/src/benchmarks/claude-ui/harness.ts b/src/benchmarks/claude-ui/harness.ts
--- a/src/benchmarks/claude-ui/harness.ts
+++ b/src/benchmarks/claude-ui/harness.ts
@@ -565,6 +565,9 @@
   }
 
   const suitePaths = args.all ? await listSuitePaths() : [resolveSuitePath(args.suite as string)];
+  if (suitePaths.length === 0) {
+    throw new Error('no suite files found in benchmarks/claude-ui/suites');
+  }
   const progress = createProgressReporter({ enabled: !args.json });
   const results: BenchmarkResult[] = [];
   for (let index = 0; index < suitePaths.length; index += 1) {

_{You can send follow-ups to the cloud agent here.}

Fail --all when suite discovery returns no suite files so a missing or misconfigured benchmark suite directory cannot look like a successful full run. Return a non-zero parser exit when malformed JSONL lines are skipped so broken transcript parses fail the benchmark instead of producing partial-success artifacts. Co-Authored-By: Codex <codex@openai.com>

Separate first-run preflight terminate executor errors from non-zero exit handling so each failure path is handled once. Track preflight completion with a success flag instead of catching and rethrowing only to drive cleanup suppression. Co-Authored-By: Codex <codex@openai.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Duplicate stumble count parse errors
- Modified failureCount calculation to skip adding parserExitCode when audit.parseErrors already captured the same parser failures, preventing double-counting.

Or push these changes by commenting:

@cursor push 827aa9507c

Preview (827aa9507c)

diff --git a/src/benchmarks/claude-ui/compare.ts b/src/benchmarks/claude-ui/compare.ts
--- a/src/benchmarks/claude-ui/compare.ts
+++ b/src/benchmarks/claude-ui/compare.ts
@@ -183,7 +183,7 @@
     audit.failures.length +
     audit.patternFailures.length +
     (run.claudeExitCode === 0 ? 0 : 1) +
-    (run.parserExitCode === 0 ? 0 : 1);
+    (run.parserExitCode === 0 || audit.parseErrors.length > 0 ? 0 : 1);
   const sequenceMode = config.sequence?.mode ?? 'warn';
   const sequenceMatched =
     expected.length === 0 || (missing.length === 0 && additional.length === 0);

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit 34a7e53. Configure here.}

Applied via @cursor push command

github-actions

Benchmark test spawns real python3 subprocess, bypassing executor safety overrides

The runParserScript helper calls spawn('python3', args) directly from node:child_process, making npm test dependent on Python 3 being installed and bypassing the vitest-executor-safety.setup.ts framework-executor overrides; wrap the parser invocation in an injectable function so tests can stub it.

Evidence

vitest.config.ts includes src/**/__tests__/**/*.test.ts, so src/benchmarks/claude-ui/__tests__/claude-ui-benchmark.test.ts runs under npm test.
vitest-executor-safety.setup.ts overrides only __setTestCommandExecutorOverride / __setTestInteractiveSpawnerOverride (framework interfaces); raw node:child_process spawn is unguarded.
runParserScript resolves repoRoot and passes the real parse_claude_conversation.py path, hitting the real filesystem and a real Python 3 process.
The 'returns a non-zero parser exit when JSONL lines are malformed' test calls runParserScript with a temp-dir JSONL file, meaning CI must have python3 and the script on disk to pass.

_{Identified by Warden xcodebuildmcp-test-boundary-review}

cameroncooke mentioned this pull request May 23, 2026

feat(ui-automation): Add rs/1 runtime automation parity #416

Merged

github-actions Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/harness.ts Outdated

Comment thread src/benchmarks/claude-ui/simulator-lifecycle.ts Outdated

github-actions Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/first-run-preflight.ts Outdated

Comment thread src/benchmarks/claude-ui/simulator-lifecycle.ts

cameroncooke force-pushed the cam/feat/claude-ui-benchmark-harness branch from 93b21e7 to 08346c6 Compare May 23, 2026 11:57

github-actions Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/harness.ts

github-actions Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/first-run-preflight.ts

cameroncooke force-pushed the cam/feat/claude-ui-benchmark-harness branch 2 times, most recently from 3cde457 to fe03f96 Compare May 23, 2026 18:09

github-actions Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/render.ts

Base automatically changed from cam/feat/ui-automation-runtime-parity to main May 23, 2026 18:24

github-actions Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/__tests__/simulator-lifecycle.test.ts

Comment thread src/mcp/tools/ui-automation/__tests__/button.test.ts

github-code-quality Bot found potential problems May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/__tests__/simulator-lifecycle.test.ts Fixed

cameroncooke marked this pull request as ready for review May 23, 2026 18:44

cursor Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/first-run-preflight.ts Outdated

Comment thread src/benchmarks/claude-ui/first-run-preflight.ts Outdated

cursor Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/simulator-lifecycle.ts Outdated

Comment thread src/benchmarks/claude-ui/config.ts Outdated

sentry Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/first-run-preflight.ts

This comment has been minimized.

Sign in to view

github-actions Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/transcript.ts

github-actions Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/harness.ts

sentry-warden Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/compare.ts

github-code-quality Bot found potential problems May 23, 2026

View reviewed changes

Comment thread src/mcp/tools/ui-automation/shared/post-action-snapshot.ts Fixed

cursor Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/harness.ts

Comment thread src/benchmarks/claude-ui/harness.ts

sentry Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/mcp/tools/ui-automation/shared/post-action-snapshot.ts Outdated

cameroncooke and others added 4 commits May 23, 2026 20:58

fix(benchmarks): Start preflight timeout after app launch

8d05e26

Start first-run prompt timeout after the target app finishes launching so slow first launches do not consume the inspection window before AXe can check for prompts. Co-Authored-By: OpenAI Codex <noreply@openai.com>

cameroncooke force-pushed the cam/feat/claude-ui-benchmark-harness branch from afdf792 to 7c5e741 Compare May 23, 2026 19:59

cursor Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/harness.ts

cursor Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/config.ts

fix(benchmarks): Validate session defaults during load

131cc28

Validate Claude UI benchmark sessionDefaults while reading the suite config so unknown keys and invalid value types fail before simulator setup starts. Co-Authored-By: OpenAI Codex <codex@openai.com>

cursor Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/first-run-preflight.ts Outdated

sentry Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/mcp/tools/ui-automation/shared/post-action-snapshot.ts

github-actions Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/__tests__/first-run-preflight.test.ts

github-actions Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/compare.ts

Comment thread src/benchmarks/claude-ui/first-run-preflight.ts Outdated

Comment thread src/benchmarks/claude-ui/render.ts

sentry-warden Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/__tests__/simulator-lifecycle.test.ts

cameroncooke and others added 2 commits May 23, 2026 21:44

fix(benchmarks): Harden Claude UI config preflight

38d8a71

Validate benchmark session defaults with the same schema used by runtime session defaults and terminate preflight-launched apps after post-launch failures. Co-Authored-By: Codex <codex@openai.com>

fix(ui-automation): Retry transient snapshot refreshes

ebdec42

Retry transient runtime snapshot parse failures during post-action refresh and avoid ending first-run preflight during empty transition snapshots. Co-Authored-By: Codex <codex@openai.com>

cursor Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/harness.ts

Comment thread benchmarks/claude-ui/parse_claude_conversation.py Outdated

sentry Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/first-run-preflight.ts

sentry-warden Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/first-run-preflight.ts

cursor Bot reviewed May 24, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/compare.ts Outdated

Fix duplicate stumble count for parse errors

989ab76

Applied via @cursor push command

github-actions Bot reviewed May 24, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/first-run-preflight.ts

Comment thread src/benchmarks/claude-ui/first-run-preflight.ts

github-actions Bot reviewed May 24, 2026

View reviewed changes

Comment thread src/benchmarks/claude-ui/__tests__/claude-ui-benchmark.test.ts

Uh oh!

Conversation

cameroncooke commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cameroncooke commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pkg-pr-new Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cameroncooke commented May 23, 2026

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot commented May 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cameroncooke commented May 23, 2026 •

edited

Loading

cameroncooke commented May 23, 2026 •

edited

Loading

cursor Bot left a comment •

edited

Loading

pkg-pr-new Bot commented May 23, 2026 •

edited

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading