Skip to content

feat(benchmarks): Add Claude UI benchmark harness#427

Open
cameroncooke wants to merge 12 commits into
mainfrom
cam/feat/claude-ui-benchmark-harness
Open

feat(benchmarks): Add Claude UI benchmark harness#427
cameroncooke wants to merge 12 commits into
mainfrom
cam/feat/claude-ui-benchmark-harness

Conversation

@cameroncooke
Copy link
Copy Markdown
Collaborator

@cameroncooke cameroncooke commented May 23, 2026

Add a local Claude UI benchmark harness for measuring simulator UI automation behavior against the development MCP server.

The harness runs deterministic app tasks from Markdown prompts, creates fresh temporary simulators, writes isolated MCP configuration, parses Claude Code transcripts, and reports tool counts, wall-clock timing, failures, and sequence drift. This gives us a repeatable way to catch regressions in agent efficiency and UI automation behavior across Weather, Contacts, and Reminders.

The benchmark setup also keeps simulator boot/open and first-run prompt cleanup outside the measured Claude task, so baselines reflect the actual app work rather than transient Apple setup screens. Mutating UI actions now wait for settled post-action runtime snapshots so the next agent step receives stable refs.

Copy link
Copy Markdown
Collaborator Author

cameroncooke commented May 23, 2026

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parser path hardcoded to author's local filesystem

The harness will fail for any developer or CI environment that doesn't have /Volumes/Developer/parse_claude_conversation.py; this path needs to be configurable or relative to the repo.

Evidence
  • harness.ts line 34 declares const parserPath = '/Volumes/Developer/parse_claude_conversation.py' as a module-level constant.
  • runParser at line 214 passes parserPath directly to runCommand as the script argument for python3.
  • No environment variable, config option, or fallback exists; the path is hardcoded unconditionally.
  • The path begins with /Volumes/Developer/, a macOS external-volume prefix unique to the author's machine.

Identified by Warden find-bugs

Comment thread src/benchmarks/claude-ui/harness.ts Outdated
Comment thread src/benchmarks/claude-ui/simulator-lifecycle.ts Outdated
Comment thread src/benchmarks/claude-ui/first-run-preflight.ts Outdated
Comment thread src/benchmarks/claude-ui/simulator-lifecycle.ts
@cameroncooke cameroncooke force-pushed the cam/feat/claude-ui-benchmark-harness branch from 93b21e7 to 08346c6 Compare May 23, 2026 11:57
Comment thread src/benchmarks/claude-ui/harness.ts
Comment thread src/benchmarks/claude-ui/first-run-preflight.ts
@cameroncooke cameroncooke force-pushed the cam/feat/claude-ui-benchmark-harness branch 2 times, most recently from 3cde457 to fe03f96 Compare May 23, 2026 18:09
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snapshot-settle timeout returns unsettled snapshot without warning, silently breaking the stable-refs guarantee

When captureRuntimeSnapshotAfterAction in post-action-snapshot.ts exceeds its 1 500 ms deadline without the UI settling, it records and returns latestSnapshot.payload as a normal success; captureRuntimeSnapshotAfterActionSafely wraps this in { capture } with no warning or uiError, so every UI-action tool (tap, swipe, gesture, etc.) passes a potentially mid-animation snapshot to the agent as fully stable, silently breaking the PR's stated guarantee that the next agent step receives stable refs.

Evidence
  • captureRuntimeSnapshotAfterAction (post-action-snapshot.ts line 90–93): if (remainingMs <= 0) { recordRuntimeSnapshot(latestSnapshot); return latestSnapshot.payload; } — no indication of unsettled state.
  • captureRuntimeSnapshotAfterActionSafely only branches on exceptions; the timeout return path is indistinguishable from a fully settled return, so { capture: await captureRuntimeSnapshotAfterAction(params) } is emitted.
  • Callers such as tap.ts line 161, swipe.ts line 226, gesture.ts line 171 (and 7 more tools) check captureResult.warning / captureResult.uiError to surface problems; neither is set on timeout.
  • The existing test (post-action-snapshot.test.ts) only covers the settled path and does not test timeout behaviour, leaving the gap undetected.

Identified by Warden find-bugs

Comment thread src/benchmarks/claude-ui/render.ts
Base automatically changed from cam/feat/ui-automation-runtime-parity to main May 23, 2026 18:24
Comment thread src/benchmarks/claude-ui/__tests__/simulator-lifecycle.test.ts
Comment thread src/mcp/tools/ui-automation/__tests__/button.test.ts
Comment thread src/benchmarks/claude-ui/__tests__/simulator-lifecycle.test.ts Fixed
@cameroncooke cameroncooke marked this pull request as ready for review May 23, 2026 18:44
Comment thread src/benchmarks/claude-ui/first-run-preflight.ts Outdated
Comment thread src/benchmarks/claude-ui/first-run-preflight.ts Outdated
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 4 total unresolved issues (including 2 from previous reviews).

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: Temp simulator leak on setup
    • Added try-catch cleanup in prepareTemporarySimulator to delete the created simulator if post-creation setup steps fail.
  • ✅ Fixed: Unused exported env helper
    • Removed the unused sessionDefaultsEnv export from config.ts as it was dead code with no references.

Create PR

Or push these changes by commenting:

@cursor push aed23b8a5b
Preview (aed23b8a5b)
diff --git a/CHANGELOG.md b/CHANGELOG.md
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -22,6 +22,7 @@
 
 ### Fixed
 
+- Fixed Claude UI benchmark temporary simulator cleanup so simulators created by the harness are deleted even when post-creation setup steps (boot, bootstatus, or Simulator.app open) fail.
 - Fixed Claude UI benchmark suite runs so temporary simulators are applied through an isolated per-run MCP config instead of being overridden by repo or example-project config defaults.
 - Fixed simulator launch failures before simulator-name resolution so they are not reported as macOS launch failures.
 - Fixed CLI JSON output so simulator-name resolution failures return the structured error envelope instead of plain stderr.

diff --git a/src/benchmarks/claude-ui/config.ts b/src/benchmarks/claude-ui/config.ts
--- a/src/benchmarks/claude-ui/config.ts
+++ b/src/benchmarks/claude-ui/config.ts
@@ -186,16 +186,3 @@
   const raw = parseYaml(await readFile(suitePath, 'utf8')) as unknown;
   return readConfig(raw, suitePath);
 }
-
-export function sessionDefaultsEnv(
-  sessionDefaults: Record<string, unknown> | undefined,
-): Record<string, string> {
-  const validated = validateSessionDefaults(sessionDefaults);
-  if (!validated) return {};
-
-  const env: Record<string, string> = {};
-  for (const [key, value] of Object.entries(validated)) {
-    env[sessionDefaultEnvNames[key]] = String(value);
-  }
-  return env;
-}

diff --git a/src/benchmarks/claude-ui/simulator-lifecycle.ts b/src/benchmarks/claude-ui/simulator-lifecycle.ts
--- a/src/benchmarks/claude-ui/simulator-lifecycle.ts
+++ b/src/benchmarks/claude-ui/simulator-lifecycle.ts
@@ -278,65 +278,89 @@
     logPath: opts.logPath,
   } satisfies CreatedTemporarySimulator;
 
-  opts.onEvent?.(`booting simulator ${simulatorId}`);
-  const bootArgs = ['simctl', 'boot', simulatorId];
-  const bootResult = await executor({
-    command: 'xcrun',
-    args: bootArgs,
-    cwd: opts.cwd,
-    logPath: opts.logPath,
-  });
-  if (!isAlreadyBooted(bootResult)) {
-    throw new Error(
-      `${opts.config.name}: failed to boot temporary simulator with ${commandText('xcrun', bootArgs)} (exit ${bootResult.exitCode}); see ${opts.logPath}`,
-    );
-  }
-  if (bootResult.exitCode !== 0) {
+  try {
+    opts.onEvent?.(`booting simulator ${simulatorId}`);
+    const bootArgs = ['simctl', 'boot', simulatorId];
+    const bootResult = await executor({
+      command: 'xcrun',
+      args: bootArgs,
+      cwd: opts.cwd,
+      logPath: opts.logPath,
+    });
+    if (!isAlreadyBooted(bootResult)) {
+      throw new Error(
+        `${opts.config.name}: failed to boot temporary simulator with ${commandText('xcrun', bootArgs)} (exit ${bootResult.exitCode}); see ${opts.logPath}`,
+      );
+    }
+    if (bootResult.exitCode !== 0) {
+      await appendLifecycleLog(
+        opts.logPath,
+        'Boot command reported simulator was already booted; continuing',
+        logWriter,
+      );
+    }
+
+    opts.onEvent?.(`waiting for simulator ${simulatorId} bootstatus`);
+    const bootstatusArgs = ['simctl', 'bootstatus', simulatorId, '-b'];
+    const bootstatusResult = await executor({
+      command: 'xcrun',
+      args: bootstatusArgs,
+      cwd: opts.cwd,
+      logPath: opts.logPath,
+    });
+    if (bootstatusResult.exitCode !== 0) {
+      throw new Error(
+        `${opts.config.name}: temporary simulator did not reach bootstatus with ${commandText('xcrun', bootstatusArgs)} (exit ${bootstatusResult.exitCode}); see ${opts.logPath}`,
+      );
+    }
+
+    opts.onEvent?.(`opening Simulator.app for ${simulatorId}`);
+    const openArgs = ['-a', 'Simulator', '--args', '-CurrentDeviceUDID', simulatorId];
+    const openResult = await executor({
+      command: 'open',
+      args: openArgs,
+      cwd: opts.cwd,
+      logPath: opts.logPath,
+    });
+    if (openResult.exitCode !== 0) {
+      throw new Error(
+        `${opts.config.name}: failed to open Simulator.app with ${commandText('open', openArgs)} (exit ${openResult.exitCode}); see ${opts.logPath}`,
+      );
+    }
+
+    await waitForReadinessDelay({
+      logPath: opts.logPath,
+      milliseconds: opts.readinessDelayMs ?? 2_000,
+      onEvent: opts.onEvent,
+      logWriter,
+    });
+    await appendLifecycleLog(opts.logPath, `Temporary simulator ready: ${simulatorId}`, logWriter);
+    opts.onEvent?.(`simulator ready ${simulatorId}`);
+
+    return simulator;
+  } catch (error) {
     await appendLifecycleLog(
       opts.logPath,
-      'Boot command reported simulator was already booted; continuing',
+      `Setup failed, cleaning up simulator ${simulatorId}`,
       logWriter,
     );
+    try {
+      await executor({
+        command: 'xcrun',
+        args: ['simctl', 'delete', simulatorId],
+        cwd: opts.cwd,
+        logPath: opts.logPath,
+      });
+      await appendLifecycleLog(opts.logPath, `Deleted simulator ${simulatorId}`, logWriter);
+    } catch (deleteError) {
+      await appendLifecycleLog(
+        opts.logPath,
+        `Failed to delete simulator ${simulatorId}: ${deleteError instanceof Error ? deleteError.message : String(deleteError)}`,
+        logWriter,
+      );
+    }
+    throw error;
   }
-
-  opts.onEvent?.(`waiting for simulator ${simulatorId} bootstatus`);
-  const bootstatusArgs = ['simctl', 'bootstatus', simulatorId, '-b'];
-  const bootstatusResult = await executor({
-    command: 'xcrun',
-    args: bootstatusArgs,
-    cwd: opts.cwd,
-    logPath: opts.logPath,
-  });
-  if (bootstatusResult.exitCode !== 0) {
-    throw new Error(
-      `${opts.config.name}: temporary simulator did not reach bootstatus with ${commandText('xcrun', bootstatusArgs)} (exit ${bootstatusResult.exitCode}); see ${opts.logPath}`,
-    );
-  }
-
-  opts.onEvent?.(`opening Simulator.app for ${simulatorId}`);
-  const openArgs = ['-a', 'Simulator', '--args', '-CurrentDeviceUDID', simulatorId];
-  const openResult = await executor({
-    command: 'open',
-    args: openArgs,
-    cwd: opts.cwd,
-    logPath: opts.logPath,
-  });
-  if (openResult.exitCode !== 0) {
-    throw new Error(
-      `${opts.config.name}: failed to open Simulator.app with ${commandText('open', openArgs)} (exit ${openResult.exitCode}); see ${opts.logPath}`,
-    );
-  }
-
-  await waitForReadinessDelay({
-    logPath: opts.logPath,
-    milliseconds: opts.readinessDelayMs ?? 2_000,
-    onEvent: opts.onEvent,
-    logWriter,
-  });
-  await appendLifecycleLog(opts.logPath, `Temporary simulator ready: ${simulatorId}`, logWriter);
-  opts.onEvent?.(`simulator ready ${simulatorId}`);
-
-  return simulator;
 }
 
 export async function deleteTemporarySimulator(

You can send follow-ups to the cloud agent here.

Comment thread src/benchmarks/claude-ui/simulator-lifecycle.ts Outdated
Comment thread src/benchmarks/claude-ui/config.ts Outdated
@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented May 23, 2026

Open in StackBlitz

npm i https://pkg.pr.new/xcodebuildmcp@427

commit: 989ab76

Comment thread src/benchmarks/claude-ui/first-run-preflight.ts
@cursor

This comment has been minimized.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid failurePatterns regex crashes benchmark analysis after Claude has already run

A malformed regex string in failurePatterns (e.g. [unclosed) causes new RegExp(pattern, 'i') in createPatternMatchers to throw an uncaught SyntaxError, aborting analyzeClaudeJsonl mid-run and discarding all Claude output already collected. Wrap the RegExp constructor in a try-catch and surface the error as a parseError instead.

Evidence
  • createPatternMatchers in transcript.ts:114 calls new RegExp(pattern, 'i') with no try-catch.
  • patterns comes from config.failurePatterns loaded from YAML — any user-authored regex string is passed verbatim.
  • analyzeClaudeJsonl calls createPatternMatchers at line 126 before processing any transcript lines; a throw here propagates up through runSuite in harness.ts, skipping result.json write.
  • The benchmark has already spent wall-clock time running Claude when this occurs, so all timing and transcript data is lost.

Identified by Warden find-bugs

Comment thread src/benchmarks/claude-ui/transcript.ts
Comment thread src/benchmarks/claude-ui/harness.ts
Comment thread src/benchmarks/claude-ui/compare.ts
@cameroncooke
Copy link
Copy Markdown
Collaborator Author

Addressed the non-inline review summaries as well: snapshot settle timeout now returns a recoverable warning instead of unstable refs, invalid failurePatterns are validated before Claude runs, and the parser path is already repo-relative/configurable in the current branch. Fixed in afdf792.

Comment thread src/mcp/tools/ui-automation/shared/post-action-snapshot.ts Fixed
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: Preflight skipped without simulator ID
    • Added explicit error when firstRunPromptDismissals is configured but no simulator ID is available, failing loudly instead of silently skipping preflight.
  • ✅ Fixed: Command log omits suite simulator
    • Changed command log to use effectiveSimulatorId which includes both temporary and session default simulator IDs for accurate debugging.

Create PR

Or push these changes by commenting:

@cursor push 4df7567169
Preview (4df7567169)
diff --git a/src/benchmarks/claude-ui/harness.ts b/src/benchmarks/claude-ui/harness.ts
--- a/src/benchmarks/claude-ui/harness.ts
+++ b/src/benchmarks/claude-ui/harness.ts
@@ -379,6 +379,11 @@
       (typeof config.sessionDefaults?.simulatorId === 'string'
         ? config.sessionDefaults.simulatorId
         : undefined);
+    if (config.firstRunPromptDismissals && !effectiveSimulatorId) {
+      throw new Error(
+        'firstRunPromptDismissals configured but no simulator ID available: set temporarySimulator: true or provide sessionDefaults.simulatorId',
+      );
+    }
     if (effectiveSimulatorId) {
       await dismissFirstRunPrompts({
         config,
@@ -422,7 +427,7 @@
     ];
     await writeFile(
       artifacts.claudeCommandLogPath,
-      `Run dir: ${runDirectory}\nCommand: claude ${claudeArgs.join(' ')} < ${artifacts.promptPath} > ${artifacts.claudeJsonlPath} 2> ${artifacts.claudeStderrPath}\nWorking directory: ${workingDirectory}\nMCP workspace: ${artifacts.mcpWorkspaceDirectory}\nMCP workspace config: ${artifacts.mcpWorkspaceConfigPath}\nSimulator lifecycle log: ${artifacts.simulatorLifecycleLogPath}\nSimulator ID: ${temporarySimulator?.simulatorId ?? 'suite/default'}\nStarted: ${new Date().toISOString()}\n`,
+      `Run dir: ${runDirectory}\nCommand: claude ${claudeArgs.join(' ')} < ${artifacts.promptPath} > ${artifacts.claudeJsonlPath} 2> ${artifacts.claudeStderrPath}\nWorking directory: ${workingDirectory}\nMCP workspace: ${artifacts.mcpWorkspaceDirectory}\nMCP workspace config: ${artifacts.mcpWorkspaceConfigPath}\nSimulator lifecycle log: ${artifacts.simulatorLifecycleLogPath}\nSimulator ID: ${effectiveSimulatorId ?? 'suite/default'}\nStarted: ${new Date().toISOString()}\n`,
       'utf8',
     );

You can send follow-ups to the cloud agent here.

Comment thread src/benchmarks/claude-ui/harness.ts
Comment thread src/benchmarks/claude-ui/harness.ts
Comment thread src/mcp/tools/ui-automation/shared/post-action-snapshot.ts Outdated
cameroncooke and others added 4 commits May 23, 2026 20:58
Add a local Claude UI benchmark harness for running deterministic app tasks
against the development MCP server. The harness creates temporary simulators,
uses isolated MCP config, records tool-call and timing metrics, and reports
sequence drift with readable terminal output.

Stabilize post-action UI snapshots so mutating UI actions return settled refs
before the next agent step. Add benchmark and UI automation tests covering the
new harness behavior and snapshot polling.

Co-Authored-By: Codex <noreply@openai.com>
Make the Claude UI benchmark parser path explicit instead of relying on
a local absolute path. Tighten first-run preflight completion handling,
render null process exit codes, and inject benchmark lifecycle logging
and post-action timing in tests.

Co-Authored-By: OpenAI Codex <noreply@openai.com>
Use the repo-local Claude UI transcript parser as the default so the
benchmark command works without machine-specific parser arguments.
Keep --parser and CLAUDE_UI_BENCHMARK_PARSER as explicit overrides for
local parser testing.

Co-Authored-By: OpenAI Codex <noreply@openai.com>
Start first-run prompt timeout after the target app finishes launching so
slow first launches do not consume the inspection window before AXe can
check for prompts.

Co-Authored-By: OpenAI Codex <noreply@openai.com>
Harden the Claude UI benchmark harness so preflight retries transient UI inspection failures, temporary simulators are cleaned up after partial setup failures, and suite config validation fails before expensive runs.

Also report unsettled post-action runtime snapshots as recoverable UI automation warnings instead of returning unstable refs.
Co-Authored-By: OpenAI Codex <codex@openai.com>
@cameroncooke cameroncooke force-pushed the cam/feat/claude-ui-benchmark-harness branch from afdf792 to 7c5e741 Compare May 23, 2026 19:59
Comment thread src/benchmarks/claude-ui/harness.ts
Require benchmark preflight dismissals to have a concrete simulator ID and record suite-provided simulator IDs in command logs. Remove an unused post-action snapshot assignment reported by review tooling.

Co-Authored-By: OpenAI Codex <codex@openai.com>
Comment thread src/benchmarks/claude-ui/config.ts
Validate Claude UI benchmark sessionDefaults while reading the suite config so unknown keys and invalid value types fail before simulator setup starts.

Co-Authored-By: OpenAI Codex <codex@openai.com>
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Preflight leaves app running on failure
    • Wrapped dismissal logic in try-finally block to ensure simctl terminate is always called after launch, even when errors are thrown.

Create PR

Or push these changes by commenting:

@cursor push 203c16e988
Preview (203c16e988)
diff --git a/src/benchmarks/claude-ui/first-run-preflight.ts b/src/benchmarks/claude-ui/first-run-preflight.ts
--- a/src/benchmarks/claude-ui/first-run-preflight.ts
+++ b/src/benchmarks/claude-ui/first-run-preflight.ts
@@ -160,76 +160,79 @@
     );
   }
 
-  const deadline = timing.now() + timeoutMs;
-  let promptsDismissed = false;
-  let uiSeen = false;
-  while (timing.now() < deadline) {
-    const search = await findFirstRunPromptLabel({
-      simulatorId: opts.simulatorId,
-      labels: dismissals.labels,
-      cwd: opts.cwd,
-      logPath: opts.logPath,
-      executor,
-      axePath,
-      axeEnv,
-    });
+  try {
+    const deadline = timing.now() + timeoutMs;
+    let promptsDismissed = false;
+    let uiSeen = false;
+    while (timing.now() < deadline) {
+      const search = await findFirstRunPromptLabel({
+        simulatorId: opts.simulatorId,
+        labels: dismissals.labels,
+        cwd: opts.cwd,
+        logPath: opts.logPath,
+        executor,
+        axePath,
+        axeEnv,
+      });
 
-    if (search.status === 'unavailable') {
-      await appendLifecycleLog(
-        opts.logPath,
-        `First-run prompt preflight: UI unavailable; retrying (exit ${search.exitCode})`,
-      );
-      await timing.sleep(500);
-      continue;
-    }
+      if (search.status === 'unavailable') {
+        await appendLifecycleLog(
+          opts.logPath,
+          `First-run prompt preflight: UI unavailable; retrying (exit ${search.exitCode})`,
+        );
+        await timing.sleep(500);
+        continue;
+      }
 
-    if (search.status === 'not-found') {
-      if (search.hasElements) {
-        uiSeen = true;
+      if (search.status === 'not-found') {
+        if (search.hasElements) {
+          uiSeen = true;
+        }
+        if (uiSeen) {
+          promptsDismissed = true;
+          break;
+        }
+        await timing.sleep(500);
+        continue;
       }
-      if (uiSeen) {
-        promptsDismissed = true;
-        break;
+
+      uiSeen = true;
+      const { label } = search;
+      opts.onEvent?.(`dismissing first-run prompt '${label}'`);
+      await appendLifecycleLog(opts.logPath, `Dismissing first-run prompt label: ${label}`);
+      const tap = await executor({
+        command: axePath,
+        args: ['tap', '--label', label, '--element-type', 'Button', '--udid', opts.simulatorId],
+        cwd: opts.cwd,
+        logPath: opts.logPath,
+        env: axeEnv,
+      });
+      if (tap.exitCode !== 0) {
+        throw new Error(
+          `${opts.config.name}: failed to dismiss first-run prompt '${label}' (exit ${tap.exitCode}); see ${opts.logPath}`,
+        );
       }
       await timing.sleep(500);
-      continue;
     }
 
-    uiSeen = true;
-    const { label } = search;
-    opts.onEvent?.(`dismissing first-run prompt '${label}'`);
-    await appendLifecycleLog(opts.logPath, `Dismissing first-run prompt label: ${label}`);
-    const tap = await executor({
-      command: axePath,
-      args: ['tap', '--label', label, '--element-type', 'Button', '--udid', opts.simulatorId],
+    if (!promptsDismissed) {
+      throw new Error(
+        `${opts.config.name}: timed out during first-run prompt preflight; see ${opts.logPath}`,
+      );
+    }
+
+    await appendLifecycleLog(opts.logPath, 'First-run prompt preflight: complete');
+  } finally {
+    const terminate = await executor({
+      command: 'xcrun',
+      args: ['simctl', 'terminate', opts.simulatorId, bundleId],
       cwd: opts.cwd,
       logPath: opts.logPath,
-      env: axeEnv,
     });
-    if (tap.exitCode !== 0) {
+    if (terminate.exitCode !== 0) {
       throw new Error(
-        `${opts.config.name}: failed to dismiss first-run prompt '${label}' (exit ${tap.exitCode}); see ${opts.logPath}`,
+        `${opts.config.name}: failed to terminate app after first-run prompt preflight (exit ${terminate.exitCode}); see ${opts.logPath}`,
       );
     }
-    await timing.sleep(500);
   }
-
-  if (!promptsDismissed) {
-    throw new Error(
-      `${opts.config.name}: timed out during first-run prompt preflight; see ${opts.logPath}`,
-    );
-  }
-
-  const terminate = await executor({
-    command: 'xcrun',
-    args: ['simctl', 'terminate', opts.simulatorId, bundleId],
-    cwd: opts.cwd,
-    logPath: opts.logPath,
-  });
-  if (terminate.exitCode !== 0) {
-    throw new Error(
-      `${opts.config.name}: failed to terminate app after first-run prompt preflight (exit ${terminate.exitCode}); see ${opts.logPath}`,
-    );
-  }
-  await appendLifecycleLog(opts.logPath, 'First-run prompt preflight: complete');
 }

You can send follow-ups to the cloud agent here.

Comment thread src/benchmarks/claude-ui/first-run-preflight.ts Outdated
@cursor
Copy link
Copy Markdown
Contributor

cursor Bot commented May 23, 2026

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Session defaults validate too late
    • Added validateSessionDefaults call in readConfig to fail fast on invalid keys or value types before simulator setup.

Create PR

Or push these changes by commenting:

@cursor push f981ed6361
Preview (f981ed6361)
diff --git a/CHANGELOG.md b/CHANGELOG.md
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -22,6 +22,7 @@
 
 ### Fixed
 
+- Fixed Claude UI benchmark suite loading so `sessionDefaults` unknown keys and invalid value types fail fast at config load time instead of after expensive simulator setup.
 - Fixed Claude UI benchmark preflight so transient malformed or still-loading UI snapshots no longer crash the harness or finish before app UI is observable.
 - Fixed Claude UI benchmark preflight so configured first-run dismissals require a concrete simulator ID and suite-provided simulator IDs are recorded in command logs.
 - Fixed Claude UI benchmark config handling so invalid `failurePatterns` regexes fail before a suite starts and partial `allowedVariance` overrides preserve defaults for omitted metrics.

diff --git a/src/benchmarks/claude-ui/config.ts b/src/benchmarks/claude-ui/config.ts
--- a/src/benchmarks/claude-ui/config.ts
+++ b/src/benchmarks/claude-ui/config.ts
@@ -185,7 +185,9 @@
     ),
   };
 
-  if (isRecord(raw.sessionDefaults)) config.sessionDefaults = raw.sessionDefaults;
+  if (isRecord(raw.sessionDefaults)) {
+    config.sessionDefaults = validateSessionDefaults(raw.sessionDefaults);
+  }
   config.allowedVariance = readAllowedVariance(raw.allowedVariance, `${source}.allowedVariance`);
 
   if (raw.baseline !== undefined) {

You can send follow-ups to the cloud agent here.

Comment thread src/mcp/tools/ui-automation/shared/post-action-snapshot.ts
Comment thread src/benchmarks/claude-ui/__tests__/first-run-preflight.test.ts
Comment thread src/benchmarks/claude-ui/compare.ts
Comment thread src/benchmarks/claude-ui/first-run-preflight.ts Outdated
Comment thread src/benchmarks/claude-ui/render.ts
Comment thread src/benchmarks/claude-ui/__tests__/simulator-lifecycle.test.ts
cameroncooke and others added 2 commits May 23, 2026 21:44
Validate benchmark session defaults with the same schema used by runtime
session defaults and terminate preflight-launched apps after post-launch failures.

Co-Authored-By: Codex <codex@openai.com>
Retry transient runtime snapshot parse failures during post-action refresh
and avoid ending first-run preflight during empty transition snapshots.

Co-Authored-By: Codex <codex@openai.com>
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: Empty --all reports success
    • Added check to throw error when listSuitePaths returns empty array, preventing vacuous success from empty results.
  • ✅ Fixed: Parser always exits zero
    • Modified parse function to track JSON decode errors and return non-zero exit code when malformed lines are encountered.

Create PR

Or push these changes by commenting:

@cursor push 3e2d05ccb6
Preview (3e2d05ccb6)
diff --git a/benchmarks/claude-ui/parse_claude_conversation.py b/benchmarks/claude-ui/parse_claude_conversation.py
--- a/benchmarks/claude-ui/parse_claude_conversation.py
+++ b/benchmarks/claude-ui/parse_claude_conversation.py
@@ -101,13 +101,14 @@
     return "\n\n".join(parts)
 
 
-def parse(path: Path, out_dir: Path, tool_prefix: str) -> None:
+def parse(path: Path, out_dir: Path, tool_prefix: str) -> bool:
     out_dir.mkdir(parents=True, exist_ok=True)
 
     # Track tool_use_ids that target our prefix so we keep matching results.
     tracked_ids: set[str] = set()
     tool_name_by_id: dict[str, str] = {}
     counter = 0
+    had_errors = False
 
     def next_path(kind: str, label: str | None = None) -> Path:
         nonlocal counter
@@ -124,6 +125,7 @@
                 entry = json.loads(raw)
             except json.JSONDecodeError as exc:
                 print(f"warn: skipping line {line_no}: {exc}", file=sys.stderr)
+                had_errors = True
                 continue
 
             etype = entry.get("type")
@@ -204,6 +206,7 @@
             # last-prompt, file-history-snapshot, thinking blocks) is dropped.
 
     print(f"Wrote {counter} files to {out_dir}")
+    return not had_errors
 
 
 def main() -> int:
@@ -227,8 +230,8 @@
         return 1
 
     out = args.output or args.jsonl.with_name(f"{args.jsonl.stem}_conversation")
-    parse(args.jsonl, out, args.tool_prefix)
-    return 0
+    success = parse(args.jsonl, out, args.tool_prefix)
+    return 0 if success else 1
 
 
 if __name__ == "__main__":

diff --git a/src/benchmarks/claude-ui/harness.ts b/src/benchmarks/claude-ui/harness.ts
--- a/src/benchmarks/claude-ui/harness.ts
+++ b/src/benchmarks/claude-ui/harness.ts
@@ -565,6 +565,9 @@
   }
 
   const suitePaths = args.all ? await listSuitePaths() : [resolveSuitePath(args.suite as string)];
+  if (suitePaths.length === 0) {
+    throw new Error('no suite files found in benchmarks/claude-ui/suites');
+  }
   const progress = createProgressReporter({ enabled: !args.json });
   const results: BenchmarkResult[] = [];
   for (let index = 0; index < suitePaths.length; index += 1) {

You can send follow-ups to the cloud agent here.

Comment thread src/benchmarks/claude-ui/harness.ts
Comment thread benchmarks/claude-ui/parse_claude_conversation.py Outdated
Comment thread src/benchmarks/claude-ui/first-run-preflight.ts
Fail --all when suite discovery returns no suite files so a missing or misconfigured benchmark suite directory cannot look like a successful full run.

Return a non-zero parser exit when malformed JSONL lines are skipped so broken transcript parses fail the benchmark instead of producing partial-success artifacts.

Co-Authored-By: Codex <codex@openai.com>
Comment thread src/benchmarks/claude-ui/first-run-preflight.ts
Separate first-run preflight terminate executor errors from non-zero exit handling so each failure path is handled once.

Track preflight completion with a success flag instead of catching and rethrowing only to drive cleanup suppression.

Co-Authored-By: Codex <codex@openai.com>
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Duplicate stumble count parse errors
    • Modified failureCount calculation to skip adding parserExitCode when audit.parseErrors already captured the same parser failures, preventing double-counting.

Create PR

Or push these changes by commenting:

@cursor push 827aa9507c
Preview (827aa9507c)
diff --git a/src/benchmarks/claude-ui/compare.ts b/src/benchmarks/claude-ui/compare.ts
--- a/src/benchmarks/claude-ui/compare.ts
+++ b/src/benchmarks/claude-ui/compare.ts
@@ -183,7 +183,7 @@
     audit.failures.length +
     audit.patternFailures.length +
     (run.claudeExitCode === 0 ? 0 : 1) +
-    (run.parserExitCode === 0 ? 0 : 1);
+    (run.parserExitCode === 0 || audit.parseErrors.length > 0 ? 0 : 1);
   const sequenceMode = config.sequence?.mode ?? 'warn';
   const sequenceMatched =
     expected.length === 0 || (missing.length === 0 && additional.length === 0);

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 34a7e53. Configure here.

Comment thread src/benchmarks/claude-ui/compare.ts Outdated
Comment thread src/benchmarks/claude-ui/first-run-preflight.ts
Comment thread src/benchmarks/claude-ui/first-run-preflight.ts
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark test spawns real python3 subprocess, bypassing executor safety overrides

The runParserScript helper calls spawn('python3', args) directly from node:child_process, making npm test dependent on Python 3 being installed and bypassing the vitest-executor-safety.setup.ts framework-executor overrides; wrap the parser invocation in an injectable function so tests can stub it.

Evidence
  • vitest.config.ts includes src/**/__tests__/**/*.test.ts, so src/benchmarks/claude-ui/__tests__/claude-ui-benchmark.test.ts runs under npm test.
  • vitest-executor-safety.setup.ts overrides only __setTestCommandExecutorOverride / __setTestInteractiveSpawnerOverride (framework interfaces); raw node:child_process spawn is unguarded.
  • runParserScript resolves repoRoot and passes the real parse_claude_conversation.py path, hitting the real filesystem and a real Python 3 process.
  • The 'returns a non-zero parser exit when JSONL lines are malformed' test calls runParserScript with a temp-dir JSONL file, meaning CI must have python3 and the script on disk to pass.

Identified by Warden xcodebuildmcp-test-boundary-review

Comment thread src/benchmarks/claude-ui/__tests__/claude-ui-benchmark.test.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants