feat(benchmarks): Add Claude UI benchmark harness#427
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
There was a problem hiding this comment.
Parser path hardcoded to author's local filesystem
The harness will fail for any developer or CI environment that doesn't have /Volumes/Developer/parse_claude_conversation.py; this path needs to be configurable or relative to the repo.
Evidence
harness.tsline 34 declaresconst parserPath = '/Volumes/Developer/parse_claude_conversation.py'as a module-level constant.runParserat line 214 passesparserPathdirectly torunCommandas the script argument forpython3.- No environment variable, config option, or fallback exists; the path is hardcoded unconditionally.
- The path begins with
/Volumes/Developer/, a macOS external-volume prefix unique to the author's machine.
Identified by Warden find-bugs
93b21e7 to
08346c6
Compare
3cde457 to
fe03f96
Compare
There was a problem hiding this comment.
Snapshot-settle timeout returns unsettled snapshot without warning, silently breaking the stable-refs guarantee
When captureRuntimeSnapshotAfterAction in post-action-snapshot.ts exceeds its 1 500 ms deadline without the UI settling, it records and returns latestSnapshot.payload as a normal success; captureRuntimeSnapshotAfterActionSafely wraps this in { capture } with no warning or uiError, so every UI-action tool (tap, swipe, gesture, etc.) passes a potentially mid-animation snapshot to the agent as fully stable, silently breaking the PR's stated guarantee that the next agent step receives stable refs.
Evidence
captureRuntimeSnapshotAfterAction(post-action-snapshot.ts line 90–93):if (remainingMs <= 0) { recordRuntimeSnapshot(latestSnapshot); return latestSnapshot.payload; }— no indication of unsettled state.captureRuntimeSnapshotAfterActionSafelyonly branches on exceptions; the timeout return path is indistinguishable from a fully settled return, so{ capture: await captureRuntimeSnapshotAfterAction(params) }is emitted.- Callers such as
tap.tsline 161,swipe.tsline 226,gesture.tsline 171 (and 7 more tools) checkcaptureResult.warning/captureResult.uiErrorto surface problems; neither is set on timeout. - The existing test (
post-action-snapshot.test.ts) only covers the settled path and does not test timeout behaviour, leaving the gap undetected.
Identified by Warden find-bugs
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 4 total unresolved issues (including 2 from previous reviews).
Autofix Details
Bugbot Autofix prepared fixes for both issues found in the latest run.
- ✅ Fixed: Temp simulator leak on setup
- Added try-catch cleanup in prepareTemporarySimulator to delete the created simulator if post-creation setup steps fail.
- ✅ Fixed: Unused exported env helper
- Removed the unused sessionDefaultsEnv export from config.ts as it was dead code with no references.
Or push these changes by commenting:
@cursor push aed23b8a5b
Preview (aed23b8a5b)
diff --git a/CHANGELOG.md b/CHANGELOG.md
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -22,6 +22,7 @@
### Fixed
+- Fixed Claude UI benchmark temporary simulator cleanup so simulators created by the harness are deleted even when post-creation setup steps (boot, bootstatus, or Simulator.app open) fail.
- Fixed Claude UI benchmark suite runs so temporary simulators are applied through an isolated per-run MCP config instead of being overridden by repo or example-project config defaults.
- Fixed simulator launch failures before simulator-name resolution so they are not reported as macOS launch failures.
- Fixed CLI JSON output so simulator-name resolution failures return the structured error envelope instead of plain stderr.
diff --git a/src/benchmarks/claude-ui/config.ts b/src/benchmarks/claude-ui/config.ts
--- a/src/benchmarks/claude-ui/config.ts
+++ b/src/benchmarks/claude-ui/config.ts
@@ -186,16 +186,3 @@
const raw = parseYaml(await readFile(suitePath, 'utf8')) as unknown;
return readConfig(raw, suitePath);
}
-
-export function sessionDefaultsEnv(
- sessionDefaults: Record<string, unknown> | undefined,
-): Record<string, string> {
- const validated = validateSessionDefaults(sessionDefaults);
- if (!validated) return {};
-
- const env: Record<string, string> = {};
- for (const [key, value] of Object.entries(validated)) {
- env[sessionDefaultEnvNames[key]] = String(value);
- }
- return env;
-}
diff --git a/src/benchmarks/claude-ui/simulator-lifecycle.ts b/src/benchmarks/claude-ui/simulator-lifecycle.ts
--- a/src/benchmarks/claude-ui/simulator-lifecycle.ts
+++ b/src/benchmarks/claude-ui/simulator-lifecycle.ts
@@ -278,65 +278,89 @@
logPath: opts.logPath,
} satisfies CreatedTemporarySimulator;
- opts.onEvent?.(`booting simulator ${simulatorId}`);
- const bootArgs = ['simctl', 'boot', simulatorId];
- const bootResult = await executor({
- command: 'xcrun',
- args: bootArgs,
- cwd: opts.cwd,
- logPath: opts.logPath,
- });
- if (!isAlreadyBooted(bootResult)) {
- throw new Error(
- `${opts.config.name}: failed to boot temporary simulator with ${commandText('xcrun', bootArgs)} (exit ${bootResult.exitCode}); see ${opts.logPath}`,
- );
- }
- if (bootResult.exitCode !== 0) {
+ try {
+ opts.onEvent?.(`booting simulator ${simulatorId}`);
+ const bootArgs = ['simctl', 'boot', simulatorId];
+ const bootResult = await executor({
+ command: 'xcrun',
+ args: bootArgs,
+ cwd: opts.cwd,
+ logPath: opts.logPath,
+ });
+ if (!isAlreadyBooted(bootResult)) {
+ throw new Error(
+ `${opts.config.name}: failed to boot temporary simulator with ${commandText('xcrun', bootArgs)} (exit ${bootResult.exitCode}); see ${opts.logPath}`,
+ );
+ }
+ if (bootResult.exitCode !== 0) {
+ await appendLifecycleLog(
+ opts.logPath,
+ 'Boot command reported simulator was already booted; continuing',
+ logWriter,
+ );
+ }
+
+ opts.onEvent?.(`waiting for simulator ${simulatorId} bootstatus`);
+ const bootstatusArgs = ['simctl', 'bootstatus', simulatorId, '-b'];
+ const bootstatusResult = await executor({
+ command: 'xcrun',
+ args: bootstatusArgs,
+ cwd: opts.cwd,
+ logPath: opts.logPath,
+ });
+ if (bootstatusResult.exitCode !== 0) {
+ throw new Error(
+ `${opts.config.name}: temporary simulator did not reach bootstatus with ${commandText('xcrun', bootstatusArgs)} (exit ${bootstatusResult.exitCode}); see ${opts.logPath}`,
+ );
+ }
+
+ opts.onEvent?.(`opening Simulator.app for ${simulatorId}`);
+ const openArgs = ['-a', 'Simulator', '--args', '-CurrentDeviceUDID', simulatorId];
+ const openResult = await executor({
+ command: 'open',
+ args: openArgs,
+ cwd: opts.cwd,
+ logPath: opts.logPath,
+ });
+ if (openResult.exitCode !== 0) {
+ throw new Error(
+ `${opts.config.name}: failed to open Simulator.app with ${commandText('open', openArgs)} (exit ${openResult.exitCode}); see ${opts.logPath}`,
+ );
+ }
+
+ await waitForReadinessDelay({
+ logPath: opts.logPath,
+ milliseconds: opts.readinessDelayMs ?? 2_000,
+ onEvent: opts.onEvent,
+ logWriter,
+ });
+ await appendLifecycleLog(opts.logPath, `Temporary simulator ready: ${simulatorId}`, logWriter);
+ opts.onEvent?.(`simulator ready ${simulatorId}`);
+
+ return simulator;
+ } catch (error) {
await appendLifecycleLog(
opts.logPath,
- 'Boot command reported simulator was already booted; continuing',
+ `Setup failed, cleaning up simulator ${simulatorId}`,
logWriter,
);
+ try {
+ await executor({
+ command: 'xcrun',
+ args: ['simctl', 'delete', simulatorId],
+ cwd: opts.cwd,
+ logPath: opts.logPath,
+ });
+ await appendLifecycleLog(opts.logPath, `Deleted simulator ${simulatorId}`, logWriter);
+ } catch (deleteError) {
+ await appendLifecycleLog(
+ opts.logPath,
+ `Failed to delete simulator ${simulatorId}: ${deleteError instanceof Error ? deleteError.message : String(deleteError)}`,
+ logWriter,
+ );
+ }
+ throw error;
}
-
- opts.onEvent?.(`waiting for simulator ${simulatorId} bootstatus`);
- const bootstatusArgs = ['simctl', 'bootstatus', simulatorId, '-b'];
- const bootstatusResult = await executor({
- command: 'xcrun',
- args: bootstatusArgs,
- cwd: opts.cwd,
- logPath: opts.logPath,
- });
- if (bootstatusResult.exitCode !== 0) {
- throw new Error(
- `${opts.config.name}: temporary simulator did not reach bootstatus with ${commandText('xcrun', bootstatusArgs)} (exit ${bootstatusResult.exitCode}); see ${opts.logPath}`,
- );
- }
-
- opts.onEvent?.(`opening Simulator.app for ${simulatorId}`);
- const openArgs = ['-a', 'Simulator', '--args', '-CurrentDeviceUDID', simulatorId];
- const openResult = await executor({
- command: 'open',
- args: openArgs,
- cwd: opts.cwd,
- logPath: opts.logPath,
- });
- if (openResult.exitCode !== 0) {
- throw new Error(
- `${opts.config.name}: failed to open Simulator.app with ${commandText('open', openArgs)} (exit ${openResult.exitCode}); see ${opts.logPath}`,
- );
- }
-
- await waitForReadinessDelay({
- logPath: opts.logPath,
- milliseconds: opts.readinessDelayMs ?? 2_000,
- onEvent: opts.onEvent,
- logWriter,
- });
- await appendLifecycleLog(opts.logPath, `Temporary simulator ready: ${simulatorId}`, logWriter);
- opts.onEvent?.(`simulator ready ${simulatorId}`);
-
- return simulator;
}
export async function deleteTemporarySimulator(You can send follow-ups to the cloud agent here.
commit: |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Invalid failurePatterns regex crashes benchmark analysis after Claude has already run
A malformed regex string in failurePatterns (e.g. [unclosed) causes new RegExp(pattern, 'i') in createPatternMatchers to throw an uncaught SyntaxError, aborting analyzeClaudeJsonl mid-run and discarding all Claude output already collected. Wrap the RegExp constructor in a try-catch and surface the error as a parseError instead.
Evidence
createPatternMatchersintranscript.ts:114callsnew RegExp(pattern, 'i')with no try-catch.patternscomes fromconfig.failurePatternsloaded from YAML — any user-authored regex string is passed verbatim.analyzeClaudeJsonlcallscreatePatternMatchersat line 126 before processing any transcript lines; a throw here propagates up throughrunSuiteinharness.ts, skippingresult.jsonwrite.- The benchmark has already spent wall-clock time running Claude when this occurs, so all timing and transcript data is lost.
Identified by Warden find-bugs
|
Addressed the non-inline review summaries as well: snapshot settle timeout now returns a recoverable warning instead of unstable refs, invalid failurePatterns are validated before Claude runs, and the parser path is already repo-relative/configurable in the current branch. Fixed in afdf792. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Autofix Details
Bugbot Autofix prepared fixes for both issues found in the latest run.
- ✅ Fixed: Preflight skipped without simulator ID
- Added explicit error when firstRunPromptDismissals is configured but no simulator ID is available, failing loudly instead of silently skipping preflight.
- ✅ Fixed: Command log omits suite simulator
- Changed command log to use effectiveSimulatorId which includes both temporary and session default simulator IDs for accurate debugging.
Or push these changes by commenting:
@cursor push 4df7567169
Preview (4df7567169)
diff --git a/src/benchmarks/claude-ui/harness.ts b/src/benchmarks/claude-ui/harness.ts
--- a/src/benchmarks/claude-ui/harness.ts
+++ b/src/benchmarks/claude-ui/harness.ts
@@ -379,6 +379,11 @@
(typeof config.sessionDefaults?.simulatorId === 'string'
? config.sessionDefaults.simulatorId
: undefined);
+ if (config.firstRunPromptDismissals && !effectiveSimulatorId) {
+ throw new Error(
+ 'firstRunPromptDismissals configured but no simulator ID available: set temporarySimulator: true or provide sessionDefaults.simulatorId',
+ );
+ }
if (effectiveSimulatorId) {
await dismissFirstRunPrompts({
config,
@@ -422,7 +427,7 @@
];
await writeFile(
artifacts.claudeCommandLogPath,
- `Run dir: ${runDirectory}\nCommand: claude ${claudeArgs.join(' ')} < ${artifacts.promptPath} > ${artifacts.claudeJsonlPath} 2> ${artifacts.claudeStderrPath}\nWorking directory: ${workingDirectory}\nMCP workspace: ${artifacts.mcpWorkspaceDirectory}\nMCP workspace config: ${artifacts.mcpWorkspaceConfigPath}\nSimulator lifecycle log: ${artifacts.simulatorLifecycleLogPath}\nSimulator ID: ${temporarySimulator?.simulatorId ?? 'suite/default'}\nStarted: ${new Date().toISOString()}\n`,
+ `Run dir: ${runDirectory}\nCommand: claude ${claudeArgs.join(' ')} < ${artifacts.promptPath} > ${artifacts.claudeJsonlPath} 2> ${artifacts.claudeStderrPath}\nWorking directory: ${workingDirectory}\nMCP workspace: ${artifacts.mcpWorkspaceDirectory}\nMCP workspace config: ${artifacts.mcpWorkspaceConfigPath}\nSimulator lifecycle log: ${artifacts.simulatorLifecycleLogPath}\nSimulator ID: ${effectiveSimulatorId ?? 'suite/default'}\nStarted: ${new Date().toISOString()}\n`,
'utf8',
);You can send follow-ups to the cloud agent here.
Add a local Claude UI benchmark harness for running deterministic app tasks against the development MCP server. The harness creates temporary simulators, uses isolated MCP config, records tool-call and timing metrics, and reports sequence drift with readable terminal output. Stabilize post-action UI snapshots so mutating UI actions return settled refs before the next agent step. Add benchmark and UI automation tests covering the new harness behavior and snapshot polling. Co-Authored-By: Codex <noreply@openai.com>
Make the Claude UI benchmark parser path explicit instead of relying on a local absolute path. Tighten first-run preflight completion handling, render null process exit codes, and inject benchmark lifecycle logging and post-action timing in tests. Co-Authored-By: OpenAI Codex <noreply@openai.com>
Use the repo-local Claude UI transcript parser as the default so the benchmark command works without machine-specific parser arguments. Keep --parser and CLAUDE_UI_BENCHMARK_PARSER as explicit overrides for local parser testing. Co-Authored-By: OpenAI Codex <noreply@openai.com>
Start first-run prompt timeout after the target app finishes launching so slow first launches do not consume the inspection window before AXe can check for prompts. Co-Authored-By: OpenAI Codex <noreply@openai.com>
Harden the Claude UI benchmark harness so preflight retries transient UI inspection failures, temporary simulators are cleaned up after partial setup failures, and suite config validation fails before expensive runs. Also report unsettled post-action runtime snapshots as recoverable UI automation warnings instead of returning unstable refs. Co-Authored-By: OpenAI Codex <codex@openai.com>
afdf792 to
7c5e741
Compare
Require benchmark preflight dismissals to have a concrete simulator ID and record suite-provided simulator IDs in command logs. Remove an unused post-action snapshot assignment reported by review tooling. Co-Authored-By: OpenAI Codex <codex@openai.com>
Validate Claude UI benchmark sessionDefaults while reading the suite config so unknown keys and invalid value types fail before simulator setup starts. Co-Authored-By: OpenAI Codex <codex@openai.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Preflight leaves app running on failure
- Wrapped dismissal logic in try-finally block to ensure simctl terminate is always called after launch, even when errors are thrown.
Or push these changes by commenting:
@cursor push 203c16e988
Preview (203c16e988)
diff --git a/src/benchmarks/claude-ui/first-run-preflight.ts b/src/benchmarks/claude-ui/first-run-preflight.ts
--- a/src/benchmarks/claude-ui/first-run-preflight.ts
+++ b/src/benchmarks/claude-ui/first-run-preflight.ts
@@ -160,76 +160,79 @@
);
}
- const deadline = timing.now() + timeoutMs;
- let promptsDismissed = false;
- let uiSeen = false;
- while (timing.now() < deadline) {
- const search = await findFirstRunPromptLabel({
- simulatorId: opts.simulatorId,
- labels: dismissals.labels,
- cwd: opts.cwd,
- logPath: opts.logPath,
- executor,
- axePath,
- axeEnv,
- });
+ try {
+ const deadline = timing.now() + timeoutMs;
+ let promptsDismissed = false;
+ let uiSeen = false;
+ while (timing.now() < deadline) {
+ const search = await findFirstRunPromptLabel({
+ simulatorId: opts.simulatorId,
+ labels: dismissals.labels,
+ cwd: opts.cwd,
+ logPath: opts.logPath,
+ executor,
+ axePath,
+ axeEnv,
+ });
- if (search.status === 'unavailable') {
- await appendLifecycleLog(
- opts.logPath,
- `First-run prompt preflight: UI unavailable; retrying (exit ${search.exitCode})`,
- );
- await timing.sleep(500);
- continue;
- }
+ if (search.status === 'unavailable') {
+ await appendLifecycleLog(
+ opts.logPath,
+ `First-run prompt preflight: UI unavailable; retrying (exit ${search.exitCode})`,
+ );
+ await timing.sleep(500);
+ continue;
+ }
- if (search.status === 'not-found') {
- if (search.hasElements) {
- uiSeen = true;
+ if (search.status === 'not-found') {
+ if (search.hasElements) {
+ uiSeen = true;
+ }
+ if (uiSeen) {
+ promptsDismissed = true;
+ break;
+ }
+ await timing.sleep(500);
+ continue;
}
- if (uiSeen) {
- promptsDismissed = true;
- break;
+
+ uiSeen = true;
+ const { label } = search;
+ opts.onEvent?.(`dismissing first-run prompt '${label}'`);
+ await appendLifecycleLog(opts.logPath, `Dismissing first-run prompt label: ${label}`);
+ const tap = await executor({
+ command: axePath,
+ args: ['tap', '--label', label, '--element-type', 'Button', '--udid', opts.simulatorId],
+ cwd: opts.cwd,
+ logPath: opts.logPath,
+ env: axeEnv,
+ });
+ if (tap.exitCode !== 0) {
+ throw new Error(
+ `${opts.config.name}: failed to dismiss first-run prompt '${label}' (exit ${tap.exitCode}); see ${opts.logPath}`,
+ );
}
await timing.sleep(500);
- continue;
}
- uiSeen = true;
- const { label } = search;
- opts.onEvent?.(`dismissing first-run prompt '${label}'`);
- await appendLifecycleLog(opts.logPath, `Dismissing first-run prompt label: ${label}`);
- const tap = await executor({
- command: axePath,
- args: ['tap', '--label', label, '--element-type', 'Button', '--udid', opts.simulatorId],
+ if (!promptsDismissed) {
+ throw new Error(
+ `${opts.config.name}: timed out during first-run prompt preflight; see ${opts.logPath}`,
+ );
+ }
+
+ await appendLifecycleLog(opts.logPath, 'First-run prompt preflight: complete');
+ } finally {
+ const terminate = await executor({
+ command: 'xcrun',
+ args: ['simctl', 'terminate', opts.simulatorId, bundleId],
cwd: opts.cwd,
logPath: opts.logPath,
- env: axeEnv,
});
- if (tap.exitCode !== 0) {
+ if (terminate.exitCode !== 0) {
throw new Error(
- `${opts.config.name}: failed to dismiss first-run prompt '${label}' (exit ${tap.exitCode}); see ${opts.logPath}`,
+ `${opts.config.name}: failed to terminate app after first-run prompt preflight (exit ${terminate.exitCode}); see ${opts.logPath}`,
);
}
- await timing.sleep(500);
}
-
- if (!promptsDismissed) {
- throw new Error(
- `${opts.config.name}: timed out during first-run prompt preflight; see ${opts.logPath}`,
- );
- }
-
- const terminate = await executor({
- command: 'xcrun',
- args: ['simctl', 'terminate', opts.simulatorId, bundleId],
- cwd: opts.cwd,
- logPath: opts.logPath,
- });
- if (terminate.exitCode !== 0) {
- throw new Error(
- `${opts.config.name}: failed to terminate app after first-run prompt preflight (exit ${terminate.exitCode}); see ${opts.logPath}`,
- );
- }
- await appendLifecycleLog(opts.logPath, 'First-run prompt preflight: complete');
}You can send follow-ups to the cloud agent here.
|
Bugbot Autofix prepared a fix for the issue found in the latest run.
Or push these changes by commenting: Preview (f981ed6361)diff --git a/CHANGELOG.md b/CHANGELOG.md
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -22,6 +22,7 @@
### Fixed
+- Fixed Claude UI benchmark suite loading so `sessionDefaults` unknown keys and invalid value types fail fast at config load time instead of after expensive simulator setup.
- Fixed Claude UI benchmark preflight so transient malformed or still-loading UI snapshots no longer crash the harness or finish before app UI is observable.
- Fixed Claude UI benchmark preflight so configured first-run dismissals require a concrete simulator ID and suite-provided simulator IDs are recorded in command logs.
- Fixed Claude UI benchmark config handling so invalid `failurePatterns` regexes fail before a suite starts and partial `allowedVariance` overrides preserve defaults for omitted metrics.
diff --git a/src/benchmarks/claude-ui/config.ts b/src/benchmarks/claude-ui/config.ts
--- a/src/benchmarks/claude-ui/config.ts
+++ b/src/benchmarks/claude-ui/config.ts
@@ -185,7 +185,9 @@
),
};
- if (isRecord(raw.sessionDefaults)) config.sessionDefaults = raw.sessionDefaults;
+ if (isRecord(raw.sessionDefaults)) {
+ config.sessionDefaults = validateSessionDefaults(raw.sessionDefaults);
+ }
config.allowedVariance = readAllowedVariance(raw.allowedVariance, `${source}.allowedVariance`);
if (raw.baseline !== undefined) {You can send follow-ups to the cloud agent here. |
Validate benchmark session defaults with the same schema used by runtime session defaults and terminate preflight-launched apps after post-launch failures. Co-Authored-By: Codex <codex@openai.com>
Retry transient runtime snapshot parse failures during post-action refresh and avoid ending first-run preflight during empty transition snapshots. Co-Authored-By: Codex <codex@openai.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Autofix Details
Bugbot Autofix prepared fixes for both issues found in the latest run.
- ✅ Fixed: Empty --all reports success
- Added check to throw error when listSuitePaths returns empty array, preventing vacuous success from empty results.
- ✅ Fixed: Parser always exits zero
- Modified parse function to track JSON decode errors and return non-zero exit code when malformed lines are encountered.
Or push these changes by commenting:
@cursor push 3e2d05ccb6
Preview (3e2d05ccb6)
diff --git a/benchmarks/claude-ui/parse_claude_conversation.py b/benchmarks/claude-ui/parse_claude_conversation.py
--- a/benchmarks/claude-ui/parse_claude_conversation.py
+++ b/benchmarks/claude-ui/parse_claude_conversation.py
@@ -101,13 +101,14 @@
return "\n\n".join(parts)
-def parse(path: Path, out_dir: Path, tool_prefix: str) -> None:
+def parse(path: Path, out_dir: Path, tool_prefix: str) -> bool:
out_dir.mkdir(parents=True, exist_ok=True)
# Track tool_use_ids that target our prefix so we keep matching results.
tracked_ids: set[str] = set()
tool_name_by_id: dict[str, str] = {}
counter = 0
+ had_errors = False
def next_path(kind: str, label: str | None = None) -> Path:
nonlocal counter
@@ -124,6 +125,7 @@
entry = json.loads(raw)
except json.JSONDecodeError as exc:
print(f"warn: skipping line {line_no}: {exc}", file=sys.stderr)
+ had_errors = True
continue
etype = entry.get("type")
@@ -204,6 +206,7 @@
# last-prompt, file-history-snapshot, thinking blocks) is dropped.
print(f"Wrote {counter} files to {out_dir}")
+ return not had_errors
def main() -> int:
@@ -227,8 +230,8 @@
return 1
out = args.output or args.jsonl.with_name(f"{args.jsonl.stem}_conversation")
- parse(args.jsonl, out, args.tool_prefix)
- return 0
+ success = parse(args.jsonl, out, args.tool_prefix)
+ return 0 if success else 1
if __name__ == "__main__":
diff --git a/src/benchmarks/claude-ui/harness.ts b/src/benchmarks/claude-ui/harness.ts
--- a/src/benchmarks/claude-ui/harness.ts
+++ b/src/benchmarks/claude-ui/harness.ts
@@ -565,6 +565,9 @@
}
const suitePaths = args.all ? await listSuitePaths() : [resolveSuitePath(args.suite as string)];
+ if (suitePaths.length === 0) {
+ throw new Error('no suite files found in benchmarks/claude-ui/suites');
+ }
const progress = createProgressReporter({ enabled: !args.json });
const results: BenchmarkResult[] = [];
for (let index = 0; index < suitePaths.length; index += 1) {You can send follow-ups to the cloud agent here.
Fail --all when suite discovery returns no suite files so a missing or misconfigured benchmark suite directory cannot look like a successful full run. Return a non-zero parser exit when malformed JSONL lines are skipped so broken transcript parses fail the benchmark instead of producing partial-success artifacts. Co-Authored-By: Codex <codex@openai.com>
Separate first-run preflight terminate executor errors from non-zero exit handling so each failure path is handled once. Track preflight completion with a success flag instead of catching and rethrowing only to drive cleanup suppression. Co-Authored-By: Codex <codex@openai.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Duplicate stumble count parse errors
- Modified failureCount calculation to skip adding parserExitCode when audit.parseErrors already captured the same parser failures, preventing double-counting.
Or push these changes by commenting:
@cursor push 827aa9507c
Preview (827aa9507c)
diff --git a/src/benchmarks/claude-ui/compare.ts b/src/benchmarks/claude-ui/compare.ts
--- a/src/benchmarks/claude-ui/compare.ts
+++ b/src/benchmarks/claude-ui/compare.ts
@@ -183,7 +183,7 @@
audit.failures.length +
audit.patternFailures.length +
(run.claudeExitCode === 0 ? 0 : 1) +
- (run.parserExitCode === 0 ? 0 : 1);
+ (run.parserExitCode === 0 || audit.parseErrors.length > 0 ? 0 : 1);
const sequenceMode = config.sequence?.mode ?? 'warn';
const sequenceMatched =
expected.length === 0 || (missing.length === 0 && additional.length === 0);You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit 34a7e53. Configure here.
Applied via @cursor push command
There was a problem hiding this comment.
Benchmark test spawns real python3 subprocess, bypassing executor safety overrides
The runParserScript helper calls spawn('python3', args) directly from node:child_process, making npm test dependent on Python 3 being installed and bypassing the vitest-executor-safety.setup.ts framework-executor overrides; wrap the parser invocation in an injectable function so tests can stub it.
Evidence
vitest.config.tsincludessrc/**/__tests__/**/*.test.ts, sosrc/benchmarks/claude-ui/__tests__/claude-ui-benchmark.test.tsruns undernpm test.vitest-executor-safety.setup.tsoverrides only__setTestCommandExecutorOverride/__setTestInteractiveSpawnerOverride(framework interfaces); rawnode:child_processspawnis unguarded.runParserScriptresolvesrepoRootand passes the realparse_claude_conversation.pypath, hitting the real filesystem and a real Python 3 process.- The
'returns a non-zero parser exit when JSONL lines are malformed'test callsrunParserScriptwith a temp-dir JSONL file, meaning CI must havepython3and the script on disk to pass.
Identified by Warden xcodebuildmcp-test-boundary-review



Add a local Claude UI benchmark harness for measuring simulator UI automation behavior against the development MCP server.
The harness runs deterministic app tasks from Markdown prompts, creates fresh temporary simulators, writes isolated MCP configuration, parses Claude Code transcripts, and reports tool counts, wall-clock timing, failures, and sequence drift. This gives us a repeatable way to catch regressions in agent efficiency and UI automation behavior across Weather, Contacts, and Reminders.
The benchmark setup also keeps simulator boot/open and first-run prompt cleanup outside the measured Claude task, so baselines reflect the actual app work rather than transient Apple setup screens. Mutating UI actions now wait for settled post-action runtime snapshots so the next agent step receives stable refs.