From da5af7917d2b22f7e0e08fda807a05e90372480d Mon Sep 17 00:00:00 2001 From: marcelo Date: Fri, 5 Jun 2026 16:42:32 -0700 Subject: [PATCH] docs: fix hostconfig plan audit notes --- hostconfig-consolidation-plan.md | 361 +++++++++++++++++++++++++++++++ 1 file changed, 361 insertions(+) create mode 100644 hostconfig-consolidation-plan.md diff --git a/hostconfig-consolidation-plan.md b/hostconfig-consolidation-plan.md new file mode 100644 index 0000000..6b2952d --- /dev/null +++ b/hostconfig-consolidation-plan.md @@ -0,0 +1,361 @@ +# Plan: Consolidate `hostConfig` into `@mcpjam/sdk` (Playground + Evals on one pillar) + +## Implementation status (live) + +**All code/docs PRs for stages 0–6 are merged, but the external SDK reporter behavior is not published yet.** Stage 5 Step 2 (backend) merged via mcpjam-backend PR #427; Stage 5 Step 3 (SDK reporter wire-send) merged via inspector PR #2423 but still has a pending changeset (`.changeset/stage5-step3-sdk-reporter.md`); Stage 6 (docs refresh) merged via inspector PR #2439 + mcpjam-backend PR #433. The only remaining work is operational: cut the SDK release that publishes Stage 5 Step 3, then run the external-runtime smoke (plan §End-to-end verification item 5) once the new SDK is on npm. Both are tracked under [Remaining work](#remaining-work) at the bottom. + +**Stage 1-4 recap.** Inspector PR #2392 (Stages 0 / 0b / 0c + Host hardening) merged the SDK host-config core and public `Host` facade; backend PR #409 switched the backend to import the SDK canonicalizer directly; inspector PR #2396 moved portable client leaf primitives; inspector PR #2400 tightened the canonicalizer; inspector PR #2407 moved host-execution policy, OpenAI compat resolution, and app-only filtering into the SDK; inspector PR #2409 shipped the breaking SDK rename (`TestAgent` → `HostRunner`, `EvalAgent` → `HostExecutor`, `.prompt()` → `.run()`), `HostRuntime`, `host.withManager(manager, { apiKey })`, single-gated app-only filtering, SDK-owned OpenAI compat resolution, and per-iteration host snapshot capture. + +The follow-up PR #2414 (`needsApproval` forwarding) was closed without merging. That decision is authoritative: the AI SDK's `needsApproval` is execution-gating, not metadata, and the eval path has no approval-response channel. Future stages should keep `requireToolApproval` as reported metadata (`approvals_would_require`) unless/until evals gain an explicit approval round-trip. + +**Stage 5 was implemented as a 3-step sequence (user refinement of the original plan).** Originally drafted as "backend first, then SDK reporter," it landed as: (1) SDK helper-only release publishing `normalizeSdkEvalHostConfigForWire` to `@mcpjam/sdk/host-config/internal`; (2) backend ingestion + `/sdk/v1/info` capability importing that helper from npm; (3) SDK reporter wire-send capability-gated against `/sdk/v1/info`. The three-step shape preserves Stage 1's "one canonicalizer" rule — no temporary backend mirror of the normalizer, no inline reimplementation. SDK 1.12.0 includes Stage B's canonicalize-tightening changeset and Stage 5 Step 1's normalizer changeset; the Step 3 reporter changeset is merged but not yet published. Per-iteration `iteration.hostSnapshot` from Stage 4 is the primary source; `executor.getHostSnapshot?.()` is fallback; `MCPJamReportingConfig.host` is last-resort override. Pass-1 homogeneity gate: run-level send only when all available iteration snapshots canonicalize to the same hash; heterogeneous omit. Server-id mismatch resolved by stripping `serverIds`/`optionalServerIds`/`serverConnectionOverrides` in the normalizer on both sides — `Host.requireServer("everything")` runtime ids never reach Convex `validateServerScope` as `Id<'servers'>`. + +An earlier abandoned backend attempt (`9f52006c`, the hand-mirror + parity-fixture approach) sits on a stale branch and was **not** the basis for Stage 1. It was superseded by the one-import approach actually shipped and should remain discarded. + +- ✅ **Stage 0 — SDK host-config core** (inspector commit `86fccc0`, pushed). New `sdk/src/host-config/` (`types.ts`, `canonicalize.ts`, `hash.ts`, `index.ts`), wired into `index.ts`/`browser.ts`/`tsup`/`package.json` (`./host-config` subpath). Generator `scripts/gen-host-config-parity-fixture.mjs` → 14-vector golden fixture. Tests: `host-config-parity.test.ts` + `host-config-canonicalize.test.ts` (46 passing). Verified: typecheck ✓, build ✓, test:packaging ✓. +- ✅ **Stage 0b — public `Host` class facade** (inspector commit `19adf12`, pushed; post-#2392 refactor `8bb56a84a`). Per review: the public surface was the internal Convex schema lifted into the barrel. Reworked to a developer-facing, MCP-vocabulary API: `import { Host } from "@mcpjam/sdk"`, **direct-property mutation** (`host.mcp.protocolVersion = "2025-11-25"`, `host.mcp.apps = { sandbox: { csp: { mode: "declared" } } }`, `host.addServer("srv_abc")`), `host.toJSON()`. Renames (public only): `mcpProfile`→`mcp` (`HostMcp`), `HostConfigInputV2`→`HostInit` (ctor arg), `CanonicalHostConfigV2`→`HostJson`; `canonicalize*`/`compute*Hash`/`sha256Hex` no longer exported (moved behind a non-public `internal.ts` entry for tooling/tests). `Host.hash()` was deliberately removed (PR #2392, commit `3ed68ca3a`) — content addressing is an internal SDK↔backend concern. The wholesale-replace setters (`setMcp`/`setClientCapabilities`/`setHostContext`/`setHostCapabilitiesOverride`/`setChatUiOverride`/`setConnectionDefaults`) were dropped in `8bb56a84a` in favor of direct property mutation; scalar setters (`setStyle`/`setModel`/`setSystemPrompt`/`setTemperature`/`setRequireToolApproval`/`setProgressiveToolDiscovery`/`setRespectToolVisibility`) stay as fluent-chain conveniences. Server-list helpers now dedupe (`addServer`/`addOptionalServer`) and the inverse operations exist (`removeServer`/`removeOptionalServer`/`removeServerOverride`/`clearMcp`); `addServerOverride` → `setServerOverride` (always had replace semantics). **Decided wire question = (b)**: the internal canonical form keeps on-disk `mcpProfile`/`schemaVersion` and is hashed for backend parity, but it is NEVER surfaced — `toJSON()` projects to a clean public `HostJson` (`mcp`, `servers`, `style`, no impl names) and round-trips. Zero hash churn for untouched hosts (an empty `host.mcp = {}` collapses to no `mcpProfile` in canonical), backend untouched. 64 host-config tests (golden-vector hash equivalence + a no-impl-names guard + round-trip + direct-mutation coverage); typecheck/build/test:packaging ✓. +- ✅ **Stage 0c — `@mcpjam/sdk/host-config/internal` subpath** (inspector commit `8c77b04`, pushed). A spike disproved the "Convex can't import the SDK" assumption for this module: `dist/host-config/{index,internal}.js` are self-contained (zero external/`node:` imports), `convex dev --once --debug-node-apis` bundles the `/internal` subpath cleanly, and delegating the backend canonicalizer to it gives **136/136 byte-identical** results. So the backend will **import** the SDK canonicalizer (one source of truth), not hand-mirror it. Published the existing `internal.ts` entry as `@mcpjam/sdk/host-config/internal` (first-party, not-semver-stable; external consumers still use `Host`). `test:packaging` imports it; `/internal` hash === `Host.hash()`. +- ✅ **Stage 1 — backend imports the SDK canonicalizer** (mcpjam-backend PR #409, merge `a0a11541`; squashed commit `0d0c9961`). `convex/lib/hostConfigV2.ts` now imports `canonicalizeHostConfigV2` + `computeHostConfigHashV2` from `@mcpjam/sdk/host-config/internal` (delegating wrappers with `Id<'servers'> ↔ ServerId` boundary cast on the canonical return); ~880 lines of newly-unreachable private helpers (`sortStringKeys`/`deepSortStringKeys`/`isPlainObject`/`canonicalizeCsp*`/`canonicalizeAllowFeatures`/`canonicalizeMcpProfile`/`canonicalizeServerConnectionOverrides`) deleted in the same PR. Net file shrink: 2267 → 1293 lines (−974). **Pre-merge audit of all 15,334 prod `hostConfigs` rows: 0 affected by the hash-affecting tightening from `3ded440a8`** (0 with dup `serverIds`/`optionalServerIds`, 0 with non-finite `requestTimeoutOverride`, 0 with `serverConnectionOverrides` defined at all) — the swap was hash-neutral in practice; no row migration needed. Verified: `tsc -p convex/tsconfig.json --noEmit` ✓, `npm run test:once -- hostConfigV2Canonicalize` 105/105 against the SDK delegation ✓, broader hostConfig suites 170/171 (1 pre-existing skip), `npx convex dev --once --debug-node-apis` bundles cleanly ✓. The pre-existing 105-case `hostConfigV2Canonicalize.test.ts` is now the regression for both sides of the import boundary. +- ✅ **Stage 2 — inspector client consumes SDK leaf primitives** (inspector PR #2396, merge `be9751a80`). Stage 2 shipped **smaller than originally drafted** — the macro plan listed `HostConfigInputV2`/`HostConfigDtoV2`/`HostStyleId` for moving, but those types are deliberately stricter than the SDK's (client requires `serverIds`/`optionalServerIds`/`respectToolVisibility` for the editor; SDK leaves them optional for the iterative-host rollout; client uses the structured `ChatUiOverride` from `@/lib/client-styles`; client uses the closed `ChatboxHostStyle` union). Moving them would have cascaded `undefined`-checks across ~30 importers and fought a real design intent, so they stay client-owned. What actually moved: leaf types `CspDomainSet`/`HostConfigConnectionDefaults`/`HostConfigMcpProfileV1`/`McpProtocolVersion`, constants `SEP_1865_PERMISSION_FEATURES`/`DEFAULT_TEMPERATURE_V2`, and the one-line `resolveEffectiveMcpProtocolVersion` function — all added to `@mcpjam/sdk/host-config/internal` (new file `sdk/src/host-config/defaults.ts`) and re-exported from `client/src/lib/client-config-v2.ts`. The 65 importers stayed on `@/lib/client-config-v2` — zero churn. `emptyHostConfigInputV2` / `hostConfigDtoToInput` / `resolveClientInfo` / `resolveSupportedProtocolVersions` / `resolveHostInfo` / all entangled `resolveEffective*` resolvers + mergers / all dirty-detection helpers stay client-side (return strict client types or coupled to `@/lib/client-styles` + ext-apps). File shrink: 1080 → 875 lines (−205). Verified: SDK 1145/1145 + 6 new `host-config-defaults` tests ✓, `typecheck:client` ✓, 24/24 client test files touching `client-config-v2` → 418/418 tests passing ✓. Bot-review follow-ups landed in the same PR: vitest source alias (`d9a42182a`) + vite source alias (`6bf81115b`) for `@mcpjam/sdk/host-config/internal` (matching the existing `browser`/`matchers`/`skill-reference` pattern), so `npm test -w @mcpjam/inspector` and `npm run dev:client`/`build:client` in isolation no longer require a pre-`build:sdk` step. +- 🗑️ **Abandoned backend branch** (`mcpjam-backend` commit `9f52006c` on `claude/eloquent-heisenberg-DLDzM`). An earlier hand-mirror + duplicated fixture + `__inputHash` guard attempt at Stage 1, superseded by the one-import approach that actually shipped as PR #409. **Discarded; not merged.** +- ✅ **Host hardening (user commit `22c7df8`, merged via #2392 as `878963410`).** `Host` now requires `style`/`model` (no silent `"mcpjam"` default, no empty-`model` sentinel) via lazy `requireConfigured()` in `toJSON()`. Reconciled locally; typecheck + 64 host-config tests green. +- ✅ **Partial — canonicalizer tightening shipped post-#2392** (inspector commit `3ded440a8`, on `main`). Two of the six bot-review items landed at the **canonicalizer** layer (not just the Host facade), so the hash-migration risk listed below applies *today* for any stored row in scope — the Stage 1 backend swap is no longer hash-neutral for these two cases: + - ✅ #2 dedupe `serverIds` / `optionalServerIds` — `sortUniqueServerIds` in `canonicalize.ts`. Backend rows with duplicate ids will hash to a different value once Stage 1 swaps to the SDK canonicalizer. + - ✅ #4 reject non-finite `requestTimeoutOverride` — explicit `Number.isFinite` guard in `canonicalizeServerConnectionOverrides`. +- ✅ **Stage B — canonicalizer tightening follow-ups** (inspector PR #2400, merge `40bf746e7`). All four deferred items from #2392 shipped as one SDK PR since Stage 1's one-import delegation meant a single SDK change covers both sides: + 1. ✅ deep-sort nested `clientCapabilities` / `hostContext` (recursive sort, matches `*Override` + `mcpProfile`); + 3. ✅ collapse empty `allowFeatures` to absent (matches sibling `openaiAppsOverrides`); + 5. ✅ fail-fast on missing required `clientCapabilities` / `hostContext` via new `requireRecord()` helper (replaces `?? {}` coalescing); + 6. ✅ drop `openaiAppsOverrides` when `compatRuntime.openaiApps === false` (resolver ignores them anyway). + CodeRabbit-caught follow-up (`fe85d98ef`): tightened the shared `isPlainObject` predicate with a prototype guard (`Object.prototype` or `null`) so `Date` / `Map` / `Set` / class instances no longer canonicalize to `{}` and merge into the empty-record dedupe pool — closes the loophole across 5+ canonicalize call sites in one place. Pre-merge audit of all 15,389 prod hostConfigs rows: **0 affected by any of the four items** — shipped hash-neutral, no mint+repoint migration. SDK 1154/1154 tests (was 1145 baseline; +7 tightening tests, +2 prototype-guard regression tests, +1 updated deep-sort assertion); parity fixture regenerated + `EXPECTED_INPUT_HASH` bumped. Bumped to SDK 1.12.0 (minor) via changeset. Other review items that landed on #2392 — the builder-clone determinism fix (`34f2d1c`, parity-neutral), test hardening, the `style`/`model` required check (`878963410`), and `Host.hash()` removal from the public surface (`3ed68ca3a`) — also remain merged. +- ✅ **Stage 3 — host-policy + compat + `filterAppOnlyTools` in SDK; both runtimes rewired** (inspector PR #2407, merge `9c618e1ae`). New SDK files under `sdk/src/host-config/`: `app-only-tool.ts` (pure leaf — `isAppOnlyTool` moved out of `mcp-client-manager/tool-converters.ts`, which now re-exports it for back-compat), `tool-visibility.ts` (structural `ToolMetadataSource` duck-type + `filterAppOnlyTools` + `applyVisibilityPolicyAndCountSignals` — **no `MCPClientManager`/`ai` runtime import** so `/internal` stays browser-safe), `host-policy.ts` (`extractHostExecutionPolicy` + `buildHostIterationMetadata` + types), `compat-runtime.ts` (`readOpenAiCompatOverride` + `compatPresetForHostStyle` + `resolveOpenAiCompatForHostConfig`). Re-exported from `internal.ts`. Inspector rewires: `server/utils/chat-v2-orchestration.ts` drops its local `filterAppOnlyTools` body + the `isToolVisibilityAppOnly` import (the `@modelcontextprotocol/ext-apps` dep stays — still used by 3 other server files); `server/services/evals/host-execution-policy.ts` + `compat-runtime.ts` become thin re-export shims over the SDK (`loadSuiteHostConfig` stays inspector-side, Convex-bound); `server/{tsup,vitest}.config.ts` register the `@mcpjam/sdk/host-config/internal` subpath alongside the existing SDK entries. **Eval-path → live-path entanglement deleted** — `grep -rn "utils/chat-v2-orchestration" mcpjam-inspector/server/services/evals/` returns only a doc-comment match (no import). Verified: SDK 1195 passed + 1 skipped (the deliberate Stage 4 placeholder — see below); `test:packaging` asserts all 10 expected exports are functions; `esbuild --platform=browser` metafile on `host-config/internal.ts` confirms the bundle resolves to host-config files only (no `ai`, no `MCPClientManager`, no `node:*`); inspector 4910 passed + 6 skipped; client typecheck + `build:client` clean. Net file delta: 712 insertions / 248 deletions across 15 files. Stage 4 carryover: the private raw-`Tool[]` conversion at `sdk/src/TestAgent.ts:90` still drops app-only tools unconditionally; `applyVisibilityPolicyAndCountSignals` runs over the already-converted set so it cannot recover that drop. Marked as `it.skip("...fix at TestAgent.ts:90 (Stage 4)")` in `host-config-tool-visibility.test.ts` so the gap is visible without lying about coverage. +- ✅ **Stage 4 — SDK rename + `Host` as primary spec + `HostRuntime` live binding + single-gated app-only filter** (inspector PR #2409, merge `8a5c9426b`; broken follow-up PR #2414 closed without merge). The single PR that consolidated four months of host-config plumbing into a coherent breaking SDK release. What landed (full breaking-change index in `sdk/CHANGELOG.md`): + - **Renames (no deprecation aliases — first SDK major):** `TestAgent` → `HostRunner` (class); `EvalAgent` → `HostExecutor` (interface); `.prompt(message, options)` → `.run(message, options)` on the interface + both impls; `Host.addServer` → `requireServer`; `Host.removeServer` → `removeRequiredServer`; `EvalTest.run` / `EvalSuite.run` parameter `agent` → `executor`. + - **`Host` becomes the primary `HostRunnerConfig` spec.** `HostRunnerConfig.host: Host | HostInit | HostJson` (discriminated union: callers supply either `host` with optional `model`, or `model` with no host — a config missing both is a compile-time error). `HostRunner` snapshots once at construction via `snapshotHostSource(...)`; a pre-snapshotted `HostJson` (the path `HostRuntime.run()` takes) passes through unchanged — no double-snapshot. New accessors: `getHostSnapshot()`, `getHostPolicy()`. + - **`HostRuntime` + `host.withManager(manager, { apiKey, ...defaults })`.** Live binding to a `Host` + a structural `HostRuntimeManager` (`hasServer` + `getToolsForAiSdk`, plus optional `getServerReplayConfigs`). Each `.run()` snapshots the host afresh and dynamically imports `HostRunner` (the bundle-safety boundary — `HostRunner.ts` pulls in `ai`/Node deps; `host-config/host-runtime.ts` only references it as `dynamic-import`, so browser bundlers tree-shake it). **Stateless across turns**: history accumulates for inspection but is NOT auto-replayed; multi-turn continuity stays explicit via `PromptOptions.context`. One-shot sugar `host.run(input, { apiKey, mcpClientManager })` delegates through a throwaway runtime. + - **`HostExecutor` interface unifies both impls.** `HostRunner` and `HostRuntime` both implement it. Optional `getHostSnapshot?()` and `getServerReplayConfigs?()` so `resolveServerReplayConfigs` finds the latter through any executor uniformly. `wrapAgentWithAbortSignal` forwards both so the iteration wrapper preserves the introspection surface. + - **Single-gated app-only filter.** Removed from `convertToToolSet`; gated once at `HostRunner` tool-prep by `hostPolicy.respectToolVisibility !== false` (default = filter, preserving pre-Stage-4 semantics; `respectToolVisibility: false` is the explicit opt-out). Closes the Stage 3 `it.skip` placeholder. **`withOptions` preserves `rawTools`** (the raw input `Tool[]` / `AiSdkTool` reference at construction time) so host-replacement clones re-run the prep step under the new policy — otherwise a stricter clone would inherit the parent's already-converted ToolSet and silently expose app-only tools. + - **SDK-owned OpenAI compat decision.** Default derives from `resolveOpenAiCompatForHostConfig(hostSnapshot)`; explicit `injectOpenAiCompat` overrides. Existing SDK `injectOpenAICompat` widget primitive unchanged. + - **Per-iteration host snapshot stamping.** `IterationResult.hostSnapshot?: HostJson` captured at iteration end via `iterationAgent.getHostSnapshot?.()`. For `HostRunner` this is the immutable construction snapshot; for `HostRuntime` it's the live `Host` state at iteration end — so a user mutating the bound `Host` between iterations (e.g. `host.requireServer("x")` mid-eval) gets per-iteration metadata that reflects what that iteration actually ran with, not the post-mutation state. `eval-result-mapping.ts`'s `resolveIterationHostExtras(iteration, fallback)` makes per-iteration snapshot win; the upload-time `executor.getHostSnapshot()` stamp stays as fallback for executors that don't expose the method. **Mid-iteration mutation between turns is not separately captured** — would require threading the snapshot into `PromptResult`; deferred. + - **`MCPJamReportingConfig.host?: Host` field added.** Wire-level `hostConfigHash` propagation deferred to Stage 5 — the field is accepted but not yet sent to `/sdk/v1/evals/*`. + - **Inspector cleanup (import-only, no behavior change):** `evals-runner.ts`, `replay-suite-run.ts`, `routes/shared/evals.ts` switched from `./evals/host-execution-policy.js` and `./evals/compat-runtime.js` re-export shims to direct `@mcpjam/sdk/host-config/internal` imports. The pure shim `host-execution-policy.ts` is deleted; `compat-runtime.ts` is slimmed to keep only `loadSuiteHostConfig` (Convex-bound, can't move). Bundle isolation re-verified: `HostRunner.ts` appears in the `host-config/index.ts` graph **only as `kind: dynamic-import`**, so browser consumers of `Host` / `HostRuntime` can tree-shake `ai` out. + - **Verified:** SDK 1248 tests pass (was 1195 baseline; +53 across `HostRunner.host.test.ts`, `HostRuntime.test.ts`, `eval-result-mapping.test.ts` per-iteration regression, `host-config-policy.test.ts` and `host-config-compat-runtime.test.ts` HostJson-shape regression coverage); `test:packaging` extended to assert renamed exports; built-bundle functional smoke verified `host.withManager` round-trip + `assertHostServersKnown` validation. The five-round review cycle (Cursor Bugbot + chatgpt-codex-connector) is summarized in the trap log under "Stage 4 review-cycle findings" below. +- 🛑 **Closed without merge: PR #2414 (`needsApproval` forwarding in `HostRuntime.run`).** chatgpt-codex-connector's first review on the merged Stage 4 PR flagged that `HostRuntime.run()` passed `{ includeAppOnly }` to `manager.getToolsForAiSdk` but not `{ needsApproval: policy.requireToolApproval }`, citing the server chat path as the model. PR #2414 forwarded it. The same bot's next review on #2414 (correctly) flagged that this **breaks** evals: the AI SDK's `needsApproval: true` is **execution-gating**, not metadata — it causes `generateText` to emit an approval-request and skip the tool's `execute` callback until a separate approval response is provided. The chat path has that approval-response channel; `HostRuntime.run` / `PromptOptions` deliberately doesn't (host-policy.ts: *"Evals do not block on approval prompts — this is a 'would prompt N times' signal only"*). So forwarding `needsApproval` makes `HostRuntime`-backed evals against approval-required hosts silently skip tool execution: `result.hasToolCall("X")` returns `true` (intent recorded) but `execute` never ran, no widget snapshot, no real output, predicates can't fire. **Lesson for future stages and future bot reviews:** the server chat path is NOT a safe template for the eval path when the API in question is execution-gating without an explicit completion channel. `requireToolApproval` is reported as the `approvals_would_require` iteration metadata signal (see `buildHostIterationMetadata`) and that's the correct semantic — report the policy, don't enforce it. +- ✅ **Stage 5 — SDK↔backend eval wire integration** (all code PRs merged across both repos; Step 1 helper published as `@mcpjam/sdk@1.12.0`; Step 2 backend merged as mcpjam-backend PR #427; Step 3 reporter merged but not yet published). The original Stage 5 plan said "backend first, then SDK reporter." During scoping the user refined this to a 3-step sequence — SDK helper-only release first, then backend imports it from npm, then SDK reporter activates capability-gated — to preserve Stage 1's "one canonicalizer" rule (no temporary backend mirror, no inline reimplementation). Caveat: as of Stage 6's merge the Step 3 reporter changeset (`.changeset/stage5-step3-sdk-reporter.md`) is still unpublished — the released SDK on npm does not yet send the wire pair. See [Remaining work](#remaining-work). + - ✅ **Step 1 — SDK helper-only release** (inspector PR #2422, merge `9e3535884`; published as `@mcpjam/sdk@1.12.0`). New `sdk/src/host-config/sdk-evals-normalizer.ts` exporting `normalizeSdkEvalHostConfigForWire(source: HostConfigInputV2 | HostJson): HostConfigInputV2` from `@mcpjam/sdk/host-config/internal` only (not the public barrel). Strips runtime-manager identifiers (`serverIds`, `optionalServerIds`, `serverConnectionOverrides`) so SDK reporter and backend ingestion hash byte-identical wire shapes. Accepts both canonical `HostConfigInputV2` (storage-row vocabulary: `hostStyle`/`mcpProfile`/`serverIds`) AND public `HostJson` from `Host.toJSON()` (public vocabulary: `style`/`mcp`/`servers`) — projects public → canonical before stripping (Stage 4 trap-log lesson). Pure, browser-safe, runtime-free, idempotent. 11 tests; `test:packaging` extended to assert the new export; esbuild metafile confirms no `ai`/`MCPClientManager`/`node:*` in the `/internal` bundle slice. Reporter not touched in this PR (helper-only). + - ✅ **Step 2 — backend ingestion + `/sdk/v1/info` capability** (mcpjam-backend PR #427, merged; bumped `@mcpjam/sdk` to `^1.12.0`). New unauthenticated `GET /sdk/v1/info` route returning `{ "capabilities": { "evalsHostConfig": 1 } }` — capability discovery deliberately bypasses API-key auth so reporters can probe without credentials. Extended `POST /sdk/v1/evals/runs/start` and `POST /sdk/v1/evals/report` to accept optional `{ hostConfig, hostConfigHash }` at the top level (flat, not nested). Both are an integrity pair — sending one without the other → 400. When present: server-side normalize via `normalizeSdkEvalHostConfigForWire` (defense in depth, since the SDK reporter is expected to have already stripped client-side) → `canonicalizeHostConfigV2` → recompute via `computeHostConfigHashV2` → reject 400 on mismatch. All three imports come from the narrow `@mcpjam/sdk/host-config/internal` subpath (the same Stage 1 Convex-safe entry point), never the SDK barrel. Verified base persisted on `testSuiteRun.configSnapshot.sdkEvalHostConfigBase` (matches existing snapshot-at-run-start pattern alongside `judgeConfig`/`defaultMatchOptions`/`defaultPredicates`); schema field typed `v.any()` because the transport snapshot must round-trip whatever the wire normalizer emitted and materialization re-validates via `canonicalizeHostConfigV2`/`ensureHostConfigV2`. In `internalAppendSdkIterations`, when a snapshot base exists it replaces `resolveEvalBaseHostConfigV2(suite, …)` as the materializer input; suite-resolved `serverIds` are still layered on top so the final stored v2 `hostConfigs` row's hash **differs** from the wire `hostConfigHash` — the wire hash is a transport-integrity check, **not** the storage id. Discovery during implementation: `materializeEvalIterationHostConfig` requires a `HostConfigDtoV2` (with `id`/`schemaVersion`/DTO shape) but the wire payload is a `HostConfigInputV2` — inspection confirmed the materializer body never reads `.id`/`.schemaVersion`, only policy fields, so a small `sdkBaseHostConfigDtoFromWireInput()` adapter fabricates a placeholder id and projects every field the materializer touches; `ensureHostConfigV2` at the end mints the real row id. 9 new `convex-test` cases covering: backward compat (no hostConfig); matching hash success; tampered hash reject; runtime server-ids like `"everything"` normalized away (never reach `validateServerScope`); one-shot vs chunked path parity; `/sdk/v1/info` shape. + - ✅ **Step 3 — SDK reporter wire-send (capability-gated)** (inspector PR #2423, merged). Three new modules: `sdk/src/sdk-evals-capability.ts` (lazy `GET /sdk/v1/info` probe, cached per `baseUrl`, **fail-safe to "no capability"** on any error — network, 404, timeout, parse — so the reporter omits hostConfig rather than failing the report; 2s `AbortController` timeout; tolerates both nested `{capabilities:{evalsHostConfig:1}}` and flat `{evalsHostConfig:1}` response shapes); `sdk/src/sdk-evals-wire-host-config.ts` (`buildSdkEvalsWireHostConfig` runs the SAME `normalizeSdkEvalHostConfigForWire → canonicalizeHostConfigV2 → computeHostConfigHashV2` pipeline as the backend); `sdk/src/sdk-evals-host-config-source.ts` (`resolveRunLevelHostSnapshot` implements the documented source order `iteration.hostSnapshot → executor.getHostSnapshot?.() → MCPJamReportingConfig.host` plus the pass-1 homogeneity gate — canonicalize+hash every available iteration snapshot, return null on disagreement). Wire pair injected at one-shot `POST /sdk/v1/evals/report` body AND chunked `POST /sdk/v1/evals/runs/start` body only — **never** in `appendEvalRunIterations` or `finalizeEvalRun` (per-run, not per-batch). `hasAnyHostSnapshotSource` short-circuit skips the capability probe when no snapshot source exists, preserving the legacy single-request flow for callers that don't supply host info (and keeping existing fetch-mock-count assertions green). 32 new tests across 4 files. Cursor Bugbot follow-up (`787e21a72`, rebased to `ef34c42f9`): wrapped `resolveRunLevelHostSnapshot` + `buildSdkEvalsWireHostConfig` in try/catch matching the capability-probe fail-safe pattern — a malformed `iteration.hostSnapshot`, throwing `executor.getHostSnapshot()`, or non-canonicalizable explicit `Host.toJSON()` must NOT crash the entire eval upload. +1 regression test asserting a throwing executor produces a successful upload, no wire pair in body, and a `console.warn` containing `"omitting hostConfig wire pair"`. Release-process trap learned + carried forward (see Stage 5 trap log below). +- ✅ **Stage 6 — docs refresh** (inspector PR #2439, merged + mcpjam-backend PR #433, merged). Public SDK docs (`docs/sdk/`) bulk-renamed `TestAgent` → `HostRunner`, `EvalAgent` → `HostExecutor`, `.prompt()` → `.run()`, `Host.addServer()` → `Host.requireServer()`, `TestAgentOptions` → `HostRunnerOptions` across 18 mdx files; `test-agent.mdx` → `host-runner.mdx` with redirect in `docs.json`. New "Spec-first: `Host` + `HostRuntime`" section in `index.mdx`, "Bring your own host" section in `concepts/running-evals.mdx`, and "Run-level host snapshot" section in `reference/eval-reporting.mdx` (Stage 5 wire-send semantics — capability probe, fail-safe-to-omit, source order, pass-1 homogeneity gate, server-id normalization). `MCPJamReportingConfig.host` field documented. Contributing arch docs (`docs/contributing/playground-architecture.mdx` + `evals-architecture.mdx`) got top-of-file "HostConfig consolidation" sections explaining the single-source-of-truth module layout, Stage 3 import paths, Stage 4 rename surface, per-iteration host snapshot capture, Stage 5 wire-send, and the two-sandbox-layer distinction (allowlist-only persisted shape vs deny-capable runtime resolver). Older sections preserved with a stale-doc warning. SDK README + CHANGELOG updated; backend `convex/lib/hostConfigV2.ts` got a file-header comment naming `@mcpjam/sdk/host-config/internal` as the canonicalizer source of truth so a future reader doesn't hand-patch a parallel implementation. Verified: SDK typecheck ✓, `test:packaging` ✓ (all 11 expected `/host-config/internal` exports load including `normalizeSdkEvalHostConfigForWire`), `docs.json` valid JSON ✓. + - Confirmed-faithful porting gotcha for later stages: `clientCapabilities`/`hostContext` use a **shallow** key sort; only `*Override` + `mcpProfile` deep-sort (a test that assumed deep order-independence for `clientCapabilities` was wrong — the canonicalizer is correct). + - Stage 2 discovery worth carrying forward: when "moving" a type from client to SDK, first diff the strictness. The SDK type is the storage-write contract (permissive during rollouts); the client type is often the editor's stricter draft contract. If they diverge, lift the **leaf subtypes** to SDK and let the client keep the stricter aggregate that composes them — don't try to share the aggregate. + - Stage 3 discovery worth carrying forward: when extending a documented browser-safe SDK subpath barrel, the new modules must be *structurally* pure — not just "logically pure." A type-only `MCPClientManager` import doesn't drag runtime code, but a `typeof MCPClientManager` runtime reference does. Use structural duck-types (`interface ToolMetadataSource { getAllToolsMetadata(serverId): ... }`) for cross-module signatures; verify with `esbuild --bundle --metafile --platform=browser` before commit. + - **Stage 5 trap log worth carrying forward:** + 1. **Cross-repo helper publish gates the whole stack.** Backend importing a new SDK helper at the narrow `/internal` subpath needs the SDK published to npm before the backend PR can pass CI (`npm ci` resolves from the lockfile-pinned version). For development parallelism, `npm pack` the SDK from the helper PR's branch and install the tarball into the backend worktree — proves the contract before the real publish. Bumping `package.json`+lockfile must land in the same change as the new import; chatgpt-codex-connector flagged this on PR #427 (correctly). + 2. **`release.yml` runs `changeset version` + `publish` itself — never pre-run them in a PR.** First attempt at the release rollup (inspector PR #2427) pre-ran `changeset version` locally, which deleted the changeset files and bumped `sdk/package.json` in the PR. The release workflow then ran on the merged commit, found zero unpublished changesets, and aborted at "Select release scope" with `Scope "packages-only" does not include any unpublished changesets`. No publish happened. Recovered by reverting via #2431 (restored changesets + reverted version bump), then triggering the workflow normally. Pattern: PRs accumulate changeset files; the workflow does the version+publish. + 3. **Version-bump prediction is fragile under changeset collapsing.** Multiple minor changesets from the same baseline collapse to ONE bump. Stage B and Stage 5 Step 1 both shipped `minor` changesets from `1.11.0` — published as `1.12.0`, NOT `1.13.0`. Don't write version numbers into commit messages, PR descriptions, or code comments until `npm view version` confirms what actually published. + 4. **Fail-safe symmetry for cross-cutting wire features.** When one layer of a cross-cutting feature (e.g. the `/sdk/v1/info` capability probe) is fail-safe-to-omit on errors, every adjacent layer in the same pipeline (e.g. `resolveRunLevelHostSnapshot` + `buildSdkEvalsWireHostConfig`) must use the same fail-safe pattern. Cursor Bugbot caught a Stage 5 Step 3 regression where the probe was fail-safe but the snapshot-resolve+hash step let errors propagate, so a malformed `iteration.hostSnapshot` would fail the entire eval upload. Rule: if the feature must not block the main workflow under any circumstance, audit *every* call in its setup pipeline for try/catch, not just the obvious one. + - **Stage 4 review-cycle findings worth carrying forward:** + 1. **Bot-symmetry framing is dangerous.** "The server chat path does X, you should too" is not load-bearing when X is an execution-gating API (`needsApproval`) without an eval-side completion channel. Verify the API's *runtime semantics*, not just its surface signature, before mirroring. + 2. **Helpers under `/internal` must accept both canonical (`hostStyle`/`mcpProfile`) AND public (`style`/`mcp`) shapes.** Stage 3's helpers were written when only the canonical shape called them (inspector-side, via Convex). After Stage 4, `HostRunner` feeds them `Host.toJSON()` snapshots whose top-level fields use the public shape. A `hostConfig.hostStyle ?? hostConfig.style` / `mcpProfile ?? mcp` fallback is the cheap fix; add a parity test using `new Host(...).toJSON()` directly, not just canonical fixtures. + 3. **`isHostJson` must require the full normalized shape**, not just style/model/servers. A naked `{ style, model, servers }` `HostInit` will otherwise bypass `new Host(init).toJSON()` normalization in `snapshotHostSource()`. Check at least `optionalServers`/`connectionDefaults`/`clientCapabilities`/`hostContext`/`systemPrompt`/`temperature`/`requireToolApproval` — fields a HostJson always carries with concrete types but HostInit typically omits. + 4. **`withOptions` semantics split on whether host is being replaced.** Without `options.host`, carry the parent's resolved fields (`model`, `systemPrompt`, `temperature`, `injectOpenAiCompat`, `rawTools`) so plain `withOptions({})` doesn't silently revert an explicit ctor override. With `options.host`, do NOT carry them — let the new host's snapshot drive defaults; explicit `options.*` still wins. `rawTools` is always passed (re-prep runs under whatever policy applies after the clone). + 5. **Per-iteration capture for live-binding runtimes.** Anything that reads `executor.getHostSnapshot?.()` once at upload time will be wrong for `HostRuntime` if the bound `Host` mutated during the run. Capture per-iteration on the iteration clone; let the upload-time stamp be the fallback for executors that don't expose the method. + 6. **Eval reporting wire format.** When Stage 5 adds `hostConfig`/`hostConfigHash` to the eval ingestion body, it must derive from `iteration.hostSnapshot` (the per-iteration capture from Stage 4) — NOT from `executor.getHostSnapshot()` at report time. The `executor` may have a different `Host` state by then. The same mid-iteration-mutation limitation applies; document it on the Stage 5 PR. + +## Context + +**Why this exists.** Playground and Evaluate do nearly the same thing — drive an LLM + MCP tool-loop under a *host configuration* — but their shared foundation, `hostConfig`, has splintered into 3–4 hand-synced representations interpreted by two forked runtimes: + +- **Backend canonical**: `convex/lib/hostConfigV2.ts` (`HostConfigInputV2` + `canonicalizeHostConfigV2` + content-hash → the `hostConfigs` table). +- **Client mirror**: `mcpjam-inspector/client/src/lib/client-config-v2.ts` — header literally says *"Kept in sync by hand."* +- **Wire contract**: `shared/chat-v2.ts:ChatV2Request` (the same fields flattened). +- **SDK**: nothing — `@mcpjam/sdk` has zero hostConfig logic today. + +The two runtimes also fork: the live path (`server/utils/chat-v2-orchestration.ts:prepareChatV2`) vs. the eval path (`server/services/evals/host-execution-policy.ts` + `compat-runtime.ts`), whose only shared logic is `filterAppOnlyTools` — which the eval path reaches *into* the live path to borrow. + +**Goal.** Make `@mcpjam/sdk` own a portable `hostConfig` core so external users can **run evals from their own agent with the same host behavior the Inspector uses**, and so both Inspector runtimes read from one source of truth. + +**Decided constraints (locked with the user):** +1. **Full end-to-end roadmap** (not just a first slice). +2. **SDK is the single source of truth, and the backend imports it directly.** (Corrected from an earlier draft that claimed the backend couldn't import the SDK.) That "no SDK import" rule — from `convex/lib/mcpProtocolVersion.ts` — applies to the full SDK **barrel** (Node-only deps), NOT the purpose-built, zero-dependency `@mcpjam/sdk/host-config/internal` subpath. A spike confirmed Convex's bundler accepts the subpath (`convex dev --once --debug-node-apis`) and delegating to it is 136/136 byte-identical, so the backend **imports** `canonicalizeHostConfigV2`/`computeHostConfigHashV2` — one canonicalizer, no hand-mirror, no parity fixture. The publish gate (`@mcpjam/sdk@1.11.0`) has been cleared. +3. **Wire BOTH runtimes** (live playground + eval) onto the shared SDK core. + +**Intended outcome.** One canonical `hostConfig` definition + canonicalizer + hash + host-execution policy living in `@mcpjam/sdk`; the Inspector client and both server runtimes consume it; **the backend imports the canonicalizer from `@mcpjam/sdk/host-config/internal` (one source of truth, no hand-mirror)**; and `HostRunner`/`EvalTest`/`EvalSuite` accept a `Host` and report it through the existing `/sdk/v1/evals/*` ingestion routes. (Stage 4 also renames the existing `TestAgent` class to `HostRunner` — see Stage 4 for the rationale.) + +--- + +## Principles & non-goals + +- **Chat and eval storage stay intentionally separate.** Playground (`chatSessions` — interactive, editable, continuable) and Evals (`testSuiteRun`/`testIteration` — reproducible, frozen, audited) have legitimately different lifecycles. This plan does **not** unify their storage or session models; `importChatSessionToTestCase` remains the single, explicit crossing point between them. +- **Only `hostConfig` + host-execution policy become portable and shared.** The consolidation is strictly at the *configuration + policy* layer (how a host talks to MCP servers + the LLM, and how tool-visibility/approval/compat are applied) — never at the storage/transport layer. Every stage below touches config/policy, not session persistence. +- **Two distinct sandbox layers, kept distinct.** The persisted host-config sandbox shape is **allowlist-only (no `deny`)**; it is NOT the same type as the SDK's runtime CSP *resolver* (`sandbox-policy.ts`, which carries `deny` + a hosted clamp for render-time enforcement). They share only leaf subtypes — see the reuse caveat below. + +--- + +## Key architectural finding (shapes the whole plan) + +**There is exactly one canonicalizer — the backend imports the SDK's.** `dist/host-config/{index,internal}.js` are fully self-contained (zero external/`node:` imports; tsup inlines the types, canonicalizer, hash, and reused protocol-version consts). A spike (`npm pack` the SDK → install into the backend → delegate `canonicalizeHostConfigV2`/`computeHostConfigHashV2` → `convex dev --once --debug-node-apis`) bundled cleanly and produced **136/136 byte-identical** results (31 golden vectors + 105 canonicalize cases). + +So the SDK and backend share **one import**, not two implementations pinned by a test. This **deletes the entire parity ritual** an earlier draft of this plan called for (duplicated fixture, `__inputHash` self-guard, lockstep regeneration). The backend's existing 105-case `hostConfigV2Canonicalize.test.ts`, once it imports from the SDK, *is* the regression for both; the SDK keeps its own `host-config-*.test.ts` as the SDK-side regression. The publish gate that previously bracketed this work (`@mcpjam/sdk@1.11.0` exposing the `/internal` subpath) has been cleared. + +--- + +## New SDK module layout (`sdk/src/host-config/`) + +Pure (no `convex/values`, no `ctx.db`, no Node-only APIs); most of it browser-safe. Sits alongside and re-exported like the existing `sandbox-policy.ts` and `mcp-client-manager/mcp-protocol-version.ts`. + +**Hashing decision (decided up front, NOT discovered mid-implementation): async everywhere, single API, no sync variant.** Verified the backend already hashes via async Web Crypto — `sha256Hex` is `async` using `crypto.subtle.digest` (`convex/lib/keys.ts:13`) and `computeHostConfigHashV2` is already `async` (`convex/lib/hostConfigV2.ts:1461`); the only sync `node:crypto` SHA usage is unrelated billing/ip-salt code. So there is **no async/sync mismatch to reconcile**: the SDK keeps `canonicalizeHostConfigV2(input)` **sync** and `computeHostConfigHashV2(canonical): Promise` **async**, byte-identical to the backend. **No `computeHostConfigHashV2Sync` is added.** Every hashing call site (backend ctx-bound mutations; SDK `HostRunner`/reporter) is already async and `await`s — zero call-site migration churn. If a future sync need ever appears, add an explicit `*Sync` variant backed by a vendored sync SHA-256 rather than letting the default become ambiguous. + +``` +sdk/src/host-config/ + index.ts # barrel + types.ts # HostConfigInputV2, CanonicalHostConfigV2, HostConfigDtoV2, + # HostConfigMcpProfileV1, CspDomainSet, OpenAi/McpAppsCapabilities, + # HostStyleId (= string), connection defaults, schema-version + capability consts + canonicalize.ts # canonicalizeHostConfigV2 + all private helpers + hash.ts # sha256Hex, toHex (Web Crypto — ASYNC), computeHostConfigHashV2 (async; NO sync variant — see decision) + hydrate.ts # hydrateHostConfigDto + pure core of hydrateHostConfigDtoWithOverrides (takes pre-fetched refs) + defaults.ts # emptyHostConfigInputV2({ defaultHostStyle = "mcpjam" } = {}), hostConfigDtoToInput, resolve* helpers, DEFAULT_* consts + host-policy.ts # [BROWSER-SAFE / pure] extractHostExecutionPolicy, buildHostIterationMetadata, HostExecutionPolicy, ToolExposureSignals + compat-runtime.ts # [BROWSER-SAFE / pure] readOpenAiCompatOverride, compatPresetForHostStyle, resolveOpenAiCompatForHostConfig + tool-visibility.ts# [NODE-ONLY / manager-aware] filterAppOnlyTools (SDK-native), applyVisibilityPolicyAndCountSignals +``` + +**Reuse — do NOT redeclare:** +- `sandbox-policy.ts` — reuse ONLY the compatible **leaf** subtypes: `SandboxCspMode` (`"host-default"|"declared"|"relaxed"`), `SandboxPermissionsMode` (`"resource-declared"|"deny-all"|"custom"`), and the four-directive domain-set shape (`SandboxCspDomainSet` ≅ backend `CspDomainSet`). **Do NOT import `SandboxCspPolicy`/`SandboxPermissionsPolicy` wholesale** — those carry a `deny` field (and the resolver implements a deny step + hosted clamp), but the persisted host-config shape is **allowlist-only — there is no `deny`** (`hostConfigV2.ts:60-63`: "SEP-1865 is allowlist-only; there's no deny concept"). Define `mcpProfile.apps.sandbox.csp`/`.permissions` natively in `host-config/types.ts` (mode + `restrictTo`/`allow` only) so the type byte-matches the backend canonical shape; reusing the resolver's policy types would silently reintroduce `deny` and break golden parity. +- `mcp-client-manager/mcp-protocol-version.ts` — `MCP_PROTOCOL_VERSIONS`, `isKnownProtocolVersion`, `isStatelessProtocolVersion` (the canonicalizer's protocol-pin validation + the "stateful pin ⇒ advertise version" rule depend on these). +- `mcp-client-manager/capabilities.ts` — `getDefaultClientCapabilities`, MCP-UI extension consts (replaces the backend's hand-copied default-capabilities block and the client seed). +- `mcp-client-manager/tool-converters.ts:isAppOnlyTool` — the SDK already owns the app-only predicate (already used by `getToolsForAiSdk`); the new `filterAppOnlyTools` builds on it (no new SDK dependency). + +**Export surface:** `index.ts` (Node) exports the full surface. `browser.ts` exports the **browser-safe subset** — all types, `canonicalizeHostConfigV2`, `computeHostConfigHashV2`, defaults, and the **pure** policy/compat helpers (`extractHostExecutionPolicy`, `buildHostIterationMetadata`, `resolveOpenAiCompatForHostConfig`, `readOpenAiCompatOverride`, `compatPresetForHostStyle`) — but **excludes** the manager-aware `filterAppOnlyTools`/`applyVisibilityPolicyAndCountSignals` (they reference `MCPClientManager`/`ToolSet`). This is why `host-policy.ts` (pure) is split from `tool-visibility.ts` (manager-aware): the split keeps the browser barrel clean. The `@mcpjam/sdk/host-config` subpath (public, browser-safe) and `@mcpjam/sdk/host-config/internal` (first-party, not-semver-stable) are both wired in `tsup.config.ts`, `package.json#exports`, and `test:packaging` — landed in Stages 0 and 0c. + +--- + +## Stages + +### Stage 0 — SDK host-config module + golden parity fixture *(SDK only)* +- **Port** the type block, `canonicalizeHostConfigV2` + every helper, and `sha256Hex`/`toHex`/`computeHostConfigHashV2` from `convex/lib/hostConfigV2.ts` + `convex/lib/keys.ts`. Replace the backend's `Id<'servers'>` branding with an opaque `type ServerId = string` (the canonicalizer only sorts/dedupes serverIds — semantically safe). +- **Port** the backend's full canonicalize test suite (`tests/convex/hostConfigV2Canonicalize.test.ts`, incl. the byte-pinned canonical JSON) into `sdk/tests/`. +- **Add** `sdk/tests/fixtures/host-config-parity-fixtures.json` (golden vectors covering every `mcpProfile` branch, sandbox csp/permissions/allowFeatures incl. injection guards, `serverConnectionOverrides` + protocol-version override, the stateful-pin cross-field derivation, and undefined-vs-`{}` distinctions) + `sdk/tests/host-config-parity.test.ts` asserting per-row canonical-JSON + sha256 equality and the inlined `__inputHash`. +- **Verify:** `cd sdk && npm run build && npm test && npm run typecheck && npm run test:packaging`. +- **Risks:** (1) field insertion-order in the returned canonical object is load-bearing for the byte-pinned hash — preserve it exactly; a missed `Id<'servers'>` replacement fails SDK typecheck (good). (2) Don't let the `sandbox-policy.ts` resolver's `deny`-bearing policy types leak into the canonical shape (see reuse caveat). Add a fixture vector with `csp.restrictTo` + `permissions.allow` set and assert the canonical output contains **no `deny` key**, so any accidental reintroduction fails parity in both repos. + +### Stage 1 (revised) — Backend imports the SDK canonicalizer *(backend only — ✅ shipped as PR #409, merge `a0a11541`, commit `0d0c9961`)* + +Shipped as a small, single-commit PR; sub-step record kept here as the historical reference for what was done. +- **Branched fresh off backend `main`** (not from `9f52006c` on `claude/eloquent-heisenberg-DLDzM`; that branch holds the abandoned hand-mirror approach). +- Added `@mcpjam/sdk@^1.11.0` to backend `dependencies` (the published version that exposes `./host-config/internal`). +- In `convex/lib/hostConfigV2.ts`, **replaced** `canonicalizeHostConfigV2` + `computeHostConfigHashV2` bodies with delegating wrappers over `@mcpjam/sdk/host-config/internal`. Boundary cast: `Id<'servers'> ↔ ServerId = string` (`as unknown as` on the canonical return; the canonicalizer only sorts/dedupes serverIds, semantically safe). Also deleted ~880 lines of now-unreachable private helpers (`sortStringKeys`/`deepSortStringKeys`/`isPlainObject`/`canonicalizeCsp*`/`canonicalizeAllowFeatures`/`canonicalizeMcpProfile`/`canonicalizeServerConnectionOverrides`) so the file is now entirely runtime/ctx-bound code. +- **No parity scaffolding to delete on a fresh branch off main** (the `9f52006c` artifacts — `tests/convex/fixtures/host-config-parity-fixtures.json`, `tests/convex/hostConfigV2Parity.test.ts`, the parity-discipline header note — never landed on main). Kept `hostConfigV2Canonicalize.test.ts` (105 cases) — it now exercises the SDK directly as the single regression for both sides of the import boundary. +- **Hash migration risk landed at zero.** Pre-merge audit of all 15,334 prod `hostConfigs` rows turned up 0 hits on either hash-affecting condition from `3ded440a8` (dedupe-then-sort `serverIds`/`optionalServerIds`; non-finite `requestTimeoutOverride` rejection). `serverConnectionOverrides` is defined on 0 prod rows today, so the timeout-override risk was theoretical. No mint+repoint migration needed. +- **Verified:** `tsc -p convex/tsconfig.json --noEmit` ✓; `npm run test:once -- hostConfigV2Canonicalize` 105/105 ✓; broader hostConfig suites 170/171 (1 pre-existing skip) ✓; `npx convex dev --once --debug-node-apis` bundles cleanly ✓. + +### Stage 2 — Client consumes the SDK core *(inspector only — ✅ shipped as PR #2396, merge `be9751a80`)* + +Shipped smaller than originally drafted; this section records what actually landed and why some moves on the original list stayed put. + +**Moved to `@mcpjam/sdk/host-config/internal`** (new file `sdk/src/host-config/defaults.ts`; re-exported by the client so the 65 importers didn't churn): +- Types: `CspDomainSet`, `HostConfigConnectionDefaults`, `HostConfigMcpProfileV1`, `McpProtocolVersion`. +- Constants: `SEP_1865_PERMISSION_FEATURES`, `DEFAULT_TEMPERATURE_V2`. +- Function: `resolveEffectiveMcpProtocolVersion(serverOverride, hostDefault)` — pure one-line `??`. + +**Stayed client-side** (the macro plan's original list overcounted what was portable — see the "stricter aggregate" gotcha in the Implementation-status bullet for the underlying rule): +- `HostConfigInputV2` / `HostConfigDtoV2` — stricter than the SDK on `serverIds`/`optionalServerIds`/`respectToolVisibility` (editor invariant) and on `chatUiOverride` (structured `ChatUiOverride` from `@/lib/client-styles`); moving them would have cascaded `undefined`-checks across ~30 importers. +- `HostStyleId` — closed `ChatboxHostStyle` union, not the SDK's open `string`. +- `emptyHostConfigInputV2` / `hostConfigDtoToInput` — return the client's strict types; tied to editor invariants. +- `resolveClientInfo` / `resolveSupportedProtocolVersions` / `resolveHostInfo` — logically portable but no external demand; deferred. +- All entangled resolvers (`resolveEffectiveHostCapabilities`/`resolveEffectiveCompatRuntime`/`resolveEffectiveMcpAppsCapabilities` + the `mergeOpenAiAppsCapabilities`/`mergeMcpAppsCapabilities`/`hostCapabilitiesOverrideToMatrix` helpers) — coupled to `@/lib/client-styles` + ext-apps; Stage 3+ territory. +- All dirty-detection equality + clone helpers — client-side change-tracking, not part of the canonical model. + +**Verified:** SDK 1145/1145 tests including 6 new `host-config-defaults` tests ✓; inspector `typecheck:client` ✓; 24/24 client test files referencing `client-config-v2` → 418/418 tests passing ✓. File shrink: 1080 → 875 lines (−205). Importer count unchanged at 65. + +**Bot-review follow-ups merged on the same PR:** Codex P2 + Cursor Bugbot Low independently flagged that the new `@mcpjam/sdk/host-config/internal` import only resolved against built `sdk/dist/` — the `pretest` doesn't run `build:sdk` (root-level `npm test` does, which is why CI was always green). Fixed by adding source aliases for the subpath to both `client/vitest.config.ts` (commit `d9a42182a`) and `client/vite.config.ts` (commit `6bf81115b`), matching the existing `@mcpjam/sdk/browser` / `/matchers` / `/skill-reference` pattern, so `npm test -w @mcpjam/inspector`, `npm run dev:client`, and `npm run build:client` no longer require a pre-`build:sdk` step in isolated dev workflows. + +### Stage 3 — Move host-policy + compat + `filterAppOnlyTools` into the SDK; rewire BOTH runtimes *(SDK + inspector — ✅ shipped as PR #2407, merge `9c618e1ae`)* + +Shipped largely as drafted, with three corrections that the macro plan should carry forward (logged here as historical reference for what was done): + +- **SDK additions** under `sdk/src/host-config/`: + - `app-only-tool.ts` (pure leaf) — `isAppOnlyTool` moved here so `host-config/internal.ts` can import the predicate without pulling `tool-converters.ts` (which imports `ai` + `@modelcontextprotocol/client`). `tool-converters.ts` now `import`s + re-exports it, so external callers' import paths are unchanged. + - `tool-visibility.ts` — `filterAppOnlyTools` + `applyVisibilityPolicyAndCountSignals` + a structural `ToolMetadataSource` duck-type (`{ getAllToolsMetadata(serverId): Record> }`). `MCPClientManager` satisfies this shape, so inspector callers pass their existing manager. **Structurally pure — no `MCPClientManager`/`ai` runtime imports** (this was the macro-plan correction that turned "NODE-only" into "browser-safe via duck-type"). Critical because `host-config/internal` is the same subpath the inspector client imports via Vite source alias. + - `host-policy.ts` — `extractHostExecutionPolicy` + `buildHostIterationMetadata` + the `HostExecutionPolicy` / `ToolExposureSignals` types. + - `compat-runtime.ts` — `readOpenAiCompatOverride` + `compatPresetForHostStyle` + `resolveOpenAiCompatForHostConfig` (the first two are now exported, not just internal helpers — Stage 4's `HostRunnerConfig` will want them). +- **SDK barrel:** `internal.ts` re-exports the new symbols; **no new package.json exports / tsup entries / subpath**. `test:packaging` extended to assert all 10 expected exports are functions, so a future ergonomic re-export break would fail the smoke. +- **Live path rewire:** `server/utils/chat-v2-orchestration.ts` drops its local `filterAppOnlyTools` body + the `isToolVisibilityAppOnly` import, then `import { filterAppOnlyTools } from "@mcpjam/sdk/host-config/internal"` and re-exports for callers. The `@modelcontextprotocol/ext-apps` package entry in `mcpjam-inspector/package.json` stays — the dep is still used by 3 other server files (correction over an earlier draft that wanted to remove it). +- **Eval path rewire:** `server/services/evals/host-execution-policy.ts` and `compat-runtime.ts` become thin re-export shims over `@mcpjam/sdk/host-config/internal`. `loadSuiteHostConfig` stays inspector-side (Convex-bound). +- **Inspector server build/test configs:** `server/tsup.config.ts:42` adds `@mcpjam/sdk/host-config/internal` to `noExternal` + esbuild alias; `server/vitest.config.ts:52` adds the source alias + `server.deps.inline` entry (matching the existing SDK-subpath pattern). Without this the server bundles try to resolve the subpath from `node_modules` and the server-side vitest forces a pre-`build:sdk` step. +- **Test re-homing:** the inspector's `server/services/evals/__tests__/host-execution-policy.test.ts` was moved to `sdk/tests/host-config-policy.test.ts` (git detected the rename). Two new SDK test files: `host-config-tool-visibility.test.ts` (mock `ToolMetadataSource`) and `host-config-compat-runtime.test.ts` (resolution-order matrix). Net coverage strictly increased. +- **Double-filter regression — DEFERRED to Stage 4.** `convertMCPToolsToVercelTools` already respects `includeAppOnly`; the remaining unconditional drop is in the private raw-`Tool[]` conversion at `sdk/src/TestAgent.ts:90`, and `applyVisibilityPolicyAndCountSignals` runs over the already-converted set so it cannot recover that drop. Wrote `it.skip("...fix at TestAgent.ts:90 (Stage 4)")` in `host-config-tool-visibility.test.ts` so the gap is visible without lying about coverage. The macro plan originally drafted this as a Stage 3 test — that was the second correction. +- **Verified:** SDK 1195 passed + 1 skipped; `npm run typecheck` ✓; `npm run test:packaging` exit 0; `npx esbuild src/host-config/internal.ts --bundle --platform=browser --metafile` → bundle inputs are `host-config/{app-only-tool,canonicalize,compat-runtime,defaults,hash,host-policy,internal,tool-visibility,types}.ts` + `mcp-client-manager/mcp-protocol-version.ts` (already in the graph from `canonicalize.ts`); **no `ai`, no `MCPClientManager`, no `node:*`**. Inspector 4910 passed + 6 skipped; server tsc baseline unchanged vs origin/main; `npm run typecheck:client` + `npm run build:client` clean. +- **Risk realized + closed:** the parity-of-predicates risk (SDK `isAppOnlyTool` vs ext-apps `isToolVisibilityAppOnly`) was confirmed byte-equivalent in the Stage 3 exploration before any code moved — both implementations check `_meta.ui.visibility === ["app"]`. No focused parity test needed; the existing `chat-v2-orchestration.test.ts` SEP-1865 visibility tests cover the integration end. + +### Stage 4 — `TestAgent` → `HostRunner` rename + `Host` as primary spec + `HostRuntime` *(SDK only; breaking — ✅ shipped as PR #2409, merge `8a5c9426b`)* + +Shipped as one breaking SDK major. This section records what actually landed and the trap log from the review cycle; the original draft of Stage 4 was substantially smaller than what merged. + +**Rename rationale (unchanged from the original draft).** In MCP spec vocabulary "host" already names *the thing that drives an LLM with MCP tools*. Shipping `TestAgent` alongside `Host` was a synonym collision; users had to learn that "agent" and "host" mean different things here when in every other context they're synonymous. The split that actually shipped: +- `Host` = the spec / config (immutable snapshot via `Host.toJSON()`). +- `HostRunner` = a synchronous executor over a `Host` with tools pre-resolved. +- `HostRuntime` = a live binding of a `Host` to an `MCPClientManager` (or any structural `HostRuntimeManager`). +- `HostExecutor` = the interface both runners implement; what `EvalTest.run` / `EvalSuite.run` take. + +`EvalTest`/`EvalSuite` keep their names — they wrap a runner, they aren't one. No deprecation aliases; the SDK had no public adopters at the time of the major. + +**What actually shipped (one PR, breaking; codemod in `sdk/CHANGELOG.md`):** + +- `TestAgent` → `HostRunner` (class), `TestAgentConfig` → `HostRunnerConfig`. +- `EvalAgent` → `HostExecutor` (interface). +- `.prompt(message, options)` → `.run(message, options)` on the interface and both impls. +- `Host.addServer` → `requireServer`, `Host.removeServer` → `removeRequiredServer`. +- `EvalTest.run` / `EvalSuite.run` parameter `agent` → `executor`. +- `HostRunnerConfig.host: Host | HostInit | HostJson` (discriminated union: caller supplies either `host` with optional `model`, or `model` with no host — a config missing both is a compile-time error). Snapshotted once via `snapshotHostSource(...)`; pre-snapshotted `HostJson` (the `HostRuntime.run()` path) passes through unchanged. Accessors: `getHostSnapshot()`, `getHostPolicy()`. +- `HostRuntime` + `host.withManager(manager, { apiKey, ...defaults })`. Structural `HostRuntimeManager` (`hasServer` + `getToolsForAiSdk` + optional `getServerReplayConfigs`). `.run()` snapshots the live host afresh, validates required server ids (`assertHostServersKnown`), resolves tools, dynamic-imports `HostRunner`. **Stateless across turns**: history accumulates for inspection but does NOT auto-replay; multi-turn continuity stays explicit via `PromptOptions.context`. +- One-shot sugar `host.run(input, { apiKey, mcpClientManager, ... })` delegating through a throwaway runtime. +- Single-gated app-only filter at `HostRunner` tool-prep — `convertToToolSet` is now a pure converter. `withOptions` preserves `rawTools` so host-replacement clones re-run the prep step under the new host's policy. +- SDK-owned OpenAI compat decision via `resolveOpenAiCompatForHostConfig(hostSnapshot)`; existing `injectOpenAICompat` widget primitive unchanged. +- Per-iteration host snapshot stamping (`IterationResult.hostSnapshot?: HostJson`). For `HostRuntime`-backed runs, mutation between iterations is reflected per-iteration in metadata rather than collapsed to the upload-time state. `wrapAgentWithAbortSignal` was updated to forward `getHostSnapshot` and `getServerReplayConfigs` so the iteration wrapper preserves the introspection surface. +- `MCPJamReportingConfig.host?: Host` field added — wire send deferred to Stage 5. +- Inspector imports of Stage 3 helpers switched from local re-export shims to direct `@mcpjam/sdk/host-config/internal`. Pure shim `host-execution-policy.ts` deleted; `compat-runtime.ts` slimmed to keep only `loadSuiteHostConfig` (Convex-bound, can't move). + +**Bundle isolation verified.** `HostRunner.ts` appears in the `host-config/index.ts` esbuild graph only as `kind: dynamic-import`, so browser bundlers (Vite/webpack/Rollup) can tree-shake `ai` out of a `Host`/`HostRuntime` import. `host-config/internal.ts` graph is host-config files + `mcp-protocol-version.ts` only; no `ai`, no `MCPClientManager`, no `node:*`. + +**Trap log (review-cycle findings worth carrying forward).** Five review rounds across Cursor Bugbot and chatgpt-codex-connector. Six findings landed real fixes on PR #2409 before merge; a seventh prompted PR #2414 (`needsApproval` forwarding) which was **closed without merging** because the proposed fix actually breaks evals — see the dedicated trap log under the Implementation-status bullet for the full rationale. Lessons summarized at the top of the Stages 5+6 bullet. + +**Did NOT ship in Stage 4 (deferred to Stage 5 or later):** +- Wire-level `hostConfig` / `hostConfigHash` propagation through `/sdk/v1/evals/*` — Stage 5. +- Per-turn host snapshot capture inside `PromptResult` — mid-iteration mutation between turns in a multi-turn testFn is still rolled up to the iteration-end snapshot. Future stage. +- `EvalTestConfig` / `EvalSuiteConfig` host fields — dropped in favor of executor-only routing (`getHostSnapshot?.()` on the executor is the source of truth; no separate config field). + +### Stage 5 — Connect hostConfig to `/sdk/v1/evals/*` ingestion *(shipped as a 3-step sequence — see ✅ Stage 5 bullet above for what actually landed; the design notes below are preserved as decision provenance)* + +> **Shipped delta from this design.** The original plan had two PRs (backend + SDK reporter). User refined it during scoping to three PRs (SDK helper-only → backend → SDK reporter) so the backend imports the normalizer from npm rather than vendoring a temporary copy. Wire shape chose flat `{ hostConfig, hostConfigHash }` at the top level of the request body; capability advertised at `/sdk/v1/info` as nested `{ "capabilities": { "evalsHostConfig": 1 } }` (unauthenticated). SDK published as 1.12.0 (Stage B + Step 1 changesets collapsed). The pass-1 server-id bridge is "strip on both sides via shared normalizer" (option chosen in plan); the larger external-host/server-ref shape remains future work. + +Today the backend resolves hostConfig **server-side from the suite**; the external report payload carries no hostConfig, so an external agent's host behavior is invisible. Make it additive — this changes only *how the per-iteration hostConfig row is sourced*, **not** the eval storage/session model (which stays separate from chat per the principles above): + +**Pass-1 scope: run-level hostConfig only, with a homogeneity gate.** +- Accept a single `hostConfig` at the run boundary (`runs/start`/`report`) and materialize per-iteration `hostConfigId` rows from it using the existing `materializeEvalIterationHostConfig` + `advancedConfig` overlay path. +- Do **not** add per-result/per-iteration hostConfig in the first pass. +- Do **not** choose "iteration 1" as a representative snapshot. That silently corrupts reporting when a `HostRuntime`-backed eval mutates the bound `Host` between iterations. +- Instead, canonicalize every available `iteration.hostSnapshot`, compute the hash for each, and send run-level `{ hostConfig, hostConfigHash }` only if all hashes match. If snapshots differ, omit run-level host config for pass 1. If the reporter has a strict/debug mode, fail with a clear message: "heterogeneous host snapshots require per-iteration hostConfig wire support." + +**SDK reporter source-of-truth.** +- Primary source: `iteration.hostSnapshot` (the per-iteration capture added in Stage 4 via `eval-result-mapping.ts:resolveIterationHostExtras`). +- Fallback source: `executor.getHostSnapshot?.()` only for legacy/custom executors whose iterations have no `hostSnapshot`. +- Last-resort explicit source: `MCPJamReportingConfig.host`, if the caller supplied it and neither iteration nor executor snapshots exist. Treat this as compatibility/fallback, not the recommended path. +- Never call `executor.getHostSnapshot()` at report time and stamp every iteration with that value when per-iteration snapshots exist. For `HostRuntime`, the bound `Host` may have changed by then; Stage 4 specifically captured iteration snapshots to avoid that bug. +- Mid-iteration mutation between turns is still rolled up to the iteration-end snapshot because `PromptResult` does not yet carry per-turn host snapshots. Document that limitation in the Stage 5 PR. + +**Wire shape and canonicalization.** +- Reporter sends canonical wire shape, not public `HostJson`: call `canonicalizeHostConfigV2(snapshot)` + `computeHostConfigHashV2(canonical)` from `@mcpjam/sdk/host-config/internal` and send `{ hostConfig, hostConfigHash }`. +- `hostConfigHash` is a transport-integrity check only. It detects tampering/corruption between SDK and backend; it is not cross-version drift detection because both sides use the same first-party canonicalizer. +- No public `Host.hash()` returns. The reporter is first-party SDK code and may import `/internal`. +- Helpers in `/internal` already tolerate both canonical (`hostStyle`/`mcpProfile`) and public (`style`/`mcp`) shapes for in-SDK policy/compat callers. The eval wire payload should still be one canonical shape. + +**Backend ingestion contract.** +- `convex/http.ts` `/sdk/v1/evals/runs/start` and `/sdk/v1/evals/report` parse optional `hostConfig` + `hostConfigHash`. +- `convex/sdkEvals.ts:internalStartSdkRun` accepts the normalized/sanitized hostConfig payload and wires it into the existing run/iteration materialization path. +- When client hostConfig is present, the backend recomputes the hash server-side after applying the exact same normalization/sanitization that the client used for the transmitted shape. Reject on mismatch. +- Existing payloads without hostConfig must behave exactly as they do today and continue to fall back to suite/default server-side host resolution. + +**Important server-id mismatch.** +- Backend `hostConfigInputV2Validator` currently models `serverIds` / `optionalServerIds` as Convex `Id<'servers'>` and `ensureHostConfigV2(...)` validates them against project server scope. +- SDK `Host.requireServer("everything")` stores runtime-manager ids. Those ids are meaningful to `HostRuntimeManager.hasServer(...)` and `getToolsForAiSdk(...)`; they are **not** Convex `servers` table ids. +- Stage 5 must not pass SDK runtime ids directly into `hostConfigInputV2Validator`, `validateServerScope`, or `ensureHostConfigV2` as `Id<'servers'>`. +- Minimal pass-1 bridge: define an SDK-evals hostConfig normalizer that strips or empties `serverIds`, `optionalServerIds`, and server-specific connection overrides before backend storage/hash validation, while preserving exact external runtime/server replay identity in `serverReplayConfigs`. +- If preserving required/optional external server ids in `hostConfigs` is product-critical, do not overload the Convex-id fields. Add an explicit external-host/server-ref shape instead. That is larger than pass 1 and should be a separate design/PR. +- Whatever bridge is chosen, client and server must hash the same normalized shape. Do not let the SDK hash `"everything"` while the backend strips it before recomputing. + +**Capability and deploy compatibility.** +- Backend lands and deploys first. +- Add a concrete SDK-facing capability, preferably a small endpoint or response field such as `{ evalsHostConfig: 1 }`; avoid semver parsing as the primary gate. +- SDK reporters send `hostConfig` only when that capability is present, unless an explicit opt-in/test override is used. +- Do not rely on old backends ignoring unknown request fields. Even if today's routes manually pluck fields, future or alternate deployments may validate strictly and 400. + +**Do NOT forward execution-gating policy fields.** +- The eval-side execution path does not enforce `requireToolApproval` via AI SDK `needsApproval`; forwarding it skips tool execution until an approval response arrives, and evals do not have that response channel. +- `requireToolApproval` remains reported metadata (`approvals_would_require`), not an execution gate. +- Any future review that argues "the chat path forwards X, evals should too" must first verify whether X is reporting-only or execution-gating. + +**Recommended PR split.** +1. Backend PR: capability endpoint/flag; optional hostConfig parsing; SDK-evals hostConfig normalizer for the server-id bridge; server-side hash recompute/reject; storage/materialization wiring; tests for old payload compatibility, matching hash, tampered hash, external runtime server ids, and capability response. +2. SDK PR: reporter derives from `iteration.hostSnapshot`; homogeneity gate; fallback ordering (`iteration.hostSnapshot` → `executor.getHostSnapshot` → `MCPJamReportingConfig.host`); capability-gated send; one-shot and chunked reporter parity; tests for homogeneous send, heterogeneous omit/strict-fail, old-backend capability absence, and hash body shape. +3. E2E/docs PR: external-runtime smoke and docs refresh once both sides are deployed. + +**Verify.** +- Backend `convex-test`: POST `/report` and `/runs/start` without hostConfig still behaves unchanged. +- Backend `convex-test`: matching normalized hostConfig/hash stores/materializes expected `hostConfigId`. +- Backend `convex-test`: tampered hash rejects. +- Backend `convex-test`: SDK runtime server ids like `"everything"` do not get cast to Convex ids; behavior follows the chosen bridge. +- SDK reporter tests: body includes hostConfig only when capability is present and snapshots are homogeneous. +- SDK reporter tests: heterogeneous snapshots do not silently choose the first snapshot. + +### Stage 6 — Docs refresh *(both repos — ✅ shipped as inspector PR #2439 + mcpjam-backend PR #433; the bullet list below is preserved as the design checklist for what landed)* + +> **Shipped delta.** Everything listed below shipped substantially as drafted. The bulk rename (`TestAgent` → `HostRunner`, `EvalAgent` → `HostExecutor`, `.prompt()` → `.run()`, `Host.addServer()` → `Host.requireServer()`, `TestAgentOptions` → `HostRunnerOptions`) hit 18 mdx files plus the SDK README; the `docs/sdk/reference/test-agent.mdx` page was renamed to `host-runner.mdx` with a redirect added to `docs.json`. New "Spec-first: `Host` + `HostRuntime`" section in `docs/sdk/index.mdx`; new "Bring your own host" section in `docs/sdk/concepts/running-evals.mdx`; new "Run-level host snapshot" section in `docs/sdk/reference/eval-reporting.mdx` documenting Stage 5 wire semantics end-to-end (capability probe, fail-safe-to-omit, source order, pass-1 homogeneity gate, server-id normalization, wire-hash-vs-storage-hash distinction). `MCPJamReportingConfig.host` field is documented as fallback/override (not the recommended primary path). SDK `CHANGELOG.md` restructured into Unreleased (Stage 5 Step 3 reporter wire-send), 1.12.0 (Stage 5 Step 1 normalizer + Stage B canonicalizer tightening), 1.11.0 (Stage 4 rename). Backend `convex/lib/hostConfigV2.ts` got a file-header comment naming `@mcpjam/sdk/host-config/internal` as the canonicalizer source of truth so a future reader doesn't hand-patch a parallel implementation. The "stale" `docs/contributing/playground-architecture.mdx` + `evals-architecture.mdx` got top-of-file "HostConfig consolidation" sections describing the current architecture (one-source-of-truth module, Stage 3 import paths, Stage 4 rename surface, per-iteration host snapshot capture, Stage 5 wire-send, two-sandbox-layer distinction); the existing pre-chat-v2 sections below were left as-is with a stale-doc warning at the top rather than rewriting 2000+ lines of historical reference. Verified: SDK typecheck ✓, `test:packaging` ✓ (all 11 expected `/host-config/internal` exports load), `docs.json` valid JSON ✓. + +- `sdk/README.md` + inspector `docs/`: "Run evals from your own runtime with a `Host`" (construct via `new Host({...}).requireServer("everything")`, bind with `host.withManager(mcpClientManager, { apiKey })` when servers are live/dynamic, or pass a static `Host`/`HostJson` into `HostRunner` when tools are already resolved). Document that `HostRuntime.run()` calls are stateless by default; multi-turn continuity stays explicit via `PromptOptions.context`. +- Document Stage 5 reporting semantics: reporter derives host config from `iteration.hostSnapshot`; pass 1 sends run-level hostConfig only for homogeneous snapshots; heterogeneous per-iteration host configs require later wire support. `MCPJamReportingConfig.host` is fallback/override, not the recommended primary path. +- Note the `TestAgent` → `HostRunner`, `EvalAgent` → `HostExecutor`, `.prompt()` → `.run()`, and `Host.addServer()` → `Host.requireServer()` rename in the SDK changelog with a one-line codemod. +- Note in `convex/lib/hostConfigV2.ts` that the canonicalizer is imported from `@mcpjam/sdk/host-config/internal` (one source of truth; no hand-mirror), and document the `/internal` subpath as first-party-only in `sdk/README.md`. +- Refresh the **stale** `docs/contributing/playground-architecture.mdx` + `evals-architecture.mdx` (they describe the pre-`chat-v2` world and never mention hostConfig). + +--- + +## Cross-repo ordering + +1. **SDK (inspector)** lands the host-config module + `Host` facade + `/internal` subpath (Stages 0 / 0b / 0c — ✅ done) and is **published as `@mcpjam/sdk@1.11.0`** — ✅ publish gate cleared. +2. **Backend Stage 1** ✅ shipped (mcpjam-backend PR #409, merge `a0a11541`). Bumped `@mcpjam/sdk` to `^1.11.0` and swapped to the SDK-imported canonicalizer; ~880 lines of dead helpers deleted; prod audit confirmed hash-neutral in practice. +3. **Inspector Stage 2** ✅ shipped (inspector PR #2396, merge `be9751a80`). Client re-exports the SDK leaf primitives; file shrunk 1080 → 875 lines; 65 importers unchanged. Bot follow-up commits added vitest + vite source aliases for `/internal`. +4. **Stage B — canonicalizer tightening** ✅ shipped (inspector PR #2400, merge `40bf746e7`). All four deferred #2392 items + a prototype-guard fix in one SDK PR. Hash-neutral against 15,389 prod rows. SDK bumped to 1.12.0 via changeset. Backend bumps to `^1.12.0` is the only remaining cross-repo coordination — its existing canonicalize suite picks up the tightened behavior automatically. +5. **SDK + Inspector Stage 3** ✅ shipped (inspector PR #2407, merge `9c618e1ae`). Host-policy + compat + `filterAppOnlyTools` extracted to SDK; both runtimes rewired; eval-path → live-path entanglement deleted; one stale-pre-Stage-4 raw-`Tool[]` drop at `TestAgent.ts:90` marked `it.skip`. +6. **SDK Stage 4** ✅ shipped (inspector PR #2409, merge `8a5c9426b`). `TestAgent` → `HostRunner` + `EvalAgent` → `HostExecutor` rename; `Host` as primary `HostRunnerConfig` spec; `HostRuntime` + `host.withManager(manager, { apiKey })`; single-gated app-only filter (Stage 3 `it.skip` is now a real assertion); per-iteration host snapshot capture for `HostRuntime`-backed evals; inspector imports of Stage 3 helpers redirected to `@mcpjam/sdk/host-config/internal` and the local re-export shims deleted. Follow-up PR #2414 (`needsApproval` forwarding) **closed without merging** — see Stage 4 trap log; the eval path correctly does not forward `needsApproval` because it has no approval-response channel and `requireToolApproval` is a "would prompt" metadata signal, not an execution gate. +7. **Stage 5** ✅ shipped in 3 PRs across both repos. (a) **Step 1** — inspector PR #2422 (merge `9e3535884`): SDK helper `normalizeSdkEvalHostConfigForWire` published as part of `@mcpjam/sdk@1.12.0`. (b) **Step 2** — mcpjam-backend PR #427 (merged): `/sdk/v1/info` capability + `/sdk/v1/evals/*` ingestion accepts `{ hostConfig, hostConfigHash }`; bumped `@mcpjam/sdk` to `^1.12.0`. (c) **Step 3** — inspector PR #2423 (merged): SDK reporter wire-send capability-gated against `/sdk/v1/info`; source order `iteration.hostSnapshot → executor.getHostSnapshot?.() → MCPJamReportingConfig.host`; pass-1 homogeneity gate; fail-safe wrap on the wire-resolve pipeline (Cursor Bugbot follow-up). One incidental detour: release rollup PR #2427 pre-ran `changeset version` locally and broke the release workflow's "Select release scope" step (reverted via #2431); see Stage 5 trap log for the lesson. The Step 3 reporter changeset (`.changeset/stage5-step3-sdk-reporter.md`) is still unpublished at the time of Stage 6's merge — see [Remaining work](#remaining-work). +8. **Stage 6** ✅ shipped (inspector PR #2439, merged + mcpjam-backend PR #433, merged). Public-facing docs caught up to the Stage 4 rename + Stage 5 wire-send + canonicalizer single-source-of-truth split; backend `convex/lib/hostConfigV2.ts` file-header annotated. + +No remaining cross-repo coupling at the code level. The runtime-server-id bridge ships in Step 1 as a single normalizer used by both sides (defense-in-depth client AND server); the publish gate is gone (Step 1 SDK helper is on npm); the golden-vector/`__inputHash` parity ritual is also gone. The only remaining gate is the SDK publish that includes Stage 5 Step 3's reporter changeset — see [Remaining work](#remaining-work). + +--- + +## Entanglement flags / what stays put + +- **`filterAppOnlyTools` / `applyVisibilityPolicyAndCountSignals`** — live version depends on `@modelcontextprotocol/ext-apps/app-bridge` (inspector dep). Reimplement on the SDK's own `isAppOnlyTool` + `MCPClientManager.getAllToolsMetadata`. **Move, don't inject.** +- **Client `resolveEffective*` helpers** — coupled to the client host-style registry + ext-apps; **stay in client**. Only the empty/dto/protocol helpers are portable. +- **No SDK default for `style`/`model`** — RESOLVED in user commit `22c7df8`: `Host` requires both (kept type-optional on `HostInit` so the `setStyle()`/`setModel()` setter pattern works, but `toJSON()`/`hash()` throw a clear error if either is missing). The SDK deliberately ships no default `style` (an external author isn't silently opted into MCPJam chrome) and no default `model`. Client/backend surfaces that want a default pick it at their own call site. +- **`loadSuiteHostConfig`** + all ctx/db-bound functions (`ensureHostConfigV2`, `getProjectDefaultHostConfig`, `getSuiteHostConfig`, `resolveEvalBaseHostConfigV2`, `snapshotHostConfigForRun`, `materializeEvalIterationHostConfig`, `hostConfigInputV2Validator`) — **stay in backend**. +- **`Id<'servers'>`** branding — already replaced with opaque `string` inside the SDK canonicalizer boundary, but backend storage still uses real Convex server ids. Stage 5 must not confuse SDK runtime-manager ids with backend `Id<'servers'>`; normalize or model them separately. + +## Remaining work + +All stages 0–6 are merged. What's left is operational, not architectural: + +1. **Cut the SDK release that publishes Stage 5 Step 3 (reporter wire-send).** The changeset `.changeset/stage5-step3-sdk-reporter.md` is still sitting in the inspector repo at the time of Stage 6's merge. The published `@mcpjam/sdk@1.12.0` includes the helper (`normalizeSdkEvalHostConfigForWire`, Stage 5 Step 1) but NOT the reporter code that uses it — so even though the backend at `mcpjam-backend` PR #427 advertises `evalsHostConfig` at `GET /sdk/v1/info` and accepts `{ hostConfig, hostConfigHash }` at the ingestion routes, no SDK-side reporter actually sends the pair until this release ships. Per the Stage 5 trap log: do NOT pre-run `changeset version` locally; let `release.yml` run `changeset version` + `publish` itself. The release will most likely bump to `@mcpjam/sdk@1.13.0` (single minor changeset from baseline 1.12.0); confirm the actual version with `npm view @mcpjam/sdk version` rather than hard-coding it in commit messages or docs. + +2. **Post-deploy verification of the merged Stage 5 + Stage 6 PRs.** Quick sanity checks on production once (1) ships: + - `GET https://api.mcpjam.com/sdk/v1/info` (or however the route is mounted in prod) returns `{ "capabilities": { "evalsHostConfig": 1 } }`. Confirms backend PR #427 deployed cleanly. + - Drive a one-shot eval upload from the published SDK against the production ingest endpoint; verify the run's `configSnapshot.sdkEvalHostConfigBase` matches the SDK-normalized wire input, then verify the persisted iteration's `hostConfigId` points at a v2 `hostConfigs` row whose `configHash` is the storage hash after suite-resolved Convex `serverIds` are layered in. That stored hash is expected to differ from the wire `hostConfigHash` by design — wire hash strips runtime server ids, storage hash includes the resolved Convex ids. A 200 on `POST /sdk/v1/evals/runs/start` with a valid pair and a 400 on a tampered hash both prove the integrity check. + +3. **External-runtime smoke (plan §End-to-end verification item 5).** Still the *actual* proof external users get the same host behavior as the inspector. Build a tiny `@mcpjam/sdk` script (likely under `examples/`) that: + - Constructs `new Host({ style, model }).requireServer("everything")`. + - Binds via `host.withManager(mcpClientManager, { apiKey })`. + - Drives an `EvalSuite` against a local MCP server. + - Reports to `/sdk/v1/evals/*` and asserts the run snapshot (`configSnapshot.sdkEvalHostConfigBase`) equals the normalized host config the SDK reporter sent at the wire boundary. + - Asserts the persisted iteration's `hostConfigId` points at a v2 `hostConfigs` row whose storage `configHash` reflects the normalized host policy plus suite-resolved Convex `serverIds`; this hash should not be compared to the wire `hostConfigHash`. + - Runs twice: once with `respectToolVisibility` unset (app-only tools dropped) and once with `respectToolVisibility: false` (preserved). + - Runs a third heterogeneous-host smoke (mutate the bound `Host` between iterations) and verifies pass-1 omits run-level `hostConfig` instead of stamping every iteration with the first snapshot. + + Needs a real LLM API key (OpenAI / Anthropic / etc.) so it can't be automated in CI without secrets. Worth landing the script first so the smoke is reproducible by anyone with credentials; then optionally wire it as a manually-triggered workflow. + +4. **(Optional) Schedule the deferred per-turn host snapshot capture.** Stage 4 left mid-iteration host mutation between turns (inside a multi-turn `testFn`) rolled up to the iteration-end snapshot. The fix is to thread `hostSnapshot` into `PromptResult` and have `HostRuntime.run()` re-snapshot per-call. Pure SDK work, no backend changes; only worth doing if a user actually hits the limitation in practice. + +## Guardrail tests that must stay green (and be extended) +`importChatSessionToTestCase.parity.test.ts`, `evalIterationHostConfig.test.ts`, `chatboxHostConfigPin.test.ts`, `hostConfigCanonicalize.test.ts`, `hostConfigV2Canonicalize.test.ts` (backend). **New invariants to add:** (1) a cross-surface parity test asserting a playground run and an eval run resolve to the *same* canonical hostConfig where **"same inputs" is defined precisely** as an identical full backend `HostConfigInputV2` **plus the same resolved Convex server IDs**; scope it to that full-input path only because direct chat's reduced ingestion payload does NOT preserve every hostConfig field today. (2) SDK reporter tests asserting Stage 5 does not silently choose a representative snapshot when iteration host snapshots differ. (3) Backend ingestion tests asserting SDK runtime server ids are normalized/modeled explicitly and never accepted as Convex `Id<'servers'>`. + +--- + +## End-to-end verification + +1. **SDK:** `cd sdk && npm run build && npm test && npm run typecheck && npm run test:packaging` — host-config module, canonicalize suite, golden fixture, `Host` facade tests pass; `/host-config` + `/internal` subpaths import. +2. **Backend (post-publish):** `npx convex dev --once --debug-node-apis` bundles cleanly (proves the `/internal` import is isolate-safe) and `npm run test:once -- hostConfigV2Canonicalize` is 105/105 against the SDK import; plus the ingestion tamper-check test + existing eval guardrails. +3. **Inspector:** typecheck + build + eval/chat-v2 test suites; `grep` confirms (a) client has no local hostConfig type declarations, (b) eval path no longer imports `chat-v2-orchestration`. +4. **One-canonicalizer check:** `grep` confirms the backend's `canonicalizeHostConfigV2`/`computeHostConfigHashV2` are re-exports from `@mcpjam/sdk/host-config/internal` (no local body) and the duplicated fixture + `hostConfigV2Parity.test.ts` are gone. +5. **External-runtime smoke (the actual goal):** a tiny `@mcpjam/sdk` script builds `new Host({ style, model }).requireServer("everything")`, binds it with `host.withManager(mcpClientManager, { apiKey })`, drives an `EvalSuite` against a local MCP server, reports to `/sdk/v1/evals/*`, and confirms the run snapshot (`configSnapshot.sdkEvalHostConfigBase`) equals the SDK-normalized wire input while the persisted iteration's `hostConfigId` points at a v2 `hostConfigs` row whose storage hash includes suite-resolved Convex `serverIds`. Confirm app-only tools are filtered per policy. **Run the smoke twice**: once with `respectToolVisibility` unset (app-only tools dropped) and once with `respectToolVisibility: false` (app-only tools preserved). Also run a heterogeneous-host smoke where the `Host` mutates between iterations and verify Stage 5 pass 1 omits/strict-fails run-level hostConfig instead of stamping every iteration with the first snapshot.