Fix production loop readiness gaps#22
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d9489723bd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return [*argv[:2], *dev_args, *argv[2:]] | ||
| return argv | ||
|
|
||
| if _is_python_gate_payload(argv): | ||
| return ["uv", "run", *dev_args, *argv] |
There was a problem hiding this comment.
Prevent uv gates from mutating the worktree
When this wrapper runs in a target repo with a missing or stale uv.lock, uv run can resolve/sync and write uv.lock before policy_node inspects the worktree; uv run --help documents --locked as asserting the lockfile remains unchanged and --frozen as running without updating uv.lock. That means an otherwise allowed Python change can fail policy because the gate, not the backend, added or modified uv.lock outside allowed_paths. Add --locked/--frozen or otherwise prevent/clean gate side effects when synthesizing these uv commands.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in the PR branch: synthesized and rewritten uv run gates now add --locked unless --locked or --frozen is already present. Regression coverage is in tests/test_gate_runner.py.
|
Addressed the Codex review: synthesized/rewritten |
ae08c02 to
986b750
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 986b750712
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| update["gate_commands"] = select_gate_commands_for_mutations( | ||
| changed_paths or list(mutations) | ||
| changed_paths or list(mutations), | ||
| cwd=gate_worktree, |
There was a problem hiding this comment.
Select Python gates per worktree
When a run has multiple worktrees, this passes only the first worktree into gate selection, but gate_execution_node later runs the resulting global gate_commands in every worktree. If the Python repo with a dev dependency group is not first, the run falls back to compileall and skips the intended pytest/ruff/mypy gates for that repo; if only the first repo has the dev group, those uv gates get run in unrelated worktrees. Derive gates per worktree/repo instead of using a single cwd for the whole multi-repo run.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in the PR branch: auto-selected gates are now stored as gate_commands_by_repo and selected/executed per worktree, using each repo own pyproject.toml. Added multi-worktree regression coverage in test_auto_gate_selection_is_per_worktree.
| "max_wall_clock_minutes": config.max_wall_clock_minutes_per_run, | ||
| "max_cost_usd": config.max_cost_usd_per_run, | ||
| }, | ||
| backend_budget=backend_budget_for_issue(item, config), |
There was a problem hiding this comment.
Clamp issue budgets to remaining daily cost
When the ledger already has spend earlier in the UTC day, an issue labeled loop:budget-xl is still handed a $10 run budget here; the daily guard above only checks whether the ledger is already at the cap before starting. For example, with $5 already spent under the default $10 daily cap, the next XL run can be budgeted for another $10 and update the ledger to $15, so the new label bypasses the promised daily cap. Clamp max_cost_usd to the remaining daily budget or skip the run when none remains.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in the PR branch: issue-label run budgets are clamped to the remaining daily cost budget before the feature runner starts. Added unit and daemon-path regression coverage.
|
Addressed the two follow-up Codex findings:
Re-ran locally:
|
4524339 to
aa56212
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aa56212438
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| try: | ||
| decoded = json.loads(stdout) | ||
| except (json.JSONDecodeError, ValueError): | ||
| return {} | ||
| return _parse_pi_json_events(stdout) | ||
| return decoded if isinstance(decoded, dict) else {} |
There was a problem hiding this comment.
Normalize single-event Pi JSON before returning
When pi --mode json emits only one JSON event, such as an immediate agent_error before any turns, json.loads(stdout) succeeds and the new Pi event mapper is skipped. That single event is returned without the normalized is_error, usage, or total_cost_usd fields, so a zero-exit error event can be treated as a completed backend run and the daemon ledger records no reported cost. Detect Pi event-shaped dicts here and pass them through the same mapper before returning generic harness JSON.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in commit a0ab44f: single Pi JSON events whose type is a Pi event are now routed through the same Pi event normalizer instead of being returned raw. Added test_pi_backend_parses_single_json_error_event.
| dev_args = _uv_dev_args(cwd_path) | ||
| if not dev_args: | ||
| return argv |
There was a problem hiding this comment.
Guard explicit uv gates without dev deps
Fresh evidence beyond the earlier lockfile thread: this early return still bypasses _with_uv_lock_guard for an explicit uv run ... gate whenever the target pyproject.toml has no dev group/extra. I checked uv run --help: --locked asserts uv.lock remains unchanged and --frozen runs without updating it; without either flag, uv run in a repo with a missing or stale lockfile can create/modify uv.lock, and the subsequent policy diff check will blame the backend for a gate-side effect outside allowed_paths.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in commit a0ab44f: explicit uv run ... gates now receive --locked even when the target repo has no dev group/extra, while explicit --locked/--frozen remains preserved. Added test_explicit_uv_gate_without_dev_env_still_gets_lock_guard.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a0ab44f7ac
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| argv = list(command) | ||
| if not argv: | ||
| raise ValueError("gate command cannot be empty") | ||
| prepared = prepare_gate_command(argv, cwd=cwd) |
There was a problem hiding this comment.
Validate rewritten gate commands
When a target repo has a dev dependency set, this rewrites an already policy-approved gate such as ruff check . into uv run --locked --group dev ruff check .; however _validate_gate_commands only checks the original command[0]. In environments where allowed_gate_commands permits ruff/pytest/mypy but intentionally omits uv, the loop will still execute uv, bypassing the allowlist. Validate the prepared argv or persist prepared commands before the policy node runs.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in the PR branch: gate policy now validates the prepared argv for the cwd(s) where gates execute, in addition to the declared command. That catches rewrites such as ruff check . -> uv run --locked --group dev ruff check . when uv is not allowlisted. Added test_policy_validates_prepared_gate_command.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 16b6e1bbc7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| def _uv_run_has_lock_guard(argv: Sequence[str]) -> bool: | ||
| return any(arg in UV_LOCK_GUARDS for arg in argv) |
There was a problem hiding this comment.
Distinguish uv lock flags from payload args
When an explicit gate passes --locked or --frozen to the command being run, for example uv run python tools/check.py --locked or anything after --, this scan treats that payload flag as a uv lock guard and _with_uv_lock_guard leaves uv run unguarded. I checked uv run --help: its usage is uv run [OPTIONS] [COMMAND], with --locked/--frozen listed as uv options, so only flags before the command protect the lockfile; in the payload case a stale or missing uv.lock can still be updated by the gate and then be blamed by the policy diff check.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in the PR branch: _uv_run_has_lock_guard now only considers --locked/--frozen when they appear in the uv run option segment before the payload command or --. Payload args like uv run python tools/check.py --locked and args after -- no longer suppress insertion of the real uv lock guard. Added regression coverage for both cases.
07d452e to
c84e9ca
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c84e9ca5a3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "--index-url", | ||
| "--isolated", | ||
| "--keyring-provider", | ||
| "--managed-python", |
There was a problem hiding this comment.
Classify uv run flags correctly
When an explicit gate uses a no-argument uv flag such as uv run --managed-python pytest -q in a repo that declares a dev group, this entry makes _uv_run_payload consume pytest as if it were the value for --managed-python, so prepare_gate_command no longer recognizes the payload as a Python gate and never inserts --group dev/--extra dev. I checked uv run --help: --managed-python is listed as a flag, while --group <GROUP> is the dependency selector this code is trying to add, so these gates can still run outside the target repo's dev dependencies.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in the PR branch: corrected the uv option classifier so no-argument flags like --managed-python and --isolated are not treated as consuming the payload command. uv run --managed-python pytest -q now still gets the target repo dev selector plus --locked. Added test_uv_no_argument_flag_does_not_hide_python_gate_payload.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bc08975a88
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| PYTHON_GATE_TOOLS = frozenset({"pytest", "ruff", "mypy"}) | ||
| UV_DEV_SELECTORS = frozenset({"--group", "--extra", "--all-groups", "--all-extras"}) | ||
| UV_LOCK_GUARDS = frozenset({"--locked", "--frozen"}) | ||
| UV_OPTIONS_WITH_VALUE = frozenset( |
There was a problem hiding this comment.
Parse uv options that take values before the payload
For explicit workspace gates such as uv run --package api pytest -q, this option-value table omits --package, so _uv_run_payload treats api as the command and never recognizes the pytest payload that should get the repo's dev selector. I checked uv run --help, which lists --package <PACKAGE> before [COMMAND]; add the missing value-taking uv options or use a more complete parser so flags before the payload do not hide Python gates.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in the PR branch: expanded the uv value-taking option table to include workspace/package and other uv run options such as --package, so option values before the payload no longer hide Python gates. Added test_uv_value_option_does_not_hide_python_gate_payload.
| if _uv_run_has_dependency_selector(locked): | ||
| return locked |
There was a problem hiding this comment.
Add dev selectors alongside non-dev uv extras
When an explicit Python gate already selects another uv dependency set, for example uv run --extra docs pytest -q in a repo whose pyproject.toml exposes [project.optional-dependencies].dev, this early return leaves the command without --extra dev. I checked uv run --help: --extra <EXTRA> includes dependencies from the specified extra name, so a non-dev extra does not make the dev tools available; these gates can still run outside the target repo's dev dependencies or fail even though the dev extra exists.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in the PR branch: existing non-dev selectors no longer suppress the required dev selector. We now check specifically for the required dev group/extra (or all-groups/all-extras) before deciding not to add it. Added test_uv_non_dev_extra_does_not_suppress_dev_selector.
| budget["max_wall_clock_minutes"] = max( | ||
| int(budget["max_wall_clock_minutes"]), int(override["max_wall_clock_minutes"]) | ||
| ) | ||
| budget["max_cost_usd"] = max(float(budget["max_cost_usd"]), float(override["max_cost_usd"])) |
There was a problem hiding this comment.
Enforce the raised cost budget before trusting labels
Fresh evidence beyond the earlier clamp thread: when loop:budget-large/xl reaches this assignment, the only downstream use is copying max_cost_usd into BackendConstraints; assemble_backend_prompt does not mention the dollar cap and SubprocessBackend.execute never compares reported Pi cost to it. The new label therefore neither enforces nor meaningfully raises a run cost cap, and the daemon can still add an over-cap cost to the daily ledger after the run.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in the PR branch: reported harness cost is now enforced against BackendConstraints.max_cost_usd; an over-budget completed harness run becomes budget_exhausted. The daemon also blocks draft PR publication if the final reported run cost exceeds the effective per-run budget. Added backend and daemon regression coverage.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ccae6f7d31
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if isinstance(message, dict) and message.get("role") == "assistant": | ||
| usage = _usage_from_pi_message(message) | ||
| if usage is not None: | ||
| parsed.update(usage) | ||
| text = _content_text(message) |
There was a problem hiding this comment.
Honor Pi message stopReason errors
When Pi encodes a provider/runtime failure on an assistant message, e.g. stopReason: "error" or "aborted" with errorMessage (as its message types document), this branch still treats the message like a normal assistant response and never sets is_error. If that JSON-mode run exits 0, execute() reports a completed backend run instead of a failure, so the loop can proceed past a failed Pi invocation; check assistant stopReason/errorMessage while normalizing these events.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in the PR branch: Pi assistant messages with error-like stopReason values (error, aborted, canceled variants) or errorMessage/error now normalize to is_error=True. Added test_pi_backend_treats_assistant_error_stop_reason_as_error, preserving usage/cost extraction while making zero-exit provider/runtime failures fail the backend run.
c72e1d0 to
8ad7a8a
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8ad7a8ad12
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if event.get("willRetry") is True: | ||
| parsed["is_error"] = True |
There was a problem hiding this comment.
Don't fail successful Pi retry streams
When Pi recovers from context overflow, its JSON stream can contain retry bookkeeping before the later successful agent_end. The Pi RPC docs describe compaction_end with willRetry: true as meaning Pi will automatically retry the prompt, but this line makes that transient marker set is_error permanently. In that scenario a zero-exit, ultimately successful pi --mode json run is returned as failed, so the loop goes to triage despite valid backend output.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in the PR branch: transient retry bookkeeping (willRetry: true, e.g. compaction retry events) no longer marks the whole Pi stream as failed. Error events and assistant error stop reasons still set is_error. Added test_pi_backend_retry_bookkeeping_does_not_fail_later_success.
| if selector == "--group": | ||
| return "--all-groups" in options or value in _uv_option_value(options, "--group") |
There was a problem hiding this comment.
Treat uv only-dev selectors as satisfying dev
When an explicit gate already uses uv run --only-group dev pytest -q or uv run --only-dev pytest -q in a repo with [dependency-groups].dev, this check returns false because it only accepts --group dev/--all-groups. I checked uv run --help: --only-group <ONLY_GROUP> and --only-dev are uv dependency selectors, and uv rejects combining them with an added --group, so prepare_gate_command rewrites a valid explicit gate into one that fails before running tests.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in the PR branch: --only-group dev and --only-dev now satisfy the dev dependency requirement for dependency-group repos, so we do not add incompatible --group dev. Added regressions for both selectors.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5a7157d3fb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| selector, value = dev_args | ||
| if selector == "--group": | ||
| return ( | ||
| "--all-groups" in options |
There was a problem hiding this comment.
Treat dev exclusions as unsatisfied
When an explicit gate in a dependency-group repo uses uv run --all-groups --no-group dev pytest -q, this branch returns true as soon as --all-groups is present. I checked uv run --help: --no-group <NO_GROUP> disables the specified dependency group, so the command still excludes the dev group that normally provides pytest/ruff/mypy; prepare_gate_command then leaves the gate outside the target dev deps or failing because the tools are missing. Treat dev exclusions as unsatisfied or reject them.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in the PR branch: uv Python gates that explicitly exclude the required dev dependency set now fail closed instead of being treated as satisfied. This catches cases like uv run --all-groups --no-group dev pytest -q for dependency-group repos before execution and in gate policy validation. Added runner and policy regressions.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c7c416f26e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ["uv", "run", "ruff", "check", "."], | ||
| ["uv", "run", "mypy", *_mypy_targets(normalized)], |
There was a problem hiding this comment.
Keep auto-selected Python gates from dirtying worktrees
When a Python repo has a dev dependency set and does not already ignore these cache directories, the newly auto-selected ruff check . and mypy ... gates can create untracked .ruff_cache/.mypy_cache in the target worktree; Ruff documents its default cache dir as .ruff_cache and mypy documents its default cache dir as .mypy_cache. Because _validate_worktree_diffs later reads git status for the same worktree, an otherwise valid backend diff can fail policy on gate-created files outside the allowed paths. Run these gates with cache dirs outside the worktree or clean/disable their caches before the diff policy step.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in the PR branch: auto-selected Python gates are now cache-neutral inside target worktrees. Pytest disables the cache provider, Ruff runs with --no-cache, and mypy runs with --no-incremental, preventing .pytest_cache, .ruff_cache, and .mypy_cache side effects from polluting later policy diff checks. Added regression coverage for the selected gate argv.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f4cec956fa
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| cwd_path = Path(cwd).expanduser().resolve() if cwd is not None else None | ||
| if _uv_dev_args(cwd_path): | ||
| return [ | ||
| ["uv", "run", "python", "-m", "pytest", "-q", "-p", "no:cacheprovider"], |
There was a problem hiding this comment.
Avoid forcing pytest on testless dev-env repos
When any Python file changes in a repo that merely declares a dev group/extra, this auto-selects pytest even if the target repo has no tests or does not use pytest. Pytest documents exit code 5 as “No tests were collected” (https://docs.pytest.org/en/stable/reference/exit-codes.html), so a valid Python-only change in a new/small package can fail the loop solely because the synthesized gate was added; consider only selecting pytest when tests/config are present or falling back to compileall in that case.
Useful? React with 👍 / 👎.
Summary
Closes #13.
Closes #11.
Closes #12.
Closes #6.
This hardens the production loop path by fixing the current open readiness gaps:
uv run --group dev/--extra dev) when present## Acceptance/## Acceptance criteriasections with deterministic fallbackpi --print --mode jsonloop:budget-largeandloop:budget-xlValidation
uv run --group dev python -m pytest -q→ 200 passeduvx ruff check src tests→ passeduv run --group dev mypy --strict src→ passeduv run hyrule-engineering-loop models validate --model-policy configs/loop/model-policy.production.yml --json→ okNotes
The daily daemon caps remain enforced. Budget labels only raise the per-run cap for explicitly triaged feature-sized issues.