Conversation
[codex] drop v4 task compatibility
Decouple agent native tools from environment primitives
# Conflicts: # docs/reference/agents.mdx # hud/environment/environment.py # hud/environment/tests/test_environment.py # hud/tools/computer/base.py # hud/tools/computer/gemini.py # hud/tools/executors/xdo.py # hud/tools/tests/test_computer.py
Refactor Agents
Robot capability: environment.robots, episode recorder, telemetry, ensembler
Rename Robot Capability + Add MainThreadSimRunner
…arching global ast
Resolve registry name from the served Environment, not a source scan
| from hud.eval import Taskset, group_relative | ||
|
|
||
| agent = create_agent("claude-sonnet-4-5") | ||
| job = await Taskset(count_letter(word=w) for w in words).run(agent, group=16) |
There was a problem hiding this comment.
Taskset constructor misused
Medium Severity
The training example calls Taskset(count_letter(word=w) for w in words), which passes the generator as the taskset name, leaving tasks empty. .run(agent, group=16) then schedules no tasks, so the GRPO snippet does nothing useful.
Reviewed by Cursor Bugbot for commit 88ba14d. Configure here.
| @env.template() | ||
| async def count_letter(word: str = "strawberry", letter: str = "r"): | ||
| answer = yield f"How many '{letter}'s are in '{word}'? Reply with just the number." | ||
| yield 1.0 if answer and str(word.count(letter)) in answer else 0.0 |
There was a problem hiding this comment.
Letter count case mismatch
Low Severity
The sample grader uses case-sensitive word.count(letter) while checking whether that count appears in the agent answer. Mixed-case inputs (e.g. "Strawberry" / "r") can score 0.0 even when the answer is correct, unlike the prior lowercased logic.
Reviewed by Cursor Bugbot for commit 88ba14d. Configure here.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 4 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 704bca4. Configure here.


Note
High Risk
This is a major SDK and protocol shift (v5 agents cannot drive v6-served environments) plus CI test setup changes that drop browser/Playwright provisioning, which can hide regressions in computer-use paths if those tests still exist.
Overview
This PR ships HUD Python SDK v6 as the primary surface: environments expose a thin control channel with capabilities (
ssh,mcp,cdp,rfb,robot) and tasks (@env.template()generators), while agent harnesses own the tools. User-facing narrative moves from v5 scenarios/MCP tools to protocol-first manifest → tasks.start → tasks.grade, withTask.run(agent)returning aJob/Runinstead ofhud.eval()/env("scenario", ...).Documentation is restructured on Mintlify: default v6 nav (
docs/v6/), v5 tagged Legacy underdocs/v5/, redirects from old paths, new Migrate to v6 guide, agent skill doc, and refreshed site styling (docs.json,custom.css). Several long-form cookbooks are removed from the old tree and replaced or relocated (e.g. v6 coding-agent, ops-diagnostics, a2a-chat, robot-benchmark).Runnable examples land under
cookbooks/(A2A chat server moved out of the SDK as reference code; codex-style agent; v6chat_envusingEvaluationResultand templates). README and CONTRIBUTING are rewritten for v6 workflows (hud init,hud deploy,hud evalwithout--rootdir=hud).CI/dev ergonomics: GitHub Actions drops Xvfb/Playwright install from the test matrix;
.githooks/pre-pushis removed..gitignoreexpands for local/experimental dirs. AddsAGENTS.md(andCLAUDE.mdpointer) for contributor/agent guidance.Reviewed by Cursor Bugbot for commit c673f40. Bugbot is set up for automated code reviews on this repo. Configure here.