feat(agentic): add Hidden Intent proactivity tracking framework (pi-Bench) by harryfan1985 · Pull Request #846 · GCWing/BitFun

harryfan1985 · 2026-05-23T15:14:34Z

Overview

This PR adds the platform-neutral groundwork for Hidden Intent / proactivity tracking in BitFun agent sessions, inspired by the pi-Bench paper (arXiv 2605.14678).

The current implementation does not claim to fully reproduce pi-Bench's hidden-intent evaluator. Instead, it introduces the session/config/data contracts, prompt guidance, runtime evidence capture, and report fields needed to evaluate whether an agent proactively handles latent requirements once real hidden-intent assignments are available.

Paper Source

pi-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows
Zhang et al., arXiv 2605.14678, May 2026
https://arxiv.org/abs/2605.14678

The paper evaluates proactive personal assistants through hidden intents with three terminal states:

Completed: the agent directly satisfies the hidden intent without the user explicitly providing it.
Inferred: the agent asks a targeted clarification and the user reveals the hidden intent.
Provided: the user must proactively supply the hidden intent.

Proactivity Score = (Completed + Inferred) / Total Hidden Intents.

Task completeness is a separate final-output/task-requirement judgment in the paper, not something that should be inferred from the hidden-intent terminal states.

What This PR Changes

1. Prompt Guidance

agentic_mode.md: adds guidance for coding agents to infer likely requirements from workspace context and ask targeted questions when information is missing.
claw_mode.md: strengthens personal-assistant proactivity guidance, including preference/context recovery across longer workflows.
facet_extraction.md: extends the session-insights extraction prompt with proactivity and completeness fields. This is prompt/schema guidance only; it is not yet wired as the authoritative hidden-intent assignment grader.

2. Platform-Neutral Data Contracts

hidden_intent_types.rs: adds DTOs/enums for hidden intents, persistent intents, terminal statuses, session-level tracking, raw turn evidence, and score/report value types.
session/types.rs: adds persisted intent_assignments and intent_evidence on dialog turns, plus session metadata fields for intent tracking and optional score snapshots.
core/session.rs: adds enable_intent_tracking to SessionConfig, defaulting to false.
services-core owns the shared contracts so the logic stays platform-agnostic and can be exposed through desktop/web/server adapters.

3. Runtime Evidence Collection

intent_evidence.rs: collects lightweight per-turn trajectory signals such as targeted user-question usage, question topics, proactive tool calls, output production, and round count.
round_executor.rs: detects AskUserQuestion tool usage and extracts question-topic hints from tool-call arguments.
execution_engine.rs: accumulates evidence during the turn and persists a snapshot after the dialog loop completes.
coordinator.rs: creates an IntentEvidenceCollector only when enable_intent_tracking=true.
session_manager.rs: persists raw evidence to both session metadata and the dialog-turn file without converting it into hidden-intent terminal assignments.

4. Session Usage Report Surface

session_usage/types.rs: adds optional proactivity and completeness report fields.
session_usage/service.rs: aggregates real hidden-intent assignments into a proactivity report when such assignments exist.
Completeness remains unset until a dedicated checklist/rubric/rule-based grader is implemented.
Legacy proxy-style turn-* assignments generated from raw evidence are ignored so old heuristic data is not reported as real hidden-intent evaluation.

5. Frontend / Adapter Plumbing

Desktop API request DTOs accept enable_intent_tracking and pass it into SessionConfig.
Flow chat config types and session creation paths propagate enableIntentTracking.
Session API/report typings include optional proactivity/completeness report fields.
Session history typings include optional intentEvidence, so future UI/report features can inspect raw evidence separately from hidden-intent assignments.

Incremental Refactor / pi-Bench Alignment

A follow-up refactor tightened the implementation against the paper's functional model:

Separated evidence from assignment: runtime signals are stored as IntentTurnEvidence, not synthetic IntentAssignment rows.
Corrected proactivity semantics: score helpers use the full hidden-intent count as denominator and return unavailable while any hidden intent lacks a terminal status.
Removed false completeness derivation: completeness is no longer computed from Completed/Inferred/Provided; it is reserved for an independent final-task grader.
Preserved compatibility: serde defaults/aliases keep older session files readable, and usage reports filter old proxy assignments.
Kept boundaries clean: shared DTOs live in services-core, execution evidence collection lives in bitfun-core, and UI code consumes typed API/session-history data.

Current Limitations / Follow-ups

This PR does not yet implement the full hidden-intent discovery/assignment evaluator described by pi-Bench.
Persistent-intent DTOs are present, but cross-session memory recovery and application are not fully wired.
Completeness needs a separate grader over the final trajectory/artifacts and task requirements.
AskUserQuestion topic extraction depends on the current tool-call argument shape and should be revisited if the tool schema changes.
Proactivity reporting is meaningful only when real hidden-intent assignments are produced by a future evaluator or imported evaluation data.

Follow-up TODO: Validation and Optimization

Validation TODO

Add fixture-based tests with hand-authored hidden intents and expected terminal statuses to verify Completed, Inferred, and Provided classification semantics independently from runtime evidence collection.
Add replay tests for saved sessions with enable_intent_tracking=true, covering metadata persistence, turn-file intentEvidence, and usage report aggregation after reload.
Add backward-compatibility tests for older session files that contain no intent fields, plus sessions that contain legacy proxy-style turn-* assignments.
Add remote-workspace coverage to confirm evidence persistence and usage-report generation behave the same for local and remote sessions.
Add a small evaluator-golden set once the assignment evaluator exists, with cases for direct satisfaction, targeted clarification, generic clarification that should not count as Inferred, and user-provided hidden intent.
Add completeness-grader tests separately from proactivity, using task checklists/rubrics over final artifacts so completeness does not regress into hidden-intent status counting.

Optimization TODO

Introduce a post-hoc hidden-intent assignment evaluator that compares concrete hidden intents against the full trajectory in two stages: direct satisfaction first, then targeted elicitation.
Wire persistent intents into the workspace/session memory layer so preferences can be recovered across sessions without coupling core logic to desktop-specific storage.
Replace fragile question-topic extraction with a typed AskUserQuestion tool contract or versioned parser to avoid silent drift when the tool schema changes.
Consider batching or coalescing post-turn evidence persistence if high-volume tracking sessions show measurable metadata I/O overhead.
Add UI affordances for raw evidence vs evaluated assignments, making it clear when a report is evidence-only, fully evaluated, or unavailable.
Add report coverage metadata for proactivity/completeness so consumers can distinguish not evaluated, partially evaluated, and fully evaluated states.
Revisit proactivity thresholds after collecting real coding-assistant sessions; pi-Bench-style labels may need calibration for BitFun's coding and desktop workflows.

Risk Assessment

Low Risk

enable_intent_tracking defaults to false, so evidence collection is opt-in.
New persisted fields use Option, Vec, serde defaults, and aliases for backward-compatible deserialization.
Raw evidence and terminal assignments are stored separately, reducing the risk of misleading reports.

Medium Risk

Enabling tracking adds post-turn metadata/turn-file persistence work.
The current report surface includes reserved completeness fields before the completeness grader exists.
Future evaluator work must be careful to distinguish targeted clarification (Inferred) from generic questions or passive waiting.

Verification

cargo test -p bitfun-services-core hidden_intent -- --nocapture
cargo test -p bitfun-core intent_evidence -- --nocapture
cargo test -p bitfun-core report_ -- --nocapture
cargo check --tests -p bitfun-services-core
cargo check --tests -p bitfun-core
cargo check --tests -p bitfun-desktop
pnpm run type-check:web
pnpm run lint:web
pnpm --dir src/web-ui run test:run (139 files / 744 tests passed)

Generated with BitFun

Based on the pi-Bench Hidden Intent framework (arXiv 2605.14678), this introduces infrastructure for tracking proactive assistance quality in long-horizon agent workflows. Paper reference: pi-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows Zhang et al., arXiv 2605.14678, May 2026 What this adds: - Hidden Intent types: IntentTerminalStatus (Completed/Inferred/Provided), HiddenIntent, PersistentIntent, SessionIntentTracking, ProactivityScore, CompletenessScore in services-core - IntentEvidenceCollector and IntentTurnEvidence in the ExecutionEngine for lightweight per-turn signal collection - Proactivity behavior guidance in agentic_mode.md and claw_mode.md system prompts - Extended facet_extraction.md with proactivity/completeness analysis dimensions - SessionUsageReport extensions with ProactivityReport and CompletenessRepor Based on the pi-Bench Hidden Intent framework (arXiv 2605.14678), this introduces infrastructure for tracking p edintroduces infrastructure for tracking proactive assistance quality ig.long-horizon agent workflows. Paper reference: pi-Bench: Evaluatinho Paper reference: pi-Benchden pi-Bench: Evas Long-Horizon Workflows Zhang et al., arXiv 2605.14678, Mer Zhang et al., arXiv 2ou What this adds: - Hidden Intent types: As - Hidden Intde HiddenIntent, PersistentIntent, SessionIntentTracking, ProactivitySal ProactivityScore, CompletenessScore in services-core ds - IntentEvidenceCollector and IntentTurnEvidence in t

GCWing · 2026-05-24T01:07:04Z

This PR involves significant changes and affects the Agentic agent; it will be considered for merging after verification.

harryfan1985 · 2026-05-24T03:30:19Z

This PR involves significant changes and affects the Agentic agent; it will be considered for merging after verification.

sure!

- round_executor: detect AskUserQuestion even when no topic headers are extractable, so the call is no longer silently dropped - execution_engine/session_manager: drop unused turn_id param; warn on poisoned intent evidence mutex instead of silent skip - hidden_intent_types: centralize proactivity level thresholds in ProactivityLevel::{from_score,as_str}; add explicit IntentAssignment is_proxy flag so proxy detection no longer relies solely on a fragile intent_id string heuristic (heuristic kept as legacy fallback) - session_usage: use is_proxy flag first; document the single-provided suppression rationale - add regression tests for AskUserQuestion detection and proxy filtering Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

harryfan1985 · 2026-05-24T04:47:19Z

Follow-up: 代码审查问题修复 (commit `4ffbb24`)

针对此前审查中发现的问题，已在最新提交中修复并补充回归测试。

修复内容

问题	文件	修复
AskUserQuestion 漏报	`round_executor.rs`	`detect_ask_user_question` 此前在 `questions[].header` 缺失时返回 `(false, [])`，导致工具调用被静默丢弃。改为用独立的 `called` 标记记录调用本身，topic 提取保持 best-effort。
未使用参数	`session_manager.rs` / `execution_engine.rs`	删除 `record_intent_evidence` 中从未使用的 `_turn_id` 参数（实际通过 `evidence.turn_index` 定位），同步更新唯一调用方。
Mutex 中毒静默丢失	`execution_engine.rs`	证据收集器锁中毒时由静默跳过改为输出 `warn!` 日志，便于排查。
阈值三处重复	`hidden_intent_types.rs` / `intent_evidence.rs` / `service.rs`	将 0.8/0.5/0.2 等级阈值统一收敛到 `ProactivityLevel::from_score()` 与 `as_str()`，另外两处改为代理调用。
代理赋值检测脆弱	`hidden_intent_types.rs` / `service.rs`	`IntentAssignment` 新增显式 `is_proxy: bool` 字段（serde 默认 `false`，向后兼容）。`is_legacy_proxy_intent_assignment` 优先读取该字段，原 `intent_id.starts_with("turn-")` 字符串启发式保留为旧数据兜底，避免误判真实意图。
单条 Provided 过滤无说明	`service.rs`	补充注释，解释为何单条 `Provided`(total=1) 不构成有意义的报告而需抑制。

新增测试

round_executor.rs：6 个 detect_ask_user_question 用例，覆盖有 header / 无 header / 空数组 / 缺 key / 不存在 / 混合工具调用。
session_usage/service.rs：2 个代理过滤用例 —— is_proxy=true 必须排除（无论 intent_id），以及 turn- 前缀的真实意图在 is_proxy=false 时不被误过滤。

验证

cargo test -p bitfun-services-core hidden_intent   # 10 passed
cargo test -p bitfun-core intent_evidence          # 12 passed
cargo test -p bitfun-core report_                   # 22 passed (含 2 新增)
cargo test -p bitfun-core detect_ask_user_question  # 6 passed (新增)
cargo check --tests -p bitfun-services-core / bitfun-core / bitfun-desktop  # 全部通过

说明：问题 #2（从 trigger_description 自由文本解析 proactive_tools=）属于设计层面，建议随后续的结构化评估器一并替换为专用字段，本次未改动以控制范围。

harryfan1985 added 4 commits May 23, 2026 23:11

fix(agentic): sync turn-level intent assignments to dialog turn file

388d9f6

fix(agentic): wire hidden intent tracking fixes

c177427

fix(agentic): align hidden intent reporting with pi-bench

74b7f89

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agentic): add Hidden Intent proactivity tracking framework (pi-Bench)#846

feat(agentic): add Hidden Intent proactivity tracking framework (pi-Bench)#846
harryfan1985 wants to merge 5 commits into
GCWing:mainfrom
harryfan1985:feature/hidden-intent-proactivity-tracking

harryfan1985 commented May 23, 2026 •

edited

Loading

Uh oh!

GCWing commented May 24, 2026

Uh oh!

harryfan1985 commented May 24, 2026

Uh oh!

harryfan1985 commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

harryfan1985 commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Paper Source

What This PR Changes

1. Prompt Guidance

2. Platform-Neutral Data Contracts

3. Runtime Evidence Collection

4. Session Usage Report Surface

5. Frontend / Adapter Plumbing

Incremental Refactor / pi-Bench Alignment

Current Limitations / Follow-ups

Follow-up TODO: Validation and Optimization

Validation TODO

Optimization TODO

Risk Assessment

Low Risk

Medium Risk

Verification

Uh oh!

GCWing commented May 24, 2026

Uh oh!

harryfan1985 commented May 24, 2026

Uh oh!

harryfan1985 commented May 24, 2026

Follow-up: 代码审查问题修复 (commit 4ffbb24)

修复内容

新增测试

验证

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

harryfan1985 commented May 23, 2026 •

edited

Loading

Follow-up: 代码审查问题修复 (commit `4ffbb24`)