Skip to content

feat(agentic): add Hidden Intent proactivity tracking framework (pi-Bench)#846

Open
harryfan1985 wants to merge 5 commits into
GCWing:mainfrom
harryfan1985:feature/hidden-intent-proactivity-tracking
Open

feat(agentic): add Hidden Intent proactivity tracking framework (pi-Bench)#846
harryfan1985 wants to merge 5 commits into
GCWing:mainfrom
harryfan1985:feature/hidden-intent-proactivity-tracking

Conversation

@harryfan1985
Copy link
Copy Markdown
Contributor

@harryfan1985 harryfan1985 commented May 23, 2026

Overview

This PR adds the platform-neutral groundwork for Hidden Intent / proactivity tracking in BitFun agent sessions, inspired by the pi-Bench paper (arXiv 2605.14678).

The current implementation does not claim to fully reproduce pi-Bench's hidden-intent evaluator. Instead, it introduces the session/config/data contracts, prompt guidance, runtime evidence capture, and report fields needed to evaluate whether an agent proactively handles latent requirements once real hidden-intent assignments are available.

Paper Source

pi-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows
Zhang et al., arXiv 2605.14678, May 2026
https://arxiv.org/abs/2605.14678

The paper evaluates proactive personal assistants through hidden intents with three terminal states:

  • Completed: the agent directly satisfies the hidden intent without the user explicitly providing it.
  • Inferred: the agent asks a targeted clarification and the user reveals the hidden intent.
  • Provided: the user must proactively supply the hidden intent.

Proactivity Score = (Completed + Inferred) / Total Hidden Intents.

Task completeness is a separate final-output/task-requirement judgment in the paper, not something that should be inferred from the hidden-intent terminal states.

What This PR Changes

1. Prompt Guidance

  • agentic_mode.md: adds guidance for coding agents to infer likely requirements from workspace context and ask targeted questions when information is missing.
  • claw_mode.md: strengthens personal-assistant proactivity guidance, including preference/context recovery across longer workflows.
  • facet_extraction.md: extends the session-insights extraction prompt with proactivity and completeness fields. This is prompt/schema guidance only; it is not yet wired as the authoritative hidden-intent assignment grader.

2. Platform-Neutral Data Contracts

  • hidden_intent_types.rs: adds DTOs/enums for hidden intents, persistent intents, terminal statuses, session-level tracking, raw turn evidence, and score/report value types.
  • session/types.rs: adds persisted intent_assignments and intent_evidence on dialog turns, plus session metadata fields for intent tracking and optional score snapshots.
  • core/session.rs: adds enable_intent_tracking to SessionConfig, defaulting to false.
  • services-core owns the shared contracts so the logic stays platform-agnostic and can be exposed through desktop/web/server adapters.

3. Runtime Evidence Collection

  • intent_evidence.rs: collects lightweight per-turn trajectory signals such as targeted user-question usage, question topics, proactive tool calls, output production, and round count.
  • round_executor.rs: detects AskUserQuestion tool usage and extracts question-topic hints from tool-call arguments.
  • execution_engine.rs: accumulates evidence during the turn and persists a snapshot after the dialog loop completes.
  • coordinator.rs: creates an IntentEvidenceCollector only when enable_intent_tracking=true.
  • session_manager.rs: persists raw evidence to both session metadata and the dialog-turn file without converting it into hidden-intent terminal assignments.

4. Session Usage Report Surface

  • session_usage/types.rs: adds optional proactivity and completeness report fields.
  • session_usage/service.rs: aggregates real hidden-intent assignments into a proactivity report when such assignments exist.
  • Completeness remains unset until a dedicated checklist/rubric/rule-based grader is implemented.
  • Legacy proxy-style turn-* assignments generated from raw evidence are ignored so old heuristic data is not reported as real hidden-intent evaluation.

5. Frontend / Adapter Plumbing

  • Desktop API request DTOs accept enable_intent_tracking and pass it into SessionConfig.
  • Flow chat config types and session creation paths propagate enableIntentTracking.
  • Session API/report typings include optional proactivity/completeness report fields.
  • Session history typings include optional intentEvidence, so future UI/report features can inspect raw evidence separately from hidden-intent assignments.

Incremental Refactor / pi-Bench Alignment

A follow-up refactor tightened the implementation against the paper's functional model:

  • Separated evidence from assignment: runtime signals are stored as IntentTurnEvidence, not synthetic IntentAssignment rows.
  • Corrected proactivity semantics: score helpers use the full hidden-intent count as denominator and return unavailable while any hidden intent lacks a terminal status.
  • Removed false completeness derivation: completeness is no longer computed from Completed/Inferred/Provided; it is reserved for an independent final-task grader.
  • Preserved compatibility: serde defaults/aliases keep older session files readable, and usage reports filter old proxy assignments.
  • Kept boundaries clean: shared DTOs live in services-core, execution evidence collection lives in bitfun-core, and UI code consumes typed API/session-history data.

Current Limitations / Follow-ups

  • This PR does not yet implement the full hidden-intent discovery/assignment evaluator described by pi-Bench.
  • Persistent-intent DTOs are present, but cross-session memory recovery and application are not fully wired.
  • Completeness needs a separate grader over the final trajectory/artifacts and task requirements.
  • AskUserQuestion topic extraction depends on the current tool-call argument shape and should be revisited if the tool schema changes.
  • Proactivity reporting is meaningful only when real hidden-intent assignments are produced by a future evaluator or imported evaluation data.

Follow-up TODO: Validation and Optimization

Validation TODO

  • Add fixture-based tests with hand-authored hidden intents and expected terminal statuses to verify Completed, Inferred, and Provided classification semantics independently from runtime evidence collection.
  • Add replay tests for saved sessions with enable_intent_tracking=true, covering metadata persistence, turn-file intentEvidence, and usage report aggregation after reload.
  • Add backward-compatibility tests for older session files that contain no intent fields, plus sessions that contain legacy proxy-style turn-* assignments.
  • Add remote-workspace coverage to confirm evidence persistence and usage-report generation behave the same for local and remote sessions.
  • Add a small evaluator-golden set once the assignment evaluator exists, with cases for direct satisfaction, targeted clarification, generic clarification that should not count as Inferred, and user-provided hidden intent.
  • Add completeness-grader tests separately from proactivity, using task checklists/rubrics over final artifacts so completeness does not regress into hidden-intent status counting.

Optimization TODO

  • Introduce a post-hoc hidden-intent assignment evaluator that compares concrete hidden intents against the full trajectory in two stages: direct satisfaction first, then targeted elicitation.
  • Wire persistent intents into the workspace/session memory layer so preferences can be recovered across sessions without coupling core logic to desktop-specific storage.
  • Replace fragile question-topic extraction with a typed AskUserQuestion tool contract or versioned parser to avoid silent drift when the tool schema changes.
  • Consider batching or coalescing post-turn evidence persistence if high-volume tracking sessions show measurable metadata I/O overhead.
  • Add UI affordances for raw evidence vs evaluated assignments, making it clear when a report is evidence-only, fully evaluated, or unavailable.
  • Add report coverage metadata for proactivity/completeness so consumers can distinguish not evaluated, partially evaluated, and fully evaluated states.
  • Revisit proactivity thresholds after collecting real coding-assistant sessions; pi-Bench-style labels may need calibration for BitFun's coding and desktop workflows.

Risk Assessment

Low Risk

  • enable_intent_tracking defaults to false, so evidence collection is opt-in.
  • New persisted fields use Option, Vec, serde defaults, and aliases for backward-compatible deserialization.
  • Raw evidence and terminal assignments are stored separately, reducing the risk of misleading reports.

Medium Risk

  • Enabling tracking adds post-turn metadata/turn-file persistence work.
  • The current report surface includes reserved completeness fields before the completeness grader exists.
  • Future evaluator work must be careful to distinguish targeted clarification (Inferred) from generic questions or passive waiting.

Verification

  • cargo test -p bitfun-services-core hidden_intent -- --nocapture
  • cargo test -p bitfun-core intent_evidence -- --nocapture
  • cargo test -p bitfun-core report_ -- --nocapture
  • cargo check --tests -p bitfun-services-core
  • cargo check --tests -p bitfun-core
  • cargo check --tests -p bitfun-desktop
  • pnpm run type-check:web
  • pnpm run lint:web
  • pnpm --dir src/web-ui run test:run (139 files / 744 tests passed)

Generated with BitFun

Based on the pi-Bench Hidden Intent framework (arXiv 2605.14678), this
introduces infrastructure for tracking proactive assistance quality in
long-horizon agent workflows.

Paper reference:
  pi-Bench: Evaluating Proactive Personal Assistant Agents in
  Long-Horizon Workflows
  Zhang et al., arXiv 2605.14678, May 2026

What this adds:
  - Hidden Intent types: IntentTerminalStatus (Completed/Inferred/Provided),
    HiddenIntent, PersistentIntent, SessionIntentTracking,
    ProactivityScore, CompletenessScore in services-core
  - IntentEvidenceCollector and IntentTurnEvidence in the ExecutionEngine
    for lightweight per-turn signal collection
  - Proactivity behavior guidance in agentic_mode.md and claw_mode.md
    system prompts
  - Extended facet_extraction.md with proactivity/completeness
    analysis dimensions
  - SessionUsageReport extensions with ProactivityReport and
    CompletenessRepor
Based on the pi-Bench Hidden Intent framework (arXiv 2605.14678), this
introduces infrastructure for tracking p edintroduces infrastructure for tracking proactive assistance quality ig.long-horizon agent workflows.

Paper reference:
  pi-Bench: Evaluatinho
Paper reference:
  pi-Benchden  pi-Bench: Evas   Long-Horizon Workflows
  Zhang et al., arXiv 2605.14678, Mer  Zhang et al., arXiv 2ou
What this adds:
  - Hidden Intent types: As  - Hidden Intde    HiddenIntent, PersistentIntent, SessionIntentTracking,
    ProactivitySal    ProactivityScore, CompletenessScore in services-core
ds  - IntentEvidenceCollector and IntentTurnEvidence in t
@GCWing
Copy link
Copy Markdown
Owner

GCWing commented May 24, 2026

This PR involves significant changes and affects the Agentic agent; it will be considered for merging after verification.

@harryfan1985
Copy link
Copy Markdown
Contributor Author

This PR involves significant changes and affects the Agentic agent; it will be considered for merging after verification.

sure!

- round_executor: detect AskUserQuestion even when no topic headers are
  extractable, so the call is no longer silently dropped
- execution_engine/session_manager: drop unused turn_id param; warn on
  poisoned intent evidence mutex instead of silent skip
- hidden_intent_types: centralize proactivity level thresholds in
  ProactivityLevel::{from_score,as_str}; add explicit IntentAssignment
  is_proxy flag so proxy detection no longer relies solely on a fragile
  intent_id string heuristic (heuristic kept as legacy fallback)
- session_usage: use is_proxy flag first; document the single-provided
  suppression rationale
- add regression tests for AskUserQuestion detection and proxy filtering

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@harryfan1985
Copy link
Copy Markdown
Contributor Author

Follow-up: 代码审查问题修复 (commit 4ffbb24)

针对此前审查中发现的问题,已在最新提交中修复并补充回归测试。

修复内容

问题 文件 修复
AskUserQuestion 漏报 round_executor.rs detect_ask_user_question 此前在 questions[].header 缺失时返回 (false, []),导致工具调用被静默丢弃。改为用独立的 called 标记记录调用本身,topic 提取保持 best-effort。
未使用参数 session_manager.rs / execution_engine.rs 删除 record_intent_evidence 中从未使用的 _turn_id 参数(实际通过 evidence.turn_index 定位),同步更新唯一调用方。
Mutex 中毒静默丢失 execution_engine.rs 证据收集器锁中毒时由静默跳过改为输出 warn! 日志,便于排查。
阈值三处重复 hidden_intent_types.rs / intent_evidence.rs / service.rs 将 0.8/0.5/0.2 等级阈值统一收敛到 ProactivityLevel::from_score()as_str(),另外两处改为代理调用。
代理赋值检测脆弱 hidden_intent_types.rs / service.rs IntentAssignment 新增显式 is_proxy: bool 字段(serde 默认 false,向后兼容)。is_legacy_proxy_intent_assignment 优先读取该字段,原 intent_id.starts_with("turn-") 字符串启发式保留为旧数据兜底,避免误判真实意图。
单条 Provided 过滤无说明 service.rs 补充注释,解释为何单条 Provided(total=1) 不构成有意义的报告而需抑制。

新增测试

  • round_executor.rs:6 个 detect_ask_user_question 用例,覆盖有 header / 无 header / 空数组 / 缺 key / 不存在 / 混合工具调用。
  • session_usage/service.rs:2 个代理过滤用例 —— is_proxy=true 必须排除(无论 intent_id),以及 turn- 前缀的真实意图在 is_proxy=false 时不被误过滤。

验证

cargo test -p bitfun-services-core hidden_intent   # 10 passed
cargo test -p bitfun-core intent_evidence          # 12 passed
cargo test -p bitfun-core report_                   # 22 passed (含 2 新增)
cargo test -p bitfun-core detect_ask_user_question  # 6 passed (新增)
cargo check --tests -p bitfun-services-core / bitfun-core / bitfun-desktop  # 全部通过

说明:问题 #2(从 trigger_description 自由文本解析 proactive_tools=)属于设计层面,建议随后续的结构化评估器一并替换为专用字段,本次未改动以控制范围。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants