[XPU][Speculative Decoding] Enable CudaGraph capture for MTP draft model by Clarity256 · Pull Request #8061 · PaddlePaddle/FastDeploy

Clarity256 · 2026-06-17T05:19:17Z

Motivation

Draft model 前向推理启用 step_use_cudagraph 门控逻辑，并在 multi-step 执行中仅对首步进行 capture。
Draft model 推理路径中传递 forward_meta 和 use_cudagraph 到 xpu_pre_process，确保 cu_seqlens_q_output / batch_id_per_token_output 在 cudagraph 模式下使用 copy_ 原地更新，保证 tensor 地址稳定性。
新增 padding_cudagraph_inputs() 方法处理 draft model 的 buffer padding，并在 graph replay 时按 real_token_num 切片 model output。
Target model 侧投机解码 warmup 流程适配（capture size 计算、accept_all_drafts 参数传递、TP>1 下 expected_decode_len 修正）。
将 padding_sampling_params（Python 侧 CPU 实现）替换为 build_sampling_params XPU 自定义算子（[XPU][OP] Add build_sampling_params kernel for MTP speculative decoding #8032），在算子内部完成 infer_seed 的原地更新，避免在 cudagraph 外额外操作。
increment_value 改为与投机解码 token 数联动（(num_speculative_tokens + 1) * 4）。
Draft model 中 last_seq_lens_this_time 使用 copy_() 替代 clone()，避免 CUDAGraph replay 时产生新 tensor 导致内存持续增长。

Modifications

fastdeploy/spec_decode/mtp_xpu.py：draft model 启用 step_use_cudagraph 门控；_propose 新增 cudagraph padding 逻辑与 output slicing；_initialize_forward_meta 传递 cudagraph 参数；last_seq_lens_this_time 改为 copy_() 原地更新。
fastdeploy/worker/xpu_model_runner.py：increment_value 与投机解码 token 数联动；warmup capture 流程适配 speculative decoding；infer_seed 更新移入 build_sampling_params 算子内部；draft model propose 传递 step_use_cudagraph；修正 TP>1 时 dummy_prefill_inputs 的 expected_decode_len。
fastdeploy/model_executor/layers/sample/sampler.py：forward_xpu 改用 build_sampling_params XPU 算子替代 padding_sampling_params；新增 increment_value 参数。
fastdeploy/model_executor/xpu_pre_and_post_process.py：cudagraph 模式下改用 copy_ 原地更新 cu_seqlens_q_output 和 batch_id_per_token_output，保证 graph 捕获的 tensor 地址稳定。
tests/xpu_ci/4cards_cases/run_mtp_cudagraph.py → test_mtp_cudagraph.py：重命名测试脚本以符合 CI 命名规范。

Usage or Command

Accuracy Tests

MTP with CUDAGraph：输出与参考结果一致（见 PR 截图）

- MTP without CUDAGraph：输出与参考结果一致（见 PR 截图）

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

CLAassistant · 2026-06-17T05:19:24Z

All committers have signed the CLA.

codecov-commenter · 2026-06-17T06:05:11Z

Codecov Report

❌ Patch coverage is 0% with 38 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@74a363e). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/worker/xpu_model_runner.py	0.00%	17 Missing ⚠️
fastdeploy/spec_decode/mtp_xpu.py	0.00%	13 Missing ⚠️
...tdeploy/model_executor/xpu_pre_and_post_process.py	0.00%	5 Missing ⚠️
fastdeploy/model_executor/layers/sample/sampler.py	0.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #8061   +/-   ##
==========================================
  Coverage           ?   67.39%           
==========================================
  Files              ?      475           
  Lines              ?    66766           
  Branches           ?    10305           
==========================================
  Hits               ?    45000           
  Misses             ?    18894           
  Partials           ?     2872

Flag	Coverage Δ
GPU	`77.47% <0.00%> (?)`
XPU	`6.97% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Enable step_use_cudagraph for draft model with proper gating logic - Pass forward_meta and use_cudagraph to xpu_pre_process in draft path - Add padding_cudagraph_inputs() for draft model buffer management - Slice model output by real_token_num when graph is active - Adapt target model warmup and execute_model for MTP+CudaGraph - Use build_sampling_params kernel in verify path (replaces padding_sampling_params) - Fix memory issue by using copy_ instead of clone for seq_lens_this_time - Fix expected_decode_len for TP>1 in dummy_prefill Co-Authored-By: Clarity256 <1140021759@qq.com>

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-06-17 21:55:43

📋 Review 摘要

PR 概述：为 XPU MTP draft/target 推理路径适配 CUDAGraph capture，并将 speculative sampling 参数构造迁移到 XPU op。
变更范围：sampler.py、xpu_pre_and_post_process.py、mtp_xpu.py、xpu_model_runner.py、XPU CI 测试脚本
影响面 Tag：[XPU] [Speculative Decoding] [Graph Optimization] [OP]

问题

级别	文件	概述
🔴 Bug	`fastdeploy/model_executor/layers/sample/sampler.py:1243`	XPU NAIVE speculative sampling 直接使用 batch 维采样参数，且该分支不再更新 `infer_seed`

历史 Findings 修复情况

Finding	问题	状态
F1	`SamplingMetadata` 没有 `topp_seed` 字段，XPU NAIVE speculative 路径会运行时报错。	✅ 已修复
F2	`cudagraph_only_prefill=True` 的 prefill capture 被禁用了。	⚠️ 仍存在
F3	这里只同步了 target model 的 `moe_phase`，没有同步 MTP draft model 的 `fd_config`。	⚠️ 仍存在

📝 PR 规范检查

标题包含两个官方 Tag，而当前 FastDeploy 规范要求标题必须且仅包含一个官方 Tag；另外 Usage or Command 章节为空。建议改为以下内容。

标题建议（可直接复制）：

[XPU] Enable CUDAGraph capture for MTP draft model

PR 描述建议（点击展开，可直接复制）

## Motivation
1. Draft model 前向推理启用 `step_use_cudagraph` 门控逻辑，并在 multi-step 执行中仅对首步进行 capture。
2. Draft model 推理路径中传递 `forward_meta` 和 `use_cudagraph` 到 `xpu_pre_process`，确保 `cu_seqlens_q_output` / `batch_id_per_token_output` 在 cudagraph 模式下使用 `copy_` 原地更新，保证 tensor 地址稳定性。
3. 新增 `padding_cudagraph_inputs()` 方法处理 draft model 的 buffer padding，并在 graph replay 时按 `real_token_num` 切片 model output。
4. Target model 侧投机解码 warmup 流程适配 capture size 计算、`accept_all_drafts` 参数传递、TP>1 下 `expected_decode_len` 修正。
5. 将 `padding_sampling_params` 替换为 `build_sampling_params` XPU 自定义算子，在算子内部完成 `infer_seed` 的原地更新，避免在 cudagraph 外额外操作。
6. `increment_value` 改为与投机解码 token 数联动：`(num_speculative_tokens + 1) * 4`。
7. Draft model 中 `last_seq_lens_this_time` 使用 `copy_()` 替代 `clone()`，避免 CUDAGraph replay 时产生新 tensor 导致内存持续增长。

## Modifications
- `fastdeploy/spec_decode/mtp_xpu.py`：draft model 启用 `step_use_cudagraph` 门控；`_propose` 新增 cudagraph padding 逻辑与 output slicing；`_initialize_forward_meta` 传递 cudagraph 参数；`last_seq_lens_this_time` 改为 `copy_()` 原地更新。
- `fastdeploy/worker/xpu_model_runner.py`：`increment_value` 与投机解码 token 数联动；warmup capture 流程适配 speculative decoding；`infer_seed` 更新移入 `build_sampling_params` 算子内部；draft model propose 传递 `step_use_cudagraph`；修正 TP>1 时 `dummy_prefill_inputs` 的 `expected_decode_len`。
- `fastdeploy/model_executor/layers/sample/sampler.py`：`forward_xpu` 改用 `build_sampling_params` XPU 算子替代 `padding_sampling_params`；新增 `increment_value` 参数。
- `fastdeploy/model_executor/xpu_pre_and_post_process.py`：cudagraph 模式下改用 `copy_` 原地更新 `cu_seqlens_q_output` 和 `batch_id_per_token_output`，保证 graph 捕获的 tensor 地址稳定。
- `tests/xpu_ci/4cards_cases/run_mtp_cudagraph.py` → `tests/xpu_ci/4cards_cases/test_mtp_cudagraph.py`：重命名测试脚本以符合 CI 命名规范。

## Usage or Command
N/A

## Accuracy Tests
- MTP with CUDAGraph：输出与参考结果一致（见 PR 截图）。
- MTP without CUDAGraph：输出与参考结果一致（见 PR 截图）。

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

MTP CUDAGraph 的主路径实现方向基本清晰，但 XPU NAIVE speculative sampling 在本次 seed 迁移后留下了 token 维参数构造缺口，会影响运行正确性，需要先修复。历史的 prefill capture 和 mixed EP draft phase 同步问题仍未解决，本轮不重复发 inline。

PaddlePaddle-bot · 2026-06-17T13:56:38Z

            probs,
-            top_p=top_p,
-            top_k=top_k,
+            top_p=sampling_metadata.top_p,


🔴 Bug XPU NAIVE speculative sampling 这里直接使用 batch 维 top_p/top_k/seed 采样 token 维 probs。

top_k_top_p_sampling 的 top_p/top_k/topp_seed 是按 query 行匹配的；这里 probs 已经过 speculative penalty 处理，行数等于输出 token 数，原来的 padding_sampling_params 会按 seq_lens_this_time/seq_lens_encoder 展开到 token 维。现在直接传 batch 维参数，在 prefill 或多 token 场景会 shape 不匹配或错误复用采样参数；同时 runner 对 speculative 场景不再统一递增 infer_seed，而这个 NAIVE 分支也没有调用会原地更新 seed 的 build_sampling_params，后续步会重复使用同一 seed。

建议修复方式：
给 _normal_sample_xpu 传入 increment_value，像 verify 的 TARGET_MATCH 分支一样调用 build_sampling_params(...) 生成 token 维 top_p/top_k/topp_seed 后再采样；或者保留原 padding 逻辑并在该分支显式原地更新 infer_seed。不要直接把 batch 维 sampling_metadata.* 传给 token 维采样。

PaddlePaddle-bot · 2026-06-18T02:18:33Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-18 10:17:29 UTC+08:00

CI报告基于以下代码生成（30分钟更新一次）:
PR commit: 2355d0e | Merge base: 74a363e (branch: develop)

1 Required任务 : 8/10 通过

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
42(0)	42	36	6	0	0	0

任务	错误类型	置信度	日志
`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	PR问题	高	Job
`Approval`	需要 Approval	高	Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题（置信度: 高）

错误类型: PR问题 | 置信度: 高
分析器: 通用分析(fallback)
失败用例: diff coverage 阈值校验

用例	错误摘要
`fastdeploy/model_executor/layers/sample/sampler.py`	新增/变更行 64、65、1276 未被单测覆盖，diff coverage 为 0%，低于 80% 阈值

关键日志:

TEST_EXIT_CODE: 0
COVERAGE_EXIT_CODE: 9
Coverage generation failed (exit code 9)
"src_stats": {"fastdeploy/model_executor/layers/sample/sampler.py": {"percent_covered": 0.0,
"violation_lines": [64, 65, 1276], "covered_lines": []}}
"total_num_violations": 3, "total_percent_covered": 0, "num_changed_lines": 109

根因摘要: sampler.py 新增 XPU 采样路径未被覆盖
单元测试执行成功（TEST_EXIT_CODE=0），失败只发生在 Verify Code Coverage Threshold (80%)。PR 在 sampler.py 新增 XPU 分支导入 build_sampling_params/top_p_candidates/verify_draft_tokens，并在 TARGET_MATCH verify 路径调用 build_sampling_params(...)；这些新增行被 diff coverage 判定为未覆盖。

修复建议:

在 tests/layers/test_sampler.py 或相邻 XPU sampler 单测中补充覆盖：mock current_platform.is_xpu()、build_sampling_params、top_k_top_p_sampling，构造 VerifyStrategy.TARGET_MATCH 的 share_inputs，走到 fastdeploy/model_executor/layers/sample/sampler.py:1276。
同时让测试导入/初始化经过 XPU 分支，覆盖 fastdeploy/model_executor/layers/sample/sampler.py:64 和 fastdeploy/model_executor/layers/sample/sampler.py:65；若这些 import 行在当前单测环境不可执行，需要用项目认可的覆盖豁免方式处理。

关联变更: fastdeploy/model_executor/layers/sample/sampler.py:64, fastdeploy/model_executor/layers/sample/sampler.py:65, fastdeploy/model_executor/layers/sample/sampler.py:1276

🔴 Approval — 需要 Approval（置信度: 高）

错误类型: 需要 Approval | 置信度: 高
分析器: builtin
失败用例: 无

关键日志:

Process completed with exit code 6.

根因摘要: Workflow 等待人工审批
该 Job 需要人工 Approval，完成审批后 CI 才会继续执行。

修复建议:

请通过人工审批。

关联变更: 不涉及代码变更

Clarity256 had a problem deploying to Metax_ci June 17, 2026 05:19 — with GitHub Actions Failure

Clarity256 force-pushed the feature/xpu-mtp-cudagraph-capture branch from 9e45d1d to 72c0f92 Compare June 17, 2026 05:28

Clarity256 had a problem deploying to Metax_ci June 17, 2026 05:28 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Clarity256 force-pushed the feature/xpu-mtp-cudagraph-capture branch from 72c0f92 to 2355d0e Compare June 17, 2026 07:27

Clarity256 had a problem deploying to Metax_ci June 17, 2026 07:27 — with GitHub Actions Failure

PaddlePaddle-bot suggested changes Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[XPU][Speculative Decoding] Enable CudaGraph capture for MTP draft model#8061

[XPU][Speculative Decoding] Enable CudaGraph capture for MTP draft model#8061
Clarity256 wants to merge 1 commit into
PaddlePaddle:developfrom
Clarity256:feature/xpu-mtp-cudagraph-capture

Clarity256 commented Jun 17, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Jun 17, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented Jun 17, 2026 •

edited

Loading

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot Jun 17, 2026

Uh oh!

PaddlePaddle-bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Clarity256 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

CLAassistant commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

历史 Findings 修复情况

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot commented Jun 18, 2026

1 Required任务 : 8/10 通过

2 失败详情

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Clarity256 commented Jun 17, 2026 •

edited

Loading

CLAassistant commented Jun 17, 2026 •

edited

Loading

codecov-commenter commented Jun 17, 2026 •

edited

Loading