Fix moe acc#7988
Conversation
…thout a slot(PaddlePaddle#7141) (PaddlePaddle#7181) * [BugFix] Set MC_MAX_MR_SIZE to avoid register hang (PaddlePaddle#7163) * Set MC_MAX_MR_SIZE to avoid register hang * up * [fix] prevent requests from entering running state without a slot * [fix] count abort set * [fix] count preempted task in waiting list --------- Co-authored-by: jc <52520497+juncaipeng@users.noreply.github.com>
… (PaddlePaddle#7192) * fix MTP bugs in TP and overlap * fix
Co-authored-by: K11OntheBoat <ruianmaidanglao@163.com> Co-authored-by: liuruian <liuruian@MacBook-Pro.local>
* [Feature]whl version * [Feature]whl version,set root_is_pure = false * [Feature]code style Co-authored-by: ChowMingSing <610208940@qq.com>
…7218 (PaddlePaddle#7256) * support moe-topk use topk_reduce_func * fix ep error * fix ut * fix ut
…s in SM90 flash_mask_attn (PaddlePaddle#7216)
…addle#7266) * Remove duplicate NICs from environment variables * Update version for xvllm in download_dependencies.sh Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
…addlePaddle#7191) * merge matmul and add * modify format * using paddle.nn.functional.linear * using _C_ops.linear * using paddle.nn.functional.linear * add FLAGS_use_legacy_linear env var in test case * fix format * add assert and remove env * modify format * using matmul for no bias * modify accurate baseline
…7277) * Update docs for release/2.5 * Update English docs for release/2.5 - Update README_EN.md: add v2.5 news entry, reformat v2.4 entry with release link - Update docs/get_started/installation/nvidia_gpu.md: - Docker image: 2.4.0 -> 2.5.0, notice now shows SM80/86/89/90 support - paddlepaddle-gpu: 3.3.0 -> 3.3.1, add CUDA 12.9 alternatives - fastdeploy-gpu: 2.4.0 -> 2.5.0, unified arch install with CUDA 12.9 option - Update docs/zh/get_started/installation/nvidia_gpu.md: - Fix remaining paddlepaddle-gpu==3.3.0 refs in sections 4&5 -> 3.3.1 Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/fa0be381-324e-4b0d-b7a6-e2c1fa12174f * Clarify --extra-index-url usage in installation docs Add note explaining that --extra-index-url is only for downloading fastdeploy-gpu dependencies; fastdeploy-gpu itself must be installed from the Paddle source specified by -i. Applied to both Chinese and English nvidia_gpu.md installation guides. Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/9fa8b3c9-7555-4eae-b9b9-026cddd7e74c * Update nvidia_gpu.md --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…nd bug (PaddlePaddle#7221) (PaddlePaddle#7296) Co-authored-by: ming1753 <61511741+ming1753@users.noreply.github.com>
…#7276) * fix * refine code * refine code * refine code * refine code * refine code
…ion Params + CUDAGraph Validation (PaddlePaddle#7215,PaddlePaddle#7281) (PaddlePaddle#7301) * refactor cudagraph args * refactor quant cli param * fix * fix * tmp skip xpu * fix
…e#7320) (PaddlePaddle#7322) Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
…addlePaddle#7318) * change glm rope_emb calculation * glm without EnforceFmulRN * fix ci
) (PaddlePaddle#7339) * moe bf16 ep support paddle batch_gemm
…#7308) (PaddlePaddle#7310) * support quant use pow2scale * fix * fix
…ePaddle#7159) (PaddlePaddle#7351) * [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1 * [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1 * fix
…_stop_value kernels (PaddlePaddle#7370) - speculate_limit_thinking_content_length: update current_base_step to step_idx+1 (step_idx now records history count before current round); remove incorrect step_idx decrement on accept_num truncation; mark step_idx param as const. - speculate_set_stop_value_multi_seqs: fix can_stop gate to use step_idx_now+accept_num>=min_token_limit; fix skip check and pre_ids_idx formula (remove stale -accept_num offset); use <= condition so accept_idx maps directly to the accepted token that ends the stop sequence; fix accept_tokens index (remove -1). - Update unit tests for speculate_set_stop_value_multi_seqs kernel.
…it scenario (PaddlePaddle#7364) (PaddlePaddle#7387) ## Motivation 在 PD 分离场景下,decode 节点在接收 prefill 节点转发的请求后,没有及时更新 cache block 的命中信息, 导致 prefix cache 命中率低,影响推理性能。 ## Modifications 1. 在 `_free_blocks_when_stop` 方法中,额外排除 prefill 节点(`splitwise_role == "prefill"`) 的 cache block 更新,避免 prefill 节点重复更新 cache 导致状态混乱。 2. 在 decode 节点分配请求(`_alloc_requests_with_cache`)成功后,主动调用 `update_cache_blocks` 使用 `need_prefill_tokens` 更新 cache block 信息, 确保 decode 节点能正确感知已命中的 prefix cache。 Co-authored-by: kevin <chengyf112@gmail.com>
…addlePaddle#7843 (PaddlePaddle#7845) * [Feature]console metrics log for pd disaggregation * [Feature]console metrics log for pd disaggregation fix test
…ePaddle#7881) (PaddlePaddle#7831) * Add inner benchmark metrics component * Add window_mode * remove temp scripts * fix ut * increase coverage lines
* Update _xpu_4cards_case_test.yml * Update _xpu_8cards_case_test.yml
Co-authored-by: kevin <chengyf112@gmail.com>
…e threashold for prefill instance (PaddlePaddle#7871)
…ePaddle#7688) (PaddlePaddle#7729) * support c8 decode attention * support c16 attention && backend * opt kernel * fix * opt larger batch * inplace out * fix input_batch && remove fast_math * fix xpu * fix bug * fix ci * opt and fix mtp * fix merge * clean code * fix merge * update * update test * fix test * fix test * opt buffer * fix conflict --------- Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
…dlePaddle#7883) (PaddlePaddle#7884) * opt mtp logprob * fix * fix test and log * fix bits * Adapt logprobs baseline update in test_ernie_21b_mtp_multistep.py --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
…ng CUDAGraph recapture(PaddlePaddle#7934) (PaddlePaddle#7933) * fix clear bug in rl * fix: use self.max_chunk_tokens instead of fd_config.get_max_chunk_tokens() for buffer recreation fd_config.get_max_chunk_tokens() without mm_max_tokens_per_item arg may return a smaller value than the actual initial buffer size when enable_mm and mm_max_tokens_per_item is None. Use self.max_chunk_tokens which is already computed during __init__ and consistent with first CUDAGraph capture.
…addle#7839) * PD send cache via storage & Refine swap_cache_layout op * skip messager * up * consider write cache error * fix ci * up
…ddlePaddle#7936) (PaddlePaddle#7917) * support fused noauxtc kernel on ep mode * fix unit test
…dle#7892) and Triton SamplerBackend (PaddlePaddle#7639) (PaddlePaddle#7910) * [CP][Feature] support new sampler backend with triton (PaddlePaddle#7639) * [Optimization] TopP=1.0 using _random_sample (PaddlePaddle#7892) * code check * add env FD_ENABLE_TOP_P_ONE_OPT control top_p=1 opt * defalut FD_ENABLE_TOP_P_ONE_OPT=0 * change FD_ENABLE_TOP_P_ONE_OPT=1 * fix mtp triton seed * change triton seed int64 * fix triton sampler * add seed for mtp triton sampler --------- Co-authored-by: Zero Rains <linjunlu@zerorains.top> Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
…ddle#7923) (PaddlePaddle#7922) * fix accurate issue * fix acc issue in ep + tp mode --------- Co-authored-by: root <root@tjzj-inf-sci-k8s-bzz2-0271.tjzj.baidu.com>
…in accuracy (PaddlePaddle#7960) * Reset buffer size of R3 * refine code * R3 fix Eos bug * pre-commit * fix r3 ci and support dsa * refine code * refine code * reset ci dir * refine code * fix dsv3
* Reset buffer size of R3 * refine code * R3 fix Eos bug * pre-commit * fix r3 ci and support dsa * refine code * refine code * reset ci dir * refine code * fix dsv3 * fix ernie5 mm bug
…lePaddle#7951) (PaddlePaddle#7971) * Add GDR streaming weight update path * [RL] Unify GDR and IPC weight update
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-03 17:32:34
📋 Review 摘要
PR 概述:修复 MoE 模型推理精度问题,同步更新 CI 构建配置(固定 PaddlePaddle wheel 版本、改进容器清理逻辑、迁移 runner 至 APPROVAL group)。
建议拆分方案:
- PR 1:
[CI]CI 基础设施更新 —.github/workflows/**,scripts/** - PR 2:
[BugFix]MoE 精度修复 —custom_ops/gpu_ops/moe/,fastdeploy/model_executor/layers/moe/,custom_ops/gpu_ops/grouped_topk_kernels.cu - PR 3:
[Models]模型 forward 变更 —fastdeploy/model_executor/models/** - PR 4:
[OP]Attention / Quantization kernel 变更 —custom_ops/gpu_ops/append_attn/,custom_ops/gpu_ops/decode_unified_attention/,fastdeploy/model_executor/layers/quantization/
变更范围:CI workflows、MoE kernels、Models、Attention backends、Quantization
影响面 Tag:[CI] [Models] [OP] [BugFix]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | custom_ops/gpu_ops/moe/tritonmoe_preprocess.cu |
topk_ids_numel 以 int 承接 topk_ids.numel()(int64_t),大 batch 下存在 int32 截断风险 |
| 🟡 建议 | fastdeploy/model_executor/layers/moe/triton_moe_kernels.py |
新 kernel fused_moe_kernel_bf16 已添加 offs_token.to(tl.int64) 修复 stride 溢出,但旧 kernel fused_moe_kernel_paddle 未同步此修复 |
| ❓ 疑问 | .github/workflows/_accuracy_test.yml |
移除 --ipc=host --pid=host,可能影响容器内分布式多进程的 IPC 通信 |
未发现阻塞性问题。PR 规范问题在下面章节报。
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | check-bypass.yml per_page=100 分页遗漏(本 PR 涉及 319 个文件,实际只检查了前 100 个) |
|
| F2 | 硬编码 bcebos 内部 wheel URL,长期维护风险 |
📝 PR 规范检查
标题 "Fix moe acc" 缺少官方 Tag,所有描述 section 均为空(仅模板占位符)。与上次 Review 一致,未修改。
标题建议(可直接复制):
[BugFix] Fix MoE accuracy regression
PR 描述建议(点击展开,可直接复制)
## Motivation
修复 MoE 模型精度问题(Triton kernel 中 `stride_cm * offs_token` int32 溢出导致精度异常),同步更新 CI 构建配置以提升稳定性。
## Modifications
- 新增 `fused_moe_kernel_bf16` Triton kernel,在索引计算前统一将 `offs_token`、`off_experts`、`offs_bn` 提升为 `tl.int64`,修复大 batch 下 stride 乘法溢出
- 固定 CI 中 PaddlePaddle GPU wheel 为 3.3.1.post20260420 版本(cu126/cu129/cu130/RL/XPU 全覆盖),替换原先的 nightly pre 版本
- 所有构建/测试 workflow 新增 "Terminate and delete the container" step(`if: always()`),确保异常退出时也能清理容器
- 改进 workspace 清理逻辑,新增 `find` force cleanup fallback,避免残留目录导致 CI 卡住
- `tar` 命令统一加 `--no-same-owner` 选项,避免解压权限问题
- 多个 workflow 的 runner 从 `ubuntu-latest` 迁移到 `APPROVAL` group,runner 环境更一致
- 移除 docker 构建容器的 `--privileged` 标志,提升 CI 安全性
## Usage or Command
N/A
## Accuracy Tests
N/A(请补充 MoE 精度修复前后对比数据)
## Checklist
- [ ] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
CI 基础设施改进合理;MoE 精度修复通过新 Triton kernel 解决了 int32 stride 溢出问题,方向正确。主要关注点:旧 kernel fused_moe_kernel_paddle 未同步 int64 修复,tritonmoe_preprocess.cu 存在 int 截断风险,以及 --ipc=host 移除对分布式测试的潜在影响。
CI报告基于以下代码生成(30分钟更新一次): 1 Required任务 : 9/10 通过
2 失败详情🔴 Approval — 需要 Approval(置信度: 高)根因摘要该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。 修复建议摘要请通过人工审批:根据 Approval Job 日志中的提示邀请对应 reviewer 完成审批;审批完成后重新触发/等待 CI 即可。 |
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.