feat(moe): add MoE inference and expert parallel support by qinyiqun · Pull Request #444 · InfiniTensor/InfiniLM

qinyiqun · 2026-06-18T02:17:45Z

Summary

Add a generic MoE layer stack under csrc/layers/moe.
Route Qwen3-MoE through the generic SparseMoeBlock, TopKRouter, FusedMoeExperts, and FusedMoE runner.
Add MoE EP dispatchers for local_allreduce and allgather_reducescatter.
Add a reserved deepep backend interface for future integration.
Move the old per-expert MoeMLP into csrc/layers/moe/legacy and keep DeepSeek-V2 on the legacy path.
Pass MoE EP config through Python args and model config instead of bench-owned environment variables.
Optimize rank-local safetensors loading for EP expert weights.
Support Qwen3/Qwen3Next GQA cases where num_key_value_heads < tp_size.

Motivation

Closes #

InfiniLM needs a reusable MoE inference path that can support Qwen3-MoE models and provide a clear abstraction boundary for future high-performance EP backends such as DeepEP.

The current implementation focuses on correctness and data-flow alignment first:

TP-only MoE works through the standard dispatcher.
DP=1 EP uses local_allreduce as the preferred current path.
allgather_reducescatter is available as a correctness-oriented backend.
DeepEP is explicitly reserved but not implemented in this PR.

Type of Change

feat — new feature / new model
refactor — code restructuring without behavior change
perf — performance improvement (no behavioral change)
fix — bug fix
test — adding or fixing tests only
docs — documentation only
build / ci — build system or CI configuration
chore — tooling, formatting, or other non-code changes
Breaking change

Test Results of Involved Models on Supported Platforms (Please attach screenshots)

Please attach screenshots for the final tested commands.

Suggested coverage:

Qwen3-30B-A3B, TP=1, EP disabled
Qwen3-30B-A3B, TP=2, EP=2, local_allreduce
Qwen3-30B-A3B, TP=2, EP=2, allgather_reducescatter
Qwen3-235B-A22B, TP=8, EP=8, local_allreduce
Qwen3-8B-base non-MoE regression, TP=2, graph enabled
DeepSeek-V2-Lite loading/regression for legacy MoE path if applicable

Benchmark / Performance Impact

Initial measured examples on A100:

Qwen3-30B-A3B, TP=2/EP=2, local_allreduce, graph enabled:
- Prefill and decode are functional.
- Decode performance is currently limited by MoE communication and temporary fused MoE kernel quality.
Qwen3-235B-A22B, TP=8/EP=8, local_allreduce, graph enabled:
- Model loading and decode are functional.
- Nsys shows decode is dominated by communication, especially allreduce-heavy paths.

This PR does not claim final high-performance MoE EP parity with vLLM/SGLang. It establishes the correct abstraction and execution path for later DeepEP/fused MoE work.

Notes for Reviewers

local_allreduce is the recommended current EP backend for DP=1.
allgather_reducescatter is correctness-oriented and expected to be slower.
deepep is intentionally a placeholder interface.
prepare_moe_input-style CUTLASS grouped GEMM flow is not used by the current InfiniLM MoE runner.
DeepSeek-V2 remains on layers/moe/legacy and is not migrated to the new fused Qwen3-MoE path.
Non-MoE models should show MoE EP backend: disabled.

Checklist

Title, Branch, and Commits

PR title follows Conventional Commits.
Branch name follows <type>/xxx-yyyy-zzzz.
Each commit message follows Conventional Commits.
Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable.
No stray merge commits from main.
No fixup! / squash! / wip commits remain.
Existing PR/branch/commit that followed the legacy issue format.

Scope and Design

Changes are scoped to MoE inference, EP config/loading, and required model compatibility.
No debug prints or temporary MoE logs are left behind.
Public API changes are intentional and reflected in Python/C++ callers.

C++ Specific

Changed files are formatted by scripts/format.py.
Project builds cleanly on NVIDIA.

Python Specific

Changed files are formatted by scripts/format.py.

Testing

Passed single request test, or reason for skipping is documented.
Passed offline performance test, or reason for skipping is documented.
Passed sanity test, or reason for skipping is documented.
Passed service test, or reason for skipping is documented.

pengcheng888 · 2026-06-26T06:16:24Z

    struct CompiledResult {
        InfinilmModel::Input input;
        Compiled compiled;
+        std::shared_ptr<InfinilmModel::Output> replay_output;


这个新增的replay_output变量，以及graph编译时新增和修改的代码。可以注释或解释一下么，不知道啥意思

已补充注释。这里的 replay_output 是 graph capture 时为输出保留的普通 Output handle；compiled 里保存的是 GraphTensor/graph 对象，replay 后需要通过这个 handle 拿回模型输出。这样 get_compiled 时可以直接返回可复用的 graph replay 结果。

这个不影响static 推理，已测试

pengcheng888 · 2026-06-26T06:24:54Z

        throw std::runtime_error(" Model object not found. ");
    }
-    return workers_.front()->state_dict_keys();
+    std::vector<std::string> keys;


这个写法，我看了好一会才看懂。
其是等价于下面的写法。先求set, 最后再赋值给key_vec.
`
std::unordered_setstd::string keys;
for (const auto& worker : workers_) {
const auto& worker_keys = worker->state_dict_keys();
keys.insert(worker_keys.begin(), worker_keys.end());
}

std::vectorstd::string keys_vec(keys.begin(), keys.end());
return keys_vec;

`

这里改的核心目的：合并多 rank/worker 的 state_dict_keys() 时去重，但保留稳定顺序。
背景是 InferEngine 里有多个 RankWorker，比如 TP=2 时，每个 worker 都有一份模型结构，因此 worker->state_dict_keys() 里很多 key 是重复的。如果 InferEngine::state_dict_keys() 只是把所有 worker 的 key 拼起来，Python 侧 check_parameters() 虽然最后会转 set，但返回列表会非常冗余，也不利于定位 missing/unexpected key。

pengcheng888 · 2026-06-26T06:27:17Z

            } else if (local_cmd == Command::LOAD_BATCH) {
                try {
-                    model_->load_parameters_no_sync(local_params);
+                    model_->load_parameters_no_sync(local_params, local_params_strict);


等价于这个写法么 model_->load_parameters_no_sync(local_params, strict);

是的，现在等价于直接调用 model_->load_parameters_no_sync(local_params, local_params_strict)。这里需要把 strict 继续传下去，否则 Python 侧传入的 non-strict load 对 MoE packed weight 不生效。

pengcheng888 · 2026-06-26T07:18:17Z

        self.parser.add_argument("--model", type=str, required=True)
        self.parser.add_argument("--device", type=str, default="cpu")
        self.parser.add_argument("--tp", "--tensor-parallel-size", type=int, default=1)
+        self.parser.add_argument("--dp", "--data-parallel-size", type=int, default=1)


提供测试命令

这个dp只有python中被使用，不会传递给c++么

测试命令可以用：

CUDA_VISIBLE_DEVICES=2,3 python examples/bench.py
--device=nvidia
--model=xxxxx
--enable-paged-attn
--attn=flash-attn
--tp=2
--ep=2
--moe-ep-backend=local_allreduce
--input-len=16,1024
--output-len=1024
--batch-size=1
--enable-graph

dp现在是预留口，我们没有支持dp功能，但是dp会跟moe有执行层的绑定，所以先留出来了

pengcheng888 · 2026-06-26T09:39:14Z

 namespace infinilm::models::qwen3_moe {

-class Qwen3MoeSparseMoeBlock : public infinicore::nn::Module {
+class Qwen3MoeSparseMoeBlock final : public infinilm::layers::moe::SparseMoeBlock {


继承后貌似啥也没干。这里直接 using Qwen3MoeSparseMoeBlock = public infinilm::layers::moe::SparseMoeBlock 可以么。

这里暂时保留 adapter class。原因是 TextDecoderLayer 构造 MLP block 使用的是 (config, layer_idx, device)，而通用 SparseMoeBlock 的构造顺序是 (config, device, layer_idx)。using alias 不能适配构造函数签名，所以这里保留一个很薄的 Qwen3Moe adapter，后续统一构造函数签名后可以删掉。

wooway777 · 2026-07-01T09:57:29Z

/retest

wooway777

参考：
[P2] csrc/engine/compiler/static_batching_compiler.cpp (line 32)
StaticBatchingCompiler::compile() 新增了 capture 前的 eager warmup，但 warmup 后直接 startGraphRecording()，没有像 PagedCompiler 一样调用 model_->reset_runtime_state()。如果模型里有 Marlin/AWQ/GPTQ 这类会留下 lock/workspace 状态的模块，static graph capture 会从 dirty runtime state 开始，可能导致 static graph 下量化模型输出不稳定或复用旧状态。建议在 line 33-35 之间补 model_->reset_runtime_state(); syncStream();，和 paged compiler 保持一致。
[P2] python/infinilm/moe_config.py (line 4)
deepep 被公开列为合法 --moe-ep-backend，但 csrc/layers/moe/ep/deepep_dispatcher.cpp (line 30) 的所有路径都会 throw “reserved but not implemented yet”。这会让用户通过配置校验、模型初始化也可能成功，然后第一轮 forward 才炸。既然 PR 明确说 DeepEP 是 reserved，建议 Python 配置层直接拒绝 deepep，或把 help 文案改成“not supported yet”并在配置阶段报错。
Open Question
csrc/layers/moe/runner/cuda_fused_moe_runner.cpp (line 190)
local_allreduce 路径依赖每个 rank 对非本地 expert 的 token 输出为 0，但 workspace.fused_moe_output 复用后没有显式清零。这里是否由 moe_fused_dense_ 保证完整覆盖/清零？如果不是，需要在 kernel 前清零，否则 allreduce 会把旧 buffer 内容加进去。

wooway777 · 2026-07-01T07:00:59Z

为什么改了这些？static不是本身也没支持图？

pengcheng888 · 2026-07-02T01:57:01Z

+                ordered_keys.push_back(key);
+            }
+        }
+    }


这部分代码，感觉可以简洁一点代码。

这个主要是为了规避掉多进程有n份权重做的稳定去重，应该不太能更简单了。

qinyiqun · 2026-07-02T02:21:13Z

参考： [P2] csrc/engine/compiler/static_batching_compiler.cpp (line 32) StaticBatchingCompiler::compile() 新增了 capture 前的 eager warmup，但 warmup 后直接 startGraphRecording()，没有像 PagedCompiler 一样调用 model_->reset_runtime_state()。如果模型里有 Marlin/AWQ/GPTQ 这类会留下 lock/workspace 状态的模块，static graph capture 会从 dirty runtime state 开始，可能导致 static graph 下量化模型输出不稳定或复用旧状态。建议在 line 33-35 之间补 model_->reset_runtime_state(); syncStream();，和 paged compiler 保持一致。 [P2] python/infinilm/moe_config.py (line 4) deepep 被公开列为合法 --moe-ep-backend，但 csrc/layers/moe/ep/deepep_dispatcher.cpp (line 30) 的所有路径都会 throw “reserved but not implemented yet”。这会让用户通过配置校验、模型初始化也可能成功，然后第一轮 forward 才炸。既然 PR 明确说 DeepEP 是 reserved，建议 Python 配置层直接拒绝 deepep，或把 help 文案改成“not supported yet”并在配置阶段报错。 Open Question csrc/layers/moe/runner/cuda_fused_moe_runner.cpp (line 190) local_allreduce 路径依赖每个 rank 对非本地 expert 的 token 输出为 0，但 workspace.fused_moe_output 复用后没有显式清零。这里是否由 moe_fused_dense_ 保证完整覆盖/清零？如果不是，需要在 kernel 前清零，否则 allreduce 会把旧 buffer 内容加进去。

现在只有marlin 需要lock space，marlin现在只有nv支持，nv默认有图，所以这个不需要支持static cache
当前 NVIDIA 路径下 moe_fused_dense_ 的输出覆盖没有暴露这个问题；但在部分国产平台上，确实可能需要显式清零来保证 local_allreduce 前非本地 expert 输出为 0。问题是现在推理层还没有统一 memory pool / low-cost zero buffer 机制，直接在这里加 set_zeros 会把国产平台 decode 性能打下来。这个更适合等国产平台相关 PR 和推理层 memory pool 接入后，再统一处理 workspace 清零/复用策略。
deepep 预留接口，如果不要可以删掉

PanZezhong1725 · 2026-07-02T02:23:59Z

+    bool contains_non_null(const std::string &key) const;
+
+    template <typename T>
+    T get_or_alias(const std::string &key, const std::string &alias, const T &default_value) const {


直接做一个接受vector<const std::string &> 接口吧，拿到第一个存在的值否则default。然后contains_non_null可以删了

PanZezhong1725 · 2026-07-02T02:31:00Z

-        const std::string &weight_load_mode = "async");
+        const std::string &weight_load_mode = "async",
+        const std::string &moe_ep_backend = "disabled",
+        size_t moe_ep_size = 1);


分布式相关的东西是不是应该在DistConfig里

PanZezhong1725 · 2026-07-02T02:37:09Z

        Compiled compiled;
+        // Graph capture stores a GraphTensor in compiled. Replay returns a
+        // normal output handle restored from the same graph output blob.
+        std::shared_ptr<InfinilmModel::Output> replay_output;


这么做会导致每个result独占一份output tensor

这块我建议先保留。CUDA graph replay 的 output 地址是在 capture 时固定下来的，所以每个 captured graph/shape 需要持有对应的 output handle，避免 allocator 把这块地址复用后 graph replay 写到错误位置。vLLM 和 SGLang 也是类似做法：按 batch descriptor / shape 保存 graph entry 和 output，replay 后直接返回对应 output。这里不是额外拷贝一份 logits，而是维护 graph output 的生命周期。后续如果要优化显存，应从 graph pool / weak-ref / allocator 生命周期管理入手。

pengcheng888 · 2026-07-02T02:40:59Z

+
+CudaFusedMoeRunner::CudaFusedMoeRunner(size_t num_local_experts,
+                                       size_t hidden_size,


函数名和类名为什么要以Cuda开头，这个命名可以避免么

感觉出现在infinilm中不好

这个倒无所谓，本来就是一个临时的，未来会被销毁的，只有cuda 能用的算子。

Add reusable MoE routing, dispatch, expert, and runner layers for sparse MoE models. Wire Qwen3/DeepSeek MoE implementations to the shared MoE path, expose EP backend configuration through DistConfig-backed bench/server/LLM entrypoints, and keep graph replay output handling compatible with MoE execution.

qinyiqun requested a review from a team June 18, 2026 02:17

qinyiqun force-pushed the moe branch 2 times, most recently from 4b3058a to f2d4861 Compare June 23, 2026 02:20

qinyiqun requested a review from pengcheng888 June 25, 2026 08:17