Skip to content

feat(moe): add MoE inference and expert parallel support#444

Open
qinyiqun wants to merge 1 commit into
InfiniTensor:mainfrom
qinyiqun:moe
Open

feat(moe): add MoE inference and expert parallel support#444
qinyiqun wants to merge 1 commit into
InfiniTensor:mainfrom
qinyiqun:moe

Conversation

@qinyiqun

Copy link
Copy Markdown
Contributor

Summary

  • Add a generic MoE layer stack under csrc/layers/moe.
  • Route Qwen3-MoE through the generic SparseMoeBlock, TopKRouter, FusedMoeExperts, and FusedMoE runner.
  • Add MoE EP dispatchers for local_allreduce and allgather_reducescatter.
  • Add a reserved deepep backend interface for future integration.
  • Move the old per-expert MoeMLP into csrc/layers/moe/legacy and keep DeepSeek-V2 on the legacy path.
  • Pass MoE EP config through Python args and model config instead of bench-owned environment variables.
  • Optimize rank-local safetensors loading for EP expert weights.
  • Support Qwen3/Qwen3Next GQA cases where num_key_value_heads < tp_size.

Motivation

Closes #

InfiniLM needs a reusable MoE inference path that can support Qwen3-MoE models and provide a clear abstraction boundary for future high-performance EP backends such as DeepEP.

The current implementation focuses on correctness and data-flow alignment first:

  • TP-only MoE works through the standard dispatcher.
  • DP=1 EP uses local_allreduce as the preferred current path.
  • allgather_reducescatter is available as a correctness-oriented backend.
  • DeepEP is explicitly reserved but not implemented in this PR.

Type of Change

  • feat — new feature / new model
  • refactor — code restructuring without behavior change
  • perf — performance improvement (no behavioral change)
  • fix — bug fix
  • test — adding or fixing tests only
  • docs — documentation only
  • build / ci — build system or CI configuration
  • chore — tooling, formatting, or other non-code changes
  • Breaking change

Test Results of Involved Models on Supported Platforms (Please attach screenshots)

Please attach screenshots for the final tested commands.

Suggested coverage:

  • Qwen3-30B-A3B, TP=1, EP disabled
  • Qwen3-30B-A3B, TP=2, EP=2, local_allreduce
  • Qwen3-30B-A3B, TP=2, EP=2, allgather_reducescatter
  • Qwen3-235B-A22B, TP=8, EP=8, local_allreduce
  • Qwen3-8B-base non-MoE regression, TP=2, graph enabled
  • DeepSeek-V2-Lite loading/regression for legacy MoE path if applicable

Benchmark / Performance Impact

Initial measured examples on A100:

  • Qwen3-30B-A3B, TP=2/EP=2, local_allreduce, graph enabled:
    • Prefill and decode are functional.
    • Decode performance is currently limited by MoE communication and temporary fused MoE kernel quality.
  • Qwen3-235B-A22B, TP=8/EP=8, local_allreduce, graph enabled:
    • Model loading and decode are functional.
    • Nsys shows decode is dominated by communication, especially allreduce-heavy paths.

This PR does not claim final high-performance MoE EP parity with vLLM/SGLang. It establishes the correct abstraction and execution path for later DeepEP/fused MoE work.

Notes for Reviewers

  • local_allreduce is the recommended current EP backend for DP=1.
  • allgather_reducescatter is correctness-oriented and expected to be slower.
  • deepep is intentionally a placeholder interface.
  • prepare_moe_input-style CUTLASS grouped GEMM flow is not used by the current InfiniLM MoE runner.
  • DeepSeek-V2 remains on layers/moe/legacy and is not migrated to the new fused Qwen3-MoE path.
  • Non-MoE models should show MoE EP backend: disabled.

Checklist

Title, Branch, and Commits

  • PR title follows Conventional Commits.
  • Branch name follows <type>/xxx-yyyy-zzzz.
  • Each commit message follows Conventional Commits.
  • Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable.
  • No stray merge commits from main.
  • No fixup! / squash! / wip commits remain.
  • Existing PR/branch/commit that followed the legacy issue format.

Scope and Design

  • Changes are scoped to MoE inference, EP config/loading, and required model compatibility.
  • No debug prints or temporary MoE logs are left behind.
  • Public API changes are intentional and reflected in Python/C++ callers.

C++ Specific

  • Changed files are formatted by scripts/format.py.
  • Project builds cleanly on NVIDIA.

Python Specific

  • Changed files are formatted by scripts/format.py.

Testing

  • Passed single request test, or reason for skipping is documented.
  • Passed offline performance test, or reason for skipping is documented.
  • Passed sanity test, or reason for skipping is documented.
  • Passed service test, or reason for skipping is documented.

@qinyiqun qinyiqun requested a review from a team June 18, 2026 02:17
@qinyiqun qinyiqun force-pushed the moe branch 2 times, most recently from 4b3058a to f2d4861 Compare June 23, 2026 02:20
@qinyiqun qinyiqun requested a review from pengcheng888 June 25, 2026 08:17
struct CompiledResult {
InfinilmModel::Input input;
Compiled compiled;
std::shared_ptr<InfinilmModel::Output> replay_output;

@pengcheng888 pengcheng888 Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个新增的replay_output变量,以及graph编译时新增和修改的代码。可以注释或解释一下么,不知道啥意思

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已补充注释。这里的 replay_output 是 graph capture 时为输出保留的普通 Output handle;compiled 里保存的是 GraphTensor/graph 对象,replay 后需要通过这个 handle 拿回模型输出。这样 get_compiled 时可以直接返回可复用的 graph replay 结果。

这个不影响static 推理,已测试

Comment thread csrc/engine/infer_engine.cpp Outdated
throw std::runtime_error(" Model object not found. ");
}
return workers_.front()->state_dict_keys();
std::vector<std::string> keys;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个写法,我看了好一会才看懂。
其是等价于下面的写法。先求set, 最后再赋值给key_vec.
`
std::unordered_setstd::string keys;
for (const auto& worker : workers_) {
const auto& worker_keys = worker->state_dict_keys();
keys.insert(worker_keys.begin(), worker_keys.end());
}

std::vectorstd::string keys_vec(keys.begin(), keys.end());
return keys_vec;

`

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里改的核心目的:合并多 rank/worker 的 state_dict_keys() 时去重,但保留稳定顺序。
背景是 InferEngine 里有多个 RankWorker,比如 TP=2 时,每个 worker 都有一份模型结构,因此 worker->state_dict_keys() 里很多 key 是重复的。如果 InferEngine::state_dict_keys() 只是把所有 worker 的 key 拼起来,Python 侧 check_parameters() 虽然最后会转 set,但返回列表会非常冗余,也不利于定位 missing/unexpected key。

} else if (local_cmd == Command::LOAD_BATCH) {
try {
model_->load_parameters_no_sync(local_params);
model_->load_parameters_no_sync(local_params, local_params_strict);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

等价于这个写法么 model_->load_parameters_no_sync(local_params, strict);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,现在等价于直接调用 model_->load_parameters_no_sync(local_params, local_params_strict)。这里需要把 strict 继续传下去,否则 Python 侧传入的 non-strict load 对 MoE packed weight 不生效。

self.parser.add_argument("--model", type=str, required=True)
self.parser.add_argument("--device", type=str, default="cpu")
self.parser.add_argument("--tp", "--tensor-parallel-size", type=int, default=1)
self.parser.add_argument("--dp", "--data-parallel-size", type=int, default=1)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

提供测试命令

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个dp只有python中被使用,不会传递给c++么

@qinyiqun qinyiqun Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

测试命令可以用:

CUDA_VISIBLE_DEVICES=2,3 python examples/bench.py
--device=nvidia
--model=xxxxx
--enable-paged-attn
--attn=flash-attn
--tp=2
--ep=2
--moe-ep-backend=local_allreduce
--input-len=16,1024
--output-len=1024
--batch-size=1
--enable-graph

dp现在是预留口,我们没有支持dp功能,但是dp会跟moe有执行层的绑定,所以先留出来了

Comment thread python/infinilm/infer_engine.py Outdated
Comment thread python/infinilm/modeling_utils.py
Comment thread examples/bench.py Outdated
Comment thread python/infinilm/server/inference_server.py Outdated
Comment thread python/infinilm/infer_engine.py
namespace infinilm::models::qwen3_moe {

class Qwen3MoeSparseMoeBlock : public infinicore::nn::Module {
class Qwen3MoeSparseMoeBlock final : public infinilm::layers::moe::SparseMoeBlock {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

继承后貌似啥也没干。 这里直接 using Qwen3MoeSparseMoeBlock = public infinilm::layers::moe::SparseMoeBlock 可以么。

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里暂时保留 adapter class。原因是 TextDecoderLayer 构造 MLP block 使用的是 (config, layer_idx, device),而通用 SparseMoeBlock 的构造顺序是 (config, device, layer_idx)。using alias 不能适配构造函数签名,所以这里保留一个很薄的 Qwen3Moe adapter,后续统一构造函数签名后可以删掉。

@wooway777

Copy link
Copy Markdown
Collaborator

/retest

@wooway777 wooway777 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参考:
[P2] csrc/engine/compiler/static_batching_compiler.cpp (line 32)
StaticBatchingCompiler::compile() 新增了 capture 前的 eager warmup,但 warmup 后直接 startGraphRecording(),没有像 PagedCompiler 一样调用 model_->reset_runtime_state()。如果模型里有 Marlin/AWQ/GPTQ 这类会留下 lock/workspace 状态的模块,static graph capture 会从 dirty runtime state 开始,可能导致 static graph 下量化模型输出不稳定或复用旧状态。建议在 line 33-35 之间补 model_->reset_runtime_state(); syncStream();,和 paged compiler 保持一致。
[P2] python/infinilm/moe_config.py (line 4)
deepep 被公开列为合法 --moe-ep-backend,但 csrc/layers/moe/ep/deepep_dispatcher.cpp (line 30) 的所有路径都会 throw “reserved but not implemented yet”。这会让用户通过配置校验、模型初始化也可能成功,然后第一轮 forward 才炸。既然 PR 明确说 DeepEP 是 reserved,建议 Python 配置层直接拒绝 deepep,或把 help 文案改成“not supported yet”并在配置阶段报错。
Open Question
csrc/layers/moe/runner/cuda_fused_moe_runner.cpp (line 190)
local_allreduce 路径依赖每个 rank 对非本地 expert 的 token 输出为 0,但 workspace.fused_moe_output 复用后没有显式清零。这里是否由 moe_fused_dense_ 保证完整覆盖/清零?如果不是,需要在 kernel 前清零,否则 allreduce 会把旧 buffer 内容加进去。

Comment thread csrc/engine/compiler/paged_compiler.cpp Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么改了这些?static不是本身也没支持图?

ordered_keys.push_back(key);
}
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这部分代码,感觉可以简洁一点代码。

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个主要是为了规避掉多进程有n份权重做的稳定去重,应该不太能更简单了。

@qinyiqun

qinyiqun commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

参考: [P2] csrc/engine/compiler/static_batching_compiler.cpp (line 32) StaticBatchingCompiler::compile() 新增了 capture 前的 eager warmup,但 warmup 后直接 startGraphRecording(),没有像 PagedCompiler 一样调用 model_->reset_runtime_state()。如果模型里有 Marlin/AWQ/GPTQ 这类会留下 lock/workspace 状态的模块,static graph capture 会从 dirty runtime state 开始,可能导致 static graph 下量化模型输出不稳定或复用旧状态。建议在 line 33-35 之间补 model_->reset_runtime_state(); syncStream();,和 paged compiler 保持一致。 [P2] python/infinilm/moe_config.py (line 4) deepep 被公开列为合法 --moe-ep-backend,但 csrc/layers/moe/ep/deepep_dispatcher.cpp (line 30) 的所有路径都会 throw “reserved but not implemented yet”。这会让用户通过配置校验、模型初始化也可能成功,然后第一轮 forward 才炸。既然 PR 明确说 DeepEP 是 reserved,建议 Python 配置层直接拒绝 deepep,或把 help 文案改成“not supported yet”并在配置阶段报错。 Open Question csrc/layers/moe/runner/cuda_fused_moe_runner.cpp (line 190) local_allreduce 路径依赖每个 rank 对非本地 expert 的 token 输出为 0,但 workspace.fused_moe_output 复用后没有显式清零。这里是否由 moe_fused_dense_ 保证完整覆盖/清零?如果不是,需要在 kernel 前清零,否则 allreduce 会把旧 buffer 内容加进去。

  1. 现在只有marlin 需要lock space,marlin现在只有nv支持,nv默认有图,所以这个不需要支持static cache
  2. 当前 NVIDIA 路径下 moe_fused_dense_ 的输出覆盖没有暴露这个问题;但在部分国产平台上,确实可能需要显式清零来保证 local_allreduce 前非本地 expert 输出为 0。问题是现在推理层还没有统一 memory pool / low-cost zero buffer 机制,直接在这里加 set_zeros 会把国产平台 decode 性能打下来。这个更适合等国产平台相关 PR 和推理层 memory pool 接入后,再统一处理 workspace 清零/复用策略。
  3. deepep 预留接口,如果不要可以删掉

Comment thread csrc/config/model_config.hpp Outdated
bool contains_non_null(const std::string &key) const;

template <typename T>
T get_or_alias(const std::string &key, const std::string &alias, const T &default_value) const {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

直接做一个接受vector<const std::string &> 接口吧,拿到第一个存在的值否则default。然后contains_non_null可以删了

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment thread csrc/engine/infer_engine.hpp Outdated
const std::string &weight_load_mode = "async");
const std::string &weight_load_mode = "async",
const std::string &moe_ep_backend = "disabled",
size_t moe_ep_size = 1);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

分布式相关的东西是不是应该在DistConfig里

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment thread csrc/engine/compiler/paged_compiler.hpp Outdated
Compiled compiled;
// Graph capture stores a GraphTensor in compiled. Replay returns a
// normal output handle restored from the same graph output blob.
std::shared_ptr<InfinilmModel::Output> replay_output;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这么做会导致每个result独占一份output tensor

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块我建议先保留。CUDA graph replay 的 output 地址是在 capture 时固定下来的,所以每个 captured graph/shape 需要持有对应的 output handle,避免 allocator 把这块地址复用后 graph replay 写到错误位置。vLLM 和 SGLang 也是类似做法:按 batch descriptor / shape 保存 graph entry 和 output,replay 后直接返回对应 output。这里不是额外拷贝一份 logits,而是维护 graph output 的生命周期。后续如果要优化显存,应从 graph pool / weak-ref / allocator 生命周期管理入手。

Comment on lines +13 to +15

CudaFusedMoeRunner::CudaFusedMoeRunner(size_t num_local_experts,
size_t hidden_size,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

函数名和类名为什么要以Cuda开头,这个命名可以避免么

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉出现在infinilm中不好

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个倒无所谓,本来就是一个临时的,未来会被销毁的,只有cuda 能用的算子。

Add reusable MoE routing, dispatch, expert, and runner layers for sparse MoE models.

Wire Qwen3/DeepSeek MoE implementations to the shared MoE path, expose EP backend configuration through DistConfig-backed bench/server/LLM entrypoints, and keep graph replay output handling compatible with MoE execution.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants