feat(moe): add MoE inference and expert parallel support#444
Conversation
4b3058a to
f2d4861
Compare
| struct CompiledResult { | ||
| InfinilmModel::Input input; | ||
| Compiled compiled; | ||
| std::shared_ptr<InfinilmModel::Output> replay_output; |
There was a problem hiding this comment.
这个新增的replay_output变量,以及graph编译时新增和修改的代码。可以注释或解释一下么,不知道啥意思
There was a problem hiding this comment.
已补充注释。这里的 replay_output 是 graph capture 时为输出保留的普通 Output handle;compiled 里保存的是 GraphTensor/graph 对象,replay 后需要通过这个 handle 拿回模型输出。这样 get_compiled 时可以直接返回可复用的 graph replay 结果。
这个不影响static 推理,已测试
| throw std::runtime_error(" Model object not found. "); | ||
| } | ||
| return workers_.front()->state_dict_keys(); | ||
| std::vector<std::string> keys; |
There was a problem hiding this comment.
这个写法,我看了好一会才看懂。
其是等价于下面的写法。先求set, 最后再赋值给key_vec.
`
std::unordered_setstd::string keys;
for (const auto& worker : workers_) {
const auto& worker_keys = worker->state_dict_keys();
keys.insert(worker_keys.begin(), worker_keys.end());
}
std::vectorstd::string keys_vec(keys.begin(), keys.end());
return keys_vec;
`
There was a problem hiding this comment.
这里改的核心目的:合并多 rank/worker 的 state_dict_keys() 时去重,但保留稳定顺序。
背景是 InferEngine 里有多个 RankWorker,比如 TP=2 时,每个 worker 都有一份模型结构,因此 worker->state_dict_keys() 里很多 key 是重复的。如果 InferEngine::state_dict_keys() 只是把所有 worker 的 key 拼起来,Python 侧 check_parameters() 虽然最后会转 set,但返回列表会非常冗余,也不利于定位 missing/unexpected key。
| } else if (local_cmd == Command::LOAD_BATCH) { | ||
| try { | ||
| model_->load_parameters_no_sync(local_params); | ||
| model_->load_parameters_no_sync(local_params, local_params_strict); |
There was a problem hiding this comment.
等价于这个写法么 model_->load_parameters_no_sync(local_params, strict);
There was a problem hiding this comment.
是的,现在等价于直接调用 model_->load_parameters_no_sync(local_params, local_params_strict)。这里需要把 strict 继续传下去,否则 Python 侧传入的 non-strict load 对 MoE packed weight 不生效。
| self.parser.add_argument("--model", type=str, required=True) | ||
| self.parser.add_argument("--device", type=str, default="cpu") | ||
| self.parser.add_argument("--tp", "--tensor-parallel-size", type=int, default=1) | ||
| self.parser.add_argument("--dp", "--data-parallel-size", type=int, default=1) |
There was a problem hiding this comment.
这个dp只有python中被使用,不会传递给c++么
There was a problem hiding this comment.
测试命令可以用:
CUDA_VISIBLE_DEVICES=2,3 python examples/bench.py
--device=nvidia
--model=xxxxx
--enable-paged-attn
--attn=flash-attn
--tp=2
--ep=2
--moe-ep-backend=local_allreduce
--input-len=16,1024
--output-len=1024
--batch-size=1
--enable-graph
dp现在是预留口,我们没有支持dp功能,但是dp会跟moe有执行层的绑定,所以先留出来了
| namespace infinilm::models::qwen3_moe { | ||
|
|
||
| class Qwen3MoeSparseMoeBlock : public infinicore::nn::Module { | ||
| class Qwen3MoeSparseMoeBlock final : public infinilm::layers::moe::SparseMoeBlock { |
There was a problem hiding this comment.
继承后貌似啥也没干。 这里直接 using Qwen3MoeSparseMoeBlock = public infinilm::layers::moe::SparseMoeBlock 可以么。
There was a problem hiding this comment.
这里暂时保留 adapter class。原因是 TextDecoderLayer 构造 MLP block 使用的是 (config, layer_idx, device),而通用 SparseMoeBlock 的构造顺序是 (config, device, layer_idx)。using alias 不能适配构造函数签名,所以这里保留一个很薄的 Qwen3Moe adapter,后续统一构造函数签名后可以删掉。
|
/retest |
wooway777
left a comment
There was a problem hiding this comment.
参考:
[P2] csrc/engine/compiler/static_batching_compiler.cpp (line 32)
StaticBatchingCompiler::compile() 新增了 capture 前的 eager warmup,但 warmup 后直接 startGraphRecording(),没有像 PagedCompiler 一样调用 model_->reset_runtime_state()。如果模型里有 Marlin/AWQ/GPTQ 这类会留下 lock/workspace 状态的模块,static graph capture 会从 dirty runtime state 开始,可能导致 static graph 下量化模型输出不稳定或复用旧状态。建议在 line 33-35 之间补 model_->reset_runtime_state(); syncStream();,和 paged compiler 保持一致。
[P2] python/infinilm/moe_config.py (line 4)
deepep 被公开列为合法 --moe-ep-backend,但 csrc/layers/moe/ep/deepep_dispatcher.cpp (line 30) 的所有路径都会 throw “reserved but not implemented yet”。这会让用户通过配置校验、模型初始化也可能成功,然后第一轮 forward 才炸。既然 PR 明确说 DeepEP 是 reserved,建议 Python 配置层直接拒绝 deepep,或把 help 文案改成“not supported yet”并在配置阶段报错。
Open Question
csrc/layers/moe/runner/cuda_fused_moe_runner.cpp (line 190)
local_allreduce 路径依赖每个 rank 对非本地 expert 的 token 输出为 0,但 workspace.fused_moe_output 复用后没有显式清零。这里是否由 moe_fused_dense_ 保证完整覆盖/清零?如果不是,需要在 kernel 前清零,否则 allreduce 会把旧 buffer 内容加进去。
There was a problem hiding this comment.
为什么改了这些?static不是本身也没支持图?
| ordered_keys.push_back(key); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
这个主要是为了规避掉多进程有n份权重做的稳定去重,应该不太能更简单了。
|
| bool contains_non_null(const std::string &key) const; | ||
|
|
||
| template <typename T> | ||
| T get_or_alias(const std::string &key, const std::string &alias, const T &default_value) const { |
There was a problem hiding this comment.
直接做一个接受vector<const std::string &> 接口吧,拿到第一个存在的值否则default。然后contains_non_null可以删了
| const std::string &weight_load_mode = "async"); | ||
| const std::string &weight_load_mode = "async", | ||
| const std::string &moe_ep_backend = "disabled", | ||
| size_t moe_ep_size = 1); |
There was a problem hiding this comment.
分布式相关的东西是不是应该在DistConfig里
| Compiled compiled; | ||
| // Graph capture stores a GraphTensor in compiled. Replay returns a | ||
| // normal output handle restored from the same graph output blob. | ||
| std::shared_ptr<InfinilmModel::Output> replay_output; |
There was a problem hiding this comment.
这么做会导致每个result独占一份output tensor
There was a problem hiding this comment.
这块我建议先保留。CUDA graph replay 的 output 地址是在 capture 时固定下来的,所以每个 captured graph/shape 需要持有对应的 output handle,避免 allocator 把这块地址复用后 graph replay 写到错误位置。vLLM 和 SGLang 也是类似做法:按 batch descriptor / shape 保存 graph entry 和 output,replay 后直接返回对应 output。这里不是额外拷贝一份 logits,而是维护 graph output 的生命周期。后续如果要优化显存,应从 graph pool / weak-ref / allocator 生命周期管理入手。
|
|
||
| CudaFusedMoeRunner::CudaFusedMoeRunner(size_t num_local_experts, | ||
| size_t hidden_size, |
There was a problem hiding this comment.
函数名和类名为什么要以Cuda开头,这个命名可以避免么
There was a problem hiding this comment.
这个倒无所谓,本来就是一个临时的,未来会被销毁的,只有cuda 能用的算子。
2c5a844 to
938f5d4
Compare
Add reusable MoE routing, dispatch, expert, and runner layers for sparse MoE models. Wire Qwen3/DeepSeek MoE implementations to the shared MoE path, expose EP backend configuration through DistConfig-backed bench/server/LLM entrypoints, and keep graph replay output handling compatible with MoE execution.
Summary
csrc/layers/moe.SparseMoeBlock,TopKRouter,FusedMoeExperts, andFusedMoErunner.local_allreduceandallgather_reducescatter.deepepbackend interface for future integration.MoeMLPintocsrc/layers/moe/legacyand keep DeepSeek-V2 on the legacy path.num_key_value_heads < tp_size.Motivation
Closes #
InfiniLM needs a reusable MoE inference path that can support Qwen3-MoE models and provide a clear abstraction boundary for future high-performance EP backends such as DeepEP.
The current implementation focuses on correctness and data-flow alignment first:
local_allreduceas the preferred current path.allgather_reducescatteris available as a correctness-oriented backend.Type of Change
feat— new feature / new modelrefactor— code restructuring without behavior changeperf— performance improvement (no behavioral change)fix— bug fixtest— adding or fixing tests onlydocs— documentation onlybuild/ci— build system or CI configurationchore— tooling, formatting, or other non-code changesTest Results of Involved Models on Supported Platforms (Please attach screenshots)
Please attach screenshots for the final tested commands.
Suggested coverage:
local_allreduceallgather_reducescatterlocal_allreduceBenchmark / Performance Impact
Initial measured examples on A100:
local_allreduce, graph enabled:local_allreduce, graph enabled:This PR does not claim final high-performance MoE EP parity with vLLM/SGLang. It establishes the correct abstraction and execution path for later DeepEP/fused MoE work.
Notes for Reviewers
local_allreduceis the recommended current EP backend for DP=1.allgather_reducescatteris correctness-oriented and expected to be slower.deepepis intentionally a placeholder interface.prepare_moe_input-style CUTLASS grouped GEMM flow is not used by the current InfiniLM MoE runner.layers/moe/legacyand is not migrated to the new fused Qwen3-MoE path.MoE EP backend: disabled.Checklist
Title, Branch, and Commits
<type>/xxx-yyyy-zzzz.main.fixup!/squash!/wipcommits remain.Scope and Design
C++ Specific
scripts/format.py.Python Specific
scripts/format.py.Testing