[New Model] Add MiniMind-Omni model support by xRay2016 · Pull Request #3796 · vllm-project/vllm-omni

xRay2016 · 2026-05-21T13:43:25Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Add initial MiniMind-Omni support, including the three-stage pipeline:

Thinker: text/multimodal understanding and text generation
Talker: thinker hidden states to Mimi codec tokens
Code2Wav: Mimi codec tokens to waveform audio

This is a draft PR. The focused model and weight-loading paths are working locally, while full end-to-end Omni(...) inference is still being debugged.

Current Progress

Implemented so far:

Added MiniMind-Omni config classes and AutoConfig registration.
Added MiniMindOmniForConditionalGeneration stage wrapper.
Added thinker, talker, and code2wav model implementations.
Added MiniMind pipeline topology: thinker -> talker -> code2wav.
Added stage input processors:
- thinker2talker
- talker2code2wav
Added model registry entries for MiniMind-Omni architectures.
Added pipeline registry entry for model_type="minimind-o".
Added local tests for:
- real thinker checkpoint weight loading
- real talker checkpoint weight loading, dense and MoE
- real Mimi Code2Wav decode
- Code2Wav startup through MiniMindOmniForConditionalGeneration

Test Plan

Following docs/contributing/ci/tests_style.md, tests are organized by scope and placed next to the related source modules when possible.

E2E Correctness

Run the same fixed prompt set on:
- Native MiniMind-Omni inference
- vLLM-Omni MiniMind-Omni pipeline: thinker -> talker -> code2wav
Cover text-only and text-to-audio cases.
Compare outputs:
- Text: manual/keyword check first, then optional semantic similarity
- Audio: verify waveform is non-empty, finite, correct sample rate, and playable
- Optional: ASR the generated audio and compare transcript with native output

E2E Performance

Measure the same prompt set on native MiniMind-Omni and vLLM-Omni.

Metrics:

End-to-end latency
Time to first token / first audio when available
Peak GPU memory
Throughput for single request and small batch

Test Result

TODO

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector · 2026-05-21T13:45:47Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Signed-off-by: xRay2016 <1150722393@qq.com>

linyueqian

Thanks @xRay2016. I ran a hands-on end-to-end evaluation with jingyaogong/minimind-3o (3-stage thinker, talker, code2wav). The architecture is faithful to the official MiniMind-O: QK-norm transformer, neox-style RoPE, tied embeddings, and a real transformers.MimiModel Code2Wav matching eval_omni.py. However it does not complete e2e yet. 7 blocking issues are inline below.

One more that is not tied to a single line: the core thinker forward fails CUDA-graph capture with RuntimeError: Cannot copy between CPU and CUDA tensors during CUDA graph capture. This recurs even with multimodal disabled, so a CPU tensor on the forward path needs to be pinned or kept on device. enforce_eager=True is only a temporary workaround.

Environment note for reproducibility: the test box needed VLLM_USE_FLASHINFER_SAMPLER=0 (no nvcc for the flashinfer sampler JIT). That is environmental, not a PR issue. Full trace logs available on request.

linyueqian · 2026-05-21T21:04:23Z

        from vllm_omni.model_executor.models.qwen3_tts.configuration_qwen3_tts import (
            Qwen3TTSConfig,
        )
+        from vllm_omni.transformers_utils.configs.voxcpm import VoxCPMConfig


🔴 [blocking] vllm_omni/transformers_utils/configs/voxcpm.py does not exist, so this import raises ImportError, which silently aborts _register_omni_hf_configs(), and minimind-o is never registered. Lines 57-58 also reference CosyVoice3Config / OmniVoiceConfig, which are never imported. This looks like contamination from an unrelated branch; the block should add only minimind-o.

This was accidentally brought in during the rebase from main and is unrelated to the MiniMind work in this PR.

I’ll trim the block back so it only registers minimind-o and does not include those unrelated imports/references.

linyueqian · 2026-05-21T21:04:23Z

+    return bridge[-expected_len:].detach().to(torch.float32)
+
+
+def thinker2talker(


🔴 [blocking] thinker2talker / talker2code2wav use a stale signature (stage_list, engine_input_source, ...) and read stage_list[id].engine_outputs. Current vLLM-Omni calls stage processors as (source_outputs, prompt, ...) for the non-chunk path, or (transfer_manager, pooling_output, request, is_finished) for the chunk-transfer path, which is the default. On the default path the engine hangs in a loop on thinker2talker() got an unexpected keyword argument 'transfer_manager'. These need to be rewritten against the current API; stage_input_processors/qwen3_omni.py and its *_async_chunk variants are a good reference.

linyueqian · 2026-05-21T21:04:23Z

+    path = resolve_model_dir(path, "SenseVoice encoder")
+
+    try:
+        from funasr import AutoModel


🔴 [blocking] funasr (and librosa, soundfile) are used but declared in no requirements file. Because the thinker advertises unbounded audio support, even a text-only request invokes SenseVoice during dummy profiling, so the stage cannot start without funasr (ModuleNotFoundError). Please declare them (the official pins funasr==1.3.1, librosa==0.11.0, soundfile==0.13.1) and consider making this import lazy so text-only serving does not hard-require it.

linyueqian · 2026-05-21T21:04:23Z

+        return self.language_model.compute_logits(hidden_states)
+
+
+    def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:


🔴 [blocking] The frozen SenseVoice/SigLIP2 encoders are loaded via from_pretrained/funasr, not from the main checkpoint, so their parameters are never added to the set this method returns. vLLM's track_weights_loading then raises ValueError: weights not initialized from checkpoint. Add the audio_encoder.* / vision_encoder.* param names to loaded_weights before returning.

linyueqian · 2026-05-21T21:04:23Z

+            return None
+        return model.model.encoder.to(device=self.device, dtype=torch.float32)
+
+    def encode_audio_inputs(


🔴 [blocking] With the audio tower built, thinker init fails in this path with RuntimeError: expected scalar type Float but found BFloat16: the frozen SenseVoice encoder runs float32 while the model runs bfloat16. The encoder output and model dtype need to be reconciled.

linyueqian · 2026-05-21T21:04:23Z

+            return (embeddings,)
+        return tuple(embeddings.unbind(0))
+
+    def embed_multimodal(self, **kwargs: object) -> MultiModalEmbeddings:


🔴 [blocking] During vLLM dummy-input profiling, this returns 0 embeddings for 5 dummy mm items: AssertionError: Expected number of multimodal embeddings to match number of input items: 5, but got len(mm_embeddings)=0. The dummy-data path through embed_multimodal must produce embeddings consistent with MiniMindOmniDummyInputsBuilder.

linyueqian · 2026-05-21T21:04:23Z

+                    quant_config=quant_config,
+                    prefix=prefix,
+                )
+                for l in range(self.num_hidden_layers)


🟢 [nit] E741 ambiguous variable name l. pre-commit is currently red (this, plus trailing whitespace, ruff-format, and a missing end-of-file newline in pipeline.py). pre-commit run --all-files clears it. There are also a few unused imports (contextlib, io, logging) in this file.

linyueqian · 2026-05-21T21:04:23Z

+        self.spk_emb_size = spk_emb_size
+
+
+class MiniMindOmniConfig(PretrainedConfig):


🟡 [important] MiniMindOmniConfig exposes no top-level hidden_size / num_hidden_layers; only text_config carries them. The official OmniConfig(MiniMindConfig) inherits these, so any caller reading hf_config.hidden_size directly will break. Consider mirroring the key fields at the top level.

linyueqian · 2026-05-21T21:04:24Z

+from transformers import AutoConfig, PretrainedConfig
+
+
+@dataclass


🟢 [nit] @dataclass here is effectively a no-op since MiniMindConfig defines __init__ manually, and it adds a misleading generated __eq__/__repr__ onto a PretrainedConfig subclass. Recommend removing the decorator.

hsliuustc0106 · 2026-05-21T21:09:19Z

Quick review noted. CI checks look good.

Signed-off-by: xRay2016 <1150722393@qq.com>

xRay2016 changed the title ~~[New Model] [Draft] Add MiniMind-Omni model support~~ [New Model] Add MiniMind-Omni model support May 21, 2026

xRay2016 marked this pull request as ready for review May 21, 2026 13:45

xRay2016 requested review from ZeldaHuang, gcanlin, linyueqian, lishunyang12, princepride, tzhouam and yuanheng-zhao as code owners May 21, 2026 13:45

xRay2016 added 5 commits May 21, 2026 13:56

feat: support minimind omni thinker

dc08b5d

Signed-off-by: xRay2016 <1150722393@qq.com>

fix: processor and info

98dbbdd

Signed-off-by: xRay2016 <1150722393@qq.com>

fix: fix thinker

0944080

Signed-off-by: xRay2016 <1150722393@qq.com>

feat: support minimind-omni

83a391d

Signed-off-by: xRay2016 <1150722393@qq.com>

feat: resolve encoder path/repo

7e1c06e

Signed-off-by: xRay2016 <1150722393@qq.com>

xRay2016 force-pushed the feature/support-minimind-omni branch from 23e957d to 7e1c06e Compare May 21, 2026 14:02

fix: e2e

c1367ec

Signed-off-by: xRay2016 <1150722393@qq.com>

linyueqian mentioned this pull request May 21, 2026

[New Model]: MiniMind-O #3784

Closed

linyueqian reviewed May 21, 2026

View reviewed changes

Merge branch 'main' into feature/support-minimind-omni

0f13708

xRay2016 added 7 commits May 22, 2026 17:04

fix: unnecessary import

c4d72e8

Signed-off-by: xRay2016 <1150722393@qq.com>

fix: pre-commit

88b011e

Signed-off-by: xRay2016 <1150722393@qq.com>

fix: pre-commit

6ae7cfe

Signed-off-by: xRay2016 <1150722393@qq.com>

fix: pre-commit

c5d548d

Signed-off-by: xRay2016 <1150722393@qq.com>

fix: audio encoder dtype

aec19c6

Signed-off-by: xRay2016 <1150722393@qq.com>

fix: audio encoder dtype

8726cd0

Signed-off-by: xRay2016 <1150722393@qq.com>

fix: pre-commit

0c8331f

Signed-off-by: xRay2016 <1150722393@qq.com>

		return bridge[-expected_len:].detach().to(torch.float32)


		def thinker2talker(

		return self.language_model.compute_logits(hidden_states)


		def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:

		self.spk_emb_size = spk_emb_size


		class MiniMindOmniConfig(PretrainedConfig):

		from transformers import AutoConfig, PretrainedConfig


		@dataclass

Conversation

xRay2016 commented May 21, 2026

Purpose

Current Progress

Test Plan

E2E Correctness

E2E Performance

Test Result

Uh oh!

chatgpt-codex-connector Bot commented May 21, 2026

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants