Skip to content

[New Model] Add MiniMind-Omni model support#3796

Open
xRay2016 wants to merge 14 commits into
vllm-project:mainfrom
xRay2016:feature/support-minimind-omni
Open

[New Model] Add MiniMind-Omni model support#3796
xRay2016 wants to merge 14 commits into
vllm-project:mainfrom
xRay2016:feature/support-minimind-omni

Conversation

@xRay2016
Copy link
Copy Markdown

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

ref #3399

Add initial MiniMind-Omni support, including the three-stage pipeline:

  • Thinker: text/multimodal understanding and text generation
  • Talker: thinker hidden states to Mimi codec tokens
  • Code2Wav: Mimi codec tokens to waveform audio

This is a draft PR. The focused model and weight-loading paths are working locally, while full end-to-end Omni(...) inference is still being debugged.

Current Progress

Implemented so far:

  • Added MiniMind-Omni config classes and AutoConfig registration.
  • Added MiniMindOmniForConditionalGeneration stage wrapper.
  • Added thinker, talker, and code2wav model implementations.
  • Added MiniMind pipeline topology: thinker -> talker -> code2wav.
  • Added stage input processors:
    • thinker2talker
    • talker2code2wav
  • Added model registry entries for MiniMind-Omni architectures.
  • Added pipeline registry entry for model_type="minimind-o".
  • Added local tests for:
    • real thinker checkpoint weight loading
    • real talker checkpoint weight loading, dense and MoE
    • real Mimi Code2Wav decode
    • Code2Wav startup through MiniMindOmniForConditionalGeneration

Test Plan

Following docs/contributing/ci/tests_style.md, tests are organized by scope and placed next to the related source modules when possible.

E2E Correctness

  • Run the same fixed prompt set on:
    • Native MiniMind-Omni inference
    • vLLM-Omni MiniMind-Omni pipeline: thinker -> talker -> code2wav
  • Cover text-only and text-to-audio cases.
  • Compare outputs:
    • Text: manual/keyword check first, then optional semantic similarity
    • Audio: verify waveform is non-empty, finite, correct sample rate, and playable
    • Optional: ASR the generated audio and compare transcript with native output

E2E Performance

Measure the same prompt set on native MiniMind-Omni and vLLM-Omni.

Metrics:

  • End-to-end latency
  • Time to first token / first audio when available
  • Peak GPU memory
  • Throughput for single request and small batch

Test Result

TODO


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@xRay2016 xRay2016 changed the title [New Model] [Draft] Add MiniMind-Omni model support [New Model] Add MiniMind-Omni model support May 21, 2026
@xRay2016 xRay2016 marked this pull request as ready for review May 21, 2026 13:45
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

xRay2016 added 5 commits May 21, 2026 13:56
Signed-off-by: xRay2016 <1150722393@qq.com>
Signed-off-by: xRay2016 <1150722393@qq.com>
Signed-off-by: xRay2016 <1150722393@qq.com>
Signed-off-by: xRay2016 <1150722393@qq.com>
Signed-off-by: xRay2016 <1150722393@qq.com>
@xRay2016 xRay2016 force-pushed the feature/support-minimind-omni branch from 23e957d to 7e1c06e Compare May 21, 2026 14:02
Signed-off-by: xRay2016 <1150722393@qq.com>
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xRay2016. I ran a hands-on end-to-end evaluation with jingyaogong/minimind-3o (3-stage thinker, talker, code2wav). The architecture is faithful to the official MiniMind-O: QK-norm transformer, neox-style RoPE, tied embeddings, and a real transformers.MimiModel Code2Wav matching eval_omni.py. However it does not complete e2e yet. 7 blocking issues are inline below.

One more that is not tied to a single line: the core thinker forward fails CUDA-graph capture with RuntimeError: Cannot copy between CPU and CUDA tensors during CUDA graph capture. This recurs even with multimodal disabled, so a CPU tensor on the forward path needs to be pinned or kept on device. enforce_eager=True is only a temporary workaround.

Environment note for reproducibility: the test box needed VLLM_USE_FLASHINFER_SAMPLER=0 (no nvcc for the flashinfer sampler JIT). That is environmental, not a PR issue. Full trace logs available on request.

Comment thread vllm_omni/engine/arg_utils.py Outdated
from vllm_omni.model_executor.models.qwen3_tts.configuration_qwen3_tts import (
Qwen3TTSConfig,
)
from vllm_omni.transformers_utils.configs.voxcpm import VoxCPMConfig
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 [blocking] vllm_omni/transformers_utils/configs/voxcpm.py does not exist, so this import raises ImportError, which silently aborts _register_omni_hf_configs(), and minimind-o is never registered. Lines 57-58 also reference CosyVoice3Config / OmniVoiceConfig, which are never imported. This looks like contamination from an unrelated branch; the block should add only minimind-o.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was accidentally brought in during the rebase from main and is unrelated to the MiniMind work in this PR.

I’ll trim the block back so it only registers minimind-o and does not include those unrelated imports/references.

Comment thread vllm_omni/engine/arg_utils.py Outdated
return bridge[-expected_len:].detach().to(torch.float32)


def thinker2talker(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 [blocking] thinker2talker / talker2code2wav use a stale signature (stage_list, engine_input_source, ...) and read stage_list[id].engine_outputs. Current vLLM-Omni calls stage processors as (source_outputs, prompt, ...) for the non-chunk path, or (transfer_manager, pooling_output, request, is_finished) for the chunk-transfer path, which is the default. On the default path the engine hangs in a loop on thinker2talker() got an unexpected keyword argument 'transfer_manager'. These need to be rewritten against the current API; stage_input_processors/qwen3_omni.py and its *_async_chunk variants are a good reference.

path = resolve_model_dir(path, "SenseVoice encoder")

try:
from funasr import AutoModel
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 [blocking] funasr (and librosa, soundfile) are used but declared in no requirements file. Because the thinker advertises unbounded audio support, even a text-only request invokes SenseVoice during dummy profiling, so the stage cannot start without funasr (ModuleNotFoundError). Please declare them (the official pins funasr==1.3.1, librosa==0.11.0, soundfile==0.13.1) and consider making this import lazy so text-only serving does not hard-require it.

return self.language_model.compute_logits(hidden_states)


def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 [blocking] The frozen SenseVoice/SigLIP2 encoders are loaded via from_pretrained/funasr, not from the main checkpoint, so their parameters are never added to the set this method returns. vLLM's track_weights_loading then raises ValueError: weights not initialized from checkpoint. Add the audio_encoder.* / vision_encoder.* param names to loaded_weights before returning.

return None
return model.model.encoder.to(device=self.device, dtype=torch.float32)

def encode_audio_inputs(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 [blocking] With the audio tower built, thinker init fails in this path with RuntimeError: expected scalar type Float but found BFloat16: the frozen SenseVoice encoder runs float32 while the model runs bfloat16. The encoder output and model dtype need to be reconciled.

return (embeddings,)
return tuple(embeddings.unbind(0))

def embed_multimodal(self, **kwargs: object) -> MultiModalEmbeddings:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 [blocking] During vLLM dummy-input profiling, this returns 0 embeddings for 5 dummy mm items: AssertionError: Expected number of multimodal embeddings to match number of input items: 5, but got len(mm_embeddings)=0. The dummy-data path through embed_multimodal must produce embeddings consistent with MiniMindOmniDummyInputsBuilder.

quant_config=quant_config,
prefix=prefix,
)
for l in range(self.num_hidden_layers)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 [nit] E741 ambiguous variable name l. pre-commit is currently red (this, plus trailing whitespace, ruff-format, and a missing end-of-file newline in pipeline.py). pre-commit run --all-files clears it. There are also a few unused imports (contextlib, io, logging) in this file.

self.spk_emb_size = spk_emb_size


class MiniMindOmniConfig(PretrainedConfig):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [important] MiniMindOmniConfig exposes no top-level hidden_size / num_hidden_layers; only text_config carries them. The official OmniConfig(MiniMindConfig) inherits these, so any caller reading hf_config.hidden_size directly will break. Consider mirroring the key fields at the top level.

from transformers import AutoConfig, PretrainedConfig


@dataclass
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 [nit] @dataclass here is effectively a no-op since MiniMindConfig defines __init__ manually, and it adds a misleading generated __eq__/__repr__ onto a PretrainedConfig subclass. Recommend removing the decorator.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Quick review noted. CI checks look good.

xRay2016 added 7 commits May 22, 2026 17:04
Signed-off-by: xRay2016 <1150722393@qq.com>
Signed-off-by: xRay2016 <1150722393@qq.com>
Signed-off-by: xRay2016 <1150722393@qq.com>
Signed-off-by: xRay2016 <1150722393@qq.com>
Signed-off-by: xRay2016 <1150722393@qq.com>
Signed-off-by: xRay2016 <1150722393@qq.com>
Signed-off-by: xRay2016 <1150722393@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants