[Bugfix][CI] Run Whisper validation on CPU for single-GPU runners by linyueqian · Pull Request #3822 · vllm-project/vllm-omni

linyueqian · 2026-05-22T13:01:25Z

Purpose

Fixes the CUDA OOM in the L4 nightly TTS function test test_qwen3_tts_base_expansion.py::test_voice_clone_streaming_001[async_chunk], reported in #3788 and #3809.

The failure is test-side, not a server bug. The e2e audio helper validates server output by transcribing it with Whisper small. In tests/helpers/media.py, _whisper_transcribe_in_current_process loads Whisper onto a GPU; on a single-GPU runner the n == 1 branch selects device 0, which is the GPU the model server already occupies. The streaming speech test sends request_num=5 concurrent requests, and each spawns its own Whisper validator subprocess, so up to 5 Whisper models pile onto the server's card.

On an L4 (22 GiB) the two server stages hold ~18 GiB, leaving ~4 GiB. Each Whisper small validator needs ~2.4 GiB, so the concurrent validators exhaust VRAM and the request fails with CUDA out of memory, surfacing as AssertionError: The request failed. The server itself returns HTTP 200 for every request; only the client-side validation OOMs.

The device heuristic already avoids the server GPU on multi-GPU hosts by selecting device n - 1. The single-GPU case had nowhere else to go. This PR drops it so single-GPU runners validate on CPU. Multi-GPU behaviour is unchanged.

Closes #3788
Closes #3809

Test Plan

Run the real helper tests.helpers.media.convert_audio_bytes_to_text under a single visible GPU (CUDA_VISIBLE_DEVICES set to one device, so device_count == 1 like the L4 runner), with a filler tensor capping free VRAM at the L4's post-server budget (~4 GiB). Launch 5 concurrent validator calls (mirroring request_num=5), before and after this change.

Test Result

Single Whisper small validator footprint: 2.41 GiB of GPU memory.

Before (buggy n == 1 -> device 0), ~4 GiB free, 5 concurrent validators:

RESULT: 2/5 ok, 3 OOM, 0 other-err; min free during run = 0.01 GiB
CUDA out of memory. Tried to allocate 20.00 MiB ... 12 MiB free

Same error signature as the CI log in #3788.

After (this PR, single-GPU -> CPU), same ~4 GiB free, 5 concurrent validators:

RESULT: 5/5 ok, 0 OOM, 0 other-err; min free during run = 4.00 GiB

GPU free memory stays flat (validators run on CPU); transcripts are byte-identical to the GPU path.

Note: reducing the test's "few" batch size from 5 to 2 (suggested in #3809) does not fix it. Two validators alone need 18 + 2 x 2.4 = 22.9 GiB, already over the L4's 22 GiB.

Thanks to @congw729 for the original nightly report in #3788 and to @tzhouam for #3809. @yenuo26, flagging this for you as the author of the test-helper restructure (#2556 and #2620) in case the multi-GPU device heuristic deserves a follow-up.

Checklist

Purpose of the PR, linking the issues it resolves ([Bug]: Nightly / CI failed - tests/e2e/online_serving/test_qwen3_tts_base_expansion.py::test_voice_clone_streaming_001[async_chunk] #3788, [Rebase][Bug] TTS Function Test: CUDA OOM on L4 GPU during streaming voice-clone #3809).
Test plan: reproduction script and command described above.
Test results: before/after comparison pasted above.
Documentation update: not applicable (test-helper only).
Release notes update: not applicable (CI-only change).

The e2e audio helper validates server output by transcribing it with Whisper "small". On a single-GPU runner, _whisper_transcribe_in_current_process loaded Whisper onto device 0, the GPU the model server already occupies, once per concurrent request. On an L4 (22 GiB) the server stages hold ~18 GiB and each validator adds ~2.4 GiB, so a few concurrent requests exhaust VRAM and the request fails with CUDA OOM (test_voice_clone_streaming_001). The device heuristic already avoids the server GPU on multi-GPU hosts by selecting device n-1. The single-GPU case had nowhere else to go, so drop it: single-GPU runners now validate on CPU. Multi-GPU behaviour is unchanged. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

chatgpt-codex-connector · 2026-05-22T13:01:32Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

hsliuustc0106

The single-GPU fallback to CPU for Whisper validation is a reasonable workaround for the L4 runner constraint. The test evidence shows the fix resolves the OOM. Multi-GPU behavior is preserved.

linyueqian requested a review from yenuo26 as a code owner May 22, 2026 13:01

linyueqian added ready label to trigger buildkite CI tts-test label to trigger buildkite tts models test in nightly CI labels May 22, 2026

hsliuustc0106 reviewed May 22, 2026

View reviewed changes

hsliuustc0106 merged commit 4e4458e into vllm-project:main May 22, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix][CI] Run Whisper validation on CPU for single-GPU runners#3822

[Bugfix][CI] Run Whisper validation on CPU for single-GPU runners#3822
hsliuustc0106 merged 1 commit into
vllm-project:mainfrom
linyueqian:bugfix/tts-l4-whisper-oom

linyueqian commented May 22, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 22, 2026

Uh oh!

hsliuustc0106 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

linyueqian commented May 22, 2026

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot commented May 22, 2026

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants