Skip to content

[Bugfix][CI] Run Whisper validation on CPU for single-GPU runners#3822

Merged
hsliuustc0106 merged 1 commit into
vllm-project:mainfrom
linyueqian:bugfix/tts-l4-whisper-oom
May 22, 2026
Merged

[Bugfix][CI] Run Whisper validation on CPU for single-GPU runners#3822
hsliuustc0106 merged 1 commit into
vllm-project:mainfrom
linyueqian:bugfix/tts-l4-whisper-oom

Conversation

@linyueqian
Copy link
Copy Markdown
Collaborator

Purpose

Fixes the CUDA OOM in the L4 nightly TTS function test test_qwen3_tts_base_expansion.py::test_voice_clone_streaming_001[async_chunk], reported in #3788 and #3809.

The failure is test-side, not a server bug. The e2e audio helper validates server output by transcribing it with Whisper small. In tests/helpers/media.py, _whisper_transcribe_in_current_process loads Whisper onto a GPU; on a single-GPU runner the n == 1 branch selects device 0, which is the GPU the model server already occupies. The streaming speech test sends request_num=5 concurrent requests, and each spawns its own Whisper validator subprocess, so up to 5 Whisper models pile onto the server's card.

On an L4 (22 GiB) the two server stages hold ~18 GiB, leaving ~4 GiB. Each Whisper small validator needs ~2.4 GiB, so the concurrent validators exhaust VRAM and the request fails with CUDA out of memory, surfacing as AssertionError: The request failed. The server itself returns HTTP 200 for every request; only the client-side validation OOMs.

The device heuristic already avoids the server GPU on multi-GPU hosts by selecting device n - 1. The single-GPU case had nowhere else to go. This PR drops it so single-GPU runners validate on CPU. Multi-GPU behaviour is unchanged.

Closes #3788
Closes #3809

Test Plan

Run the real helper tests.helpers.media.convert_audio_bytes_to_text under a single visible GPU (CUDA_VISIBLE_DEVICES set to one device, so device_count == 1 like the L4 runner), with a filler tensor capping free VRAM at the L4's post-server budget (~4 GiB). Launch 5 concurrent validator calls (mirroring request_num=5), before and after this change.

Test Result

Single Whisper small validator footprint: 2.41 GiB of GPU memory.

Before (buggy n == 1 -> device 0), ~4 GiB free, 5 concurrent validators:

RESULT: 2/5 ok, 3 OOM, 0 other-err; min free during run = 0.01 GiB
CUDA out of memory. Tried to allocate 20.00 MiB ... 12 MiB free

Same error signature as the CI log in #3788.

After (this PR, single-GPU -> CPU), same ~4 GiB free, 5 concurrent validators:

RESULT: 5/5 ok, 0 OOM, 0 other-err; min free during run = 4.00 GiB

GPU free memory stays flat (validators run on CPU); transcripts are byte-identical to the GPU path.

Note: reducing the test's "few" batch size from 5 to 2 (suggested in #3809) does not fix it. Two validators alone need 18 + 2 x 2.4 = 22.9 GiB, already over the L4's 22 GiB.

Thanks to @congw729 for the original nightly report in #3788 and to @tzhouam for #3809. @yenuo26, flagging this for you as the author of the test-helper restructure (#2556 and #2620) in case the multi-GPU device heuristic deserves a follow-up.

Checklist

The e2e audio helper validates server output by transcribing it with Whisper "small". On a single-GPU runner, _whisper_transcribe_in_current_process loaded Whisper onto device 0, the GPU the model server already occupies, once per concurrent request. On an L4 (22 GiB) the server stages hold ~18 GiB and each validator adds ~2.4 GiB, so a few concurrent requests exhaust VRAM and the request fails with CUDA OOM (test_voice_clone_streaming_001).

The device heuristic already avoids the server GPU on multi-GPU hosts by selecting device n-1. The single-GPU case had nowhere else to go, so drop it: single-GPU runners now validate on CPU. Multi-GPU behaviour is unchanged.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
@linyueqian linyueqian requested a review from yenuo26 as a code owner May 22, 2026 13:01
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@linyueqian linyueqian added ready label to trigger buildkite CI tts-test label to trigger buildkite tts models test in nightly CI labels May 22, 2026
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single-GPU fallback to CPU for Whisper validation is a reasonable workaround for the L4 runner constraint. The test evidence shows the fix resolves the OOM. Multi-GPU behavior is preserved.

@hsliuustc0106 hsliuustc0106 merged commit 4e4458e into vllm-project:main May 22, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI tts-test label to trigger buildkite tts models test in nightly CI

Projects

None yet

2 participants