[Bugfix][CI] Run Whisper validation on CPU for single-GPU runners#3822
Merged
hsliuustc0106 merged 1 commit intoMay 22, 2026
Merged
Conversation
The e2e audio helper validates server output by transcribing it with Whisper "small". On a single-GPU runner, _whisper_transcribe_in_current_process loaded Whisper onto device 0, the GPU the model server already occupies, once per concurrent request. On an L4 (22 GiB) the server stages hold ~18 GiB and each validator adds ~2.4 GiB, so a few concurrent requests exhaust VRAM and the request fails with CUDA OOM (test_voice_clone_streaming_001). The device heuristic already avoids the server GPU on multi-GPU hosts by selecting device n-1. The single-GPU case had nowhere else to go, so drop it: single-GPU runners now validate on CPU. Multi-GPU behaviour is unchanged. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Collaborator
hsliuustc0106
left a comment
There was a problem hiding this comment.
The single-GPU fallback to CPU for Whisper validation is a reasonable workaround for the L4 runner constraint. The test evidence shows the fix resolves the OOM. Multi-GPU behavior is preserved.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Fixes the CUDA OOM in the L4 nightly TTS function test
test_qwen3_tts_base_expansion.py::test_voice_clone_streaming_001[async_chunk], reported in #3788 and #3809.The failure is test-side, not a server bug. The e2e audio helper validates server output by transcribing it with Whisper
small. Intests/helpers/media.py,_whisper_transcribe_in_current_processloads Whisper onto a GPU; on a single-GPU runner then == 1branch selects device 0, which is the GPU the model server already occupies. The streaming speech test sendsrequest_num=5concurrent requests, and each spawns its own Whisper validator subprocess, so up to 5 Whisper models pile onto the server's card.On an L4 (22 GiB) the two server stages hold ~18 GiB, leaving ~4 GiB. Each Whisper
smallvalidator needs ~2.4 GiB, so the concurrent validators exhaust VRAM and the request fails withCUDA out of memory, surfacing asAssertionError: The request failed. The server itself returns HTTP 200 for every request; only the client-side validation OOMs.The device heuristic already avoids the server GPU on multi-GPU hosts by selecting device
n - 1. The single-GPU case had nowhere else to go. This PR drops it so single-GPU runners validate on CPU. Multi-GPU behaviour is unchanged.Closes #3788
Closes #3809
Test Plan
Run the real helper
tests.helpers.media.convert_audio_bytes_to_textunder a single visible GPU (CUDA_VISIBLE_DEVICESset to one device, sodevice_count == 1like the L4 runner), with a filler tensor capping free VRAM at the L4's post-server budget (~4 GiB). Launch 5 concurrent validator calls (mirroringrequest_num=5), before and after this change.Test Result
Single Whisper
smallvalidator footprint: 2.41 GiB of GPU memory.Before (buggy
n == 1-> device 0), ~4 GiB free, 5 concurrent validators:Same error signature as the CI log in #3788.
After (this PR, single-GPU -> CPU), same ~4 GiB free, 5 concurrent validators:
GPU free memory stays flat (validators run on CPU); transcripts are byte-identical to the GPU path.
Note: reducing the test's
"few"batch size from 5 to 2 (suggested in #3809) does not fix it. Two validators alone need 18 + 2 x 2.4 = 22.9 GiB, already over the L4's 22 GiB.Thanks to @congw729 for the original nightly report in #3788 and to @tzhouam for #3809. @yenuo26, flagging this for you as the author of the test-helper restructure (#2556 and #2620) in case the multi-GPU device heuristic deserves a follow-up.
Checklist