[bugfix, rl] Fix sleep do not release full memory in custom pipeline#3818
[bugfix, rl] Fix sleep do not release full memory in custom pipeline#3818knlnguyen1802 wants to merge 5 commits into
Conversation
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
The diagnosis matches what I would expect for RL colocated rollouts with custom_pipeline: constructing under with target_device: makes safetensors take the direct-to-GPU fast path during from_pretrained() inside pipeline __init__ (e.g. QwenImage text encoder / VAE), which bypasses CuMemAllocator and leaves memory pinned after sleep().
Moving custom_pipeline init out of the CUDA default-device context is consistent with the existing HSDP path in _load_model_with_hsdp(), which already avoids with target_device: for the same reason. The follow-up model.to(target_device) keeps non-CPU offload paths unchanged.
Blocking gaps
-
No test plan or before/after evidence. The PR template checklist is entirely unchecked. For a sleep-mode memory bug, please include peak VRAM /
CuMemAllocator.get_current_usage()before and aftersleep(level=1)on a custom-pipeline RL path (e.g.QwenImagePipelineWithLogProb+enable_sleep_mode=True). -
No regression test.
tests/e2e/offline_inference/custom_pipeline/test_async_omni_collective_rpc.py::test_sleep_wake_up_inline_modeexercises custom pipeline sleep/wake but does not setenable_sleep_mode=Trueand does not assert physical memory is reclaimed. Please extend that test (or add a unit test aroundDiffusersPipelineLoader.load_model(..., load_format="custom_pipeline")) to assert allocator-tracked usage drops after sleep.
Non-blocking notes
- The long NOTE is justified; consider trimming to 4–5 lines if you add a test that encodes the invariant.
- This fix addresses loader-level default device context only. Custom pipelines that pass
device=intofrom_pretrained()during__init__may still bypass the allocator; worth a one-line caveat in the comment. model.to(target_device)is redundant for pipelines that already.to(self.device)in__init__, but harmless.
Please add the regression test and test results
…yen1802/vllm-omni into fix_sleep_custom_pipeline
hsliuustc0106
left a comment
There was a problem hiding this comment.
BLOCKING:
- Correctness — pre-commit check is failing. Please run
pre-commit run --all-fileslocally and fix any issues before proceeding with the review.
hsliuustc0106
left a comment
There was a problem hiding this comment.
BLOCKING:
- Correctness — pre-commit check is failing. Please run
pre-commit run --all-fileslocally and fix any issues before proceeding with the review.
Purpose
Following the release of vLLM version
0.20.0,safetensorsnow utilizes a direct-to-GPU fast path that invokescudaMallocthrough the driver API, thereby bypassing PyTorch's caching allocator. As a consequence, the allocated memory regions become invisible toCuMemAllocator, which preventssleep()fromoffloadingorunmappingthem. This results in GPU memory remaining pinned and unable to be reclaimed as expected.cc: @SamitHuang
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)