Skip to content

[bugfix, rl] Fix sleep do not release full memory in custom pipeline#3818

Open
knlnguyen1802 wants to merge 5 commits into
vllm-project:mainfrom
knlnguyen1802:fix_sleep_custom_pipeline
Open

[bugfix, rl] Fix sleep do not release full memory in custom pipeline#3818
knlnguyen1802 wants to merge 5 commits into
vllm-project:mainfrom
knlnguyen1802:fix_sleep_custom_pipeline

Conversation

@knlnguyen1802
Copy link
Copy Markdown
Contributor

@knlnguyen1802 knlnguyen1802 commented May 22, 2026

Purpose

Following the release of vLLM version 0.20.0, safetensors now utilizes a direct-to-GPU fast path that invokes cudaMalloc through the driver API, thereby bypassing PyTorch's caching allocator. As a consequence, the allocated memory regions become invisible to CuMemAllocator, which prevents sleep() from offloading or unmapping them. This results in GPU memory remaining pinned and unable to be reclaimed as expected.

cc: @SamitHuang


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Copy link
Copy Markdown
Collaborator

@SamitHuang SamitHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diagnosis matches what I would expect for RL colocated rollouts with custom_pipeline: constructing under with target_device: makes safetensors take the direct-to-GPU fast path during from_pretrained() inside pipeline __init__ (e.g. QwenImage text encoder / VAE), which bypasses CuMemAllocator and leaves memory pinned after sleep().

Moving custom_pipeline init out of the CUDA default-device context is consistent with the existing HSDP path in _load_model_with_hsdp(), which already avoids with target_device: for the same reason. The follow-up model.to(target_device) keeps non-CPU offload paths unchanged.

Blocking gaps

  1. No test plan or before/after evidence. The PR template checklist is entirely unchecked. For a sleep-mode memory bug, please include peak VRAM / CuMemAllocator.get_current_usage() before and after sleep(level=1) on a custom-pipeline RL path (e.g. QwenImagePipelineWithLogProb + enable_sleep_mode=True).

  2. No regression test. tests/e2e/offline_inference/custom_pipeline/test_async_omni_collective_rpc.py::test_sleep_wake_up_inline_mode exercises custom pipeline sleep/wake but does not set enable_sleep_mode=True and does not assert physical memory is reclaimed. Please extend that test (or add a unit test around DiffusersPipelineLoader.load_model(..., load_format="custom_pipeline")) to assert allocator-tracked usage drops after sleep.

Non-blocking notes

  • The long NOTE is justified; consider trimming to 4–5 lines if you add a test that encodes the invariant.
  • This fix addresses loader-level default device context only. Custom pipelines that pass device= into from_pretrained() during __init__ may still bypass the allocator; worth a one-line caveat in the comment.
  • model.to(target_device) is redundant for pipelines that already .to(self.device) in __init__, but harmless.

Please add the regression test and test results

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
@knlnguyen1802 knlnguyen1802 requested a review from yenuo26 as a code owner May 22, 2026 10:10
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLOCKING:

  • Correctness — pre-commit check is failing. Please run pre-commit run --all-files locally and fix any issues before proceeding with the review.

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLOCKING:

  • Correctness — pre-commit check is failing. Please run pre-commit run --all-files locally and fix any issues before proceeding with the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants