[Bugfix] purge chunk-transfer zombies on every schedule tick to keep engine-core alive on aborts (fixes #3736)#3774
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Confirming this PR fixes the engine-core crash from #3736 — verified locally against an However, there's a residual hang that this PR does not close under the same abort-pressure profile, and I want to flag it before this lands so reviewers can decide whether to expand the scope or track separately. SymptomAfter exactly Reproduction shape (model-agnostic)
That last row is the load-bearing observation: it's not concurrency-driven, it's per-abort accumulation with a capacity equal to Where it isn't
Where I suspect it isPer-stage The status-realignment fix lives in the scheduler caller (e.g. Repro test idea (sketch)# concurrency=1, max_num_seqs=8, abort each session after the 2nd chunk
for i in range(16):
open ws → send prompt/done → recv 2 chunks → ws.close()
# Expect: idx 0-7 abort cleanly (chunks=2), idx 8-15 hang at chunks=0This matches the test plan you already have for the adapter purge but would extend coverage past Happy to test any follow-up patch, and happy to file a separate issue for the residual hang if you'd prefer to keep this PR scoped to the crash fix only. |
|
Got the root cause and a working patch on top of this PR for the residual hang. Sharing in case you want to fold it in here — happy to split into a separate PR if you'd rather keep this one scoped to the deque-zombie fix. Root cause (confirmed via debug logging)Added a one-liner in So:
That mismatch traps Why the status is wrong
If an abort lands in the gap between admit and the next deque round-trip, This is orthogonal to the deque-zombie path this PR plugs: the request here is not in the adapter's Proposed patch (on top of this PR)Realign # In OmniARScheduler.finish_requests, right after chunk_transfer_adapter.finish_requests:
# vllm-omni#3774 fixes the deque-zombie path, but chunk_transfer_adapter's
# requests_origin_status table goes stale on the waiting→running admit
# transition. If an abort lands before the next deque round-trip overwrites
# it, chunk_transfer_adapter.finish_requests restores `status=WAITING` on
# a request that actually lives in self.running. Upstream
# Scheduler.finish_requests (scheduler.py:1797) only removes from
# self.running when status==RUNNING, so the request stays alive in
# self.running and the worker's input_batch slot leaks. After max_num_seqs
# such aborts every new request hangs at chunks=0.
if isinstance(request_ids, str):
ids_to_align = (request_ids,)
elif request_ids is None:
ids_to_align = list(self.requests.keys())
else:
ids_to_align = list(request_ids)
if ids_to_align:
running_ids = {r.request_id for r in self.running}
waiting_ids = {r.request_id for r in self.waiting}
for rid in ids_to_align:
req = self.requests.get(rid)
if req is None or req.is_finished():
continue
if rid in running_ids and req.status != RequestStatus.RUNNING:
req.status = RequestStatus.RUNNING
elif rid in waiting_ids and req.status == RequestStatus.RUNNING:
req.status = RequestStatus.WAITINGThe cleaner alternative would be to teach Verification matrixSame 3-stage
Happy to either:
Also worth noting: the |
11195cf to
169f8d4
Compare
|
Thanks for the deep dive @npuichigo — that's an exceptional bug report with a complete root-cause trace, a 15-line fix, and a verification matrix. Folding your status-realignment patch in here on top of What changed in the fold-in:
You correctly pointed out the deque purge and the realignment are complementary, not redundant: the purge handles entries already parked in If you'd like to validate the fold-in against your repro setup before reviewers stamp, happy to wait — your matrix is the canonical signal here. Otherwise this is ready for a final pass. |
|
I applied this patch and confirmed it fixes one important class of zombie request leaks, but I think there is still a residual race around With debug logging before That is problematic because upstream The result is a scheduler invariant violation: a finished or untracked request can remain in The remaining fix I needed was to realign status to actual queue membership immediately before calling if req_id is in self.running:
request.status = RUNNING
elif req_id is in self.waiting and request.status == RUNNING:
request.status = WAITINGI also added a defensive post-finish purge for any So I think this PR is directionally correct, but it may not fully close abort-storm hangs unless |
…engine-core alive on aborts `OmniARScheduler` + `OmniChunkTransferAdapter` (and the sibling `OmniGenerationScheduler` under `async_chunk`) could re-inject an already-freed `Request` onto `self.running` after a mid-flight abort. The next `schedule()` tick then emitted the zombie in `scheduled_cached_reqs`, and the worker's `_update_states` crashed with `KeyError` reading `self.requests[req_id]`. The crash propagated out of `run_busy_loop` and killed the engine-core process -- every client attached to that stage dropped. Root cause (issue vllm-project#3736): a request parked in `waiting_for_chunk_running_requests` becomes a zombie once `Scheduler._free_request` deletes its tracking entry. The adapter still holds a reference and `restore_queues` blindly `extend`s it back onto `running_queue` because there was no membership check against the live scheduler set. Fix: - Add `_purge_untracked_chunk_requests` on the adapter: walks a deque, drops entries whose `request_id` is no longer in `scheduler_requests`, runs `cleanup_receiver` on each zombie (drops origin-status mapping + registers the id as cancelled so a late load/poll is dropped too). Order of survivors is preserved. - Make `scheduler_requests` a required keyword-only parameter on both `process_pending_chunks` and `restore_queues`. The enforcement is deliberate: `OmniGenerationScheduler` was already silently calling the unguarded form, which would have left the bug live for any `async_chunk` generation deployment. - Call the purge inside `restore_queues` as well, not just `process_pending_chunks`. `restore_queues` runs in the `finally` of the scheduler's `schedule()`, so an abort that fires *between* the two calls (in the body of `super().schedule()`) would otherwise reopen the same window. The second purge closes it. - Update all 5 production call sites (1 `OmniARScheduler` + 2 `OmniGenerationScheduler` `restore_queues` + the matching `process_pending_chunks` calls) to pass `scheduler_requests=self.requests`. Reporter @npuichigo verified this PR fixes the engine-core crash under a 3-stage `LLM_AR + LLM_AR + LLM_GENERATION` pipeline with `async_scheduling: true` and `async_chunk: true`, but also flagged a residual hang on the same abort-pressure profile: after exactly `max_num_seqs` (=8) successful aborts, every subsequent request stalls at `chunks=0`. Co-debugged with @npuichigo: `chunk_transfer_adapter._process_chunk_queue` stamps `requests_origin_status[req.id] = WAITING` (or `RUNNING`) when first parking a request. On the next tick, when the chunk arrives, `_process_chunk_queue` flips `request.status = target`, but `requests_origin_status` is left at its first-park value -- no hook updates it on the `waiting -> running` admit transition that `super().schedule()` later performs. If an abort lands in the gap before the next deque round-trip, `chunk_transfer_adapter.finish_requests` reads stale `WAITING` from `requests_origin_status`, stomps it onto `request.status`, and upstream `Scheduler.finish_requests` else branch silently fails to remove from `self.running` -- the request stays alive in `self.running` and the worker's `input_batch` slot leaks. After `max_num_seqs` such aborts every new request hangs at `chunks=0`. Fold @npuichigo's status-realignment fix into this PR via a shared `OmniSchedulerMixin._realign_request_status_to_queues` helper, called by both schedulers right after the adapter's `finish_requests` and before `super().finish_requests`. The helper realigns `request.status` to actual queue membership (`self.running` <-> `RUNNING`, `self.waiting` <-> `WAITING`) so the upstream branch picks the right removal path. This is a localized 65-line change that does not touch the adapter's invariants and is complementary to the deque purge. Per a follow-up from @npuichigo on the same PR thread, also add a defensive belt-and-suspenders sweep `_purge_finished_from_running` on the same `OmniSchedulerMixin`, called by both schedulers *after* `super().finish_requests`. It reclaims any `self.running` entry that is `is_finished()` or no longer tracked by `self.requests`. Realignment above is preventive (fix `status` so the upstream removal branch fires correctly); this sweep is defensive (catch residues if a future regression or status mid-transition slips past). Both ends of the abort path are now guarded -- realign before, sweep after -- so the worker's `input_batch` slot is never pinned by a freed request even under unforeseen corner cases. Verification matrix from @npuichigo (3-stage pipeline, `max_num_seqs=8` per stage, this PR + realignment + sweep on top): | Scenario | Pre-PR | This PR alone | This PR + realignment + sweep | |------------------------------------------------|--------|---------------|-------------------------------| | 16 sequential aborts (conc=1, abort=0.5) | 8/16 hang | 8/16 hang | **16/16 ok** | | 120 mixed (conc=8, abort=0.3, seed=17) | ~30-40% hang | ~3% hang | **120/120 ok** | | 160 abort-pressure (conc=8, abort=0.5, seed=42)| ~20% hang | improved but not zero | **160/160 ok, 33.5 s** | | 32 pure no-abort (conc=8) | ok | ok | **32/32 ok** | Tests: - 8 CPU regression tests in `tests/core/sched/test_omni_scheduler_mixin.py` for the realignment helper: stale-WAITING-in-running, stale-RUNNING-in-waiting, already-aligned no-op, unknown id silent skip, parked-in-deque left untouched, str / None / iterable input forms, and the `is_finished()` early skip. - 6 CPU regression tests in the same file for `_purge_finished_from_running`: finished-entry-purged, untracked-entry-purged, healthy-running-unchanged, empty-running-noop, mixed-keeps-only-alive-in-order, in-place mutation behavior. - 5 CPU regression tests in `tests/distributed/omni_connectors/test_chunk_transfer_adapter.py` for the deque-purge path: running deque, waiting deque, interleaved live/zombie ordering, restore-time race window, empty-deque short-circuit. - 3 integration tests in `tests/core/sched/test_omni_scheduler_finish_requests_purge.py` parametrized over both schedulers (= 6 runs) exercising the full `finish_requests` path end-to-end and pinning the realign-before / sweep-after placement. Local verification: pre-commit clean (12 hooks). Tests collect on a machine with `vllm` + `aenum` installed; on this dev box without `vllm` the tests rely on CI -- pure-Python helpers, no GPU dependency. Fixes vllm-project#3736 Co-authored-by: npuichigo <11533479+npuichigo@users.noreply.github.com> Signed-off-by: abinggo <107740309+abinggo@users.noreply.github.com>
169f8d4 to
5ee5a67
Compare
|
Thanks @npuichigo — folded the defensive post-finish purge into the PR as Predicate: drop entries that are Tests added:
Pre-commit clean (12 hooks). Amended into the same commit to keep history single-shot — new sha is 5ee5a67. The verification matrix you ran semantically already covers this (realign was the load-bearing fix; sweep is belt-and-suspenders). If you want to re-verify on the amended commit before maintainers chime in, happy to wait. |
|
It works well for my case now |
Summary
Fixes the
engine-corecrash described in #3736: under sustained client-abort pressure on a chunk-transfer pipeline (OmniARSchedulerorOmniGenerationSchedulerwithasync_chunk+OmniChunkTransferAdapter), an aborted request that was parked inwaiting_for_chunk_running_requestsbecomes a zombie onceScheduler._free_requestdeletes its tracking entry. The adapter still holds a reference, andrestore_queuesblindlyextends it back ontorunning_queue. The nextschedule()tick packs the zombie intoscheduled_cached_reqs, and the worker's_update_statescrashes withKeyErrorreadingself.requests[req_id]— taking down every client on that stage.Fix
_purge_untracked_chunk_requestson the adapter: in-place filter (popleft+ append-live) over a deque, dropping entries whoserequest_idis not inscheduler_requestsand reclaiming their receiver-side state viacleanup_receiver. Survivor order is preserved.scheduler_requestsa required, keyword-only parameter on bothprocess_pending_chunksandrestore_queues. This is intentionally a hard break:OmniGenerationScheduler:97-98was already silently calling the unguarded form onasync_chunkgeneration deployments, which would have left the bug live there too — required-kwarg means any future caller that forgets gets aTypeErrorat the call site, not a worker crash three ticks later.restore_queuesas well, not only inprocess_pending_chunks.restore_queuesruns in the scheduler'sfinally-clause, so an abort that fires between the two calls (duringsuper().schedule()) would otherwise reopen the same race window. The second purge closes it.OmniARScheduler.schedule(); inOmniGenerationScheduler.schedule(): 1process_pending_chunks+ 2restore_queues) to passscheduler_requests=self.requests.The fix is conceptually the one-liner the issue suggests, but the issue body assumed the dormant zombie-purge guard already lived inside the adapter. It didn't on
main—process_pending_chunkshad a(waiting_queue, running_queue)signature with no purge logic. So the actual delta is the guard + the call-site rewires + the API hardening.Fixes
Why required keyword-only
The optional-default pattern (
scheduler_requests: dict[...] | None = None) was tempting for backwards compatibility but the fact thatOmniGenerationScheduleralready exists and calls the unguarded form proves the optional API gets misused on day one. We'd ship the same crash forasync_chunkgeneration that the AR pipeline ships today. Required keyword-only fails loudly at every call site instead of silently leaking.*, scheduler_requests: dict[str, Request]makes the constraint enforceable: any in-tree caller that compiles is necessarily safe; any out-of-tree downstream is forced to make a deliberate choice (in which case{}opts out and skips the purge — the explicit form replaces the implicitNonefoot-gun).Test plan
5 new CPU regression tests in
tests/distributed/omni_connectors/test_chunk_transfer_adapter.py, re-using the existingbuild_adapter/_req/DummyWaitingQueuefixture pattern. Coverage matrix:test_process_pending_chunks_purges_zombies_in_running_dequetest_process_pending_chunks_purges_zombies_in_waiting_dequerestore_queuesusesadd_request(notextend) for the waiting deque, so the purge needs to apply symmetricallytest_purge_preserves_live_order_with_interleaved_zombiespopleft+append-live ordering invariant on[live1, zombie1, live2, zombie2]test_restore_queues_purges_late_aborts_after_process_pending_chunksprocess_pending_chunkssucceeds but beforerestore_queuesruns infinallytest_purge_is_noop_on_empty_dequespoplefton empty deque is correctly short-circuitedPlus the existing
test_process_and_restore_queuesis updated to passscheduler_requests(otherwise the new required kwarg wouldTypeErrorit — the intended fail-loud behavior).I cannot run pytest locally on this dev box (mac, no
vllmwheel —tests/conftest.pycallsbootstrap_vllm_layer_custom_op_modulesat import time, which needsvllm.compilation). Relying on CI for the test pass;pre-commit run(ruff-check, ruff-format, typos, the suggestion / no-pickle / debug-statement / EOL hooks) is green locally.Out of scope
OmniChunkTransferAdapter.finish_requests(line 491-497) restoresrequests_origin_statusbut doesn't pop the corresponding deque entries. The fixed-upprocess_pending_chunks+restore_queuesalready close the resulting race; makingfinish_requestspop synchronously would tighten the contract one more step but it's not load-bearing for [Bug]: OmniARScheduler skips chunk-transfer zombie purge — engine-core crashes on aborts #3736 and would expand the diff for no functional gain. Happy to follow up if reviewers want it folded in._update_after_schedule(defensive guard missing) vllm#42157 is the upstream sibling onScheduler._update_after_schedule/_update_from_kv_xfer_finished. That's a different file and a different membership check; [Bug]: OmniARScheduler skips chunk-transfer zombie purge — engine-core crashes on aborts #3736 here is the chunk-transfer half.Notes on review
_purge_untracked_chunk_requestsrather than_purge_zombiesfor the helper name — describes the predicate (the entry is no longer in the live scheduler set), not the implementation pun.process_pending_chunksandrestore_queues) explain why the parameter exists. If reviewers prefer the worker-crash detail to live in the issue tracker only, happy to trim.