Skip to content

wslc: idle-terminate per-user session VMs when inactive#40781

Draft
benhillis wants to merge 27 commits into
masterfrom
user/benhill/wslc-idle-terminate-vm
Draft

wslc: idle-terminate per-user session VMs when inactive#40781
benhillis wants to merge 27 commits into
masterfrom
user/benhill/wslc-idle-terminate-vm

Conversation

@benhillis

Copy link
Copy Markdown
Member

Summary

Idle-terminates a per-user WSLC session's backing VM when it has been inactive, freeing memory while the session object (and its persistent storage) lives on. The VM is transparently recreated on the next operation.

Builds on #40770 (IWSLCVirtualMachineFactory).

Behavior

  • Only sessions with persistent storage (StoragePath set) idle-terminate.
  • An idle worker thread tears the VM down after a grace period (currently 30s) once there is no in-flight activity and no active container lock.
  • In-flight work holds an activity reference so the VM cannot be torn down mid-operation:
    • VmLease wraps CLI/container operations.
    • BeginContainerOperation hands clients an activity token (IFastRundown so a client crash reclaims it promptly).
    • Long-lived root-namespace processes (e.g. plugin hosts) created via CreateRootNamespaceProcess hold a keep-alive token for their lifetime.
  • Activity bookkeeping (count + wake event) lives in a shared IdleState held via shared_ptr, decoupled from the session's lifetime, so a held token suppresses idle teardown without extending the session object's lifetime (preserving the explicit-reset-invalidates-held-processes invariant from Add WSLC (WSL Containers) feature #40366).

Testing

  • New WSLCE2EVmIdleTests E2E suite (5 tests) including WSLCE2E_VmIdle_RootProcessKeepsVmAlive.
  • WSLCTests::CreateRootNamespaceProcess still passes.
  • Full x64 Debug build clean.

Notes / follow-ups (deferred)

  • Grace period is a hardcoded constexpr; making it injectable would enable deterministic race tests.
  • No crash-path (client dies holding token) automated coverage yet.

Note

Draft for early review.

Copilot AI review requested due to automatic review settings June 11, 2026 19:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds on-demand creation and idle-termination of per-user WSLC session VMs (for sessions with persistent storage), so memory can be reclaimed while keeping the session object and storage intact. It also introduces VM-liveness/activity bookkeeping to prevent teardown during in-flight operations and adds new E2E coverage around VM lifecycle behavior.

Changes:

  • Implement lazy VM bring-up and idle shutdown in wslcsession via an idle worker, activity counting/tokens, and a VmLease used by VM-requiring operations.
  • Add client-side “operation keep-alive” usage in wslc.exe container operations to prevent VM teardown between OpenContainer and subsequent calls/streaming.
  • Add a new E2E test suite validating lazy start, idle stop, persistence across restarts, keep-alive for root-namespace processes, and teardown/recreate races.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/windows/wslc/e2e/WSLCE2EVmIdleTests.cpp New E2E tests covering lazy VM start, idle stop, persistence, keep-alive, and race scenarios.
test/windows/wslc/e2e/WSLCE2EHelpers.h Exposes the underlying IWSLCSession* for diagnostics/test-only calls.
src/windows/wslcsession/WSLCSession.h Adds VM lifecycle state, idle worker/tokens/lease declarations, and new session methods.
src/windows/wslcsession/WSLCSession.cpp Implements lazy VM creation, idle teardown, activity tokens, and VM diagnostics reporting.
src/windows/wslcsession/WSLCProcessControl.cpp Preserves a real exit code when signaling container release, only synthesizing SIGKILL when needed.
src/windows/wslcsession/WSLCProcess.h Stores a keep-alive token on root-namespace processes to keep the VM alive for their lifetime.
src/windows/wslcsession/WSLCContainer.cpp Signals idle re-checks on terminal container transitions; holds a VM lease during delete.
src/windows/wslcsession/IORelay.h Adds IsRelayThread() to safely avoid destroying the relay on its own thread.
src/windows/wslcsession/IORelay.cpp Co-initializes the relay thread into the MTA; implements IsRelayThread().
src/windows/wslc/services/SessionModel.h Adds a helper to acquire/hold a keep-alive token for client-side container operations.
src/windows/wslc/services/ContainerService.cpp Uses the keep-alive token across container operations (attach/start/stop/kill/delete/exec/etc.).
src/windows/service/inc/wslc.idl Adds VM diagnostics type + new session methods for diagnostics and operation keep-alive.
src/windows/service/exe/WSLCSessionManager.cpp Updates comments to reflect on-demand VM creation and recreation after idle termination.

Comment thread src/windows/wslcsession/WSLCSession.h
Comment thread src/windows/service/inc/wslc.idl
Copilot AI review requested due to automatic review settings June 11, 2026 20:10

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Comment thread src/windows/service/inc/wslc.idl
Comment thread src/windows/wslcsession/WSLCSession.cpp
Comment thread src/windows/wslcsession/WSLCSession.cpp
@benhillis benhillis force-pushed the user/benhill/wslc-idle-terminate-vm branch from c12d7e1 to fa2eb47 Compare June 12, 2026 01:28
Ben Hillis and others added 13 commits June 12, 2026 09:31
Summary
-------
Decouples a wslc session's VM lifetime from the wslcsession.exe process so the
VM can be torn down when nothing needs it and recreated on demand later, while
the per-user session and its bookkeeping survive across VM restarts.

Previously a VM was created 1:1 with a session: the SYSTEM service eagerly built
an HcsVirtualMachine and handed the COM pointer to wslcsession.exe, and any VM
exit permanently terminated the (single-shot) session.

Detailed description
--------------------
* New service-side IWSLCVirtualMachineFactory lets the per-user process mint a
  fresh VM at any time. IWSLCSessionFactory::CreateSession / IWSLCSession::Initialize
  now take the factory instead of an eager IWSLCVirtualMachine. WSLCVirtualMachineFactory
  deep-copies the settings and duplicates the dmesg handle per creation.
* WSLCSession Initialize is now lightweight (persists settings, starts the idle
  worker). VM bring-up/teardown is split into re-runnable StartVmLockHeld /
  StopVmLockHeld / TearDownVmLockHeld driven by a VmState machine
  (None/Starting/Running/Stopping).
* On-demand bring-up + idle teardown: every VM-requiring operation takes a VmLease
  RAII (EnsureVmRunning + activity count); when activity drops to zero and no
  container is in the Created or Running state, the idle worker tears the VM down
  immediately. Idle termination is enabled only for persistent-storage sessions.
* VM-exit disambiguation: intentional stops (m_vmStopRequested) keep the session
  alive; unexpected exits still Terminate() permanently.
* New IWSLCSession::GetVmDiagnostics (Running + StartCount) exposes VM lifecycle
  for tests/diagnostics without bringing the VM up or counting as activity.

Concurrency fixes folded in (compile-validated; flagged for runtime stress):
* IORelay self-join: TearDownVmLockHeld no longer destroys the IO relay from its
  own thread (added IORelay::IsRelayThread); the stopped relay is left for
  ~WSLCSession on a non-relay thread.
* Lease-vs-idle-stop race: VmLease retries instead of throwing ERROR_INVALID_STATE
  when the idle worker tears down in the bring-up window.
* Idle-worker-vs-crash deadlock: IdleWorker bails when the VM exit event is already
  signaled, letting the relay-thread Terminate path own teardown.

Validation steps
-----------------
* Full solution build (x64 Debug) green, including wsltests.dll.
* Copyright-header validation: no new violations.
* Added E2E tests (test/windows/wslc/e2e/WSLCE2EVmIdleTests.cpp): lazy start +
  idle stop, recreate-on-demand + state persistence, Created container keeps VM
  alive, and concurrent recreate stress (lease/idle race).
* NOT runtime-validated here (requires deploy + Administrator + container runtime);
  run bin\x64\Debug\test.bat /name:*VmIdle* and stress the two race fixes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…deadlock)

Runtime fixes for the idle-terminate feature so all four WSLCE2EVmIdleTests pass:

- Register the IWSLCVirtualMachineFactory proxy in the process Global Interface
  Table and re-fetch it per VM creation, and MTA-init the IdleWorker / IORelay
  threads, so on-demand VM creation no longer fails with RPC_E_WRONG_THREAD.
- Register the factory IID in the MSIX so cross-proc marshalling resolves.
- Preserve a container's real exit code in DockerContainerProcessControl::
  OnContainerReleased: only synthesize 128+SIGKILL when no exit code was ever
  recorded, so --rm 'container run' returns 0 instead of 137 when the VM
  idle-terminates immediately after the container exits.
- Fix an AB-BA deadlock between 'container rm' and idle teardown: hold a VmLease
  across WSLCContainer::Delete (keeps the VM up and blocks teardown) and drop the
  now-redundant shared-lock re-acquire in OnContainerDeleted (which would deadlock
  behind the idle worker's pending exclusive lock).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Operation

The wslc CLI performs each container mutation as two COM round-trips
(OpenContainer to resolve a wrapper, then the operation), and may stream
output afterwards. With on-demand VM idle-termination enabled (any
persistent-storage session), the VM could idle-stop in the gap between the
calls when the target container is not Created/Running: TearDownVmLockHeld
clears m_containers, disconnecting the client-held wrapper, so the second
call failed with RPC_E_DISCONNECTED. This regressed the container CRUD E2E
suite (rm of stopped containers, and cleanup helpers).

Add IWSLCSession::BeginContainerOperation, returning an activity token that
holds m_activityCount > 0 for as long as the client holds it. The CLI now
holds the token across the whole operation (resolve + operate + streamed
relay), so the idle worker cannot tear the VM down mid-operation. Releasing
the token (or the client exiting, via fast rundown) lets the VM idle again.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tearing the VM down the instant it went idle could thrash it (repeated
teardown/recreate) when containers are created and destroyed, or operations
issued, in quick succession. Keep an otherwise-idle VM running for a short
grace period and only tear it down once it has stayed idle for the whole
window.

The idle worker now waits with a timeout derived from a grace deadline. The
deadline is armed when the VM is first observed idle and reset on any non-idle
observation or explicit idle-check signal (raised on every lease/token release
and terminal container state change), so teardown occurs a full grace period
after the last activity. A WAIT_TIMEOUT wake means the VM has been continuously
idle for the grace period and is torn down.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A root-namespace process created via CreateRootNamespaceProcess is not
tracked as a container, so it did not contribute to the session activity
count or HasActiveContainerLockHeld(). The VmLease taken during creation
was released when the call returned, leaving the process eligible for
idle teardown: once the grace period elapsed the idle worker could stop
the VM and kill a long-lived root process (e.g. a plugin host) out from
under the client.

Bind an activity token to the returned WSLCProcess so the VM stays alive
for as long as the client holds the process proxy. Factor the existing
BeginContainerOperation token logic into CreateActivityToken() and reuse
it here.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…VM lifecycle

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR #40767 (terminateEvent) removed ITerminationCallback / WSLCSessionSettings.TerminationCallback from the IDL but left the WSLCVirtualMachineFactory (introduced later by factory PR #40770, which its branch merged from master) still referencing them, so the branch did not compile. Drop the dead member and assignments; the VM now caches the reason for WSLCSession to pull.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Kevin's PR #40767 (event-model termination) moved m_vmExitEvent.SetEvent()
in OnExit() to after a new std::lock_guard(m_lock) that caches the
termination reason. ~HcsVirtualMachine already holds m_lock across the 5s
exit-event wait and HcsCloseComputeSystem (which drains in-flight HCS
callbacks). An in-flight OnExit() therefore blocks acquiring m_lock, so it
never signals the exit event nor drains, and the close never completes:
a hard deadlock that StuckVmTermination reliably reproduces.

Drop the broad lock from the dtor. By the time the compute system is
closed no further callbacks can run, so the remaining teardown is safe
unguarded. Flag to Kevin to fold into #40767.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In the lazy-VM model, container/volume/network recovery runs at first VM
start rather than during CreateSession, so the create-time WarningCallback
was out of scope by the time recovery emitted warnings.

Park the session's WarningCallback in the GIT and have WSLCExecutionContext
fall back to it when an operation has no explicit callback, so lazy-recovery
warnings still reach the user. CLI-side, keep the create/enter callback alive
for the whole command by storing it in the Session model.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ConfigureStorage validates the storage path lazily (via AttachDisk at first
VM start), so with a lazily-created VM a WSLCSessionStorageFlagsNoCreate
session pointing at a missing path no longer failed at CreateSession.
Validate the storage VHD existence eagerly in Initialize so misconfiguration
is reported up front.

The idle worker acquired m_lock exclusively (blocking) on every wake. Because
SRW locks favor a waiting writer, that pending acquire stalled all new
shared-lock operations behind it, so a long-running operation holding its
shared VmLease (e.g. a blocking SaveImage/Export) serialized every concurrent
operation until it completed. Use try_lock_exclusive and treat contention as
activity, re-evaluating on the next idle-check signal.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Ben Hillis and others added 2 commits June 12, 2026 09:31
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Idle teardown destroyed container impls and disconnected their COM wrappers, leaving any outstanding client proxy with RPC_E_DISCONNECTED. Keep the VM alive while a client still holds a proxy:

- Containers: HasActiveContainerLockHeld now also keeps the VM up while a container wrapper is externally referenced (refcount > the single internal m_comWrapper reference). WSLCContainer::Release() wakes the idle worker when the last client proxy is released so the VM is reclaimed promptly.
- Exec processes: the returned WSLCProcess wrapper (not retained internally) now carries a keep-alive activity token for its client-held lifetime, mirroring root-namespace processes.

Adds WSLCE2E_VmIdle_HeldContainerProxyKeepsVmAlive covering a held exited-container proxy.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 12, 2026 17:46
@benhillis benhillis force-pushed the user/benhill/wslc-idle-terminate-vm branch from fa2eb47 to ea2254c Compare June 12, 2026 17:46

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 3 comments.

Comment thread src/windows/service/inc/wslc.idl
Comment thread src/windows/service/inc/wslc.idl
Comment thread src/windows/service/inc/wslc.idl Outdated
Ben Hillis and others added 2 commits June 12, 2026 13:11
WSLCContainer::Release() woke the idle worker by calling
m_session.RequestIdleCheck(), but the wrapper can outlive the session
(a client keeps a container proxy past releasing the session; the impl's
baseline m_comWrapper ref is dropped during teardown while the client
proxy survives) and can also be destroyed concurrently the instant our
reference drops. Both make touching m_session after Release() a
use-after-free.

Bind an idle-check signaler at construction that captures the session's
shared IdleState (shared_ptr), mirroring CreateActivityToken, and
snapshot it on the stack before RuntimeClassBase::Release() so no member
is touched on any post-Release path. Make WSLCSession::IdleState public
for the comment reference.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The comment implied callers could query the reason before termination and
observe Unknown/empty, but the implementation returns ERROR_INVALID_STATE
until the termination event is signaled. Document the actual contract.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 12, 2026 20:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 4 comments.

Comment on lines 527 to +533
// Returns an event that is signaled when the VM exits (graceful or forced).
HRESULT GetTerminationEvent([out, system_handle(sh_event)] HANDLE* Event);

// Returns the cached termination reason and details. The values are only meaningful
// after the termination event has been signaled; before that the reason is
// WSLCVirtualMachineTerminationReasonUnknown and Details is an empty string.
HRESULT GetTerminationReason([out] WSLCVirtualMachineTerminationReason* Reason, [out] LPWSTR* Details);
Comment on lines 792 to 834
interface IWSLCSession : IUnknown
{
HRESULT GetId([out] ULONG* Id);
HRESULT GetState([out] WSLCSessionState* State);

// Returns a one-off event that is signaled when the session terminates, whether due to an
// explicit Terminate() call or an unexpected VM exit. The returned handle is owned by the
// caller and remains valid (and observes the signaled state) even after the session is released.
HRESULT GetTerminationEvent([out, system_handle(sh_event)] HANDLE* Event);

// Returns the cached termination reason and details. Only valid once the session has terminated,
// i.e. after the event returned by GetTerminationEvent is signaled; before that it returns
// HRESULT_FROM_WIN32(ERROR_INVALID_STATE). On success the caller owns Details and must free it.
HRESULT GetTerminationReason([out] WSLCVirtualMachineTerminationReason* Reason, [out] LPWSTR* Details);

// Reports on-demand VM lifecycle diagnostics. Does not bring the VM up or count as activity.
HRESULT GetVmDiagnostics([out] WSLCVmDiagnostics* Diagnostics);

// Image management.
HRESULT PullImage([in] LPCSTR Image, [in, unique] LPCSTR RegistryAuthenticationInformation, [in, unique] IProgressCallback* ProgressCallback, [in, unique] IWarningCallback* WarningCallback);
HRESULT BuildImage([in] const WSLCBuildImageOptions* Options, [in, unique] IProgressCallback* ProgressCallback, [in, unique, system_handle(sh_event)] HANDLE CancelEvent);
HRESULT LoadImage([in] WSLCHandle ImageHandle, [in, unique] IProgressCallback* ProgressCallback, [in] ULONGLONG ContentLength, [in, unique] IWarningCallback* WarningCallback);
HRESULT ImportImage([in] WSLCHandle ImageHandle, [in] LPCSTR ImageName, [in, unique] IProgressCallback* ProgressCallback, [in] ULONGLONG ContentLength, [in, unique] IWarningCallback* WarningCallback);
HRESULT SaveImage([in] WSLCHandle OutputHandle, [in] LPCSTR ImageNameOrID, [in, unique] IProgressCallback * ProgressCallback, [in, unique, system_handle(sh_event)] HANDLE CancelEvent);
HRESULT SaveImages([in] WSLCHandle OutputHandle, [in] const WSLCStringArray* ImageNames, [in, unique] IProgressCallback * ProgressCallback, [in, unique, system_handle(sh_event)] HANDLE CancelEvent);
HRESULT ListImages([in, unique] const WSLCListImagesOptions* Options, [out, size_is(, *Count)] WSLCImageInformation** Images, [out] ULONG* Count);
HRESULT DeleteImage([in] const WSLCDeleteImageOptions* Options, [out, size_is(, *Count)] WSLCDeletedImageInformation** DeletedImages, [out] ULONG* Count);
HRESULT TagImage([in] const WSLCTagImageOptions* Options);
HRESULT InspectImage([in] LPCSTR ImageNameOrId, [out] LPSTR* Output);
HRESULT PruneImages([in, unique, size_is(FiltersCount)] const WSLCFilter* Filters, [in] ULONG FiltersCount, [out, size_is(, *DeletedImagesCount)] WSLCDeletedImageInformation** DeletedImages, [out] ULONG* DeletedImagesCount, [out] ULONGLONG* SpaceReclaimed);

// Container management.
HRESULT CreateContainer([in] const WSLCContainerOptions* Options, [in, unique] IWarningCallback* WarningCallback, [out] IWSLCContainer** Container);
HRESULT OpenContainer([in, ref] LPCSTR Id, [out] IWSLCContainer** Container);

// Keeps the VM alive for the duration of a client-side container operation. The CLI performs
// each mutation as two round-trips (OpenContainer followed by the operation) and may stream
// output afterwards. With on-demand VM idle-termination the VM could otherwise tear down
// between those calls, disconnecting the container wrapper and failing the second call with
// RPC_E_DISCONNECTED. The client holds the returned token for the whole operation; releasing
// it (or the client exiting) lets the VM idle-terminate again.
HRESULT BeginContainerOperation([out] IUnknown** Operation);
HRESULT ListContainers([in, unique] const WSLCListContainersOptions* Options,[out, size_is(, *Count)] WSLCContainerEntry** Containers,[out] ULONG* Count, [out, size_is(, *PortsCount)] WSLCContainerPortMapping** Ports, [out] ULONG* PortsCount);
Comment on lines 18 to +20
WslcSetSessionSettingsFeatureFlags
WslcSetSessionSettingsTerminationCallback
WslcGetSessionTerminationEvent
WslcGetSessionTerminationReason
Comment on lines 58 to 65
m_settings = nullptr;

winrt::check_hresult(WslcGetSessionTerminationEvent(m_session.get(), m_terminationEvent.put()));

m_terminationWait.reset(CreateThreadpoolWait(&Session::OnTerminated, this, nullptr));
THROW_LAST_ERROR_IF_NULL(m_terminationWait);
SetThreadpoolWait(m_terminationWait.get(), m_terminationEvent.get(), nullptr);
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants