Skip to content

Race condition in scheduler causing stuck preemptions and quota oversubscription #11480

@kshalot

Description

@kshalot

What happened

When the scheduler admits a workload, it optimistically sets QuotaReserved=True for that workload in the cache, while the API server update is handled asynchronously via a goroutine. This introduces indeterminism, as the admission might happen after the next scheduler cycle already started or way later, after several cycles already treated the workload as admitted.

Scenario 1 - Infinitely Stuck Preemption

Consider the following sequence of events:

  1. Scheduling Cycle 1: Workload1 is admitted - it is optimistically added to the cache. The goroutine to update it in the API server is spawned.
  2. Scheduling Cycle 2: Workload2 preempts Workload1. It sends a synchronous update to the API server to add the Evicted condition to Workload1 and adds Workload1 to preemptionExpectation.
  3. Workload1 Admission Goroutine: The goroutine sends a request to the API server to admit the workload. But the update is based off the state of Workload1 in Scheduling Cycle 1. This overwrites the Evicted condition.
  4. Scheduling Cycle 3: Workload2 is evaluated again. It picks Workload1 as it's preemption target. Since Workload1 does not have the Evicted condition anymore, it won't hit the branch that would observe the eviction in preemptionExpectations. Instead, because Workload1 was never observed in preemptionExpectations, it will fall into a fast exist. Workload2 is not admitted.
  5. Scheduling Cycle 4: Is exactly the same as the third cycle.
  6. Scheduling Cycle N: This continues until Workload1 is deleted or finishes.

Scenario 2 - Quota Oversubcription

If step 3 in the previous scenario happens after Scheduling Cycle 3, the Evicted condition will not be overwritten. Instead:

  1. Scheduling Cycle 1: Workload1 is admitted - it is optimistically added to the cache. The goroutine to update it in the API server is spawned.
  2. Scheduling Cycle 2: Workload2 preempts Workload1. It sends a synchronous update to the API server to add the Evicted condition to Workload1 and adds Workload1 to preemptionExpectation.
  3. Scheduling Cycle 3: Sees that Workload1 is Evicted and observes it.
  4. Job Controller: The controller for Workload1's job evicts it.
  5. Scheduling Cycle 4: Workload2 is admitted.
  6. Workload1 Admission Goroutine: The goroutine admits the workload without any quota checks, causing oversubcription.

I will admit, I did not trace the second scenario very closely, I only reproduced it. Out of the two, it seems like a less severe one.

Scenario 1 was observed on production and the impact is a complete lack of admission for some ClusterQueues in the cluster.

What you expected to happen

How to reproduce it (as minimally and precisely as possible)

I wrote a very hacky reproduction by editing the Kueue code to force a race condition from Scenario 1. It's structured as a E2E test on my local fork (link).

I achieved the race condition by forcing a sequence via the use of channels.

It also reproduces the flood of:

Preemption already issued, waiting for observation

logs, which is consistent with the production investigation.

Anything else we need to know?

This was found during an investigation of a production issue.

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions