What happened
When the scheduler admits a workload, it optimistically sets QuotaReserved=True for that workload in the cache, while the API server update is handled asynchronously via a goroutine. This introduces indeterminism, as the admission might happen after the next scheduler cycle already started or way later, after several cycles already treated the workload as admitted.
Scenario 1 - Infinitely Stuck Preemption
Consider the following sequence of events:
- Scheduling Cycle 1: Workload1 is admitted - it is optimistically added to the cache. The goroutine to update it in the API server is spawned.
- Scheduling Cycle 2: Workload2 preempts Workload1. It sends a synchronous update to the API server to add the
Evicted condition to Workload1 and adds Workload1 to preemptionExpectation.
- Workload1 Admission Goroutine: The goroutine sends a request to the API server to admit the workload. But the update is based off the state of Workload1 in Scheduling Cycle 1. This overwrites the Evicted condition.
- Scheduling Cycle 3: Workload2 is evaluated again. It picks Workload1 as it's preemption target. Since Workload1 does not have the Evicted condition anymore, it won't hit the branch that would observe the eviction in
preemptionExpectations. Instead, because Workload1 was never observed in preemptionExpectations, it will fall into a fast exist. Workload2 is not admitted.
- Scheduling Cycle 4: Is exactly the same as the third cycle.
- Scheduling Cycle N: This continues until Workload1 is deleted or finishes.
Scenario 2 - Quota Oversubcription
If step 3 in the previous scenario happens after Scheduling Cycle 3, the Evicted condition will not be overwritten. Instead:
- Scheduling Cycle 1: Workload1 is admitted - it is optimistically added to the cache. The goroutine to update it in the API server is spawned.
- Scheduling Cycle 2: Workload2 preempts Workload1. It sends a synchronous update to the API server to add the
Evicted condition to Workload1 and adds Workload1 to preemptionExpectation.
- Scheduling Cycle 3: Sees that Workload1 is
Evicted and observes it.
- Job Controller: The controller for Workload1's job evicts it.
- Scheduling Cycle 4: Workload2 is admitted.
- Workload1 Admission Goroutine: The goroutine admits the workload without any quota checks, causing oversubcription.
I will admit, I did not trace the second scenario very closely, I only reproduced it. Out of the two, it seems like a less severe one.
Scenario 1 was observed on production and the impact is a complete lack of admission for some ClusterQueues in the cluster.
What you expected to happen
How to reproduce it (as minimally and precisely as possible)
I wrote a very hacky reproduction by editing the Kueue code to force a race condition from Scenario 1. It's structured as a E2E test on my local fork (link).
I achieved the race condition by forcing a sequence via the use of channels.
It also reproduces the flood of:
Preemption already issued, waiting for observation
logs, which is consistent with the production investigation.
Anything else we need to know?
This was found during an investigation of a production issue.
What happened
When the scheduler admits a workload, it optimistically sets
QuotaReserved=Truefor that workload in the cache, while the API server update is handled asynchronously via a goroutine. This introduces indeterminism, as the admission might happen after the next scheduler cycle already started or way later, after several cycles already treated the workload as admitted.Scenario 1 - Infinitely Stuck Preemption
Consider the following sequence of events:
Evictedcondition to Workload1 and adds Workload1 topreemptionExpectation.preemptionExpectations. Instead, because Workload1 was never observed inpreemptionExpectations, it will fall into a fast exist. Workload2 is not admitted.Scenario 2 - Quota Oversubcription
If step 3 in the previous scenario happens after Scheduling Cycle 3, the
Evictedcondition will not be overwritten. Instead:Evictedcondition to Workload1 and adds Workload1 topreemptionExpectation.Evictedand observes it.I will admit, I did not trace the second scenario very closely, I only reproduced it. Out of the two, it seems like a less severe one.
Scenario 1 was observed on production and the impact is a complete lack of admission for some ClusterQueues in the cluster.
What you expected to happen
How to reproduce it (as minimally and precisely as possible)
I wrote a very hacky reproduction by editing the Kueue code to force a race condition from Scenario 1. It's structured as a E2E test on my local fork (link).
I achieved the race condition by forcing a sequence via the use of channels.
It also reproduces the flood of:
logs, which is consistent with the production investigation.
Anything else we need to know?
This was found during an investigation of a production issue.