Race condition in scheduler causing stuck preemptions and quota oversubscription

### What happened
When the scheduler admits a workload, it optimistically sets `QuotaReserved=True` for that workload in the cache, while the API server update is handled asynchronously via a [goroutine](https://github.com/kubernetes-sigs/kueue/blob/6a7157c339312e8271711e8a35ea861a25274724/pkg/scheduler/scheduler.go#L803). This introduces indeterminism, as the admission might happen after the next scheduler cycle already started or way later, after several cycles already treated the workload as admitted.

#### Scenario 1 - Infinitely Stuck Preemption
Consider the following sequence of events:
1. **Scheduling Cycle 1**: Workload1 is admitted - it is optimistically added to the cache. The goroutine to update it in the API server is spawned.
2. **Scheduling Cycle 2**: Workload2 preempts Workload1. It sends a synchronous update to the API server to [add the `Evicted` condition](https://github.com/kubernetes-sigs/kueue/blob/main/pkg/workload/workload.go#L1765) to Workload1 and [adds Workload1 to `preemptionExpectation`](https://github.com/kubernetes-sigs/kueue/blob/6a7157c339312e8271711e8a35ea861a25274724/pkg/scheduler/preemption/preemption.go#L221).
3. **Workload1 Admission Goroutine**: The goroutine [sends a request](https://github.com/kubernetes-sigs/kueue/blob/6a7157c339312e8271711e8a35ea861a25274724/pkg/scheduler/scheduler.go#L805) to the API server to admit the workload. But the update is based off the [state of Workload1 in **Scheduling Cycle 1**](https://github.com/kubernetes-sigs/kueue/blob/6a7157c339312e8271711e8a35ea861a25274724/pkg/scheduler/scheduler.go#L802). **This overwrites the Evicted condition**.
4. **Scheduling Cycle 3**: Workload2 is evaluated again. It picks Workload1 as it's preemption target. Since Workload1 does not have the Evicted condition anymore, [it won't hit the branch](https://github.com/kubernetes-sigs/kueue/blob/6a7157c339312e8271711e8a35ea861a25274724/pkg/scheduler/preemption/preemption.go#L204-L209) that would observe the eviction in `preemptionExpectations`. Instead, because Workload1 was never observed in `preemptionExpectations`, it will [fall into a fast exist](https://github.com/kubernetes-sigs/kueue/blob/6a7157c339312e8271711e8a35ea861a25274724/pkg/scheduler/preemption/preemption.go#L210-L216). Workload2 is not admitted.
5. **Scheduling Cycle 4**: Is exactly the same as the third cycle.
6. **Scheduling Cycle N**: This continues until Workload1 is deleted or finishes.

#### Scenario 2 - Quota Oversubcription
If step 3 in the previous scenario happens **after Scheduling Cycle 3**, the `Evicted` condition will not be overwritten. Instead:
1. **Scheduling Cycle 1**: Workload1 is admitted - it is optimistically added to the cache. The goroutine to update it in the API server is spawned.
2. **Scheduling Cycle 2**: Workload2 preempts Workload1. It sends a synchronous update to the API server to [add the `Evicted` condition](https://github.com/kubernetes-sigs/kueue/blob/main/pkg/workload/workload.go#L1765) to Workload1 and [adds Workload1 to `preemptionExpectation`](https://github.com/kubernetes-sigs/kueue/blob/6a7157c339312e8271711e8a35ea861a25274724/pkg/scheduler/preemption/preemption.go#L221).
3. **Scheduling Cycle 3**: Sees that Workload1 is `Evicted` and [observes it](https://github.com/kubernetes-sigs/kueue/blob/6a7157c339312e8271711e8a35ea861a25274724/pkg/scheduler/preemption/preemption.go#L204-L209).
4. **Job Controller**: The controller for Workload1's job evicts it.
5. **Scheduling Cycle 4**: Workload2 is admitted.
6. **Workload1 Admission Goroutine**: The goroutine admits the workload without any quota checks, causing oversubcription.

---

I will admit, I did not trace the second scenario very closely, I only reproduced it. Out of the two, it seems like a less severe one.

Scenario 1 was observed on production and the impact is a **complete lack of admission** for some ClusterQueues in the cluster.

### What you expected to happen

### How to reproduce it (as minimally and precisely as possible)
I wrote a very hacky reproduction by editing the Kueue code to force a race condition from Scenario 1. It's structured as a E2E test on my local fork ([link](https://github.com/kubernetes-sigs/kueue/commit/6db3082613caa25111f5d72a625855bd85b22f05)).

I achieved the race condition by forcing a sequence via the use of channels.

It also reproduces the flood of:
```
Preemption already issued, waiting for observation
```
logs, which is consistent with the production investigation.

### Anything else we need to know?
This was found during an investigation of a production issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in scheduler causing stuck preemptions and quota oversubscription #11480

What happened

Scenario 1 - Infinitely Stuck Preemption

Scenario 2 - Quota Oversubcription

What you expected to happen

How to reproduce it (as minimally and precisely as possible)

Anything else we need to know?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Race condition in scheduler causing stuck preemptions and quota oversubscription #11480

Description

What happened

Scenario 1 - Infinitely Stuck Preemption

Scenario 2 - Quota Oversubcription

What you expected to happen

How to reproduce it (as minimally and precisely as possible)

Anything else we need to know?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions