RedisReplication reconciler: needUpdate() returns true on no-op (server-side defaults), causing perpetual Update() loop

**What version of redis operator are you using?**

```
redis-operator version: 0.24.0
chart: redis-operator (latest at 2026-05)
```

Affects `main` per code inspection — the relevant code in `internal/controller/common/statefulset/statefulset_reconcile.go` and `internal/controller/common/service/service_reconcile.go` is unchanged since the controllers were refactored to use `common/reconciler.Reconcile`.

**Does this issue reproduce with the latest release?**

Yes — observed on `v0.24.0`, code path identical on `main`.

**What operating system and processor architecture are you using (`kubectl version`)?**

```
Server Version: v1.34.x (OKE managed)
Client Version: v1.34.x
arm64 cluster nodes
```

**What did you do?**

Deployed a single `RedisReplication` CR (`clusterSize: 3`, `redisExporter.enabled: true`, `resolveHostnames: "no"`). Left it idle and observed the operator's behavior with `MAX_CONCURRENT_RECONCILES=1` (default).

**What did you expect to see?**

After initial reconciliation:
- The controller's `Reconcile()` should occasionally re-check state but **return `(ctrl.Result{}, nil)` (success)** when the world is in the desired state.
- No `Updating resource` / `Updated resource successfully` log lines on the steady-state Service + StatefulSet because nothing should be changing.
- `controller_runtime_reconcile_total{controller="redisreplication",result="success"}` should advance proportional to event-triggered reconciles.

**What did you see instead?**

The controller is in a perpetual reconcile loop with **two distinct bugs** working together:

### Bug 1: `result="success"` never reached

`/metrics` from a steady-state pod (450 reconciles since startup, ~3 hours uptime):

```
controller_runtime_reconcile_total{controller="redisreplication",result="error"} 0
controller_runtime_reconcile_total{controller="redisreplication",result="requeue"} 0
controller_runtime_reconcile_total{controller="redisreplication",result="requeue_after"} 450
controller_runtime_reconcile_total{controller="redisreplication",result="success"} 0
controller_runtime_reconcile_total{controller="redissentinel",result="success"} 0
```

Every single reconcile returns `RequeueAfter` (~36s cadence in the log timestamps). Zero successes ever. This is unusual for an idle, well-formed CR.

### Bug 2: `needUpdate` returns true on a no-op reconcile, causing unconditional `Update()` calls

`internal/controller/common/statefulset/statefulset_reconcile.go:56-66`:

```go
func needUpdate(expected, existed *appsv1.StatefulSet) bool {
    return !maps.IsSubset(expected.Labels, existed.Labels) ||
        !maps.IsSubset(expected.Annotations, existed.Annotations) ||
        !equality.Semantic.DeepEqual(expected.Spec, existed.Spec)
}

func update(expected, existed *appsv1.StatefulSet) {
    existed.Labels = maps.Merge(existed.Labels, expected.Labels)
    existed.Annotations = maps.Merge(existed.Annotations, expected.Annotations)
    existed.Spec = expected.Spec   // full replace; no merge of server-side defaults
}
```

(Identical shape in `service_reconcile.go:33-43`.)

The comparison uses `equality.Semantic.DeepEqual(expected.Spec, existed.Spec)` — but `expected.Spec` is the operator's freshly-generated spec, which **omits server-side defaults** that the apiserver injected into `existed.Spec`. Examples on `appsv1.StatefulSet`:

- `Spec.PodManagementPolicy` → apiserver defaults to `OrderedReady`
- `Spec.RevisionHistoryLimit` → defaults to `10`
- `Spec.UpdateStrategy.Type` → defaults to `RollingUpdate`
- `Spec.Template.Spec.RestartPolicy` → defaults to `Always`
- `Spec.Template.Spec.DNSPolicy` → defaults to `ClusterFirst`
- `Spec.Template.Spec.SchedulerName` → defaults to `default-scheduler`
- `Spec.Template.Spec.TerminationGracePeriodSeconds` → defaults to `30`
- `Spec.Template.Spec.Containers[].TerminationMessagePath/Policy` → defaults

On `corev1.Service`:

- `Spec.IPFamilies` / `Spec.IPFamilyPolicy`
- `Spec.SessionAffinity` → `None`
- `Spec.Ports[].Protocol` → `TCP`
- `Spec.Ports[].TargetPort` → defaults to `.Port`

Whichever of these the operator's `generateStatefulSetDef()` / `generateServiceDef()` doesn't set, the apiserver injects on Create. On every subsequent reconcile, the operator constructs `expected` without those fields → `Semantic.DeepEqual` returns false → `needUpdate` returns true → `update()` overwrites `existed.Spec = expected.Spec` (stripping the defaults) → `Update()` is sent to apiserver → apiserver re-injects defaults → the next reconcile sees the mismatch again.

The apiserver recognizes the spec is semantically unchanged (when its own defaults are merged in) and returns the same `resourceVersion`, so this loop is **invisible at the K8s audit-log level beyond Update verb counts**. But it's not free:

- Operator log spam (`Updating resource` / `Updated resource successfully` every 36s per managed object).
- Apiserver round-trip per Update (auth, admission, diff, response) — wasted budget against priority-and-fairness.
- The reconcile time histogram for `redisreplication` shows **p99 ≈ 6.0 seconds** in our production cluster, which feels excessive for an idle CR and is plausibly amplified by the wasted I/O.

### Repro evidence

`kubectl logs deploy/redis-operator` snapshots 36 seconds apart, **same resourceVersion** for both Service and STS, with `Updating resource` between them:

```
2026-05-14T05:41:53Z Updating resource (Service plane/plane-redis-s-hl)
2026-05-14T05:41:53Z Updated resource successfully resourceVersion=37383744
2026-05-14T05:41:53Z Updating resource (StatefulSet plane/plane-redis-s)
2026-05-14T05:41:53Z Updated resource successfully resourceVersion=37384117
2026-05-14T05:42:29Z Updating resource (Service plane/plane-redis-s-hl)  ← 36s later
2026-05-14T05:42:29Z Updated resource successfully resourceVersion=37383744  ← same RV
2026-05-14T05:42:29Z Updating resource (StatefulSet plane/plane-redis-s)
2026-05-14T05:42:29Z Updated resource successfully resourceVersion=37384117  ← same RV
```

`diff` of the Service + STS captured 45 seconds apart shows **zero byte differences** — confirming neither `metadata.managedFields[].time` nor `metadata.resourceVersion` is advancing despite the Update verb calls.

### Suggested fix

Two options:

1. **Compare merged state**: in `needUpdate`, merge server-side defaults onto `expected.Spec` before `DeepEqual`. This is roughly what `controllerutil.CreateOrUpdate` and `sigs.k8s.io/controller-runtime/pkg/patch` helpers do.

2. **Compare only operator-managed fields**: instead of `DeepEqual(expected.Spec, existed.Spec)`, compare a curated subset (image, replicas, resource requests, env, …) that the operator actually owns. Don't compare the parts that the apiserver fills in.

`internal/k8sutils/services.go:144-178` (the older `patchService`) uses `banzaicloud-style` `patch.DefaultPatchMaker.Calculate(...)` with explicit `patch.IgnoreStatusFields()` etc. and is round-trip-stable. The newer `common/service/service_reconcile.go` route is what's buggy — they appear to be two competing implementations.

(For context: I'm investigating this from the operator-consumer side, in [cmpx575/devops#1207](https://github.com/cmpx575/devops/issues/1207). The leader-election deadline misses we saw aren't caused by this loop directly, but the loop is a separate productivity bug worth fixing.)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RedisReplication reconciler: needUpdate() returns true on no-op (server-side defaults), causing perpetual Update() loop #1773

Bug 1: `result="success"` never reached

Bug 2: `needUpdate` returns true on a no-op reconcile, causing unconditional `Update()` calls

Repro evidence

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

RedisReplication reconciler: needUpdate() returns true on no-op (server-side defaults), causing perpetual Update() loop #1773

Description

Bug 1: result="success" never reached

Bug 2: needUpdate returns true on a no-op reconcile, causing unconditional Update() calls

Repro evidence

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug 1: `result="success"` never reached

Bug 2: `needUpdate` returns true on a no-op reconcile, causing unconditional `Update()` calls