What version of redis operator are you using?
redis-operator version: 0.24.0
chart: redis-operator (latest at 2026-05)
Affects main per code inspection — the relevant code in internal/controller/common/statefulset/statefulset_reconcile.go and internal/controller/common/service/service_reconcile.go is unchanged since the controllers were refactored to use common/reconciler.Reconcile.
Does this issue reproduce with the latest release?
Yes — observed on v0.24.0, code path identical on main.
What operating system and processor architecture are you using (kubectl version)?
Server Version: v1.34.x (OKE managed)
Client Version: v1.34.x
arm64 cluster nodes
What did you do?
Deployed a single RedisReplication CR (clusterSize: 3, redisExporter.enabled: true, resolveHostnames: "no"). Left it idle and observed the operator's behavior with MAX_CONCURRENT_RECONCILES=1 (default).
What did you expect to see?
After initial reconciliation:
- The controller's
Reconcile() should occasionally re-check state but return (ctrl.Result{}, nil) (success) when the world is in the desired state.
- No
Updating resource / Updated resource successfully log lines on the steady-state Service + StatefulSet because nothing should be changing.
controller_runtime_reconcile_total{controller="redisreplication",result="success"} should advance proportional to event-triggered reconciles.
What did you see instead?
The controller is in a perpetual reconcile loop with two distinct bugs working together:
Bug 1: result="success" never reached
/metrics from a steady-state pod (450 reconciles since startup, ~3 hours uptime):
controller_runtime_reconcile_total{controller="redisreplication",result="error"} 0
controller_runtime_reconcile_total{controller="redisreplication",result="requeue"} 0
controller_runtime_reconcile_total{controller="redisreplication",result="requeue_after"} 450
controller_runtime_reconcile_total{controller="redisreplication",result="success"} 0
controller_runtime_reconcile_total{controller="redissentinel",result="success"} 0
Every single reconcile returns RequeueAfter (~36s cadence in the log timestamps). Zero successes ever. This is unusual for an idle, well-formed CR.
Bug 2: needUpdate returns true on a no-op reconcile, causing unconditional Update() calls
internal/controller/common/statefulset/statefulset_reconcile.go:56-66:
func needUpdate(expected, existed *appsv1.StatefulSet) bool {
return !maps.IsSubset(expected.Labels, existed.Labels) ||
!maps.IsSubset(expected.Annotations, existed.Annotations) ||
!equality.Semantic.DeepEqual(expected.Spec, existed.Spec)
}
func update(expected, existed *appsv1.StatefulSet) {
existed.Labels = maps.Merge(existed.Labels, expected.Labels)
existed.Annotations = maps.Merge(existed.Annotations, expected.Annotations)
existed.Spec = expected.Spec // full replace; no merge of server-side defaults
}
(Identical shape in service_reconcile.go:33-43.)
The comparison uses equality.Semantic.DeepEqual(expected.Spec, existed.Spec) — but expected.Spec is the operator's freshly-generated spec, which omits server-side defaults that the apiserver injected into existed.Spec. Examples on appsv1.StatefulSet:
Spec.PodManagementPolicy → apiserver defaults to OrderedReady
Spec.RevisionHistoryLimit → defaults to 10
Spec.UpdateStrategy.Type → defaults to RollingUpdate
Spec.Template.Spec.RestartPolicy → defaults to Always
Spec.Template.Spec.DNSPolicy → defaults to ClusterFirst
Spec.Template.Spec.SchedulerName → defaults to default-scheduler
Spec.Template.Spec.TerminationGracePeriodSeconds → defaults to 30
Spec.Template.Spec.Containers[].TerminationMessagePath/Policy → defaults
On corev1.Service:
Spec.IPFamilies / Spec.IPFamilyPolicy
Spec.SessionAffinity → None
Spec.Ports[].Protocol → TCP
Spec.Ports[].TargetPort → defaults to .Port
Whichever of these the operator's generateStatefulSetDef() / generateServiceDef() doesn't set, the apiserver injects on Create. On every subsequent reconcile, the operator constructs expected without those fields → Semantic.DeepEqual returns false → needUpdate returns true → update() overwrites existed.Spec = expected.Spec (stripping the defaults) → Update() is sent to apiserver → apiserver re-injects defaults → the next reconcile sees the mismatch again.
The apiserver recognizes the spec is semantically unchanged (when its own defaults are merged in) and returns the same resourceVersion, so this loop is invisible at the K8s audit-log level beyond Update verb counts. But it's not free:
- Operator log spam (
Updating resource / Updated resource successfully every 36s per managed object).
- Apiserver round-trip per Update (auth, admission, diff, response) — wasted budget against priority-and-fairness.
- The reconcile time histogram for
redisreplication shows p99 ≈ 6.0 seconds in our production cluster, which feels excessive for an idle CR and is plausibly amplified by the wasted I/O.
Repro evidence
kubectl logs deploy/redis-operator snapshots 36 seconds apart, same resourceVersion for both Service and STS, with Updating resource between them:
2026-05-14T05:41:53Z Updating resource (Service plane/plane-redis-s-hl)
2026-05-14T05:41:53Z Updated resource successfully resourceVersion=37383744
2026-05-14T05:41:53Z Updating resource (StatefulSet plane/plane-redis-s)
2026-05-14T05:41:53Z Updated resource successfully resourceVersion=37384117
2026-05-14T05:42:29Z Updating resource (Service plane/plane-redis-s-hl) ← 36s later
2026-05-14T05:42:29Z Updated resource successfully resourceVersion=37383744 ← same RV
2026-05-14T05:42:29Z Updating resource (StatefulSet plane/plane-redis-s)
2026-05-14T05:42:29Z Updated resource successfully resourceVersion=37384117 ← same RV
diff of the Service + STS captured 45 seconds apart shows zero byte differences — confirming neither metadata.managedFields[].time nor metadata.resourceVersion is advancing despite the Update verb calls.
Suggested fix
Two options:
-
Compare merged state: in needUpdate, merge server-side defaults onto expected.Spec before DeepEqual. This is roughly what controllerutil.CreateOrUpdate and sigs.k8s.io/controller-runtime/pkg/patch helpers do.
-
Compare only operator-managed fields: instead of DeepEqual(expected.Spec, existed.Spec), compare a curated subset (image, replicas, resource requests, env, …) that the operator actually owns. Don't compare the parts that the apiserver fills in.
internal/k8sutils/services.go:144-178 (the older patchService) uses banzaicloud-style patch.DefaultPatchMaker.Calculate(...) with explicit patch.IgnoreStatusFields() etc. and is round-trip-stable. The newer common/service/service_reconcile.go route is what's buggy — they appear to be two competing implementations.
(For context: I'm investigating this from the operator-consumer side, in cmpx575/devops#1207. The leader-election deadline misses we saw aren't caused by this loop directly, but the loop is a separate productivity bug worth fixing.)
What version of redis operator are you using?
Affects
mainper code inspection — the relevant code ininternal/controller/common/statefulset/statefulset_reconcile.goandinternal/controller/common/service/service_reconcile.gois unchanged since the controllers were refactored to usecommon/reconciler.Reconcile.Does this issue reproduce with the latest release?
Yes — observed on
v0.24.0, code path identical onmain.What operating system and processor architecture are you using (
kubectl version)?What did you do?
Deployed a single
RedisReplicationCR (clusterSize: 3,redisExporter.enabled: true,resolveHostnames: "no"). Left it idle and observed the operator's behavior withMAX_CONCURRENT_RECONCILES=1(default).What did you expect to see?
After initial reconciliation:
Reconcile()should occasionally re-check state but return(ctrl.Result{}, nil)(success) when the world is in the desired state.Updating resource/Updated resource successfullylog lines on the steady-state Service + StatefulSet because nothing should be changing.controller_runtime_reconcile_total{controller="redisreplication",result="success"}should advance proportional to event-triggered reconciles.What did you see instead?
The controller is in a perpetual reconcile loop with two distinct bugs working together:
Bug 1:
result="success"never reached/metricsfrom a steady-state pod (450 reconciles since startup, ~3 hours uptime):Every single reconcile returns
RequeueAfter(~36s cadence in the log timestamps). Zero successes ever. This is unusual for an idle, well-formed CR.Bug 2:
needUpdatereturns true on a no-op reconcile, causing unconditionalUpdate()callsinternal/controller/common/statefulset/statefulset_reconcile.go:56-66:(Identical shape in
service_reconcile.go:33-43.)The comparison uses
equality.Semantic.DeepEqual(expected.Spec, existed.Spec)— butexpected.Specis the operator's freshly-generated spec, which omits server-side defaults that the apiserver injected intoexisted.Spec. Examples onappsv1.StatefulSet:Spec.PodManagementPolicy→ apiserver defaults toOrderedReadySpec.RevisionHistoryLimit→ defaults to10Spec.UpdateStrategy.Type→ defaults toRollingUpdateSpec.Template.Spec.RestartPolicy→ defaults toAlwaysSpec.Template.Spec.DNSPolicy→ defaults toClusterFirstSpec.Template.Spec.SchedulerName→ defaults todefault-schedulerSpec.Template.Spec.TerminationGracePeriodSeconds→ defaults to30Spec.Template.Spec.Containers[].TerminationMessagePath/Policy→ defaultsOn
corev1.Service:Spec.IPFamilies/Spec.IPFamilyPolicySpec.SessionAffinity→NoneSpec.Ports[].Protocol→TCPSpec.Ports[].TargetPort→ defaults to.PortWhichever of these the operator's
generateStatefulSetDef()/generateServiceDef()doesn't set, the apiserver injects on Create. On every subsequent reconcile, the operator constructsexpectedwithout those fields →Semantic.DeepEqualreturns false →needUpdatereturns true →update()overwritesexisted.Spec = expected.Spec(stripping the defaults) →Update()is sent to apiserver → apiserver re-injects defaults → the next reconcile sees the mismatch again.The apiserver recognizes the spec is semantically unchanged (when its own defaults are merged in) and returns the same
resourceVersion, so this loop is invisible at the K8s audit-log level beyond Update verb counts. But it's not free:Updating resource/Updated resource successfullyevery 36s per managed object).redisreplicationshows p99 ≈ 6.0 seconds in our production cluster, which feels excessive for an idle CR and is plausibly amplified by the wasted I/O.Repro evidence
kubectl logs deploy/redis-operatorsnapshots 36 seconds apart, same resourceVersion for both Service and STS, withUpdating resourcebetween them:diffof the Service + STS captured 45 seconds apart shows zero byte differences — confirming neithermetadata.managedFields[].timenormetadata.resourceVersionis advancing despite the Update verb calls.Suggested fix
Two options:
Compare merged state: in
needUpdate, merge server-side defaults ontoexpected.SpecbeforeDeepEqual. This is roughly whatcontrollerutil.CreateOrUpdateandsigs.k8s.io/controller-runtime/pkg/patchhelpers do.Compare only operator-managed fields: instead of
DeepEqual(expected.Spec, existed.Spec), compare a curated subset (image, replicas, resource requests, env, …) that the operator actually owns. Don't compare the parts that the apiserver fills in.internal/k8sutils/services.go:144-178(the olderpatchService) usesbanzaicloud-stylepatch.DefaultPatchMaker.Calculate(...)with explicitpatch.IgnoreStatusFields()etc. and is round-trip-stable. The newercommon/service/service_reconcile.goroute is what's buggy — they appear to be two competing implementations.(For context: I'm investigating this from the operator-consumer side, in cmpx575/devops#1207. The leader-election deadline misses we saw aren't caused by this loop directly, but the loop is a separate productivity bug worth fixing.)