etcd-operator

A Kubernetes operator for running etcd clusters. Status: early alpha — API is etcd-operator.cozystack.io/v1alpha2 and will likely change.

What it does

The operator manages etcd clusters via two custom resources:

EtcdCluster — what the user creates. Captures cluster-wide intent: replica count, etcd version, per-member storage size, a progress deadline.
EtcdMember — what the operator creates. One per etcd member. Owns its Pod and PVC. Operator-managed; users should not edit these directly.

There is no StatefulSet. Each member's Pod and PVC are reconciled independently so the operator can model protocol-aware lifecycle (learner-mode joins, member-id assignment, graceful removal, scale-to-zero pause/resume) without fighting StatefulSet's "all replicas are one workload" assumption.

The full design rationale is in docs/concepts.md.

What's supported today

Bootstrap of new clusters. Single seed first, learner-mode adds afterwards.
Scale up / down: cluster controller adds members one at a time as learners and promotes them; scale-down picks the most-recently-created member, runs MemberRemove via a finalizer, then GCs the Pod and PVC.
Scale to zero (pause/resume): spec.replicas: 0 parks the surviving member via spec.dormant=true; the Pod is deleted, the PVC stays owned by the EtcdMember. Scaling back up to ≥ 1 flips spec.dormant=false on the same member; etcd resumes from the existing data dir with the same cluster ID and member ID.
Pod restart / node failure: data PVC is preserved, the new Pod reads the existing WAL and rejoins with the same member ID.
Memory-backed storage (opt-in): spec.storage.medium: Memory switches each member's data dir to a tmpfs emptyDir whose lifetime is bound to the Pod. Members that lose their Pod (eviction, node failure) lose their data; the operator detects this, removes the member from etcd, and replaces it via the existing scale-up path. Suits scenarios where the etcd state is reconstructable and replication absorbs single-member losses. For production, set spec.affinity and spec.resources.limits.memory explicitly — neither is defaulted (#16); see docs/concepts.md.
Apiserver-enforced validation: CEL rules on the CRD (k8s 1.29+) reject replicas: 0 with storage.medium: Memory, storage.size: 0 with storage.medium: Memory, storage.medium changes after creation, and storage.size shrinks. No webhook / cert-manager dependency.
PodDisruptionBudget: per-cluster PDB selects voting members only (role=voter); maxUnavailable = (voters-1)/2 so kubectl drain cannot voluntarily push the cluster below quorum.
TLS (BYO Secrets or cert-manager): spec.tls.client / spec.tls.peer enable TLS on each surface independently. Material comes from either user-provided Secrets (serverSecretRef / operatorClientSecretRef / secretRef) or operator-emitted cert-manager.io/v1 Certificates (certManager.{serverIssuerRef,operatorClientIssuerRef,issuerRef}) — mutually exclusive per subtree, enforced by CEL. mTLS is the implicit mode when an operator-client source is supplied; server-TLS-only when it isn't. The whole tls subtree is CEL-locked immutable post-create. cert-manager-emitted certs auto-renew via cert-manager; Pod-side rotation is a manual one-at-a-time kubectl delete pod either way. See docs/concepts.md.
Resource sizing: spec.resources (a corev1.ResourceRequirements) sets the etcd container's CPU/memory requests and limits. Unset uses a conservative 100m/128Mi-request default. Updates take effect on newly-created members; pair with a VerticalPodAutoscaler targeting the cluster for live recommendation/rollout.
Scheduling & extra metadata: spec.affinity and spec.topologySpreadConstraints pass through to every member Pod (anti-affinity is not defaulted — set it for production); spec.additionalMetadata merges user labels/annotations onto every object the operator creates (member Pods, data PVCs, Services, PDB, EtcdMember CRs), with operator-owned keys winning on collision. All three apply on object creation and are latched like the rest of the spec. See docs/concepts.md.
Monitoring / autoscaling hooks: every member Pod always exposes a plaintext metrics container port at 2381 (etcd's /health + Prometheus /metrics) for VMPodScrape / PodMonitor. The EtcdCluster CRD exposes the /scale subresource with a populated status.selector, making it a valid target for kubectl scale and VerticalPodAutoscaler.targetRef.
Locking pattern: status.observed snapshots the in-flight target so mid-flight spec edits don't corrupt consensus; progressDeadline bounds how long the operator will spend trying to reach a target.
Cluster deletion: cascading owner refs clean up everything; finalizers detect "the whole cluster is going away" and skip etcd-side removal to avoid deadlock.
Snapshots & restore: EtcdSnapshot captures a one-shot snapshot of a cluster to S3 (or a PVC) via a Job running the operator image as a snapshot agent; status.artifact records the stored object's URI, size, and checksum. A new cluster restores from a snapshot at first bootstrap via spec.bootstrap.restore.source (the seed Pod runs a restore initContainer before etcd starts). TLS and spec.auth auth are honored automatically. No scheduled snapshots (EtcdSnapshotSchedule is intentionally out of scope) — drive recurring snapshots with a CronJob/kubectl apply from outside. See docs/concepts.md and the restore runbook.

What's not supported (yet)

No multi-user / per-tenant RBAC inside etcd — single-user root auth is available via spec.auth.enabled (BYO credentials Secret; see docs/concepts.md), but every authenticated client is root. No in-place version upgrades (changing spec.version only affects newly-created members). No PVC resizing — see #2. No automatic broken-member replacement for PVC-backed clusters (memory-backed members do auto-replace on Pod loss; status.brokenMembers reads 0 in practice — see docs/concepts.md). One-shot snapshots and restore-on-bootstrap are supported (see above), but there is no scheduled snapshot CRD. No defragmentation scheduling. PodAntiAffinity is supported via spec.affinity but not applied by default (defaulting tracked in #16). See the issue tracker for the running follow-up list.

Quick start

# 1. Install CRDs and the operator. Builds an image and pushes it to your
#    registry; substitute IMG= for a prebuilt tag if you have one. The cluster
#    must be able to pull from <your-registry> — for local clusters (kind /
#    minikube / k3d) sideload the image or use an ephemeral registry such as
#    ttl.sh, otherwise the operator Deployment will sit in ImagePullBackOff.
make install
make docker-build docker-push deploy IMG=<your-registry>/etcd-operator:<tag>

# 2. Create a cluster.
cat <<'EOF' | kubectl apply -f -
apiVersion: etcd-operator.cozystack.io/v1alpha2
kind: EtcdCluster
metadata:
  name: my-etcd
  namespace: default
spec:
  replicas: 3
  version: 3.6.11
  storage:
    size: 1Gi
EOF

# 3. Wait for ready and inspect.
kubectl get etcdcluster.etcd-operator.cozystack.io my-etcd -w
POD=$(kubectl get pod -l etcd-operator.cozystack.io/cluster=my-etcd \
  -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it "$POD" -- etcdctl --endpoints=http://localhost:2379 \
  member list -w table

Member names are apiserver-assigned (GenerateName="<cluster>-") — don't hard-code them; use the cluster label selector.

For step-by-step setup, RBAC, image versions, and teardown see docs/installation.md.

Documentation

Installation — deploy the operator, create your first cluster, networking pitfalls, upgrades.
Concepts — design rationale: locking pattern, single-seed bootstrap, GenerateName naming, scale-to-zero mechanics, conditions reference.
Operations — runbook for day-2: scaling, pausing/resuming, decoding conditions, escalating stuck reconciles, broken-member recovery.
Migration — moving onto this operator from the legacy aenix operator; tracks behavioural changes that need an explicit migration step — currently the BYO root-credentials requirement when enabling auth.

Testing

go test ./controllers/...

The suite uses controller-runtime's fake client and a fake etcd client; no envtest assets needed at the unit level. Pinned behaviours:

Bootstrap — single-seed creation, idempotent recovery, GenerateName-assigned names.
Locking pattern — status.observed / progressDeadline lock the in-flight target; bootstrap-deadline is terminal.
Scale up — learner-mode add, readiness gate before the next step, crash-recovery branches between Create / MemberAddAsLearner / Patch(initialCluster).
Scale down — CreationTimestamp DESC (name DESC tiebreak) victim selection, finalizer-driven MemberRemove.
Scale to zero — 1→0 Patches spec.dormant=true; 0→1 flips it back; dormant member's Pod is gone but its PVC is preserved.
Discovery — seed found via spec.bootstrap=true; etcd client endpoints filtered to voters (MemberReady=True) so MemberList doesn't route to a learner.
Status no-churn — steady-state reconciles don't repeatedly mutate status.

License

Apache 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github/workflows		.github/workflows
api/v1alpha2		api/v1alpha2
cmd		cmd
config		config
controllers		controllers
docs		docs
hack		hack
internal		internal
test/e2e		test/e2e
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
PROJECT		PROJECT
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
main_test.go		main_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

etcd-operator

What it does

What's supported today

What's not supported (yet)

Quick start

Documentation

Testing

License

About

Uh oh!

Releases 12

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

etcd-operator

What it does

What's supported today

What's not supported (yet)

Quick start

Documentation

Testing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages