diff --git a/design-proposals/agl-tofu-backends/README.md b/design-proposals/agl-tofu-backends/README.md new file mode 100644 index 0000000..6d2db32 --- /dev/null +++ b/design-proposals/agl-tofu-backends/README.md @@ -0,0 +1,168 @@ +# Terraform/OpenTofu backend for Cozystack AGL + +- **Title:** Terraform/OpenTofu backend for Cozystack Application Generation Layer +- **Author(s):** @kitsunoff +- **Date:** 2026-05-20 +- **Status:** Draft + +## Overview + +Cozystack's Application Generation Layer (AGL) today maps a user-facing Kubernetes kind (e.g. `Postgres`, `Kafka`, `Bucket`) to a Flux `HelmRelease` under the hood. Helm is the only supported backend. + +This proposal adds a second backend: `Terraform` CRs of [flux-iac/tofu-controller](https://github.com/flux-iac/tofu-controller), so platform engineers can describe cloud primitives (VPCs, DNS zones, managed services, IAM bindings, external buckets) under the same AGL abstraction. + +Two architectural alternatives are documented here for review: + +- **[Draft 1 — Parallel Tofu Stack](./draft-1-parallel-tofu-stack.md)** — a new `TofuApplicationDefinition` CRD with its own apiserver and reconciler, mirroring the existing Helm AGL one-to-one. Minimal risk to existing packages, but ~80% code duplication. +- **[Draft 2 — Pluggable Backend](./draft-2-pluggable-backend.md)** — refactor AGL so Helm and Terraform are two implementations of a single `Backend` interface, behind one `ApplicationDefinition` CRD. Cleaner long-term, but a larger refactor of the hot path. + +A short slide deck summarising both drafts side-by-side is included as [`presentation.md`](./presentation.md) (Marp-flavoured). + +## Scope and related proposals + +This proposal targets [cozystack/cozystack](https://github.com/cozystack/cozystack). No related proposals at this time. + +The two drafts in this directory are **alternatives**, not sibling proposals. The intent of the PR is to choose one of them (possibly via the hybrid path in the recommendation below) before implementation starts. + +## Context + +The AGL today consists of: + +- A cluster-scoped CRD `ApplicationDefinition` (`apps.cozystack.io/v1alpha1`) describing the mapping from a user-facing kind to a `HelmRelease`. +- A custom aggregation API server (`cozystack-api`) that reads `ApplicationDefinition`s and registers dynamic kinds. +- A REST layer that translates `Application.Spec` (opaque `RawExtension`) into `HelmRelease.Spec.Values` on create/update, and projects `HelmRelease.Status` back into `Application.Status`. +- A reconciler (`ApplicationDefinitionHelmReconciler`) that keeps existing `HelmRelease`s aligned with the definition when the definition changes, and a config-hash mechanism that restarts the aggregation apiserver when the set of registered kinds changes. + +Relevant files in `cozystack/cozystack`: + +- `api/v1alpha1/applicationdefinitions_types.go` +- `pkg/registry/apps/application/rest.go` +- `pkg/apiserver/apiserver.go` +- `internal/controller/applicationdefinition_helmreconciler.go` +- `packages/system/cozystack-api/` + +### The problem + +Some platform primitives don't fit Helm naturally: + +- Cloud VPCs, subnets, route tables, IAM bindings. +- Managed DNS zones. +- Managed databases on hyperscalers. +- External object storage buckets, queues, secrets in cloud KMS. + +These are naturally Terraform/OpenTofu territory. Today, a tenant who wants a `VPC` resource alongside their `Postgres` resource has no way to express it through AGL — Cozystack would need a separate, non-AGL surface for cloud-side primitives, which defeats the whole point of having a generation layer. + +`flux-iac/tofu-controller` already provides a Flux-native `Terraform` CRD with the same lifecycle model as `HelmRelease` (source refs, drift detection, reconcile interval, status conditions). The shape of the integration is therefore well-defined: AGL should be able to emit `Terraform` CRs the same way it emits `HelmRelease`s. + +## Goals + +- Allow package authors to declare a user-facing kind (e.g. `VPC`, `S3Bucket`, `DNSZone`) that AGL translates into a `Terraform` CR. +- Reuse the existing dashboard/category/openAPISchema/secret-include machinery without modification. +- Stay backwards-compatible: existing Helm-backed packages must keep working with no manifest changes. + +## Non-goals + +- Mixing Helm and Terraform under a single application definition (one backend per definition; composite resources are out of scope). +- Building a "Terraform module marketplace" — package authors still ship their own modules. +- Replacing Flux. Both backends still reconcile through Flux primitives (helm-controller, tofu-controller). +- A UI for the Terraform plan/apply approval flow — tofu-controller already exposes it. + +## Design + +Two alternative designs are proposed. Detailed designs (CRD shapes, Go types, API server changes, REST translation, status projection, reconciler, packaging, implementation plan, risks, open questions) live in the per-draft documents: + +- **[Draft 1 — Parallel Tofu Stack](./draft-1-parallel-tofu-stack.md)** +- **[Draft 2 — Pluggable Backend](./draft-2-pluggable-backend.md)** + +### Comparison + +| Dimension | Draft 1 (parallel) | Draft 2 (decoupled) | +| -------------------------- | ---------------------- | ----------------------- | +| Time to first PoC | days | weeks | +| Regression risk | minimal | medium | +| Code duplication | high | none | +| Cost of 3rd backend | another full copy | one interface impl | +| User-facing UX | two AppDef kinds | one AppDef, switch type | +| Review burden | low | high, needs slicing | +| Long-term maintenance cost | high | low | + +### Recommendation + +A **hybrid path**: + +1. **First:** ship Draft 1 as a feature-branch PoC. Prove the tofu-controller mapping. Surface real requirements for vars marshalling, cloud-creds runner pods, output secret handling, plan-approval UX. +2. **Then:** with two working backends in hand, do the Draft 2 refactor. The `Backend` interface is then designed against concrete code, not speculation. + +This trades a small amount of throwaway code for much lower risk of a bad abstraction. + +## User-facing changes + +- A new backend type for `ApplicationDefinition` (shape depends on chosen draft): package authors can declare Terraform-backed kinds. +- Tenants can `kubectl apply` a `VPC`/`DNSZone`/`S3Bucket` (or any other Terraform-backed kind a package defines) the same way they apply `Postgres` today. +- Dashboard renders Terraform-backed kinds via the existing category/icon/openAPISchema machinery. +- Outputs of a Terraform run are surfaced through the existing `spec.secrets.include` mechanism. + +No change to existing Helm-backed packages or their CRs. + +## Upgrade and rollback compatibility + +- **Draft 1:** purely additive. Existing `ApplicationDefinition` and `HelmRelease` objects are untouched. Rollback = remove the new chart and CRD. +- **Draft 2:** the `release` field stays in `v1alpha1` as a deprecated alias for `backend.helm`. An apiserver-side normalization pass projects `spec.release` onto `spec.backend.helm` in memory; persisted objects are not mutated. Rollback of the apiserver is safe as long as the CRD still accepts both fields. The `v1alpha2` removal of `release` is deferred to a later release with a conversion webhook. + +Both drafts include a feature flag for the rollout phase. + +## Security + +- **New trust boundary:** runner pods executing `terraform plan`/`apply` need cloud credentials. Definitions can pin a `runnerPodTemplate` with a `ServiceAccount` (e.g. IRSA/Workload Identity). +- **New tenant-supplied inputs:** `Application.Spec` fields become Terraform input variables. Inputs are validated against `application.openAPISchema` before any backend call, and additionally must match HCL identifier regex `^[a-z_][a-z0-9_]*$` for variable names. +- **New secrets stored or transmitted:** `writeOutputsToSecret` can contain provider-returned values (e.g. access keys). Mitigated by an admission policy that requires explicit output allow-listing, plus the existing `spec.secrets.include` allow-list. +- **New RBAC surface:** the aggregation apiserver gains read/write on `Terraform` CRs in tenant namespaces. Scope follows the existing pattern for `HelmRelease`. + +## Failure and edge cases + +- Invalid input rejected by `openAPISchema` → apiserver returns 422 before any Flux object is created (existing behaviour). +- Tofu-controller not installed → Draft 1: apiserver fails fast on startup; Draft 2: registry skips registering the Terraform backend and logs a clear line. Packages that declare a Terraform backend get a "backend not available" status condition. +- Manual approval pending (`approvePlan: ""`) → surfaced as `Application.Status.PendingApproval=true`. Dashboard renders the plan diff. +- `destroyResourcesOnDeletion: true` on a Terraform CR with broken provider creds → destroy fails, `Terraform` CR remains, `Application` deletion is blocked by finalizer. User intervention required. +- Tenant deletes a Terraform-backed `Application` mid-apply → tofu-controller handles cancellation; finalizer waits for terminal state. + +## Testing + +- Unit tests for the REST translation (Draft 1 and Draft 2): `Application.Spec` ↔ `Terraform.Spec.Vars` mapping, status projection round-trips. +- Unit tests for the backend interface (Draft 2): each backend tested in isolation with a fake client. +- Integration tests with a real tofu-controller against a local provider (e.g. `null_resource`, `random_id`) — no cloud creds needed. +- e2e: one Terraform-backed example package (`VPC` against localstack, or `DNSZone` against a mock) end-to-end through the aggregation apiserver. +- Backwards-compat (Draft 2): the existing Helm-backed e2e suite must pass unchanged after the refactor. + +## Rollout + +- **Release N:** ships behind feature flag (`--enable-tofu-backend` for Draft 1, `--enable-pluggable-backends` for Draft 2). Off by default in stable channel. +- **Release N+1:** flag on by default. First Terraform-backed example package shipped. +- **Release N+2:** deprecation of `spec.release` announced (Draft 2 only). +- **Release N+3:** `v1alpha2` of `ApplicationDefinition` removes `spec.release` (Draft 2 only), conversion webhook fills it for v1alpha1 clients. + +## Open questions + +- Which alternative should land? Draft 1, Draft 2, or the hybrid path (Draft 1 first, then refactor to Draft 2)? +- Per-instance backend override (e.g. `Application.Spec.runnerPodOverride`) for multi-tenant cloud identity isolation — needed in v1 or deferred? +- Should the runner pod template be defined per-AppDef, per-instance, or both? +- How should plan diffs be exposed when `approvePlan` is manual — status field + dashboard rendering, or a separate `approve` subresource on `Application`? +- Is `apps.cozystack.io` the right group for Terraform-backed kinds (Draft 1), or should they live under a distinct `tofu.apps.cozystack.io` group? + +## Alternatives considered + +The two drafts in this directory are themselves the alternatives evaluated. Within each draft, alternatives for narrower decisions (group naming, status schema shape, conversion strategy) are discussed in the draft's own "Risks" and "Open questions" sections. + +Out-of-this-proposal alternatives that were dismissed: + +- **Treat Terraform as out of scope for AGL and build a parallel "infra" CRD set.** Rejected: defeats the purpose of having a generation layer, fragments the dashboard and tenant experience. +- **Use ArgoCD `Application` with a Terraform plugin.** Rejected: introduces a second reconciliation engine alongside Flux, doesn't reuse the existing AGL machinery. +- **Generate raw `Job`s that shell out to `tofu`.** Rejected: re-implements drift detection, state management, and lifecycle that tofu-controller already provides. + +--- + +## Reading these documents + +Rendered, browsable version (with navigation and the Marp slide deck as HTML): . + +Source repository: . diff --git a/design-proposals/agl-tofu-backends/draft-1-parallel-tofu-stack.md b/design-proposals/agl-tofu-backends/draft-1-parallel-tofu-stack.md new file mode 100644 index 0000000..b9ed533 --- /dev/null +++ b/design-proposals/agl-tofu-backends/draft-1-parallel-tofu-stack.md @@ -0,0 +1,273 @@ +# Draft 1: Parallel Tofu Stack for Cozystack AGL + +- **Status:** Draft +- **Author:** @kitsunoff +- **Date:** 2026-05-20 +- **Target project:** [cozystack/cozystack](https://github.com/cozystack/cozystack) + +> Companion document to [`README.md`](./README.md) (overview and comparison) and [`draft-2-pluggable-backend.md`](./draft-2-pluggable-backend.md) (alternative design). + +## Summary + +Add a second, parallel application generation stack to Cozystack that maps user-facing Kubernetes resources to `Terraform` CRs of [flux-iac/tofu-controller](https://github.com/flux-iac/tofu-controller), mirroring the existing Helm-based AGL one-to-one. + +No changes to the existing Helm path: the new stack lives next to it as an isolated set of types, API server bindings and reconcilers. + +## Motivation + +Cozystack's AGL is the right abstraction for "user creates a high-level resource, platform creates the actual workload underneath", but today the only supported backend is `HelmRelease`. Cloud-side primitives (VPCs, managed databases, DNS zones, IAM bindings) do not fit Helm naturally and are typically managed with Terraform/OpenTofu. + +`tofu-controller` already provides a Flux-native `Terraform` CRD with the same lifecycle model as `HelmRelease` (source refs, drift detection, reconcile interval, status). The smallest viable change is to repeat the AGL pattern for it. + +## Goals + +- Allow package authors to declare a user-facing kind (e.g. `VPC`, `S3Bucket`, `DNSZone`) that the API server transparently translates into a `Terraform` CR. +- Reuse the existing dashboard/category/openAPISchema machinery without modification. +- Ship as an additive feature: zero risk of regression for existing Helm-backed packages. + +## Non-goals + +- Refactoring the existing Helm AGL or introducing a backend abstraction (see Draft 2). +- Supporting a third backend (Argo, Kustomization, plain manifests). +- Mixing Helm and Terraform under a single application definition. +- Building a UI for the Terraform plan/apply approval flow (tofu-controller already exposes it). + +## Design + +### New CRD: `TofuApplicationDefinition` + +Cluster-scoped, lives in `apps.cozystack.io/v1alpha1` alongside `ApplicationDefinition`. + +```yaml +apiVersion: apps.cozystack.io/v1alpha1 +kind: TofuApplicationDefinition +metadata: + name: vpc +spec: + application: + kind: VPC + singular: vpc + plural: vpcs + openAPISchema: | + { + "type": "object", + "properties": { + "cidr": { "type": "string" }, + "region": { "type": "string" } + } + } + terraform: + sourceRef: + kind: OCIRepository # GitRepository | OCIRepository | Bucket + name: cozystack-vpc-module + namespace: cozy-system + path: ./ + prefix: vpc- + labels: + sharding.fluxcd.io/key: tenants + interval: 5m + approvePlan: auto # "auto" | "" (manual) + destroyResourcesOnDeletion: true + writeOutputsToSecret: + name: "{{ .name }}-outputs" + runnerPodTemplate: # cloud creds, IRSA, custom image + spec: + serviceAccountName: tofu-runner + secrets: + include: + - resourceNames: ["{{ .name }}-outputs"] + dashboard: + singular: VPC + category: Infrastructure + icon: +``` + +### Go types + +New file `api/v1alpha1/tofuapplicationdefinitions_types.go`: + +```go +type TofuApplicationDefinition struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + Spec TofuApplicationDefinitionSpec `json:"spec,omitempty"` +} + +type TofuApplicationDefinitionSpec struct { + Application ApplicationDefinitionApplication `json:"application"` + Terraform TofuApplicationDefinitionTerraform `json:"terraform"` + Secrets *ApplicationDefinitionResources `json:"secrets,omitempty"` + Services *ApplicationDefinitionResources `json:"services,omitempty"` + Ingresses *ApplicationDefinitionResources `json:"ingresses,omitempty"` + Dashboard *ApplicationDefinitionDashboard `json:"dashboard,omitempty"` +} + +type TofuApplicationDefinitionTerraform struct { + SourceRef tfv1alpha2.CrossNamespaceSourceReference `json:"sourceRef"` + Path string `json:"path,omitempty"` + Prefix string `json:"prefix,omitempty"` + Labels map[string]string `json:"labels,omitempty"` + Interval metav1.Duration `json:"interval,omitempty"` + ApprovePlan string `json:"approvePlan,omitempty"` + DestroyResourcesOnDeletion bool `json:"destroyResourcesOnDeletion,omitempty"` + WriteOutputsToSecret *tfv1alpha2.WriteOutputsToSecretSpec `json:"writeOutputsToSecret,omitempty"` + RunnerPodTemplate *tfv1alpha2.RunnerPodTemplate `json:"runnerPodTemplate,omitempty"` +} +``` + +The `ApplicationDefinitionApplication`, `ApplicationDefinitionResources` and `ApplicationDefinitionDashboard` types are reused as-is from the existing `applicationdefinitions_types.go` — these parts are backend-agnostic. + +### API server changes + +- New binary `cmd/cozystack-tofu-api/main.go` (or a feature-flagged subcommand) — boots an aggregation API server identical in shape to `cozystack-api`. +- Group: `tofu.apps.cozystack.io/v1alpha1` (separate to avoid kind collisions with Helm-side AGL). +- Reads `TofuApplicationDefinition` list on startup, registers dynamic kinds against the same internal `Application` type used by the Helm path (the type is generic enough — only `Spec` is `runtime.RawExtension`). +- Registers tofu-controller schema (`infra.contrib.fluxcd.io/v1alpha2`) instead of helm-controller schema. + +### REST translation + +A near-copy of `pkg/registry/apps/application/rest.go` lives at `pkg/registry/tofuapps/application/rest.go`: + +| Method | Existing (Helm) | New (Tofu) | +| ------- | ---------------------------------- | --------------------------------------- | +| Create | build `HelmRelease`, `c.Create` | build `Terraform`, `c.Create` | +| Get | fetch `HelmRelease`, project | fetch `Terraform`, project | +| List | list `HelmRelease` by labels | list `Terraform` by labels | +| Update | patch `HelmRelease.Spec.Values` | patch `Terraform.Spec.Vars` | +| Delete | delete `HelmRelease` | delete `Terraform` (respects `destroyResourcesOnDeletion`) | + +Field mapping `Application` → `Terraform`: + +```text +Application.Name → Terraform.Name = prefix + Application.Name +Application.Namespace → Terraform.Namespace +Application.Labels → Terraform.Labels (with LabelPrefix) +Application.Spec (RawExtension) → Terraform.Spec.Vars (flattened, top-level keys = vars) +TofuAppDef.Terraform.SourceRef → Terraform.Spec.SourceRef +TofuAppDef.Terraform.Path → Terraform.Spec.Path +TofuAppDef.Terraform.ApprovePlan → Terraform.Spec.ApprovePlan +TofuAppDef.Terraform.WriteOutputs… → Terraform.Spec.WriteOutputsToSecret +TofuAppDef.Terraform.RunnerPod… → Terraform.Spec.RunnerPodTemplate +``` + +`Application.Spec` keys become Terraform input variables. A small validator on the way in: keys must match `^[a-z_][a-z0-9_]*$` (HCL identifier). + +### Status projection + +`Terraform.Status` → `Application.Status` (visible to the user): + +```text +Terraform.Status.Conditions[Ready] → Application.Status.Conditions[Ready] +Terraform.Status.LastAppliedRevision → Application.Status.LastAppliedRevision +Terraform.Status.LastPlannedRevision → Application.Status.LastPlannedRevision +Terraform.Status.Plan.Pending → Application.Status.PendingApproval (bool) +Terraform.Status.AvailableOutputs → Application.Status.Outputs ([]string) +``` + +Actual output values land in the `writeOutputsToSecret` secret and are surfaced via the existing `spec.secrets.include` mechanism — no special handling needed. + +### Reconciler + +New controller `internal/controller/tofuapplicationdefinition_controller.go`: + +- Watches `TofuApplicationDefinition`. +- Finds existing `Terraform` CRs by label selector `apps.cozystack.io/application.kind=`. +- Patches `sourceRef`, `path`, `approvePlan`, `runnerPodTemplate` when the definition changes (mirrors `ApplicationDefinitionHelmReconciler`). +- A second controller (or a shared one) updates the config-hash annotation on the `cozystack-tofu-api` deployment to trigger a pod restart, identical to the Helm path. + +### Packaging + +New chart `packages/system/cozystack-tofu-api/` mirroring `packages/system/cozystack-api/`: + +- Deployment of the new apiserver binary. +- `APIService` registration for `tofu.apps.cozystack.io/v1alpha1`. +- RBAC: read `TofuApplicationDefinition`, full access to `Terraform` and the underlying secrets. + +Dependency: requires tofu-controller to be installed. Either: + +- Add it as a dependency of the `cozystack` umbrella chart (preferred), or +- Document it as a prerequisite and fail-fast at apiserver startup if the CRD is missing. + +## Implementation plan + +1. **CRD + types** — `TofuApplicationDefinition` Go types, deepcopy, manifests, validation webhook. No behaviour yet. +2. **API server skeleton** — new binary, dynamic type registration, openapi from `application.openAPISchema`. No translation yet. +3. **REST translation (Create/Get/List/Delete)** — copy from Helm path, swap target type. +4. **Status projection** — map Terraform conditions/outputs into Application status. +5. **Reconciler** — keep `Terraform` CRs in sync with definition changes. +6. **Packaging** — Helm chart for the apiserver, RBAC, APIService. +7. **Example package** — one real Terraform-backed package (e.g. `VPC` or `DNSZone`) for an end-to-end test. +8. **Docs** — author-facing guide on writing a `TofuApplicationDefinition`. + +Stages 1–4 are mergeable independently behind a feature flag. + +## Risks + +- **Two apiservers in the aggregation layer** add operational surface (two deployments, two health checks, two restarts on config change). Mitigated by sharing a single binary with two subcommands if useful. +- **Output secrets** can leak provider credentials if `writeOutputsToSecret` is misconfigured. Mitigated by an admission policy that requires explicit output allow-listing. +- **Tofu-controller upstream churn** — `flux-iac/tofu-controller` is a community fork of the archived `weaveworks/tf-controller`. API may move; pin `v1alpha2` and track. +- **No clean migration path to Draft 2** — once two top-level CRDs exist (`ApplicationDefinition`, `TofuApplicationDefinition`), unifying them later requires a deprecation cycle. + +## Open questions + +- Should we share `apps.cozystack.io` group and rely on `kind` uniqueness, or use a distinct `tofu.apps.cozystack.io` group? Distinct group is safer for now. +- Should the runner pod template be defined per-AppDef (current design) or per-instance (`Application.Spec.runnerPodOverride`)? Per-instance gives tenants control over cloud identities — probably needed for multi-tenant clusters. +- How to expose plan diffs when `approvePlan: ""` (manual)? Likely via a status field that the dashboard renders, plus a separate `approve` subresource on `Application`. + +## Example: `VPC` package + +```yaml +apiVersion: apps.cozystack.io/v1alpha1 +kind: TofuApplicationDefinition +metadata: + name: vpc +spec: + application: + kind: VPC + singular: vpc + plural: vpcs + openAPISchema: | + { + "type": "object", + "required": ["cidr", "region"], + "properties": { + "cidr": { "type": "string", "pattern": "^[0-9./]+$" }, + "region": { "type": "string" } + } + } + terraform: + sourceRef: + kind: OCIRepository + name: aws-vpc-module + namespace: cozy-system + path: ./modules/vpc + prefix: vpc- + approvePlan: auto + destroyResourcesOnDeletion: true + writeOutputsToSecret: + name: "{{ .name }}-outputs" + runnerPodTemplate: + spec: + serviceAccountName: aws-tofu-runner + secrets: + include: + - resourceNames: ["{{ .name }}-outputs"] + dashboard: + singular: VPC + category: Infrastructure +``` + +End-user request: + +```yaml +apiVersion: tofu.apps.cozystack.io/v1alpha1 +kind: VPC +metadata: + name: prod + namespace: tenant-acme +spec: + cidr: 10.10.0.0/16 + region: eu-central-1 +``` + +Result: tofu-controller reconciles `Terraform/vpc-prod` against the AWS VPC module, outputs (`vpc_id`, `subnet_ids`) land in `Secret/prod-outputs`, surfaced through cozystack's existing secret-include machinery. diff --git a/design-proposals/agl-tofu-backends/draft-2-pluggable-backend.md b/design-proposals/agl-tofu-backends/draft-2-pluggable-backend.md new file mode 100644 index 0000000..4affaa0 --- /dev/null +++ b/design-proposals/agl-tofu-backends/draft-2-pluggable-backend.md @@ -0,0 +1,405 @@ +# Draft 2: Pluggable Backend for Cozystack AGL + +- **Status:** Draft +- **Author:** @kitsunoff +- **Date:** 2026-05-20 +- **Target project:** [cozystack/cozystack](https://github.com/cozystack/cozystack) + +> Companion document to [`README.md`](./README.md) (overview and comparison) and [`draft-1-parallel-tofu-stack.md`](./draft-1-parallel-tofu-stack.md) (alternative design). + +## Summary + +Refactor Cozystack's Application Generation Layer to support multiple release backends through a single, generic `ApplicationDefinition`. The first two backends are Helm (existing behaviour) and Terraform/OpenTofu via tofu-controller; the design leaves room for ArgoCD `Application`, Flux `Kustomization`, or plain manifests later without further schema changes. + +Existing packages continue to work with no manifest changes, thanks to a defaulted backend type. + +## Motivation + +Today the AGL is structurally a Helm runtime dressed up as a generic abstraction: the CRD field is named `release`, types embed `helmv2.CrossNamespaceSourceReference`, the REST layer constructs `HelmRelease` directly, and a dedicated reconciler patches `HelmRelease` fields. Adding Terraform support by copying the stack (Draft 1) gets us there fast, but every additional backend pays the same duplication cost. + +A pluggable backend turns AGL from "Helm generator" into a real abstraction layer: the user-facing kind, OpenAPI schema, dashboard wiring, secret/service inclusion, and config-hash restart logic are written once; the per-backend code is a small interface implementation. + +## Goals + +- One `ApplicationDefinition` CRD, multiple backend implementations. +- Backwards-compatible with the current schema (no breaking change to existing packages). +- Add tofu-controller as the second backend. +- A clear Go interface that a third backend can implement without touching the rest of the codebase. + +## Non-goals + +- Allowing multiple backends inside a single `ApplicationDefinition` (one backend per definition; composite resources are out of scope). +- Building a generic "Terraform module marketplace" — package authors still ship their own modules. +- Replacing Flux. Both backends still reconcile through Flux primitives (helm-controller, tofu-controller). + +## Design + +### Backend abstraction + +A new package `pkg/agl/backend/` defines: + +```go +package backend + +type Type string + +const ( + TypeHelm Type = "Helm" + TypeTerraform Type = "Terraform" +) + +// Backend translates between the user-facing Application and a concrete +// Flux-managed target object (HelmRelease, Terraform, ...). +type Backend interface { + // Type returns the discriminator value, e.g. "Helm" or "Terraform". + Type() Type + + // TargetGVK is the GroupVersionKind of the backing object (HelmRelease, + // Terraform CR, ArgoCD Application, ...). Used by the REST layer to + // list/get/watch. + TargetGVK() schema.GroupVersionKind + + // TargetName computes the name of the backing object from the + // user-facing name and the definition (typically applies a prefix). + TargetName(appName string, def *v1alpha1.ApplicationDefinition) string + + // Build produces the backing object from an Application. + Build( + ctx context.Context, + app *appsv1alpha1.Application, + def *v1alpha1.ApplicationDefinition, + ) (client.Object, error) + + // ProjectStatus translates the backing object's status into the + // generic Application.Status the user sees. + ProjectStatus(target client.Object) (appsv1alpha1.ApplicationStatus, error) + + // Reconcile keeps an existing backing object aligned with the + // definition when the definition changes (mirrors today's + // ApplicationDefinitionHelmReconciler logic). + Reconcile( + ctx context.Context, + c client.Client, + target client.Object, + def *v1alpha1.ApplicationDefinition, + ) (updated bool, err error) +} + +// Registry resolves a definition to its backend implementation. +type Registry interface { + Get(def *v1alpha1.ApplicationDefinition) (Backend, error) + All() []Backend +} +``` + +Two implementations land in the same PR: + +- `pkg/agl/backend/helm/` — extracted from the current `rest.go` and `applicationdefinition_helmreconciler.go`. +- `pkg/agl/backend/terraform/` — new, targets tofu-controller `Terraform` CRD. + +### Updated `ApplicationDefinition` schema + +`apps.cozystack.io/v1alpha1` gains a `backend` field; the existing `release` field becomes an alias. + +```yaml +apiVersion: apps.cozystack.io/v1alpha1 +kind: ApplicationDefinition +metadata: + name: postgres +spec: + application: + kind: Postgres + singular: postgres + plural: postgreses + openAPISchema: | + { ... } + backend: + type: Helm # discriminator: Helm | Terraform + helm: # required when type=Helm + chartRef: + kind: ExternalArtifact + name: cozystack-postgres-chart + namespace: cozy-system + prefix: postgres- + labels: + sharding.fluxcd.io/key: tenants + valuesFrom: + - kind: Secret + name: cozystack-values + secrets: { ... } + services: { ... } + dashboard: { ... } +``` + +```yaml +apiVersion: apps.cozystack.io/v1alpha1 +kind: ApplicationDefinition +metadata: + name: vpc +spec: + application: + kind: VPC + singular: vpc + plural: vpcs + openAPISchema: | + { ... } + backend: + type: Terraform + terraform: # required when type=Terraform + sourceRef: + kind: OCIRepository + name: aws-vpc-module + namespace: cozy-system + path: ./ + prefix: vpc- + approvePlan: auto + destroyResourcesOnDeletion: true + writeOutputsToSecret: + name: "{{ .name }}-outputs" + runnerPodTemplate: + spec: + serviceAccountName: aws-tofu-runner + secrets: { ... } + dashboard: { ... } +``` + +### Go types + +```go +type ApplicationDefinitionSpec struct { + Application ApplicationDefinitionApplication `json:"application"` + + // Backend is the new discriminated union. + Backend *Backend `json:"backend,omitempty"` + + // Release is the legacy field. Kept for backwards compatibility. + // If Backend is unset and Release is set, treated as Backend{Type: Helm, Helm: from(Release)}. + // +deprecated + Release *ApplicationDefinitionRelease `json:"release,omitempty"` + + Secrets *ApplicationDefinitionResources `json:"secrets,omitempty"` + Services *ApplicationDefinitionResources `json:"services,omitempty"` + Ingresses *ApplicationDefinitionResources `json:"ingresses,omitempty"` + Dashboard *ApplicationDefinitionDashboard `json:"dashboard,omitempty"` +} + +type Backend struct { + // +kubebuilder:validation:Enum=Helm;Terraform + Type BackendType `json:"type"` + + Helm *HelmBackend `json:"helm,omitempty"` + Terraform *TerraformBackend `json:"terraform,omitempty"` +} + +type HelmBackend struct { + ChartRef *helmv2.CrossNamespaceSourceReference `json:"chartRef"` + Prefix string `json:"prefix,omitempty"` + Labels map[string]string `json:"labels,omitempty"` + ValuesFrom []helmv2.ValuesReference `json:"valuesFrom,omitempty"` +} + +type TerraformBackend struct { + SourceRef tfv1alpha2.CrossNamespaceSourceReference `json:"sourceRef"` + Path string `json:"path,omitempty"` + Prefix string `json:"prefix,omitempty"` + Labels map[string]string `json:"labels,omitempty"` + Interval metav1.Duration `json:"interval,omitempty"` + ApprovePlan string `json:"approvePlan,omitempty"` + DestroyResourcesOnDeletion bool `json:"destroyResourcesOnDeletion,omitempty"` + WriteOutputsToSecret *tfv1alpha2.WriteOutputsToSecretSpec `json:"writeOutputsToSecret,omitempty"` + RunnerPodTemplate *tfv1alpha2.RunnerPodTemplate `json:"runnerPodTemplate,omitempty"` +} +``` + +A defaulting webhook (or in-process normalization at apiserver startup) projects `spec.release` onto `spec.backend.helm` when only the legacy field is present, so existing packages keep working. + +### API server changes + +- The aggregation API server schema registers **both** `helmv2` and `tfv1alpha2` (and any future backend's GVKs). Cheap — schemes are just type bindings. +- Dynamic kind registration is unchanged: every package still produces one user-facing kind backed by the internal `Application` type. +- `pkg/registry/apps/application/rest.go` is the main refactor target. Today it calls helm-controller types directly; after refactoring it calls into a `Backend` interface obtained from the registry: + +```go +func (r *REST) Create(ctx context.Context, obj runtime.Object, ...) (..., error) { + app := obj.(*appsv1alpha1.Application) + def := r.definitions.Get(r.kindName) + b, err := r.backends.Get(def) + if err != nil { return nil, err } + + target, err := b.Build(ctx, app, def) + if err != nil { return nil, err } + + // Common label/annotation injection (extracted once). + injectAGLLabels(target, app, r.kindName) + + if err := r.c.Create(ctx, target); err != nil { return nil, err } + return app, nil +} +``` + +`Get`/`List`/`Update`/`Delete` follow the same shape: the REST layer is backend-agnostic, the backend knows the target object type. + +### Generic status + +`Application.Status` becomes a small generic envelope plus an opaque per-backend extension: + +```go +type ApplicationStatus struct { + // Common, projected by every backend. + Conditions []metav1.Condition `json:"conditions,omitempty"` + Ready bool `json:"ready,omitempty"` + Message string `json:"message,omitempty"` + + // Backend-specific, raw JSON. Schema documented per backend. + Backend *runtime.RawExtension `json:"backend,omitempty"` +} +``` + +- Helm projects `lastAppliedRevision`, `lastAttemptedRevision` into `backend`. +- Terraform projects `lastAppliedRevision`, `lastPlannedRevision`, `availableOutputs`, `pendingApproval` into `backend`. + +This avoids forcing every field of every backend into the top-level schema while keeping the common ready/conditions contract uniform. + +### Generic reconciler + +Today's `ApplicationDefinitionHelmReconciler` is replaced by a single `ApplicationDefinitionReconciler` that: + +1. Watches `ApplicationDefinition`. +2. Resolves backend via the registry. +3. Lists existing target objects by label selector `apps.cozystack.io/application.kind=`. +4. For each target, calls `backend.Reconcile(ctx, c, target, def)`. + +The existing config-hash restart logic for the aggregation apiserver is untouched. + +### Backwards compatibility + +Required behaviour: + +- A `v1alpha1` `ApplicationDefinition` shipped today, with `spec.release` and no `spec.backend`, must continue to work after the upgrade. +- Existing `HelmRelease`s owned by the AGL must not be recreated; they should be reconciled by the new generic reconciler exactly as before. + +Mechanism: + +- The CRD keeps both `release` (deprecated) and `backend` fields in `v1alpha1` for one minor version. +- The API server applies a normalization pass on read: if `backend == nil && release != nil`, synthesize `backend = {type: Helm, helm: from(release)}` in memory. Persisted objects are not mutated. +- A `v1alpha2` conversion webhook follows in a later release: `release` is removed from the schema; conversion fills it from `backend.helm` for clients still on v1alpha1. + +### Packaging + +- No new chart. `packages/system/cozystack-api/` gains a dependency on tofu-controller CRDs (soft dependency: the registry skips registering the Terraform backend if `infra.contrib.fluxcd.io/v1alpha2` is not installed, with a clear log line). +- RBAC of `cozystack-apiserver` extended to read/write `Terraform` CRs. + +## Implementation plan + +1. **Extract Helm logic** — move all helm-specific code from `rest.go` and `applicationdefinition_helmreconciler.go` behind the new `Backend` interface. No behaviour change, no schema change. PR is large but mechanical; covered by existing e2e tests. +2. **Schema: add `backend`** — introduce `spec.backend` as optional, defaulted from `spec.release`. CRD validation, conversion logic. +3. **Generic reconciler** — replace the helm-specific reconciler with the registry-driven one. Helm backend wired through the registry. +4. **Terraform backend** — implement `pkg/agl/backend/terraform/`. Includes vars marshalling (`Application.Spec` → `Terraform.Spec.Vars`), status projection, reconcile diffing. +5. **Example package** — one Terraform-backed package end-to-end. +6. **Deprecation notice** — emit warning when `spec.release` is used; plan `v1alpha2` removal. +7. **Docs** — update author guide, add backend authoring guide. + +Stages 1 and 3 are the risky ones (touch the hot path of every existing package). Stages 4–7 are additive. + +## Risks + +- **Refactor blast radius.** `rest.go` is large and touched by every Helm-backed package. Mitigation: stage 1 is mechanical extraction with no behaviour change, validated by the full e2e suite before any new backend lands. +- **Interface fitness.** Designing the `Backend` interface against only two backends risks a leaky abstraction the third backend (Argo, Kustomization) can't honour. Mitigation: sketch a third backend on paper before merging stage 1; treat it as a design constraint. +- **Status schema churn.** Opaque `Backend *runtime.RawExtension` defers the schema problem rather than solving it. Consumers (dashboard, kubectl printers) need a documented contract per backend type. Mitigation: ship printer columns and dashboard schemas per backend alongside each implementation. +- **Conversion webhook complexity.** `v1alpha1 → v1alpha2` conversion has to be exact for old objects. Mitigation: keep `v1alpha1` indefinitely if needed; conversion isn't on the critical path for the feature. + +## Migration plan + +- **Release N (this PR series):** ships behind a feature flag `--enable-pluggable-backends`. Defaults to off in stable channel, on in next. Helm path is rewritten to go through the backend interface either way (so the flag only gates the new types and the Terraform backend). +- **Release N+1:** flag flipped to on by default. Terraform backend GA. +- **Release N+2:** `spec.release` marked deprecated in CRD docs; deprecation warning logged at admission time. +- **Release N+3 (`v1alpha2`):** `spec.release` removed; conversion webhook fills it for legacy v1alpha1 clients. + +## Open questions + +- **Per-instance backend override?** Should an `Application` (the user-facing instance) be able to override `runnerPodTemplate` or `chartRef` for tenant isolation? Probably yes for cloud creds in multi-tenant clusters, but adds complexity to the `Build` contract. +- **Shared identity injection.** Today every backend gets the same set of AGL labels/annotations. Should that be the backend's responsibility, or extracted into a wrapper? Currently designed as wrapper (`injectAGLLabels` outside the backend), which keeps backends small. +- **Cross-backend ownership.** Can a Helm-backed package's chart reference a `Terraform` CR managed by another AppDef? Out of scope for this draft, but the design should not preclude it. + +## Example: side-by-side definitions + +```yaml +--- +apiVersion: apps.cozystack.io/v1alpha1 +kind: ApplicationDefinition +metadata: + name: postgres +spec: + application: + kind: Postgres + singular: postgres + plural: postgreses + openAPISchema: | + { ... } + backend: + type: Helm + helm: + chartRef: + kind: ExternalArtifact + name: cozystack-postgres-chart + namespace: cozy-system + prefix: postgres- + valuesFrom: + - kind: Secret + name: cozystack-values + secrets: + include: + - resourceNames: ["postgres-{{ .name }}-credentials"] +--- +apiVersion: apps.cozystack.io/v1alpha1 +kind: ApplicationDefinition +metadata: + name: dns-zone +spec: + application: + kind: DNSZone + singular: dnszone + plural: dnszones + openAPISchema: | + { ... } + backend: + type: Terraform + terraform: + sourceRef: + kind: OCIRepository + name: cloudflare-dns-module + namespace: cozy-system + path: ./ + prefix: dns- + approvePlan: auto + destroyResourcesOnDeletion: true + writeOutputsToSecret: + name: "{{ .name }}-outputs" + runnerPodTemplate: + spec: + serviceAccountName: cloudflare-tofu-runner + secrets: + include: + - resourceNames: ["{{ .name }}-outputs"] +``` + +End users in `tenant-acme`: + +```yaml +--- +apiVersion: apps.cozystack.io/v1alpha1 +kind: Postgres +metadata: { name: app-db, namespace: tenant-acme } +spec: + size: 20Gi + replicas: 3 +--- +apiVersion: apps.cozystack.io/v1alpha1 +kind: DNSZone +metadata: { name: acme-prod, namespace: tenant-acme } +spec: + zone: acme.example.com + ttl: 300 +``` + +One CRD, one apiserver, one reconciler. The backend boundary is the only place that knows whether the result is a `HelmRelease` or a `Terraform` CR. diff --git a/design-proposals/agl-tofu-backends/presentation.md b/design-proposals/agl-tofu-backends/presentation.md new file mode 100644 index 0000000..d0c0425 --- /dev/null +++ b/design-proposals/agl-tofu-backends/presentation.md @@ -0,0 +1,289 @@ +--- +marp: true +theme: default +paginate: true +size: 16:9 +header: "Cozystack AGL — Terraform Backend" +footer: "Draft proposal · 2026-05-20" +style: | + section { font-size: 26px; } + h1 { color: #1a73e8; } + h2 { color: #1a73e8; } + code { background: #f6f8fa; padding: 2px 5px; border-radius: 3px; } + table { font-size: 22px; } + .small { font-size: 20px; } +--- + +# Terraform-backed resources
in Cozystack AGL + +Two design drafts for mapping API resources
to `Terraform` CRs of tofu-controller + +Maxim Belyy · 2026-05-20 + +--- + +## Context: what is Cozystack AGL? + +**Application Generation Layer** — maps a user-facing Kubernetes kind
(e.g. `Postgres`, `Kafka`, `Bucket`) to a Flux `HelmRelease` under the hood. + +- Cluster-scoped CRD `ApplicationDefinition` describes the mapping. +- Custom aggregation API server registers user-facing kinds dynamically. +- Per-instance: `Application.Spec` → `HelmRelease.Spec.Values`. +- Dashboard, OpenAPI validation, RBAC come for free. + +The user creates a high-level resource. The platform turns it into a real workload. + +--- + +## How it works today + +```text + ┌──────────────────────────────┐ + kubectl ──► │ cozystack-api (aggregation) │ + apply Postgres │ - reads ApplicationDefinition│ + │ - registers dynamic kinds │ + │ - REST: Postgres ↔ HelmRelease│ + └──────────────┬───────────────┘ + │ creates + ▼ + ┌──────────────────┐ + │ HelmRelease │ ◄── Flux helm-controller + │ postgres-mydb │ reconciles to actual + └──────────────────┘ Kubernetes workloads +``` + +One CRD, one apiserver, one reconciler — and **one backend: Helm**. + +--- + +## The problem + +Some primitives don't fit Helm: + +- Cloud VPCs, subnets, IAM +- Managed DNS zones +- Managed databases on hyperscalers +- External buckets, queues, secrets + +These are naturally **Terraform/OpenTofu** territory. + +`flux-iac/tofu-controller` already provides a Flux-native `Terraform` CRD
with the same lifecycle model as `HelmRelease`. + +**Goal:** let AGL emit `Terraform` CRs the same way it emits `HelmRelease`. + +--- + +## Where the Helm assumption lives + +| File | What's hard-coded | +| --- | --- | +| `api/v1alpha1/applicationdefinitions_types.go` | imports `helmv2`, field `Release.ChartRef` is `helmv2.CrossNamespaceSourceReference` | +| `pkg/registry/apps/application/rest.go` | builds and reads `helmv2.HelmRelease` directly | +| `pkg/apiserver/apiserver.go` | registers only the helm-controller schema | +| `internal/controller/applicationdefinition_helmreconciler.go` | patches `HelmRelease.Spec.ChartRef` / `ValuesFrom` | + +The abstraction is structurally there. The code is not. + +--- + +## Two ways forward + +**Draft 1** — Parallel stack: copy the AGL for Terraform, leave Helm path untouched. + +**Draft 2** — Pluggable backend: refactor AGL so Helm and Terraform are
two implementations of one interface. + +Let's look at both. + +--- + +## Draft 1 — Parallel Tofu Stack + +New CRD `TofuApplicationDefinition`, new apiserver, new reconciler — sitting next to the existing Helm AGL. + +```yaml +apiVersion: apps.cozystack.io/v1alpha1 +kind: TofuApplicationDefinition +spec: + application: + kind: VPC + openAPISchema: | + { "type": "object", "properties": { "cidr": ... } } + terraform: + sourceRef: # tofu-controller source + kind: OCIRepository + name: aws-vpc-module + approvePlan: auto + destroyResourcesOnDeletion: true + writeOutputsToSecret: + name: "{{ .name }}-outputs" +``` + +`Application.Spec` keys become Terraform input variables. + +--- + +## Draft 1 — Architecture + +```text + ┌───────────────────────────┐ ┌────────────────────────────┐ + Postgres ───► │ cozystack-api │ │ cozystack-tofu-api │ ◄─── VPC + │ ↳ ApplicationDefinition │ │ ↳ TofuApplicationDefinition│ + │ ↳ HelmRelease translator │ │ ↳ Terraform translator │ + └─────────────┬─────────────┘ └─────────────┬──────────────┘ + ▼ ▼ + ┌────────────┐ ┌──────────────┐ + │ HelmRelease│ │ Terraform │ + └────────────┘ └──────────────┘ + helm-controller tofu-controller +``` + +Two parallel stacks. Zero shared code (initially). + +--- + +## Draft 1 — Trade-offs + +**Pros** + +- Days, not weeks. Mechanical copy of the existing stack. +- Zero risk to Helm-backed packages. +- Easy review; can land incrementally behind a feature flag. + +**Cons** + +- ~80% code duplication (REST, reconciler, dynamic-type registration). +- Two apiservers in the aggregation layer — more operational surface. +- A third backend (Argo, Kustomization) means another full copy. +- Future unification needs a deprecation cycle. + +--- + +## Draft 2 — Pluggable backend + +One `ApplicationDefinition`, discriminated by `backend.type`. + +```yaml +apiVersion: apps.cozystack.io/v1alpha1 +kind: ApplicationDefinition +spec: + application: + kind: VPC + openAPISchema: | + { ... } + backend: + type: Terraform # Helm | Terraform | ... + terraform: + sourceRef: { kind: OCIRepository, name: aws-vpc-module, ... } + approvePlan: auto + destroyResourcesOnDeletion: true +``` + +Existing packages keep working — `spec.release` is defaulted to `spec.backend.helm`. + +--- + +## Draft 2 — The interface + +```go +type Backend interface { + Type() Type // "Helm" | "Terraform" + TargetGVK() schema.GroupVersionKind // HelmRelease | Terraform | ... + TargetName(appName string, def *ApplicationDefinition) string + + Build(ctx, app *Application, def *ApplicationDefinition) (client.Object, error) + ProjectStatus(target client.Object) (ApplicationStatus, error) + Reconcile(ctx, c client.Client, target client.Object, + def *ApplicationDefinition) (updated bool, err error) +} +``` + +Two implementations in the same PR: + +- `pkg/agl/backend/helm/` — extracted from current code. +- `pkg/agl/backend/terraform/` — new. + +--- + +## Draft 2 — Architecture + +```text + ┌────────────────────────────────────────────────┐ + Postgres ───► │ cozystack-api │ ◄─── VPC + │ ApplicationDefinition (single CRD) │ + │ │ + │ ┌─────────────────┐ ┌───────────────────┐ │ + │ │ HelmBackend │ │ TerraformBackend │ │ + │ └────────┬────────┘ └─────────┬─────────┘ │ + └────────────┼───────────────────────┼───────────┘ + ▼ ▼ + ┌────────────┐ ┌──────────────┐ + │ HelmRelease│ │ Terraform │ + └────────────┘ └──────────────┘ +``` + +One apiserver. One reconciler. N backends. + +--- + +## Draft 2 — Trade-offs + +**Pros** + +- Operationally simpler in the long run: one CRD, one apiserver, one reconciler. +- New backends (Argo, Kustomization, plain manifests) cost only an interface impl. +- AGL finally is what its name says: a *Generation Layer*, not a Helm wrapper. + +**Cons** + +- Large refactor of `rest.go` and the reconciler — touches every Helm package. +- Interface designed against two backends risks being leaky for a third. +- Generic `Application.Status` envelope + per-backend opaque extension —
defers some schema problems rather than solving them. + +--- + +## Comparison + +| | Draft 1 (parallel) | Draft 2 (decoupled) | +| -------------------------- | ------------------------ | ----------------------- | +| Time to first PoC | days | weeks | +| Regression risk | minimal | medium | +| Code duplication | high | none | +| Cost of 3rd backend | another full copy | one interface impl | +| User-facing UX | two AppDef kinds | one AppDef, switch type | +| Review burden | low | high, needs slicing | +| Long-term maintenance cost | high | low | + +--- + +## Recommendation + +**Hybrid path:** + +1. **Now:** Draft 1 as a feature-branch PoC.
Prove tofu-controller mapping. Surface real requirements for vars marshalling,
cloud-creds runner pods, output secret handling. + +2. **Next:** With two working backends in hand, do Draft 2 refactor.
The interface is designed against concrete code, not speculation. + +This trades a small amount of throwaway code for much lower
risk of a bad abstraction. + +--- + +## Next steps + +- Review and approve the two drafts. +- Decide: ship Draft 1 standalone, or commit to the hybrid path? +- Pick the first Terraform-backed example package
(candidate: `VPC`, `DNSZone`, `S3Bucket`). +- Discuss in cozystack maintainers' sync; if positive, open an RFC issue. + +**Documents in this draft:** + +- `draft-1-parallel-tofu-stack.md` +- `draft-2-pluggable-backend.md` +- `presentation.md` (this deck) + +--- + +## Questions + +Thank you. + +Sources studied: `cozystack/api/v1alpha1/applicationdefinitions_types.go`,
`cozystack/pkg/registry/apps/application/rest.go`,
`cozystack/pkg/apiserver/`, `cozystack/internal/controller/`,
`cozystack/packages/system/postgres-rd/`.