Skip to content

USHIFT-6797: C2CC Probe remote clusters#6729

Draft
pmtk wants to merge 14 commits into
openshift:mainfrom
pmtk:c2cc/probe
Draft

USHIFT-6797: C2CC Probe remote clusters#6729
pmtk wants to merge 14 commits into
openshift:mainfrom
pmtk:c2cc/probe

Conversation

@pmtk
Copy link
Copy Markdown
Member

@pmtk pmtk commented May 25, 2026

Summary by CodeRabbit

  • New Features
    • Cluster-to-cluster health-check system with a probe deployment and per-remote health CRs reporting Healthy/Unhealthy, last probe time, and errors.
  • Configuration
    • New probeInterval setting (1s–5m, default 10s) to control probe frequency; probe service ClusterIP deterministically computed.
  • CLI
    • Added hidden probe subcommand to run the probe server.
  • Installation / Runtime
    • Manifests for namespace, RBAC, service account, service, and probe deployment included; CRD for RemoteCluster added.
  • Tooling
    • Make target and script to generate CRD/artifacts.
  • Tests & QA
    • End-to-end and unit tests validating probe behavior, CR lifecycle, and cluster IP computation.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 25, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 25, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 25, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 25, 2026

@pmtk: This pull request references USHIFT-6797 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 25, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pmtk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 25, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 25, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Implements C2CC probe support: RemoteCluster API/CRD, probe config and deterministic probe IPs, probe runtime and CLI, controller reconciliation and probe deployment, asset/codegen additions, and unit/acceptance tests.

Changes

C2CC Probe Feature

Layer / File(s) Summary
RemoteCluster CRD and API Definition
pkg/apis/microshift/v1alpha1/*, assets/crd/microshift.io_remoteclusters.yaml, cmd/generate-config/config/config-openapi-spec.json
Adds v1alpha1 API types, scheme wiring, and CRD manifest for RemoteCluster with spec.probeTarget, spec.probeInterval (default 10s), and status fields including state, lastProbeTime, lastSuccessfulProbe, and errors.
Configuration, defaults & validation
pkg/config/c2cc.go, pkg/config/c2cc_test.go, pkg/config/config.go, packaging/microshift/config.yaml, docs/user/howto_config.md
Adds probeInterval config (1s–5m, default 10s), resolves to ResolvedProbeInterval, computes deterministic ProbeIP from service CIDR's 11th host, and adds tests and docs.
Kubernetes assets & codegen
assets/components/c2cc/*, scripts/generate-crds.sh, Makefile, scripts/auto-rebase/assets.yaml
Adds probe namespace, ServiceAccount, ClusterRole/Binding, Deployment, Service (templated clusterIP), codegen script, Make target generate-crds, and updates asset listing.
Controller: deploy probe assets & CRD lifecycle
pkg/controllers/c2cc/controller.go, pkg/controllers/c2cc/deploy_probe.go, pkg/assets/crd.go
Controller applies healthcheck CRD at startup, waits for it to be established, computes probe ClusterIP, renders and applies probe assets, implements CRD cleanup, and adds DeleteCRDs helper with aggregated error handling.
Healthcheck reconciliation manager
pkg/controllers/c2cc/healthcheck.go, pkg/controllers/c2cc/healthcheck_test.go
Builds desired RemoteCluster objects from config, reconciles by creating/updating/deleting managed CRs, deterministic CR naming from next-hop IP, and unit tests for reconcile behaviors.
Probe runtime & CLI
pkg/controllers/c2cc/probe.go, pkg/cmd/c2cc_probe.go, cmd/microshift/main.go
RunProbe starts HTTP probe server on :8080, manages per-CR probe goroutines that probe targets and update CR status; adds hidden c2cc-probe command and registers it in main.
Acceptance tests (Robot Framework)
test/suites/c2cc/*, test/resources/c2cc.resource
Robot tests verify namespace/deployment/service existence, service ClusterIP correctness (11th CIDR host), RemoteCluster health/status fields, and controller self-healing.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • eslutsky
  • kasturinarra
  • pacevedom
🚥 Pre-merge checks | ✅ 9 | ❌ 3

❌ Failed checks (3 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 2.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Tests lack descriptive assertion messages. Multiple plain assertions without failure messages found in c2cc_test.go, healthcheck_test.go, and Robot Framework tests. Add assertion messages to all tests for diagnostic clarity. Go: use second parameter with context. Robot: include msg arguments in Should keywords.
Topology-Aware Scheduling Compatibility ⚠️ Warning Deployment uses nodeSelector for node-role.kubernetes.io/master without topology-awareness. On HyperShift, this causes the probe pod to remain Pending indefinitely. Detect topology via infrastructure.Status.ControlPlaneTopology and conditionally apply nodeSelector, or use preferredDuringSchedulingIgnoredDuringExecution affinity instead.
✅ Passed checks (9 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main objective of the PR: implementing C2CC probe functionality for remote clusters, as evidenced by the comprehensive changes to add RemoteCluster CRD, probe interval configuration, health status tracking, and probe deployment logic.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Go tests use static names ("IPv4", "valid single remote IPv4"). Robot Framework tests use static titles. No dynamic identifiers, timestamps, UUIDs, IPs, or node names appear in test titles.
Microshift Test Compatibility ✅ Passed Check applies to Ginkgo e2e tests only. PR adds no Ginkgo e2e tests—only Robot Framework suites and Go unit tests, which are outside this check's scope.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests added in this PR. Only Robot Framework tests (probe.robot, healthcheck.robot, c2cc.resource) were added, which fall outside the scope of this Ginkgo-specific check.
Ote Binary Stdout Contract ✅ Passed No process-level stdout writes found. New code uses klog in RunProbe subcommand, which writes to stderr by default. No main/init/suite-level logging that would corrupt OTE JSON stdout.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests were added. The PR adds Robot Framework test suites (probe.robot, healthcheck.robot) which use different syntax; the custom check applies only to Ginkgo tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
test/suites/c2cc/probe.robot (1)

123-130: ⚡ Quick win

Consider extracting Compute 11th IP to a shared resource.

This keyword is duplicated identically in 07-healthcheck.robot (lines 97-104). Since both test files already import c2cc.resource, extracting this to the shared resource file would eliminate duplication and provide a single point of maintenance.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/suites/c2cc/probe.robot` around lines 123 - 130, The keyword Compute
11th IP is duplicated; extract it into the shared resource by moving the entire
keyword block (the Robot Framework keyword named "Compute 11th IP" that uses
Process.Run Process to call python3 with the ipaddress snippet, checks rc and
returns the stripped stdout) into the c2cc.resource file, remove the duplicate
definitions from test/suites/c2cc/probe.robot and 07-healthcheck.robot, and
ensure both tests continue to import c2cc.resource so they call the single
shared Compute 11th IP keyword.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@assets/components/c2cc/deployment.yaml`:
- Around line 54-60: The deployment currently hard-pins c2cc-probe to
master-labeled nodes via nodeSelector node-role.kubernetes.io/master and a
NoSchedule toleration; change this to a topology-aware preference instead:
remove the strict nodeSelector and NoSchedule toleration and replace with a
preferred node affinity (preferredDuringSchedulingIgnoredDuringExecution) that
targets control-plane labels (e.g., node-role.kubernetes.io/master or
node-role.kubernetes.io/control-plane) so pods prefer control-plane nodes but
can run elsewhere for single-node/arbiter variants, and if a toleration is still
required use a non-blocking preference (e.g., avoid NoSchedule) or match the
exact taint/key semantics; update the c2cc deployment
(nodeSelector/tolerations/priorityClassName usage) and add topology-specific CI
to validate placement across single-node and multi-node topologies.

In `@cmd/generate-config/config/config-openapi-spec.json`:
- Around line 206-209: The OpenAPI schema for probeInterval lacks a default and
should match runtime default used in fillDefaults(); update the "probeInterval"
property in the OpenAPI schema (clusterToCluster.probeInterval) to include a
default value of "10s" and keep the existing description/type, so generated docs
and clients reflect the actual default; locate the "probeInterval" entry in the
config-openapi-spec JSON and add the "default": "10s" field.

In `@pkg/config/c2cc_test.go`:
- Around line 864-865: fillDefaults() is an error-returning initializer and the
tests currently ignore its error; update each call (e.g., the cfg.fillDefaults()
invocations around the assertions at the noted locations) to capture the
returned error and assert success (for example assign err := cfg.fillDefaults()
and call require.NoError(t, err) or assert.NoError(t, err)) before continuing to
the assert.Equal checks so tests fail fast on initialization errors.

In `@pkg/controllers/c2cc/probe.go`:
- Around line 85-87: The call to factory.WaitForCacheSync currently ignores its
result; update the startup in pkg/controllers/c2cc/probe.go to check the
returned map[reflect.Type]bool (the result of factory.WaitForCacheSync) and fail
fast if any informer failed to sync: after factory.Start(ctx.Done()) capture the
sync map from factory.WaitForCacheSync(ctx.Done()), inspect its values, and if
any value is false log an error via klog.Errorf (including what failed) and
return/exit the function/manager startup instead of continuing to
klog.Infof("Probe manager running, watching RemoteCluster CRs").
- Around line 67-70: The UpdateFunc is restarting probes on any RemoteCluster
update causing status-update-driven restart loops; change UpdateFunc to accept
both oldObj and newObj, cast them to *microshiftv1alpha1.RemoteCluster, compare
the relevant spec fields (e.g., Spec.ProbeTarget and Spec.ProbeInterval) between
old and new, and only call pm.restartProbe(ctx, rc) when those spec fields
differ (ignore updates that only change Status); this preserves runProbeLoop
behavior and prevents resetting consecutiveFailures to 0 on status-only updates.

---

Nitpick comments:
In `@test/suites/c2cc/probe.robot`:
- Around line 123-130: The keyword Compute 11th IP is duplicated; extract it
into the shared resource by moving the entire keyword block (the Robot Framework
keyword named "Compute 11th IP" that uses Process.Run Process to call python3
with the ipaddress snippet, checks rc and returns the stripped stdout) into the
c2cc.resource file, remove the duplicate definitions from
test/suites/c2cc/probe.robot and 07-healthcheck.robot, and ensure both tests
continue to import c2cc.resource so they call the single shared Compute 11th IP
keyword.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3633745b-f762-4fbe-9134-34031e2ec1d5

📥 Commits

Reviewing files that changed from the base of the PR and between ca5e605 and 2bb9c14.

⛔ Files ignored due to path filters (22)
  • pkg/apis/microshift/v1alpha1/zz_generated.deepcopy.go is excluded by !**/zz_generated*
  • pkg/generated/clientset/versioned/clientset.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/fake/clientset_generated.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/fake/doc.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/fake/register.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/scheme/doc.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/scheme/register.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/typed/microshift/v1alpha1/doc.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/typed/microshift/v1alpha1/fake/doc.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/typed/microshift/v1alpha1/fake/fake_microshift_client.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/typed/microshift/v1alpha1/fake/fake_remotecluster.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/typed/microshift/v1alpha1/generated_expansion.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/typed/microshift/v1alpha1/microshift_client.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/typed/microshift/v1alpha1/remotecluster.go is excluded by !**/generated/**
  • pkg/generated/informers/externalversions/factory.go is excluded by !**/generated/**
  • pkg/generated/informers/externalversions/generic.go is excluded by !**/generated/**
  • pkg/generated/informers/externalversions/internalinterfaces/factory_interfaces.go is excluded by !**/generated/**
  • pkg/generated/informers/externalversions/microshift/interface.go is excluded by !**/generated/**
  • pkg/generated/informers/externalversions/microshift/v1alpha1/interface.go is excluded by !**/generated/**
  • pkg/generated/informers/externalversions/microshift/v1alpha1/remotecluster.go is excluded by !**/generated/**
  • pkg/generated/listers/microshift/v1alpha1/expansion_generated.go is excluded by !**/generated/**
  • pkg/generated/listers/microshift/v1alpha1/remotecluster.go is excluded by !**/generated/**
📒 Files selected for processing (30)
  • Makefile
  • assets/components/c2cc/clusterrole.yaml
  • assets/components/c2cc/clusterrolebinding.yaml
  • assets/components/c2cc/deployment.yaml
  • assets/components/c2cc/namespace.yaml
  • assets/components/c2cc/service.yaml
  • assets/components/c2cc/serviceaccount.yaml
  • assets/crd/microshift.io_remoteclusters.yaml
  • cmd/generate-config/config/config-openapi-spec.json
  • cmd/microshift/main.go
  • docs/user/howto_config.md
  • packaging/microshift/config.yaml
  • pkg/apis/microshift/v1alpha1/doc.go
  • pkg/apis/microshift/v1alpha1/groupversion_info.go
  • pkg/apis/microshift/v1alpha1/types.go
  • pkg/assets/crd.go
  • pkg/cmd/c2cc_probe.go
  • pkg/config/c2cc.go
  • pkg/config/c2cc_test.go
  • pkg/config/config.go
  • pkg/controllers/c2cc/controller.go
  • pkg/controllers/c2cc/deploy_probe.go
  • pkg/controllers/c2cc/healthcheck.go
  • pkg/controllers/c2cc/healthcheck_test.go
  • pkg/controllers/c2cc/probe.go
  • scripts/auto-rebase/assets.yaml
  • scripts/boilerplate.go.txt
  • scripts/generate-crds.sh
  • test/suites/c2cc/07-healthcheck.robot
  • test/suites/c2cc/probe.robot

Comment on lines +54 to +60
nodeSelector:
node-role.kubernetes.io/master: ""
priorityClassName: system-cluster-critical
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="assets/components/c2cc/deployment.yaml"

echo "=== File exists? ==="
ls -l "$FILE" || exit 1

echo
echo "=== Lines 1-120 (context) ==="
nl -ba "$FILE" | sed -n '1,120p'

echo
echo "=== Lines 45-75 (scheduling snippet focus) ==="
nl -ba "$FILE" | sed -n '45,75p'

echo
echo "=== Search for scheduling/topology fields in this file ==="
rg -n --hidden --no-ignore-vcs "(nodeSelector|affinity|podAntiAffinity|topologySpreadConstraints|topologyKey|replicas|podDisruptionBudget|PDB|node-role.kubernetes.io/master|control-plane|master|kubernetes.io/hostname)" "$FILE" || true

echo
echo "=== Repo-wide usage of node-role.kubernetes.io/master ==="
rg -n --hidden --no-ignore-vcs "node-role\.kubernetes\.io/master" . || true

Repository: openshift/microshift

Length of output: 266


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="assets/components/c2cc/deployment.yaml"

echo "=== File exists? ==="
ls -l "$FILE" || exit 1

echo
echo "=== Lines 1-120 (context) ==="
awk 'NR>=1 && NR<=120 {printf "%6d  %s\n", NR, $0}' "$FILE"

echo
echo "=== Lines 45-75 (scheduling snippet focus) ==="
awk 'NR>=45 && NR<=75 {printf "%6d  %s\n", NR, $0}' "$FILE"

echo
echo "=== Search for scheduling/topology fields in this file ==="
rg -n --hidden --no-ignore-vcs "(nodeSelector|affinity|podAntiAffinity|topologySpreadConstraints|topologyKey|replicas|podDisruptionBudget|PDB|node-role.kubernetes.io/master|control-plane|master|kubernetes.io/hostname)" "$FILE" || true

echo
echo "=== Repo-wide usage of node-role.kubernetes.io/master ==="
rg -n --hidden --no-ignore-vcs "node-role\.kubernetes\.io/master" . || true

Repository: openshift/microshift

Length of output: 15624


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="assets/components/c2cc/deployment.yaml"

echo "=== File exists? ==="
ls -l "$FILE" || exit 1

echo
echo "=== Lines 1-120 (context) ==="
awk 'NR>=1 && NR<=120 {printf "%6d  %s\n", NR, $0}' "$FILE"

echo
echo "=== Lines 45-75 (scheduling snippet focus) ==="
awk 'NR>=45 && NR<=75 {printf "%6d  %s\n", NR, $0}' "$FILE"

echo
echo "=== Search for scheduling/topology fields in this file ==="
rg -n --hidden --no-ignore-vcs "(nodeSelector|affinity|podAntiAffinity|topologySpreadConstraints|topologyKey|replicas|podDisruptionBudget|PDB|node-role.kubernetes.io/master|control-plane|master|kubernetes.io/hostname)" "$FILE" || true

echo
echo "=== Repo-wide usage of node-role.kubernetes.io/master ==="
rg -n --hidden --no-ignore-vcs "node-role\.kubernetes\.io/master" . || true

Repository: openshift/microshift

Length of output: 15624


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Repo-wide: control-plane label usage ==="
rg -n --hidden --no-ignore-vcs "node-role\.kubernetes\.io/control-plane" . || true

echo
echo "=== Repo-wide: arbiter mentions ==="
rg -n --hidden --no-ignore-vcs "arbiter" . || true

echo
echo "=== Repo-wide: SNO mention ==="
rg -n --hidden --no-ignore-vcs "\bSNO\b|single[- ]node|single[- ]node" . || true

echo
echo "=== CI/topology job hints ==="
rg -n --hidden --no-ignore-vcs "topology|topological|spread|DoNotSchedule|hostname|podAntiAffinity" .github . || true

Repository: openshift/microshift

Length of output: 50378


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Repo-wide: control-plane label usage ==="
rg -n --hidden --no-ignore-vcs "node-role\.kubernetes\.io/control-plane" . || true

echo
echo "=== Repo-wide: arbiter mentions ==="
rg -n --hidden --no-ignore-vcs "arbiter" . || true

echo
echo "=== Repo-wide: SNO mention ==="
rg -n --hidden --no-ignore-vcs "\bSNO\b|single[- ]node|single[- ]node" . || true

echo
echo "=== CI/topology job hints ==="
rg -n --hidden --no-ignore-vcs "topology|topological|spread|DoNotSchedule|hostname|podAntiAffinity" .github . || true

Repository: openshift/microshift

Length of output: 50378


Avoid hard pinning c2cc-probe to master-labeled control-plane nodes
In assets/components/c2cc/deployment.yaml (lines 54-60), the workload uses nodeSelector: node-role.kubernetes.io/master: "" plus a NoSchedule toleration for the same taint, forcing placement onto master-labeled nodes only. This bakes in a specific control-plane topology and isn’t topology-aware for single-node/arbiter variants—make placement topology-aware and cover it with topology-specific CI.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@assets/components/c2cc/deployment.yaml` around lines 54 - 60, The deployment
currently hard-pins c2cc-probe to master-labeled nodes via nodeSelector
node-role.kubernetes.io/master and a NoSchedule toleration; change this to a
topology-aware preference instead: remove the strict nodeSelector and NoSchedule
toleration and replace with a preferred node affinity
(preferredDuringSchedulingIgnoredDuringExecution) that targets control-plane
labels (e.g., node-role.kubernetes.io/master or
node-role.kubernetes.io/control-plane) so pods prefer control-plane nodes but
can run elsewhere for single-node/arbiter variants, and if a toleration is still
required use a non-blocking preference (e.g., avoid NoSchedule) or match the
exact taint/key semantics; update the c2cc deployment
(nodeSelector/tolerations/priorityClassName usage) and add topology-specific CI
to validate placement across single-node and multi-node topologies.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix — MicroShift is single-node. Every other deployment (service-ca, OVN, CSI, DNS) uses the identical nodeSelector: node-role.kubernetes.io/master + NoSchedule toleration pattern. This is the standard convention, not a topology concern.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmtk, understood — that's a valid convention for MicroShift's single-node topology. I'll defer to the established pattern.


✏️ Learnings added
Learnt from: pmtk
Repo: openshift/microshift PR: 6729
File: assets/components/c2cc/deployment.yaml:54-60
Timestamp: 2026-05-25T11:24:44.503Z
Learning: In MicroShift (openshift/microshift), all system-level deployments (service-ca, OVN, CSI, DNS, c2cc-probe, etc.) use the standard pattern of `nodeSelector: node-role.kubernetes.io/master: ""` plus a `NoSchedule` toleration for `node-role.kubernetes.io/master`. This is intentional and correct — MicroShift is a single-node distribution, so topology-aware scheduling is not applicable. Do not flag this pattern as a topology concern.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment thread cmd/generate-config/config/config-openapi-spec.json
Comment thread pkg/config/c2cc_test.go Outdated
Comment thread pkg/controllers/c2cc/probe.go Outdated
Comment on lines +85 to +87
factory.Start(ctx.Done())
factory.WaitForCacheSync(ctx.Done())
klog.Infof("Probe manager running, watching RemoteCluster CRs")
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify cache sync result is currently ignored.
rg -n -C2 'factory.Start|WaitForCacheSync|HasSynced' pkg/controllers/c2cc/probe.go

Repository: openshift/microshift

Length of output: 218


Fail fast if informer cache does not sync. In pkg/controllers/c2cc/probe.go (lines 85-87), the factory.WaitForCacheSync(ctx.Done()) return value is ignored; if the RemoteCluster informer cache never syncs, the manager can start without a valid initial state—handle the sync result (log and exit/return an error) when it fails.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/c2cc/probe.go` around lines 85 - 87, The call to
factory.WaitForCacheSync currently ignores its result; update the startup in
pkg/controllers/c2cc/probe.go to check the returned map[reflect.Type]bool (the
result of factory.WaitForCacheSync) and fail fast if any informer failed to
sync: after factory.Start(ctx.Done()) capture the sync map from
factory.WaitForCacheSync(ctx.Done()), inspect its values, and if any value is
false log an error via klog.Errorf (including what failed) and return/exit the
function/manager startup instead of continuing to klog.Infof("Probe manager
running, watching RemoteCluster CRs").

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix — WaitForCacheSync(ctx.Done()) blocks until sync or context cancellation. If ctx is cancelled, the next line <-ctx.Done() exits immediately anyway — both paths converge on the same shutdown behavior.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmtk, fair point — if the context is already cancelled when WaitForCacheSync returns false, the <-ctx.Done() path shuts things down cleanly regardless. Acknowledged.

@pmtk
Copy link
Copy Markdown
Member Author

pmtk commented May 25, 2026

Depends on #6727

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
assets/components/c2cc/deployment.yaml (1)

35-40: ⚡ Quick win

Consider adding a readinessProbe.

The Deployment has a livenessProbe but no readinessProbe. Without it, the Service will route traffic immediately on pod start, potentially before the HTTP server is ready.

♻️ Proposed addition
         livenessProbe:
           httpGet:
             path: /
             port: 8080
           initialDelaySeconds: 10
           periodSeconds: 10
+        readinessProbe:
+          httpGet:
+            path: /
+            port: 8080
+          initialDelaySeconds: 5
+          periodSeconds: 5
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@assets/components/c2cc/deployment.yaml` around lines 35 - 40, Add a
readinessProbe alongside the existing livenessProbe so the Service only routes
traffic once the pod's HTTP server is ready; mirror the livenessProbe's httpGet
(path: /, port: 8080) and set sensible readiness-specific timing (for example
readinessProbe with initialDelaySeconds, periodSeconds and optionally
successThreshold/failureThreshold) in the same Pod spec where livenessProbe is
defined so Kubernetes checks readiness separately from liveness.
pkg/controllers/c2cc/deploy_probe.go (1)

26-27: ⚡ Quick win

Add a defensive guard before indexing ServiceNetwork.

Line 26 assumes at least one entry and can panic on empty config; return a clear error instead.

Suggested patch
 func (c *C2CCRouteManager) deployProbe(ctx context.Context) error {
+	if len(c.cfg.Network.ServiceNetwork) == 0 {
+		return fmt.Errorf("service network is empty")
+	}
 	_, svcNet, err := net.ParseCIDR(c.cfg.Network.ServiceNetwork[0])
 	if err != nil {
 		return fmt.Errorf("failed to parse local service network: %w", err)
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/c2cc/deploy_probe.go` around lines 26 - 27, Add a defensive
guard before indexing c.cfg.Network.ServiceNetwork: check that
len(c.cfg.Network.ServiceNetwork) > 0 and if not, return a clear error (e.g.,
fmt.Errorf or wrapped error) indicating the ServiceNetwork is empty; only then
call net.ParseCIDR(c.cfg.Network.ServiceNetwork[0]) and handle its error as
before. This protects the code around net.ParseCIDR and the surrounding function
(where c.cfg.Network.ServiceNetwork is used) from panics due to an empty slice.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/controllers/c2cc/healthcheck.go`:
- Around line 65-67: The reconcile path in healthcheckCRManager.reconcile()
writes got.Spec = want.Spec then calls h.client.RemoteClusters().Update(...)
which can fail with 409s if the probe pod updates status between List and
Update; wrap the Update call in a retry-on-conflict loop (use
retry.RetryOnConflict) that re-gets the latest RemoteCluster
(h.client.RemoteClusters().Get), applies the desired Spec (assign want.Spec to
the fetched object's Spec), and retries the Update until success or a
non-conflict error, preserving the existing metav1.UpdateOptions usage.

In `@pkg/controllers/c2cc/probe.go`:
- Around line 177-181: The current probe logic unconditionally sets status.State
to "Healthy" whenever consecutiveFailures < unhealthyThreshold, which hides
errors; change the logic in the probe result handler that updates status.State
(the block using consecutiveFailures, unhealthyThreshold and status.State) to
avoid overwriting the prior state when there are Errors—either preserve the
existing status.State if Errors is non-empty or use an intermediate state such
as "Degraded"/"UnhealthyPending" instead of "Healthy"; ensure you handle the
initial/zero-value state (set to "Healthy" only when there are no Errors and
consecutiveFailures == 0) so the code that writes status.Errors and status.State
remains consistent.
- Line 44: The http.Server instantiation (server := &http.Server{Addr: ":8080",
Handler: mux}) lacks a ReadHeaderTimeout, making it vulnerable to Slowloris;
update the server struct to include a sensible ReadHeaderTimeout (e.g., a few
seconds) alongside Addr and Handler so the server closes connections that stall
while sending headers (modify the &http.Server literal where server is created).

---

Nitpick comments:
In `@assets/components/c2cc/deployment.yaml`:
- Around line 35-40: Add a readinessProbe alongside the existing livenessProbe
so the Service only routes traffic once the pod's HTTP server is ready; mirror
the livenessProbe's httpGet (path: /, port: 8080) and set sensible
readiness-specific timing (for example readinessProbe with initialDelaySeconds,
periodSeconds and optionally successThreshold/failureThreshold) in the same Pod
spec where livenessProbe is defined so Kubernetes checks readiness separately
from liveness.

In `@pkg/controllers/c2cc/deploy_probe.go`:
- Around line 26-27: Add a defensive guard before indexing
c.cfg.Network.ServiceNetwork: check that len(c.cfg.Network.ServiceNetwork) > 0
and if not, return a clear error (e.g., fmt.Errorf or wrapped error) indicating
the ServiceNetwork is empty; only then call
net.ParseCIDR(c.cfg.Network.ServiceNetwork[0]) and handle its error as before.
This protects the code around net.ParseCIDR and the surrounding function (where
c.cfg.Network.ServiceNetwork is used) from panics due to an empty slice.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: ba6f532c-76d3-473b-8672-a7e040dd7d51

📥 Commits

Reviewing files that changed from the base of the PR and between 7ec41e8 and 06c24c1.

⛔ Files ignored due to path filters (22)
  • pkg/apis/microshift/v1alpha1/zz_generated.deepcopy.go is excluded by !**/zz_generated*
  • pkg/generated/clientset/versioned/clientset.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/fake/clientset_generated.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/fake/doc.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/fake/register.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/scheme/doc.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/scheme/register.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/typed/microshift/v1alpha1/doc.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/typed/microshift/v1alpha1/fake/doc.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/typed/microshift/v1alpha1/fake/fake_microshift_client.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/typed/microshift/v1alpha1/fake/fake_remotecluster.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/typed/microshift/v1alpha1/generated_expansion.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/typed/microshift/v1alpha1/microshift_client.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/typed/microshift/v1alpha1/remotecluster.go is excluded by !**/generated/**
  • pkg/generated/informers/externalversions/factory.go is excluded by !**/generated/**
  • pkg/generated/informers/externalversions/generic.go is excluded by !**/generated/**
  • pkg/generated/informers/externalversions/internalinterfaces/factory_interfaces.go is excluded by !**/generated/**
  • pkg/generated/informers/externalversions/microshift/interface.go is excluded by !**/generated/**
  • pkg/generated/informers/externalversions/microshift/v1alpha1/interface.go is excluded by !**/generated/**
  • pkg/generated/informers/externalversions/microshift/v1alpha1/remotecluster.go is excluded by !**/generated/**
  • pkg/generated/listers/microshift/v1alpha1/expansion_generated.go is excluded by !**/generated/**
  • pkg/generated/listers/microshift/v1alpha1/remotecluster.go is excluded by !**/generated/**
📒 Files selected for processing (31)
  • Makefile
  • assets/components/c2cc/clusterrole.yaml
  • assets/components/c2cc/clusterrolebinding.yaml
  • assets/components/c2cc/deployment.yaml
  • assets/components/c2cc/namespace.yaml
  • assets/components/c2cc/service.yaml
  • assets/components/c2cc/serviceaccount.yaml
  • assets/crd/microshift.io_remoteclusters.yaml
  • cmd/generate-config/config/config-openapi-spec.json
  • cmd/microshift/main.go
  • docs/user/howto_config.md
  • packaging/microshift/config.yaml
  • pkg/apis/microshift/v1alpha1/doc.go
  • pkg/apis/microshift/v1alpha1/groupversion_info.go
  • pkg/apis/microshift/v1alpha1/types.go
  • pkg/assets/crd.go
  • pkg/cmd/c2cc_probe.go
  • pkg/config/c2cc.go
  • pkg/config/c2cc_test.go
  • pkg/config/config.go
  • pkg/controllers/c2cc/controller.go
  • pkg/controllers/c2cc/deploy_probe.go
  • pkg/controllers/c2cc/healthcheck.go
  • pkg/controllers/c2cc/healthcheck_test.go
  • pkg/controllers/c2cc/probe.go
  • scripts/auto-rebase/assets.yaml
  • scripts/boilerplate.go.txt
  • scripts/generate-crds.sh
  • test/resources/c2cc.resource
  • test/suites/c2cc/healthcheck.robot
  • test/suites/c2cc/probe.robot
✅ Files skipped from review due to trivial changes (4)
  • pkg/apis/microshift/v1alpha1/doc.go
  • assets/components/c2cc/namespace.yaml
  • cmd/generate-config/config/config-openapi-spec.json
  • docs/user/howto_config.md

Comment on lines +65 to +67
got.Spec = want.Spec
if _, err := h.client.RemoteClusters().Update(ctx, got, metav1.UpdateOptions{}); err != nil {
return fmt.Errorf("updating RemoteCluster %q: %w", name, err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify concurrent status writers and current lack of retry-on-conflict in update path.
rg -nP --type=go 'RemoteClusters\(\)\.UpdateStatus\('
rg -nP --type=go 'healthcheckCRManager\)\s+reconcile|RemoteClusters\(\)\.Update\(|RetryOnConflict'

Repository: openshift/microshift

Length of output: 28029


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Inspect the exact code around the update call
sed -n '1,140p' pkg/controllers/c2cc/healthcheck.go

# 2) Find all remotecluster update call sites in this repo (spec or status)
rg -n --type=go "RemoteClusters\(\)\.(Update|UpdateStatus|Patch)" pkg | head -n 200

# 3) Find any other controller that updates RemoteCluster status/fields
rg -n --type=go "RemoteCluster" pkg/controllers/c2cc | head -n 200
rg -n --type=go "RemoteCluster" pkg | head -n 200

Repository: openshift/microshift

Length of output: 29154


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '130,280p' pkg/controllers/c2cc/probe.go

# Also inspect imports to see if retry is already used in probe.go
sed -n '1,120p' pkg/controllers/c2cc/probe.go

Repository: openshift/microshift

Length of output: 6156


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n --type=go "RetryOnConflict|util/retry|Update\(ctx,.*RemoteClusters" pkg/controllers/c2cc/healthcheck.go
rg -n --type=go "RemoteClusters\(\)\.Update\(" pkg/controllers/c2cc
rg -n --type=go "UpdateStatus\(" pkg/controllers/c2cc

Repository: openshift/microshift

Length of output: 46


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show how healthcheck manager + probe are wired
sed -n '1,220p' pkg/controllers/c2cc/controller.go
ls -1 pkg/controllers/c2cc

# Find where healthcheckCRManager is instantiated / reconcile loop is started
rg -n --type=go "newHealthcheckCRManager|healthcheckCRManager|\\.reconcile\\(" pkg/controllers/c2cc
rg -n --type=go "RunProbe\\(" pkg/controllers/c2cc
rg -n --type=go "RunProbe" pkg/controllers | head -n 50

Repository: openshift/microshift

Length of output: 8925


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '140,260p' pkg/controllers/c2cc/controller.go
sed -n '1,220p' pkg/controllers/c2cc/deploy_probe.go

rg -n --type=go "RunProbe|deploy_probe|Probe" pkg/controllers/c2cc | head -n 200

# Show full healthcheck.go to confirm Update uses entire object (status included)
sed -n '1,120p' pkg/controllers/c2cc/healthcheck.go

Repository: openshift/microshift

Length of output: 13925


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "deployment.yaml|c2cc-probe|probe" components/c2cc -S || true
ls -R components/c2cc || true

# Find the deployment template file referenced by deployProbe
fd -a "deployment.yaml" . | head -n 20

# Print the deployment.yaml relevant bits (replicas, container args/env) without dumping entire file
FILE="$(fd -a "deployment.yaml" . | head -n 1)"
echo "Using $FILE"
rg -n "replicas:|c2cc.*probe|RunProbe|healthcheck-probe|container|args:" "$FILE" -n || true
sed -n '1,220p' "$FILE"

Repository: openshift/microshift

Length of output: 6913


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Probe deployment template
ls -la assets/components/c2cc || true
sed -n '1,240p' assets/components/c2cc/deployment.yaml

# 2) RemoteCluster CRD to see if status subresource is enabled
CRD="$(fd -a "microshift.io_remoteclusters.yaml" . | head -n 1)"
echo "CRD path: $CRD"
sed -n '1,220p' "$CRD"
rg -n "subresources:|status:" "$CRD" -n

# 3) Retry constants used in suggested patch
sed -n '1,140p' vendor/k8s.io/client-go/util/retry/util.go
rg -n "DefaultRetry|DefaultBackoff" vendor/k8s.io/client-go/util/retry/util.go

Repository: openshift/microshift

Length of output: 9233


Add retry-on-conflict when updating RemoteCluster Spec (avoid 409s from concurrent probe status writes).

healthcheckCRManager.reconcile() does a List then full-object Update() after modifying got.Spec. The probe pod writes RemoteCluster health via UpdateStatus, and the CRD enables the status subresource—so probe status updates can bump resourceVersion between List and Update, causing reconcile to fail early on conflicts.

Suggested patch
 import (
 	"context"
 	"fmt"
 	"net"
 	"strings"

 	microshiftv1alpha1 "github.com/openshift/microshift/pkg/apis/microshift/v1alpha1"
 	"github.com/openshift/microshift/pkg/config"
 	microshiftclient "github.com/openshift/microshift/pkg/generated/clientset/versioned/typed/microshift/v1alpha1"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+	"k8s.io/client-go/util/retry"
 	"k8s.io/klog/v2"
 )
@@
-		got.Spec = want.Spec
-		if _, err := h.client.RemoteClusters().Update(ctx, got, metav1.UpdateOptions{}); err != nil {
+		if err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
+			latest, err := h.client.RemoteClusters().Get(ctx, name, metav1.GetOptions{})
+			if err != nil {
+				return err
+			}
+			latest.Spec = want.Spec
+			_, err = h.client.RemoteClusters().Update(ctx, latest, metav1.UpdateOptions{})
+			return err
+		}); err != nil {
 			return fmt.Errorf("updating RemoteCluster %q: %w", name, err)
 		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
got.Spec = want.Spec
if _, err := h.client.RemoteClusters().Update(ctx, got, metav1.UpdateOptions{}); err != nil {
return fmt.Errorf("updating RemoteCluster %q: %w", name, err)
if err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
latest, err := h.client.RemoteClusters().Get(ctx, name, metav1.GetOptions{})
if err != nil {
return err
}
latest.Spec = want.Spec
_, err = h.client.RemoteClusters().Update(ctx, latest, metav1.UpdateOptions{})
return err
}); err != nil {
return fmt.Errorf("updating RemoteCluster %q: %w", name, err)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/c2cc/healthcheck.go` around lines 65 - 67, The reconcile path
in healthcheckCRManager.reconcile() writes got.Spec = want.Spec then calls
h.client.RemoteClusters().Update(...) which can fail with 409s if the probe pod
updates status between List and Update; wrap the Update call in a
retry-on-conflict loop (use retry.RetryOnConflict) that re-gets the latest
RemoteCluster (h.client.RemoteClusters().Get), applies the desired Spec (assign
want.Spec to the fetched object's Spec), and retries the Update until success or
a non-conflict error, preserving the existing metav1.UpdateOptions usage.

w.WriteHeader(http.StatusOK)
fmt.Fprint(w, "ok")
})
server := &http.Server{Addr: ":8080", Handler: mux}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add ReadHeaderTimeout to prevent Slowloris attacks.

The HTTP server lacks ReadHeaderTimeout, leaving it vulnerable to slow header attacks that can exhaust connections.

🔒 Proposed fix
-	server := &http.Server{Addr: ":8080", Handler: mux}
+	server := &http.Server{
+		Addr:              ":8080",
+		Handler:           mux,
+		ReadHeaderTimeout: 10 * time.Second,
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
server := &http.Server{Addr: ":8080", Handler: mux}
server := &http.Server{
Addr: ":8080",
Handler: mux,
ReadHeaderTimeout: 10 * time.Second,
}
🧰 Tools
🪛 golangci-lint (2.12.2)

[medium] 44-44: G112: Potential Slowloris Attack because ReadHeaderTimeout is not configured in the http.Server

(gosec)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/c2cc/probe.go` at line 44, The http.Server instantiation
(server := &http.Server{Addr: ":8080", Handler: mux}) lacks a ReadHeaderTimeout,
making it vulnerable to Slowloris; update the server struct to include a
sensible ReadHeaderTimeout (e.g., a few seconds) alongside Addr and Handler so
the server closes connections that stall while sending headers (modify the
&http.Server literal where server is created).

Comment on lines +177 to +181
if consecutiveFailures >= unhealthyThreshold {
status.State = "Unhealthy"
} else {
status.State = "Healthy"
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

State shows "Healthy" with errors until threshold is reached.

When a probe fails but consecutiveFailures < unhealthyThreshold, the status is set to "Healthy" with Errors populated. This creates a confusing state where the CR appears healthy but contains error messages. Consider preserving the previous state or introducing an intermediate state.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/c2cc/probe.go` around lines 177 - 181, The current probe
logic unconditionally sets status.State to "Healthy" whenever
consecutiveFailures < unhealthyThreshold, which hides errors; change the logic
in the probe result handler that updates status.State (the block using
consecutiveFailures, unhealthyThreshold and status.State) to avoid overwriting
the prior state when there are Errors—either preserve the existing status.State
if Errors is non-empty or use an intermediate state such as
"Degraded"/"UnhealthyPending" instead of "Healthy"; ensure you handle the
initial/zero-value state (set to "Healthy" only when there are no Errors and
consecutiveFailures == 0) so the code that writes status.Errors and status.State
remains consistent.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
pkg/controllers/c2cc/probe.go (1)

44-44: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Set ReadHeaderTimeout on the probe HTTP server.

This keeps the endpoint from being vulnerable to slow-header connection exhaustion.

Suggested patch
-	server := &http.Server{Addr: ":8080", Handler: mux}
+	server := &http.Server{
+		Addr:              ":8080",
+		Handler:           mux,
+		ReadHeaderTimeout: 10 * time.Second,
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/c2cc/probe.go` at line 44, The probe HTTP server
instantiation (the http.Server assigned to variable server in probe.go:
http.Server{Addr: ":8080", Handler: mux}) lacks a ReadHeaderTimeout which makes
it vulnerable to slow-header attacks; set the ReadHeaderTimeout field (e.g.,
time.Second * 5 or a configurable value) on that http.Server struct so the
server closes connections that take too long to send request headers.
pkg/controllers/c2cc/healthcheck.go (1)

65-67: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add retry-on-conflict around RemoteCluster updates.

Update() can hit 409 conflicts when probe status writers update the same object between list and update.

Suggested patch
 import (
 	"context"
 	"fmt"
 	"net"
 	"strings"

 	microshiftv1alpha1 "github.com/openshift/microshift/pkg/apis/microshift/v1alpha1"
 	"github.com/openshift/microshift/pkg/config"
 	microshiftclient "github.com/openshift/microshift/pkg/generated/clientset/versioned/typed/microshift/v1alpha1"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+	"k8s.io/client-go/util/retry"
 	"k8s.io/klog/v2"
 )
@@
-		got.Spec = want.Spec
-		if _, err := h.client.RemoteClusters().Update(ctx, got, metav1.UpdateOptions{}); err != nil {
+		if err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
+			latest, err := h.client.RemoteClusters().Get(ctx, name, metav1.GetOptions{})
+			if err != nil {
+				return err
+			}
+			latest.Spec = want.Spec
+			_, err = h.client.RemoteClusters().Update(ctx, latest, metav1.UpdateOptions{})
+			return err
+		}); err != nil {
 			return fmt.Errorf("updating RemoteCluster %q: %w", name, err)
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/c2cc/healthcheck.go` around lines 65 - 67, The RemoteCluster
update can fail with 409 conflicts; wrap the update in a retry-on-conflict loop
(use k8s.io/client-go/util/retry.RetryOnConflict) around
h.client.RemoteClusters().Update, re-get the latest object inside the retry, set
got.Spec = want.Spec each attempt, then call Update(ctx, got,
metav1.UpdateOptions{}), and return the final error if retries exhaust;
reference the variables/functions got, want, and
h.client.RemoteClusters().Update to locate where to apply the loop.
🧹 Nitpick comments (1)
pkg/controllers/c2cc/deploy_probe.go (1)

59-59: ⚡ Quick win

Lower repetitive success logging to verbose level.

This path runs every reconcile, so Infof will generate noisy logs over time.

Suggested change
-	klog.Infof("C2CC probe assets deployed (probe ClusterIP=%s)", probeIP)
+	klog.V(4).Infof("C2CC probe assets deployed (probe ClusterIP=%s)", probeIP)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/c2cc/deploy_probe.go` at line 59, The success log in
deploy_probe.go currently uses klog.Infof("C2CC probe assets deployed (probe
ClusterIP=%s)", probeIP) which runs every reconcile and is noisy; change this to
a verbose-level log (e.g., klog.V(2).Infof or appropriate verbosity) so normal
Info logs are not flooded—locate the klog.Infof call in the deploy probe
reconciliation (symbol: the logging call in deploy_probe.go that prints probe
ClusterIP) and replace it with a klog.V(x).Infof call using an appropriate
verbosity level.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@assets/components/c2cc/deployment.yaml`:
- Around line 35-40: Add a readinessProbe alongside the existing livenessProbe
so the Service only routes traffic after the app is ready: add a readinessProbe
entry (mirroring the livenessProbe's httpGet path: / and port: 8080) and
configure sensible timings (e.g., initialDelaySeconds, periodSeconds,
successThreshold/failureThreshold) to gate traffic until the probe server is
listening; update the same Kubernetes manifest section containing livenessProbe
to include readinessProbe.

In `@pkg/controllers/c2cc/probe.go`:
- Around line 61-86: The call to informer.AddEventHandler currently ignores its
returned error, so update the code where informer.AddEventHandler(...) is
invoked to capture the returned error and fail fast on error (e.g., return the
error from the surrounding function or call log.Fatalf/process exit), rather
than silently proceeding; keep the existing handlers (AddFunc calling
pm.startProbe, UpdateFunc calling pm.restartProbe, DeleteFunc calling
pm.stopProbe) intact and reference this change around the
informer.AddEventHandler invocation to ensure event-handler registration errors
are handled and propagated.

In `@test/resources/c2cc.resource`:
- Around line 286-294: The current inline Python command uses n[11] which raises
IndexError for networks with fewer than 12 addresses; modify the command built
in the ${cmd} variable to first compute n = ipaddress.ip_network('${cidr}',
strict=False) and check n.num_addresses > 11 (or >=12) before accessing n[11];
if the check fails, print a clear error message and exit with a non-zero status
so the Process.Run call (python3 -c ${cmd}) returns a failing rc and the Robot
keyword surfaces a readable validation error instead of an IndexError traceback.

---

Duplicate comments:
In `@pkg/controllers/c2cc/healthcheck.go`:
- Around line 65-67: The RemoteCluster update can fail with 409 conflicts; wrap
the update in a retry-on-conflict loop (use
k8s.io/client-go/util/retry.RetryOnConflict) around
h.client.RemoteClusters().Update, re-get the latest object inside the retry, set
got.Spec = want.Spec each attempt, then call Update(ctx, got,
metav1.UpdateOptions{}), and return the final error if retries exhaust;
reference the variables/functions got, want, and
h.client.RemoteClusters().Update to locate where to apply the loop.

In `@pkg/controllers/c2cc/probe.go`:
- Line 44: The probe HTTP server instantiation (the http.Server assigned to
variable server in probe.go: http.Server{Addr: ":8080", Handler: mux}) lacks a
ReadHeaderTimeout which makes it vulnerable to slow-header attacks; set the
ReadHeaderTimeout field (e.g., time.Second * 5 or a configurable value) on that
http.Server struct so the server closes connections that take too long to send
request headers.

---

Nitpick comments:
In `@pkg/controllers/c2cc/deploy_probe.go`:
- Line 59: The success log in deploy_probe.go currently uses klog.Infof("C2CC
probe assets deployed (probe ClusterIP=%s)", probeIP) which runs every reconcile
and is noisy; change this to a verbose-level log (e.g., klog.V(2).Infof or
appropriate verbosity) so normal Info logs are not flooded—locate the klog.Infof
call in the deploy probe reconciliation (symbol: the logging call in
deploy_probe.go that prints probe ClusterIP) and replace it with a
klog.V(x).Infof call using an appropriate verbosity level.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 03ed3a56-44f9-4fcb-accc-ec1e9217620d

📥 Commits

Reviewing files that changed from the base of the PR and between 06c24c1 and 89f74e6.

⛔ Files ignored due to path filters (1)
  • pkg/apis/microshift/v1alpha1/zz_generated.deepcopy.go is excluded by !**/zz_generated*
📒 Files selected for processing (26)
  • assets/components/c2cc/clusterrole.yaml
  • assets/components/c2cc/clusterrolebinding.yaml
  • assets/components/c2cc/deployment.yaml
  • assets/components/c2cc/namespace.yaml
  • assets/components/c2cc/service.yaml
  • assets/components/c2cc/serviceaccount.yaml
  • assets/crd/microshift.io_remoteclusters.yaml
  • cmd/generate-config/config/config-openapi-spec.json
  • cmd/microshift/main.go
  • docs/user/howto_config.md
  • packaging/microshift/config.yaml
  • pkg/apis/microshift/v1alpha1/types.go
  • pkg/assets/crd.go
  • pkg/cmd/c2cc_probe.go
  • pkg/config/c2cc.go
  • pkg/config/c2cc_test.go
  • pkg/config/config.go
  • pkg/controllers/c2cc/controller.go
  • pkg/controllers/c2cc/deploy_probe.go
  • pkg/controllers/c2cc/healthcheck.go
  • pkg/controllers/c2cc/healthcheck_test.go
  • pkg/controllers/c2cc/probe.go
  • scripts/auto-rebase/assets.yaml
  • test/resources/c2cc.resource
  • test/suites/c2cc/healthcheck.robot
  • test/suites/c2cc/probe.robot
✅ Files skipped from review due to trivial changes (2)
  • docs/user/howto_config.md
  • cmd/generate-config/config/config-openapi-spec.json

Comment on lines +35 to +40
livenessProbe:
httpGet:
path: /
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a readiness probe for endpoint gating.

Without readiness, the Service can route traffic before the probe server is actually listening, which can cause avoidable startup failures.

Suggested change
         livenessProbe:
           httpGet:
             path: /
             port: 8080
           initialDelaySeconds: 10
           periodSeconds: 10
+        readinessProbe:
+          httpGet:
+            path: /
+            port: 8080
+          initialDelaySeconds: 2
+          periodSeconds: 5
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@assets/components/c2cc/deployment.yaml` around lines 35 - 40, Add a
readinessProbe alongside the existing livenessProbe so the Service only routes
traffic after the app is ready: add a readinessProbe entry (mirroring the
livenessProbe's httpGet path: / and port: 8080) and configure sensible timings
(e.g., initialDelaySeconds, periodSeconds, successThreshold/failureThreshold) to
gate traffic until the probe server is listening; update the same Kubernetes
manifest section containing livenessProbe to include readinessProbe.

Comment on lines +61 to +86
informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {
if rc, ok := obj.(*microshiftv1alpha1.RemoteCluster); ok {
pm.startProbe(ctx, rc)
}
},
UpdateFunc: func(oldObj, newObj interface{}) {
oldRC, ok1 := oldObj.(*microshiftv1alpha1.RemoteCluster)
newRC, ok2 := newObj.(*microshiftv1alpha1.RemoteCluster)
if ok1 && ok2 && (oldRC.Spec.ProbeTarget != newRC.Spec.ProbeTarget ||
oldRC.Spec.ProbeInterval != newRC.Spec.ProbeInterval) {
pm.restartProbe(ctx, newRC)
}
},
DeleteFunc: func(obj interface{}) {
rc, ok := obj.(*microshiftv1alpha1.RemoteCluster)
if !ok {
if tombstone, ok := obj.(cache.DeletedFinalStateUnknown); ok {
rc, _ = tombstone.Obj.(*microshiftv1alpha1.RemoteCluster)
}
}
if rc != nil {
pm.stopProbe(rc.Name)
}
},
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show relevant section with line numbers
sed -n '1,140p' pkg/controllers/c2cc/probe.go | nl -ba | sed -n '40,120p'

# Confirm signature/usage elsewhere in repo
rg -n "AddEventHandler\(" pkg/controllers/c2cc/probe.go pkg/controllers/c2cc -S || true
rg -n "failed to add RemoteCluster informer handlers" -S . || true

# Show imports to see if fmt is already available
sed -n '1,80p' pkg/controllers/c2cc/probe.go | nl -ba

Repository: openshift/microshift

Length of output: 107


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Print relevant portion with line numbers (using awk since nl isn't available)
awk 'NR>=40 && NR<=120 {printf "%d\t%s\n", NR, $0}' pkg/controllers/c2cc/probe.go

echo "----"
# Search for AddEventHandler usage in this package
rg -n "AddEventHandler\\(" pkg/controllers/c2cc -S || true

echo "----"
# Search for fmt usage in this file to see if fmt is already imported/used
rg -n "\\bfmt\\." pkg/controllers/c2cc/probe.go -S || true

echo "----"
# Print import block to see if fmt is present
awk 'NR>=1 && NR<=80 {print}' pkg/controllers/c2cc/probe.go

Repository: openshift/microshift

Length of output: 5881


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate the generated informer file for RemoteClusters
fd -a "remoteclusters" pkg/generated/informers/externalversions -S || true
fd -a "RemoteClusters" pkg/generated/informers/externalversions -S || true

# Search for the Informer() method definition returning the informer type
rg -n "RemoteClusters\\(\\)\\.?Informer\\(" pkg/generated/informers/externalversions -S

# Show the specific Informer() implementation + returned type
rg -n "func \\(.*\\) Informer\\(\\) cache\\.SharedIndexInformer" pkg/generated/informers/externalversions -S || true
rg -n "func \\(.*\\) Informer\\(\\)" pkg/generated/informers/externalversions -S | rg -n "RemoteClusters" || true

# Find the AddEventHandler method signature in the client-go source if present in module cache
# (best-effort; ignore failures if go env isn't available)
if command -v go >/dev/null 2>&1; then
  GOPATH="$(go env GOPATH 2>/dev/null || true)"
  if [ -n "$GOPATH" ] && [ -d "$GOPATH/pkg/mod" ]; then
    rg -n "func \\(.*\\) AddEventHandler\\(" "$GOPATH/pkg/mod" -S --hidden --no-ignore-vcs 2>/dev/null \
      | rg -n "ResourceEventHandler" | head -n 20
  fi
fi

Repository: openshift/microshift

Length of output: 761


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== In-repo: find SharedIndexInformer interface definition =="
rg -n "type +SharedIndexInformer +interface" -S . || true

echo "== In-repo: find AddEventHandler signature on SharedIndexInformer =="
rg -n "SharedIndexInformer.*AddEventHandler|AddEventHandler\\(handler" -S . || true

echo "== In GOPATH pkg/mod: find SharedIndexInformer interface =="
if command -v go >/dev/null 2>&1; then
  GOPATH="$(go env GOPATH 2>/dev/null || true)"
  if [ -n "${GOPATH:-}" ] && [ -d "$GOPATH/pkg/mod" ]; then
    rg -n "type +SharedIndexInformer +interface" "$GOPATH/pkg/mod" -S --no-mmap 2>/dev/null | head -n 20 || true
  else
    echo "GOPATH pkg/mod not found"
  fi
else
  echo "go not available"
fi

echo "== In GOPATH pkg/mod: find AddEventHandler signature =="
if command -v go >/dev/null 2>&1; then
  GOPATH="$(go env GOPATH 2>/dev/null || true)"
  if [ -n "${GOPATH:-}" ] && [ -d "$GOPATH/pkg/mod" ]; then
    rg -n "AddEventHandler\\(.*ResourceEventHandler" "$GOPATH/pkg/mod" -S --no-mmap 2>/dev/null \
      | rg -n "SharedIndexInformer|cache\\.SharedIndexInformer|type .*SharedIndexInformer" \
      | head -n 50 || true
  fi
fi

echo "== Show generated Informer() return type for RemoteClusters =="
sed -n '60,110p' pkg/generated/informers/externalversions/microshift/v1alpha1/remotecluster.go

Repository: openshift/microshift

Length of output: 9804


Handle RemoteCluster informer event-handler registration errors

cache.SharedIndexInformer.AddEventHandler(...) returns an (…, error) but pkg/controllers/c2cc/probe.go ignores the error, so probe event handling could be silently disabled. Fail fast if registration fails.

Suggested patch
-	informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
+	if _, err := informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
 		AddFunc: func(obj interface{}) {
 			if rc, ok := obj.(*microshiftv1alpha1.RemoteCluster); ok {
 				pm.startProbe(ctx, rc)
 			}
 		},
 		UpdateFunc: func(oldObj, newObj interface{}) {
 			oldRC, ok1 := oldObj.(*microshiftv1alpha1.RemoteCluster)
 			newRC, ok2 := newObj.(*microshiftv1alpha1.RemoteCluster)
 			if ok1 && ok2 && (oldRC.Spec.ProbeTarget != newRC.Spec.ProbeTarget ||
 				oldRC.Spec.ProbeInterval != newRC.Spec.ProbeInterval) {
 				pm.restartProbe(ctx, newRC)
 			}
 		},
 		DeleteFunc: func(obj interface{}) {
 			rc, ok := obj.(*microshiftv1alpha1.RemoteCluster)
 			if !ok {
 				if tombstone, ok := obj.(cache.DeletedFinalStateUnknown); ok {
 					rc, _ = tombstone.Obj.(*microshiftv1alpha1.RemoteCluster)
 				}
 			}
 			if rc != nil {
 				pm.stopProbe(rc.Name)
 			}
 		},
-	})
+	}); err != nil {
+		return fmt.Errorf("failed to add RemoteCluster informer handlers: %w", err)
+	}
🧰 Tools
🪛 golangci-lint (2.12.2)

[error] 61-61: Error return value of informer.AddEventHandler is not checked

(errcheck)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/c2cc/probe.go` around lines 61 - 86, The call to
informer.AddEventHandler currently ignores its returned error, so update the
code where informer.AddEventHandler(...) is invoked to capture the returned
error and fail fast on error (e.g., return the error from the surrounding
function or call log.Fatalf/process exit), rather than silently proceeding; keep
the existing handlers (AddFunc calling pm.startProbe, UpdateFunc calling
pm.restartProbe, DeleteFunc calling pm.stopProbe) intact and reference this
change around the informer.AddEventHandler invocation to ensure event-handler
registration errors are handled and propagated.

Comment on lines +286 to +294
Compute 11th IP
[Documentation] Return the 11th host address in a CIDR (e.g. 10.43.0.0/16 -> 10.43.0.11).
[Arguments] ${cidr}
VAR ${cmd}= import ipaddress; n=ipaddress.ip_network('${cidr}', strict=False); print(n[11])
${result}= Process.Run Process python3 -c ${cmd}
Should Be Equal As Integers ${result.rc} 0
${ip}= Strip String ${result.stdout}
RETURN ${ip}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add bounds check for small CIDRs.

Accessing n[11] will raise IndexError if the network has fewer than 12 addresses (e.g., a /29 CIDR has only 8). While service CIDRs are typically large enough, adding a validation ensures clear error messages if a small CIDR is passed.

🛡️ Proposed fix to validate network size
 Compute 11th IP
     [Documentation]    Return the 11th host address in a CIDR (e.g. 10.43.0.0/16 -> 10.43.0.11).
     [Arguments]    ${cidr}
-    VAR    ${cmd}=    import ipaddress; n=ipaddress.ip_network('${cidr}', strict=False); print(n[11])
+    VAR    ${cmd}=    import ipaddress; n=ipaddress.ip_network('${cidr}', strict=False); assert n.num_addresses > 11, f"CIDR {n} too small"; print(n[11])
     ${result}=    Process.Run Process    python3    -c    ${cmd}
     Should Be Equal As Integers    ${result.rc}    0
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/resources/c2cc.resource` around lines 286 - 294, The current inline
Python command uses n[11] which raises IndexError for networks with fewer than
12 addresses; modify the command built in the ${cmd} variable to first compute n
= ipaddress.ip_network('${cidr}', strict=False) and check n.num_addresses > 11
(or >=12) before accessing n[11]; if the check fails, print a clear error
message and exit with a non-zero status so the Process.Run call (python3 -c
${cmd}) returns a failing rc and the Robot keyword surfaces a readable
validation error instead of an IndexError traceback.

pmtk and others added 14 commits May 25, 2026 15:56
The probe pod needs a place to report health state per remote cluster.
Add State (enum: NeverProbed/Healthy/Unhealthy with NeverProbed as
default), LastSuccessfulProbe, LastProbeTime, and Errors fields to
RemoteClusterStatus. Latency statistics are deferred to a later ticket.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The probe pod has a dual role: it listens on :8080 as a probe target
for remote clusters, and watches RemoteCluster CRs to actively probe
remote probe services. Per-CR goroutines perform HTTP GET at the
configured interval and update CR status with state transitions
(NeverProbed -> Healthy -> Unhealthy after 3 consecutive failures).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hidden subcommand invoked by the probe pod deployment as
'microshift c2cc-probe'. Delegates to c2cc.RunProbe() which
serves as a probe target and actively probes remote clusters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Namespace, ServiceAccount, ClusterRole, ClusterRoleBinding, Deployment,
and Service for the C2CC probe pod. The Deployment mounts the host
microshift binary and runs 'microshift c2cc-probe'. The Service gets
a deterministic ClusterIP (11th IP of local service CIDR) so remote
clusters can target it as a probe endpoint.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After the RemoteCluster CRD is established, the controller applies the
probe pod manifests (namespace, SA, RBAC, deployment, service). The
probe service gets the 11th IP of the local service CIDR as its
deterministic ClusterIP so remote clusters can target it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When C2CC config is removed, cleanupAll() now deletes the probe pod
namespace (cascading to SA, Deployment, Service), ClusterRole, and
ClusterRoleBinding before removing the RemoteCluster CRD.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move deployProbe() into fullReconcile() so the probe deployment is
re-applied every 10s. If the deployment is deleted or scaled down
at runtime, the next reconcile cycle recreates it. ApplyDeployments
is idempotent so this is a no-op when the deployment already matches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests verify probe namespace, deployment readiness, service ClusterIP
assignment, RemoteCluster status transitions, and self-healing after
deployment deletion or scale-down.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
UpdateFunc unconditionally restarted probes on every RemoteCluster
update, including status updates from the probe loop itself. This
created a feedback loop that reset consecutiveFailures to 0 on
every tick, preventing clusters from ever reaching the unhealthy
threshold. Only restart probes when spec fields actually change.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fillDefaults() returns an error but the tests were ignoring it.
Use require.NoError so tests fail fast with a clear message if
default initialization breaks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The keyword was duplicated in both probe.robot and
07-healthcheck.robot. Move it to c2cc.resource which both
files already import.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
pkg/controllers/c2cc/healthcheck.go (1)

65-67: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Retry on conflict for spec updates.

The update path can fail on transient resourceVersion conflicts when status is updated concurrently; retrying avoids reconcile aborts.

Suggested patch
 import (
 	"context"
 	"fmt"
 	"net"
 	"strings"

 	microshiftv1alpha1 "github.com/openshift/microshift/pkg/apis/microshift/v1alpha1"
 	"github.com/openshift/microshift/pkg/config"
 	microshiftclient "github.com/openshift/microshift/pkg/generated/clientset/versioned/typed/microshift/v1alpha1"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+	"k8s.io/client-go/util/retry"
 	"k8s.io/klog/v2"
 )
@@
-		got.Spec = want.Spec
-		if _, err := h.client.RemoteClusters().Update(ctx, got, metav1.UpdateOptions{}); err != nil {
+		if err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
+			latest, err := h.client.RemoteClusters().Get(ctx, name, metav1.GetOptions{})
+			if err != nil {
+				return err
+			}
+			latest.Spec = want.Spec
+			_, err = h.client.RemoteClusters().Update(ctx, latest, metav1.UpdateOptions{})
+			return err
+		}); err != nil {
 			return fmt.Errorf("updating RemoteCluster %q: %w", name, err)
 		}
#!/bin/bash
set -euo pipefail

echo "Verify current reconcile update path (no retry loop):"
awk 'NR>=60 && NR<=72 {printf "%d:%s\n", NR, $0}' pkg/controllers/c2cc/healthcheck.go

echo "Verify concurrent status writer exists:"
rg -n --type=go 'RemoteClusters\(\)\.UpdateStatus\(' pkg/controllers/c2cc/probe.go

echo "Verify RetryOnConflict is not currently used in healthcheck manager:"
rg -n --type=go 'RetryOnConflict' pkg/controllers/c2cc/healthcheck.go || true
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/c2cc/healthcheck.go` around lines 65 - 67, The update
currently sets got.Spec = want.Spec and calls
h.client.RemoteClusters().Update(...) which can fail with resourceVersion
conflicts; wrap the update in a retry loop using
k8s.io/apimachinery/pkg/util/wait.RetryOnConflict: inside the loop re-get the
latest RemoteCluster (using h.client.RemoteClusters().Get), set its Spec to
want.Spec, then call Update and return on success, propagating non-conflict
errors; reference the existing symbols got.Spec = want.Spec,
h.client.RemoteClusters().Update, and use RetryOnConflict to implement the retry
semantics.
pkg/controllers/c2cc/probe.go (2)

61-86: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle AddEventHandler registration errors.

If handler registration fails, probe management silently never starts correctly.

Suggested patch
-	informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
+	if _, err := informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
 		AddFunc: func(obj interface{}) {
 			if rc, ok := obj.(*microshiftv1alpha1.RemoteCluster); ok {
 				pm.startProbe(ctx, rc)
 			}
 		},
@@
 		DeleteFunc: func(obj interface{}) {
@@
 		},
-	})
+	}); err != nil {
+		return fmt.Errorf("failed to add RemoteCluster informer handlers: %w", err)
+	}
#!/bin/bash
set -euo pipefail

echo "Inspect AddEventHandler call site:"
awk 'NR>=58 && NR<=90 {printf "%d:%s\n", NR, $0}' pkg/controllers/c2cc/probe.go

echo "Confirm return value is not captured:"
rg -n --type=go '^\s*informer\.AddEventHandler\(' pkg/controllers/c2cc/probe.go
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/c2cc/probe.go` around lines 61 - 86, The AddEventHandler call
in probe.go is currently un-checked so failures to register handlers are
ignored; modify the code that registers the informer handlers (the
informer.AddEventHandler call in the same function that calls
pm.startProbe/pm.restartProbe/pm.stopProbe) to capture and handle any
registration error—i.e., call the registration variant that returns an error (or
wrap the registration in a helper that validates success), and on error log the
failure via the controller logger and return the error (or propagate it) so
probe management fails fast instead of silently never starting.

44-44: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Set ReadHeaderTimeout on the probe HTTP server.

This leaves the listener exposed to slow-header connection abuse.

Suggested patch
-	server := &http.Server{Addr: ":8080", Handler: mux}
+	server := &http.Server{
+		Addr:              ":8080",
+		Handler:           mux,
+		ReadHeaderTimeout: 10 * time.Second,
+	}
#!/bin/bash
set -euo pipefail

echo "Inspect server construction:"
awk 'NR>=39 && NR<=48 {printf "%d:%s\n", NR, $0}' pkg/controllers/c2cc/probe.go

echo "Check whether ReadHeaderTimeout is present:"
rg -n --type=go 'ReadHeaderTimeout' pkg/controllers/c2cc/probe.go || true
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/c2cc/probe.go` at line 44, The probe HTTP server is created
with server := &http.Server{Addr: ":8080", Handler: mux} and lacks a
ReadHeaderTimeout, leaving it vulnerable to slow-header attacks; update the
http.Server construction for the probe (the variable named server) to include a
ReadHeaderTimeout (e.g., 5s or another suitable duration defined in a
constant/config) so the server rejects connections that take too long to send
headers, and ensure the chosen timeout is applied where the probe server is
started.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@assets/crd/microshift.io_remoteclusters.yaml`:
- Around line 43-47: Add schema-level admission validation for
spec.probeInterval: update the CRD's openAPIV3Schema for the spec.probeInterval
property to include an x-kubernetes-validations rule that enforces a duration
between 1s and 5m (e.g. a CEL expression that parses the duration and asserts
seconds >= 1 && seconds <= 300) and a pattern fallback to only accept valid
duration strings (digits + unit like s, m). Target the spec.probeInterval
property in the CRD schema and include a clear message for rejection.

In `@Makefile`:
- Around line 349-352: The generate-crds Makefile target is currently leaf-only
and not invoked by the broader generate/verify flows; update the top-level
targets (e.g., the generate target and verify/verify-config targets used by make
and make verify) to depend on generate-crds so CRD and client generation runs
automatically—modify the prerequisites for the primary generate target and the
verify/verify-config targets to include generate-crds (references: Makefile
targets "generate-crds", "generate", "verify", "verify-config") so stale
CRDs/clients are caught by the normal make/verify pipelines.

In `@pkg/controllers/c2cc/deploy_probe.go`:
- Around line 26-29: The code accesses c.cfg.Network.ServiceNetwork[0] without
checking length which will panic if the slice is empty; before calling
net.ParseCIDR, add a guard that validates len(c.cfg.Network.ServiceNetwork) > 0
and return a clear error (e.g. "service network not configured") if empty, then
proceed to call net.ParseCIDR using the first element and preserve the existing
error wrapping for ParseCIDR failures (variables involved:
c.cfg.Network.ServiceNetwork, net.ParseCIDR, svcNet, err).

---

Duplicate comments:
In `@pkg/controllers/c2cc/healthcheck.go`:
- Around line 65-67: The update currently sets got.Spec = want.Spec and calls
h.client.RemoteClusters().Update(...) which can fail with resourceVersion
conflicts; wrap the update in a retry loop using
k8s.io/apimachinery/pkg/util/wait.RetryOnConflict: inside the loop re-get the
latest RemoteCluster (using h.client.RemoteClusters().Get), set its Spec to
want.Spec, then call Update and return on success, propagating non-conflict
errors; reference the existing symbols got.Spec = want.Spec,
h.client.RemoteClusters().Update, and use RetryOnConflict to implement the retry
semantics.

In `@pkg/controllers/c2cc/probe.go`:
- Around line 61-86: The AddEventHandler call in probe.go is currently
un-checked so failures to register handlers are ignored; modify the code that
registers the informer handlers (the informer.AddEventHandler call in the same
function that calls pm.startProbe/pm.restartProbe/pm.stopProbe) to capture and
handle any registration error—i.e., call the registration variant that returns
an error (or wrap the registration in a helper that validates success), and on
error log the failure via the controller logger and return the error (or
propagate it) so probe management fails fast instead of silently never starting.
- Line 44: The probe HTTP server is created with server := &http.Server{Addr:
":8080", Handler: mux} and lacks a ReadHeaderTimeout, leaving it vulnerable to
slow-header attacks; update the http.Server construction for the probe (the
variable named server) to include a ReadHeaderTimeout (e.g., 5s or another
suitable duration defined in a constant/config) so the server rejects
connections that take too long to send headers, and ensure the chosen timeout is
applied where the probe server is started.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 928d58d0-49a8-4e1a-b7d7-a07340e7adee

📥 Commits

Reviewing files that changed from the base of the PR and between 89f74e6 and 00bf8b1.

⛔ Files ignored due to path filters (22)
  • pkg/apis/microshift/v1alpha1/zz_generated.deepcopy.go is excluded by !**/zz_generated*
  • pkg/generated/clientset/versioned/clientset.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/fake/clientset_generated.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/fake/doc.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/fake/register.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/scheme/doc.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/scheme/register.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/typed/microshift/v1alpha1/doc.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/typed/microshift/v1alpha1/fake/doc.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/typed/microshift/v1alpha1/fake/fake_microshift_client.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/typed/microshift/v1alpha1/fake/fake_remotecluster.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/typed/microshift/v1alpha1/generated_expansion.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/typed/microshift/v1alpha1/microshift_client.go is excluded by !**/generated/**
  • pkg/generated/clientset/versioned/typed/microshift/v1alpha1/remotecluster.go is excluded by !**/generated/**
  • pkg/generated/informers/externalversions/factory.go is excluded by !**/generated/**
  • pkg/generated/informers/externalversions/generic.go is excluded by !**/generated/**
  • pkg/generated/informers/externalversions/internalinterfaces/factory_interfaces.go is excluded by !**/generated/**
  • pkg/generated/informers/externalversions/microshift/interface.go is excluded by !**/generated/**
  • pkg/generated/informers/externalversions/microshift/v1alpha1/interface.go is excluded by !**/generated/**
  • pkg/generated/informers/externalversions/microshift/v1alpha1/remotecluster.go is excluded by !**/generated/**
  • pkg/generated/listers/microshift/v1alpha1/expansion_generated.go is excluded by !**/generated/**
  • pkg/generated/listers/microshift/v1alpha1/remotecluster.go is excluded by !**/generated/**
📒 Files selected for processing (31)
  • Makefile
  • assets/components/c2cc/clusterrole.yaml
  • assets/components/c2cc/clusterrolebinding.yaml
  • assets/components/c2cc/deployment.yaml
  • assets/components/c2cc/namespace.yaml
  • assets/components/c2cc/service.yaml
  • assets/components/c2cc/serviceaccount.yaml
  • assets/crd/microshift.io_remoteclusters.yaml
  • cmd/generate-config/config/config-openapi-spec.json
  • cmd/microshift/main.go
  • docs/user/howto_config.md
  • packaging/microshift/config.yaml
  • pkg/apis/microshift/v1alpha1/doc.go
  • pkg/apis/microshift/v1alpha1/groupversion_info.go
  • pkg/apis/microshift/v1alpha1/types.go
  • pkg/assets/crd.go
  • pkg/cmd/c2cc_probe.go
  • pkg/config/c2cc.go
  • pkg/config/c2cc_test.go
  • pkg/config/config.go
  • pkg/controllers/c2cc/controller.go
  • pkg/controllers/c2cc/deploy_probe.go
  • pkg/controllers/c2cc/healthcheck.go
  • pkg/controllers/c2cc/healthcheck_test.go
  • pkg/controllers/c2cc/probe.go
  • scripts/auto-rebase/assets.yaml
  • scripts/boilerplate.go.txt
  • scripts/generate-crds.sh
  • test/resources/c2cc.resource
  • test/suites/c2cc/healthcheck.robot
  • test/suites/c2cc/probe.robot
✅ Files skipped from review due to trivial changes (4)
  • pkg/apis/microshift/v1alpha1/doc.go
  • scripts/auto-rebase/assets.yaml
  • docs/user/howto_config.md
  • assets/components/c2cc/namespace.yaml

Comment on lines +43 to +47
probeInterval:
default: 10s
description: Interval between probe attempts (e.g. "10s", "1m").
type: string
probeTarget:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Enforce spec.probeInterval constraints at admission.

probeInterval is documented as 1s5m, but the CRD schema currently accepts any string. Add schema-level validation so invalid values are rejected early.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@assets/crd/microshift.io_remoteclusters.yaml` around lines 43 - 47, Add
schema-level admission validation for spec.probeInterval: update the CRD's
openAPIV3Schema for the spec.probeInterval property to include an
x-kubernetes-validations rule that enforces a duration between 1s and 5m (e.g. a
CEL expression that parses the duration and asserts seconds >= 1 && seconds <=
300) and a pattern fallback to only accept valid duration strings (digits + unit
like s, m). Target the spec.probeInterval property in the CRD schema and include
a clear message for rejection.

Comment thread Makefile
Comment on lines +349 to +352
.PHONY: generate-crds
generate-crds:
./scripts/fetch_tools.sh controller-gen && \
./scripts/generate-crds.sh
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Wire generate-crds into the default generate/verify flow.

Right now this target is leaf-only, so the visible make, make verify, and make verify-config paths won't catch stale CRDs or generated clients after API changes.

Suggested change
-all: generate-config microshift etcd
+all: generate-config generate-crds microshift etcd
...
-verify-config: generate-config
+verify-config: generate-config generate-crds
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Makefile` around lines 349 - 352, The generate-crds Makefile target is
currently leaf-only and not invoked by the broader generate/verify flows; update
the top-level targets (e.g., the generate target and verify/verify-config
targets used by make and make verify) to depend on generate-crds so CRD and
client generation runs automatically—modify the prerequisites for the primary
generate target and the verify/verify-config targets to include generate-crds
(references: Makefile targets "generate-crds", "generate", "verify",
"verify-config") so stale CRDs/clients are caught by the normal make/verify
pipelines.

Comment on lines +26 to +29
_, svcNet, err := net.ParseCIDR(c.cfg.Network.ServiceNetwork[0])
if err != nil {
return fmt.Errorf("failed to parse local service network: %w", err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard empty ServiceNetwork before indexing.

Line 26 accesses c.cfg.Network.ServiceNetwork[0] without a length check; an empty list will panic and crash reconciliation.

Suggested fix
 func (c *C2CCRouteManager) deployProbe(ctx context.Context) error {
+	if len(c.cfg.Network.ServiceNetwork) == 0 {
+		return fmt.Errorf("service network is empty")
+	}
 	_, svcNet, err := net.ParseCIDR(c.cfg.Network.ServiceNetwork[0])
 	if err != nil {
 		return fmt.Errorf("failed to parse local service network: %w", err)
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_, svcNet, err := net.ParseCIDR(c.cfg.Network.ServiceNetwork[0])
if err != nil {
return fmt.Errorf("failed to parse local service network: %w", err)
}
if len(c.cfg.Network.ServiceNetwork) == 0 {
return fmt.Errorf("service network is empty")
}
_, svcNet, err := net.ParseCIDR(c.cfg.Network.ServiceNetwork[0])
if err != nil {
return fmt.Errorf("failed to parse local service network: %w", err)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/c2cc/deploy_probe.go` around lines 26 - 29, The code accesses
c.cfg.Network.ServiceNetwork[0] without checking length which will panic if the
slice is empty; before calling net.ParseCIDR, add a guard that validates
len(c.cfg.Network.ServiceNetwork) > 0 and return a clear error (e.g. "service
network not configured") if empty, then proceed to call net.ParseCIDR using the
first element and preserve the existing error wrapping for ParseCIDR failures
(variables involved: c.cfg.Network.ServiceNetwork, net.ParseCIDR, svcNet, err).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants