diff --git a/.markdownlint-cli2.jsonc b/.markdownlint-cli2.jsonc
index 6aad0f7a9..2048dada6 100644
--- a/.markdownlint-cli2.jsonc
+++ b/.markdownlint-cli2.jsonc
@@ -12,6 +12,7 @@
"**/tests/pytests/**",
"**/knowledge/**",
"**/tests/**",
+ "**/example/gpu-validation-cluster/**",
"**/docs-internal/**"
]
}
diff --git a/example/gpu-validation-cluster/.gitignore b/example/gpu-validation-cluster/.gitignore
new file mode 100644
index 000000000..f4bcf4e4b
--- /dev/null
+++ b/example/gpu-validation-cluster/.gitignore
@@ -0,0 +1,3 @@
+
+# Lab-specific test scaffolding (per-engineer; contains hostnames + cred refs)
+ansible/inventory.test.yml
diff --git a/example/gpu-validation-cluster/README.md b/example/gpu-validation-cluster/README.md
index 04f903746..01a9f09e5 100644
--- a/example/gpu-validation-cluster/README.md
+++ b/example/gpu-validation-cluster/README.md
@@ -4,213 +4,1101 @@ A containerized, one-click deployment solution for validating AMD GPU and AINIC
## Overview
-This project provides an automated, reproducible testing environment for GPU operator functionality. It deploys a complete Kubernetes cluster with AMD GPU and Network operators pre-configured, enabling rapid validation of operator features and performance.
+This project provides an automated, reproducible testing environment for AMD GPU + AMD AINIC validation at fleet scale. A single Ansible run brings up a containerized k3s cluster across N nodes, installs the AMD GPU + Network operators, and deploys the Cluster Validation Framework (CVF) — a CronJob-driven, 5-phase gated pipeline that exercises GPUs, intra-node xGMI, per-NIC RDMA health, per-rail bandwidth, and multi-node RCCL collectives.
+
+```mermaid
+flowchart TB
+ subgraph CN["Control Node (your laptop / jumphost)"]
+ AP["ansible-playbook
(cvf.yml + lifecycle playbooks)"]
+ end
+
+ subgraph SN["Server Node"]
+ SC["docker container 'server'
k3s server (control-plane)"]
+ SC --> GO["AMD GPU Operator
(kube-amd-gpu)"]
+ SC --> NO["AMD Network Operator
(kube-amd-network)"]
+ SC --> MO["MPI Operator
(mpi-operator)"]
+ SC --> CVF["Cluster Validation Framework
CronJob (default */30)"]
+ end
+
+ subgraph AN["Agent Node(s) 1..N"]
+ AC["docker container 'agent'
k3s agent + GPU/NIC device plugins"]
+ end
+
+ AP -->|setup-cluster| SC
+ AP -->|setup-cluster / add-agent-nodes| AC
+ AP -->|cvf.yml -e action=…| SC
+ AC -.k3s join.-> SC
+
+ CVF --> P0["Phase 0: candidate selection"]
+ P0 --> P1["Phase 1: GPU HW acceptance
RVS gst_single + AGFHC xgmi/pcie/hbm"]
+ P1 --> P2["Phase 2: intra-node GPU mesh
RCCL all_reduce (single node)"]
+ P2 --> P3["Phase 3: per-node NIC health
rdma link + ibv_devinfo"]
+ P3 --> P4["Phase 4: pairwise rail bandwidth
ib_write_bw per rail per pair"]
+ P4 --> P5["Phase 5: multi-node RCCL
MPIJob all_reduce_perf"]
+ P5 --> RES["per-node labels:
amd.com/{phase}=passed|failed"]
+```
## Features
-- **Automated Deployment**: Single-command cluster initialization with all operators ready
-- **GPU Operator**: Full AMD GPU device plugin with resource management and scheduling
-- **Network Operator**: AMD network operator for advanced networking and performance optimization
-- **Cluster Validation Framework**: Comprehensive automated tests for both GPU validation and RCCL tests.
-- **Containerized**: Entire stack runs in containers for portability and consistency
+- **One-command multi-node deployment** via Ansible (`playbooks/setup-cluster.yml`)
+- **AMD GPU + Network operators** pre-configured (inbox driver by default; operator-installed driver opt-in)
+- **5-phase validation pipeline** with per-phase skip flags, per-stage timeouts, and per-framework test-runner images
+- **Day-2 operations playbook** (`playbooks/cvf.yml`) — single dispatcher for reapply / reset / inject-secret / trigger / status / fresh-run
+- **Persistent per-phase logs** on each node under `/var/log/cluster-validation/` (Phase 1 test-runner, Phase 2 RCCL, Phase 3 NIC probes, Phase 4 ib_write_bw server+client, Phase 5 RCCL launcher)
+- **Containerized** — entire stack runs in `docker` so the host stays clean
## Quick Start
-There are two ways to deploy the GPU Validation Cluster:
+Multi-node deployment is fully Ansible-driven. The flow is **(1) edit `configs/config.json`** for your environment, then **(2) run two playbooks**: `setup-cluster.yml` for one-time bringup, `cvf.yml` for all day-2 operations.
+
+### Prerequisites
+
+**Host OS (required, every cluster node):**
+
+- **Ubuntu 22.04 LTS** or **Ubuntu 24.04 LTS** — these are the only validated host OS versions. Other distros and other Ubuntu versions are not supported by the bringup playbook (Docker install, kernel-module DKMS, and the operator chart all assume Ubuntu LTS).
+
+**Per cluster node:**
+
+- Docker daemon running (the playbook will install it if missing)
+- `jq` CLI (installed by the playbook if missing)
+- `amdgpu` + `ionic` (and `ionic_rdma`) kernel modules present — either inbox or installed by the AMD operators (see step 0.1)
+- SSH access from the control node (configured by `setup-ssh-keys.yml`)
-1. **Automated (Recommended)**: Using Ansible playbooks for multi-node deployment
-2. **Manual**: Using shell scripts on each node individually
+**Control node (your laptop or jumphost):**
-### Option 1: Automated Deployment with Ansible (Recommended)
+- Ansible 2.9+
+- SSH key to every cluster node (set up via `setup-ssh-keys.yml` if not already in place)
+- Working copy of this repo + an `ansible/inventory.yml` that lists the server + agent nodes
-For deploying a multi-node cluster, use the Ansible automation:
+### Step 0 — Pre-flight configuration
+
+Edit `configs/config.json` **before** running any playbook. The settings drive what the framework installs, runs, and how long it waits.
+
+#### 0.1 Driver source: inbox vs operator-installed
+
+Each AMD operator can either (a) use the host's existing kernel module ("inbox") or (b) install/manage its own driver. Default is **inbox** for both.
+
+```jsonc
+"amd-gpu-operator": {
+ "version": "v1.4.1", // operator chart version
+ "install-amdgpu-driver": false, // false = use host's amdgpu; true = operator installs driver-version
+ "driver-version": "30.30"
+},
+"network-operator": {
+ "version": "v1.0.0",
+ "install-network-operator": true, // optional (default true). false = skip the network operator entirely for a GPU-only cluster
+ "install-ionic-driver": false, // false = use host's ionic_rdma; true = operator installs driver-version
+ "driver-version": "1.117.5-a-56"
+}
+```
+
+> **GPU-only clusters:** set `install-network-operator: false` to skip the
+> network operator, its `NetworkConfig`, and Multus altogether. Only do
+> this when no NIC-dependent phases run — make sure Phases 3, 4, and 5 are
+> skipped in `skip-tests` and that `node-selector-labels` does not require
+> an `amd-nic`/`amd-vnic` label. cert-manager, the GPU operator,
+> mpi-operator, and the CVF CronJob still install normally.
+
+**How to decide:** check what's already loaded on each node before bringup.
```bash
-cd ansible
-./quickstart.sh
+# On every node — confirm GPU + RDMA drivers are present and report a version
+modinfo amdgpu | grep -E '^version|^srcversion' | head -1
+modinfo ionic_rdma | grep -E '^version|^srcversion' | head -1
+lsmod | grep -E '^amdgpu|^ionic'
```
-**Benefits:**
+If both are present and a known-good version, leave `install-*-driver: false`. Otherwise set to `true` and pin `driver-version` to a version supported by your kernel.
-- Automated deployment across all nodes from a single command
-- Automatic installation of prerequisites (Docker, jq)
-- Passwordless SSH setup included
-- One-command teardown and status checking
-- Handles image distribution automatically
+> **Container ↔ kernel ABI matching.** The `roce-workload` image used by Phases 2/4/5 ships its own `libionic` userspace which must support the host kernel's `ionic_rdma` ABI. If you see `libibverbs: Warning: Driver ionic does not support the kernel ABI of N (supports M to M)` in Phase 4 logs, swap to a `roce-workload` image whose ainic version matches the host driver (see step 0.4).
-**See the [Ansible README](ansible/README.md) for detailed instructions.**
+#### 0.2 Node selector labels (physical vs VF-passthrough VM)
-### Option 2: Manual Deployment
+`node-selector-labels` is an AND-list — only nodes carrying **every** label in the array are eligible for cluster validation. The label you choose depends on whether the node exposes physical AMD devices or virtual-function (VF) passthrough devices to a guest VM:
-For manual deployment or single-node testing, follow these steps:
+| Label | Means | Set by |
+|---|---|---|
+| `feature.node.kubernetes.io/amd-gpu=true` | Physical AMD GPU (bare metal / VM with PF-passthrough) | NFD + AMD GPU operator |
+| `feature.node.kubernetes.io/amd-vgpu=true` | **VF-passthrough** AMD GPU inside a guest VM (SR-IOV VF assigned to the VM) | NFD + AMD GPU operator (VM-aware) |
+| `feature.node.kubernetes.io/amd-nic=true` | Physical AMD Pensando NIC (bare metal / VM with PF-passthrough) | NFD + AMD Network operator |
+| `feature.node.kubernetes.io/amd-vnic=true` | **VF-passthrough** AMD Pensando NIC inside a guest VM (SR-IOV VF assigned to the VM) | NFD + AMD Network operator (VM-aware) |
-#### Prerequisites
+Default in `config.json` (physical GPU + physical NIC fleet):
+```jsonc
+"node-selector-labels": [
+ "feature.node.kubernetes.io/amd-gpu=true",
+ "feature.node.kubernetes.io/amd-nic=true"
+]
+```
-- Docker engine installed and daemon is running (validated on Docker 29.1.5 or newer)
-- `jq` CLI for JSON parsing
-- Ubuntu 22.04 or 24.04 host
+**Tune for your fleet:**
-#### Deployment Steps
+| Posture | `node-selector-labels` |
+|---|---|
+| Bare-metal / PF-passthrough VM (default) | `["amd-gpu=true", "amd-nic=true"]` |
+| Fully virtualized (VF GPU + VF NIC inside guest VM) | `["amd-vgpu=true", "amd-vnic=true"]` |
+| Mixed (VF GPU + PF NIC, etc.) | mix and match the four labels |
+| Early bringup, no NIC yet | drop the NIC label: `["amd-gpu=true"]` only |
-1. **Build the container image**
+**Verify what's actually labeled** before changing this:
+```bash
+ansible -i ansible/inventory.yml server-node -m shell -a \
+ "docker exec server kubectl get nodes -L feature.node.kubernetes.io/amd-gpu \
+ -L feature.node.kubernetes.io/amd-vgpu \
+ -L feature.node.kubernetes.io/amd-nic \
+ -L feature.node.kubernetes.io/amd-vnic"
+```
- ```bash
- ./gpu-cluster.sh build
- ```
+NFD applies these labels asynchronously after the corresponding operator's device plugin advertises a resource. If a node shows the device under `kubectl describe node` capacity but the label is missing, the operator hasn't completed its NFD rule rollout yet — wait \~1 min and re-check.
+
+> **Custom (non-NFD) labels must be applied by you.** Only the four
+> `feature.node.kubernetes.io/amd-{gpu,vgpu,nic,vnic}` labels above are
+> set automatically (by NFD). If you set `node-selector-labels` to a
+> custom label such as `["cvf-candidate=true"]`, **nothing applies it for
+> you** — you must label the target nodes yourself, or Phase 0 will match
+> zero nodes and every CronJob tick will skip with
+> `[Node Selection: WARNING] Selector '...' matched 0 nodes`:
+> ```bash
+> docker exec server kubectl label node cvf-candidate=true
+> ```
+> This is a useful pattern when you want to *explicitly* opt nodes into
+> validation rather than auto-selecting every GPU/NIC node.
+
+> **Note:** the orchestrator uses these labels for Phase 0 candidate selection. NIC-requiring phases (Phase 3 / Phase 4) further narrow inside their own per-phase script via intersection with `amd-nic=true` (or `amd-vnic=true` if you set it), so non-NIC nodes pass through GPU-only phases cleanly without entering the NIC phases.
+
+#### 0.3 Which phases / stages to run + estimated runtime
+
+Two layers of skip flags: **per-phase** (5 flags) and **per-Phase-1-stage** (4 flags).
+
+```jsonc
+"skip-tests": {
+ "skip-phase1-gpu-hw-acceptance": true, // Phase 1: RVS gst_single + AGFHC xgmi/pcie/hbm (DEFAULT: SKIPPED, please configure your AGFHC image and pull secrets then enable this phase)
+ "skip-phase2-gpu-mesh-validation": false, // Phase 2: intra-node 8-rank RCCL all_reduce
+ "skip-phase3-nic-validation": false, // Phase 3: per-node NIC health (needs amd-nic=true)
+ "skip-phase4-rail-bandwidth-test": false, // Phase 4: pairwise ib_write_bw per rail
+ "skip-phase5-rccl-test": false, // Phase 5: multi-node RCCL via MPIJob
+ "skip-phase1-stages": { // Only consulted when skip-phase1-gpu-hw-acceptance=false
+ "skip-phase1-gpu-stress": true, // gst_single (RVS) ~15-30 min/node
+ "skip-phase1-xgmi-lvl1": true, // xgmi_lvl1 (AGFHC) ~3-5 min/node (private image)
+ "skip-phase1-pcie-lvl1": true, // pcie_lvl1 (AGFHC) ~3-5 min/node (private image)
+ "skip-phase1-hbm-lvl1": true // hbm_lvl1 (AGFHC) ~3-5 min/node (private image)
+ }
+}
+```
- After building, you have two options to make the image available on all nodes:
-
- - **Option A**: Save and port the image to other nodes:
-
- ```bash
- # On server node: save the image
- docker save gpu-validation-cluster:latest -o gpu-validation-cluster.tar
-
- # Transfer to worker nodes and load:
- scp gpu-validation-cluster.tar user@worker-node:/tmp/
- ssh user@worker-node "docker load -i /tmp/gpu-validation-cluster.tar"
- ```
-
- - **Option B**: Rebuild the image on each worker node:
-
- ```bash
- # Run on each worker node
- ./gpu-cluster.sh build
- ```
-
-2. **Configure cluster validation framework**
-
- Before starting the cluster, edit `configs/config.json` to match your environment. Common configuration options:
-
- **Device Type Selection:**
- - For physical GPUs: `"gpu-type": "amd-gpu"`
- - For SR-IOV VF GPUs (in VMs): `"gpu-type": "amd-vgpu"`
- - For physical NICs: `"nic-type": "amd-nic"`
- - For virtual NICs (in VMs): `"nic-type": "amd-vnic"`
-
- **Resource Configuration:**
-
- ```json
- "cluster-validation-framework": {
- "node-selector-labels": [ // Node selector labels for candidate selection
- "feature.node.kubernetes.io/amd-gpu=true", // GPU label selector
- "feature.node.kubernetes.io/amd-nic=true" // NIC label selector
- ],
- "resources": {
- "worker-replicas": 2, // Number of nodes to validate in parallel
- "gpu-per-worker": 8, // Number of GPUs per node
- "pf-nic-per-worker": 0, // Number of physical function NICs per node
- "vf-nic-per-worker": 8, // Number of virtual function NICs per node
- "slots-per-worker": 8, // MPI ranks per worker
- "node-validation-interval-mins": 10 // Minimum interval between validation runs on same node
- },
- "skip-tests": {
- "skip-gpu-validation": false, // Set to true to skip GPU validation tests (RVS/AGFHC)
- "skip-rccl-test": false // Set to true to skip MPI Job RCCL tests
- }
+The default skips all of Phase 1 (no AGFHC pull secret required out of the box) and runs Phase 2 → 5. Set `skip-phase1-gpu-hw-acceptance: false` then enable individual stages in `skip-phase1-stages` when you want HW acceptance coverage — and remember to inject the AGFHC pull secret (step 0.5) if any of the `xgmi/pcie/hbm-lvl1` stages are enabled.
+
+**Estimated per-phase wallclock (per node, in parallel across nodes):**
+
+| Phase | Stages / contents | Typical | Worst-case (timeout) |
+|---|---|---|---|
+| 1.gpu-stress | RVS gst_single (FP8/FP16/dgemm sweep) | 15-30 min | 60 min |
+| 1.xgmi-lvl1 | AGFHC xGMI link sweep | 3-5 min | 20 min |
+| 1.pcie-lvl1 | AGFHC PCIe lane sweep | 3-5 min | 20 min |
+| 1.hbm-lvl1 | AGFHC HBM ECC sweep | 3-5 min | 20 min |
+| 2 | intra-node 8-rank RCCL all_reduce | 3-8 min | 10 min |
+| 3 | per-node NIC health (count, ip link, rdma link, ibv_devinfo) | 30s-2 min | 5 min |
+| 4 | pairwise `ib_write_bw` × 8 rails per pair (rails serialized) | 1-2 min × 8 = 8-16 min per pair (pairs run in parallel) | 40 min per pair |
+| 5 | multi-node RCCL MPIJob (`all_reduce_perf`) | 1-5 min | 5 min |
+
+**Suggested bringup postures:**
+
+| Posture | Skip flags | Use when |
+|---|---|---|
+| **Network-focused** (default) | P1 skipped, P2+P3+P4+P5 enabled | network fabric + RCCL stack validation; no AGFHC pull secret needed |
+| **GPU HW acceptance only** | P1 enabled (set `skip-phase1-stages.*` to choose stages), P2+P3+P4+P5 skipped | new GPU node bringup; HW burn-in; requires AGFHC pull secret if xgmi/pcie/hbm-lvl1 enabled |
+| **GPU-only smoke** | P1+P2 enabled, P3+P4+P5 skipped | datacenter early, no NIC infrastructure yet |
+| **Full pipeline** | all 5 enabled | full HW + fabric + workload validation; requires AGFHC pull secret |
+
+#### 0.4 Per-step container images
+
+```jsonc
+"images": {
+ "roce-workload": "docker.io/rocm/roce-workload:ubuntu24_rocm-7.0.2_rccl-7.0.2_anp-v1.2.0_ainic-1.117.5-a-56",
+ "test-runner": {
+ "rvs": "docker.io/rocm/test-runner:v1.4.0", // Phase 1 gpu-stress
+ "agfhc": "docker.io/amdpsdo/test-runner:agfhc-v1.5.0-4" // Phase 1 xgmi/pcie/hbm — PRIVATE, see 0.4
+ },
+ "orchestrator": "docker.io/bitnamilegacy/kubectl:1.33.4", // submit-mpijob, NAD apply
+ "preflight-init": "docker.io/bitnamilegacy/kubectl:1.33.4", // Phase 5 launcher init container
+ "nic-health": "docker.io/rocm/roce-workload:" // Phase 3 NIC probes (now uses ${ROCE_WORKLOAD_IMAGE})
+}
+```
+
+| Image | Used by | Registry |
+|---|---|---|
+| `rocm/test-runner` (RVS) | Phase 1 `gpu-stress` (Recipe: `gst_single`) | public (Docker Hub) |
+| **`amdpsdo/test-runner:agfhc-*`** | Phase 1 `xgmi-lvl1` / `pcie-lvl1` / `hbm-lvl1` | **PRIVATE** — see 0.4 |
+| `rocm/roce-workload` | Phase 2, Phase 3 NIC health Job, Phase 4 server+client, Phase 5 worker/launcher (all four pinned to one `ROCE_WORKLOAD_IMAGE` ConfigMap key) | public |
+| `bitnamilegacy/kubectl` | orchestrator + Phase 5 preflight init | public |
+
+> **roce-workload ↔ kernel ABI:** if Phase 4 fails on every rail with `libibverbs: Warning: Driver ionic does not support the kernel ABI`, your `ainic-*` tag's libionic doesn't match the host's `ionic_rdma` ABI. Try a different tag from the [`rocm/roce-workload` Docker Hub repo](https://hub.docker.com/r/rocm/roce-workload/tags).
+
+#### 0.5 ⚠️ AGFHC requires a pull secret
+
+If **any Phase 1 stage other than `gpu-stress` is enabled** (i.e. `xgmi-lvl1`, `pcie-lvl1`, or `hbm-lvl1`), the orchestrator will pull `docker.io/amdpsdo/test-runner:agfhc-*` — a **private Docker Hub repo**. The pod will sit in `ErrImagePull` until you inject pull credentials.
+
+**Steps:**
+
+1. Request an AGFHC pull token from your AMD representative.
+2. Add it to `configs/config.json`:
+ ```jsonc
+ "global": {
+ "image-pull-secrets": [
+ { "registry-url": "docker.io",
+ "username": "amdpsdo",
+ "token": "dckr_oat_…" // your token, keep out of git
+ }
+ ]
}
```
+3. Inject it into the running cluster:
+ ```bash
+ ansible-playbook -i ansible/inventory.yml ansible/playbooks/cvf.yml -e action=inject-secret
+ ```
+4. (Alternative — keep token out of `config.json`) inject ad-hoc via CLI override:
+ ```bash
+ ansible-playbook -i ansible/inventory.yml ansible/playbooks/cvf.yml -e action=inject-secret \
+ -e username=amdpsdo -e token=dckr_oat_…
+ ```
- **Node Selector Labels:**
- The `node-selector-labels` array defines which nodes are eligible for cluster validation. Each label is combined with AND logic to select nodes.
+The playbook is idempotent (additive strategic-merge patch on `cluster-validation-sa.imagePullSecrets`). See [`ansible/README.md`](ansible/README.md) for full details.
- Common label combinations:
- - Physical GPUs + Physical NICs: `["feature.node.kubernetes.io/amd-gpu=true", "feature.node.kubernetes.io/amd-nic=true"]`
- - Virtual GPUs + Virtual NICs (in VMs): `["feature.node.kubernetes.io/amd-vgpu=true", "feature.node.kubernetes.io/amd-vnic=true"]`
- - Mixed configurations: Customize the array to match your environment
+If `xgmi-lvl1` / `pcie-lvl1` / `hbm-lvl1` are all set to `skip: true`, no secret is needed.
-3. **Start the validation cluster**
+#### 0.6 Timeouts
- ```bash
- # Bring up control plane (run in background)
- ./gpu-cluster.sh run server &
+Make sure each per-phase budget is generous enough for your hardware — a too-tight budget kills the test Job mid-run and the orchestrator records it as a failure even when the underlying test was about to pass.
- # Fetch control plane token to join the cluster
- ./gpu-cluster.sh get-token
+```jsonc
+"timeouts": {
+ "phase1-stages-secs": {
+ "phase1-gpu-stress": 3600, // gst_single can take 30+ min on slower nodes; 1h headroom
+ "phase1-xgmi-lvl1": 1200,
+ "phase1-pcie-lvl1": 1200,
+ "phase1-hbm-lvl1": 1200
+ },
+ "phase2-job-wait-secs": 600,
+ "phase3-job-wait-secs": 300,
+ "phase4-pair-wait-secs": 300,
+ "phase5-mpijob-wait-secs": 300
+}
+```
- # On other nodes, bring up workers to join the cluster (run in background)
- ./gpu-cluster.sh run agent &
- ```
+Tuning advice:
-4. **Verify cluster status**
+- If you see Phase 1 stages marked TIMEOUT but the persistent log under `/var/log/cluster-validation/__stdout.gz` shows the test still progressing → **raise** the corresponding `phase1-stages-secs` entry.
+- If you see Phase 4 `ib-write-bw-crashed` on every rail → **not** a timeout issue; check the persistent log `/var/log/cluster-validation/_phase4-{server,client}_railN.log` (libionic ABI mismatch, missing per-rail NAD, etc.).
+- Per-node-interval `resources.node-validation-interval-mins: 30` controls how often a single node re-runs; lower it (e.g. 10) only if you want tighter feedback during bringup.
- After bringing up the cluster, login to the server container to check cluster status:
+#### 0.7 Advanced — everything else lives in YAML
- ```bash
- # Login to server container
- docker exec -it server bash
+`config.json` is intentionally narrow (resources, skip flags, timeouts, images, schedule). For deeper tuning open `configs/cluster-validation-config.yaml` — every field that's overridable via `config.json` carries a `# patchable: ` marker; everything else is a tunable that requires editing the YAML directly. Examples of YAML-only knobs:
- # Check all nodes are ready
- kubectl get nodes
+| YAML key (search in `cluster-validation-config.yaml`) | What it controls |
+|---|---|
+| `PHASE2_BW_THRESHOLD` | min intra-node RCCL all_reduce GB/s for Phase 2 PASS |
+| `PHASE4_BW_THRESHOLD`, `PHASE4_RAIL_COUNT`, `PHASE4_MAX_CONCURRENT_PAIRS`, `PHASE4_IB_DEV_PREFIX`, `PHASE4_NAD_NAME_PREFIX` | Phase 4 bandwidth threshold, rails per pair, parallelism cap, RDMA device prefix, NAD prefix |
+| `PHASE3_EXPECTED_NIC_COUNT`, `PHASE3_AMD_NIC_PCI_IDS`, `PHASE3_MIN_GID_COUNT` | Phase 3 NIC count + PCI ID allowlist + minimum GID count |
+| `RCCL_ENV_VARS`, `PHASE2_RCCL_ENV_VARS` | NCCL/RCCL env-var blocks for Phase 5 vs single-node Phase 2 |
+| `GPU_VALIDATION_STAGES_JSON` | full Phase 1 stage spec (Name, Framework, Recipe, Iterations, TimeoutSeconds, Arguments) — patchable per-field via `config.json`, or replace wholesale here |
+| `PHASE45_PREFLIGHT_SCRIPT`, `PHASE5_LAUNCHER_SCRIPT`, `PHASE5_WORKER_SCRIPT` | the actual bash bodies run inside each phase's pods |
- # Check all pods are running
- kubectl get pods -A
+After editing the YAML, push to the live cluster with `ansible-playbook playbooks/cvf.yml -e action=reapply`.
- # Exit container
- exit
+### Step 1 — One-time bringup
- # Check cluster validation framework status
- ./gpu-cluster.sh status
+```bash
+cd ansible
- # Check per-node validation results
- ./gpu-cluster.sh node-status
- ```
+# 1a. (only first time) configure passwordless SSH to all nodes
+ansible-playbook -i inventory.yml playbooks/setup-ssh-keys.yml --ask-pass --ask-become-pass
-5. **Tear down the cluster**
+# 1b. bring up the cluster (k3s + operators + CVF CronJob)
+ansible-playbook -i inventory.yml playbooks/setup-cluster.yml
+```
- ```bash
- ./gpu-cluster.sh teardown
- ```
+`setup-cluster.yml` builds the Docker image once on the control node, ships the tar to every node, starts the `server` container on the server node, then joins the `agent` containers. Estimated 10-20 min depending on network speed and node count.
+
+### Step 2 — Day-2 operations via `cvf.yml`
+
+All CVF runtime operations live in one playbook dispatched by `-e action=`:
+
+```bash
+# After editing configs/*.yaml or config.json, push changes to the live cluster
+ansible-playbook -i inventory.yml playbooks/cvf.yml -e action=reapply
+
+# Clear per-phase node labels + annotations (forces a fresh full run on next trigger)
+ansible-playbook -i inventory.yml playbooks/cvf.yml -e action=reset
+
+# Inject AGFHC pull secret(s) into the live SA (idempotent)
+ansible-playbook -i inventory.yml playbooks/cvf.yml -e action=inject-secret
+
+# Trigger one validation run manually (creates Job 'cvf-test'; default schedule is */30)
+ansible-playbook -i inventory.yml playbooks/cvf.yml -e action=trigger
+
+# Full status dump: CronJob, recent orchestrators, in-flight Jobs, per-node phase labels + failure annotations
+ansible-playbook -i inventory.yml playbooks/cvf.yml -e action=status
+
+# Composite: reapply -> reset (delete_all_jobs=true) -> inject-secret -> trigger -> status
+ansible-playbook -i inventory.yml playbooks/cvf.yml -e action=fresh-run
+```
+
+> **Cron-vs-manual race:** the CronJob's `concurrencyPolicy: Forbid` blocks cron-vs-cron overlap but **not** manual-vs-cron. If a manual `trigger` exceeds the `*/30` schedule, the next cron tick spawns a second orchestrator that fights for the same nodes. For manual runs expected to exceed 30 min, suspend the cron first:
+> ```bash
+> ansible-playbook playbooks/cvf.yml -e action=reset -e suspend_cronjob=true
+> ansible-playbook playbooks/cvf.yml -e action=trigger
+> # ...after completion, re-enable:
+> docker exec server kubectl patch cronjob cluster-validation-cron-job \
+> -n default --type=merge -p '{"spec":{"suspend":false}}'
+> ```
+
+See [`ansible/README.md`](ansible/README.md) for the full `cvf.yml` action reference (all flags, idempotency notes, race-condition documentation).
+
+### Step 3 — Tear down
+
+```bash
+ansible-playbook -i inventory.yml playbooks/teardown-cluster.yml
+```
+
+## Multi-Phase Validation Pipeline
+
+The cluster validation framework runs a single CronJob (`cluster-validation-cron-job`) that drives a **5-phase, gated state machine**. Each phase narrows the candidate node pool: a node MUST pass phase N to enter phase N+1. Phases are independently enable-able via the `skip-tests` block in `config.json`, so a datacenter early in bringup can run only the GPU-only phases (1+2) and light up downstream phases as NIC and rail infrastructure arrive. The orchestrator skeleton, the per-phase `run_phaseN` stubs, the `DRY_RUN=1` planning mode, and the shared node-label helper library all live in `configs/cluster-validation-config.yaml`; the CronJob shell lives in `configs/cluster-validation-job.yaml`.
+
+### Phases, Labels, and Skip Flags
+
+| Phase | Purpose | `config.json` skip flag | ConfigMap env var | Per-phase node label key | Default |
+|-------|---------|-------------------------|-------------------|--------------------------|---------|
+| 1 | GPU HW acceptance (RVS / AGFHC via Test Runner) | `skip-gpu-hw-acceptance` | `SKIP_GPU_HW_ACCEPTANCE` | `amd.com/gpu-hw-acceptance` | enabled |
+| 2 | Intra-node GPU collective / mesh validation | `skip-gpu-mesh-validation` | `SKIP_GPU_MESH_VALIDATION` | `amd.com/gpu-mesh-validation` | enabled |
+| 3 | Per-node NIC health (requires `amd-nic=true`) | `skip-nic-validation` | `SKIP_NIC_VALIDATION` | `amd.com/nic-health` | skipped |
+| 4 | Pairwise rail bandwidth | `skip-rail-bandwidth-test` | `SKIP_RAIL_BANDWIDTH_TEST` | `amd.com/rail-bandwidth` | skipped |
+| 4.5 | Cross-node connectivity matrix (pre-flight gate before Phase 5) | (shares `skip-rccl-test`) | (shares `SKIP_RCCL_TEST`) | (no new label key; annotation only) | skipped (with Phase 5) |
+| 5 | Multi-node RCCL via MPIJob | `skip-rccl-test` | `SKIP_RCCL_TEST` | `amd.com/cluster-validation-status` | skipped |
+
+On each phase, the per-phase script labels every input node with either `=passed` or `=failed`. On failure it also writes a failure-reason annotation:
+
+```text
+-failure-reason=
+```
+
+(The suffix `-failure-reason` is published as the ConfigMap constant `PHASE_FAILURE_REASON_ANNOTATION_SUFFIX`.) The phase label is the **only contract** between phases — phase scripts MUST use the `label_phase_passed` / `label_phase_failed` / `annotate_phase_value` helpers from `PHASE_NODE_LABEL_SCRIPT` rather than calling `kubectl label` directly, and they MUST read `PHASE{N}_LABEL_KEY` from the ConfigMap rather than hard-coding key names.
+
+### Phase 1: Per-Node GPU Hardware Acceptance
+
+**Scope:** First gate of the pipeline. Per-node, no-network GPU hardware acceptance using a Test Runner image (`rocm/test-runner` by default). The phase catches dead GPUs, thermal issues, HBM errors, PCIe degradation, and xGMI link issues at boot.
+
+**Result label:** `amd.com/gpu-hw-acceptance=passed|failed`. Only nodes labeled `passed` advance to Phase 2.
+
+**Multi-stage execution model.** The ROCm test-runner CLI executes only the first `TestCases[]` entry per invocation, so Phase 1 runs one Test Runner Job *per recipe per candidate node*. The orchestrator iterates the configured stages **sequentially per node** with **stop-on-first-failure** semantics: stage S submits one Job per still-alive node *in parallel*, waits for the whole batch, drops any node that failed, then moves to stage S+1. Cross-node parallelism is preserved within each stage; cross-stage execution is gated by the previous stage's per-node result.
+
+Each stage is fully self-describing in the `GPU_VALIDATION_STAGES_JSON` ConfigMap key — including its own `Image`, allowing RVS and AGFHC test-runner images to be pinned to different versions:
+
+```yaml
+GPU_VALIDATION_STAGES_JSON: |
+ [
+ { "Name": "gpu-stress", "Image": "docker.io/rocm/test-runner:v1.4.0",
+ "Framework": "RVS", "Recipe": "gst_single",
+ "Iterations": 1, "TimeoutSeconds": 1800, "Arguments": "--parallel" },
+ { "Name": "xgmi-lvl1", "Image": "docker.io/rocm/test-runner:v1.4.0",
+ "Framework": "AGFHC", "Recipe": "xgmi_lvl1",
+ "Iterations": 1, "TimeoutSeconds": 300, "Arguments": "" },
+ { "Name": "pcie-lvl1", "Image": "docker.io/rocm/test-runner:v1.4.0",
+ "Framework": "AGFHC", "Recipe": "pcie_lvl1",
+ "Iterations": 1, "TimeoutSeconds": 300, "Arguments": "" },
+ { "Name": "hbm-lvl1", "Image": "docker.io/rocm/test-runner:v1.4.0",
+ "Framework": "AGFHC", "Recipe": "hbm_lvl1",
+ "Iterations": 1, "TimeoutSeconds": 300, "Arguments": "" }
+ ]
+```
+
+| Field | Purpose |
+|-------|---------|
+| `Name` | Orchestrator identifier for the stage. Used in per-stage ConfigMap names (`cvf-phase1---`), Job names (`cvf-tr---`), and per-stage annotations. Must be DNS-1123-safe (lowercase alphanumerics + `-`). |
+| `Image` | Test Runner image used **for this stage only**. Lets RVS and AGFHC ship independently. |
+| `Framework` / `Recipe` / `Iterations` / `TimeoutSeconds` / `Arguments` | Passed verbatim into the runner as a single-element `TestCases[]` payload via a per-stage per-node ConfigMap. |
+
+Per-stage Job wait budget = `TimeoutSeconds + 120s` (pod-startup slack). There is **no** global `TEST_RUNNER_JOB_WAIT_TIME` — each stage owns its own budget.
+
+**Default sub-tests.** The shipped `GPU_VALIDATION_STAGES_JSON` defines these four stages, executed in array order:
+
+| # | Stage `Name` | Framework | Recipe | Purpose |
+|---|--------------|-----------|--------|---------|
+| 1 | `gst-single` | RVS | `gst_single` | Single-GPU stress (compute + memory) — baseline GPU health |
+| 2 | `xgmi-lvl1` | AGFHC | `xgmi_lvl1` | xGMI inter-GPU link integrity (level-1) |
+| 3 | `pcie-lvl1` | AGFHC | `pcie_lvl1` | PCIe link integrity, lane width / speed (level-1) |
+| 4 | `hbm-lvl1` | AGFHC | `hbm_lvl1` | HBM memory integrity (level-1) |
+
+**Per-stage annotation scheme.** Independent of the aggregate `amd.com/gpu-hw-acceptance` label, each stage writes its own annotation per node:
+
+```text
+amd.com/gpu-hw-acceptance-stage-=passed|failed
+```
+
+If a stage fails, subsequent stages are *not* submitted for that node, so no annotation is written for stages after the failing one. The aggregate label is `passed` only when **all** stages pass for the node; on first failure, the aggregate label becomes `failed` and the standard `-failure-reason` annotation is set to `stage-:` (e.g., `stage-hbm-lvl1:subtest-failed:hbm_lvl1`), with `-failed-subtest=` for sub-test failures.
-## Usage
+**Sample failure annotation output.** When the `hbm-lvl1` stage fails on a node, the prior stages' per-stage annotations are still recorded:
+
+```text
+$ kubectl describe node smc300x-ccs-aus-gpuf268
+Name: smc300x-ccs-aus-gpuf268
+Labels: amd.com/gpu-hw-acceptance=failed
+ feature.node.kubernetes.io/amd-gpu=true
+ ...
+Annotations: amd.com/gpu-hw-acceptance-stage-gst-single=passed
+ amd.com/gpu-hw-acceptance-stage-xgmi-lvl1=passed
+ amd.com/gpu-hw-acceptance-stage-pcie-lvl1=passed
+ amd.com/gpu-hw-acceptance-stage-hbm-lvl1=failed
+ amd.com/gpu-hw-acceptance-failure-reason=stage-hbm-lvl1:subtest-failed:hbm_lvl1
+ amd.com/gpu-hw-acceptance-failed-subtest=hbm_lvl1
+ amd.com/cluster-validation-last-run-timestamp=2026-05-20T14:32:11Z
+ ...
+```
+
+Other failure reasons the phase can emit (in the `-failure-reason` annotation, always prefixed with `stage-:`):
+
+| Reason value | Meaning |
+|--------------|---------|
+| `stage-:subtest-failed:` | The named sub-test in the stage failed; `` also appears in `-failed-subtest`. |
+| `stage-:recipe-not-found` | The AGFHC/RVS recipe is missing from this stage's `Image`. `-failed-subtest=`. Action: upgrade the `Image` field for the affected stage in `GPU_VALIDATION_STAGES_JSON`. |
+| `stage-:test-runner-did-not-emit-results` | Job completed but `result.json` is absent. `-failed-subtest=unknown`. |
+| `stage-:timeout` | Job exceeded the stage's `TimeoutSeconds + 120s` budget. |
+| `stage-:configmap-creation-failed` | `kubectl apply` on the per-stage per-node `GPU_VALIDATION_TESTS_JSON` ConfigMap returned non-zero. |
+| `stage-:job-creation-failed` | `kubectl apply` on the Test Runner Job returned non-zero. |
+| `phase1-missing-env:` / `phase1-stages-empty-or-invalid` / `phase1-stages-missing-fields` | Preamble validation failed before any stage was submitted; the same reason is written for every input node and no Jobs/ConfigMaps are created. |
+
+**`SKIP_GPU_HW_ACCEPTANCE` behavior.** Setting `skip-gpu-hw-acceptance: true` in `config.json` (which renders to `SKIP_GPU_HW_ACCEPTANCE=true` in the ConfigMap) makes Phase 1 short-circuit:
+
+- **No Test Runner Job is created** for any input node.
+- Every input node is immediately labeled `amd.com/gpu-hw-acceptance=passed` via the standard `label_phase_passed` helper, so downstream `filter_passed_nodes` calls treat the node as eligible for Phase 2.
+- No failure annotation is written, and no `result.json` parsing occurs.
+
+This short-circuit exists for incremental bringup: when GPU hardware has already been independently validated (e.g., during burn-in) or when the operator wants to exercise Phases 2-5 without re-running the full HW acceptance suite each tick, `SKIP_GPU_HW_ACCEPTANCE=true` keeps the pipeline moving without consuming the ~30-minute Test Runner budget per node.
+
+### Phase 2: Intra-Node GPU Mesh Validation
+
+**Scope:** Second gate of the pipeline. Per-node, no-network RCCL `all_reduce_perf` across all 8 local GPUs of a node to stress the xGMI mesh under a real collective workload. One Kubernetes `batch/v1.Job` is created per candidate node that passed Phase 1; the Job is **not** an MPIJob — launcher and worker are co-located in a single pod, and `mpirun --np 8 --host localhost --allow-run-as-root` drives all 8 ranks locally. The phase catches xGMI link failures that only manifest under collective load, GPU mesh issues, and single-GPU instability under stress that Phase 1 single-GPU recipes can miss.
+
+**Result label:** `amd.com/gpu-mesh-validation=passed|failed`. Only nodes labeled `passed` advance to Phase 3 (or to the per-node label being available for downstream consumers when Phases 3-5 are skipped).
+
+**Workload.** The container runs the RCCL `all_reduce_perf` benchmark from the `rocm/roce-workload` image (already used by Phase 5), driven by `mpirun` against the 8 GPUs requested via `amd.com/gpu: 8`:
+
+```bash
+mpirun --np 8 --host localhost --allow-run-as-root \
+ --mca btl ^vader,openib \
+ $PERF_TEST_DIR/all_reduce_perf \
+ -b $PHASE2_START_MSG_SIZE -e $PHASE2_END_MSG_SIZE \
+ -f $PHASE2_STEP_FACTOR -g 1 \
+ -n $PHASE2_ITER_COUNT -w $PHASE2_WARMUP_ITER_COUNT
+```
+
+After `mpirun` exits, the shared `validate-single-test.sh` parses the standard `Avg bus bandwidth` line from the RCCL output, compares it against `PHASE2_BW_THRESHOLD`, and the per-phase label/annotation helpers write the result. The per-Job template lives in the new `cluster-validation-phase2-job-config` ConfigMap (mounted into the orchestrator's `submit-mpijob` container at `/phase2-configs`); per-node rendering uses the same `sed`-substitution pattern as Phase 1's Test Runner Job.
+
+**ConfigMap env vars (Phase 2 subset).** Tunable values are sourced from `cluster-validation-config`:
+
+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `PHASE2_START_MSG_SIZE` | `1K` | `-b` lower bound for `all_reduce_perf` message-size sweep |
+| `PHASE2_END_MSG_SIZE` | `2G` | `-e` upper bound for the sweep |
+| `PHASE2_STEP_FACTOR` | `2` | `-f` geometric step factor between message sizes |
+| `PHASE2_ITER_COUNT` | `6` | `-n` measured iterations per message size |
+| `PHASE2_WARMUP_ITER_COUNT` | `20` | `-w` warmup iterations (excluded from BW measurement) |
+| `PHASE2_BW_THRESHOLD` | `200` | Minimum acceptable `Avg bus bandwidth` in GB/s |
+| `PHASE2_JOB_WAIT_TIME` | `600` | Per-node Job wait budget in seconds |
+| `PHASE2_RCCL_ENV_VARS` | (block) | RCCL/NCCL env vars sourced before `mpirun`; intra-node subset of Phase 5's `RCCL_ENV_VARS` — drops `NCCL_NET_PLUGIN` and all `NCCL_IB_*` tunables (single node, no fabric) and keeps `NCCL_DEBUG=INFO` for diagnostics |
+
+The `PHASE2_RCCL_ENV_VARS` block deliberately contains **no IB-specific tunables** — Phase 2 exercises only the local xGMI mesh, so any IB/RoCE env var would be misleading at best and a source of false failures at worst. Phase 5 (multi-node MPIJob) is where IB/RoCE settings belong.
+
+**Sample failure annotation output.** When the bandwidth measurement falls below the configured threshold, the node receives the `failed` label plus the uniform `-failure-reason` annotation:
+
+```text
+$ kubectl describe node smc300x-ccs-aus-gpuf268
+Name: smc300x-ccs-aus-gpuf268
+Labels: amd.com/gpu-hw-acceptance=passed
+ amd.com/gpu-mesh-validation=failed
+ feature.node.kubernetes.io/amd-gpu=true
+ ...
+Annotations: amd.com/gpu-mesh-validation-failure-reason=bus-bw-below-threshold
+ amd.com/gpu-mesh-validation-measured-bw=148.3
+ amd.com/cluster-validation-last-run-timestamp=2026-05-20T15:08:42Z
+ ...
+```
+
+Failure reasons the phase can emit (in the `-failure-reason` annotation):
+
+| Reason value | Meaning |
+|--------------|---------|
+| `bus-bw-below-threshold` | RCCL completed cleanly but measured `Avg bus bandwidth` < `PHASE2_BW_THRESHOLD`. Measured value is also written to the `-measured-bw` annotation. |
+| `rccl-crash` | `mpirun` exited non-zero (RCCL aborted, runtime error). Last 50 lines of `/shared/phase2.log` captured in the annotation (truncated to fit the 256 KB annotation budget). |
+| `xgmi-init-failure` | RCCL failed during initialization with an xGMI-specific error class — the link came up at boot but cannot carry collective traffic. Inspect `NCCL_DEBUG=INFO` output for the offending peer pair. |
+| `timeout` | Job did not complete within `PHASE2_JOB_WAIT_TIME` (typically: scheduler did not grant all 8 GPUs, or `all_reduce_perf` hung). Conservative — the node is retried on the next CronJob tick. |
+| `job-creation-failed` | `kubectl apply` of the rendered Job template returned non-zero. Usually a `cluster-validation-phase2-job-config` ConfigMap deploy issue. |
+
+**`SKIP_GPU_MESH_VALIDATION` behavior.** Setting `skip-gpu-mesh-validation: true` in `config.json` (which renders to `SKIP_GPU_MESH_VALIDATION=true` in the ConfigMap) makes Phase 2 short-circuit:
+
+- **No Phase 2 Job is created** for any input node.
+- Every input node is immediately labeled `amd.com/gpu-mesh-validation=passed` via the standard `label_phase_passed` helper, so downstream `filter_passed_nodes` calls treat the node as eligible for Phase 3 (or for being surfaced as Phase-2-clean when Phases 3-5 are also skipped).
+- No failure annotation is written, and no `all_reduce_perf` log parsing occurs.
+
+This short-circuit exists for the same incremental-bringup reasons as Phase 1: when collective behavior has already been independently validated, or when the operator wants to exercise downstream phases without consuming the per-node Phase 2 budget on every CronJob tick.
+
+#### Threshold Tuning
+
+`PHASE2_BW_THRESHOLD` is the single number that decides pass vs. fail when RCCL completes cleanly, so it MUST be tuned for the GPU SKU and topology in use. The default (`200` GB/s) is calibrated for MI300-series xGMI on an 8-GPU node — high enough to catch a single degraded xGMI link, low enough to absorb normal run-to-run variance.
+
+**How to override.** Change the value in `configs/cluster-validation-config.yaml` and push it to the live cluster with `ansible-playbook playbooks/cvf.yml -e action=reapply` (re-applies + patches all ConfigMaps; no orchestrator code changes are required). Example override snippet:
+
+```yaml
+# configs/cluster-validation-config.yaml
+data:
+ PHASE2_BW_THRESHOLD: "180" # relax to 180 GB/s for a slightly older silicon revision
+```
+
+The next CronJob tick will pick up the new value via the `envFrom: configMapRef` projection on the Phase 2 Job pod. Already-labeled nodes keep their existing label until the next phase run overwrites it.
+
+**Expected ranges by GPU SKU.** Use these as starting points; always validate against a measured baseline on healthy hardware before locking in a production threshold.
+
+| GPU SKU | xGMI generation | Typical `all_reduce_perf` `Avg bus bandwidth` (8-GPU, message sizes 1K-2G) | Suggested `PHASE2_BW_THRESHOLD` |
+|---------|-----------------|----------------------------------------------------------------------------|---------------------------------|
+| MI300X / MI300A | xGMI 3 (4-link, 128 GB/s/link bi-dir) | 220-260 GB/s | `200` (default) |
+| MI250 / MI250X | xGMI 2 (8-link, 50-100 GB/s/link bi-dir, ring topology) | 140-180 GB/s | `120` |
+| MI210 | xGMI 2 (3-link per pair) | 60-90 GB/s | `50` |
+| MI100 | xGMI 1 (3-link per pair) | 40-60 GB/s | `35` |
+
+**Tuning workflow:**
+
+1. **Measure baseline.** On known-good hardware, set `PHASE2_BW_THRESHOLD: "1"` (effectively pass-through) and let one CronJob tick run. Read the actual `Avg bus bandwidth` from the Job pod logs or from `/var/log/cluster-validation/`.
+2. **Set the threshold ~10-15% below the baseline.** Tight enough to flag a degraded link (typically one bad xGMI link drops the figure by 20-30%), loose enough to absorb normal variance.
+3. **Re-test on the same node with the new threshold.** Confirm `amd.com/gpu-mesh-validation=passed`.
+4. **Inject a known-bad threshold** (`PHASE2_BW_THRESHOLD: "9999"`) on one tick to confirm the failure path labels and annotates correctly (`failed-reason=bus-bw-below-threshold`, measured value recorded), then revert.
+
+If `Avg bus bandwidth` is consistently below the suggested-range floor on healthy hardware, suspect a topology or firmware issue (xGMI link not training at full width, BIOS/SBIOS misconfiguration, ROCm/RCCL version mismatch with the silicon) rather than relaxing the threshold further.
+
+### Phase 3: Per-Node NIC Health Check
+
+**Scope:** Third gate of the pipeline. Per-node, structural-only RDMA NIC readiness check (no traffic generation) running on every node that passed Phase 2 **and** carries the `feature.node.kubernetes.io/amd-nic=true` label. One Kubernetes `batch/v1.Job` is created per candidate node; the Job runs the `${ROCE_WORKLOAD_IMAGE}` image (the same image pinned in `cluster-validation-config` and reused by Phases 2, 4, and 5 — see [Image switch note](#image-switch-phase-3-now-uses-roce_workload_image) below), requests all `amd.com/nic` device-plugin resources on the node (so the pod sees every NIC), and self-labels its own node via an in-pod `kubectl` against the `cluster-validation-sa` ServiceAccount. The phase catches missing NICs, drivers not loaded, bad cables, switch ports down, RoCE misconfiguration, and driver/firmware version skew before downstream phases attempt RDMA traffic.
+
+**Result label:** `amd.com/nic-health=passed|failed`. Only nodes labeled `passed` advance to Phase 4 (or to the per-node label being available for downstream consumers when Phases 4-5 are skipped). The phase is a point-in-time structural check — link flaps that develop later are caught by Phase 4 (real RDMA traffic).
+
+**Pre-flight + 5 structural checks.** Each Phase 3 Job runs `PHASE3_CHECK_SCRIPT` from the `cluster-validation-config` ConfigMap. The script first runs a **pre-flight self-check** (invokes `ibv_devinfo` once; non-zero exit short-circuits with `preflight-failed:ibv_devinfo=` so an `ROCE_WORKLOAD_IMAGE` ABI mismatch surfaces as a single clean failure rather than cascading into confused check-3/check-4 errors). Pre-flight is followed by five independent signals; the script aggregates failures across all five before labeling:
+
+| # | Check | Tool | Pass criterion |
+|---|-------|------|----------------|
+| 1 | NIC enumeration | `lspci -Dnn` filtered by `$PHASE3_AMD_NIC_PCI_IDS` (comma-separated `vendor:device` allowlist) | Count equals `PHASE3_EXPECTED_NIC_COUNT` (default `8`) |
+| 2 | Link state | `/sys/class/net//operstate` over every netdev whose `device/driver` symlink points at `ionic` | Every ionic interface reports `up` |
+| 3 | RDMA link state | `rdma link show` | Every device reports state `ACTIVE` |
+| 4 | GID table | `ibv_devinfo -d -v` | Each device responds and reports `>= PHASE3_MIN_GID_COUNT` GID entries (default `1`) |
+| 5 | Firmware ↔ workload-image alignment | `cat /sys/class/infiniband/ionic_/fw_ver` per ionic IB device, substring-matched against `$ROCE_WORKLOAD_IMAGE` (sed-substituted into the pod env at Job render time) | The observed firmware-version string appears as a substring of the workload image reference. Gated by `PHASE3_DRIVER_FW_CHECK_ENABLED` and `PHASE3_DRIVER_FW_STRICT`; "no data" (env unset OR no readable `fw_ver`) → SKIP, not FAIL |
+
+The script does **not** short-circuit on the first failing check (Checks 1–5) — it runs all of them, then aggregates the failure reasons and the offending NIC names into annotations. Pre-flight is the one exception: pre-flight failure short-circuits because every downstream check would be tainted by the same ABI/tool fault. This keeps the failure record actionable: an operator inspecting a `failed` node sees every distinct fault on the node, not just the first one tripped.
+
+**ConfigMap env vars (Phase 3 subset).** Tunable values are sourced from `cluster-validation-config`:
+
+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `PHASE3_EXPECTED_NIC_COUNT` | `8` | Required PCI device count (Check 1). Adjust for nodes with non-default NIC populations. |
+| `PHASE3_AMD_NIC_PCI_IDS` | `1dd8:1002` | Comma-separated PCI `vendor:device` allowlist matched against the `[vid:did]` tag from `lspci -nn` for Check 1. **`1dd8:1002` is the Pensando DSC Ethernet Controller PF** — the canonical NIC identifier for the AMD Pensando DSC fleet (DSC-25, DSC-200, Salina). Append additional pairs (e.g. `"1dd8:1002,1dd8:1003"`) to cover a future PF revision or a second SKU. Override only when validating non-Pensando NICs. |
+| `PHASE3_MIN_GID_COUNT` | `1` | Minimum GID table entries per RDMA device (Check 4). A device with zero GIDs cannot carry RoCE traffic. |
+| `PHASE3_JOB_WAIT_TIME` | `120` | Per-node Job wait budget in seconds. A Job still pending past this budget is treated as `nic-not-allocated`. |
+| `PHASE3_ANNOTATION_MAX_BYTES` | `250` | Maximum bytes of `failure-reason` / `failed-nics` annotation values (truncated to fit Kubernetes' 256 KB total-annotation budget on a node). |
+| `PHASE3_DRIVER_FW_CHECK_ENABLED` | `"true"` | Gates Check 5 (firmware ↔ workload-image alignment). Set to `"false"` to skip the check — Phase 3 then runs only Checks 1–4 and emits the marker without the `observed_fw=` field. The pre-flight `ibv_devinfo` self-check still runs regardless. |
+| `PHASE3_DRIVER_FW_STRICT` | `"true"` | When `"true"`, any per-NIC firmware-version that does NOT appear as a substring of `$ROCE_WORKLOAD_IMAGE` **fails** the node with `fw-image-mismatch:=/image=`. When `"false"`, mismatches surface only via the `-observed-fw` annotation and the check is informational. "No data" (missing env or unreadable sysfs) is always treated as SKIP, never a fail. |
+
+**Pensando PCI device ID configuration.** Check 1 (NIC enumeration) counts the lines from `lspci -Dnn` whose trailing `[vendor:device]` tag matches one of the entries listed in `PHASE3_AMD_NIC_PCI_IDS` (comma-separated). The default `1dd8:1002` is the Pensando "DSC Ethernet Controller" PF — the device ID the kernel `ionic` driver binds to and the canonical identifier for a NIC PF on every AMD Pensando NIC variant currently shipping in the fleet (DSC-25, DSC-200, Salina). Because each Pensando card exposes six PCI functions per vendor (two PCI bridges, three Processing-accelerator sub-functions, and one Ethernet controller PF), a vendor-only filter would over-count by ~6x and SR-IOV would inflate it further; the device-ID allowlist is the precise, drift-immune signal.
+
+To confirm on a target node:
+
+```bash
+$ lspci -Dnn | grep -E '\[1dd8:1002\]' | head
+0000:05:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1002]
+0000:06:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1002]
+...
+```
+
+If the count is zero on a node that visibly has Pensando NICs, suspect (a) `lspci` not installed in the container image (regression), (b) NICs not yet enumerated by the kernel (boot-order race — Phase 3 reschedules on the next CronJob tick), or (c) a new SKU exposes a different device ID — extend the allowlist to e.g. `PHASE3_AMD_NIC_PCI_IDS: "1dd8:1002,1dd8:1003"`. For cross-vendor evaluation (e.g. Mellanox/NVIDIA ConnectX) set the full list, such as `PHASE3_AMD_NIC_PCI_IDS: "15b3:1019"`.
+
+#### Image switch: Phase 3 uses `${ROCE_WORKLOAD_IMAGE}`
+
+Phase 3's Job container image is `${ROCE_WORKLOAD_IMAGE}` — the same image already pinned in `cluster-validation-config` for Phases 2, 4, and 5. One image pin, one bump for all four phases, and the image ships `ibv_devinfo` + `rdma` + `ethtool` + the `/sys/class/infiniband/ionic_*/fw_ver` sysfs view which is all the checks need.
+
+The centralization is also the main risk surface. The `rocm/roce-workload` image is rebuilt frequently, and at least one historical tag (`...ainic-1.117.1-a-63`) shipped `libionic 54.0-149` ABI v2 while the validation cluster's host `ionic_rdma` module spoke ABI v1 — every Phase 4 `ib_write_bw` rail failed to initialize verbs until the host driver was upgraded. A future ABI-incompatible tag would cause Checks 3 and 4 (`rdma link show`, `ibv_devinfo`) to fail with confusing per-NIC errors on every node simultaneously. The pre-flight `ibv_devinfo` self-check at the top of `PHASE3_CHECK_SCRIPT` exists for this case: non-zero exit short-circuits the script with `preflight-failed:ibv_devinfo=` as the single reason. Remediation: revert `ROCE_WORKLOAD_IMAGE` to a known-good tag in `cluster-validation-config` and wait for the next CronJob tick.
+
+#### Check 5: firmware ↔ workload-image alignment
+
+Check 5 verifies each NIC's running firmware is the one the workload image was qualified against. It reads `/sys/class/infiniband/ionic_/fw_ver` (populated by the driver from the NIC's admin command at probe time) and substring-matches the value against `$ROCE_WORKLOAD_IMAGE`. The image tag itself — e.g. `...ainic-1.117.5-a-56` — is the contract: no operator-curated allowlist, no compat-map maintenance. Bumping the firmware on a node without also bumping the image (or vice versa) trips the check.
+
+"No data" cases are always SKIP, never FAIL:
+- `$ROCE_WORKLOAD_IMAGE` unset (orchestrator render failed) → `CHECK 5 SKIP: ROCE_WORKLOAD_IMAGE env not set`
+- No `fw_ver` readable for any ionic device → `CHECK 5 SKIP: no fw_ver readable under /sys/class/infiniband/*/`
+
+Only an actual mismatch (we read fw + we read image + strings differ) fails in strict mode. Set `PHASE3_DRIVER_FW_CHECK_ENABLED=false` to skip the check unconditionally (the marker stays byte-for-byte compatible with the pre-Check-5 contract — no `observed_fw=` field).
+
+**Sample failure annotation output.** When one or more checks fail, the node receives the `failed` label plus two annotations: the uniform `-failure-reason` (per-phase contract, comma-separated list of reason tokens) and the Phase-3-specific `-failed-nics` (comma-separated list of offending NIC names, used by operators to localize the fault):
+
+```text
+$ kubectl describe node smc300x-ccs-aus-gpuf268
+Name: smc300x-ccs-aus-gpuf268
+Labels: amd.com/gpu-hw-acceptance=passed
+ amd.com/gpu-mesh-validation=passed
+ amd.com/nic-health=failed
+ feature.node.kubernetes.io/amd-gpu=true
+ feature.node.kubernetes.io/amd-nic=true
+ ...
+Annotations: amd.com/nic-health-failure-reason=rdma-state:rocep5s0=DOWN,link-state:rocep5s0=DOWN
+ amd.com/nic-health-failed-nics=rocep5s0
+ amd.com/cluster-validation-last-run-timestamp=2026-05-20T15:42:08Z
+ ...
+```
+
+The example above shows the canonical "one cable unplugged" failure: NIC `rocep5s0` fails both Check 2 (`link-state:rocep5s0=DOWN`) and Check 3 (`rdma-state:rocep5s0=DOWN`), and both reasons are aggregated in `-failure-reason`. The `-failed-nics` annotation lists `rocep5s0` once even though it tripped two checks — operators can `ssh` to the node and inspect that single device. A multi-NIC failure would render as, e.g.:
+
+```text
+amd.com/nic-health-failure-reason=link-state:rocep5s0=DOWN,gid-table:rocep5s0=0,rdma-state:rocep7s0=INIT
+amd.com/nic-health-failed-nics=rocep5s0,rocep7s0
+```
+
+**Worked Check 5 example — firmware/image mismatch.** On a node where `ionic0` reports firmware `1.117.4` but the workload image pinned in `cluster-validation-config` is `...ainic-1.117.5-a-56`, the node is labeled `failed` and receives the three annotations below. The `-observed-fw` annotation is written **whenever Check 5 ran**, regardless of pass or fail outcome, so operators always see the cluster's full per-NIC firmware inventory:
+
+```text
+amd.com/nic-health=failed
+amd.com/nic-health-failure-reason=fw-image-mismatch:ionic0=1.117.4/image=...ainic-1.117.5-a-56
+amd.com/nic-health-failed-nics=ionic0
+amd.com/nic-health-observed-fw=ionic0=1.117.4,ionic1=1.117.5-a-56,ionic2=1.117.5-a-56,ionic3=1.117.5-a-56,ionic4=1.117.5-a-56,ionic5=1.117.5-a-56,ionic6=1.117.5-a-56,ionic7=1.117.5-a-56
+```
+
+Only `ionic0` mismatches (every other NIC's firmware appears as a substring of the image tag), so only `ionic0` is in `-failed-nics`. When `PHASE3_DRIVER_FW_CHECK_ENABLED=false`, `-observed-fw` is omitted from the marker and the orchestrator preserves any previously-recorded value on the node (last-known-good).
+
+Failure reasons the phase can emit (in the `-failure-reason` annotation):
+
+| Reason token | Source | Meaning |
+|--------------|--------|---------|
+| `nic-count:expected=,actual=` | Check 1 | `lspci -Dnn` filtered by `$PHASE3_AMD_NIC_PCI_IDS` returned `` devices, not the required ``. Indicates missing NIC, driver not loaded, or a new PF device ID that needs to be appended to `PHASE3_AMD_NIC_PCI_IDS`. |
+| `link-state:=` | Check 2 | Interface `` reports `` (e.g., `DOWN`, `UNKNOWN`) instead of `UP`. Indicates bad cable, switch-port down, or admin-disabled interface. `` also appears in `-failed-nics`. |
+| `rdma-state:=` | Check 3 | RDMA device `` reports `` (e.g., `DOWN`, `INIT`, `ARMED`) instead of `ACTIVE`. Indicates RoCE not configured, peer not up, or fabric-level fault. `` also appears in `-failed-nics`. |
+| `ibv-devinfo:=unresponsive` | Check 4 | `ibv_devinfo -d ` failed to return. Indicates a kernel-level RDMA verbs failure on that device. `` also appears in `-failed-nics`. |
+| `gid-table:=` | Check 4 | RDMA device `` reports `` GID entries, below `PHASE3_MIN_GID_COUNT`. A device with zero GIDs cannot carry RoCE traffic. Typically caused by IP not yet assigned to the RDMA interface. |
+| `preflight-failed:ibv_devinfo=` | Pre-flight | The first invocation of `ibv_devinfo` inside the Phase 3 pod returned non-zero. Short-circuits the script (Checks 1–5 skipped). Almost always indicates `${ROCE_WORKLOAD_IMAGE}` was bumped to a tag whose userspace libraries are ABI-incompatible with the host kernel modules (canonical example: `libionic 54.0-149` ABI v2 against host `ionic_rdma` ABI v1). Remediation: revert `ROCE_WORKLOAD_IMAGE` to a known-good tag and wait for the next CronJob tick. No `-failed-nics` is written. |
+| `fw-image-mismatch:=/image=` | Check 5 | RDMA device `` reports firmware ``, but that string does NOT appear as a substring of the workload image reference ``. Under `PHASE3_DRIVER_FW_STRICT=true` (default) the node fails; under `PHASE3_DRIVER_FW_STRICT=false` the `-observed-fw` annotation is still written but the node is not failed. Remediation: bump `ROCE_WORKLOAD_IMAGE` to a tag built against the running firmware, OR re-flash the NIC to the firmware version the current image was qualified against. `` also appears in `-failed-nics`. |
+| `job-creation-failed` | Orchestrator | `kubectl apply` of the rendered Phase 3 Job template returned non-zero. Usually a `cluster-validation-phase3-job-config` ConfigMap deploy issue. No node-level `-failed-nics` is written in this case. |
+| `nic-not-allocated` | Orchestrator | Job pod stayed `Pending` past `PHASE3_JOB_WAIT_TIME` — the scheduler could not grant the requested `amd.com/nic: ` devices, typically because the `amd.com/nic` device plugin has not advertised resources, the node has fewer NICs than `PHASE3_EXPECTED_NIC_COUNT`, or admission was rejected. No node-level `-failed-nics` is written; inspect `kubectl describe pod` for the rejection reason. |
+
+The annotation keys follow the cross-phase convention defined: `` for `-failure-reason`, and the Phase-3-specific `-failed-nics` for the offending-device list. Both annotation values are truncated to `PHASE3_ANNOTATION_MAX_BYTES` (default `250`) bytes to stay within Kubernetes' 256 KB total-annotation budget per node.
+
+**`SKIP_NIC_VALIDATION` behavior.** Setting `skip-nic-validation: true` in `config.json` (which renders to `SKIP_NIC_VALIDATION=true` in the ConfigMap) makes Phase 3 short-circuit:
+
+- **No Phase 3 Job is created** for any input node — neither for nodes carrying `amd-nic=true` nor for nodes without the NIC label.
+- Every input node is immediately labeled `amd.com/nic-health=passed` via the standard `label_phase_passed` helper, so downstream `filter_passed_nodes` calls treat the node as eligible for Phase 4 (or for being surfaced as Phase-3-clean when Phases 4-5 are also skipped).
+- No failure annotation is written, no `lspci` / `rdma` / `ibv_devinfo` invocations occur, and the orchestrator does **not** intersect the input pool with `amd-nic=true` (the intersection only runs when Phase 3 is actually executed).
+- The `cluster-validation-phase3-job-config` ConfigMap is still mounted into the orchestrator container, but the Job template is never rendered.
+
+`skip-nic-validation: true` is the default in the shipped `config.json` (see the [Incremental Bringup Workflow](#incremental-bringup-workflow) below) because Phase 3 requires both NIC hardware and the `feature.node.kubernetes.io/amd-nic=true` label, neither of which is guaranteed during the GPU-only Day-1 bringup posture. Flip the flag to `false` once the network-operator has rolled out and `kubectl get nodes -l feature.node.kubernetes.io/amd-nic=true` returns the expected NIC-capable node set.
+
+### Phase 4: Pairwise Rail Bandwidth Test
+
+**Scope:** Fourth gate of the pipeline. Pairwise per-rail `ib_write_bw` between every pair of nodes that passed Phase 3, one rail (0.7) at a time per pair, with a default threshold of 380 Gbps. The phase isolates failures down to the `{pair, rail}` granularity so an operator inspecting a `failed` node sees exactly which rails fell below threshold and against which peer. Two Kubernetes `batch/v1.Job`s (server + client) are created per `(pair, rail)` instance from the `cluster-validation-phase4-job-config` ConfigMap; both Jobs run the `rocm/roce-workload` image (already used by Phase 5) and request a single NIC at a specific rail index via the per-rail `NetworkAttachmentDefinition` (`amd-host-device-nad-rail-${RAIL_IDX}`). The phase catches bad TOR ports, MTU mismatches, cable issues, and per-rail driver glitches that only manifest under real RDMA traffic — failure modes Phase 3's structural check cannot see.
+
+**Result label:** `amd.com/rail-bandwidth=passed|failed`. A node is labeled `passed` iff **every** rail it participated in measured at or above `PHASE4_BW_THRESHOLD` against **every** paired neighbor it was tested with. Only nodes labeled `passed` advance to Phase 4.5 (cross-node connectivity matrix) and Phase 5 (multi-node RCCL).
+
+**Pairing model.** Phase-3-passed nodes are sorted lexicographically and paired via round-robin: `node[0]<->node[1]`, `node[2]<->node[3]`, and so on. With an odd-count input, the last node has no peer to test against and is short-circuited to `passed` with an `unpaired=true` annotation (see [Unpaired-Node Behavior](#unpaired-node-behavior) below); the design intentionally trades partial coverage for forward progress, because downstream Phase 4.5 enforces a 2-node minimum anyway. Pairs run in parallel (capped by `PHASE4_MAX_CONCURRENT_PAIRS`); rails are serialized **within** a pair so the same NIC is never double-claimed by two concurrent `ib_write_bw` flows.
+
+**Workload.** For each `(pair, rail)` instance, `PHASE4_DRIVER_SCRIPT` (a) `sed`-renders the server Job template (pinned to node A, requests `amd-host-device-nad-rail-${RAIL_IDX}`), (b) waits for the server pod to publish its pod IP, (c) `sed`-renders the client Job template (pinned to node B, gets `$PEER_POD_IP` substituted in), (d) waits for both Jobs up to `PHASE4_PAIR_WAIT_TIME`, (e) parses `BW average` Gbps from the client pod's log, and (f) records the measurement under `$PHASE4_STATE_DIR/results//rail-` for both endpoints. Inside the container, `ib_write_bw` is invoked as:
+
+```bash
+# server side (node A) — listens for one client connection on the rail's RDMA device
+DEV="${PHASE4_IB_DEV_PREFIX}${RAIL_IDX}" # e.g., rdma_dev_3 for rail 3
+ib_write_bw -d "$DEV" -i 1 -F
+
+# client side (node B) — connects to the server pod IP on the same rail
+DEV="${PHASE4_IB_DEV_PREFIX}${RAIL_IDX}"
+ib_write_bw -d "$DEV" -i 1 -F "$PEER_POD_IP"
+```
+
+After all pair-runners complete, the driver aggregates per-node from the on-disk state and writes the labels and annotations via the standard `label_phase_passed` / `label_phase_failed` / `annotate_phase_value` helpers (per the design contract — never `kubectl label` directly).
+
+**ConfigMap env vars (Phase 4 subset).** Tunable values are sourced from `cluster-validation-config`:
+
+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `PHASE4_RAIL_COUNT` | `8` | Number of rails tested per pair. Iterated `0.(PHASE4_RAIL_COUNT - 1)` serially within each pair. Lowering this (e.g., `4`) is the supported way to skip the upper rails on hardware with fewer than 8 NICs per node. |
+| `PHASE4_BW_THRESHOLD` | `380` | Minimum acceptable `ib_write_bw` `BW average` in **Gbps**. Per-rail comparison is a float compare via `awk` (bash arithmetic does not handle `388.42 >= 380`). See [Threshold Tuning](#threshold-tuning-phase-4) for guidance. |
+| `PHASE4_PAIR_WAIT_TIME` | `180` | Per-(pair, rail) wallclock wait budget in seconds. Half the budget is spent waiting for the server pod IP; the other half covers client startup, handshake, and the actual transfer. Total per-pair runtime is bounded by `PHASE4_RAIL_COUNT * PHASE4_PAIR_WAIT_TIME` (rails serialized within a pair). |
+| `PHASE4_MAX_CONCURRENT_PAIRS` | `8` | Cap on parallel pair-runners. Bounded to keep the kube API server happy on large clusters — see [Concurrency-Cap Tuning](#concurrency-cap-tuning) below. |
+| `PHASE4_NAD_NAME_PREFIX` | `amd-host-device-nad-rail-` | Prefix used to construct the per-rail `NetworkAttachmentDefinition` name as `${PHASE4_NAD_NAME_PREFIX}${RAIL_IDX}`. The 8 NADs (`amd-host-device-nad-rail-0` through `-7`) are deployed automatically by `setup-cluster.yml` (and re-applied by `cvf.yml -e action=reapply`) from `configs/nad-per-rail.yaml` — no manual `kubectl apply` step required. Override only if the fleet ships NADs under a different naming scheme out-of-band. |
+| `PHASE4_IB_DEV_PREFIX` | `rdma_dev_` | Prefix used to select the RDMA device inside the pod as `${PHASE4_IB_DEV_PREFIX}${RAIL_IDX}` for `ib_write_bw -d`. Override only if a different image surfaces RDMA devices under a different naming scheme. |
+
+**Sample annotation output (passing node).** When every rail on every paired neighbor measured at or above `PHASE4_BW_THRESHOLD`, the node is labeled `passed`, every `rail-N` annotation carries its measured Gbps, and the diagnostic `peer` annotation records the last peer the node was paired with:
+
+```text
+$ kubectl describe node smc300x-ccs-aus-gpuf268
+Name: smc300x-ccs-aus-gpuf268
+Labels: amd.com/gpu-hw-acceptance=passed
+ amd.com/gpu-mesh-validation=passed
+ amd.com/nic-health=passed
+ amd.com/rail-bandwidth=passed
+ feature.node.kubernetes.io/amd-gpu=true
+ feature.node.kubernetes.io/amd-nic=true
+ ...
+Annotations: amd.com/rail-bandwidth-rail-0=388
+ amd.com/rail-bandwidth-rail-1=391
+ amd.com/rail-bandwidth-rail-2=387
+ amd.com/rail-bandwidth-rail-3=385
+ amd.com/rail-bandwidth-rail-4=390
+ amd.com/rail-bandwidth-rail-5=389
+ amd.com/rail-bandwidth-rail-6=386
+ amd.com/rail-bandwidth-rail-7=388
+ amd.com/rail-bandwidth-peer=smc300x-ccs-aus-gpuf269
+ amd.com/cluster-validation-last-run-timestamp=2026-05-20T16:14:55Z
+ ...
+```
+
+**Sample failure annotation output.** When one or more rails fail (here: rail 5 measured 180 Gbps and rail 7 lost its peer pod before the client could attach), the node receives the `failed` label, the uniform `-failure-reason` annotation (a CSV-encoded summary), the Phase-4-specific `-failed-rails` annotation (compact CSV of offending rail indices), and the per-rail `-rail-N` annotations preserve whatever measurement was recorded — including the bad value, so an operator can see the magnitude of the regression:
+
+```text
+$ kubectl describe node smc300x-ccs-aus-gpuf268
+Name: smc300x-ccs-aus-gpuf268
+Labels: amd.com/gpu-hw-acceptance=passed
+ amd.com/gpu-mesh-validation=passed
+ amd.com/nic-health=passed
+ amd.com/rail-bandwidth=failed
+ feature.node.kubernetes.io/amd-gpu=true
+ feature.node.kubernetes.io/amd-nic=true
+ ...
+Annotations: amd.com/rail-bandwidth-failure-reason=failed-rails:5,7
+ amd.com/rail-bandwidth-failed-rails=5,7
+ amd.com/rail-bandwidth-rail-0=388
+ amd.com/rail-bandwidth-rail-1=391
+ amd.com/rail-bandwidth-rail-2=387
+ amd.com/rail-bandwidth-rail-3=385
+ amd.com/rail-bandwidth-rail-4=390
+ amd.com/rail-bandwidth-rail-5=180
+ amd.com/rail-bandwidth-rail-6=386
+ amd.com/rail-bandwidth-peer=smc300x-ccs-aus-gpuf269
+ amd.com/cluster-validation-last-run-timestamp=2026-05-20T16:18:02Z
+ ...
+```
+
+The `-failure-reason` annotation always uses the `failed-rails:` form for per-rail failures — keeping the failure summary compact for the dashboard view while delegating the offending-rail list to the dedicated `-failed-rails` annotation. The `-rail-N` annotations for failed rails carry the **measured** Gbps (`180` in the example) when a parse succeeded; when no measurement could be recorded (crash, parse failure, peer unready), the `-rail-N` annotation is simply omitted and the rail index appears in `-failed-rails`.
+
+**Per-(pair, rail) failure reasons.** Each rail's outcome is classified by `PHASE4_DRIVER_SCRIPT` from the client Job's terminal state and log content. The reason is persisted to `$PHASE4_STATE_DIR/results//rail-.reason` and ultimately rolls up into the `failed-rails:` summary on the node:
+
+| Reason token (per rail) | Meaning |
+|-------------------------|---------|
+| `below-threshold:` | Client Job completed cleanly and a `BW average` line was parsed, but the value `` is below `PHASE4_BW_THRESHOLD`. The measured value is also written to `rail-` so an operator sees the magnitude. |
+| `parse-failed` | Client Job completed cleanly but the log contained no parseable `BW average` line — usually means the per-rail RDMA device did not respond or the client exited too early. |
+| `ib-write-bw-crashed` | Client Job ended `Failed` and no `BW average` line was emitted — `ib_write_bw` aborted (bad CLI args, missing device, RDMA verbs error). |
+| `peer-pod-unready` | Either the server pod never published a pod IP within half of `PHASE4_PAIR_WAIT_TIME`, or the client Job timed out before reaching a terminal state. Typically caused by NAD attach hangs or scheduler back-pressure. |
+| `nad-missing` | The requested `amd-host-device-nad-rail-${RAIL_IDX}` does not exist — pod admission rejected by the NAD admission webhook. Action: verify per-rail NADs are deployed (`kubectl get net-attach-def -A \| grep amd-host-device-nad-rail-`). The NADs are normally installed automatically by `setup-cluster.yml` from `configs/nad-per-rail.yaml`; re-apply with `ansible-playbook playbooks/cvf.yml -e action=reapply` if they were deleted out-of-band, and confirm the Multus CRD (`networkattachmentdefinitions.k8s.cni.cncf.io`) exists. |
+| `job-creation-failed` | `kubectl apply` of the rendered server or client Job template returned non-zero for a non-throttling reason (RBAC denial, malformed manifest, control-plane error). |
+| `api-throttled` | `kubectl apply` returned HTTP 429 and the driver's exponential back-off retry budget (3 attempts) was exhausted. Remaining rails for that pair are marked with this reason. |
+| `render-failed` | `sed` substitution of the server or client Job template failed to produce a valid manifest — usually a stale `cluster-validation-phase4-job-config` ConfigMap missing a substitution placeholder. |
+| `driver-state-dir-failed` | The driver could not create `/tmp/phase4--$$` as its on-disk state directory. Every input node fails Phase 4 in this scenario — surface as a CronJob filesystem regression. |
+
+#### Unpaired-Node Behavior
+
+When the Phase-3-passed input pool has an odd count, the lexicographically-last node has no peer to test against. Rather than failing the node (which would be punitive for a configuration issue outside the node's control) or holding it back from downstream consumers (Phase 4.5 would simply drop a single node by its 2-node minimum), `PHASE4_DRIVER_SCRIPT` short-circuits the unpaired node to `passed`:
+
+```text
+$ kubectl describe node smc300x-ccs-aus-gpuf270 # the odd one out
+Name: smc300x-ccs-aus-gpuf270
+Labels: amd.com/rail-bandwidth=passed
+ ...
+Annotations: amd.com/rail-bandwidth-unpaired=true
+ amd.com/cluster-validation-last-run-timestamp=2026-05-20T16:14:55Z
+ ...
+```
+
+No Jobs are created for the unpaired node, no per-rail measurements are recorded, and the diagnostic `unpaired=true` annotation lets dashboards and downstream phases distinguish a "passed because not tested" node from a "passed because every rail met threshold" node. Phase 4.5 (cross-node connectivity matrix) enforces an `N >= 2` minimum and will drop the unpaired node from the matrix on the same tick.
+
+The single-node testbed (`smc300x-ccs-aus-gpuf268`) always exercises this path — every Phase 4 run on a one-node cluster ends in an immediate unpaired pass-label with no `ib_write_bw` Jobs created. Real pairwise validation is deferred to a 2-node testbed in production.
+
+**`SKIP_RAIL_BANDWIDTH_TEST` behavior.** Setting `skip-rail-bandwidth-test: true` in `config.json` (which renders to `SKIP_RAIL_BANDWIDTH_TEST=true` in the ConfigMap) makes Phase 4 short-circuit:
+
+- **No `ib_write_bw` Jobs are created** for any pair — neither server nor client.
+- Every input node is immediately labeled `amd.com/rail-bandwidth=passed` via the standard `label_phase_passed` helper, so downstream `filter_passed_nodes` calls treat the node as eligible for Phase 4.5 and Phase 5.
+- No pairing is computed, no per-rail annotations are written, no `unpaired=true` annotation is written, and no failure annotation is written.
+- The `cluster-validation-phase4-job-config` ConfigMap is still mounted at `/phase4-configs` inside the orchestrator container, but neither the server nor the client Job template is ever rendered.
+
+`skip-rail-bandwidth-test: true` is the default in the shipped `config.json` (see the [Incremental Bringup Workflow](#incremental-bringup-workflow) below) because Phase 4 requires (a) a 2-node minimum to exercise any real pairing, (b) per-rail NADs (`amd-host-device-nad-rail-0` through `-7`) deployed in the cluster, and (c) Phase 3 actually running to produce a non-empty input pool. Flip the flag to `false` once those prerequisites are in place and the network fabric is stable enough that the pairwise threshold check is meaningful.
+
+#### Threshold Tuning (Phase 4)
+
+`PHASE4_BW_THRESHOLD` is the single Gbps number that decides pass vs. fail per `(pair, rail)` when `ib_write_bw` completes cleanly, so it MUST be tuned for the NIC SKU, link speed, and fabric topology in use. The default (`380` Gbps) is calibrated for AMD Pensando DSC-200 (200 GbE per port, single-port-per-rail attachment, 400 Gbps line rate) — high enough to catch a single degraded link that drops a rail to ~half rate, low enough to absorb the small framing and orchestration overhead vs. wire speed.
+
+**How to override.** Change the value in `configs/cluster-validation-config.yaml` and push it to the live cluster with `ansible-playbook playbooks/cvf.yml -e action=reapply` (re-applies + patches all ConfigMaps; no orchestrator code changes are required). Example override snippet:
+
+```yaml
+# configs/cluster-validation-config.yaml
+data:
+ PHASE4_BW_THRESHOLD: "180" # relax to 180 Gbps for a 200 GbE / single-port fabric
+```
+
+The next CronJob tick will pick up the new value via the `envFrom: configMapRef` projection on the orchestrator container. Already-labeled nodes keep their existing label until the next phase run overwrites it.
+
+**Expected ranges by NIC / fabric class.** Use these as starting points; always validate against a measured baseline on healthy hardware before locking in a production threshold.
+
+| NIC class | Per-rail line rate | Typical `ib_write_bw` `BW average` | Suggested `PHASE4_BW_THRESHOLD` |
+|-----------|--------------------|------------------------------------|---------------------------------|
+| Pensando DSC-200 (2 × 200 GbE) | 400 Gbps | 390-395 Gbps | `380` (default) |
+| Pensando DSC-25 (2 × 25 GbE) | 50 Gbps | 46-49 Gbps | `45` |
+| 200 GbE single-port-per-rail | 200 Gbps | 190-198 Gbps | `180` |
+| 100 GbE single-port-per-rail | 100 Gbps | 92-98 Gbps | `90` |
+
+**Tuning workflow:**
+
+1. **Measure baseline.** On known-good hardware, set `PHASE4_BW_THRESHOLD: "1"` (effectively pass-through) and let one CronJob tick run on a 2-node pool. Read the per-rail `rail-N` annotations on either node — those are the actual measured Gbps values.
+2. **Set the threshold ~3-5% below the baseline.** Tight enough to flag a degraded link (a single bad lane on a 4-lane link typically drops the figure by 25%), loose enough to absorb normal run-to-run variance and small orchestration overhead.
+3. **Re-test the same pair with the new threshold.** Confirm `amd.com/rail-bandwidth=passed` on both nodes and that every `rail-N` annotation is recorded.
+4. **Inject a known-bad threshold** (`PHASE4_BW_THRESHOLD: "9999"`) on one tick to confirm the failure path labels and annotates correctly (`failure-reason=failed-rails:0,1,.,7`, `failed-rails=0,1,.,7`, every `rail-N` carrying the measured value), then revert.
+
+If `ib_write_bw` `BW average` is consistently below the suggested-range floor on healthy hardware, suspect a fabric or driver issue (per-rail NAD pointing at the wrong PF/VF, switch egress queue mis-tuned, MTU mismatch, ROCm/OFED version drift) rather than relaxing the threshold further. The per-rail granularity of the annotations is the diagnostic — a single rail consistently 30% slow points at one cable or one switch port; all rails uniformly slow points at host- or image-level configuration.
+
+#### Concurrency-Cap Tuning
+
+Pair-runners are forked in parallel under a hard cap of `PHASE4_MAX_CONCURRENT_PAIRS` (default `8`). The cap exists for one reason: each pair-runner submits up to 2 Jobs per rail (server + client) for up to `PHASE4_RAIL_COUNT` rails, so an uncapped run on a 64-node pool (32 pairs × 8 rails × 2 Jobs = 512 `kubectl apply` calls) would trivially exceed the kube API server's default QPS budget and trigger 429-throttled storms.
+
+**How the cap behaves.**
+
+- The driver dispatches pair-runners in lexicographic pair order, sleeping in a `wait -n` loop until fewer than `PHASE4_MAX_CONCURRENT_PAIRS` are in flight.
+- Within a single pair, rails are **always serialized** — only one `(pair, rail)` `ib_write_bw` flow exists at any moment per pair. This invariant prevents a pair from double-claiming the same NIC for two concurrent flows, which would corrupt both measurements.
+- When `kubectl apply` returns HTTP 429 despite the cap, the driver's `_phase4_apply_with_backoff` retries with exponential back-off up to 3 attempts before marking the remaining rails of that pair as `api-throttled` and moving on.
+- `PHASE4_MAX_CONCURRENT_PAIRS` is bounded to `>= 1` by the driver — a misconfigured `0` or negative value is promoted to `1` (fully serial) with a WARN log line, ensuring the phase never stalls outright.
+
+**Sizing guidance.**
+
+| Cluster size (Phase-3-passed nodes) | Recommended `PHASE4_MAX_CONCURRENT_PAIRS` |
+|-------------------------------------|-------------------------------------------|
+| 2-4 nodes (1-2 pairs) | `2` — fewer in-flight Jobs, easier to triage |
+| 5-16 nodes (3-8 pairs) | `8` (default) |
+| 17-64 nodes (9-32 pairs) | `16` — paired with a kube API QPS raise (`--kube-api-qps`, `--kube-api-burst` on `kube-apiserver`) |
+| 65+ nodes (33+ pairs) | Keep at `16` — additional parallelism past 16 typically saturates the etcd write path before it speeds up the phase wall-clock; let the pair queue drain instead. |
+
+**Total Phase 4 wall-clock budget.** With the cap honored, the upper bound is `ceil(num_pairs / PHASE4_MAX_CONCURRENT_PAIRS) * PHASE4_RAIL_COUNT * PHASE4_PAIR_WAIT_TIME`. With defaults (`8` cap, `8` rails, `180`s per rail) and 16 pairs that is `2 * 8 * 180 = 2880` s ≈ 48 min worst-case; healthy hardware completes well under half that because each rail's wait budget covers the full timeout, not the measured runtime.
+
+### Phase 4.5: Cross-Node Connectivity Matrix Test
+
+**Scope:** Pre-flight gate immediately before Phase 5 (multi-node RCCL). Phase 4.5 validates that every node that passed Phase 4 can actually reach every other node across four orthogonal dimensions — SSH, DNS, MPI process spawning, and RCCL topology detection — so multi-hop routing or DNS faults that pairwise Phase 4 cannot observe are caught before the long Phase 5 MPIJob run starts. Phase 4.5 does **not** introduce a new per-node phase label: it is a gate, not a tested capability of an individual node. Failures are recorded as a node annotation and abort the Phase 5 MPIJob via init-container non-zero exit.
+
+**Execution model: piggy-back on the Phase 5 launcher init-container.** Phase 4.5 runs **inside the existing `wait-for-worker-pods` init-container of the Phase 5 MPIJob launcher** — there is **no separate Pod or Job** for Phase 4.5. The orchestrator sequence is:
+
+1. The CronJob orchestrator submits the Phase 5 MPIJob (the standard launcher + worker set).
+2. The launcher's `wait-for-worker-pods` init-container runs `PHASE45_PREFLIGHT_SCRIPT` (sourced from `cluster-validation-config.yaml` via env var). The init-container blocks launcher startup until the pre-flight passes.
+3. If the pre-flight fails, the init-container exits non-zero → launcher pod fails → MPIJob fails → existing CronJob Phase 5 failure path applies (no extra wiring).
+
+The rationale for sharing the init-container instead of standing up a dedicated Phase 4.5 Pod: the init-container already has the assembled worker pod set, the `cluster-validation-sa` ServiceAccount with the credentials/RBAC to `kubectl exec` into every worker, and the network access to reach every worker. Reusing it avoids a second pod-startup latency on every CronJob tick.
+
+**The 4 checks.** `PHASE45_PREFLIGHT_SCRIPT` runs four independent probes against the worker set. The script does **not** short-circuit on the first failing check — it runs all four, then aggregates the failure reasons into a single annotation. This keeps the failure record actionable: an operator inspecting a `failed` run sees every distinct fault class on the cluster, not just the first one tripped.
+
+| # | Check | What it does | Failure reason token |
+|---|-------|--------------|----------------------|
+| 1 | N×N SSH mesh | For every (src, dst) ordered worker-pod pair, `kubectl exec` the src and `ssh -o ConnectTimeout=5 dst-ip 'echo ok'`. Records every failing pair in `failed_pairs`. Catches multi-hop routing faults, asymmetric ACLs, and partial fabric outages that pairwise Phase 4 cannot observe. | `ssh-mesh` |
+| 2 | DNS forward + reverse | From the first worker pod, `getent hosts ` (forward) for every node hostname, then `getent hosts ` (reverse) on the returned IP. Catches cluster-DNS / coredns regressions before Phase 5 hits them mid-collective. | `dns` |
+| 3 | `mpirun --hostfile` no-op spawn | Write `WORKER_IPS` to `/tmp/hf` and `mpirun --hostfile /tmp/hf --np $(wc -l < /tmp/hf) --allow-run-as-root true`. A no-op program isolates the OMPI launcher / orted bootstrap path from the workload — a hang or non-zero exit here means MPI cannot start, period. | `mpi-spawn` |
+| 4 | RCCL topology probe | Minimal `mpirun . -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT $PERF_TEST_DIR/all_reduce_perf -b 1K -e 1K -n 1` and grep the first 50 `NCCL INFO comm|topology` lines. Wrapped in a 60 s timeout. On timeout, **soft-fail** — annotate `rccl-topology` but proceed, because first-run warming of kernel/firmware caches can legitimately exceed 60 s. The real RCCL test is Phase 5. | `rccl-topology` |
+
+**Sample failure annotation output.** When one or more checks fail, every participating worker node receives the `amd.com/phase4_5-failure-reason` annotation listing the failed check classes (comma-separated). NO node label is written or changed — per-phase labels from Phases 1-4 stay frozen, preserving the audit trail.
+
+```text
+$ kubectl describe node smc300x-ccs-aus-gpuf268
+Name: smc300x-ccs-aus-gpuf268
+Labels: amd.com/gpu-hw-acceptance=passed
+ amd.com/gpu-mesh-validation=passed
+ amd.com/nic-health=passed
+ amd.com/rail-bandwidth=passed
+ feature.node.kubernetes.io/amd-gpu=true
+ feature.node.kubernetes.io/amd-nic=true
+ ...
+Annotations: amd.com/phase4_5-failure-reason=ssh-mesh,mpi-spawn
+ amd.com/cluster-validation-last-run-timestamp=2026-05-20T16:47:33Z
+ ...
+```
+
+The example shows the canonical "fabric or routing fault" pattern: N×N SSH probes find at least one unreachable pair and `mpirun` cannot complete its TCP bootstrap, so both `ssh-mesh` and `mpi-spawn` are aggregated into the single annotation. An operator triaging the run can `kubectl get nodes -o jsonpath` over `amd.com/phase4_5-failure-reason` to see which classes failed cluster-wide and start investigation there.
+
+Best-effort attribution caveat: the script annotates **every** participating node with the same failure-reason list rather than trying to attribute "node A is at fault for the ssh-mesh failure to node B". Failure modes at this layer are almost always shared infrastructure (a TOR port, the coredns Service, a fabric routing policy), so per-node attribution would be misleading more often than it would be helpful. If one specific node is repeatedly the only one annotated across CronJob ticks, that signal will surface organically and the operator can isolate it.
+
+**No per-node label change is intentional.** Unlike Phases 1-4, Phase 4.5 does **not** write `=passed|failed` for any node. The rationale:
+
+- Phase 4.5 tests a **cluster property** (the N×N reachability surface), not a per-node capability. Labeling individual nodes `phase4_5=failed` would falsely imply that the node has a localized fault, when in practice the failure usually means "this set of nodes cannot collectively form an MPI world."
+- Keeping the per-node labels frozen at the Phase 4 outcome preserves the audit trail: an operator looking at a node a week later sees the result of every per-node phase that was actually exercised on that node, without Phase 4.5 (a gate) overwriting that history.
+- The annotation alone is sufficient for the downstream contract: the next CronJob tick re-evaluates the candidate pool from scratch, sees `amd.com/phase4_5-failure-reason=.` on the previously-failing nodes, and the operator decides whether to investigate before the next Phase 4.5 attempt.
+
+**`SKIP_RCCL_TEST` behavior (shared with Phase 5).** Phase 4.5 does **not** carry its own skip flag — it shares `SKIP_RCCL_TEST` with Phase 5. Setting `skip-rccl-test: true` in `config.json` (which renders to `SKIP_RCCL_TEST=true` in the ConfigMap) makes both phases short-circuit together:
+
+- **No Phase 5 MPIJob is submitted** when `SKIP_RCCL_TEST=true`. Because Phase 4.5 lives **inside the launcher init-container of that very MPIJob**, no init-container is created either — Phase 4.5 simply does not run.
+- No `amd.com/phase4_5-failure-reason` annotation is written, no `PHASE45_PREFLIGHT_SCRIPT` is invoked, and no `kubectl exec` traffic flows between worker pods.
+- The orchestrator's `maybe_run_phase45` helper logs that Phase 4.5 is skipped together with Phase 5 and continues to post-phase cleanup.
+
+This shared-flag model is intentional. Phase 4.5 exists **only** to gate Phase 5 — a Phase 5 that is skipped needs no pre-flight, and a Phase 5 that is enabled should always have its pre-flight run. Splitting the two flags would let a misconfigured cluster run Phase 5 directly without the pre-flight, which would simply turn a fast, well-classified Phase 4.5 failure into a slow, opaque RCCL crash 30 minutes into the MPIJob. The shared flag also keeps the dry-run shape consistent: `DRY_RUN=1` with `SKIP_RCCL_TEST=false` prints "DRY_RUN -- skipping run_phase5" and "DRY_RUN -- skipping maybe_run_phase45" in lock-step.
+
+#### Per-(pair, check) Granularity Limits
+
+Phase 4.5's diagnostics are intentionally coarser than Phase 4's `(pair, rail)` granularity for two reasons:
+
+1. **Failure modes at this layer are usually shared infrastructure.** SSH mesh failures, DNS regressions, and MPI bootstrap hangs almost never localize to a single node — they localize to a switch port, a coredns Pod, or a control-plane RBAC change. Recording every failing `(src, dst)` SSH pair in the annotation would push the annotation past the 256 KB per-node total-annotation budget on large clusters without giving the operator new information.
+2. **The fast diagnostic loop is `kubectl logs` of the launcher init-container, not annotations.** Every failing SSH pair, every DNS miss, and the full `mpirun` stderr is printed to the init-container's stdout. Operators triaging a Phase 4.5 failure read those logs first; the annotation exists only to make the failure visible on `kubectl describe node` and to allow dashboards to surface the affected node set without scraping pod logs.
+
+The annotation value is bounded to ≤ 128 bytes (the four reason tokens plus comma separators fit comfortably) so the per-node 256 KB total-annotation budget is never threatened, even when Phase 4.5 runs on every CronJob tick for weeks.
+
+### Phase 5: Multi-Node RCCL Collective Test
+
+**Scope:** Final gate of the pipeline. Multi-node RCCL collective workload (`all_reduce_perf`, `broadcast_perf`, `reduce_scatter_perf`) executed as a single MPIJob across the worker set that survived Phase 4.5. The phase produces the terminal cluster verdict on each participating node and is the only phase whose `passed` outcome means "this node was actually exercised end-to-end under multi-node RCCL traffic above the configured bandwidth threshold." The MPIJob template (`cluster-validation-mpijob-config`) and the RCCL workload image (`rocm/roce-workload`) are reused from prior releases — Phase 5 itself is the same workload as before; the work refactors how the orchestrator drives it (script extraction, helper-based labeling, dynamic worker count, per-worker triage) without changing the workload semantics.
+
+**Result label:** `amd.com/cluster-validation-status=passed|failed`. This is the **same label key the framework has always written for Phase 5** (`PHASE5_LABEL_KEY` in the ConfigMap) — no change for any operator dashboard, alerting rule, or downstream consumer that already reads `amd.com/cluster-validation-status`. What changed is the **write path** (now via the `label_phase_passed` / `label_phase_failed` helpers, not raw `kubectl label`) and the **failure-reason annotation** (now per-worker, see [Sample failure annotation output](#sample-failure-annotation-output-phase-5) below).
+
+**Pairing model.** Phase 5 takes the node set that survived Phase 4.5 as its input pool. The MPIJob is rendered with `worker-replicas = len(input pool)` — dynamic per CronJob tick — so a tick that finds 3 surviving nodes runs a 3-worker MPIJob, a tick that finds 7 runs a 7-worker MPIJob, and so on. The `WORKER_REPLICAS` value from `config.json` (rendered into the ConfigMap as `WORKER_REPLICAS`) is **the upper bound on candidate selection in Phase 0** (the orchestrator picks at most `WORKER_REPLICAS` candidates from the labeled node pool); treat it as a maximum tick budget, not a target — Phase 5 always uses the actual surviving count.
+
+**ConfigMap env vars (Phase 5 subset).** Tunable values are sourced from `cluster-validation-config`:
+
+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `PHASE5_MIN_WORKERS` | `2` | Minimum surviving-node count required before Phase 5 submits the MPIJob. If `input_count < PHASE5_MIN_WORKERS`, the script logs the reason and `return 0` with no MPIJob submitted and no label changes (the per-phase verdict is undefined for a degenerate single-node MPIJob). Default `2` reflects "multi-node RCCL is multi-node by definition." Override to `1` for opt-in single-node plumbing validation, or to a higher value to demand a larger minimum cluster. |
+| `WORKER_REPLICAS` | (from `config.json`) | Phase 0 candidate-selection budget (see above) — upper bound, not the MPIJob worker count. |
+| `MPIJOB_WAIT_TIME` | `240` | `kubectl wait mpijob --for=condition=Succeeded --timeout` budget in seconds. Bounds total Phase 5 wallclock per tick. |
+| `SLOTS_PER_WORKER` | (from `config.json`) | MPI ranks per worker (= GPUs per worker by convention); passed into the MPIJob template via `sed` substitution. |
+| `ROCE_WORKLOAD_IMAGE` | (from `config.json` `images.roce-workload`) | RCCL workload container image. |
+| `PHASE5_LABEL_KEY` | `amd.com/cluster-validation-status` | The label key written on every participating node. |
+
+
+#### Sample failure annotation output (Phase 5)
+
+When the MPIJob exits non-zero (any cause — RCCL `validate_rccl_test.sh` reports bandwidth below threshold, a worker container crashes, the launcher aborts, `kubectl wait` times out, etc.), **every** participating node receives the `failed` label plus the uniform `-failure-reason` annotation. The reason value is constructed **per-node** from the worker pod that ran on that node and that pod's container exit code, so an operator inspecting `kubectl describe node` sees which worker pod failed on each node and what its exit was — no need to cross-reference the launcher log to attribute the failure:
+
+```text
+$ kubectl describe node smc300x-ccs-aus-gpuf268
+Name: smc300x-ccs-aus-gpuf268
+Labels: amd.com/gpu-hw-acceptance=passed
+ amd.com/gpu-mesh-validation=passed
+ amd.com/nic-health=passed
+ amd.com/rail-bandwidth=passed
+ amd.com/cluster-validation-status=failed
+ feature.node.kubernetes.io/amd-gpu=true
+ feature.node.kubernetes.io/amd-nic=true
+ ...
+Annotations: amd.com/cluster-validation-status-failure-reason=worker-pod=cluster-validation-mpi-job-20260520-1632-worker-0,exit=137
+ amd.com/cluster-validation-last-run-timestamp=2026-05-20T16:38:11Z
+ ...
+```
+
+The example shows the canonical "OOM-killed worker" pattern: node `smc300x-ccs-aus-gpuf268` ran worker pod `cluster-validation-mpi-job-20260520-1632-worker-0`, which exited 137 (SIGKILL — typically OOM). A neighbour node in the same MPIJob would carry the same `failed` label but a different `-failure-reason` annotation pointing at its own worker pod and exit code, letting the operator localize the cluster-level failure down to a specific node + container.
+
+**Per-node failure-reason field semantics.** The `-failure-reason` annotation always uses the form `worker-pod=,exit=`. Both fields fall back to the literal string `unknown` when the corresponding lookup fails — never empty, never omitted — so the annotation shape stays deterministic for log scrapers and dashboards:
+
+| Sub-field | Source | `unknown` fallback condition |
+|-----------|--------|------------------------------|
+| `worker-pod=` | `kubectl get pods -l training.kubeflow.org/job-name=,training.kubeflow.org/replica-type=worker --field-selector spec.nodeName= -o jsonpath='{.items[0].metadata.name}'` | No worker pod matched the node (race against MPIJob `cleanPodPolicy: All`, scheduler eviction, or a node that the MPIJob never actually placed a worker on). |
+| `exit=` | `kubectl get pod -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}'` on the resolved worker pod | The worker pod's primary container never reached `terminated` state (still `running`, `waiting`, or evicted before terminating), or the worker pod itself was unknown. |
+
+When `worker-pod` is `unknown`, `exit` is forced to `unknown` as well — the exit code lookup requires the pod name. The annotation value is bounded by the helper's `PHASE_ANNOTATION_VALUE_MAX_BYTES` cap, so pod names that approach the 253-char DNS-1123 limit cannot push the annotation past the 256 KB per-node total-annotation budget.
+
+**Per-worker log dump.** On **both** the pass and fail paths, after the launcher log is collected, the orchestrator walks the input node set and saves each worker pod's logs to a per-node file under `LOG_DIR` (= `/var/log/cluster-validation` on the host, mounted from a `hostPath` volume):
+
+```text
+/var/log/cluster-validation/launcher-.log # existing — launcher pod, --all-containers
+/var/log/cluster-validation/worker--.log # new — worker pod on , --all-containers
+```
+
+For example, a 3-node MPIJob `cluster-validation-mpi-job-20260520-1632` produces:
+
+```text
+/var/log/cluster-validation/launcher-cluster-validation-mpi-job-20260520-1632.log
+/var/log/cluster-validation/worker-smc300x-ccs-aus-gpuf268-cluster-validation-mpi-job-20260520-1632.log
+/var/log/cluster-validation/worker-smc300x-ccs-aus-gpuf269-cluster-validation-mpi-job-20260520-1632.log
+/var/log/cluster-validation/worker-smc300x-ccs-aus-gpuf270-cluster-validation-mpi-job-20260520-1632.log
+```
+
+The per-worker log files exist on the **same host** as the orchestrator container (the `submit-mpijob` pod) — typically the LOG_STORE_NODE_NAME chosen at install time. Operators triaging a Phase 5 failure read the per-node log file matching the failed node from the `-failure-reason` annotation, without scraping the launcher log for that worker's interleaved output. The per-worker dump runs on the pass path too so a passing run's per-worker timing/topology data is still archived for later baseline comparison. If a worker pod is missing at log-collection time (race against MPIJob cleanup, scheduler eviction), the orchestrator emits a single `Warning: Could not find worker pod for node ` line and continues with the remaining nodes — one missing log file does not break the loop. Existing log-retention policy (controlled by `CLEANUP_TEST_LOGS` at teardown) covers both the launcher and the new per-worker files.
+
+#### Per-node failure-reason mapping (Phase 5)
+
+Phase 5's `-failure-reason` is a single composite token rather than a fixed enum, but the cross-field combinations map to recognizable triage classes:
+
+| `worker-pod` | `exit` | Meaning |
+|--------------|--------|---------|
+| concrete pod name | numeric (e.g., `1`, `137`, `139`) | Worker pod ran to terminal state and exited non-zero. Use the per-worker log (`worker--.log`) to read the actual error message. Common: `137` = SIGKILL (OOM, eviction); `139` = SIGSEGV (RCCL or driver crash); `1` = RCCL `validate_rccl_test.sh` bandwidth-below-threshold. |
+| concrete pod name | `0` | Worker pod exited zero but the MPIJob still failed — almost always means the **launcher** detected `validate_rccl_test.sh` failure across the collective output (bandwidth below threshold, missing test output). Inspect the launcher log; the worker exit was clean but the run's verdict was not. |
+| concrete pod name | `unknown` | Worker pod exists but its primary container never reached `terminated` state at label-write time. Common causes: container still running when `kubectl wait` timed out, pod evicted before terminating, or `--for=condition=Succeeded` fired non-zero while the pod was mid-shutdown. |
+| `unknown` | `unknown` | No worker pod could be matched to this node. Indicates the MPIJob's `cleanPodPolicy: All` already deleted worker pods by the time the failure-labeling loop ran, the worker never got scheduled on this node, or `kubectl get pods` with the label/field selector returned empty. |
+
+The per-worker log file is the source of truth for the actual failure root cause — the annotation surface is intentionally compact (single field, deterministic shape) so dashboards can pivot on it without parsing a multi-token string. Operators chasing a recurring failure pattern join the annotation across nodes (`kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.annotations.amd\\.com/cluster-validation-status-failure-reason}{"\n"}{end}'`) to spot whether the same worker-pod-suffix or the same exit code is recurring cluster-wide.
+
+**`SKIP_RCCL_TEST` behavior.** Setting `skip-rccl-test: true` in `config.json` (which renders to `SKIP_RCCL_TEST=true` in the ConfigMap) makes Phase 5 short-circuit:
+
+- **No MPIJob is submitted** for the input set. Because Phase 4.5 lives inside the Phase 5 MPIJob launcher init-container, Phase 4.5 also does not run when `SKIP_RCCL_TEST=true` (the two phases share the flag — see [Phase 4.5: Cross-Node Connectivity Matrix Test](#phase-45-cross-node-connectivity-matrix-test)).
+- Every input node is immediately labeled `amd.com/cluster-validation-status=passed` via the standard `label_phase_passed` helper. This preserves the long-standing early-bringup short-circuit semantics — the same behavior Phase 5 has always had on `SKIP_RCCL_TEST=true`.
+- No failure annotation is written, no per-worker log files are created, and `PHASE5_MIN_WORKERS` is not evaluated (skip wins over the min-workers guard).
+- The candidate-label removal loop (`amd.com/cluster-validation-candidate-`) still runs so the input nodes return to the pool for the next CronJob tick under the new label outcome.
+
+**`PHASE5_MIN_WORKERS` short-circuit.** When `SKIP_RCCL_TEST=false` and the surviving input set has fewer nodes than `PHASE5_MIN_WORKERS` (default `2`), the script logs `[Phase 5] only node(s) survived prior phases; minimum is -- skipping MPIJob (no label changes)` and `return 0` without submitting the MPIJob and **without** writing any Phase 5 label or annotation. This intentionally leaves `amd.com/cluster-validation-status` **absent** on those nodes — there is no defensible `passed` (MPIJob never ran) or `failed` (no measurement, no failure) verdict to write. The empty-input case (`input_count=0`) takes the same branch and is treated identically. Operators wanting Phase 5 to label single-node clusters as `passed` for plumbing-validation purposes should set `PHASE5_MIN_WORKERS=1` (and accept that the resulting one-worker MPIJob exercises no inter-node collective at all). The single-node testbed `smc300x-ccs-aus-gpuf268` always exercises this short-circuit path with the default `PHASE5_MIN_WORKERS=2`.
+
+### Dry-Run Mode
+
+For orchestrator-flow validation without touching cluster state, set `DRY_RUN=1` on the CronJob (or on a manual `kubectl exec` of the orchestrator container):
+
+```bash
+# Inside the CronJob's submit-mpijob container
+DRY_RUN=1 bash /scripts/run-phases.sh
+```
+
+In `DRY_RUN=1` mode:
+
+- Phase 0 reads existing `amd.com/cluster-validation-candidate=true` labels instead of writing new ones — no candidate labeling happens.
+- Every `run_phaseN` function is overridden to `:`-style no-op that prints a `DRY_RUN -- skipping run_phaseN ()` line.
+- `maybe_run_phase45` and the post-phase cleanup / log-collection steps log "DRY_RUN -- skipping ." instead of running.
+- The orchestrator still walks the entire phase order, honors all `SKIP_*` flags, and prints the planned filtered candidate pool per phase — useful for verifying that a given `config.json` produces the expected pipeline shape before committing to a real run.
+
+Combinations the dry-run is designed to exercise:
+
+- All 5 skip flags `true` → exits 0 cleanly, no `kubectl` writes.
+- Phases 1+2 enabled, 3-5 skipped → plan output ends after the Phase 2 filter.
+- Phase 3 enabled but zero `amd-nic=true` nodes present → empty pool, exits 0.
+
+## `gpu-cluster.sh` — internal contract
+
+> `gpu-cluster.sh` is the underlying per-node script the Ansible playbooks invoke. Operators normally **do not run it directly** — `setup-cluster.yml` calls `run server`/`run agent` on each node, and `cvf.yml -e action=reapply` calls `reapply-cvf` inside the server container. This section documents the script's interface for debugging and customization.
```text
Usage: ./gpu-cluster.sh [args...]
Commands:
- build Build the Docker image
- run [args...] Run the node as server or agent
- teardown Tear down the cluster and clean up
- get-token Run on server node to print agent join token
- status Show cluster validation framework status and recent runs
- node-status Show validation status per node
- help Show this help message
+ build Build the Docker image (called by setup-cluster.yml on the control node)
+ run [args...] Start k3s server or agent container (called by setup-cluster.yml on each node)
+ teardown Tear down the cluster + clean up (called by teardown-cluster.yml)
+ get-token Print the agent join token (called by add-agent-nodes.yml)
+ reapply-cvf Re-run install_cluster_validation_framework (called by cvf.yml -e action=reapply)
+ status Show CronJob + recent orchestrator pods (called by cvf.yml -e action=status)
+ node-status Show per-node validation result table
+ help Show this help
Run arguments:
run server
run agent
```
-## Environment Variables
+**Environment variables** (used by the script; the playbooks pass them as Ansible vars):
| Variable | Default | Description |
| -------- | ------- | ----------- |
| `IMAGE_NAME` | `gpu-validation-cluster` | Docker image name |
| `IMAGE_TAG` | `latest` | Docker image tag |
| `BUILD_DIR` | `$SCRIPT_DIR/build` | Path to directory containing Dockerfile and entrypoint.sh |
-| `CONFIG_DIR` | `$SCRIPT_DIR/configs` | Path to directory containing config.json and other config files |
-| `CLEANUP_TEST_LOGS` | `false` | Clean up cluster validation test logs in `/var/log/cluster-validation` during teardown |
+| `CONFIG_DIR` | `$SCRIPT_DIR/configs` | Path to directory containing `config.json` and other config files |
+| `CLEANUP_TEST_LOGS` | `false` | Clean up validation logs in `/var/log/cluster-validation` during teardown |
-### Examples
+**Direct invocation examples** (for on-node debugging only):
```bash
-# Build using a custom build directory
+# Build with a custom build directory (normally set via setup-cluster.yml var)
BUILD_DIR=/path/to/custom/build ./gpu-cluster.sh build
-# Run server node with custom config directory
-CONFIG_DIR=/path/to/custom/configs ./gpu-cluster.sh run server
-
-# Run agent node with custom config directory to join a cluster
-CONFIG_DIR=/path/to/custom/configs ./gpu-cluster.sh run agent
-
-# Teardown with cluster validation logs cleanup enabled
-CLEANUP_TEST_LOGS=true ./gpu-cluster.sh teardown
+# Re-apply CVF configs (Ansible equivalent: cvf.yml -e action=reapply)
+CONFIG_DIR=/path/to/custom/configs ./gpu-cluster.sh reapply-cvf
-# Show cluster validation framework CronJob status and recent pod runs
+# Local status view (Ansible equivalent: cvf.yml -e action=status, which also shows per-node Ansible-side state)
./gpu-cluster.sh status
-
-# Show per-node validation test status (last run time and result)
./gpu-cluster.sh node-status
+
+# Teardown with log cleanup (Ansible equivalent below under Cleanup Behavior)
+CLEANUP_TEST_LOGS=true ./gpu-cluster.sh teardown
```
## Directory Structure
@@ -218,7 +1106,7 @@ CLEANUP_TEST_LOGS=true ./gpu-cluster.sh teardown
```text
gpu-validation-cluster/
├── README.md # Project documentation
-├── gpu-cluster.sh # Unified script for build, run, teardown, and get-token
+├── gpu-cluster.sh # Internal per-node script invoked by the Ansible playbooks (build, run, reapply-cvf, teardown, get-token)
├── build/ # Build context
│ ├── Dockerfile # Dockerfile to build the image
│ └── entrypoint.sh # Container entrypoint script
@@ -228,7 +1116,6 @@ gpu-validation-cluster/
│ └── cluster-validation-job.yaml # Cluster validation framework cronjob
└── ansible/ # Ansible automation (recommended for multi-node)
├── README.md # Ansible documentation
- ├── quickstart.sh # Quick start script
├── inventory.yml # Node inventory configuration (edit this!)
├── ansible.cfg # Ansible configuration
└── playbooks/ # Ansible playbooks
@@ -237,7 +1124,7 @@ gpu-validation-cluster/
├── add-agent-nodes.yml # Add new nodes to existing cluster
├── remove-agent-nodes.yml # Remove nodes from existing cluster
├── teardown-cluster.yml # Teardown entire cluster
- └── check-status.yml # Status checking
+ └── cvf.yml # CVF day-2 ops (reapply / reset / inject-secret / trigger / status / fresh-run)
```
## Configuration
@@ -250,102 +1137,52 @@ Customize operator behavior by editing files in the `configs/` directory:
## Cleanup Behavior
-By default, the teardown command preserves cluster validation logs in `/var/log/cluster-validation` for troubleshooting and analysis. To remove these logs during teardown, set the `CLEANUP_TEST_LOGS` environment variable to `true`:
-
-```bash
-CLEANUP_TEST_LOGS=true ./gpu-cluster.sh teardown
-```
-
-## Monitoring Validation Tests
-
-### Cluster-Wide Status
-
-To view the overall cluster validation framework status including CronJob configuration and recent pod runs:
+By default, `teardown-cluster.yml` preserves the persistent validation logs in `/var/log/cluster-validation` on each node for post-mortem analysis. To also wipe the logs during teardown:
```bash
-./gpu-cluster.sh status
+ansible-playbook playbooks/teardown-cluster.yml -e cleanup_test_logs=true
```
-This command displays:
+(Under the hood the playbook passes `CLEANUP_TEST_LOGS=true` to `gpu-cluster.sh teardown` on each node.)
-- **CronJob Status**: Configuration and schedule of validation CronJobs
-- **Recent Pod Runs**: Last 5 pod executions with timestamps, phases, and assigned nodes
-- **Pod Details**: Detailed information about recent validation test pods
-
-### Per-Node Validation Status
+## Monitoring Validation Tests
-To view validation test results broken down by individual node:
+For the unified status view (CronJob, recent orchestrator pods, in-flight phase Jobs, MPIJob, per-node phase labels + Phase 1 stage annotations + failure-reason annotations + per-node container ping):
```bash
-./gpu-cluster.sh node-status
+ansible-playbook playbooks/cvf.yml -e action=status
```
-This command displays:
-
-- **Node Summary Table**: Shows each node with its last run timestamp and validation result (Passed/Failed/Pending)
-- **Detailed Node Information**: Per-node breakdown including:
- - Last run timestamp (from node annotation)
- - Validation result status
- - Most recent pod name that executed on the node
-
-**Result Status Legend:**
-
-- `Passed`: All validation tests on the node passed
- - Label: `amd.com/cluster-validation-status=passed`
-- `Failed`: One or more validation tests on the node failed
- - Label: `amd.com/cluster-validation-status=failed`
-- `Pending`: Validation tests are running or have not yet executed
- - Label: not set (no label present)
-
-### Understanding the Output
-
-The per-node view uses Kubernetes node labels and annotations to track validation test execution:
+This is the single entry point for cluster-wide AND per-node validation state. See the [`cvf.yml` action reference in `ansible/README.md`](ansible/README.md#cvfyml) for the full output schema.
-- **Annotation `amd.com/cluster-validation-last-run-timestamp`**: Timestamp of the last validation test execution on this node
-- **Label `amd.com/cluster-validation-status`**: Current validation result status:
- - Set to `passed` if all tests passed
- - Set to `failed` if any tests failed
- - Not set (empty) if tests are pending or have not yet run
+### Result schema (what the status output means)
-## Ansible Automation
+The status output is computed from Kubernetes node labels and annotations the orchestrator writes after each phase:
-For production deployments with multiple nodes, the Ansible automation provides a streamlined deployment experience:
+| Per-node label | Phase | Values |
+|---|---|---|
+| `amd.com/gpu-hw-acceptance` | 1 (GPU HW acceptance) | `passed` / `failed` / `skipped` / unset |
+| `amd.com/gpu-mesh-validation` | 2 (intra-node RCCL) | `passed` / `failed` / `skipped` / unset |
+| `amd.com/nic-health` | 3 (NIC health) | `passed` / `failed` / `skipped` / unset (NIC-capable nodes only) |
+| `amd.com/rail-bandwidth` | 4 (rail bandwidth) | `passed` / `failed` / `skipped` / unset |
+| `amd.com/cluster-validation-status` | 5 (multi-node RCCL) — also the aggregate label | `passed` / `failed` / unset |
-**Key Features:**
+| Per-node annotation | Purpose |
+|---|---|
+| `amd.com/cluster-validation-last-run-timestamp` | Wall-clock of the last orchestrator run that touched this node. Drives the `NODE_VALIDATION_INTERVAL_MINS` skip gate. |
+| `amd.com/gpu-hw-acceptance-stage-` | Phase 1 per-stage result (`passed`/`failed`/`skipped`). Example: `amd.com/gpu-hw-acceptance-stage-gpu-stress=passed`. |
+| `amd.com/-failure-reason` | Free-form reason when the corresponding label is `failed`. Example: `amd.com/rail-bandwidth-failure-reason=failed-rails:0,1,2`. |
+| `amd.com/-failed-subtest` | Specific sub-test or recipe that failed (Phase 1). |
+| `amd.com/-failed-nics` | Specific NIC IDs that failed (Phase 3). |
-- **One-command deployment**: Deploy entire cluster across all nodes with a single command
-- **Automatic prerequisites**: Installs Docker and jq on all nodes
-- **SSH key management**: Automated passwordless SSH setup
-- **Image distribution**: Builds image locally and distributes to all nodes
-- **Token management**: Automatically retrieves and distributes k3s join token
-- **Status monitoring**: Unified status checking across all nodes
-- **Clean teardown**: One-command cluster removal
-
-**Quick Start:**
+Phase 5's failure reason is per-worker and takes the form `amd.com/cluster-validation-status-failure-reason=worker-pod=,exit=` — see [Phase 5: Multi-Node RCCL Collective Test](#phase-5-multi-node-rccl-collective-test) for sub-field semantics.
+Inspect raw labels/annotations directly via:
```bash
-# 1. Configure node inventory
-cd ansible
-vi inventory.yml # Update with your node IPs and settings
-
-# 2. Run automated deployment
-./quickstart.sh
-
-# 3. Check cluster status
-ansible-playbook playbooks/check-status.yml
-
-# 4. Teardown cluster
-ansible-playbook playbooks/teardown-cluster.yml
+ansible -i ansible/inventory.yml server-node -m shell -a \
+ "docker exec server kubectl get node -o yaml | yq .metadata.labels,.metadata.annotations"
```
-**Supported Scenarios:**
-
-- Server node local (where you run Ansible) + remote agent nodes
-- All remote nodes (server + agents on different machines)
-- Mixed environments with different sudo requirements
-
-**See the [ansible/README.md](ansible/README.md) for complete documentation.**
-
## FAQ and Troubleshooting
### 1. What if I hit the DockerHub rate limit to pull images from public repository?
@@ -369,27 +1206,13 @@ Users could configure a DockerHub account secrets in `configs/config.json` so th
If the agent container starts but doesn't appear when running `kubectl get nodes` on the server, check the logs:
-**For manual deployment:**
-
```bash
-# On the agent node host, check k3s logs
-cat /var/log/k3s.log
+# On the agent node host
+cat /tmp/gpu-cluster-agent.log # ansible-side deployment log
+docker logs agent # k3s agent container log
-# Or check the GPU cluster script logs
-docker logs agent
-```
-
-**For Ansible deployment:**
-
-```bash
-# On the agent node host, check the deployment logs
-cat /tmp/gpu-cluster-agent.log
-
-# Check k3s logs inside the container
-docker logs agent
-
-# Or SSH to the agent node and check
-ssh vm@ "cat /tmp/gpu-cluster-agent.log"
+# Or via Ansible from the control node
+ansible -i ansible/inventory.yml -m shell -a "cat /tmp/gpu-cluster-agent.log; docker logs agent --tail 50"
```
**Common issues:**
@@ -397,14 +1220,12 @@ ssh vm@ "cat /tmp/gpu-cluster-agent.log"
- **Wrong server IP**: Verify the server IP is correct and reachable from the agent node
- **Network connectivity**: Ensure port 6443 is accessible between nodes
- **Token mismatch**: Verify the token matches the server's token
-- **Config missing**: Check if `/home/vm/gpu-validation-cluster/configs/config.json` exists on the agent node
+- **Config missing**: Check that `{{ config_dir }}/config.json` exists on the agent node (see inventory's `config_dir`)
- **Firewall blocking**: Ensure firewall rules allow k3s traffic (TCP 6443, 10250)
-- **Port binding failed**: If you see "bind: address already in use", check for existing k3s processes (`sudo lsof -i :6443` or `ps aux | grep k3s`) and stop them, or run `./gpu-cluster.sh teardown` to clean up
+- **Port binding failed**: If you see "bind: address already in use", check for existing k3s processes (`sudo lsof -i :6443` or `ps aux | grep k3s`) and stop them, or re-run `ansible-playbook playbooks/teardown-cluster.yml` to clean up
### 3. How do I add new agent nodes to an existing cluster?
-**For Ansible deployments:**
-
```bash
# 1. Add new nodes to inventory.yml
vi ansible/inventory.yml
@@ -418,35 +1239,14 @@ cd ansible
ansible-playbook playbooks/add-agent-nodes.yml --limit agent-node-4
# 3. Verify new nodes joined
-ansible-playbook playbooks/check-status.yml
+ansible-playbook playbooks/cvf.yml -e action=status
docker exec server kubectl get nodes
```
-**For manual deployments:**
-
-```bash
-# 1. On new agent node: Copy and load the Docker image
-scp gpu-validation-cluster.tar user@new-node:/tmp/
-ssh user@new-node "docker load -i /tmp/gpu-validation-cluster.tar"
-
-# 2. Copy project files to new node
-scp -r . user@new-node:~/gpu-validation-cluster/
-
-# 3. Get token from server node
-./gpu-cluster.sh get-token
-
-# 4. On new agent node: Join the cluster
-ssh user@new-node
-cd ~/gpu-validation-cluster
-./gpu-cluster.sh run agent
-```
-
-**Important**: The add-agent-nodes.yml playbook is designed to safely add nodes without disrupting the existing cluster. It automatically detects and skips nodes already in the cluster.
+The `add-agent-nodes.yml` playbook safely adds nodes without disrupting the existing cluster; it auto-detects and skips nodes already in the cluster.
### 4. How do I remove agent nodes from an existing cluster?
-**For Ansible deployments:**
-
```bash
# Remove a specific node
cd ansible
@@ -456,22 +1256,8 @@ ansible-playbook playbooks/remove-agent-nodes.yml --limit agent-node-2
ansible-playbook playbooks/remove-agent-nodes.yml --limit "agent-node-2,agent-node-3"
# Verify nodes were removed
-ansible-playbook playbooks/check-status.yml
+ansible-playbook playbooks/cvf.yml -e action=status
docker exec server kubectl get nodes
```
-**For manual deployments:**
-
-```bash
-# 1. On server node: Drain the node (move workloads gracefully)
-docker exec server kubectl drain --ignore-daemonsets --delete-emptydir-data --force
-
-# 2. Delete node from cluster
-docker exec server kubectl delete node
-
-# 3. On agent node: Stop and clean up
-ssh user@agent-node
-./gpu-cluster.sh teardown
-```
-
-**Important**: The remove-agent-nodes.yml playbook gracefully drains workloads before removing nodes. Always use `--limit` to target specific nodes.
+The `remove-agent-nodes.yml` playbook gracefully drains workloads before removing nodes. Always use `--limit` to target specific nodes.
diff --git a/example/gpu-validation-cluster/ansible/README.md b/example/gpu-validation-cluster/ansible/README.md
index d28c9410c..fab7b4ec1 100644
--- a/example/gpu-validation-cluster/ansible/README.md
+++ b/example/gpu-validation-cluster/ansible/README.md
@@ -31,14 +31,14 @@ This directory contains Ansible playbooks for automating the deployment and mana
ansible/
├── ansible.cfg # Ansible configuration
├── inventory.yml # Cluster inventory (edit this!)
-├── quickstart.sh # Interactive quick start script
├── playbooks/
│ ├── setup-ssh-keys.yml # Configure passwordless SSH
│ ├── setup-cluster.yml # Main cluster setup
│ ├── add-agent-nodes.yml # Add new nodes to existing cluster
│ ├── remove-agent-nodes.yml # Remove nodes from existing cluster
│ ├── teardown-cluster.yml # Teardown entire cluster
-│ └── check-status.yml # Check cluster status
+│ ├── set-gpu-compute-mode.yml # Switch GPU compute-partition mode (SPX/CPX/...) + reboot to apply
+│ └── cvf.yml # CVF day-2 ops: reapply / reset / inject-secret / trigger / status / fresh-run
├── group_vars/ # Group variables (optional)
└── README.md
```
@@ -114,7 +114,7 @@ This will:
Check the cluster status:
```bash
-ansible-playbook playbooks/check-status.yml
+ansible-playbook playbooks/cvf.yml -e action=status
```
### Step 5: Adding New Agent Nodes (Optional)
@@ -133,7 +133,7 @@ vi inventory.yml
ansible-playbook playbooks/add-agent-nodes.yml --limit agent-node-4
# 3. Verify new nodes joined
-ansible-playbook playbooks/check-status.yml
+ansible-playbook playbooks/cvf.yml -e action=status
```
**Important**: Use `--limit` to target only the new nodes you want to add. The playbook automatically detects and skips nodes already in the cluster.
@@ -150,7 +150,7 @@ ansible-playbook playbooks/remove-agent-nodes.yml --limit agent-node-2
ansible-playbook playbooks/remove-agent-nodes.yml --limit "agent-node-2,agent-node-3"
# Verify nodes were removed
-ansible-playbook playbooks/check-status.yml
+ansible-playbook playbooks/cvf.yml -e action=status
```
**Important**: This will drain workloads, delete the node from Kubernetes, stop the agent container, and clean up cluster state.
@@ -190,11 +190,11 @@ ansible-playbook playbooks/setup-cluster.yml
- Exports image to tar file
2. **Prerequisites Phase (all nodes):**
- - Checks for Docker installation
- - Installs Docker if not present
- - Checks for jq installation
- - Installs jq if not present
- - Adds user to docker group
+ - Checks for Docker installation; installs Docker if not present
+ - Stops + disables any pre-existing **native** `k3s.service` / `k3s-agent.service` so the containerized k3s doesn't collide on host ports 6443/6444
+ - Ensures a local `/etc/group` `docker` entry (fixes `SocketGroup=docker` boot failure on hosts that only have the group via LDAP/SSSD)
+ - Always (re)starts the docker daemon (not just on a brand-new install)
+ - Installs jq if not present; adds user to docker group
3. **Distribution Phase (all nodes):**
- Copies Docker image tar to all nodes
@@ -202,15 +202,19 @@ ansible-playbook playbooks/setup-cluster.yml
- Copies scripts and configs
4. **Server Phase (server node):**
+ - **Idempotency probe**: skips restart when the existing `server` container is Up AND its kubectl can reach the apiserver. Otherwise tears down + recreates.
- Starts k3s server container
- Retrieves cluster join token
- Waits for server to be ready
5. **Agent Phase (agent nodes):**
+ - **Idempotency probe**: skips restart when the existing `agent` container is Up AND has an established TCP connection to the apiserver port 6443. Otherwise tears down + recreates.
- Starts k3s agent containers one by one
- Joins agents to server
- Verifies connection
+ Both the server and agent containers run a **supervisor loop** as their entrypoint (`build/entrypoint.sh`) — if `k3s server` / `k3s agent` crashes inside the container (post-reboot iptables-nft / cni0 transients), the supervisor auto-restarts it with exponential backoff. A rolling restart cap (`K3S_MAX_RESTARTS` / `K3S_RESTART_WINDOW`) escalates to `exit 1` so `docker --restart=unless-stopped` recreates the container fresh as the next layer of recovery.
+
6. **Verification Phase:**
- Lists all cluster nodes
- Displays cluster info
@@ -227,6 +231,11 @@ ansible-playbook playbooks/setup-cluster.yml --limit server_nodes
# Dry run (check mode)
ansible-playbook playbooks/setup-cluster.yml --check
+
+# Force-recreate the server + agent containers even when the idempotency
+# probe says they look healthy (e.g. after editing gpu-cluster.sh or
+# config.json in a way that requires a fresh container start).
+ansible-playbook playbooks/setup-cluster.yml -e force_restart=true
```
### add-agent-nodes.yml
@@ -346,22 +355,101 @@ ansible-playbook playbooks/teardown-cluster.yml
- Cleans up cluster state
- Verifies cleanup
-### check-status.yml
+### set-gpu-compute-mode.yml
-**Purpose:** Check current cluster status.
+**Purpose:** Switch the AMD GPU compute-partition mode (SPX / CPX / TPX / QPX / DPX) on one or more nodes and reboot to apply.
**Usage:**
```bash
-ansible-playbook playbooks/check-status.yml
+# Flip every agent node to SPX (single-partition; the canonical setting
+# for multi-GPU RCCL workloads — CPX disables P2P/XGMI between
+# partitions, which would force RCCL to fall back to SHM through host
+# DRAM and tank intra-node bandwidth).
+ansible-playbook -i inventory.yml playbooks/set-gpu-compute-mode.yml -e partition=SPX
+
+# Target a single host
+ansible-playbook -i inventory.yml playbooks/set-gpu-compute-mode.yml \
+ -e partition=SPX -e target=agent-node-1
+
+# Skip the reboot (only useful if you know the firmware applies live)
+ansible-playbook -i inventory.yml playbooks/set-gpu-compute-mode.yml \
+ -e partition=SPX -e skip_reboot=true
```
**What it does:**
-- Shows Docker container status on all nodes
-- Displays Kubernetes nodes
-- Lists all pods
-- Shows cluster validation framework status
+- Stops the CVF `agent` container on the target host (drains workloads)
+- Runs `rocm-smi --setcomputepartition ` on each target
+- Reboots the node (BAR/aperture reclamation generally requires a cold restart on MI300/MI308)
+- Waits for SSH + Docker to come back, then re-checks the partition
+- After completion, re-run `setup-cluster.yml` to rejoin the agent container to k3s (the existing tasks are idempotent and will recreate `/var/lib/gpu-validation-cluster` state cleanly)
+
+### cvf.yml
+
+**Purpose:** Single dispatcher for all Cluster Validation Framework day-2 operations.
+Five lifecycle playbooks (setup-ssh-keys / setup-cluster / add-agent-nodes /
+remove-agent-nodes / teardown-cluster) stay separate; everything else lives here
+behind `-e action=`.
+
+**Actions:**
+
+| `-e action=` | Purpose |
+|---|---|
+| `reapply` | Copy configs/ + gpu-cluster.sh to server-node and run `gpu-cluster.sh reapply-cvf` (kubectl apply YAMLs + kubectl patch CM/CronJob from config.json overrides). `config.json` is NOT applied to k8s — it's the source of patches. **Does NOT propagate `global.image-pull-secrets`** — use `inject-secret`. |
+| `reset` | Clear per-phase labels + annotations on every node. Optional: `-e suspend_cronjob=true`, `-e delete_completed_jobs=true`, `-e delete_all_jobs=true` (full slate including MPIJob CRDs and orphan pods). |
+| `inject-secret` | Create/update docker-registry Secret(s) + strategic-merge-patch `cluster-validation-sa.imagePullSecrets`. Reads `global.image-pull-secrets` from controller-side `config.json`. Affects only pods created after the patch. CLI override: `-e username=… -e token=… [-e registry_url=…] [-e secret_name=…]`. |
+| `trigger` | Idempotent `kubectl create job --from=cronjob/...` using a fixed name (default `cvf-test`). Deletes prior Job + orphan pods first. Override with `-e job_name=`. **Cron-vs-manual race**: a manual run that exceeds the `*/30` cron schedule will race the next cron tick for the same nodes and per-rail Phase 4 Job names. Suspend cron before long manual runs (see below). |
+| `status` | Dump CronJob, recent orchestrator pods, latest orchestrator log tail, in-flight phase Jobs, MPIJob, per-node phase labels + Phase 1 stage annotations + failure-reason annotations + per-node container ping + **per-node latest log file per phase** (paths under `/var/log/cluster-validation/`, so operators can `tail -f` straight to the right file). |
+| `fresh-run` | Composite: `reapply` → `reset` (with `delete_all_jobs=true`) → `inject-secret` → `trigger` → `status`. Replaces the four-command sequence operators kept typing. |
+
+**Examples:**
+
+```bash
+# Single action
+ansible-playbook playbooks/cvf.yml -e action=status
+ansible-playbook playbooks/cvf.yml -e action=reapply
+ansible-playbook playbooks/cvf.yml -e action=reset -e delete_all_jobs=true
+ansible-playbook playbooks/cvf.yml -e action=inject-secret
+ansible-playbook playbooks/cvf.yml -e action=inject-secret -e username=u -e token=t
+ansible-playbook playbooks/cvf.yml -e action=trigger -e job_name=my-run
+
+# Composite (replaces the 4-command chain)
+ansible-playbook playbooks/cvf.yml -e action=fresh-run
+```
+
+**Idempotency:** all actions are safe to re-run. Secret creation uses
+`kubectl create --dry-run=client -o yaml | kubectl apply -f -`; SA patches
+use strategic-merge with mergeKey=name; trigger deletes any prior `cvf-test`
+Job before creating.
+
+**Validation:** unknown `action` values fail fast with the valid list.
+
+**Suspending the CronJob during long manual runs:**
+
+The CronJob's `concurrencyPolicy: Forbid` only blocks cron-vs-cron overlap.
+A manual `trigger` that exceeds the `*/30` schedule will be lapped by the
+next cron tick — both orchestrators then race for the same nodes and reuse
+the same per-rail Phase 4 Job names (`cvf-p4-{server,client}--r`,
+no timestamp suffix), which causes `kubectl apply` rc=2 on the loser.
+
+Recommended pattern for runs expected to take more than 30 minutes:
+
+```bash
+# Suspend the cron + reset state in one call
+ansible-playbook playbooks/cvf.yml -e action=reset -e suspend_cronjob=true
+
+# Run the manual orchestrator
+ansible-playbook playbooks/cvf.yml -e action=trigger
+
+# ...wait for completion (check with -e action=status), then re-enable cron
+docker exec server kubectl patch cronjob cluster-validation-cron-job \
+ -n default --type=merge -p '{"spec":{"suspend":false}}'
+```
+
+`fresh-run` does NOT auto-suspend the cron. Combine with
+`-e suspend_cronjob=true` if a long pipeline is expected (or run when
+no cron tick is due within the next ~hour).
## Advanced Usage
@@ -516,7 +604,7 @@ ansible-playbook playbooks/setup-cluster.yml
## Additional Resources
- [Ansible Documentation](https://docs.ansible.com/)
-- [GPU Validation Cluster Main README](../README.md)
+- [GPU Validation Cluster Main README](./README.md)
- [k3s Documentation](https://docs.k3s.io/)
## Support
diff --git a/example/gpu-validation-cluster/ansible/playbooks/add-agent-nodes.yml b/example/gpu-validation-cluster/ansible/playbooks/add-agent-nodes.yml
index 65cb39e90..ef94bfd58 100644
--- a/example/gpu-validation-cluster/ansible/playbooks/add-agent-nodes.yml
+++ b/example/gpu-validation-cluster/ansible/playbooks/add-agent-nodes.yml
@@ -406,7 +406,7 @@
Total cluster nodes: {{ node_count.stdout }}
To verify:
- ansible-playbook playbooks/check-status.yml
+ ansible-playbook playbooks/cvf.yml -e action=status
To check validation status:
./gpu-cluster.sh status
diff --git a/example/gpu-validation-cluster/ansible/playbooks/check-status.yml b/example/gpu-validation-cluster/ansible/playbooks/check-status.yml
deleted file mode 100644
index 9025144c2..000000000
--- a/example/gpu-validation-cluster/ansible/playbooks/check-status.yml
+++ /dev/null
@@ -1,59 +0,0 @@
----
-# Check GPU Validation Cluster status
-# Usage: ansible-playbook playbooks/check-status.yml
-
-- name: Check container status on all nodes
- hosts: cluster_nodes
- gather_facts: no
-
- tasks:
- - name: Check Docker containers
- shell: 'docker ps --filter "name=server" --filter "name=agent" --format "table {{ ''{{'' }}.Names{{ ''}}'' }}\t{{ ''{{'' }}.Status{{ ''}}'' }}\t{{ ''{{'' }}.Ports{{ ''}}'' }}"'
- register: container_status
- changed_when: false
-
- - name: Display container status
- debug:
- msg: |
- {{ inventory_hostname }}:
- {{ container_status.stdout }}
-
-- name: Check Kubernetes cluster status
- hosts: server_nodes
- gather_facts: no
-
- tasks:
- - name: Get cluster nodes
- command: docker exec server kubectl get nodes -o wide
- register: k8s_nodes
- changed_when: false
- ignore_errors: yes
-
- - name: Display Kubernetes nodes
- debug:
- msg: "{{ k8s_nodes.stdout_lines }}"
- when: k8s_nodes.rc == 0
-
- - name: Get cluster pods
- command: docker exec server kubectl get pods --all-namespaces
- register: k8s_pods
- changed_when: false
- ignore_errors: yes
-
- - name: Display Kubernetes pods
- debug:
- msg: "{{ k8s_pods.stdout_lines }}"
- when: k8s_pods.rc == 0
-
- - name: Check cluster validation framework status
- shell: |
- cd {{ script_dir }}
- CONFIG_DIR={{ config_dir }} ./gpu-cluster.sh status
- register: cvf_status
- changed_when: false
- ignore_errors: yes
-
- - name: Display cluster validation framework status
- debug:
- msg: "{{ cvf_status.stdout_lines }}"
- when: cvf_status.rc == 0
diff --git a/example/gpu-validation-cluster/ansible/playbooks/cvf.yml b/example/gpu-validation-cluster/ansible/playbooks/cvf.yml
new file mode 100644
index 000000000..82a35f3ea
--- /dev/null
+++ b/example/gpu-validation-cluster/ansible/playbooks/cvf.yml
@@ -0,0 +1,796 @@
+---
+# Single dispatcher for Cluster Validation Framework day-2 operations.
+# All CVF actions live here behind `-e action=`; lifecycle playbooks
+# (setup-cluster / teardown-cluster / setup-ssh-keys / add-agent-nodes /
+# remove-agent-nodes) remain separate.
+#
+# Actions:
+# reapply -- copy configs/ + gpu-cluster.sh to server-node and
+# run `gpu-cluster.sh reapply-cvf` (kubectl apply
+# YAMLs + kubectl patch CM/CronJob from config.json
+# overrides). config.json itself is NOT applied to
+# k8s; it's the source of patches. Does not propagate
+# global.image-pull-secrets -- use inject-secret.
+#
+# reset -- clear per-phase labels + annotations on every node.
+# Optional:
+# -e suspend_cronjob=true suspend the CronJob
+# -e delete_completed_jobs=true GC completed Jobs
+# -e delete_all_jobs=true nuke ALL CVF Jobs +
+# MPIJobs + orphan pods
+# (used for tangled runs)
+#
+# inject-secret -- create/update docker-registry Secret(s) + strategic-
+# merge-patch cluster-validation-sa.imagePullSecrets.
+# Reads global.image-pull-secrets from CONTROLLER-side
+# config.json; affects only pods created after patch.
+# CLI override (single ad-hoc):
+# -e username=foo -e token=bar [-e registry_url=...]
+# [-e secret_name=...]
+#
+# trigger -- idempotent `kubectl create job --from=cronjob/...`
+# using a fixed name (default cvf-test). Deletes prior
+# Job + orphan pods first.
+# -e job_name= override default cvf-test
+#
+# Cron-vs-manual race: CronJob.concurrencyPolicy=Forbid
+# only blocks cron-vs-cron overlap, NOT manual-vs-cron.
+# If a manual `trigger` run exceeds the */30 schedule,
+# the next cron tick spawns a parallel orchestrator
+# that re-selects the same nodes and hits the
+# Phase 4 immutable-selector race on per-rail Jobs
+# (cvf-p4-{server,client}--r have no
+# timestamp suffix). Recommended pattern when
+# triggering manually for runs >30min:
+# ansible-playbook playbooks/cvf.yml -e action=reset \
+# -e suspend_cronjob=true
+# ansible-playbook playbooks/cvf.yml -e action=trigger
+# # ...wait for completion, then re-enable:
+# docker exec server kubectl patch cronjob \
+# cluster-validation-cron-job -n default \
+# --type=merge -p '{"spec":{"suspend":false}}'
+# fresh-run does NOT auto-suspend the cron -- combine
+# with `-e suspend_cronjob=true` if a long run is
+# expected (or run fresh-run when no cron tick is due).
+#
+# status -- dump CronJob state, recent orchestrator pods, latest
+# orchestrator log tail, in-flight phase Jobs, MPIJob
+# state, per-node phase labels + Phase 1 stage
+# annotations + failure-reason annotations.
+#
+# fresh-run -- composite: reapply -> reset (delete_all_jobs=true) ->
+# inject-secret -> trigger -> status. Replaces the
+# 4-command sequence operators kept typing.
+#
+# Examples:
+# ansible-playbook -i inventory.test.yml playbooks/cvf.yml -e action=reapply
+# ansible-playbook -i inventory.test.yml playbooks/cvf.yml -e action=reset -e delete_all_jobs=true
+# ansible-playbook -i inventory.test.yml playbooks/cvf.yml -e action=inject-secret
+# ansible-playbook -i inventory.test.yml playbooks/cvf.yml -e action=inject-secret -e username=u -e token=t
+# ansible-playbook -i inventory.test.yml playbooks/cvf.yml -e action=trigger -e job_name=my-run
+# ansible-playbook -i inventory.test.yml playbooks/cvf.yml -e action=status
+# ansible-playbook -i inventory.test.yml playbooks/cvf.yml -e action=fresh-run
+
+- name: CVF day-2 operations (server-side)
+ hosts: server_nodes
+ gather_facts: false
+ vars:
+ action: ""
+ container_name: server
+ cvf_namespace: default
+ service_account_name: cluster-validation-sa
+ cronjob_name: cluster-validation-cron-job
+ job_name: cvf-test
+ delete_all_jobs: false
+ delete_completed_jobs: false
+ suspend_cronjob: false
+ # inject-secret CLI override (when both username+token are set,
+ # config.json is bypassed for a single ad-hoc secret).
+ registry_url: ""
+ username: ""
+ token: ""
+ secret_name: ""
+ _valid_actions:
+ - reapply
+ - reset
+ - inject-secret
+ - trigger
+ - status
+ - fresh-run
+ _is_fresh_run: "{{ action == 'fresh-run' }}"
+ tasks:
+ - name: Validate action
+ fail:
+ msg: |
+ Unknown or missing action: '{{ action }}'.
+ Valid: {{ _valid_actions | join(', ') }}.
+ Use -e action=.
+ when: action not in _valid_actions
+
+ # ======================================================================
+ # Action: reapply
+ # ======================================================================
+ - name: reapply > copy configs/ to server node
+ copy:
+ src: "{{ playbook_dir }}/../../configs/"
+ dest: "{{ config_dir }}/"
+ mode: '0644'
+ when: action == 'reapply' or _is_fresh_run
+
+ - name: reapply > copy gpu-cluster.sh to server node
+ copy:
+ src: "{{ playbook_dir }}/../../gpu-cluster.sh"
+ dest: "{{ script_dir }}/gpu-cluster.sh"
+ mode: '0755'
+ when: action == 'reapply' or _is_fresh_run
+
+ - name: reapply > verify server container is running
+ shell: docker ps --format '{% raw %}{{.Names}}{% endraw %}' | grep -qx '{{ container_name }}'
+ changed_when: false
+ when: action == 'reapply' or _is_fresh_run
+
+ - name: reapply > run gpu-cluster.sh reapply-cvf
+ shell: |
+ cd {{ script_dir }}
+ CONFIG_DIR={{ config_dir }} ./gpu-cluster.sh reapply-cvf
+ register: reapply_out
+ changed_when: true
+ when: action == 'reapply' or _is_fresh_run
+
+ - name: reapply > show last 20 lines of reapply output
+ debug:
+ msg: "{{ reapply_out.stdout_lines[-20:] }}"
+ when: (action == 'reapply' or _is_fresh_run) and reapply_out is defined
+
+ - name: reapply > verify post-patch CM keys
+ shell: |
+ docker exec {{ container_name }} kubectl get cm cluster-validation-config -n {{ cvf_namespace }} \
+ -o jsonpath='{.data.SKIP_GPU_HW_ACCEPTANCE}{"|"}{.data.SKIP_GPU_MESH_VALIDATION}{"|"}{.data.SKIP_NIC_VALIDATION}{"|"}{.data.SKIP_RAIL_BANDWIDTH_TEST}{"|"}{.data.SKIP_RCCL_TEST}{"|"}{.data.NODE_VALIDATION_INTERVAL_MINS}{"|"}{.data.PHASE2_JOB_WAIT_TIME}{"|"}{.data.PHASE3_JOB_WAIT_TIME}{"|"}{.data.PHASE4_PAIR_WAIT_TIME}{"|"}{.data.MPIJOB_WAIT_TIME}'
+ register: reapply_cm
+ changed_when: false
+ when: action == 'reapply' or _is_fresh_run
+
+ - debug:
+ msg: "CM keys (SKIP1|SKIP2|SKIP3|SKIP4|SKIP5|INTERVAL|P2|P3|P4|P5): {{ reapply_cm.stdout }}"
+ when: (action == 'reapply' or _is_fresh_run) and reapply_cm is defined
+
+ - name: reapply > verify CronJob schedule
+ shell: |
+ docker exec {{ container_name }} kubectl get cronjob {{ cronjob_name }} -n {{ cvf_namespace }} \
+ -o jsonpath='{.spec.schedule}'
+ register: reapply_cron
+ changed_when: false
+ when: action == 'reapply' or _is_fresh_run
+
+ - debug:
+ msg: "CronJob schedule: {{ reapply_cron.stdout }}"
+ when: (action == 'reapply' or _is_fresh_run) and reapply_cron is defined
+
+ - name: reapply > verify Phase 1 stages JSON (Skip + TimeoutSeconds + Image)
+ shell: |
+ docker exec {{ container_name }} kubectl get cm cluster-validation-config -n {{ cvf_namespace }} \
+ -o jsonpath='{.data.GPU_VALIDATION_STAGES_JSON}' \
+ | jq -c '[.[] | {Name, Skip, TimeoutSeconds, Image}]'
+ register: reapply_stages
+ changed_when: false
+ when: action == 'reapply' or _is_fresh_run
+
+ - debug:
+ msg: "{{ reapply_stages.stdout }}"
+ when: (action == 'reapply' or _is_fresh_run) and reapply_stages is defined
+
+ # ======================================================================
+ # Action: reset
+ # fresh-run forces delete_all_jobs=true so nothing lingers.
+ # ======================================================================
+ - name: reset > set delete_all_jobs=true under fresh-run
+ set_fact:
+ delete_all_jobs: true
+ when: _is_fresh_run
+
+ - name: reset > show node labels (before)
+ shell: |
+ docker exec {{ container_name }} kubectl get nodes \
+ -L amd.com/gpu-hw-acceptance \
+ -L amd.com/gpu-mesh-validation \
+ -L amd.com/nic-health \
+ -L amd.com/rail-bandwidth \
+ -L amd.com/cluster-validation-status
+ register: reset_before
+ changed_when: false
+ when: action == 'reset' or _is_fresh_run
+
+ - debug:
+ msg: "{{ reset_before.stdout_lines }}"
+ when: (action == 'reset' or _is_fresh_run) and reset_before is defined
+
+ - name: reset > remove per-phase labels + candidate label on every node
+ shell: |
+ set -eu
+ for N in $(docker exec {{ container_name }} kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do
+ docker exec {{ container_name }} kubectl label node "$N" \
+ amd.com/gpu-hw-acceptance- \
+ amd.com/gpu-mesh-validation- \
+ amd.com/nic-health- \
+ amd.com/rail-bandwidth- \
+ amd.com/cluster-validation-status- \
+ amd.com/cluster-validation-candidate- \
+ --overwrite 2>&1 | sed 's/^/ /' || true
+ done
+ args:
+ executable: /bin/bash
+ register: reset_label_clear
+ changed_when: "'labeled' in (reset_label_clear.stdout | default(''))"
+ when: action == 'reset' or _is_fresh_run
+
+ - name: reset > remove last-run timestamp + per-stage + failure-reason annotations
+ shell: |
+ set -eu
+ for N in $(docker exec {{ container_name }} kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do
+ docker exec {{ container_name }} kubectl annotate node "$N" \
+ amd.com/cluster-validation-last-run-timestamp- \
+ --overwrite 2>&1 | sed 's/^/ /' || true
+ KEYS=$(docker exec {{ container_name }} kubectl get node "$N" -o json \
+ | jq -r '(.metadata.annotations // {})
+ | keys[]
+ | select(startswith("amd.com/gpu-hw-acceptance-stage-")
+ or test("amd.com/.+-(failure-reason|failed-subtest|failed-nics)$"))')
+ for K in $KEYS; do
+ docker exec {{ container_name }} kubectl annotate node "$N" "${K}-" \
+ --overwrite 2>&1 | sed 's/^/ /' || true
+ done
+ done
+ args:
+ executable: /bin/bash
+ register: reset_ann_clear
+ changed_when: "'annotated' in (reset_ann_clear.stdout | default(''))"
+ when: action == 'reset' or _is_fresh_run
+
+ - name: reset > suspend CronJob (optional)
+ shell: |
+ docker exec {{ container_name }} kubectl patch cronjob {{ cronjob_name }} \
+ -n {{ cvf_namespace }} --type=merge -p '{"spec":{"suspend":true}}'
+ args:
+ executable: /bin/bash
+ register: reset_cron_suspend
+ changed_when: "'patched' in (reset_cron_suspend.stdout | default(''))"
+ when: (action == 'reset' or _is_fresh_run) and (suspend_cronjob | bool)
+
+ - name: reset > delete completed Jobs (optional, mutually exclusive with delete_all_jobs)
+ shell: |
+ set -eu
+ docker exec {{ container_name }} kubectl get jobs -n {{ cvf_namespace }} -o json \
+ | jq -r '.items
+ | map(select(.status.active // 0 == 0))
+ | map(select(.status.succeeded // 0 > 0 or .status.failed // 0 > 0))
+ | .[].metadata.name' \
+ | while read -r J; do
+ docker exec {{ container_name }} kubectl delete job "$J" -n {{ cvf_namespace }} 2>&1 | sed 's/^/ /' || true
+ done
+ args:
+ executable: /bin/bash
+ register: reset_job_completed
+ changed_when: "'deleted' in (reset_job_completed.stdout | default(''))"
+ when:
+ - action == 'reset' or _is_fresh_run
+ - delete_completed_jobs | bool
+ - not (delete_all_jobs | bool)
+
+ - name: reset > delete ALL CVF Jobs + MPIJob CRDs + orphan pods (optional)
+ shell: |
+ set -eu
+ # 1. Phase 5 MPIJob CRDs (cascade-deletes launcher Job + worker pods).
+ docker exec {{ container_name }} kubectl get mpijobs.kubeflow.org -n {{ cvf_namespace }} -o json 2>/dev/null \
+ | jq -r '.items
+ | map(select(.metadata.name | startswith("cluster-validation-mpi-job-")
+ or startswith("cvf-")))
+ | .[].metadata.name' \
+ | while read -r M; do
+ docker exec {{ container_name }} kubectl delete mpijob.kubeflow.org "$M" -n {{ cvf_namespace }} --wait=true 2>&1 | sed 's/^/ /' || true
+ done
+
+ # 2. batch/v1 Jobs. wait=true is intentional - next trigger
+ # reuses these names and kubectl apply rejects rc=2 if the
+ # same-name Job is still in-flight selector immutable.
+ docker exec {{ container_name }} kubectl get jobs -n {{ cvf_namespace }} -o json \
+ | jq -r '.items
+ | map(select((.metadata.name | startswith("cluster-validation-cron-job-"))
+ or (.metadata.name | startswith("cluster-validation-mpi-job-"))
+ or (.metadata.name | startswith("cvf-e2e"))
+ or (.metadata.name | startswith("cvf-test"))
+ or (.metadata.name | startswith("cvf-tr"))
+ or (.metadata.name | startswith("cvf-phase"))
+ or (.metadata.name | startswith("cvf-p4"))
+ or (.metadata.name | startswith("ib-write-bw"))))
+ | .[].metadata.name' \
+ | while read -r J; do
+ docker exec {{ container_name }} kubectl delete job "$J" -n {{ cvf_namespace }} --wait=true 2>&1 | sed 's/^/ /' || true
+ done
+
+ # 3. Orphan pods that survived Job/MPIJob deletion.
+ docker exec {{ container_name }} kubectl get pods -n {{ cvf_namespace }} -o json \
+ | jq -r '.items
+ | map(select((.metadata.name | startswith("cluster-validation-cron-job-"))
+ or (.metadata.name | startswith("cluster-validation-mpi-job-"))
+ or (.metadata.name | startswith("cvf-e2e"))
+ or (.metadata.name | startswith("cvf-test"))
+ or (.metadata.name | startswith("cvf-tr"))
+ or (.metadata.name | startswith("cvf-phase"))
+ or (.metadata.name | startswith("cvf-p4"))
+ or (.metadata.name | startswith("ib-write-bw"))))
+ | .[].metadata.name' \
+ | while read -r P; do
+ docker exec {{ container_name }} kubectl delete pod "$P" -n {{ cvf_namespace }} --force --grace-period=0 2>&1 | sed 's/^/ /' || true
+ done
+ args:
+ executable: /bin/bash
+ register: reset_all_jobs
+ changed_when: "'deleted' in (reset_all_jobs.stdout | default(''))"
+ when: (action == 'reset' or _is_fresh_run) and (delete_all_jobs | bool)
+
+ - name: reset > show node labels (after)
+ shell: |
+ docker exec {{ container_name }} kubectl get nodes \
+ -L amd.com/gpu-hw-acceptance \
+ -L amd.com/gpu-mesh-validation \
+ -L amd.com/nic-health \
+ -L amd.com/rail-bandwidth \
+ -L amd.com/cluster-validation-status
+ register: reset_after
+ changed_when: false
+ when: action == 'reset' or _is_fresh_run
+
+ - debug:
+ msg: "{{ reset_after.stdout_lines }}"
+ when: (action == 'reset' or _is_fresh_run) and reset_after is defined
+
+ # ======================================================================
+ # Action: inject-secret
+ # ======================================================================
+ - name: inject > verify ServiceAccount exists
+ shell: |
+ docker exec {{ container_name }} kubectl get sa {{ service_account_name }} \
+ -n {{ cvf_namespace }} >/dev/null
+ changed_when: false
+ when: action == 'inject-secret' or _is_fresh_run
+
+ - name: inject > build credentials list (CLI override OR controller-side config.json)
+ shell: |
+ set -eu
+ if [ -n "{{ username }}" ] && [ -n "{{ token }}" ]; then
+ REG='{{ registry_url | default("docker.io", true) }}'
+ USR='{{ username }}'
+ TOK='{{ token }}'
+ NAME='{{ secret_name }}'
+ [ -z "$NAME" ] && NAME="$(echo "$USR" | tr -c 'a-z0-9-' '-' | sed 's/-*$//')-pull-secret"
+ jq -nc --arg r "$REG" --arg u "$USR" --arg t "$TOK" --arg n "$NAME" \
+ '[{registry: $r, username: $u, token: $t, name: $n}]'
+ else
+ jq -c '[(.global["image-pull-secrets"] // [])[]
+ | { registry: (.["registry-url"] // "docker.io"),
+ username: .username,
+ token: .token,
+ name: ((.username | gsub("[^a-z0-9-]"; "-") | sub("-+$"; "")) + "-pull-secret")
+ }]' '{{ playbook_dir }}/../../configs/config.json'
+ fi
+ args:
+ executable: /bin/bash
+ delegate_to: localhost
+ run_once: true
+ register: inject_creds_json
+ changed_when: false
+ when: action == 'inject-secret' or _is_fresh_run
+
+ - name: inject > parse credentials list
+ set_fact:
+ inject_creds_list: "{{ inject_creds_json.stdout | from_json }}"
+ when: (action == 'inject-secret' or _is_fresh_run) and inject_creds_json is defined
+
+ - name: inject > show credentials (token redacted)
+ debug:
+ msg: "{{ inject_creds_list | map('combine', {'token': ''}) | list }}"
+ when: (action == 'inject-secret' or _is_fresh_run) and inject_creds_list is defined
+
+ - name: inject > skip when no credentials configured (fresh-run treats as no-op)
+ debug:
+ msg: "No image-pull credentials configured -- skipping (no-op)."
+ when:
+ - action == 'inject-secret' or _is_fresh_run
+ - inject_creds_list is defined
+ - (inject_creds_list | length) == 0
+
+ - name: inject > fail when explicitly asked for inject-secret but no creds
+ fail:
+ msg: |
+ No image-pull credentials found.
+ Either add entries to configs/config.json -> global.image-pull-secrets,
+ or pass -e username=... -e token=... [-e registry_url=...]
+ when:
+ - action == 'inject-secret' # explicit, not fresh-run
+ - inject_creds_list is defined
+ - (inject_creds_list | length) == 0
+
+ - name: inject > create/update each docker-registry Secret (idempotent)
+ shell: |
+ set -eu
+ SRV_FLAG=""
+ if [ "{{ item.registry }}" != "docker.io" ] && [ "{{ item.registry }}" != "index.docker.io" ]; then
+ SRV_FLAG="--docker-server={{ item.registry }}"
+ fi
+ docker exec {{ container_name }} kubectl create secret docker-registry {{ item.name }} \
+ $SRV_FLAG \
+ --docker-username='{{ item.username }}' --docker-password='{{ item.token }}' \
+ -n {{ cvf_namespace }} \
+ --dry-run=client -o yaml \
+ | docker exec -i {{ container_name }} kubectl apply -f -
+ args:
+ executable: /bin/bash
+ loop: "{{ inject_creds_list | default([]) }}"
+ loop_control:
+ label: "{{ item.name }} ({{ item.registry }}/{{ item.username }})"
+ register: inject_secret_apply
+ changed_when: "'created' in (inject_secret_apply.stdout | default('')) or 'configured' in (inject_secret_apply.stdout | default(''))"
+ no_log: true
+ when:
+ - action == 'inject-secret' or _is_fresh_run
+ - inject_creds_list is defined
+ - (inject_creds_list | length) > 0
+
+ - name: inject > strategic-merge-patch SA (additive on imagePullSecrets, idempotent)
+ shell: |
+ docker exec {{ container_name }} kubectl patch sa {{ service_account_name }} \
+ -n {{ cvf_namespace }} --type=strategic \
+ -p '{"imagePullSecrets":[{"name":"{{ item.name }}"}]}'
+ args:
+ executable: /bin/bash
+ loop: "{{ inject_creds_list | default([]) }}"
+ loop_control:
+ label: "{{ item.name }}"
+ register: inject_sa_patch
+ changed_when: "'patched' in (inject_sa_patch.stdout | default('')) and 'no change' not in (inject_sa_patch.stdout | default(''))"
+ when:
+ - action == 'inject-secret' or _is_fresh_run
+ - inject_creds_list is defined
+ - (inject_creds_list | length) > 0
+
+ - name: inject > show SA imagePullSecrets after patch
+ shell: |
+ docker exec {{ container_name }} kubectl get sa {{ service_account_name }} \
+ -n {{ cvf_namespace }} -o jsonpath='{.imagePullSecrets}'
+ register: inject_sa_state
+ changed_when: false
+ when:
+ - action == 'inject-secret' or _is_fresh_run
+ - inject_creds_list is defined
+ - (inject_creds_list | length) > 0
+
+ - debug:
+ msg: "{{ service_account_name }} imagePullSecrets: {{ inject_sa_state.stdout }}"
+ when:
+ - action == 'inject-secret' or _is_fresh_run
+ - inject_sa_state is defined and inject_sa_state.stdout is defined
+
+ # ======================================================================
+ # Action: trigger
+ #
+ # Reminder: this creates a one-shot Job ('cvf-test' by default) that
+ # runs INDEPENDENTLY of the */30 CronJob. If the manual run takes
+ # >30min, the next cron tick will spawn a SECOND orchestrator and
+ # race for the same nodes + Phase 4 per-rail Job names (which lack
+ # a timestamp suffix -> kubectl apply rc=2 on the loser). To avoid:
+ # ansible-playbook playbooks/cvf.yml -e action=reset -e suspend_cronjob=true
+ # before triggering, then re-enable cron after the manual run:
+ # docker exec server kubectl patch cronjob cluster-validation-cron-job \
+ # -n default --type=merge -p '{"spec":{"suspend":false}}'
+ # ======================================================================
+ - name: trigger > verify CronJob template exists
+ shell: |
+ docker exec {{ container_name }} kubectl get cronjob {{ cronjob_name }} \
+ -n {{ cvf_namespace }} >/dev/null
+ changed_when: false
+ when: action == 'trigger' or _is_fresh_run
+
+ - name: trigger > delete prior Job with the same name (if any)
+ shell: |
+ docker exec {{ container_name }} kubectl delete job {{ job_name }} -n {{ cvf_namespace }} \
+ --ignore-not-found=true --wait=true 2>&1
+ args:
+ executable: /bin/bash
+ register: trigger_prior
+ changed_when: "'deleted' in (trigger_prior.stdout | default(''))"
+ when: action == 'trigger' or _is_fresh_run
+
+ - name: trigger > force-clean orphan pods from the prior Job
+ shell: |
+ docker exec {{ container_name }} kubectl get pods -n {{ cvf_namespace }} \
+ -l job-name={{ job_name }} -o name 2>/dev/null \
+ | xargs -r -n1 docker exec {{ container_name }} kubectl delete -n {{ cvf_namespace }} \
+ --force --grace-period=0 2>&1 || true
+ args:
+ executable: /bin/bash
+ changed_when: false
+ when: action == 'trigger' or _is_fresh_run
+
+ - name: trigger > create Job from CronJob template
+ shell: |
+ docker exec {{ container_name }} kubectl create job --from=cronjob/{{ cronjob_name }} \
+ {{ job_name }} -n {{ cvf_namespace }}
+ args:
+ executable: /bin/bash
+ register: trigger_create
+ changed_when: "'created' in (trigger_create.stdout | default(''))"
+ when: action == 'trigger' or _is_fresh_run
+
+ - name: trigger > wait for orchestrator pod to be scheduled (up to 30s)
+ shell: |
+ for i in $(seq 1 15); do
+ POD=$(docker exec {{ container_name }} kubectl get pods -n {{ cvf_namespace }} \
+ -l job-name={{ job_name }} -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
+ if [ -n "$POD" ]; then
+ echo "$POD"
+ exit 0
+ fi
+ sleep 2
+ done
+ echo "ERROR: orchestrator pod did not appear within 30s" >&2
+ exit 1
+ args:
+ executable: /bin/bash
+ register: trigger_pod
+ changed_when: false
+ when: action == 'trigger' or _is_fresh_run
+
+ - name: trigger > show orchestrator pod state
+ shell: |
+ docker exec {{ container_name }} kubectl get pod {{ trigger_pod.stdout }} -n {{ cvf_namespace }} -o wide
+ register: trigger_state
+ changed_when: false
+ when: (action == 'trigger' or _is_fresh_run) and trigger_pod is defined
+
+ - debug:
+ msg:
+ - "Triggered: job/{{ job_name }} (orchestrator pod: {{ trigger_pod.stdout }})"
+ - "{{ trigger_state.stdout_lines }}"
+ when: (action == 'trigger' or _is_fresh_run) and trigger_state is defined
+
+ # ======================================================================
+ # Action: status
+ # ======================================================================
+ - name: status > nodes (wide)
+ shell: docker exec {{ container_name }} kubectl get nodes -o wide
+ register: status_nodes
+ changed_when: false
+ ignore_errors: yes
+ when: action == 'status' or _is_fresh_run
+
+ - debug:
+ msg: "{{ status_nodes.stdout_lines }}"
+ when: (action == 'status' or _is_fresh_run) and status_nodes is defined and status_nodes.rc == 0
+
+ - name: status > CronJob
+ shell: docker exec {{ container_name }} kubectl get cronjob -n {{ cvf_namespace }} -o wide
+ register: status_cron
+ changed_when: false
+ ignore_errors: yes
+ when: action == 'status' or _is_fresh_run
+
+ - debug:
+ msg: "{{ status_cron.stdout_lines }}"
+ when: (action == 'status' or _is_fresh_run) and status_cron is defined and status_cron.rc == 0
+
+ - name: status > recent orchestrator pods (last 5)
+ shell: |
+ docker exec {{ container_name }} kubectl get pods -n {{ cvf_namespace }} -o json 2>/dev/null \
+ | jq -r '.items
+ | map(select(.metadata.name | startswith("cluster-validation-cron-job-") or startswith("cvf-e2e-") or startswith("cvf-test")))
+ | sort_by(.metadata.creationTimestamp)
+ | reverse | .[0:5]
+ | .[] | "\(.metadata.name)\t\(.status.phase)\t\(.spec.nodeName // "?")\t\(.metadata.creationTimestamp)"'
+ args:
+ executable: /bin/bash
+ register: status_orch
+ changed_when: false
+ ignore_errors: yes
+ when: action == 'status' or _is_fresh_run
+
+ - debug:
+ msg: "Recent orchestrator pods (NAME | PHASE | NODE | CREATED):\n{{ status_orch.stdout }}"
+ when: (action == 'status' or _is_fresh_run) and status_orch is defined
+
+ - name: status > tail logs of the most recent orchestrator pod
+ shell: |
+ POD=$(docker exec {{ container_name }} kubectl get pods -n {{ cvf_namespace }} -o json 2>/dev/null \
+ | jq -r '.items
+ | map(select(.metadata.name | startswith("cluster-validation-cron-job-") or startswith("cvf-e2e-") or startswith("cvf-test")))
+ | sort_by(.metadata.creationTimestamp)
+ | last.metadata.name // ""')
+ [ -z "$POD" ] && { echo "no orchestrator pods yet"; exit 0; }
+ echo "===== Orchestrator pod: $POD ====="
+ docker exec {{ container_name }} kubectl logs "$POD" -n {{ cvf_namespace }} 2>&1 | tail -n 80
+ args:
+ executable: /bin/bash
+ register: status_log
+ changed_when: false
+ ignore_errors: yes
+ when: action == 'status' or _is_fresh_run
+
+ - debug:
+ msg: "{{ status_log.stdout_lines }}"
+ when: (action == 'status' or _is_fresh_run) and status_log is defined
+
+ - name: status > in-flight phase Jobs
+ shell: |
+ docker exec {{ container_name }} kubectl get jobs -n {{ cvf_namespace }} -o json 2>/dev/null \
+ | jq -r '.items
+ | map(select((.metadata.name | startswith("cluster-validation-cron-job-")) | not))
+ | map("\(.metadata.name)\tactive=\(.status.active // 0)\tsucc=\(.status.succeeded // 0)\tfail=\(.status.failed // 0)")
+ | .[]?' || true
+ args:
+ executable: /bin/bash
+ register: status_jobs
+ changed_when: false
+ ignore_errors: yes
+ when: action == 'status' or _is_fresh_run
+
+ - debug:
+ msg: "Per-phase Jobs:\n{{ status_jobs.stdout if status_jobs.stdout else ' (none)' }}"
+ when: (action == 'status' or _is_fresh_run) and status_jobs is defined
+
+ - name: status > MPIJob (Phase 5)
+ shell: |
+ docker exec {{ container_name }} kubectl get mpijob -n {{ cvf_namespace }} \
+ -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.conditions[-1].type}{"\t"}{.status.conditions[-1].reason}{"\n"}{end}' 2>/dev/null \
+ || echo "(no mpijob)"
+ args:
+ executable: /bin/bash
+ register: status_mpi
+ changed_when: false
+ ignore_errors: yes
+ when: action == 'status' or _is_fresh_run
+
+ - debug:
+ msg: "MPIJob:\n{{ status_mpi.stdout }}"
+ when: (action == 'status' or _is_fresh_run) and status_mpi is defined
+
+ - name: status > per-node phase labels + Phase 1 stage annotations + failure reasons
+ shell: |
+ set -eu
+ echo "================================================================"
+ echo " Per-node phase status"
+ echo " P1=amd.com/gpu-hw-acceptance P2=amd.com/gpu-mesh-validation"
+ echo " P3=amd.com/nic-health P4=amd.com/rail-bandwidth"
+ echo " P5/AGG=amd.com/cluster-validation-status (Phase 5 = aggregate label)"
+ echo "================================================================"
+ printf " %-72s | %-9s | %-9s | %-9s | %-9s | %-9s | %s\n" \
+ NODE P1 P2 P3 P4 P5/AGG LAST_RUN
+ printf " %s\n" "$(printf -- '-%.0s' {1..165})"
+ for NODE in $(docker exec {{ container_name }} kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do
+ NJ=$(docker exec {{ container_name }} kubectl get node "$NODE" -o json)
+ P1=$(jq -r '.metadata.labels."amd.com/gpu-hw-acceptance" // "—"' <<<"$NJ")
+ P2=$(jq -r '.metadata.labels."amd.com/gpu-mesh-validation" // "—"' <<<"$NJ")
+ P3=$(jq -r '.metadata.labels."amd.com/nic-health" // "—"' <<<"$NJ")
+ P4=$(jq -r '.metadata.labels."amd.com/rail-bandwidth" // "—"' <<<"$NJ")
+ P5=$(jq -r '.metadata.labels."amd.com/cluster-validation-status" // "—"' <<<"$NJ")
+ TS=$(jq -r '.metadata.annotations."amd.com/cluster-validation-last-run-timestamp" // "never"' <<<"$NJ")
+ printf " %-72s | %-9s | %-9s | %-9s | %-9s | %-9s | %s\n" \
+ "$NODE" "$P1" "$P2" "$P3" "$P4" "$P5" "$TS"
+ done
+ echo ""
+ echo "================================================================"
+ echo " Per-node Phase 1 stage annotations"
+ echo "================================================================"
+ for NODE in $(docker exec {{ container_name }} kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do
+ echo " $NODE:"
+ docker exec {{ container_name }} kubectl get node "$NODE" -o json \
+ | jq -r '(.metadata.annotations // {})
+ | to_entries
+ | map(select(.key | startswith("amd.com/gpu-hw-acceptance-stage-")))
+ | sort_by(.key)
+ | if length == 0 then " (no stage annotations yet)"
+ else (map(" " + .key + " = " + .value) | join("\n")) end'
+ done
+ echo ""
+ echo "================================================================"
+ echo " Per-node failure-reason / failed-subtest / failed-nics annotations"
+ echo "================================================================"
+ any_fail=0
+ for NODE in $(docker exec {{ container_name }} kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do
+ OUT=$(docker exec {{ container_name }} kubectl get node "$NODE" -o json \
+ | jq -r '(.metadata.annotations // {})
+ | to_entries
+ | map(select(.key | test("-(failure-reason|failed-subtest|failed-nics)$")))
+ | map(" " + .key + " = " + .value)
+ | .[]?')
+ if [ -n "$OUT" ]; then
+ echo " $NODE:"; echo "$OUT"; any_fail=1
+ fi
+ done
+ if [ $any_fail -eq 0 ]; then
+ echo " (none — all phases passing or pending)"
+ fi
+ args:
+ executable: /bin/bash
+ register: status_per_node
+ changed_when: false
+ failed_when: false
+ when: action == 'status' or _is_fresh_run
+
+ - debug:
+ msg: "{{ status_per_node.stdout_lines }}"
+ when: (action == 'status' or _is_fresh_run) and status_per_node is defined
+
+# ----------------------------------------------------------------------------
+# Second play: container ping across all nodes (status / fresh-run only).
+# Lightweight, lets operators see whether server + agent docker containers
+# are alive without having to run a separate command.
+# ----------------------------------------------------------------------------
+- name: CVF day-2 operations (per-node container check)
+ hosts: cluster_nodes
+ gather_facts: false
+ vars:
+ action: ""
+ _is_fresh_run: "{{ action == 'fresh-run' }}"
+ tasks:
+ - name: status > docker containers on each node
+ shell: docker ps --filter 'name=server' --filter 'name=agent' --format 'table {% raw %}{{.Names}}\t{{.Status}}{% endraw %}'
+ register: status_containers
+ changed_when: false
+ when: action == 'status' or _is_fresh_run
+
+ - debug:
+ msg: "{{ inventory_hostname }}: {{ status_containers.stdout_lines }}"
+ when: (action == 'status' or _is_fresh_run) and status_containers is defined
+
+ # ---- Per-node log file inventory ----------------------------------
+ # Every phase Job/MPIJob writes its full stdout/stderr to a host
+ # hostPath at /var/log/cluster-validation/__.log
+ # (see cluster-validation-job.yaml). Pods get GC'd after
+ # ttlSecondsAfterFinished, but the files persist. Show the LATEST
+ # file per phase per node so operators can `tail`/`grep` directly
+ # without hunting through `kubectl get pods`.
+ - name: status > latest log file per phase on each node
+ shell: |
+ set -eu
+ LOG_DIR=/var/log/cluster-validation
+ if [ ! -d "$LOG_DIR" ]; then
+ echo " no-log-dir : node has not run any phase Job yet"
+ exit 0
+ fi
+ # Parallel arrays: label[i] paired with glob[i]. Order matches
+ # the pipeline so output reads top-to-bottom.
+ labels="phase1 phase2 phase3 phase4_server phase4_client phase4_5 phase5 orchestrator"
+ i=0
+ for label in $labels; do
+ case "$label" in
+ phase1) glob='*_phase1_*.log' ;;
+ phase2) glob='*_phase2_*.log' ;;
+ phase3) glob='*_phase3_*.log' ;;
+ phase4_server) glob='*_phase4-server_*.log' ;;
+ phase4_client) glob='*_phase4-client_*.log' ;;
+ phase4_5) glob='*_phase4_5_*.log' ;;
+ phase5) glob='*_phase5_*.log' ;;
+ orchestrator) glob='cronjob-*.log' ;;
+ esac
+ # ls -t over a glob; suppress "no match" stderr; head -1 = newest
+ latest=$(cd "$LOG_DIR" && ls -t $glob 2>/dev/null | head -1 || true)
+ if [ -z "$latest" ]; then
+ printf " %-18s : none\n" "$label"
+ else
+ sz=$(stat -c%s "$LOG_DIR/$latest" 2>/dev/null || echo 0)
+ printf " %-18s : %s/%s bytes=%d\n" "$label" "$LOG_DIR" "$latest" "$sz"
+ fi
+ done
+ args:
+ executable: /bin/bash
+ register: status_logs
+ changed_when: false
+ failed_when: false
+ when: action == 'status' or _is_fresh_run
+
+ - debug:
+ msg:
+ - "================================================================"
+ - " Log files on {{ inventory_hostname }} (latest per phase)"
+ - "================================================================"
+ - "{{ status_logs.stdout_lines }}"
+ when: (action == 'status' or _is_fresh_run) and status_logs is defined
diff --git a/example/gpu-validation-cluster/ansible/playbooks/remove-agent-nodes.yml b/example/gpu-validation-cluster/ansible/playbooks/remove-agent-nodes.yml
index 6d9e71e26..0ecd209e3 100644
--- a/example/gpu-validation-cluster/ansible/playbooks/remove-agent-nodes.yml
+++ b/example/gpu-validation-cluster/ansible/playbooks/remove-agent-nodes.yml
@@ -221,7 +221,7 @@
- Cluster state cleaned up
To verify:
- ansible-playbook playbooks/check-status.yml
+ ansible-playbook playbooks/cvf.yml -e action=status
To re-add nodes:
ansible-playbook playbooks/add-agent-nodes.yml --limit node-name
diff --git a/example/gpu-validation-cluster/ansible/playbooks/set-gpu-compute-mode.yml b/example/gpu-validation-cluster/ansible/playbooks/set-gpu-compute-mode.yml
new file mode 100644
index 000000000..f52ff2eae
--- /dev/null
+++ b/example/gpu-validation-cluster/ansible/playbooks/set-gpu-compute-mode.yml
@@ -0,0 +1,118 @@
+---
+# Switch the GPU compute-partition mode on target hosts (default: agent_nodes)
+# and reboot to make the change stick.
+#
+# Usage:
+# ansible-playbook -i inventory.test.yml playbooks/set-gpu-compute-mode.yml \
+# -e partition=SPX
+# ansible-playbook -i inventory.test.yml playbooks/set-gpu-compute-mode.yml \
+# -e partition=SPX -e target=agent-node-1
+#
+# Vars (all optional):
+# partition -- SPX (default) | CPX | TPX | QPX | DPX
+# target -- inventory pattern (default: agent_nodes)
+# skip_reboot -- true to set the partition but skip the reboot
+# (only useful if the firmware applies it live; on
+# MI300/MI308 a reboot is normally required to reclaim BARs)
+#
+# After this completes, re-run setup-cluster.yml to rejoin the agent
+# container to k3s -- the existing tasks in that playbook are idempotent
+# and will recreate /var/lib/gpu-validation-cluster state cleanly.
+
+- name: Drain CVF agent container before partition change
+ hosts: "{{ target | default('agent_nodes') }}"
+ gather_facts: no
+ become: yes
+ tasks:
+ - name: Stop agent container if present
+ command: docker rm -f agent
+ changed_when: false
+ failed_when: false
+
+- name: Set GPU compute partition mode
+ hosts: "{{ target | default('agent_nodes') }}"
+ gather_facts: yes
+ become: yes
+ vars:
+ partition: SPX
+ skip_reboot: false
+ tasks:
+ - name: Verify rocm-smi is installed
+ command: which rocm-smi
+ register: rocm_smi_check
+ changed_when: false
+
+ - name: Show current compute partition (before)
+ command: rocm-smi --showcomputepartition
+ register: partition_before
+ changed_when: false
+
+ - debug:
+ msg: "{{ partition_before.stdout_lines }}"
+
+ - name: Set compute partition to {{ partition }}
+ command: rocm-smi --setcomputepartition {{ partition }}
+ register: partition_set
+ # rocm-smi exits 0 even when the change is queued for next boot.
+ # Inspect stdout if you need to distinguish live vs deferred.
+ changed_when: true
+
+ - debug:
+ msg: "{{ partition_set.stdout_lines }}"
+
+ - name: Show current compute partition (after, pre-reboot)
+ command: rocm-smi --showcomputepartition
+ register: partition_after_pre_reboot
+ changed_when: false
+
+ - debug:
+ msg: "{{ partition_after_pre_reboot.stdout_lines }}"
+
+ # MI300/MI308 generally need a reboot to free PCIe BAR space allocated
+ # for the now-removed CPX shadow PFs. Without the reboot the amdgpu
+ # driver may keep the previous partition layout in /sys.
+ - name: Reboot node to apply partition change
+ reboot:
+ msg: "Reboot for GPU compute-partition change to {{ partition }}"
+ connect_timeout: 30
+ reboot_timeout: 900 # up to 15 min for hardware POST + boot
+ pre_reboot_delay: 5
+ post_reboot_delay: 30
+ test_command: whoami
+ when: not (skip_reboot | bool)
+
+ - name: Wait for docker daemon to come back after reboot
+ shell: |
+ for i in $(seq 1 30); do
+ if docker info >/dev/null 2>&1; then exit 0; fi
+ sleep 5
+ done
+ exit 1
+ args:
+ executable: /bin/bash
+ changed_when: false
+ when: not (skip_reboot | bool)
+
+ - name: Show current compute partition (after reboot)
+ command: rocm-smi --showcomputepartition
+ register: partition_after_reboot
+ changed_when: false
+ when: not (skip_reboot | bool)
+
+ - debug:
+ msg: "{{ partition_after_reboot.stdout_lines }}"
+ when: not (skip_reboot | bool)
+
+- name: Summary
+ hosts: localhost
+ connection: local
+ gather_facts: no
+ tasks:
+ - debug:
+ msg: |
+ ============================================================
+ GPU partition change complete.
+ Next:
+ ansible-playbook -i inventory.test.yml playbooks/setup-cluster.yml
+ ansible-playbook -i inventory.test.yml playbooks/cvf.yml -e action=fresh-run -e suspend_cronjob=true
+ ============================================================
diff --git a/example/gpu-validation-cluster/ansible/playbooks/setup-cluster.yml b/example/gpu-validation-cluster/ansible/playbooks/setup-cluster.yml
index 2aab19907..7075cc149 100644
--- a/example/gpu-validation-cluster/ansible/playbooks/setup-cluster.yml
+++ b/example/gpu-validation-cluster/ansible/playbooks/setup-cluster.yml
@@ -56,18 +56,51 @@
become: yes
tasks:
- - name: Update apt cache
- apt:
- update_cache: yes
- cache_valid_time: 3600
- when: ansible_os_family == "Debian"
-
+ # Probe docker BEFORE the first apt update. If docker is already
+ # installed and working we leave the host's apt sources untouched;
+ # only when we are going to (re)install docker do we normalize its apt
+ # source below. This keeps the playbook from disturbing a healthy host.
- name: Check if Docker is installed
command: docker --version
register: docker_check
ignore_errors: yes
changed_when: false
+ # A docker apt source left by prior provisioning -- often BOTH a
+ # hand-written docker.list AND an Ansible-generated
+ # download_docker_com_linux_ubuntu.list -- can declare the same repo
+ # with a `signed-by` that conflicts (or is empty), so `apt update`
+ # aborts with "Conflicting values set for option Signed-By ...
+ # download.docker.com" and the whole play dies before docker installs.
+ # When we are about to install docker, unconditionally remove ALL
+ # pre-existing docker source files first; we recreate one canonical
+ # entry (correct signed-by) a few tasks below. Done BEFORE the first
+ # apt cache update so that update can't trip over the conflict.
+ - name: Remove pre-existing docker apt source files (before cache update)
+ shell: |
+ set -o pipefail
+ matches=$(ls -1 /etc/apt/sources.list.d/*docker*.list \
+ /etc/apt/sources.list.d/*docker*.sources \
+ /etc/apt/sources.list.d/download_docker_com*.list \
+ 2>/dev/null || true)
+ if [ -n "$matches" ]; then
+ echo "$matches" | xargs rm -f
+ echo "REMOVED"
+ fi
+ args:
+ executable: /bin/bash
+ register: docker_apt_source_cleanup
+ changed_when: "'REMOVED' in (docker_apt_source_cleanup.stdout | default(''))"
+ when:
+ - ansible_os_family == "Debian"
+ - docker_check.rc != 0
+
+ - name: Update apt cache
+ apt:
+ update_cache: yes
+ cache_valid_time: 3600
+ when: ansible_os_family == "Debian"
+
- name: Install Docker prerequisites
apt:
name:
@@ -80,18 +113,34 @@
- ansible_os_family == "Debian"
- docker_check.rc != 0
- - name: Add Docker GPG key
- apt_key:
+ - name: Ensure apt keyrings directory exists
+ file:
+ path: /etc/apt/keyrings
+ state: directory
+ mode: '0755'
+ when:
+ - ansible_os_family == "Debian"
+ - docker_check.rc != 0
+
+ # Use the modern keyring path (NOT the deprecated apt_key, which writes
+ # to the global trusted.gpg). force:yes refreshes the key so a stale
+ # one left by a prior provisioning is replaced with a known-good copy.
+ - name: Add Docker GPG key to /etc/apt/keyrings/docker.asc
+ get_url:
url: https://download.docker.com/linux/ubuntu/gpg
- state: present
+ dest: /etc/apt/keyrings/docker.asc
+ mode: '0644'
+ force: yes
when:
- ansible_os_family == "Debian"
- docker_check.rc != 0
- - name: Add Docker repository
+ - name: Add Docker repository (canonical entry with explicit signed-by)
apt_repository:
- repo: "deb [arch=amd64] https://download.docker.com/linux/ubuntu {{ ansible_distribution_release }} stable"
+ repo: "deb [arch=amd64 signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu {{ ansible_distribution_release }} stable"
+ filename: docker
state: present
+ update_cache: yes
when:
- ansible_os_family == "Debian"
- docker_check.rc != 0
@@ -108,12 +157,67 @@
- ansible_os_family == "Debian"
- docker_check.rc != 0
- - name: Start and enable Docker service
+ # Some GPU hosts ship a NATIVE k3s install (k3s.service / k3s-agent.service
+ # under /etc/systemd/system) from a prior bringup. That k3s binds host
+ # ports 6443/6444 (we run with --net=host inside the container), so our
+ # containerized k3s server fails with "bind: address already in use".
+ # Stop + disable any such native units before starting our stack.
+ - name: Find native k3s systemd units (if any)
+ # `systemctl list-unit-files 'k3s*.service'` exits rc=1 when no units
+ # match (the common case on a fresh host). Combined with set -o
+ # pipefail that fails the pipe, so we explicitly mask it with
+ # failed_when: false -- empty stdout is the legitimate "no native
+ # k3s installed" signal, which the next task handles via `when:`.
+ shell: |
+ set -o pipefail
+ systemctl list-unit-files --no-legend 'k3s*.service' 2>/dev/null \
+ | awk '{print $1}' | tr '\n' ' '
+ args:
+ executable: /bin/bash
+ register: native_k3s_units
+ changed_when: false
+ failed_when: false
+
+ - name: Stop + disable native k3s units (if any are present)
+ shell: |
+ set -eu
+ for U in {{ native_k3s_units.stdout }}; do
+ systemctl stop "$U" 2>/dev/null || true
+ systemctl disable "$U" 2>/dev/null || true
+ done
+ # Belt-and-suspenders: in case k3s was started manually
+ if command -v /usr/local/bin/k3s-killall.sh >/dev/null 2>&1; then
+ /usr/local/bin/k3s-killall.sh 2>/dev/null || true
+ fi
+ args:
+ executable: /bin/bash
+ register: native_k3s_stop
+ changed_when: native_k3s_units.stdout | trim != ''
+ when: native_k3s_units.stdout | trim != ''
+
+ # Some GPU hosts get the `docker` group from LDAP/SSSD only and not
+ # /etc/group. systemd's docker.socket unit has `SocketGroup=docker`
+ # and resolves it before SSSD is fully up at boot, so the socket fails
+ # with exit code 216/GROUP and the service never starts. Ensuring a
+ # local /etc/group entry makes resolution work via files-NSS at boot.
+ - name: Ensure local /etc/group has a 'docker' entry (idempotent)
+ group:
+ name: docker
+ state: present
+ when: ansible_os_family == "Debian"
+
+ # Always (re)start docker — previously this only ran when docker was
+ # newly installed (`when: docker_check.rc != 0`), so hosts where the
+ # daemon had failed at boot stayed broken across re-runs.
+ - name: Reset any failed docker units (idempotent, no-op when clean)
+ shell: systemctl reset-failed docker.socket docker.service 2>/dev/null || true
+ changed_when: false
+
+ - name: Start and enable Docker service (always)
systemd:
name: docker
state: started
enabled: yes
- when: docker_check.rc != 0
- name: Add user to docker group
user:
@@ -234,11 +338,42 @@
hosts: server_nodes
gather_facts: yes
become: yes
+ vars:
+ # Pass `-e force_restart=true` to force-recreate the container even
+ # when it already looks healthy (e.g. after editing gpu-cluster.sh
+ # or config.json in a way that requires a fresh start).
+ force_restart: false
tasks:
+ # Idempotency probe: container is "healthy" iff it is currently Up
+ # AND its kubectl can reach the apiserver. If both true, skip the
+ # destructive rm + run so re-runs don't churn k3s state.
+ - name: Probe existing server container health
+ shell: |
+ set -o pipefail
+ UP=$(docker ps --filter name=^server$ --format '{% raw %}{{.Status}}{% endraw %}' | head -1)
+ if [ -z "$UP" ]; then
+ echo "missing"
+ exit 0
+ fi
+ if docker exec server kubectl get --raw=/readyz >/dev/null 2>&1; then
+ echo "healthy"
+ else
+ echo "unhealthy"
+ fi
+ args:
+ executable: /bin/bash
+ register: server_health
+ changed_when: false
+
+ - name: Show server container health
+ debug:
+ msg: "server health: {{ server_health.stdout }} (force_restart={{ force_restart }})"
+
- name: Stop existing server container if running
command: docker rm -f server
ignore_errors: yes
+ when: server_health.stdout != 'healthy' or (force_restart | bool)
- name: Start server container
shell: |
@@ -247,11 +382,13 @@
sleep 5
args:
executable: /bin/bash
+ when: server_health.stdout != 'healthy' or (force_restart | bool)
- name: Wait for server to be ready
wait_for:
timeout: 50
delegate_to: localhost
+ when: server_health.stdout != 'healthy' or (force_restart | bool)
- name: Check server container status
shell: 'docker ps --filter name=server --format "{{ ''{{'' }}.Names{{ ''}}'' }}: {{ ''{{'' }}.Status{{ ''}}'' }}"'
@@ -287,6 +424,8 @@
gather_facts: no
become: yes
serial: 1 # Start agents one at a time to avoid overwhelming the server
+ vars:
+ force_restart: false
tasks:
- name: Get server IP and token
@@ -298,9 +437,43 @@
debug:
msg: "Connecting to server at {{ server_ip }}"
+ # Agent "healthy" iff: container Up AND it has an established TCP
+ # connection from inside the container to the k3s apiserver port
+ # (6443) on the server. This is local to the agent host (no
+ # cross-node SSH needed) and catches both "container died" and
+ # "container Up but k3s agent crashed / lost supervisor link".
+ - name: Probe existing agent container health
+ shell: |
+ set -o pipefail
+ UP=$(docker ps --filter name=^agent$ --format '{% raw %}{{.Status}}{% endraw %}' | head -1)
+ if [ -z "$UP" ]; then
+ echo "missing"
+ exit 0
+ fi
+ # ss inside the agent container; k3s-agent holds an ESTAB conn
+ # to :6443. Match either peer in case ss column
+ # ordering changes between distros.
+ if docker exec agent ss -tn state established 2>/dev/null \
+ | awk '{print $3, $4}' \
+ | grep -qE "(:6443\s|\s{{ server_ip }}:6443)"; then
+ echo "healthy"
+ else
+ echo "unhealthy"
+ fi
+ args:
+ executable: /bin/bash
+ register: agent_health
+ changed_when: false
+ failed_when: false
+
+ - name: Show agent container health
+ debug:
+ msg: "agent health: {{ agent_health.stdout }} (force_restart={{ force_restart }})"
+
- name: Stop existing agent container if running
command: docker rm -f agent
ignore_errors: yes
+ when: agent_health.stdout != 'healthy' or (force_restart | bool)
- name: Start agent container
shell: |
@@ -309,11 +482,13 @@
sleep 5
args:
executable: /bin/bash
+ when: agent_health.stdout != 'healthy' or (force_restart | bool)
- name: Wait for agent to be ready
wait_for:
timeout: 15
delegate_to: localhost
+ when: agent_health.stdout != 'healthy' or (force_restart | bool)
- name: Check agent container status
shell: 'docker ps --filter name=agent --format "{{ ''{{'' }}.Names{{ ''}}'' }}: {{ ''{{'' }}.Status{{ ''}}'' }}"'
@@ -352,6 +527,65 @@
debug:
msg: "{{ cluster_info.stdout_lines }}"
+ # The bringup install runs inside the backgrounded gpu-cluster.sh on
+ # the server. Earlier this play only checked that nodes joined, so a
+ # half-installed cluster (operators missing because the install aborted
+ # during post-start k3s flapping) still reported success. Explicitly
+ # verify the helm releases reached STATUS=deployed and the CVF CronJob
+ # exists, retrying while operators reconcile, and fail loudly otherwise.
+ - name: Check whether network operator is enabled in config.json
+ command: docker exec server jq -r 'if .["network-operator"]["install-network-operator"] == null then true else .["network-operator"]["install-network-operator"] end' /configs/config.json
+ register: net_op_enabled
+ changed_when: false
+
+ - name: Build expected Helm release list
+ set_fact:
+ expected_helm_releases: >-
+ {{
+ [
+ { 'release': 'cert-manager', 'ns': 'cert-manager' },
+ { 'release': 'amd-gpu-operator', 'ns': 'kube-amd-gpu' }
+ ]
+ + (
+ [ { 'release': 'amd-network-operator', 'ns': 'kube-amd-network' } ]
+ if (net_op_enabled.stdout | trim | bool) else []
+ )
+ }}
+
+ - name: Verify required Helm releases are deployed
+ shell: |
+ set -o pipefail
+ docker exec server helm status "{{ item.release }}" -n "{{ item.ns }}" -o json 2>/dev/null \
+ | jq -r '.info.status' 2>/dev/null
+ args:
+ executable: /bin/bash
+ register: helm_release_status
+ changed_when: false
+ retries: 30
+ delay: 10
+ until: helm_release_status.stdout == "deployed"
+ loop: "{{ expected_helm_releases }}"
+ loop_control:
+ label: "{{ item.release }}"
+
+ - name: Check whether CVF is enabled in config.json
+ command: docker exec server jq -r '.["cluster-validation-framework"]["install-cvf"] // false' /configs/config.json
+ register: cvf_enabled
+ changed_when: false
+
+ - name: Verify CVF CronJob exists (when CVF enabled)
+ command: docker exec server kubectl get cronjob cluster-validation-cron-job -n default
+ register: cvf_cronjob
+ changed_when: false
+ retries: 30
+ delay: 10
+ until: cvf_cronjob.rc == 0
+ when: cvf_enabled.stdout | trim | bool
+
+ - name: Installation verification passed
+ debug:
+ msg: "All operators deployed and CVF state verified."
+
- name: Summary
hosts: localhost
connection: local
@@ -369,7 +603,7 @@
Agent Nodes: {{ groups['agent_nodes'] | length }}
To check cluster status:
- ansible-playbook playbooks/check-status.yml
+ ansible-playbook playbooks/cvf.yml -e action=status
To teardown cluster:
ansible-playbook playbooks/teardown-cluster.yml
diff --git a/example/gpu-validation-cluster/ansible/playbooks/teardown-cluster.yml b/example/gpu-validation-cluster/ansible/playbooks/teardown-cluster.yml
index 4448d4e45..d5fbd2937 100644
--- a/example/gpu-validation-cluster/ansible/playbooks/teardown-cluster.yml
+++ b/example/gpu-validation-cluster/ansible/playbooks/teardown-cluster.yml
@@ -1,17 +1,22 @@
---
# Teardown GPU Validation Cluster
-# Usage: ansible-playbook playbooks/teardown-cluster.yml
+# Usage:
+# ansible-playbook playbooks/teardown-cluster.yml
+# ansible-playbook playbooks/teardown-cluster.yml -e cleanup_test_logs=true
- name: Teardown cluster on all nodes
hosts: cluster_nodes
gather_facts: no
become: yes
+ vars:
+ # Set to true to also remove /var/log/cluster-validation on each node.
+ cleanup_test_logs: false
tasks:
- name: Run teardown script
shell: |
cd {{ script_dir }}
- CONFIG_DIR={{ config_dir }} ./gpu-cluster.sh teardown
+ CONFIG_DIR={{ config_dir }} CLEANUP_TEST_LOGS={{ cleanup_test_logs | bool | string | lower }} ./gpu-cluster.sh teardown
args:
executable: /bin/bash
ignore_errors: yes
diff --git a/example/gpu-validation-cluster/ansible/quickstart.sh b/example/gpu-validation-cluster/ansible/quickstart.sh
deleted file mode 100755
index 9ed0dc803..000000000
--- a/example/gpu-validation-cluster/ansible/quickstart.sh
+++ /dev/null
@@ -1,93 +0,0 @@
-#!/bin/bash
-# Quick start script for GPU Validation Cluster Ansible deployment
-
-set -e
-
-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-cd "$SCRIPT_DIR"
-
-echo "=========================================="
-echo "GPU Validation Cluster - Quick Start"
-echo "=========================================="
-echo ""
-
-# Check if Ansible is installed
-if ! command -v ansible &> /dev/null; then
- echo "[ERROR] Ansible is not installed."
- echo "[INFO] Install Ansible:"
- echo " sudo apt update && sudo apt install ansible"
- echo " or"
- echo " pip install ansible"
- exit 1
-fi
-
-echo "[INFO] Ansible version: $(ansible --version | head -n1)"
-echo ""
-
-# Check if inventory exists
-if [ ! -f "inventory.yml" ]; then
- echo "[ERROR] inventory.yml not found!"
- echo "[INFO] Please create and edit inventory.yml with your node details:"
- echo " vi inventory.yml"
- echo ""
- echo "[INFO] Required configuration in inventory.yml:"
- echo " - ansible_host: IP addresses of your nodes"
- echo " - ansible_user: SSH username"
- echo " - ansible_connection: local (for server node if running locally)"
- exit 1
-fi
-
-echo "[INFO] Inventory file found: inventory.yml"
-echo ""
-
-# Test connectivity
-echo "[STEP 1] Testing SSH connectivity to nodes..."
-if ansible all -m ping -o; then
- echo "[SUCCESS] All nodes are reachable"
-else
- echo "[WARN] Some nodes are not reachable via SSH"
- echo ""
- read -p "Do you want to setup SSH keys? (y/n) " -n 1 -r
- echo ""
- if [[ $REPLY =~ ^[Yy]$ ]]; then
- echo "[INFO] Running SSH key setup..."
- ansible-playbook playbooks/setup-ssh-keys.yml --ask-pass --ask-become-pass
- else
- echo "[ERROR] Cannot proceed without SSH access to nodes"
- exit 1
- fi
-fi
-
-echo ""
-echo "[STEP 2] Ready to deploy cluster"
-echo ""
-echo "The following will be performed:"
-echo " 1. Build Docker image locally"
-echo " 2. Install Docker and jq on all nodes (if needed)"
-echo " 3. Copy image to all nodes"
-echo " 4. Start server node"
-echo " 5. Start agent nodes and join cluster"
-echo ""
-read -p "Proceed with cluster deployment? (y/n) " -n 1 -r
-echo ""
-
-if [[ ! $REPLY =~ ^[Yy]$ ]]; then
- echo "Deployment cancelled."
- exit 0
-fi
-
-echo ""
-echo "[STEP 3] Deploying cluster..."
-ansible-playbook playbooks/setup-cluster.yml
-
-echo ""
-echo "=========================================="
-echo "Deployment Complete!"
-echo "=========================================="
-echo ""
-echo "Check cluster status:"
-echo " ansible-playbook playbooks/check-status.yml"
-echo ""
-echo "Teardown cluster:"
-echo " ansible-playbook playbooks/teardown-cluster.yml"
-echo ""
diff --git a/example/gpu-validation-cluster/build/Dockerfile b/example/gpu-validation-cluster/build/Dockerfile
index a4c99739b..cfc631327 100644
--- a/example/gpu-validation-cluster/build/Dockerfile
+++ b/example/gpu-validation-cluster/build/Dockerfile
@@ -34,12 +34,14 @@ RUN curl -sfL https://github.com/k3s-io/k3s/releases/download/${K3S_VERSION}/k3s
mkdir -p /var/lib/rancher/k3s/agent
RUN mkdir -p /images && \
- skopeo copy docker://ghcr.io/k8snetworkplumbingwg/multus-cni:${MULTUS_CNI_VERSION} oci-archive:/images/multus-cni-${MULTUS_CNI_VERSION}.tar && \
- zstd -19 /images/multus-cni-${MULTUS_CNI_VERSION}.tar -o /images/multus-cni-${MULTUS_CNI_VERSION}.tar.zst && \
- rm /images/multus-cni-${MULTUS_CNI_VERSION}.tar && \
- skopeo copy docker://registry.k8s.io/nfd/node-feature-discovery:${NFD_VERSION} oci-archive:/images/nfd-${NFD_VERSION}.tar && \
- zstd -19 /images/nfd-${NFD_VERSION}.tar -o /images/nfd-${NFD_VERSION}.tar.zst && \
- rm /images/nfd-${NFD_VERSION}.tar
+ skopeo copy \
+ docker://ghcr.io/k8snetworkplumbingwg/multus-cni:${MULTUS_CNI_VERSION} \
+ docker-archive:/images/multus-cni-${MULTUS_CNI_VERSION}.tar:ghcr.io/k8snetworkplumbingwg/multus-cni:${MULTUS_CNI_VERSION} && \
+ zstd -19 --rm /images/multus-cni-${MULTUS_CNI_VERSION}.tar && \
+ skopeo copy \
+ docker://registry.k8s.io/nfd/node-feature-discovery:${NFD_VERSION} \
+ docker-archive:/images/nfd-${NFD_VERSION}.tar:registry.k8s.io/nfd/node-feature-discovery:${NFD_VERSION} && \
+ zstd -19 --rm /images/nfd-${NFD_VERSION}.tar
# Set kubeconfig environment variable
ENV KUBECONFIG=/etc/rancher/k3s/k3s.yaml
diff --git a/example/gpu-validation-cluster/build/entrypoint.sh b/example/gpu-validation-cluster/build/entrypoint.sh
index 7b7eeb7da..487a4c95b 100755
--- a/example/gpu-validation-cluster/build/entrypoint.sh
+++ b/example/gpu-validation-cluster/build/entrypoint.sh
@@ -26,7 +26,8 @@ start_k3s_server() {
--disable=traefik \
--disable=servicelb \
--kubelet-arg=serialize-image-pulls=true \
- ${K3S_EXTRA_ARGS} > /var/log/k3s.log 2>&1 &
+ ${K3S_EXTRA_ARGS} >> /var/log/k3s.log 2>&1 &
+ K3S_PID=$!
}
start_k3s_agent() {
@@ -35,7 +36,7 @@ start_k3s_agent() {
exit 1
fi
echo "[INFO] Starting k3s agent, joining server at $K3S_URL"
-
+
# Create registries.yaml for in-cluster registry
# do this for agent before starting the k3s agent process
# at this point we already know the URL to in cluster registry service
@@ -44,30 +45,100 @@ start_k3s_agent() {
${AGENT_REGISTRIES_CONFIG}
EOF
echo "[INFO] Created registries.yaml for registry at ${K3S_IP}:${IN_CLUSTER_REGISTRY_PORT}"
-
+
k3s agent \
--server="$K3S_URL" \
--token="$K3S_TOKEN" \
--with-node-id \
--kubelet-arg=serialize-image-pulls=true \
- ${K3S_EXTRA_ARGS} > /var/log/k3s.log 2>&1 &
+ ${K3S_EXTRA_ARGS} >> /var/log/k3s.log 2>&1 &
+ K3S_PID=$!
+}
+
+# Supervisor loop. Re-launches k3s ($MODE) whenever it exits non-zero,
+# replacing the previous "fire-and-forget + sleep infinity" pattern.
+#
+# Why this exists: k3s frequently crashes on the FIRST start after an OS
+# reboot due to transient host state -- iptables-nft chains wiped (kube-
+# router netpol panics on missing KUBE-ROUTER-OUTPUT), missing cni0
+# device (flannel can't add IP), stale containerd socket file from prior
+# boot, etc. A second/third attempt typically succeeds because the first
+# attempt re-creates the chains/ifaces before crashing. Before this loop,
+# such crashes left the container looking "Up" (entrypoint + sleep) while
+# k3s was dead, requiring an operator to docker rm/run the container.
+#
+# Safety: a rolling cap (MAX_RESTARTS in WINDOW_SECS) prevents a hot loop
+# when k3s is fundamentally misconfigured. When the cap is hit the script
+# exits non-zero so `docker --restart=unless-stopped` (already set by
+# gpu-cluster.sh) recreates the container fresh, clearing any per-PID
+# state inside the container. The CVF setup-cluster.yml health probe
+# will then detect the new container on its next idempotent re-run.
+supervise_k3s() {
+ local mode="$1"
+ local restarts=0
+ local window_start
+ window_start=$(date +%s)
+ local MAX_RESTARTS="${K3S_MAX_RESTARTS:-10}"
+ local WINDOW_SECS="${K3S_RESTART_WINDOW:-600}" # 10 min
+ local BACKOFF_MIN=2
+ local BACKOFF_MAX=30
+ local backoff="$BACKOFF_MIN"
+
+ # Forward SIGTERM/SIGINT to k3s so `docker stop` is graceful (node
+ # drain) instead of getting SIGKILLed after the 10s grace period.
+ # shellcheck disable=SC2064
+ trap 'echo "[INFO] supervisor: forwarding signal to k3s pid=$K3S_PID"; \
+ [ -n "$K3S_PID" ] && kill -TERM "$K3S_PID" 2>/dev/null; \
+ wait "$K3S_PID" 2>/dev/null; exit 0' TERM INT
+
+ while true; do
+ case "$mode" in
+ server) start_k3s_server ;;
+ agent) start_k3s_agent ;;
+ esac
+
+ # `wait $pid` blocks until that child exits and returns its exit code.
+ # `set -e` would abort the script if k3s exits non-zero -- we want to
+ # observe the rc and decide, so disable -e around the wait.
+ set +e
+ wait "$K3S_PID"
+ local rc=$?
+ set -e
+ local now
+ now=$(date +%s)
+
+ # Reset the rolling restart window if k3s ran longer than WINDOW_SECS
+ # before crashing -- it was effectively stable, so this crash is
+ # unrelated to the post-boot flapping we are protecting against.
+ if [ $(( now - window_start )) -gt "$WINDOW_SECS" ]; then
+ restarts=0
+ window_start=$now
+ backoff="$BACKOFF_MIN"
+ fi
+
+ restarts=$(( restarts + 1 ))
+ echo "[WARN] supervisor: k3s $mode exited rc=$rc (restart $restarts/$MAX_RESTARTS in ${WINDOW_SECS}s window)"
+
+ if [ "$restarts" -ge "$MAX_RESTARTS" ]; then
+ echo "[ERROR] supervisor: $MAX_RESTARTS restarts within ${WINDOW_SECS}s -- giving up so docker recreates the container"
+ exit 1
+ fi
+
+ echo "[INFO] supervisor: sleeping ${backoff}s before restart"
+ sleep "$backoff"
+ # Exponential backoff capped at BACKOFF_MAX so we don't grow unbounded.
+ backoff=$(( backoff * 2 ))
+ [ "$backoff" -gt "$BACKOFF_MAX" ] && backoff="$BACKOFF_MAX"
+ done
}
move_preloaded_images
-echo "[INFO] Starting k3s in $MODE mode"
+echo "[INFO] Starting k3s in $MODE mode (supervised)"
case "$MODE" in
- server)
- start_k3s_server
- ;;
- agent)
- start_k3s_agent
- ;;
+ server|agent) supervise_k3s "$MODE" ;;
*)
echo "[ERROR] Unknown mode: $MODE. Use 'server' or 'agent'"
exit 1
;;
esac
-
-# Keep container running
-sleep infinity
diff --git a/example/gpu-validation-cluster/configs/cluster-validation-config.yaml b/example/gpu-validation-cluster/configs/cluster-validation-config.yaml
index 22dce2826..f48c90bba 100644
--- a/example/gpu-validation-cluster/configs/cluster-validation-config.yaml
+++ b/example/gpu-validation-cluster/configs/cluster-validation-config.yaml
@@ -3,31 +3,276 @@ kind: ConfigMap
metadata:
name: cluster-validation-config
data:
- LOG_STORE_NODE_NAME: "" # Specify a candidate node name to store logs for easier access, or leave empty to let the system pick one of the candidate nodes
- JOB_NAME: cluster-validation-mpi-job # Must match MPIJob metadata.name
- WORKER_REPLICAS: "__WORKER_REPLICAS__" # Number of Worker Pods in each MPIJob doing actual computation
- LAUNCHER_REPLICAS: "__LAUNCHER_REPLICAS__" # Number of Launcher Pods for the MPIJob, which coordinates workers
- SLOTS_PER_WORKER: "__SLOTS_PER_WORKER__" # MPI ranks per Worker pod
- GPU_PER_WORKER: "__GPU_PER_WORKER__" # Number of GPUs to request per Worker pod
- PF_NIC_PER_WORKER: "__PF_NIC_PER_WORKER__" # Number of PF-NICs to request per Worker pod
- VF_NIC_PER_WORKER: "__VF_NIC_PER_WORKER__" # Number of VF-NICs to request per Worker pod
- NODE_VALIDATION_INTERVAL_MINS: "__NODE_VALIDATION_INTERVAL_MINS__" # minimum interval in minutes between cluster validation runs on a given worker node
+ LOG_STORE_NODE_NAME: "" # Specify a candidate node name to store logs for easier access, or leave empty to let the system pick one of the candidate nodes
+ JOB_NAME: cluster-validation-mpi-job # Must match MPIJob metadata.name
+ # Resource defaults below. Override via config.json
+ # ["cluster-validation-framework"].resources.* (gpu-cluster.sh patches
+ # the deployed ConfigMap after this YAML is applied).
+ # Fields below carry a `patchable:` marker showing the JSON path that
+ # overrides them when gpu-cluster.sh runs.
+ WORKER_REPLICAS: "2" # Worker Pods per MPIJob. patchable: cluster-validation-framework.resources.worker-replicas
+ LAUNCHER_REPLICAS: "1" # Launcher Pods per MPIJob. patchable: cluster-validation-framework.resources.launcher-replicas
+ SLOTS_PER_WORKER: "8" # MPI ranks per Worker pod. patchable: cluster-validation-framework.resources.slots-per-worker
+ GPU_PER_WORKER: "8" # GPUs per Worker pod. patchable: cluster-validation-framework.resources.gpu-per-worker
+ PF_NIC_PER_WORKER: "0" # PF-NICs per Worker pod. patchable: cluster-validation-framework.resources.pf-nic-per-worker
+ VF_NIC_PER_WORKER: "8" # VF-NICs per Worker pod. patchable: cluster-validation-framework.resources.vf-nic-per-worker
+ NODE_VALIDATION_INTERVAL_MINS: "30" # Minimum minutes between validation runs per node. patchable: cluster-validation-framework.resources.node-validation-interval-mins
PF_NIC_NAD_NAME: "amd-host-device-nad"
VF_NIC_NAD_NAME: "vf-amd-host-device-nad"
# === Node Selection Labels for candidates ===
# NOTE:
# The entire list below can be customized via config.json["cluster-validation-framework"]["node-selector-labels"]
- # Default: Select nodes with feature.node.kubernetes.io/amd-gpu=true and feature.node.kubernetes.io/amd-nic=true
- # For virtual function (VF) based GPU in a VM, use amd-vgpu=true instead of amd-gpu=true
+ # Default: feature.node.kubernetes.io/amd-gpu=true ONLY.
+ # The previous default also required feature.node.kubernetes.io/amd-nic=true; that
+ # joined requirement was dropped so a datacenter early in bringup can run Phase 1
+ # (GPU HW acceptance) and Phase 2 (intra-node GPU mesh) before any NIC
+ # infrastructure exists. This is the core enabling change for incremental bringup
+ # described design section 4 -> "Code Path: Candidate selection update".
+ # NIC-requiring phases (Phase 3 / NIC health, Phase 4 / rail bandwidth) narrow the
+ # candidate set inside their own run_phaseN script via
+ # intersect
+ # (see cluster-validation-job.yaml and design section 5). The orchestrator
+ # does NOT enforce amd-nic=true at candidate-selection time.
+ # For virtual function (VF) based GPU in a VM, use amd-vgpu=true instead of amd-gpu=true.
# For virtual function (VF) based NIC in a VM, use amd-vnic=true instead of amd-nic=true
+ # (override node-selector-labels in config.json when running NIC-requiring phases on VFs).
+ # patchable: cluster-validation-framework.node-selector-labels (newline-joined)
NODE_SELECTOR_LABELS: |
- __NODE_SELECTOR_LABELS__
+ feature.node.kubernetes.io/amd-gpu=true
CANDIDATE_LABEL: "amd.com/cluster-validation-candidate=true"
SUCCESS_LABEL: "amd.com/cluster-validation-status=passed"
FAILURE_LABEL: "amd.com/cluster-validation-status=failed"
TIMESTAMP_ANNOTATION: "amd.com/cluster-validation-last-run-timestamp"
+ # === Per-Phase Label Keys (consumed by every phase) ===
+ # Downstream phases MUST read these constants
+ # rather than hard-coding their own label key. Phases write
+ # =passed|failed to the node and, on failure,
+ # an annotation =.
+ PHASE1_LABEL_KEY: "amd.com/gpu-hw-acceptance"
+ PHASE2_LABEL_KEY: "amd.com/gpu-mesh-validation"
+ PHASE3_LABEL_KEY: "amd.com/nic-health"
+ PHASE4_LABEL_KEY: "amd.com/rail-bandwidth"
+ PHASE5_LABEL_KEY: "amd.com/cluster-validation-status"
+ PHASE_FAILURE_REASON_ANNOTATION_SUFFIX: "-failure-reason"
+
+ # === Phase 5 Tunables ===
+ # PHASE5_MIN_WORKERS: minimum surviving-node
+ # count required for Phase 5 to actually submit an MPIJob. Default
+ # "2" because the RCCL collectives Phase 5 runs (all_reduce_perf,
+ # broadcast_perf, reduce_scatter_perf) are multi-node by definition
+ # a single-worker MPIJob is degenerate and produces no meaningful
+ # bandwidth signal. Operators can override to "1" to opt into the
+ # degenerate single-node MPIJob (e.g., for plumbing-only validation
+ # on a single-node testbed). See design doc sections 4 and 6
+ # ("Error Handling" -> "Below PHASE5_MIN_WORKERS").
+ PHASE5_MIN_WORKERS: "2"
+
+ # === Node Metadata Helper Library ===
+ # Pure shell functions sourced by the CronJob orchestrator and by every
+ # per-phase script. Provides the uniform label/annotate/filter primitives
+ # described design §4 -> "Code Path: Node metadata helpers".
+ #
+ # Contract (see design doc design §5):
+ # * Downstream phases MUST NOT call `kubectl label` / `kubectl annotate`
+ # on phase labels directly -- they MUST use these helpers.
+ # * Label values are exactly "passed" or "failed".
+ # * On failure, the failure-reason annotation is written using
+ # PHASE_FAILURE_REASON_ANNOTATION_SUFFIX.
+ # * `filter_passed_nodes` is the gate primitive between phases:
+ # a node that is missing the label, or whose label value is not
+ # "passed", is dropped (conservative -- design §6).
+ #
+ # Functions:
+ # label_phase_passed
+ # label_phase_failed
+ # annotate_phase_value
+ # filter_passed_nodes # stdout
+ #
+ # Diagnostics go to stderr so that `filter_passed_nodes` stdout stays
+ # clean for piping (`for n in $(filter_passed_nodes .)`).
+ #
+ # Idempotency: all kubectl writes use --overwrite so re-runs of the
+ # same phase on the same node converge to the latest label/annotation.
+ PHASE_NODE_LABEL_SCRIPT: |
+ #!/bin/bash
+ # Node metadata helper library. Source this file -- do not exec.
+ # All helpers return 0 on success and non-zero on failure. A failed
+ # helper call is treated by `filter_passed_nodes` as "not passed",
+ # i.e. the node is conservatively dropped from subsequent phases.
+
+ # Maximum length of a single annotation value. Kubernetes caps the
+ # total annotation set on an object at 256 KiB; cap any individual
+ # reason/value at 200_000 bytes so a handful of large annotations
+ # cannot push the object over the limit. Truncation is logged.
+ : "${PHASE_ANNOTATION_VALUE_MAX_BYTES:=200000}"
+
+ # _phase_log
+ # Internal logger. Sends a tagged line to stderr.
+ _phase_log() {
+ echo "[phase-helpers] $*" >&2
+ }
+
+ # _phase_require_args
+ # Returns 0 if expected == actual, else logs and returns 2.
+ _phase_require_args() {
+ local fn_name="$1"
+ local expected="$2"
+ local actual="$3"
+ if [[ "$actual" -ne "$expected" ]]; then
+ _phase_log "ERROR: ${fn_name} expects ${expected} arg(s), got ${actual}"
+ return 2
+ fi
+ return 0
+ }
+
+ # _phase_require_nonempty
+ # Returns 0 if value is non-empty, else logs and returns 2.
+ _phase_require_nonempty() {
+ local fn_name="$1"
+ local arg_name="$2"
+ local value="$3"
+ if [[ -z "$value" ]]; then
+ _phase_log "ERROR: ${fn_name}: argument '${arg_name}' is empty"
+ return 2
+ fi
+ return 0
+ }
+
+ # _phase_truncate_value
+ # Emits value on stdout, truncated to PHASE_ANNOTATION_VALUE_MAX_BYTES.
+ # Logs to stderr when truncation occurs.
+ _phase_truncate_value() {
+ local value="$1"
+ local max="${PHASE_ANNOTATION_VALUE_MAX_BYTES}"
+ local len="${#value}"
+ if [[ "$len" -gt "$max" ]]; then
+ _phase_log "WARN: annotation value of ${len} bytes truncated to ${max} bytes"
+ printf '%s' "${value:0:$max}"
+ else
+ printf '%s' "$value"
+ fi
+ }
+
+ # label_phase_passed
+ # Sets =passed on .
+ label_phase_passed() {
+ _phase_require_args "label_phase_passed" 2 "$#" || return 2
+ local node="$1"
+ local key="$2"
+ _phase_require_nonempty "label_phase_passed" "node" "$node" || return 2
+ _phase_require_nonempty "label_phase_passed" "phase_label_key" "$key" || return 2
+ if ! kubectl label node "$node" "${key}=passed" --overwrite >/dev/null; then
+ _phase_log "ERROR: kubectl label ${node} ${key}=passed failed"
+ return 1
+ fi
+ _phase_log "node=${node} ${key}=passed"
+ return 0
+ }
+
+ # label_phase_failed
+ # Sets =failed on AND writes
+ # =.
+ # If labeling fails, the annotation write is still attempted so that
+ # the failure reason is captured wherever possible.
+ label_phase_failed() {
+ _phase_require_args "label_phase_failed" 3 "$#" || return 2
+ local node="$1"
+ local key="$2"
+ local reason="$3"
+ _phase_require_nonempty "label_phase_failed" "node" "$node" || return 2
+ _phase_require_nonempty "label_phase_failed" "phase_label_key" "$key" || return 2
+ # reason is allowed to be empty -- a missing reason is still a
+ # legitimate failure signal; we just don't write the annotation.
+
+ local rc=0
+ if ! kubectl label node "$node" "${key}=failed" --overwrite >/dev/null; then
+ _phase_log "ERROR: kubectl label ${node} ${key}=failed failed"
+ rc=1
+ fi
+
+ if [[ -n "$reason" ]]; then
+ local annotation_key="${key}${PHASE_FAILURE_REASON_ANNOTATION_SUFFIX}"
+ local safe_reason
+ safe_reason=$(_phase_truncate_value "$reason")
+ if ! kubectl annotate node "$node" "${annotation_key}=${safe_reason}" --overwrite >/dev/null; then
+ _phase_log "ERROR: kubectl annotate ${node} ${annotation_key}=... failed"
+ rc=1
+ fi
+ fi
+
+ _phase_log "node=${node} ${key}=failed reason=${reason:-}"
+ return "$rc"
+ }
+
+ # annotate_phase_value
+ # Writes a phase-scoped annotation of the form
+ # -=
+ # to . Intended for phase scripts that want to publish
+ # structured per-phase data (e.g. measured bandwidth, test version).
+ annotate_phase_value() {
+ _phase_require_args "annotate_phase_value" 4 "$#" || return 2
+ local node="$1"
+ local phase_key="$2"
+ local sub_key="$3"
+ local value="$4"
+ _phase_require_nonempty "annotate_phase_value" "node" "$node" || return 2
+ _phase_require_nonempty "annotate_phase_value" "phase_label_key" "$phase_key" || return 2
+ _phase_require_nonempty "annotate_phase_value" "key" "$sub_key" || return 2
+ # value may be empty -- caller controls semantics.
+
+ local annotation_key="${phase_key}-${sub_key}"
+ local safe_value
+ safe_value=$(_phase_truncate_value "$value")
+ if ! kubectl annotate node "$node" "${annotation_key}=${safe_value}" --overwrite >/dev/null; then
+ _phase_log "ERROR: kubectl annotate ${node} ${annotation_key}=... failed"
+ return 1
+ fi
+ return 0
+ }
+
+ # filter_passed_nodes
+ # Emits, on stdout, the subset of whose current value of
+ # is exactly "passed". A node that lacks the
+ # label, has a different value, or whose label cannot be read is
+ # silently dropped (conservative -- see design §6).
+ # Output is space-separated, single line, terminated by newline if
+ # any nodes match; empty input or no matches produces no output.
+ filter_passed_nodes() {
+ _phase_require_args "filter_passed_nodes" 2 "$#" || return 2
+ local nodes="$1"
+ local key="$2"
+ _phase_require_nonempty "filter_passed_nodes" "phase_label_key" "$key" || return 2
+ # nodes may be empty -- caller may legitimately pass an empty list.
+
+ # Escape dots in the label key for jsonpath traversal of
+ # `.metadata.labels.`. kubectl jsonpath uses `.` as the
+ # field separator, so dots inside the label key must be
+ # backslash-escaped (matches the existing TIMESTAMP_ANNOTATION
+ # handling in CRONJOB_CANDIDATE_NODES_SELECTION_SCRIPT).
+ local escaped_key
+ escaped_key=$(echo "$key" | sed 's/\./\\./g')
+
+ local out=""
+ local node value
+ for node in $nodes; do
+ value=$(kubectl get node "$node" \
+ -o jsonpath="{.metadata.labels.${escaped_key}}" 2>/dev/null || echo "")
+ if [[ "$value" == "passed" ]]; then
+ if [[ -z "$out" ]]; then
+ out="$node"
+ else
+ out="$out $node"
+ fi
+ fi
+ done
+ if [[ -n "$out" ]]; then
+ echo "$out"
+ fi
+ return 0
+ }
+
# === RCCL Tests Definitions ===
TESTS_JSON: |
{
@@ -38,8 +283,11 @@ data:
]
}
- RCCL_WORKLOAD_IMAGE: "docker.io/rocm/roce-workload:ubuntu24_rocm-7.0.2_rccl-7.0.2_anp-v1.2.0_ainic-1.117.1-a-63"
- MPIJOB_WAIT_TIME: "240"
+ # Single workload image for Phase 2 (intra-node RCCL),
+ # Phase 4 (ib_write_bw server/client), and Phase 5 (multi-node RCCL).
+ # patchable: cluster-validation-framework.images.roce-workload
+ ROCE_WORKLOAD_IMAGE: "docker.io/rocm/roce-workload:ubuntu24_rocm-7.0.2_rccl-7.0.2_anp-v1.2.0_ainic-1.117.1-a-63"
+ MPIJOB_WAIT_TIME: "240" # Phase 5: wallclock seconds for MPIJob completion. patchable: cluster-validation-framework.timeouts.phase5-mpijob-wait-secs
DEBUG_DELAY: "20"
WAIT_FOR_WORKERS: "true"
ENABLE_SSH_CHECK: "true"
@@ -47,20 +295,100 @@ data:
SSH_CHECK_INTERVAL: "4"
SSH_CHECK_TIMEOUT: "60"
+ # Phase 5 multi-node RCCL env vars. Profile follows the AMD Pollara
+ # 400 RCCL ops-guide (PCIe relaxed ordering, TC=96, GID index 1,
+ # ignore CPU affinity, GDR flush). Non-default deviations:
+ #
+ # NCCL_NET_PLUGIN=none - bypass ANP, use stock NCCL_NET=IB
+ # (same ibverbs path as ib_write_bw)
+ # NCCL_IB_QPS_PER_CONNECTION=1 - one QP per peer per channel
+ # NCCL_IB_SPLIT_DATA_ON_QPS=0
+ # NCCL_IB_USE_INLINE=0 - SGE list path (more battle-tested)
+ # NCCL_MIN/MAX_NCHANNELS=4 - lower QP fan-out per rank
+ # NCCL_IB_HCA=ionic_0.ionic_7 - explicit device list
+ # NCCL_IB_TIMEOUT=22, RETRY_CNT=12 - wider IB layer retry budget
+ #
+ # Diagnostics: NCCL_DEBUG=INFO, NCCL_DEBUG_SUBSYS=INIT,NET,
+ # NCCL_TOPO_DUMP_FILE=/tmp/topo_all.txt.
RCCL_ENV_VARS: |
#!/bin/bash
+ # --- transport selection (the one knob that mattered most) -----
+ RCCL_ENV="$RCCL_ENV -x NCCL_NET_PLUGIN=none"
+ RCCL_ENV="$RCCL_ENV -x NCCL_IB_HCA=ionic_0,ionic_1,ionic_2,ionic_3,ionic_4,ionic_5,ionic_6,ionic_7"
+ # --- QP / send-path shape (avoid multi-QP doorbell races) ------
+ RCCL_ENV="$RCCL_ENV -x NCCL_IB_QPS_PER_CONNECTION=1"
+ RCCL_ENV="$RCCL_ENV -x NCCL_IB_SPLIT_DATA_ON_QPS=0"
+ RCCL_ENV="$RCCL_ENV -x NCCL_IB_USE_INLINE=0"
+ RCCL_ENV="$RCCL_ENV -x NCCL_MIN_NCHANNELS=4"
+ RCCL_ENV="$RCCL_ENV -x NCCL_MAX_NCHANNELS=4"
+ # --- RoCEv2 GID + PCIe + traffic class (AMD Pollara 400 UG) ----
RCCL_ENV="$RCCL_ENV -x NCCL_IB_GID_INDEX=1"
+ RCCL_ENV="$RCCL_ENV -x NCCL_IB_PCI_RELAXED_ORDERING=1"
+ RCCL_ENV="$RCCL_ENV -x NCCL_IB_TC=96"
+ RCCL_ENV="$RCCL_ENV -x NCCL_IB_TIMEOUT=22"
+ RCCL_ENV="$RCCL_ENV -x NCCL_IB_RETRY_CNT=12"
+ # --- recv / GDR / lockfree (FW + ROCm interplay) ---------------
RCCL_ENV="$RCCL_ENV -x NCCL_NET_OPTIONAL_RECV_COMPLETION=0"
RCCL_ENV="$RCCL_ENV -x NCCL_GDR_FLUSH_DISABLE=1"
RCCL_ENV="$RCCL_ENV -x RCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING=0"
- RCCL_ENV="$RCCL_ENV -x NCCL_IB_USE_INLINE=1"
RCCL_ENV="$RCCL_ENV -x IONIC_LOCKFREE=all"
- RCCL_ENV="$RCCL_ENV -x NCCL_NET_PLUGIN=librccl-anp.so"
- RCCL_ENV="$RCCL_ENV -x NCCL_TOPO_DUMP_FILE=/tmp/topo_all.txt"
- RCCL_ENV="$RCCL_ENV -x NCCL_DEBUG=INFO"
RCCL_ENV="$RCCL_ENV -x NCCL_DMABUF_ENABLE=0"
+ # --- scheduling + diagnostics ----------------------------------
+ RCCL_ENV="$RCCL_ENV -x NCCL_IGNORE_CPU_AFFINITY=1"
+ RCCL_ENV="$RCCL_ENV -x NCCL_DEBUG=INFO"
+ RCCL_ENV="$RCCL_ENV -x NCCL_DEBUG_SUBSYS=INIT,NET"
+ RCCL_ENV="$RCCL_ENV -x NCCL_TOPO_DUMP_FILE=/tmp/topo_all.txt"
echo "$RCCL_ENV" > /shared/rccl_env.txt
+ # === Phase 2 (Intra-Node GPU Mesh Validation) env vars ===
+ # Consumed by the per-node Phase 2 RCCL Job (cluster-validation-job.yaml,
+ # `cluster-validation-phase2-job-config` ConfigMap) and by
+ # `PHASE2_SCRIPT`. See design doc design doc §4 ->
+ # "Code Path: Phase 2 ConfigMap variables".
+ #
+ # The PHASE2_START_MSG_SIZE / PHASE2_END_MSG_SIZE / PHASE2_STEP_FACTOR /
+ # PHASE2_ITER_COUNT / PHASE2_WARMUP_ITER_COUNT block are mpirun args for
+ # `all_reduce_perf`; they mirror the Phase 5 MPI_LAUNCHER_ENV_VARS
+ # defaults (1K.2G, step 2, 6 iters, 20 warmup) so a single tuning
+ # change can be applied symmetrically when needed.
+ PHASE2_START_MSG_SIZE: "1K"
+ PHASE2_END_MSG_SIZE: "2G"
+ PHASE2_STEP_FACTOR: "2"
+ PHASE2_ITER_COUNT: "6"
+ PHASE2_WARMUP_ITER_COUNT: "20"
+ # Default 100 GB/s -- intra-node Avg bus bandwidth baseline. MI300X
+ # with full xGMI fabric measures ~250-300 GB/s; MI308X (PCIe-only,
+ # no xGMI) measures ~105-110 GB/s. 100 is the lower-common-denominator
+ # threshold; raise per-deployment for fabric-rich SKUs. Validate
+ # against measured baseline before rollout to a new SKU (see design doc
+ # design §8). Tunable via ConfigMap edit -- no code change needed.
+ PHASE2_BW_THRESHOLD: "100"
+ # Per-node Job wallclock budget. all_reduce_perf with the defaults above
+ # completes in well under a minute on healthy MI300 xGMI; 600s leaves
+ # headroom for pod scheduling, image pull, and a slow-but-passing run.
+ # PHASE2_SCRIPT counts a node as `timeout` failure past this budget.
+ # patchable: cluster-validation-framework.timeouts.phase2-job-wait-secs
+ PHASE2_JOB_WAIT_TIME: "600"
+ # Stripped variant of RCCL_ENV_VARS for the single-node Phase 2 path:
+ # drops NCCL_NET_PLUGIN (no librccl-anp -- collectives stay on xGMI)
+ # and every NCCL_IB_* / IONIC_* / GDR-fabric tunable (no NIC / no
+ # InfiniBand). Keeps NCCL_DEBUG=INFO for diagnostics and
+ # NCCL_DMABUF_ENABLE=0 to match the Phase 5 path. Consumed by the
+ # Phase 2 Job container via `source <(echo "$PHASE2_RCCL_ENV_VARS")`
+ # (see cluster-validation-job.yaml `cluster-validation-phase2-job-config`).
+ PHASE2_RCCL_ENV_VARS: |
+ #!/bin/bash
+ export NCCL_DEBUG=INFO
+ export NCCL_DMABUF_ENABLE=0
+ # rccl-tests build dir (all_reduce_perf binary). Same path as
+ # MPI_LAUNCHER_ENV_VARS uses for Phase 5; duplicated here because
+ # the Phase 2 Job container only sources PHASE2_RCCL_ENV_VARS
+ # (not MPI_LAUNCHER_ENV_VARS), and the Job script runs under
+ # `set -u` so an unset PERF_TEST_DIR aborts with exit 127.
+ export PERF_TEST_DIR=/root/rccl-tests/build
+ # No NCCL_NET_PLUGIN -- single node, intra-node xGMI only.
+ # No NCCL_IB_* / IONIC_* -- no fabric / no NICs involved.
+
# MPIJob documentation: https://www.kubeflow.org/docs/components/trainer/legacy-v1/user-guides/mpi/
MPI_LAUNCHER_ENV_VARS: |
export MPIRUN_SSH_PORT=22
@@ -75,41 +403,113 @@ data:
export WARMUP_ITER_COUNT="20"
export CHECK_ITER_COUNT="0"
+ # === Per-Phase Skip Flags ===
+ # One flag per phase. Patched in from config.json["cluster-validation-framework"]["skip-tests"]
+ # by gpu-cluster.sh after this YAML is applied.
+ SKIP_GPU_HW_ACCEPTANCE: "false" # Phase 1: GPU hardware acceptance (test-runner / RVS / AGFHC). patchable: cluster-validation-framework.skip-tests.skip-phase1-gpu-hw-acceptance
+ SKIP_GPU_MESH_VALIDATION: "false" # Phase 2: Intra-node GPU collective / mesh validation. patchable: cluster-validation-framework.skip-tests.skip-phase2-gpu-mesh-validation
+ SKIP_NIC_VALIDATION: "false" # Phase 3: Per-node NIC health (requires amd-nic=true). patchable: cluster-validation-framework.skip-tests.skip-phase3-nic-validation
+ SKIP_RAIL_BANDWIDTH_TEST: "false" # Phase 4: Pairwise rail bandwidth. patchable: cluster-validation-framework.skip-tests.skip-phase4-rail-bandwidth-test
+ SKIP_RCCL_TEST: "false" # Phase 5: Multi-node RCCL via MPIJob. patchable: cluster-validation-framework.skip-tests.skip-phase5-rccl-test
+
# === GPU Validation Tests Definitions ===
- # RVS: ROCm Validation Suite. For a full list of supported recipes and arguments, refer to https://instinct.docs.amd.com/projects/gpu-operator/en/latest/test/appendix-test-recipe.html
- # AGFHC: AMD GPU Field Health Check. For a full list of supported recipes and arguments, refer to https://instinct.docs.amd.com/projects/gpu-operator/en/latest/test/agfhc.html
- # Refer to the above links for other available test frameworks and recipes, and configure the wait time accordingly.
- SKIP_GPU_VALIDATION: "__SKIP_GPU_VALIDATION__" # Set to "true" to skip GPU validation tests, directly start the RCCL tests
- SKIP_RCCL_TEST: "__SKIP_RCCL_TEST__" # Set to "true" to skip MPI Job RCCL tests
- TEST_RUNNER_JOB_WAIT_TIME: "1200"
- TEST_RUNNER_SUCCESS_LABEL: "amd.com/gpu-validation-test=passed"
- TEST_RUNNER_FAILURE_LABEL: "amd.com/gpu-validation-test=failed"
- TEST_RUNNER_IMAGE: "docker.io/rocm/test-runner:v1.4.0"
- GPU_VALIDATION_TESTS_JSON: |
- {
- "TestConfig": {
- "GPU_HEALTH_CHECK": {
- "TestLocationTrigger": {
- "global": {
- "TestParameters": {
- "MANUAL": {
- "TestCases": [
- {
- "Framework": "RVS",
- "Recipe": "gst_single",
- "Iterations": 1,
- "StopOnFailure": true,
- "TimeoutSeconds": 1200,
- "Arguments": "--parallel"
- }
- ]
- }
- }
- }
- }
- }
+ # RVS: ROCm Validation Suite. Recipe + argument reference
+ # https://instinct.docs.amd.com/projects/gpu-operator/en/latest/test/appendix-test-recipe.html
+ # AGFHC: AMD GPU Field Health Check. Recipe reference
+ # https://instinct.docs.amd.com/projects/gpu-operator/en/latest/test/agfhc.html
+ #
+ # PHASE 1 MULTI-STAGE MODEL:
+ # The ROCm test-runner CLI executes only the FIRST entry of a
+ # TestCases[] array even when N entries are provided -- so the
+ # previous single-Job-per-node design (one GPU_VALIDATION_TESTS_JSON
+ # with multiple recipes) silently dropped every recipe past the first.
+ # Phase 1 now runs each recipe as its OWN sub-stage: one test-runner
+ # Job per (stage, node), sequenced sequentially per-node with
+ # stop-on-first-failure semantics. Cross-node parallelism is preserved
+ # within each stage (submit-all-then-wait).
+ #
+ # Each stage carries its own `Image` because the RVS and AGFHC release
+ # cadences differ and may ship as separate test-runner images
+ # (e.g. rocm/test-runner-rvs:vX vs rocm/test-runner-agfhc:vY).
+ #
+ # Per-stage outcome is recorded as a node annotation:
+ # amd.com/gpu-hw-acceptance-stage-=passed|failed
+ # The aggregate label `amd.com/gpu-hw-acceptance=passed|failed`
+ # (= AND across all stages) remains the dependency contract Phase 2
+ # reads -- unchanged. On a failed node, the failing stage's recipe
+ # is additionally written to the `failed-subtest` annotation, and
+ # the failure-reason annotation captures stage-name:reason.
+ #
+ # Required per-stage fields:
+ # Name -- orchestrator identifier (used in CM/Job names
+ # and annotation suffixes); must be a valid
+ # k8s name fragment (lowercase alphanum + '-').
+ # Image -- test-runner container image for this stage.
+ # Framework -- RVS | AGFHC | . (passed verbatim to runner).
+ # Recipe -- recipe name within Framework.
+ # TimeoutSeconds -- per-stage per-node wallclock budget. The Job
+ # wait loop uses TimeoutSeconds + 120s of slack
+ # to absorb pod startup / image pull / result
+ # upload, so this number should match the actual
+ # in-container test budget.
+ # Optional per-stage fields:
+ # Iterations -- runner iterations (defaults to 1 if omitted).
+ # Arguments -- extra runner CLI args (defaults to "").
+ #
+ # Phase 1 writes the uniform per-phase label PHASE1_LABEL_KEY
+ # (= "amd.com/gpu-hw-acceptance") via the helper library (see
+ # PHASE_NODE_LABEL_SCRIPT above and PHASE1_SCRIPT below).
+ #
+ # Optional per-stage "Skip": true bypasses that stage without
+ # submitting a test-runner Job. Skipped stages annotate every
+ # still-alive node with -stage-=skipped and
+ # do NOT count toward stop-on-first-failure. If every stage is
+ # skipped, the aggregate label is set to
+ # "skipped" (not "passed"); Phase 2 currently gates on "passed" so
+ # an all-skipped Phase 1 will block downstream phases by design.
+ #
+ # patchable: per-stage Skip set from cluster-validation-framework.skip-tests.skip-phase1-stages.skip-phase1-
+ # patchable: per-stage TimeoutSeconds set from cluster-validation-framework.timeouts.phase1-stages-secs.phase1-
+ # patchable: per-stage Image set from cluster-validation-framework.images.test-runner. (rvs/agfhc)
+ GPU_VALIDATION_STAGES_JSON: |
+ [
+ {
+ "Name": "gpu-stress",
+ "Image": "docker.io/rocm/test-runner:v1.4.0",
+ "Framework": "RVS",
+ "Recipe": "gst_single",
+ "Iterations": 1,
+ "TimeoutSeconds": 1800,
+ "Arguments": "--parallel"
+ },
+ {
+ "Name": "xgmi-lvl1",
+ "Image": "docker.io/rocm/test-runner:v1.4.0",
+ "Framework": "AGFHC",
+ "Recipe": "xgmi_lvl1",
+ "Iterations": 1,
+ "TimeoutSeconds": 300,
+ "Arguments": ""
+ },
+ {
+ "Name": "pcie-lvl1",
+ "Image": "docker.io/rocm/test-runner:v1.4.0",
+ "Framework": "AGFHC",
+ "Recipe": "pcie_lvl1",
+ "Iterations": 1,
+ "TimeoutSeconds": 300,
+ "Arguments": ""
+ },
+ {
+ "Name": "hbm-lvl1",
+ "Image": "docker.io/rocm/test-runner:v1.4.0",
+ "Framework": "AGFHC",
+ "Recipe": "hbm_lvl1",
+ "Iterations": 1,
+ "TimeoutSeconds": 300,
+ "Arguments": ""
}
- }
+ ]
# === select-and-label-candidates.sh ===
@@ -124,6 +524,24 @@ data:
echo "[Node Selection] Using node selector: ${NODE_SELECTOR_LABEL}"
nodes=$(kubectl get nodes -l "${NODE_SELECTOR_LABEL}" -o name | sed 's|node/||')
+ # Distinguish "selector matches no nodes at all" from "matched nodes
+ # but none are eligible (busy / recently tested)". A custom selector
+ # (e.g. cvf-candidate=true) that nothing applies automatically -- only
+ # the NFD labels feature.node.kubernetes.io/amd-{gpu,nic} are set for
+ # you -- silently matches zero nodes, which otherwise surfaces only as
+ # a generic "Insufficient candidates" later. Warn explicitly and show
+ # which nodes DO carry each requested label so the operator can see the
+ # gap and apply the label (kubectl label node =).
+ if [ -z "$nodes" ]; then
+ echo "[Node Selection: WARNING] Selector '${NODE_SELECTOR_LABEL}' matched 0 nodes in the cluster."
+ echo "[Node Selection] Nothing applies custom labels automatically; only NFD sets"
+ echo "[Node Selection] feature.node.kubernetes.io/amd-gpu and .../amd-nic."
+ echo "[Node Selection] If you configured a custom node-selector-labels value, label the"
+ echo "[Node Selection] target nodes, e.g.: kubectl label node ${NODE_SELECTOR_LABEL%%=*}=${NODE_SELECTOR_LABEL#*=}"
+ echo "[Node Selection] Current node labels for reference:"
+ kubectl get nodes --show-labels 2>/dev/null | sed 's/^/[Node Selection] /' || true
+ fi
+
candidates=()
candidate_count=0
current_time_secs=$(date +%s)
@@ -181,6 +599,11 @@ data:
# --- Action Decision ---
if [ $candidate_count -lt ${WORKER_REPLICAS} ]; then
echo "[Node Selection: Skipped] Insufficient candidates (required $WORKER_REPLICAS, available $candidate_count)"
+ if [ -z "$nodes" ]; then
+ echo "[Node Selection: Skipped] Cause: selector '${NODE_SELECTOR_LABEL}' matched 0 nodes (see WARNING above) -- label your target nodes."
+ else
+ echo "[Node Selection: Skipped] Cause: selector matched $(echo "$nodes" | wc -w) node(s) but none were idle/eligible this tick (busy=${#busy_nodes[@]}, recently-tested=${#skipped_recent_nodes[@]})."
+ fi
echo "=================================================================="
exit 1
else
@@ -198,7 +621,7 @@ data:
fi
# === wait-for-worker-pods.sh ===
- WAIT_FOR_WORKERS_SCRIPT: |
+ PHASE45_PREFLIGHT_SCRIPT: |
#!/bin/bash
set -euo pipefail
@@ -223,6 +646,13 @@ data:
FIRST_POD=$(echo $WORKER_PODS | awk '{print $1}')
WORKER_IPS=$(kubectl get pods -n "$NAMESPACE" -l "$JOB_LABELS" -o jsonpath='{.items[*].status.podIP}')
+ # ----------------------------------------------------------------
+ # Phase 1: launcher->worker per-IP readiness wait (existing
+ # behavior). Each worker IP must accept an SSH connection from
+ # FIRST_POD within $SSH_CHECK_TIMEOUT, retried every
+ # $SSH_CHECK_INTERVAL seconds. This catches the
+ # workers-not-yet-up case before we begin the N*N matrix probe.
+ # ----------------------------------------------------------------
kubectl exec -n "$NAMESPACE" "$FIRST_POD" -- bash -c "
for ip in $WORKER_IPS; do
echo '__ Testing SSH to' \$ip '...'
@@ -241,7 +671,348 @@ data:
echo '__ SSH OK for' \$ip
done
"
- echo "__ SSH connectivity verified across all worker pods."
+ echo "__ launcher->worker SSH readiness verified."
+
+ # ----------------------------------------------------------------
+ # Phase 2 (, design §4): N*N SSH mesh probe.
+ # For every (src_pod, dst_ip) pair, exec into src_pod and SSH
+ # to dst_ip with a fast ConnectTimeout=5 single-shot probe.
+ # Record each failure in failed_pairs and CONTINUE -- we want
+ # a complete picture of the fabric, not a fail-fast trip on
+ # the first broken pair (design §4 calls this out explicitly).
+ #
+ # Note: the surrounding script runs under `set -euo pipefail`,
+ # so each probe is guarded with `|| failed_pairs+=(.)` to
+ # prevent a single SSH failure from aborting the script.
+ # ----------------------------------------------------------------
+ echo "--- Phase 4.5: N*N SSH mesh check ---"
+ echo " src pods: $WORKER_PODS"
+ echo " dst IPs: $WORKER_IPS"
+ # ssh_mesh_failed is initialised here (alongside failed_pairs)
+ # so the verdict block can read it under `set -u`
+ # even if the mesh loop records zero failures.
+ ssh_mesh_failed=false
+ failed_pairs=()
+ for src in $WORKER_PODS; do
+ for dst_ip in $WORKER_IPS; do
+ if kubectl exec -n "$NAMESPACE" "$src" -- \
+ ssh -o StrictHostKeyChecking=no \
+ -o UserKnownHostsFile=/dev/null \
+ -o ConnectTimeout=5 \
+ "$dst_ip" 'echo ok' >/dev/null 2>&1; then
+ echo "__ mesh OK: $src -> $dst_ip"
+ else
+ echo "__ mesh FAIL: $src -> $dst_ip"
+ failed_pairs+=("$src->$dst_ip")
+ fi
+ done
+ done
+
+ # Per (design §4): record the failure flag
+ # and CONTINUE to subsequent checks so the verdict block can
+ # aggregate all failure classes into a single annotation.
+ # Previously this block exit'd 1 immediately, which prevented
+ # DNS / MPI / RCCL probes from running on SSH-mesh-failed
+ # fabrics and produced thinner annotations.
+ if [ "${#failed_pairs[@]}" -gt 0 ]; then
+ ssh_mesh_failed=true
+ echo "WARN: N*N SSH mesh check failed for ${#failed_pairs[@]} pair(s) (ssh_mesh_failed=true):"
+ for pair in "${failed_pairs[@]}"; do
+ echo " $pair"
+ done
+ echo "WARN: continuing to subsequent checks (verdict gating below will aggregate failures)."
+ else
+ echo "__ N*N SSH mesh verified across all worker pods."
+ fi
+
+ # ----------------------------------------------------------------
+ # Phase 3 (, design §4): DNS forward+reverse
+ # check. From the FIRST worker pod, resolve every worker node
+ # hostname forward (getent hosts ) then reverse-resolve
+ # the returned address. A miss on either direction is recorded
+ # in dns_misses. this check MUST NOT
+ # short-circuit -- it iterates every hostname and continues to
+ # subsequent checks regardless of misses. Overall verdict
+ # gating (annotate-and-evict) is handled elsewhere.
+ #
+ # WORKER_NODES is derived from .spec.nodeName -- one entry per
+ # worker pod, possibly with duplicates if multiple workers
+ # land on the same node. We pass the deduplicated, whitespace-
+ # separated list into the in-pod loop. The surrounding script
+ # runs under `set -euo pipefail`, so each getent call is
+ # guarded with `|| echo MISS` to keep the loop alive on a
+ # failed resolution.
+ # ----------------------------------------------------------------
+ echo "--- Phase 4.5: DNS forward+reverse check ---"
+ WORKER_NODES=$(kubectl get pods -n "$NAMESPACE" -l "$JOB_LABELS" \
+ -o jsonpath='{range .items[*]}{.spec.nodeName}{"\n"}{end}' \
+ | awk 'NF' | sort -u | tr '\n' ' ')
+ echo " nodes: $WORKER_NODES"
+ echo " via pod: $FIRST_POD"
+
+ dns_misses=()
+ while IFS= read -r line; do
+ [ -z "$line" ] && continue
+ dns_misses+=("$line")
+ done < <(
+ kubectl exec -n "$NAMESPACE" "$FIRST_POD" -- bash -c '
+ for h in '"$WORKER_NODES"'; do
+ fwd=$(getent hosts "$h" 2>/dev/null || echo MISS)
+ if [ "$fwd" = MISS ]; then
+ echo "DNS:$h:fwd=MISS rev=SKIP"
+ continue
+ fi
+ addr=$(echo "$fwd" | awk "{print \$1}")
+ rev=$(getent hosts "$addr" 2>/dev/null || echo MISS)
+ if [ "$rev" = MISS ]; then
+ echo "DNS:$h:fwd=$addr rev=MISS"
+ fi
+ done
+ ' || true
+ )
+
+ # dns_failed is the boolean view of dns_misses, exported for
+ # the verdict block so it can read a single flag
+ # under `set -u` regardless of which sub-check fired.
+ dns_failed=false
+ if [ "${#dns_misses[@]}" -gt 0 ]; then
+ dns_failed=true
+ echo "WARN: DNS check recorded ${#dns_misses[@]} miss(es) (dns_failed=true):"
+ for miss in "${dns_misses[@]}"; do
+ echo " $miss"
+ done
+ echo "WARN: continuing to subsequent checks (verdict gating below will aggregate failures)."
+ else
+ echo "__ DNS forward+reverse OK for all worker nodes."
+ fi
+
+ # ----------------------------------------------------------------
+ # Phase 4 (, design §4): mpirun --hostfile
+ # no-op spawn check. From the FIRST worker pod, write the worker
+ # pod IPs (one per line) to /tmp/hf and invoke
+ # `mpirun --hostfile /tmp/hf --np --allow-run-as-root true`.
+ # A non-zero exit from kubectl exec sets mpi_spawn_failed=true.
+ # This catches OMPI launcher misconfiguration (missing orted,
+ # broken PATH, btl/oob misconfig) before the heavier RCCL probe.
+ #
+ # this check MUST NOT short-circuit -- it
+ # records the failure flag, logs WARN, and continues. Overall
+ # verdict gating (annotate-and-evict) is owned by a separate
+ # change. The surrounding script runs under `set -euo pipefail`,
+ # so the kubectl exec is guarded with `|| mpi_spawn_failed=true`
+ # to keep the script alive on spawn failure. mpi_spawn_failed is
+ # initialised to false so later annotate logic can read it under
+ # `set -u`.
+ # ----------------------------------------------------------------
+ echo "--- Phase 4.5: mpirun --hostfile no-op spawn check ---"
+ echo " via pod: $FIRST_POD"
+ echo " np: $(echo $WORKER_IPS | wc -w) (one rank per worker IP)"
+
+ mpi_spawn_failed=false
+ # --prefix is REQUIRED for cross-node spawn: SSH non-interactive
+ # shells don't source .bashrc, so the OMPI bin/lib dirs are not
+ # on the remote PATH. Without --prefix, orted is "command not
+ # found" on the remote and mpirun aborts. OMPI_DIR comes from
+ # MPI_LAUNCHER_ENV_VARS; the fallback matches the image's MPI
+ # install recipe.
+ kubectl exec -n "$NAMESPACE" "$FIRST_POD" -- bash -c '
+ : "${OMPI_DIR:=/root/ompi/install}"
+ echo "'"$WORKER_IPS"'" | tr " " "\n" | awk "NF" > /tmp/hf
+ echo "hostfile contents:"
+ cat /tmp/hf
+ mpirun --prefix "$OMPI_DIR" \
+ --hostfile /tmp/hf \
+ --np $(wc -l < /tmp/hf) \
+ --allow-run-as-root \
+ true
+ ' || mpi_spawn_failed=true
+
+ if [ "$mpi_spawn_failed" = "true" ]; then
+ echo "WARN: mpirun --hostfile no-op spawn failed (mpi_spawn_failed=true)."
+ echo "WARN: continuing to subsequent checks (verdict gating handled elsewhere)."
+ else
+ echo "__ mpirun --hostfile no-op spawn OK across all worker IPs."
+ fi
+
+ # ----------------------------------------------------------------
+ # Phase 5 (, design §4 and §6): minimal RCCL
+ # topology probe. From the FIRST worker pod, run mpirun with one
+ # rank per GPU (np = num_worker_ips * 8) against
+ # $PERF_TEST_DIR/all_reduce_perf with NCCL_DEBUG=INFO and
+ # NCCL_DEBUG_SUBSYS=INIT. Filter the combined stdout/stderr for
+ # the first 50 "NCCL INFO comm|topology" lines so the launcher
+ # log captures RCCL's topology view without flooding the
+ # operator log with full debug spew.
+ #
+ # The whole kubectl exec is wrapped in `timeout 60` (design §6:
+ # first-run topology discovery can be slow while NCCL warms
+ # caches). Two failure flags are tracked:
+ # * rccl_topo_timeout : timeout 60 fired (exit code 124)
+ # -- soft-fail per design §6 (annotate
+ # and proceed; the real test is Phase 5)
+ # * rccl_topo_failed : any non-timeout non-zero exit
+ # -- still WARN-and-continue here;
+ # verdict gating (annotate-and-evict)
+ # is handled elsewhere.
+ #
+ # this check MUST NOT short-circuit. It
+ # records the failure flags, logs WARN, and continues. The
+ # surrounding script runs under `set -euo pipefail`, so the
+ # kubectl exec is guarded with an `if !` block to keep the
+ # script alive on probe failure. Both flags are initialised
+ # to false so later annotate logic can read them under
+ # `set -u`.
+ # ----------------------------------------------------------------
+ echo "--- Phase 4.5: RCCL topology probe ---"
+ echo " via pod: $FIRST_POD"
+ echo " np: $(($(echo $WORKER_IPS | wc -w) * 8)) (8 ranks per worker IP)"
+ echo " timeout: 60s (soft-fail on timeout per design §6)"
+
+ rccl_topo_failed=false
+ rccl_topo_timeout=false
+ rccl_exit=0
+ timeout 60 kubectl exec -n "$NAMESPACE" "$FIRST_POD" -- bash -c '
+ : "${OMPI_DIR:=/root/ompi/install}"
+ echo "'"$WORKER_IPS"'" | tr " " "\n" | awk "NF" > /tmp/hf
+ echo "hostfile contents:"
+ cat /tmp/hf
+ # --prefix: same SSH/PATH reason as the no-op spawn above.
+ mpirun --prefix "$OMPI_DIR" \
+ --hostfile /tmp/hf \
+ --np $(($(wc -l < /tmp/hf) * 8)) \
+ --allow-run-as-root \
+ -x NCCL_DEBUG=INFO \
+ -x NCCL_DEBUG_SUBSYS=INIT \
+ $PERF_TEST_DIR/all_reduce_perf -b 1K -e 1K -n 1 2>&1 \
+ | grep -E "NCCL INFO comm|topology" \
+ | head -50
+ ' || rccl_exit=$?
+
+ if [ "$rccl_exit" -eq 124 ]; then
+ rccl_topo_timeout=true
+ echo "WARN: RCCL topology probe timed out after 60s (rccl_topo_timeout=true)."
+ echo "WARN: treating as soft-fail per design §6 (first-run warming can be slow)."
+ echo "WARN: continuing to verdict gating below (timeout still annotates rccl-topology)."
+ elif [ "$rccl_exit" -ne 0 ]; then
+ rccl_topo_failed=true
+ echo "WARN: RCCL topology probe failed with exit $rccl_exit (rccl_topo_failed=true)."
+ echo "WARN: continuing to verdict gating below."
+ else
+ echo "__ RCCL topology probe OK across all worker IPs."
+ fi
+
+ # ----------------------------------------------------------------
+ # Phase 6 (, design §4 verdict block + §5,§6):
+ # Aggregate verdict + annotate-on-failure + abort MPIJob.
+ #
+ # Two classes of failure are distinguished:
+ #
+ # 1. HARD failures -- abort the MPIJob via non-zero exit.
+ # Annotated AND fail-closed.
+ # * ssh_mesh_failed -> class "ssh-mesh"
+ # * dns_failed -> class "dns"
+ # * mpi_spawn_failed -> class "mpi-spawn"
+ # * rccl_topo_failed -> class "rccl-topology" (non-timeout
+ # non-zero exit from the RCCL probe)
+ #
+ # 2. SOFT failures -- annotate ONLY, do not abort. Per design §6,
+ # first-run NCCL topology discovery can be slow while caches
+ # warm; a 60s timeout is treated as "annotate and proceed
+ # the real test is Phase 5". Without this carve-out, every
+ # cold-cache run would fail-loop Phase 4.5 and never reach
+ # Phase 5.
+ # * rccl_topo_timeout -> class "rccl-topology" (annotated
+ # so the operator can still triage
+ # warm-cache vs. fabric/NCCL issues,
+ # but does NOT contribute to the
+ # abort decision)
+ #
+ # The annotation set is the union of HARD + SOFT classes so the
+ # operator sees the full picture. The abort decision is based
+ # only on the HARD classes.
+ #
+ # Annotation key: `amd.com/phase4_5-failure-reason=`
+ # written with --overwrite to replace stale values from a
+ # previous failed run on the same node.
+ #
+ # Annotated nodes: deduplicated WORKER_NODES (derived earlier
+ # from .spec.nodeName).
+ #
+ # Per design §5: NO new label key is introduced. Phase 4.5 is a
+ # gate, not a tested capability of an individual node -- per-node
+ # labels stay frozen from Phase 4 so the audit trail is preserved.
+ # The annotation is the only durable artefact of a Phase 4.5
+ # failure.
+ #
+ # Per design §2: a non-zero exit here fails the init-container,
+ # which fails the launcher pod, which fails the MPIJob, which
+ # the CronJob orchestrator's Phase 5 wait already treats as a
+ # failure (existing failure path -- no new orchestrator code
+ # needed). On the next CronJob tick the annotated nodes are
+ # re-evaluated.
+ #
+ # The annotate call is guarded with a fallback echo so a single
+ # failing kubectl annotate (e.g. node renamed out from under us)
+ # does not abort the whole loop -- the script must still reach
+ # the final FATAL exit so the MPIJob aborts on HARD failure.
+ # ----------------------------------------------------------------
+ echo "--- Phase 4.5: verdict ---"
+ # Build the failed_classes list. Using `if; then; fi` rather
+ # than `[ . ] && .` because the surrounding script runs
+ # under `set -e` and a short-circuit `&&` whose left side
+ # evaluates false would propagate a non-zero status and abort
+ # the script before the verdict block could annotate.
+ #
+ # hard_failed -> drives the abort exit.
+ # annotation_classes -> drives the annotation reason string
+ # (union of HARD + SOFT classes).
+ hard_failed=false
+ annotation_classes=()
+ if [ "$ssh_mesh_failed" = "true" ]; then
+ annotation_classes+=("ssh-mesh"); hard_failed=true
+ fi
+ if [ "$dns_failed" = "true" ]; then
+ annotation_classes+=("dns"); hard_failed=true
+ fi
+ if [ "$mpi_spawn_failed" = "true" ]; then
+ annotation_classes+=("mpi-spawn"); hard_failed=true
+ fi
+ if [ "$rccl_topo_failed" = "true" ]; then
+ annotation_classes+=("rccl-topology"); hard_failed=true
+ elif [ "$rccl_topo_timeout" = "true" ]; then
+ # Soft-fail per design §6: annotate but DO NOT abort.
+ annotation_classes+=("rccl-topology")
+ fi
+
+ if [ "${#annotation_classes[@]}" -gt 0 ]; then
+ # Join classes with comma -- IFS-trick keeps the join self
+ # contained without spawning a subshell.
+ old_ifs="$IFS"
+ IFS=,
+ reason="${annotation_classes[*]}"
+ IFS="$old_ifs"
+
+ if [ "$hard_failed" = "true" ]; then
+ echo "FATAL: Phase 4.5 pre-flight failed: $reason"
+ else
+ echo "WARN: Phase 4.5 pre-flight soft-failed: $reason (annotating but not aborting per design §6)"
+ fi
+ echo " annotating participating nodes: $WORKER_NODES"
+ for node in $WORKER_NODES; do
+ kubectl annotate node "$node" \
+ "amd.com/phase4_5-failure-reason=$reason" \
+ --overwrite || \
+ echo "WARN: failed to annotate node $node with $reason (continuing)."
+ done
+
+ if [ "$hard_failed" = "true" ]; then
+ echo "FATAL: aborting MPIJob via non-zero init-container exit (design §2, §5)."
+ exit 1
+ fi
+ echo "__ Phase 4.5 proceeding past soft-fail (rccl-topology timeout) -- Phase 5 will exercise the real path."
+ else
+ echo "__ Phase 4.5 pre-flight passed: ssh-mesh, dns, mpi-spawn, rccl-topology all OK."
+ fi
fi
# === validate-single-test.sh ===
@@ -271,142 +1042,2991 @@ data:
exit 1
fi
- # === GPU Validation Test Script ===
- GPU_VALIDATION_TEST_SCRIPT: |
+ # === Phase 1 Script: per-stage per-node Test Runner orchestration ===
+ # Sourced by the orchestrator's run_phase1 via _run_phase_generic.
+ # On entry:
+ # PHASE_NODES = space-separated input node list (also $@)
+ # PHASE1_LABEL_KEY = "amd.com/gpu-hw-acceptance" (from this ConfigMap)
+ # SKIP_GPU_HW_ACCEPTANCE = "true" short-circuits to pass-label-all
+ # GPU_VALIDATION_STAGES_JSON, GPU_PER_WORKER from ConfigMap
+ # Helpers (already sourced by orchestrator):
+ # label_phase_passed, label_phase_failed, annotate_phase_value
+ # Job template available at /test-runner-configs/cluster-validation-test-runner-job-config.yaml
+ #
+ # Multi-stage contract ( refactor, supersedes/762/765 single-Job
+ # design):
+ # For each stage S in GPU_VALIDATION_STAGES_JSON (sequential, in array order):
+ # 0. If S.Skip == true: annotate every still-alive node with
+ # ${PHASE1_LABEL_KEY}-stage-${S.Name}=skipped, do NOT submit a
+ # Job, do NOT drop the node from the alive set. Skipped stages
+ # never count as the "first failure" for stop-on-failure.
+ # 1. For each node still alive (no prior stage failed on it):
+ # a. Build single-recipe payload from S.{Framework,Recipe,Iterations,
+ # TimeoutSeconds,Arguments} wrapped in TestParameters.MANUAL.TestCases[0].
+ # b. Create per-stage per-node ConfigMap
+ # cvf-phase1-${S.Name}-${node-hash}-${ts}
+ # with key GPU_VALIDATION_TESTS_JSON = single-recipe payload.
+ # c. sed-render Job template, substituting $$NODE, $$GPU_PER_WORKER,
+ # $$TEST_RUNNER_IMAGE = S.Image, $$PHASE1_CONFIG_MAP = above CM name,
+ # and the Job's metadata.name.
+ # d. kubectl apply.
+ # 2. Poll-wait every Job in this stage (per-stage timeout = S.TimeoutSeconds
+ # + 120s slack for pod startup / result upload).
+ # 3. For each Job: locate the test-runner output file and parse it.
+ # rocm/test-runner:v1.4.0 writes one artifact per iteration to
+ # /var/log/amd-test-runner/___result(.gz)
+ # (volume-mounted at /var/log/cluster-validation on the host).
+ # We glob host-path first, then fall back to `kubectl exec ls` +
+ # `kubectl cp`; .gz variants are gunzipped to a temp file before
+ # parsing.
+ # 4. Per node: write per-stage annotation
+ # ${PHASE1_LABEL_KEY}-stage-${S.Name}=passed|failed
+ # via annotate_phase_value. On failure, drop node from the alive set
+ # and remember (S.Name, failure_reason, failed_subtest).
+ # 5. Cleanup: kubectl delete the stage's Job + its per-stage ConfigMap
+ # (best-effort; ttlSecondsAfterFinished + the delete here both apply).
+ # After all stages:
+ # 6. For nodes still alive AND that had >=1 non-skip stage submit
+ # -> label_phase_passed.
+ # For nodes still alive AND zero non-skip stages submitted
+ # (every stage carried Skip=true)
+ # -> aggregate ${PHASE1_LABEL_KEY}=skipped (Phase 2's gating
+ # check on "passed" then keeps the node out of downstream
+ # phases by design).
+ # For nodes dropped earlier
+ # -> label_phase_failed with reason
+ # "stage-${first_failed_name}:${reason}" and annotate
+ # failed-subtest = recorded sub-test from the first failed
+ # stage.
+ #
+ # SKIP_GPU_HW_ACCEPTANCE=true early-exits before stage iteration: every
+ # input node is pass-labeled and no Jobs / CMs are created.
+ #
+ # Missing result.json for a stage -> that stage is treated as failed for the
+ # affected node with reason "test-runner-did-not-emit-results" and
+ # failed-subtest = unknown.
+ #
+ # All progress output (banners, per-stage progress) goes to stdout.
+ # Diagnostic-only lines go to stderr so they don't pollute callers.
+ PHASE1_SCRIPT: |
#!/bin/bash
- # Step 2: GPU Validation Test Script
- # Expected input variables (from parent scope):
- # - nodes
- # - SKIP_GPU_VALIDATION
- # - GPU_PER_WORKER
- # - TEST_RUNNER_IMAGE
- # - TEST_RUNNER_JOB_WAIT_TIME
- # - TEST_RUNNER_SUCCESS_LABEL
- # - TEST_RUNNER_FAILURE_LABEL
- # - CANDIDATE_LABEL
- # - FAILURE_LABEL
- # - WORKER_REPLICAS
- # - DEBUG_DELAY
- #
- # Expected variables to be defined in parent scope:
- # - job_names (will be populated)
- # - job_to_node (will be populated)
- # - passed_nodes (will be populated)
- # - failed_nodes (will be populated)
-
- if [[ "${SKIP_GPU_VALIDATION,,}" == "true" ]]; then
- echo "SKIP_GPU_VALIDATION is set to true. Skipping test runner jobs."
- passed_nodes="$nodes"
- failed_nodes=""
- else
- for node in $nodes; do
- ts=$(date +%Y%m%d-%H%M%S)
- prefix="cvf-test-runner"
- max_len=63
- job_name="${prefix}-${node}-${ts}"
- if [ ${#job_name} -gt $max_len ]; then
- # If hostname is too long, replace node with node-hash
- echo "Truncating long job name.. $job_name"
+ # Phase 1 -- GPU HW Acceptance per-node Test Runner orchestration.
+ # Sourced (not exec'd) by the orchestrator. Do NOT `set -e`; we want
+ # to label every input node even if individual kubectl calls fail.
+
+ # Timestamp prefix for stage-progress log lines. Long-running stages
+ # (AGFHC xgmi_lvl1 / pcie_lvl1 / hbm_lvl1 can each take minutes) make
+ # operator triage easier when every banner carries a wallclock time.
+ _p1_ts() { date '+%Y-%m-%d %H:%M:%S'; }
+
+ # ---- Input handling ---------------------------------------------------
+ # The orchestrator passes the node list as positional args AND exports
+ # PHASE_NODES. Prefer positional ($@) so a re-source with different
+ # args works as expected; fall back to PHASE_NODES if $# is zero.
+ local_phase1_nodes=""
+ if [[ "$#" -gt 0 ]]; then
+ local_phase1_nodes="$*"
+ elif [[ -n "${PHASE_NODES:-}" ]]; then
+ local_phase1_nodes="$PHASE_NODES"
+ fi
+
+ if [[ -z "$local_phase1_nodes" ]]; then
+ echo "[$(_p1_ts)] [Phase 1] no input nodes -- nothing to do"
+ return 0
+ fi
+
+ echo "[$(_p1_ts)] [Phase 1] PHASE1_SCRIPT start: nodes=[$local_phase1_nodes]"
+
+ # ---- Early-exit: SKIP_GPU_HW_ACCEPTANCE -------------------------------
+ # Incremental-bringup short-circuit. Every input node receives the
+ # pass label, no Test Runner Jobs are created.
+ if [[ "${SKIP_GPU_HW_ACCEPTANCE,,}" == "true" ]]; then
+ echo "[$(_p1_ts)] [Phase 1] SKIP_GPU_HW_ACCEPTANCE=true -- pass-labeling all input nodes"
+ for n in $local_phase1_nodes; do
+ label_phase_passed "$n" "$PHASE1_LABEL_KEY" \
+ || echo "[$(_p1_ts)] [Phase 1] WARN: label_phase_passed failed for $n" >&2
+ done
+ echo "[$(_p1_ts)] [Phase 1] PHASE1_SCRIPT done (skip mode)"
+ return 0
+ fi
+
+ # ---- Validate required env vars ---------------------------------------
+ local missing_env=""
+ for v in PHASE1_LABEL_KEY GPU_PER_WORKER GPU_VALIDATION_STAGES_JSON; do
+ if [[ -z "${!v:-}" ]]; then
+ missing_env="$missing_env $v"
+ fi
+ done
+ if [[ -n "$missing_env" ]]; then
+ echo "[$(_p1_ts)] [Phase 1] ERROR: required env var(s) unset:$missing_env" >&2
+ echo "[$(_p1_ts)] [Phase 1] failing every input node with reason=missing-env" >&2
+ for n in $local_phase1_nodes; do
+ label_phase_failed "$n" "${PHASE1_LABEL_KEY:-amd.com/gpu-hw-acceptance}" \
+ "phase1-missing-env:${missing_env# }" \
+ || echo "[$(_p1_ts)] [Phase 1] WARN: label_phase_failed failed for $n" >&2
+ done
+ return 1
+ fi
+
+ if ! command -v jq >/dev/null 2>&1; then
+ echo "[$(_p1_ts)] [Phase 1] ERROR: jq not available; cannot parse GPU_VALIDATION_STAGES_JSON or test-runner results" >&2
+ for n in $local_phase1_nodes; do
+ label_phase_failed "$n" "$PHASE1_LABEL_KEY" "phase1-jq-missing" \
+ || echo "[$(_p1_ts)] [Phase 1] WARN: label_phase_failed failed for $n" >&2
+ done
+ return 1
+ fi
+
+ # Parse + validate stages array. Each element must have at minimum
+ # {Name, Image, Framework, Recipe, TimeoutSeconds}. Empty array or
+ # missing required field -> fail-fast every input node.
+ local stages_count=0
+ stages_count=$(jq 'length' <<<"$GPU_VALIDATION_STAGES_JSON" 2>/dev/null || echo "0")
+ if [[ -z "$stages_count" || "$stages_count" -le 0 ]]; then
+ echo "[$(_p1_ts)] [Phase 1] ERROR: GPU_VALIDATION_STAGES_JSON is not a non-empty JSON array" >&2
+ for n in $local_phase1_nodes; do
+ label_phase_failed "$n" "$PHASE1_LABEL_KEY" "phase1-stages-empty-or-invalid" \
+ || echo "[$(_p1_ts)] [Phase 1] WARN: label_phase_failed failed for $n" >&2
+ done
+ return 1
+ fi
+ local missing_field
+ missing_field=$(jq -r '
+ [ to_entries[]
+ | . as $e
+ | [ "Name","Image","Framework","Recipe","TimeoutSeconds" ]
+ | map(select(($e.value[.] // null) == null) | "stage[\($e.key)]." + .)
+ ] | add // [] | join(",")
+ ' <<<"$GPU_VALIDATION_STAGES_JSON" 2>/dev/null || echo "parse-error")
+ if [[ -n "$missing_field" && "$missing_field" != "parse-error" ]]; then
+ echo "[$(_p1_ts)] [Phase 1] ERROR: GPU_VALIDATION_STAGES_JSON missing required fields: $missing_field" >&2
+ for n in $local_phase1_nodes; do
+ label_phase_failed "$n" "$PHASE1_LABEL_KEY" \
+ "phase1-stages-missing-fields:${missing_field}" \
+ || echo "[$(_p1_ts)] [Phase 1] WARN: label_phase_failed failed for $n" >&2
+ done
+ return 1
+ fi
+ # Optional .Skip must be a bool when present. Bad type fails fast so
+ # an accidentally-quoted "true" doesn't silently run the stage.
+ local bad_skip
+ bad_skip=$(jq -r '
+ [ to_entries[]
+ | select(.value | has("Skip"))
+ | select((.value.Skip | type) != "boolean")
+ | "stage[\(.key)].Skip=\(.value.Skip | tostring)"
+ ] | join(",")
+ ' <<<"$GPU_VALIDATION_STAGES_JSON" 2>/dev/null || echo "parse-error")
+ if [[ -n "$bad_skip" && "$bad_skip" != "parse-error" ]]; then
+ echo "[$(_p1_ts)] [Phase 1] ERROR: GPU_VALIDATION_STAGES_JSON Skip must be bool: $bad_skip" >&2
+ for n in $local_phase1_nodes; do
+ label_phase_failed "$n" "$PHASE1_LABEL_KEY" \
+ "phase1-stages-bad-skip-type:${bad_skip}" \
+ || echo "[$(_p1_ts)] [Phase 1] WARN: label_phase_failed failed for $n" >&2
+ done
+ return 1
+ fi
+
+ local job_template=/test-runner-configs/cluster-validation-test-runner-job-config.yaml
+ if [[ ! -f "$job_template" ]]; then
+ echo "[$(_p1_ts)] [Phase 1] ERROR: job template not found at $job_template" >&2
+ for n in $local_phase1_nodes; do
+ label_phase_failed "$n" "$PHASE1_LABEL_KEY" "job-template-missing" \
+ || echo "[$(_p1_ts)] [Phase 1] WARN: label_phase_failed failed for $n" >&2
+ done
+ return 1
+ fi
+
+ # ---- Per-node aggregate state (across all stages) ---------------------
+ # alive[node] = "1" while no stage has failed on this node;
+ # unset/empty after a stage failure dropped it.
+ # node_ran[node] = "1" once at least one non-skipped stage has been
+ # submitted for this node. Used to distinguish the
+ # "all stages were Skip=true" case (aggregate
+ # label=skipped) from the normal pass path.
+ # node_fail_stage = first failing stage Name (empty for alive nodes).
+ # node_fail_reason = failure reason annotation value for first failure.
+ # node_fail_subtest= recorded subtest name for first failure.
+ declare -A alive=()
+ declare -A node_ran=()
+ declare -A node_fail_stage=()
+ declare -A node_fail_reason=()
+ declare -A node_fail_subtest=()
+ local n
+ for n in $local_phase1_nodes; do
+ alive[$n]="1"
+ done
+
+ local ts
+ ts=$(date +%Y%m%d-%H%M%S)
+
+ # ---- Helper: short, deterministic node hash for k8s names (max 63 chars).
+ _phase1_node_hash() {
+ echo -n "$1" | sha1sum | cut -c1-6
+ }
+
+ # Where test-runner pod logs are persisted. Defaults to the shared
+ # /var/log/cluster-validation hostPath; a user may reconfigure this
+ # path later. The orchestrator must be able to write here (see
+ # runAsUser:0 on the CronJob container in cluster-validation-job.yaml).
+ local results_root="/var/log/cluster-validation"
+ local stage_idx
+
+ # ---- Outer loop: one iteration per stage ------------------------------
+ for (( stage_idx=0; stage_idx=skipped; do NOT submit a Job, do NOT touch
+ # alive[] (skipped stages must not drop the node), do NOT touch
+ # node_ran[] (the all-skipped path must produce aggregate
+ # label=skipped, not passed).
+ if [[ "$stage_skip" == "true" ]]; then
+ echo "[$(_p1_ts)] [Phase 1] ============================================="
+ echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} idx=${stage_idx} SKIPPED (Skip=true)"
+ for n in $local_phase1_nodes; do
+ [[ "${alive[$n]:-}" == "1" ]] || continue
+ annotate_phase_value "$n" "$PHASE1_LABEL_KEY" \
+ "stage-${stage_name}" "skipped" \
+ || echo "[$(_p1_ts)] [Phase 1] WARN: annotate stage=$stage_name skipped failed for $n" >&2
+ done
+ continue
+ fi
+
+ # Per-stage Job-wait wallclock = per-stage TimeoutSeconds + 120s
+ # slack for pod scheduling / image pull / result upload.
+ local stage_wait=$(( stage_timeout + 120 ))
+
+ # Build the single-recipe runner-config payload (compact JSON).
+ # Shape matches what the runner expects at /etc/test-runner/config.json:
+ # TestConfig.GPU_HEALTH_CHECK.TestLocationTrigger.global
+ # .TestParameters.MANUAL.TestCases = [ ]
+ local recipe_payload
+ recipe_payload=$(jq -nc \
+ --arg fw "$stage_fw" \
+ --arg recipe "$stage_recipe" \
+ --argjson iters "$stage_iters" \
+ --argjson timeout "$stage_timeout" \
+ --arg args "$stage_args" \
+ '{TestConfig:{GPU_HEALTH_CHECK:{TestLocationTrigger:{global:{TestParameters:{MANUAL:{TestCases:[{Framework:$fw,Recipe:$recipe,Iterations:$iters,StopOnFailure:true,TimeoutSeconds:$timeout,Arguments:$args}]}}}}}}}')
+
+ # Snapshot alive nodes for this stage. Dropped nodes (failed earlier)
+ # do not get a Job here -- stop-on-first-failure short-circuits.
+ local alive_in_stage=""
+ for n in $local_phase1_nodes; do
+ [[ "${alive[$n]:-}" == "1" ]] && alive_in_stage="$alive_in_stage $n"
+ done
+ alive_in_stage="${alive_in_stage# }"
+
+ if [[ -z "$alive_in_stage" ]]; then
+ echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} skipped (no alive nodes left)"
+ continue
+ fi
+
+ echo "[$(_p1_ts)] [Phase 1] ============================================="
+ echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} idx=${stage_idx} fw=${stage_fw} recipe=${stage_recipe}"
+ echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} image=${stage_image}"
+ echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} alive_nodes=[${alive_in_stage}]"
+ echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} per-job timeout=${stage_timeout}s (wait budget=${stage_wait}s)"
+
+ # Per-stage state (Job/CM names indexed by node).
+ declare -A stage_job_to_node=()
+ declare -A stage_node_to_job=()
+ declare -A stage_node_to_cm=()
+ declare -A stage_job_status=() # passed | failed | timeout | submit-failed
+ declare -A stage_job_subtest=()
+ declare -A stage_job_reason=()
+ local stage_job_names=""
+
+ # ---- Per-stage Step 1: create per-node ConfigMap + Job ------------
+ echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} submitting Jobs (parallel)..."
+ for node in $alive_in_stage; do
+ local node_hash cm_name job_name
+ node_hash=$(_phase1_node_hash "$node")
+ cm_name="cvf-phase1-${stage_name}-${node_hash}-${ts}"
+ job_name="cvf-tr-${stage_name}-${node_hash}-${ts}"
+ stage_node_to_cm[$node]="$cm_name"
+ stage_node_to_job[$node]="$job_name"
+ stage_job_to_node[$job_name]="$node"
+ stage_job_names="$stage_job_names $job_name"
+ # Mark node as having had a real (non-skipped) stage attempt.
+ # Used by the final aggregate-labeling step to distinguish
+ # "all stages skipped" (label=skipped) from "all ran+passed"
+ # (label=passed).
+ node_ran[$node]="1"
+
+ # Idempotent CM create: dry-run | apply lets us re-run after a
+ # mid-stage crash without "already exists" errors.
+ if ! kubectl create configmap "$cm_name" \
+ --from-literal="GPU_VALIDATION_TESTS_JSON=${recipe_payload}" \
+ --dry-run=client -o yaml 2>/dev/null \
+ | kubectl apply -f - >/dev/null 2>&1; then
+ echo "[$(_p1_ts)] [Phase 1] ERROR: failed to create configmap=$cm_name node=$node" >&2
+ stage_job_status[$job_name]="submit-failed"
+ stage_job_reason[$job_name]="configmap-creation-failed"
+ stage_job_subtest[$job_name]="unknown"
+ continue
+ fi
+
+ # Render Job template. The four placeholders supplied by Phase 1:
+ # $$NODE -- target hostname (nodeSelector)
+ # $$GPU_PER_WORKER -- amd.com/gpu request/limit
+ # $$TEST_RUNNER_IMAGE -- per-stage image
+ # $$PHASE1_CONFIG_MAP -- per-stage per-node CM (config-volume)
+ # Plus the metadata.name rewrite so each Job gets a unique name.
+ if sed "s|\$\$NODE|${node}|g; \
+ s/^ name: cluster-validation-test-runner-job/ name: ${job_name}/; \
+ s|\$\$GPU_PER_WORKER|${GPU_PER_WORKER}|g; \
+ s|\$\$TEST_RUNNER_IMAGE|${stage_image}|g; \
+ s|\$\$PHASE1_CONFIG_MAP|${cm_name}|g" \
+ "$job_template" | kubectl apply -f - >/dev/null; then
+ echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} submitted job=$job_name node=$node cm=$cm_name"
+ else
+ echo "[$(_p1_ts)] [Phase 1] ERROR: kubectl apply failed for job=$job_name node=$node" >&2
+ stage_job_status[$job_name]="submit-failed"
+ stage_job_reason[$job_name]="job-creation-failed"
+ stage_job_subtest[$job_name]="unknown"
+ fi
+ done
+ echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} all Jobs submitted at $(date)"
+
+ # ---- Per-stage Step 2: Poll-wait every Job ------------------------
+ # 5s sleep interval preserves the proven Phase 1 polling cadence.
+ local job_name
+ for job_name in $stage_job_names; do
+ local node="${stage_job_to_node[$job_name]}"
+ if [[ "${stage_job_status[$job_name]:-}" == "submit-failed" ]]; then
+ continue
+ fi
+ echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} waiting job=$job_name node=$node (timeout=${stage_wait}s)"
+ local start_time elapsed terminal="false"
+ start_time=$(date +%s)
+ while true; do
+ elapsed=$(( $(date +%s) - start_time ))
+ if [[ "$elapsed" -ge "$stage_wait" ]]; then
+ echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} job=$job_name TIMEOUT after ${stage_wait}s"
+ stage_job_status[$job_name]="timeout"
+ stage_job_reason[$job_name]="timeout"
+ stage_job_subtest[$job_name]="unknown"
+ terminal="true"
+ break
+ fi
+ local complete_st failed_st
+ complete_st=$(kubectl get job "$job_name" \
+ -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' \
+ 2>/dev/null || echo "")
+ if [[ "$complete_st" == "True" ]]; then
+ echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} job=$job_name Complete=True (node=$node)"
+ stage_job_status[$job_name]="passed"
+ terminal="true"
+ break
+ fi
+ failed_st=$(kubectl get job "$job_name" \
+ -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' \
+ 2>/dev/null || echo "")
+ if [[ "$failed_st" == "True" ]]; then
+ echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} job=$job_name Failed=True (node=$node)"
+ stage_job_status[$job_name]="failed"
+ terminal="true"
+ break
+ fi
+ sleep 5
+ done
+ if [[ "$terminal" != "true" ]]; then
+ # Defensive belt-and-suspenders. Should be unreachable.
+ stage_job_status[$job_name]="timeout"
+ stage_job_reason[$job_name]="timeout"
+ stage_job_subtest[$job_name]="unknown"
+ fi
+ done
+
+ # ---- Per-stage Step 3: Finalize verdict from the Job condition ---
+ # The Kubernetes Job condition is the SOLE source of truth for
+ # pass/fail. It is cluster-scoped API state (set by the test-runner
+ # exit code via backoffLimit=0/restartPolicy=Never) and is therefore
+ # node-agnostic: it is read identically whether the orchestrator is
+ # co-located with the worker or not. This matches the original
+ # pre-refactor CVF design.
+ #
+ # We deliberately do NOT read the per-stage result.json here. That
+ # file is written ONLY to the worker node's local hostPath, so in a
+ # multi-node cluster the orchestrator (scheduled on an arbitrary
+ # node) either cannot see it OR -- worse -- matches a different
+ # node's same-named file at the shared-by-name volume root and
+ # mis-attributes one node's result to another. Relying on the Job
+ # condition avoids both failure modes. The verdict was already set
+ # in Step 2's poll-wait loop (passed | failed | timeout |
+ # submit-failed); Step 3 only captures pod logs for diagnostics.
+ for job_name in $stage_job_names; do
+ local status="${stage_job_status[$job_name]:-failed}"
+ if [[ "$status" == "submit-failed" || "$status" == "timeout" ]]; then
+ continue
+ fi
+
+ local pod_name=""
+ pod_name=$(kubectl get pods \
+ -l "job-name=${job_name}" \
+ -o jsonpath='{.items[-1:].metadata.name}' \
+ 2>/dev/null || echo "")
+
+ # Always persist the test-runner pod's stdout to results_root,
+ # pass or fail, for post-run triage. `kubectl logs` works on a
+ # Completed pod and is node-agnostic. This is diagnostics only
+ # -- it is never parsed for the verdict. Writing here requires
+ # the orchestrator to be able to write results_root (see
+ # runAsUser:0 on the CronJob container).
+ local pod_log_file=""
+ if [[ -n "$pod_name" ]]; then
+ local pod_log_dest="${results_root}/${job_name}_pod.log"
+ if kubectl logs "$pod_name" > "$pod_log_dest" 2>/dev/null \
+ && [[ -s "$pod_log_dest" ]]; then
+ pod_log_file="$pod_log_dest"
+ echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} job=$job_name pod-log saved=${pod_log_file}"
+ else
+ [[ -f "$pod_log_dest" ]] && rm -f "$pod_log_dest"
+ echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} job=$job_name WARN: could not save pod log to ${pod_log_dest}" >&2
+ fi
+ fi
+
+ if [[ "$status" == "passed" ]]; then
+ echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} job=$job_name PASS (Job Complete=True)"
+ else
+ # Job Failed=True. The result file is not consulted, so the
+ # specific sub-test is unknown; the Job condition is enough
+ # to fail the node. Surface the saved pod-log tail for triage.
+ stage_job_status[$job_name]="failed"
+ stage_job_subtest[$job_name]="unknown"
+ stage_job_reason[$job_name]="job-failed"
+ echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} job=$job_name FAILED (Job Failed=True)" >&2
+ if [[ -n "$pod_log_file" && -s "$pod_log_file" ]]; then
+ echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} job=$job_name pod-log tail:" >&2
+ tail -n 20 "$pod_log_file" >&2
+ fi
+ fi
+ done
+
+ # ---- Per-stage Step 4: Annotate per-stage result + update alive ---
+ # Per-stage annotation key (sub-key passed to annotate_phase_value):
+ # ${PHASE1_LABEL_KEY}-stage-${stage_name} = passed | failed
+ # Aggregate label/annotation comes after the outer loop.
+ local stage_passed=0 stage_failed=0
+ for job_name in $stage_job_names; do
+ local node="${stage_job_to_node[$job_name]}"
+ local status="${stage_job_status[$job_name]:-failed}"
+ if [[ "$status" == "passed" ]]; then
+ annotate_phase_value "$node" "$PHASE1_LABEL_KEY" \
+ "stage-${stage_name}" "passed" \
+ || echo "[$(_p1_ts)] [Phase 1] WARN: annotate stage=$stage_name passed failed for $node" >&2
+ stage_passed=$((stage_passed + 1))
+ else
+ annotate_phase_value "$node" "$PHASE1_LABEL_KEY" \
+ "stage-${stage_name}" "failed" \
+ || echo "[$(_p1_ts)] [Phase 1] WARN: annotate stage=$stage_name failed failed for $node" >&2
+ # Record first-failure info on the node and drop from alive.
+ if [[ "${alive[$node]:-}" == "1" ]]; then
+ node_fail_stage[$node]="$stage_name"
+ node_fail_reason[$node]="${stage_job_reason[$job_name]:-failed}"
+ node_fail_subtest[$node]="${stage_job_subtest[$job_name]:-unknown}"
+ alive[$node]=""
+ fi
+ stage_failed=$((stage_failed + 1))
+ fi
+ done
+ echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} done: passed=${stage_passed} failed=${stage_failed}"
+
+ # ---- Per-stage Step 5: Cleanup ------------------------------------
+ # Delete Jobs explicitly (so timed-out ones don't linger past
+ # ttlSecondsAfterFinished); delete per-stage ConfigMaps (avoid
+ # leaking N-stage * M-node CMs per cron tick).
+ for job_name in $stage_job_names; do
+ kubectl delete job "$job_name" --ignore-not-found=true \
+ --wait=false >/dev/null 2>&1 \
+ || echo "[$(_p1_ts)] [Phase 1] WARN: delete job=$job_name failed" >&2
+ done
+ for node in $alive_in_stage; do
+ local cm_name="${stage_node_to_cm[$node]:-}"
+ [[ -z "$cm_name" ]] && continue
+ kubectl delete configmap "$cm_name" --ignore-not-found=true \
+ --wait=false >/dev/null 2>&1 \
+ || echo "[$(_p1_ts)] [Phase 1] WARN: delete configmap=$cm_name failed" >&2
+ done
+
+ # Reset per-stage maps before the next iteration.
+ unset stage_job_to_node stage_node_to_job stage_node_to_cm
+ unset stage_job_status stage_job_subtest stage_job_reason
+ done
+
+ # ---- Final aggregate labeling ----------------------------------------
+ # alive=1 + ran=1 -> every stage that was not Skip=true passed.
+ # alive=1 + ran=0 -> every stage was Skip=true (or no stages ran for
+ # any other reason) -> aggregate label=skipped so
+ # downstream phases see this node as not validated.
+ # alive=0 -> a non-skip stage failed -> label=failed +
+ # failed-subtest annotation.
+ local phase1_passed_count=0 phase1_failed_count=0 phase1_skipped_count=0
+ for n in $local_phase1_nodes; do
+ if [[ "${alive[$n]:-}" == "1" ]]; then
+ if [[ "${node_ran[$n]:-}" == "1" ]]; then
+ if label_phase_passed "$n" "$PHASE1_LABEL_KEY"; then
+ phase1_passed_count=$((phase1_passed_count + 1))
+ else
+ echo "[$(_p1_ts)] [Phase 1] WARN: label_phase_passed failed for $n" >&2
+ fi
+ else
+ # All stages skipped -- write aggregate label=skipped
+ # directly (no helper -- this is the only call site).
+ if kubectl label node "$n" "${PHASE1_LABEL_KEY}=skipped" \
+ --overwrite >/dev/null; then
+ phase1_skipped_count=$((phase1_skipped_count + 1))
+ echo "[$(_p1_ts)] [Phase 1] node=${n} ${PHASE1_LABEL_KEY}=skipped (all stages Skip=true)"
+ else
+ echo "[$(_p1_ts)] [Phase 1] WARN: kubectl label ${n} ${PHASE1_LABEL_KEY}=skipped failed" >&2
+ fi
+ fi
+ else
+ local fstage="${node_fail_stage[$n]:-unknown}"
+ local freason="${node_fail_reason[$n]:-failed}"
+ local fsubtest="${node_fail_subtest[$n]:-unknown}"
+ if label_phase_failed "$n" "$PHASE1_LABEL_KEY" "stage-${fstage}:${freason}"; then
+ phase1_failed_count=$((phase1_failed_count + 1))
+ else
+ echo "[$(_p1_ts)] [Phase 1] WARN: label_phase_failed failed for $n" >&2
+ phase1_failed_count=$((phase1_failed_count + 1))
+ fi
+ annotate_phase_value "$n" "$PHASE1_LABEL_KEY" \
+ "failed-subtest" "$fsubtest" \
+ || echo "[$(_p1_ts)] [Phase 1] WARN: annotate_phase_value failed for $n" >&2
+ fi
+ done
+
+ echo "[$(_p1_ts)] [Phase 1] PHASE1_SCRIPT done: passed=${phase1_passed_count} failed=${phase1_failed_count} skipped=${phase1_skipped_count}"
+ return 0
+
+ # === Phase 2 Script: per-node single-node 8-GPU RCCL all_reduce ===
+ # Sourced by the orchestrator's run_phase2 via _run_phase_generic
+ # (current) or by the dedicated run_phase2 wiring delivered.
+ # Mirrors PHASE1_SCRIPT's submit/wait/parse/label shape.
+ #
+ # On entry:
+ # PHASE_NODES = space-separated input node list (also $@)
+ # PHASE2_LABEL_KEY = "amd.com/gpu-mesh-validation" (from this ConfigMap)
+ # SKIP_GPU_MESH_VALIDATION = "true" short-circuits to pass-label-all
+ # ROCE_WORKLOAD_IMAGE, PHASE2_JOB_WAIT_TIME, PHASE2_BW_THRESHOLD,
+ # PHASE2_START_MSG_SIZE, PHASE2_END_MSG_SIZE, PHASE2_STEP_FACTOR,
+ # PHASE2_ITER_COUNT, PHASE2_WARMUP_ITER_COUNT, PHASE2_RCCL_ENV_VARS
+ # from this ConfigMap
+ # Helpers (already sourced by orchestrator):
+ # label_phase_passed, label_phase_failed, annotate_phase_value
+ # Job template available at
+ # /phase2-configs/cluster-validation-phase2-job-config.yaml
+ # (mounted from cluster-validation-phase2-job-config
+ # ConfigMap,).
+ #
+ # Contract (design §4 -> "Code Path: PHASE2_SCRIPT"):
+ # 1. Per input node: sed-render Job template ($$NODE substitution,
+ # unique per-node Job name) and kubectl apply. Record job name.
+ # 2. Poll-wait all Jobs bounded by PHASE2_JOB_WAIT_TIME.
+ # 3. For each Job: classify result by inspecting Job condition AND
+ # container log:
+ # * Complete=True -> passed
+ # * Failed=True with mpirun crash -> failed (rccl-crash)
+ # * Failed=True with BW message -> failed (bus-bw-below-threshold)
+ # * Pending/Active past timeout -> failed (timeout)
+ # * Failed=True other -> failed (rccl-crash, default)
+ # Threshold checking happens INSIDE the Job container (see
+ # cluster-validation-phase2-job-config -- validate-single-test.sh
+ # runs there and returns non-zero on BW miss). PHASE2_SCRIPT
+ # only differentiates by log markers; it does NOT re-run the
+ # validator.
+ # 4. label_phase_passed / label_phase_failed via helpers.
+ # On failure: annotation
+ # ${PHASE2_LABEL_KEY}${PHASE_FAILURE_REASON_ANNOTATION_SUFFIX}=
+ # via label_phase_failed. Additionally, for measured-BW failures
+ # and successes, annotate the measured Avg bus bandwidth via
+ # annotate_phase_value "$node" "$PHASE2_LABEL_KEY" "measured-bw" "".
+ # 5. SKIP_GPU_MESH_VALIDATION=true early-exit: pass-label every input
+ # node, no Jobs created.
+ # 6. Cleanup hung (still-Active) Jobs at the end.
+ #
+ # Missing job template -> every input node labeled failed with
+ # reason "job-template-missing".
+ #
+ # All output (banners, per-step progress) goes to stdout. Diagnostic-only
+ # lines go to stderr so they don't accidentally pollute callers that may
+ # capture stdout in the future.
+ PHASE2_SCRIPT: |
+ #!/bin/bash
+ # Phase 2 -- Intra-node GPU collective (RCCL all_reduce_perf) per-node
+ # orchestration. Sourced (not exec'd) by the orchestrator. Do NOT
+ # `set -e`; we want to label every input node even if individual
+ # kubectl calls fail (mirror of PHASE1_SCRIPT, see design doc).
+
+ # ---- Input handling ---------------------------------------------------
+ # Mirror of PHASE1_SCRIPT: orchestrator passes the node list as
+ # positional args AND exports PHASE_NODES. Prefer positional ($@) so a
+ # re-source with different args works as expected; fall back to
+ # PHASE_NODES if $# is zero.
+ local_phase2_nodes=""
+ if [[ "$#" -gt 0 ]]; then
+ local_phase2_nodes="$*"
+ elif [[ -n "${PHASE_NODES:-}" ]]; then
+ local_phase2_nodes="$PHASE_NODES"
+ fi
+
+ if [[ -z "$local_phase2_nodes" ]]; then
+ echo "[Phase 2] no input nodes -- nothing to do"
+ return 0
+ fi
+
+ echo "[Phase 2] PHASE2_SCRIPT start: nodes=[$local_phase2_nodes]"
+
+ # ---- Early-exit: SKIP_GPU_MESH_VALIDATION -----------------------------
+ # Incremental-bringup short-circuit (design §4 contract step 5).
+ # Every input node receives the pass label, no Phase 2 Jobs are
+ # created. Matches the SKIP_GPU_HW_ACCEPTANCE pattern in PHASE1_SCRIPT.
+ if [[ "${SKIP_GPU_MESH_VALIDATION,,}" == "true" ]]; then
+ echo "[Phase 2] SKIP_GPU_MESH_VALIDATION=true -- pass-labeling all input nodes"
+ for n in $local_phase2_nodes; do
+ label_phase_passed "$n" "$PHASE2_LABEL_KEY" \
+ || echo "[Phase 2] WARN: label_phase_passed failed for $n" >&2
+ done
+ echo "[Phase 2] PHASE2_SCRIPT done (skip mode)"
+ return 0
+ fi
+
+ # ---- Validate required env vars ---------------------------------------
+ # If any are unset/empty, fail every input node fast with a clear
+ # reason. PHASE2_RCCL_ENV_VARS and the PHASE2_*_MSG_SIZE / iter knobs
+ # are consumed inside the Job container -- the driver only needs the
+ # ones used by submit + classify.
+ local missing_env=""
+ for v in PHASE2_LABEL_KEY ROCE_WORKLOAD_IMAGE PHASE2_JOB_WAIT_TIME PHASE2_BW_THRESHOLD; do
+ if [[ -z "${!v:-}" ]]; then
+ missing_env="$missing_env $v"
+ fi
+ done
+ if [[ -n "$missing_env" ]]; then
+ echo "[Phase 2] ERROR: required env var(s) unset:$missing_env" >&2
+ echo "[Phase 2] failing every input node with reason=missing-env" >&2
+ for n in $local_phase2_nodes; do
+ label_phase_failed "$n" "${PHASE2_LABEL_KEY:-amd.com/gpu-mesh-validation}" \
+ "phase2-missing-env:${missing_env# }" \
+ || echo "[Phase 2] WARN: label_phase_failed failed for $n" >&2
+ done
+ return 1
+ fi
+
+ local job_template=/phase2-configs/cluster-validation-phase2-job-config.yaml
+ if [[ ! -f "$job_template" ]]; then
+ echo "[Phase 2] ERROR: job template not found at $job_template" >&2
+ for n in $local_phase2_nodes; do
+ label_phase_failed "$n" "$PHASE2_LABEL_KEY" "job-template-missing" \
+ || echo "[Phase 2] WARN: label_phase_failed failed for $n" >&2
+ done
+ return 1
+ fi
+
+ # ---- Phase-local state -------------------------------------------------
+ # Indexed by job name. Declared local so the orchestrator shell stays
+ # clean when re-sourced.
+ declare -A phase2_job_to_node=()
+ declare -A phase2_job_status=() # "passed" | "failed" | "timeout" | "submit-failed"
+ declare -A phase2_job_reason=() # failure reason annotation value
+ declare -A phase2_job_bw=() # measured Avg bus bandwidth, when parseable
+ local phase2_job_names=""
+
+ # ---- Step 1: Parallel-submit Phase 2 Job per node ---------------------
+ # Mirror of PHASE1_SCRIPT Step 1. kubectl apply is non-blocking from
+ # the Job's perspective, so submitting in a tight loop is effectively
+ # parallel.
+ echo "[Phase 2] submitting Phase 2 Jobs (parallel)..."
+ local ts max_job_name_len=63 prefix="cvf-phase2"
+ ts=$(date +%Y%m%d-%H%M%S)
+ for node in $local_phase2_nodes; do
+ local job_name="${prefix}-${node}-${ts}"
+ if [[ "${#job_name}" -gt "$max_job_name_len" ]]; then
+ local node_hash
node_hash=$(echo -n "$node" | sha1sum | cut -c1-6)
job_name="${prefix}-${node_hash}-${ts}"
+ echo "[Phase 2] node=$node hostname too long, using hash -> job=$job_name"
fi
- job_names="$job_names $job_name"
- job_to_node[$job_name]=$node
- echo "Submitting test runner job for node: $node (job: $job_name)"
- sed "s|\$\$NODE|${node}|g; \
- s/^ name: cluster-validation-test-runner-job/ name: ${job_name}/; \
- s|\$\$GPU_PER_WORKER|${GPU_PER_WORKER}|g; \
- s|\$\$TEST_RUNNER_IMAGE|${TEST_RUNNER_IMAGE}|g" \
- /test-runner-configs/cluster-validation-test-runner-job-config.yaml | kubectl apply -f -
- sleep 1
- done
- echo "[Test Runner Jobs: Submitted for all candidate nodes]"
- echo -e "\n$(date): Waiting for test runner jobs to complete..."
-
- for job_name in $job_names; do
- node=${job_to_node[$job_name]}
- echo "Waiting for job $job_name (node: $node)..."
-
+ phase2_job_to_node[$job_name]="$node"
+ phase2_job_names="$phase2_job_names $job_name"
+
+ # Render template -- $$NODE is the only placeholder defined by the
+ # Phase 2 template (see design doc). Also rewrite the template's
+ # fixed metadata.name to our unique per-node job_name so multiple
+ # nodes can run concurrently without name collision.
+ if sed "s|\$\$NODE|${node}|g; \
+ s/^ name: cluster-validation-phase2-job/ name: ${job_name}/; \
+ s|\$\$ROCE_WORKLOAD_IMAGE|${ROCE_WORKLOAD_IMAGE}|g" \
+ "$job_template" | kubectl apply -f - >/dev/null; then
+ echo "[Phase 2] submitted job=$job_name node=$node"
+ else
+ echo "[Phase 2] ERROR: kubectl apply failed for job=$job_name node=$node" >&2
+ phase2_job_status[$job_name]="submit-failed"
+ phase2_job_reason[$job_name]="job-creation-failed"
+ fi
+ done
+ echo "[Phase 2] all Jobs submitted at $(date)"
+
+ # ---- Step 2: Poll-wait every Job --------------------------------------
+ # Per-job wallclock budget = PHASE2_JOB_WAIT_TIME. Sequential wait
+ # loop, but the Jobs themselves run in parallel on different nodes.
+ # Matches the PHASE1_SCRIPT cadence (5s poll interval).
+ local timeout="${PHASE2_JOB_WAIT_TIME}"
+ for job_name in $phase2_job_names; do
+ local node="${phase2_job_to_node[$job_name]}"
+ if [[ "${phase2_job_status[$job_name]:-}" == "submit-failed" ]]; then
+ continue
+ fi
+ echo "[Phase 2] waiting for job=$job_name node=$node (timeout=${timeout}s)"
+ local start_time elapsed
start_time=$(date +%s)
- timeout=${TEST_RUNNER_JOB_WAIT_TIME}
- job_succeeded=false
-
+ local terminal="false"
while true; do
- elapsed=$(($(date +%s) - start_time))
- if [ $elapsed -ge $timeout ]; then
- echo "Job $job_name timed out after ${timeout}s ❌"
- break
- fi
-
- status=$(kubectl get job "$job_name" -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' 2>/dev/null || echo "")
- if [[ "$status" == "True" ]]; then
- echo "Job $job_name completed successfully ✅ (node: $node)"
- job_succeeded=true
- break
- fi
-
- failed_status=$(kubectl get job "$job_name" -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' 2>/dev/null || echo "")
- if [[ "$failed_status" == "True" ]]; then
- echo "Job $job_name failed ❌ (node: $node)"
- break
- fi
- sleep 5
+ elapsed=$(( $(date +%s) - start_time ))
+ if [[ "$elapsed" -ge "$timeout" ]]; then
+ echo "[Phase 2] job=$job_name TIMEOUT after ${timeout}s"
+ phase2_job_status[$job_name]="timeout"
+ phase2_job_reason[$job_name]="timeout"
+ terminal="true"
+ break
+ fi
+ local complete_st failed_st
+ complete_st=$(kubectl get job "$job_name" \
+ -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' \
+ 2>/dev/null || echo "")
+ if [[ "$complete_st" == "True" ]]; then
+ echo "[Phase 2] job=$job_name Complete=True (node=$node)"
+ phase2_job_status[$job_name]="passed"
+ terminal="true"
+ break
+ fi
+ failed_st=$(kubectl get job "$job_name" \
+ -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' \
+ 2>/dev/null || echo "")
+ if [[ "$failed_st" == "True" ]]; then
+ echo "[Phase 2] job=$job_name Failed=True (node=$node)"
+ phase2_job_status[$job_name]="failed"
+ terminal="true"
+ break
+ fi
+ sleep 5
done
-
- if [ "$job_succeeded" = true ]; then
- passed_nodes="$passed_nodes $node"
+ if [[ "$terminal" != "true" ]]; then
+ # Defensive: should never fall through, but keep state sane.
+ phase2_job_status[$job_name]="timeout"
+ phase2_job_reason[$job_name]="timeout"
+ fi
+ done
+
+ # ---- Step 3: Classify Job result via container log --------------------
+ # The Phase 2 Job container (cluster-validation-phase2-job-config,
+ #) runs mpirun then validate-single-test.sh internally.
+ # We retrieve the container log via `kubectl logs` and look for
+ # signature lines to differentiate the failure reasons enumerated by
+ # the design doc:
+ # * "phase2 mpirun exited " -> rccl-crash
+ # * "phase2 bandwidth below threshold" -> bus-bw-below-threshold
+ # * "phase2 all_reduce_perf PASSED" -> passed (sanity check)
+ # The measured "Avg bus bandwidth: " line is also captured for
+ # the annotation when present (both pass and bw-fail paths emit it
+ # via the all_reduce_perf output).
+ for job_name in $phase2_job_names; do
+ local status="${phase2_job_status[$job_name]:-failed}"
+ # submit-failed / timeout already have status+reason set.
+ if [[ "$status" == "submit-failed" || "$status" == "timeout" ]]; then
+ continue
+ fi
+
+ # Pod naming: kubectl creates a pod per Job named "${job_name}-".
+ # backoffLimit: 0 -> at most one pod.
+ local pod_name=""
+ pod_name=$(kubectl get pods \
+ -l "job-name=${job_name}" \
+ -o jsonpath='{.items[-1:].metadata.name}' \
+ 2>/dev/null || echo "")
+
+ local log_text=""
+ if [[ -n "$pod_name" ]]; then
+ # --tail=200 -- we only need to look at the trailing section
+ # for the classification markers and the Avg-bw line. Keep
+ # the captured volume small to stay well under the
+ # PHASE_ANNOTATION_VALUE_MAX_BYTES cap if we ever embed a tail.
+ log_text=$(kubectl logs "$pod_name" --tail=200 2>/dev/null || echo "")
+ fi
+
+ # Always try to capture the measured bandwidth, regardless of
+ # pass/fail -- the validator script prints it on both paths.
+ local measured_bw=""
+ measured_bw=$(echo "$log_text" \
+ | awk -F': ' '/Avg bus bandwidth/ {print $2}' \
+ | tail -n 1 | tr -d '[:space:]')
+ if [[ -n "$measured_bw" ]]; then
+ phase2_job_bw[$job_name]="$measured_bw"
+ fi
+
+ if [[ "$status" == "passed" ]]; then
+ echo "[Phase 2] job=$job_name PASS (measured-bw=${measured_bw:-})"
+ continue
+ fi
+
+ # status == "failed": classify by log marker. Order matters
+ # the BW-below-threshold path is reached after a successful
+ # mpirun, so the crash marker is more specific.
+ if echo "$log_text" | grep -q "phase2 mpirun exited"; then
+ phase2_job_reason[$job_name]="rccl-crash"
+ echo "[Phase 2] job=$job_name FAIL reason=rccl-crash (measured-bw=${measured_bw:-})"
+ elif echo "$log_text" | grep -q "phase2 bandwidth below threshold"; then
+ phase2_job_reason[$job_name]="bus-bw-below-threshold"
+ echo "[Phase 2] job=$job_name FAIL reason=bus-bw-below-threshold (measured-bw=${measured_bw:-}, threshold=${PHASE2_BW_THRESHOLD})"
+ else
+ # Job=Failed but no recognized marker -- default to rccl-crash
+ # (any non-zero exit from the container is treated as a crash
+ # signal per design §6 unless the validator explicitly flagged
+ # bw-below-threshold).
+ phase2_job_reason[$job_name]="rccl-crash"
+ echo "[Phase 2] job=$job_name FAIL reason=rccl-crash (default, no marker found)"
+ fi
+ done
+
+ # ---- Step 4: Label every input node via helpers -------------
+ # Pass nodes get measured-bw annotation when we parsed one. Failed
+ # nodes get failed-reason via label_phase_failed AND, when available,
+ # the measured-bw annotation (useful for bus-bw-below-threshold
+ # forensics -- design §6 "Bandwidth below threshold . annotation
+ # records measured BW").
+ local phase2_passed_count=0 phase2_failed_count=0
+ for job_name in $phase2_job_names; do
+ local node="${phase2_job_to_node[$job_name]}"
+ local status="${phase2_job_status[$job_name]:-failed}"
+ local bw="${phase2_job_bw[$job_name]:-}"
+ if [[ "$status" == "passed" ]]; then
+ if label_phase_passed "$node" "$PHASE2_LABEL_KEY"; then
+ phase2_passed_count=$((phase2_passed_count + 1))
+ else
+ echo "[Phase 2] WARN: label_phase_passed failed for $node" >&2
+ fi
+ if [[ -n "$bw" ]]; then
+ annotate_phase_value "$node" "$PHASE2_LABEL_KEY" \
+ "measured-bw" "$bw" \
+ || echo "[Phase 2] WARN: annotate_phase_value (measured-bw) failed for $node" >&2
+ fi
+ else
+ local reason="${phase2_job_reason[$job_name]:-failed}"
+ if label_phase_failed "$node" "$PHASE2_LABEL_KEY" "$reason"; then
+ phase2_failed_count=$((phase2_failed_count + 1))
+ else
+ echo "[Phase 2] WARN: label_phase_failed failed for $node" >&2
+ phase2_failed_count=$((phase2_failed_count + 1))
+ fi
+ if [[ -n "$bw" ]]; then
+ annotate_phase_value "$node" "$PHASE2_LABEL_KEY" \
+ "measured-bw" "$bw" \
+ || echo "[Phase 2] WARN: annotate_phase_value (measured-bw) failed for $node" >&2
+ fi
+ fi
+ done
+
+ # ---- Step 5: Cleanup hung Jobs ----------------------------------------
+ # Anything that timed out (still Active, neither Complete nor Failed)
+ # gets explicitly deleted so it does not linger past the next CronJob
+ # tick. ttlSecondsAfterFinished handles the completed ones.
+ for job_name in $phase2_job_names; do
+ local status="${phase2_job_status[$job_name]:-failed}"
+ if [[ "$status" == "timeout" ]]; then
+ echo "[Phase 2] deleting hung job=$job_name"
+ kubectl delete job "$job_name" --ignore-not-found=true \
+ --wait=false >/dev/null 2>&1 \
+ || echo "[Phase 2] WARN: delete job=$job_name failed" >&2
+ fi
+ done
+
+ echo "[Phase 2] PHASE2_SCRIPT done: passed=${phase2_passed_count} failed=${phase2_failed_count}"
+ return 0
+
+ # === Phase 3 (Per-Node NIC Health Check) env vars ===
+ # Consumed by the per-node Phase 3 Job (cluster-validation-job.yaml,
+ # `cluster-validation-phase3-job-config` ConfigMap) and by
+ # `PHASE3_CHECK_SCRIPT` above. See design doc design doc §4 ->
+ # "Code Path: Phase 3 ConfigMap variables".
+ #
+ # PHASE3_CHECK_SCRIPT also defines `${VAR:-default}` fallbacks for these
+ # keys so the script remains independently runnable for debug; the values
+ # below are the canonical fleet defaults projected via `envFrom` on the
+ # Phase 3 Job container.
+ #
+ # Tune these via ConfigMap edit -- no code change needed.
+
+ # Expected number of AMD/Pensando NIC physical functions per node,
+ # matched against the count of PCI entries with a vendor:device pair
+ # listed in $PHASE3_AMD_NIC_PCI_IDS. Default 8 matches the Pensando
+ # MI300 fleet baseline (validated on smc300x-ccs-aus-gpuf268). Override
+ # for SKUs with a different NIC population.
+ PHASE3_EXPECTED_NIC_COUNT: "8"
+ # PCI vendor:device allowlist for AMD/Pensando NIC Physical Functions,
+ # as a comma-separated list. `1dd8:1002` is the "DSC Ethernet Controller"
+ # PF on the AMD Pensando DSC fleet (DSC-25 / DSC-200 / Salina) -- it is
+ # the device ID the kernel `ionic` driver binds to and is the canonical
+ # identifier for a NIC PF (verified on smc300x-ccs-aus-gpuf268,).
+ # To include a future PF revision or a second SKU, append additional
+ # vendor:device pairs (e.g. "1dd8:1002,1dd8:1003"). The matching uses
+ # `lspci -nn` output's trailing `[vendor:device]` tag, which is unique
+ # per device ID and immune to description-string variation -- this
+ # naturally excludes SR-IOV VFs (which carry a different device ID),
+ # PCI bridges, and Processing-accelerator sub-functions that share the
+ # vendor ID but are not NIC PFs.
+ PHASE3_AMD_NIC_PCI_IDS: "1dd8:1002"
+ # Minimum number of GID table entries required per RDMA device. Each
+ # configured RoCEv2 GID slot prints a `GID[` line in
+ # `ibv_devinfo -d $dev -v`. Default 1 catches "GID table empty"
+ # (driver misconfig, RoCE not initialized); raise to require multiple
+ # RoCE versions.
+ PHASE3_MIN_GID_COUNT: "1"
+ # Per-node Phase 3 Job wallclock budget (seconds). The four structural
+ # checks (lspci / ip link / rdma link / ibv_devinfo) complete in well
+ # under a second on a healthy node; 120s leaves headroom for image
+ # pull, pod scheduling, and NIC-resource allocation (design §6 ->
+ # "Pod cannot mount NIC devices . orchestrator times it out at
+ # PHASE3_JOB_WAIT_TIME"). PHASE3_SCRIPT treats a Job that
+ # exceeds this budget as `nic-not-allocated` failure.
+ # patchable: cluster-validation-framework.timeouts.phase3-job-wait-secs
+ PHASE3_JOB_WAIT_TIME: "120"
+
+ # ---- Phase 3 Check 5 (firmware <-> workload-image alignment) inputs ---
+ # The in-pod check compares each NIC's running firmware version (as
+ # reported by `ethtool -i ` -- which queries the running ionic
+ # driver, which in turn cached the fw version from the NIC via dev cmd
+ # at probe time) against the workload image reference passed in via
+ # the ROCE_WORKLOAD_IMAGE env var. Match policy is a simple substring
+ # test: the observed firmware-version string must appear somewhere in
+ # the image reference. This works because the roce-workload image tag
+ # encodes the ainic firmware version it was built against (e.g.
+ # `...ainic-1.117.5-a-56`), so the tag IS the contract for which
+ # firmware is qualified with this RoCE workload.
+ #
+ # Replaces the prior nicctl + curated compat-map approach: the
+ # roce-workload image does NOT bundle nicctl, while it DOES ship
+ # ethtool; and the image tag itself declares its expected firmware,
+ # removing the need for a separately maintained allowlist.
+
+ # Master switch for Check 5. Default "true". When "false", Check 5 is
+ # skipped entirely and the PHASE3_RESULT marker is byte-identical to
+ # the pre-Check-5 contract.
+ PHASE3_DRIVER_FW_CHECK_ENABLED: "true"
+ # Strict-mode toggle. "true" (default): any per-NIC fw <-> image-tag
+ # mismatch fails the node with reason
+ # `fw-image-mismatch:=/image=`. "false":
+ # mismatches surface only via the `observed-fw` node annotation and
+ # Phase 3 still passes.
+ PHASE3_DRIVER_FW_STRICT: "true"
+
+ # === Phase 3 Check Script: in-Job 4-check NIC health + self-labelling ===
+ # Embedded into the per-node Phase 3 Job (cluster-validation-phase3-job-config
+ # ConfigMap,). The Job container writes this string to /tmp/check.sh
+ # and execs it. Runs inside a privileged pod on the rocm/network-operator-utils
+ # image so it has lspci / ip / rdma / ibv_* / kubectl available.
+ #
+ # On entry (env from cluster-validation-config envFrom + downward-API):
+ # NODE_NAME = spec.nodeName (downward API on the Job pod)
+ # PHASE3_LABEL_KEY = "amd.com/nic-health" (, this ConfigMap)
+ # PHASE3_AMD_NIC_PCI_IDS = "1dd8:1002" Pensando DSC PF default; comma-
+ # separated list of vendor:device pairs
+ # PHASE3_EXPECTED_NIC_COUNT = "8" default
+ # PHASE3_MIN_GID_COUNT = "1" default
+ # PHASE_FAILURE_REASON_ANNOTATION_SUFFIX = "-failure-reason" (this ConfigMap)
+ # ServiceAccount cluster-validation-sa has node label/annotate perms.
+ #
+ # Contract (design §4 -> "Code Path: PHASE3_CHECK_SCRIPT"):
+ # 1. Check 1 -- NIC count via `lspci -nn` filtered to the device IDs
+ # listed in $PHASE3_AMD_NIC_PCI_IDS must equal $PHASE3_EXPECTED_NIC_COUNT.
+ # 2. Check 2 -- `ip link` state == UP for every amd-nic interface
+ # (interface names start with enP* or ens*).
+ # 3. Check 3 -- `rdma link show` reports state ACTIVE per device.
+ # 4. Check 4 -- `ibv_devinfo` responds AND each device's GID table
+ # has >= $PHASE3_MIN_GID_COUNT entries.
+ # 5. On any failure: self-label node with ${PHASE3_LABEL_KEY}=failed,
+ # write annotations ${PHASE3_LABEL_KEY}${SUFFIX}= and
+ # ${PHASE3_LABEL_KEY}-failed-nics=, exit non-zero.
+ # 6. On all pass: self-label ${PHASE3_LABEL_KEY}=passed, exit 0.
+ #
+ # Annotation values are truncated to PHASE3_ANNOTATION_MAX_BYTES (=250 by
+ # default) before write so a worst-case all-NIC failure cannot push the
+ # node object over the K8s annotation size cap (design §6, test case #10).
+ #
+ # NOTE: this script intentionally does NOT use `set -e`. Every check must
+ # run to completion so the failure annotation can enumerate ALL failed NICs
+ # in a single run -- a `set -e` early-exit would mask later checks. The
+ # only non-zero exit happens at the end, after labelling.
+ PHASE3_CHECK_SCRIPT: |
+ #!/bin/bash
+ # Phase 3 -- per-node NIC health check + self-labelling. Runs IN the
+ # per-node Phase 3 Job pod. Do NOT `set -e`; we
+ # want every check to run so the failure annotation captures the
+ # complete list of failed NICs in one pass (design §4).
+ set +e
+
+ # ---- Defensive defaults ------------------------------------------------
+ # PHASE3_LABEL_KEY is and projected via envFrom on
+ # cluster-validation-config, but the Job pod could in principle be
+ # launched against an older ConfigMap; fall back to the canonical key.
+ PHASE3_LABEL_KEY="${PHASE3_LABEL_KEY:-amd.com/nic-health}"
+ PHASE_FAILURE_REASON_ANNOTATION_SUFFIX="${PHASE_FAILURE_REASON_ANNOTATION_SUFFIX:--failure-reason}"
+
+ # Per-check inputs default to the Pensando MI300-fleet baseline. These
+ # are normally overridden ConfigMap keys; defaults exist
+ # so the script is independently runnable for debug.
+ PHASE3_AMD_NIC_PCI_IDS="${PHASE3_AMD_NIC_PCI_IDS:-1dd8:1002}"
+ PHASE3_EXPECTED_NIC_COUNT="${PHASE3_EXPECTED_NIC_COUNT:-8}"
+ PHASE3_MIN_GID_COUNT="${PHASE3_MIN_GID_COUNT:-1}"
+
+ # Annotation value truncation cap. K8s caps the per-object annotation
+ # set at 256 KiB total; 250 bytes per value leaves abundant headroom
+ # and matches the design doc + test plan case #10 expectation.
+ PHASE3_ANNOTATION_MAX_BYTES="${PHASE3_ANNOTATION_MAX_BYTES:-250}"
+
+ # Sysfs roots. Production hard-defaults to the in-pod paths; tests
+ # override these to point at fixture directories so the script can be
+ # exercised on a host without real ionic devices. Do NOT change the
+ # defaults -- the script reads from the container's auto-mounted
+ # /sys (kernel filesystem is netns-agnostic for sysfs class paths).
+ PHASE3_IB_SYSFS_DIR="${PHASE3_IB_SYSFS_DIR:-/sys/class/infiniband}"
+ PHASE3_NET_SYSFS_DIR="${PHASE3_NET_SYSFS_DIR:-/sys/class/net}"
+
+ # NODE_NAME is informational only -- the orchestrator owns the label.
+ # The script no longer self-labels via in-pod kubectl; it emits a
+ # PHASE3_RESULT marker on stdout that PHASE3_SCRIPT parses.
+ echo "[Phase 3] check start: node=${NODE_NAME:-} expected_nics=${PHASE3_EXPECTED_NIC_COUNT} pci_ids=${PHASE3_AMD_NIC_PCI_IDS}"
+
+ # ---- Pre-flight: roce-workload userspace ABI sanity ------------------
+ # Catches the case where ROCE_WORKLOAD_IMAGE was bumped to a tag
+ # whose libionic ABI does not match the host ionic_rdma module
+ # (canonical example: rocm/roce-workload:...a-63 shipped libionic
+ # 54.0-149 ABI v2 while the host driver speaks ABI v1 -- see memory
+ # gpuop-689-phase4-ionic-abi.md). Without this gate, a bad image
+ # would cascade into four confusing check-1..4 failures on every
+ # NIC; with it, we emit one unambiguous preflight reason and skip
+ # the rest of the checks entirely.
+ #
+ # Previously also checked `nicctl --help`; dropped now that Check 5
+ # uses ethtool (always present in the roce-workload image) and the
+ # nicctl-based driver/fw compat path is gone.
+ preflight_fail_reasons=()
+ if ! ibv_msg=$(ibv_devinfo 2>&1 >/dev/null); then
+ preflight_fail_reasons+=("preflight-failed:ibv_devinfo=${ibv_msg:0:80}")
+ fi
+ if [[ "${#preflight_fail_reasons[@]}" -gt 0 ]]; then
+ full_reason=$(IFS=,; echo "${preflight_fail_reasons[*]}")
+ reason="${full_reason:0:${PHASE3_ANNOTATION_MAX_BYTES}}"
+ echo "[Phase 3] PRE-FLIGHT FAIL: ${full_reason}"
+ echo "PHASE3_RESULT status=failed reason=${reason} failed_nics="
+ exit 1
+ fi
+
+ fail_reasons=()
+ fail_nics=()
+
+ # ---- Check 1: NIC enumeration via PCI vendor:device allowlist ---------
+ # Count only PFs whose PCI vendor:device tag matches one of the entries
+ # in $PHASE3_AMD_NIC_PCI_IDS (comma-separated list). `lspci -Dnn`
+ # appends the literal `[vvvv:dddd]` tag to every line, which is
+ # immune to description-string drift and uniquely identifies the PF
+ # device ID (e.g. `1dd8:1002` for the Pensando DSC Ethernet
+ # Controller PF). SR-IOV VFs carry a *different* device ID and are
+ # naturally excluded, as are PCI bridges (`1dd8:0008`, `1dd8:1001`)
+ # and Processing-accelerator sub-functions (`1dd8:100c/100f/1012`)
+ # that share the vendor but are not NIC PFs. Counting raw lines or
+ # filtering by `lspci -Dd :` alone inflates the result by
+ # ~6x on Pensando cards (a single 8-port node reports 48 raw lines).
+ # To support multiple SKUs, the comma list is rewritten to a regex
+ # alternation (`1dd8:1002,1dd8:1003` -> `1dd8:1002|1dd8:1003`) and
+ # matched against the trailing `[vid:did]` tag.
+ pci_alt=$(echo "$PHASE3_AMD_NIC_PCI_IDS" | sed 's/,/|/g')
+ nic_count=$(lspci -Dnn 2>/dev/null \
+ | grep -cE "\[(${pci_alt})\]" || true)
+ nic_count="${nic_count:-0}"
+ if [[ "$nic_count" -ne "$PHASE3_EXPECTED_NIC_COUNT" ]]; then
+ fail_reasons+=("nic-count:expected=${PHASE3_EXPECTED_NIC_COUNT},actual=${nic_count}")
+ echo "[Phase 3] CHECK 1 FAIL: nic_count=${nic_count} expected=${PHASE3_EXPECTED_NIC_COUNT}"
+ else
+ echo "[Phase 3] CHECK 1 PASS: nic_count=${nic_count}"
+ fi
+
+ # ---- Check 2: link state UP for every ionic-driven interface ----------
+ # TWO fixes vs the original implementation:
+ # 1. Enumerate by DRIVER, not interface-name prefix. systemd-predictable
+ # naming on Pensando NICs varies across hosts -- some report
+ # `enP5p1s0f0` (slot scheme), others `enp104s0f3` (bus scheme),
+ # others `ens` (older enumeration). The previous prefix filter
+ # `enP|ens` silently missed `enp*` and false-passed with zero
+ # iterations.
+ # 2. Read link state from sysfs (`operstate`), NOT from `ip link`.
+ # The Phase 3 pod runs on the cluster pod network -- ionic netdevs
+ # are in the HOST netns, not the pod's, so `ip link` cannot see
+ # them and netlink-based tooling (ethtool, ip) returns "no such
+ # device". sysfs is a single global kernel filesystem mounted
+ # inside the pod regardless of netns, so `/sys/class/net//`
+ # is always accessible.
+ for d in "$PHASE3_NET_SYSFS_DIR"/*/; do
+ iface=$(basename "$d")
+ drv=$(basename "$(readlink "$d/device/driver" 2>/dev/null)" 2>/dev/null)
+ [[ "$drv" == "ionic" ]] || continue
+ # operstate values per kernel Documentation/ABI/testing/sysfs-class-net:
+ # "up", "down", "unknown", "lowerlayerdown", "testing", "dormant",
+ # "notpresent". Only "up" is healthy.
+ operstate=$(cat "$d/operstate" 2>/dev/null)
+ if [[ "$operstate" != "up" ]]; then
+ fail_nics+=("$iface")
+ fail_reasons+=("link-state:${iface}=${operstate:-UNKNOWN}")
+ echo "[Phase 3] CHECK 2 FAIL: iface=${iface} operstate=${operstate:-UNKNOWN}"
+ fi
+ done
+
+ # ---- Check 3: `rdma link show` reports state ACTIVE per device --------
+ # `rdma link show` emits one line per RDMA link, e.g.
+ # link rocep5s0/1 state ACTIVE physical_state LINK_UP netdev enP5p1s0f0
+ # We split on whitespace -- field 2 is `/` (we keep the
+ # whole token; the failure annotation should be specific enough to
+ # locate the offending link/port pair). For the state token we anchor
+ # to a leading space so we only match the standalone `state` field
+ # and never the trailing chunk of `physical_state` (which would
+ # otherwise be picked up by a naive `grep -oE 'state .'`).
+ while read -r line; do
+ # Skip blank lines (e.g. trailing newline) defensively.
+ [[ -z "$line" ]] && continue
+ dev=$(echo "$line" | awk '{print $2}')
+ state=$(echo "$line" | grep -oE ' state [A-Z_]+' | head -n 1 | awk '{print $2}')
+ if [[ -n "$dev" && "$state" != "ACTIVE" ]]; then
+ fail_nics+=("$dev")
+ fail_reasons+=("rdma-state:${dev}=${state:-UNKNOWN}")
+ echo "[Phase 3] CHECK 3 FAIL: dev=${dev} rdma-state=${state:-UNKNOWN}"
+ fi
+ done < <(rdma link show 2>/dev/null)
+
+ # ---- Check 4: ibv_devinfo responsive + non-empty GID table ------------
+ # `ibv_devices` output:
+ # device node GUID
+ # ------ ----------------
+ # rocep5s0 .
+ # rocep6s0 .
+ # We skip the 2-line header (`tail -n +3`) and pull the first column.
+ # For each device:
+ # * `ibv_devinfo -d $dev` must exit 0 (driver responsive).
+ # * `ibv_devinfo -d $dev -v` must report >= PHASE3_MIN_GID_COUNT
+ # `GID[` entries (verbose mode lists each GID slot).
+ for dev in $(ibv_devices 2>/dev/null | tail -n +3 | awk 'NF{print $1}'); do
+ if ! ibv_devinfo -d "$dev" >/dev/null 2>&1; then
+ fail_nics+=("$dev")
+ fail_reasons+=("ibv-devinfo:${dev}=unresponsive")
+ echo "[Phase 3] CHECK 4 FAIL: dev=${dev} ibv_devinfo unresponsive"
+ continue
+ fi
+ gid_count=$(ibv_devinfo -d "$dev" -v 2>/dev/null | grep -c "GID\[")
+ if [[ "$gid_count" -lt "$PHASE3_MIN_GID_COUNT" ]]; then
+ fail_nics+=("$dev")
+ fail_reasons+=("gid-table:${dev}=${gid_count}")
+ echo "[Phase 3] CHECK 4 FAIL: dev=${dev} gid_count=${gid_count} min=${PHASE3_MIN_GID_COUNT}"
+ fi
+ done
+
+ # ---- Check 5: firmware <-> workload-image alignment ------------------
+ # Read each ionic NIC's running firmware version from sysfs
+ # `/sys/class/infiniband/ionic_/fw_ver` and require it to appear
+ # as a substring of the workload image reference passed in as
+ # $ROCE_WORKLOAD_IMAGE. This catches "host is on fw X but the
+ # workload image was built against fw Y" without any curated
+ # allowlist -- the image tag IS the contract.
+ #
+ # sysfs path notes:
+ # * `/sys/class/infiniband//fw_ver` is populated by every
+ # RDMA driver's `ib_device.ops.query_device()` and reflects the
+ # running firmware on the card (driver cached from the NIC's
+ # admin command at probe time).
+ # * Why sysfs instead of `ethtool -i `: the Phase 3 pod
+ # runs on the cluster pod network. ionic netdevs are in the HOST
+ # netns, not the pod's, so ethtool (netlink) cannot reach them.
+ # /sys/class/infiniband IS visible inside the pod -- sysfs is a
+ # single global kernel filesystem mounted regardless of netns,
+ # and the existing Check 4 already reads
+ # /sys/class/infiniband/*/ports/*/gids/ from the same place.
+ #
+ # Strict-mode-aware (PHASE3_DRIVER_FW_STRICT): true => mismatch
+ # fails the node; false => mismatch only surfaces via the
+ # observed_fw= annotation and Phase 3 still passes.
+ #
+ # "No data" policy: if we cannot read fw_ver from any ionic device
+ # OR the orchestrator failed to pass ROCE_WORKLOAD_IMAGE, the check
+ # is SKIPPED rather than failed (loud WARN on stdout). A missing
+ # comparison value is an infrastructure gap, not a NIC-health
+ # signal -- Checks 1-4 still gate. Only an actual mismatch
+ # (we read fw + we read image + strings differ) fails the node in
+ # strict mode.
+ PHASE3_OBSERVED_FW_CSV=""
+ PHASE3_DRIVER_FW_CHECK_ENABLED="${PHASE3_DRIVER_FW_CHECK_ENABLED:-true}"
+ if [[ "${PHASE3_DRIVER_FW_CHECK_ENABLED,,}" == "true" ]]; then
+ strict="${PHASE3_DRIVER_FW_STRICT:-true}"
+ workload_image="${ROCE_WORKLOAD_IMAGE:-}"
+
+ # One fw_ver per /sys/class/infiniband/ionic_. Filter by driver
+ # symlink (not by `ionic_*` glob) so this picks up any future
+ # rename of the IB device naming scheme.
+ declare -A dev_fw
+ for d in "$PHASE3_IB_SYSFS_DIR"/*/; do
+ dev=$(basename "$d")
+ drv=$(basename "$(readlink "$d/device/driver" 2>/dev/null)" 2>/dev/null)
+ [[ "$drv" == "ionic" ]] || continue
+ fw=$(cat "$d/fw_ver" 2>/dev/null | tr -d '[:space:]')
+ [[ -z "$fw" ]] && continue
+ dev_fw[$dev]="$fw"
+ done
+
+ # "No data" cases are treated as SKIP rather than FAIL: if we
+ # can't read fw at all, OR can't get the workload image to compare
+ # against, we cannot evaluate alignment and a Phase 3 fail here
+ # would be a false positive (the NICs themselves may be perfectly
+ # healthy -- Checks 1-4 still gate). Loud WARN on stdout so the
+ # infrastructure gap (missing env / unreadable sysfs) is visible.
+ # An actual mismatch (we read fw + we read image, strings differ)
+ # below still fails in strict mode -- that's the intended check.
+ _check5_skip=false
+ if [[ -z "$workload_image" ]]; then
+ echo "[Phase 3] CHECK 5 SKIP: ROCE_WORKLOAD_IMAGE env not set -- cannot evaluate fw/image alignment"
+ _check5_skip=true
+ fi
+ if [[ "${#dev_fw[@]}" -eq 0 ]]; then
+ echo "[Phase 3] CHECK 5 SKIP: no fw_ver readable under ${PHASE3_IB_SYSFS_DIR}/*/ -- treating as no-data"
+ _check5_skip=true
+ fi
+
+ # Per-NIC substring match: workload_image must contain fw.
+ # `per_dev_observed` is built even on skip so the operator still
+ # gets the partial observed_fw annotation when SOME devices read.
+ per_dev_observed=()
+ for dev in "${!dev_fw[@]}"; do
+ fw="${dev_fw[$dev]}"
+ per_dev_observed+=("${dev}=${fw}")
+ if [[ "$_check5_skip" == "true" ]]; then
+ continue
+ fi
+ if [[ -n "$workload_image" && "$workload_image" != *"$fw"* ]]; then
+ echo "[Phase 3] CHECK 5 MISMATCH: dev=${dev} fw=${fw} not in image=${workload_image}"
+ if [[ "${strict,,}" == "true" ]]; then
+ fail_nics+=("$dev")
+ # Only emit the image tag portion in the failure reason --
+ # full registry/repo prefix would blow the annotation cap.
+ fail_reasons+=("fw-image-mismatch:${dev}=${fw}/image=${workload_image##*:}")
+ fi
else
- failed_nodes="$failed_nodes $node"
+ [[ -n "$workload_image" ]] && \
+ echo "[Phase 3] CHECK 5 OK: dev=${dev} fw=${fw} matches image substring"
fi
done
+ PHASE3_OBSERVED_FW_CSV=$(IFS=,; echo "${per_dev_observed[*]}")
fi
-
- # Count and report results
- passed_count=$(echo $passed_nodes | wc -w)
- failed_count=$(echo $failed_nodes | wc -w)
- echo "=================================================================="
- echo "Test Runner Jobs Summary:"
- echo " Passed: $passed_count node(s)"
- if [ $passed_count -gt 0 ]; then
- echo " Nodes: $passed_nodes"
- fi
- echo " Failed: $failed_count node(s)"
- if [ $failed_count -gt 0 ]; then
- echo " Nodes: $failed_nodes"
- fi
- echo "=================================================================="
-
- if [[ "${SKIP_GPU_VALIDATION,,}" == "true" ]]; then
- echo "SKIP_GPU_VALIDATION is set to true. Skip labelling nodes."
+
+ # ---- Emit PHASE3_RESULT marker for orchestrator to parse --------------
+ # The orchestrator (PHASE3_SCRIPT) parses this marker line from
+ # `kubectl logs job/ --tail=20` and performs label/annotate
+ # writes itself. No in-pod kubectl dependency.
+ #
+ # Marker contract (single line on stdout, last PHASE3_RESULT wins):
+ # PHASE3_RESULT status=passed [observed_fw=]
+ # PHASE3_RESULT status=failed reason= failed_nics= [observed_fw=]
+ #
+ # Reason and offending-NIC lists are comma-joined then truncated to
+ # the K8s-safe per-value cap. Truncation is intentional: a worst-case
+ # fleet-wide failure must not push the node object over the 256 KiB
+ # annotation budget. We log the full untruncated values separately to
+ # stdout so the Job log retains complete detail.
+ #
+ # observed_fw= is added by Check 5 whenever
+ # PHASE3_DRIVER_FW_CHECK_ENABLED=true. It is omitted entirely when
+ # Check 5 is disabled so the marker stays byte-for-byte identical to
+ # the pre-Check-5 contract on operators who flip the check off.
+ # Values are truncated to PHASE3_ANNOTATION_MAX_BYTES for the same
+ # node-object size-cap reason as `reason=` / `failed_nics=`.
+ obs_fw_field=""
+ if [[ -n "${PHASE3_OBSERVED_FW_CSV:-}" ]]; then
+ obs_fw_trunc="${PHASE3_OBSERVED_FW_CSV:0:${PHASE3_ANNOTATION_MAX_BYTES}}"
+ obs_fw_field=" observed_fw=${obs_fw_trunc}"
+ fi
+
+ if [[ "${#fail_reasons[@]}" -eq 0 ]]; then
+ echo "[Phase 3] all checks PASSED"
+ echo "PHASE3_RESULT status=passed${obs_fw_field}"
+ echo "[Phase 3] PHASE3_CHECK_SCRIPT done: PASS"
+ exit 0
+ fi
+
+ full_reason=$(IFS=,; echo "${fail_reasons[*]}")
+ full_nics=$(IFS=,; echo "${fail_nics[*]}")
+ reason="${full_reason:0:${PHASE3_ANNOTATION_MAX_BYTES}}"
+ nics="${full_nics:0:${PHASE3_ANNOTATION_MAX_BYTES}}"
+
+ echo "[Phase 3] FAIL: full_reason=${full_reason}"
+ echo "[Phase 3] FAIL: full_failed_nics=${full_nics}"
+ echo "PHASE3_RESULT status=failed reason=${reason} failed_nics=${nics}${obs_fw_field}"
+
+ echo "[Phase 3] PHASE3_CHECK_SCRIPT done: FAIL (${#fail_reasons[@]} reason(s), ${#fail_nics[@]} NIC(s))"
+ exit 1
+
+ # === Phase 3 Script: per-node NIC health Job submit + wait + orchestrator-side label ===
+ # Sourced by the orchestrator's dedicated run_phase3 wiring
+ # (cluster-validation-job.yaml,). Mirrors PHASE1_SCRIPT /
+ # PHASE2_SCRIPT submit/wait/label shape. PHASE3_CHECK_SCRIPT runs inside
+ # the Job pod; the pod's base image does not ship kubectl, so the script
+ # emits a PHASE3_RESULT marker line on stdout that PHASE3_SCRIPT parses
+ # via `kubectl logs job/` after the Job reaches a terminal state.
+ # The orchestrator owns all label/annotate writes -- this removes the
+ # in-pod kubectl dependency entirely.
+ #
+ # PHASE3_SCRIPT responsibilities:
+ # * intersect the input pool with `amd-nic=true` -- ALREADY DONE
+ # by the cronjob orchestrator before run_phase3 is invoked
+ # (see cluster-validation-job.yaml around line ~958). PHASE3_SCRIPT
+ # therefore treats its $@ / PHASE_NODES input as already-narrowed.
+ # * submit one Phase 3 Job per node
+ # * wait for Job completion bounded by PHASE3_JOB_WAIT_TIME
+ # * for every terminal Job (Complete or Failed), parse the
+ # PHASE3_RESULT marker from `kubectl logs job/ --tail=20` and
+ # issue the corresponding label_phase_passed / label_phase_failed
+ # (+ failed-nics annotation on failure).
+ # * for Jobs that never reached terminal state (submit-failed,
+ # timeout) write label_phase_failed directly -- the pod never got
+ # to emit a marker.
+ # * cleanup hung Jobs at the end
+ #
+ # On entry:
+ # PHASE_NODES = space-separated input node list (also $@)
+ # PHASE3_LABEL_KEY = "amd.com/nic-health" (from this ConfigMap)
+ # SKIP_NIC_VALIDATION = "true" short-circuits to pass-label-all
+ # PHASE3_EXPECTED_NIC_COUNT, PHASE3_JOB_WAIT_TIME from this ConfigMap
+ #. PHASE3_AMD_NIC_PCI_IDS / PHASE3_MIN_GID_COUNT are
+ # consumed inside the Job container -- the driver only needs the
+ # ones used by submit + classify.
+ # Helpers (already sourced by orchestrator):
+ # label_phase_passed, label_phase_failed, annotate_phase_value
+ # Job template available at
+ # /phase3-configs/cluster-validation-phase3-job-config.yaml
+ # (mounted by the cronjob from cluster-validation-phase3-job-config
+ # ConfigMap,).
+ #
+ # Contract:
+ # 1. Intersection with amd-nic=true: orchestrator-side (cronjob).
+ # PHASE3_SCRIPT trusts its input list is already narrowed.
+ # 2. Per input node: sed-render Job template ($$NODE,
+ # $$EXPECTED_NIC_COUNT substitution, unique per-node Job name)
+ # and kubectl apply.
+ # 3. Poll-wait all Jobs bounded by PHASE3_JOB_WAIT_TIME. Record
+ # terminal status as complete|failed|timeout per job.
+ # 4. Orchestrator-side label writes -- single source of truth:
+ # * submit-failed -> label_phase_failed reason=job-creation-failed
+ # * timeout -> label_phase_failed reason=nic-not-allocated
+ # * complete -> parse PHASE3_RESULT from job logs
+ # -> label_phase_passed (status=passed)
+ # -> label_phase_failed + failed-nics
+ # annotation (status=failed)
+ # -> label_phase_failed reason=no-result-line
+ # (marker missing -- defensive)
+ # * failed -> same parse path as complete (the Job
+ # container exited non-zero but the marker
+ # still appears on stdout before exit 1).
+ # 5. SKIP_NIC_VALIDATION=true early-exit: pass-label every input node,
+ # no Jobs created.
+ # 6. Cleanup hung (timed-out) Jobs at the end.
+ #
+ # Missing job template -> every input node labeled failed with reason
+ # "job-template-missing".
+ #
+ # All output (banners, per-step progress) goes to stdout. Diagnostic-only
+ # lines go to stderr so they don't accidentally pollute callers that may
+ # capture stdout in the future.
+ PHASE3_SCRIPT: |
+ #!/bin/bash
+ # Phase 3 -- per-node NIC health Job submit + wait orchestration.
+ # Sourced (not exec'd) by the orchestrator. Do NOT `set -e`; we want
+ # to label every "did-not-run" input node even if individual kubectl
+ # calls fail (mirror of PHASE1_SCRIPT / PHASE2_SCRIPT).
+
+ # ---- Input handling ---------------------------------------------------
+ # Orchestrator passes the node list as positional args AND exports
+ # PHASE_NODES. Prefer positional ($@) so a re-source with different
+ # args works as expected; fall back to PHASE_NODES if $# is zero.
+ # The input is already intersected with amd-nic=true upstream
+ # (cluster-validation-job.yaml run_phase3 caller), so PHASE3_SCRIPT
+ # does NOT re-narrow.
+ local_phase3_nodes=""
+ if [[ "$#" -gt 0 ]]; then
+ local_phase3_nodes="$*"
+ elif [[ -n "${PHASE_NODES:-}" ]]; then
+ local_phase3_nodes="$PHASE_NODES"
+ fi
+
+ if [[ -z "$local_phase3_nodes" ]]; then
+ echo "[Phase 3] no input nodes -- nothing to do"
+ return 0
+ fi
+
+ echo "[Phase 3] PHASE3_SCRIPT start: nodes=[$local_phase3_nodes]"
+
+ # ---- Early-exit: SKIP_NIC_VALIDATION ----------------------------------
+ # Incremental-bringup short-circuit (design §4 contract step 5).
+ # Every input node receives the pass label, no Phase 3 Jobs are
+ # created. Matches the SKIP_GPU_HW_ACCEPTANCE / SKIP_GPU_MESH_VALIDATION
+ # pattern in PHASE1_SCRIPT / PHASE2_SCRIPT.
+ if [[ "${SKIP_NIC_VALIDATION,,}" == "true" ]]; then
+ echo "[Phase 3] SKIP_NIC_VALIDATION=true -- pass-labeling all input nodes"
+ for n in $local_phase3_nodes; do
+ label_phase_passed "$n" "$PHASE3_LABEL_KEY" \
+ || echo "[Phase 3] WARN: label_phase_passed failed for $n" >&2
+ done
+ echo "[Phase 3] PHASE3_SCRIPT done (skip mode)"
+ return 0
+ fi
+
+ # ---- Validate required env vars ---------------------------------------
+ # If any are unset/empty, fail every input node fast with a clear
+ # reason. PHASE3_CHECK_SCRIPT runs INSIDE the Job container -- the
+ # driver only needs the ones used by submit + classify.
+ local missing_env=""
+ for v in PHASE3_LABEL_KEY PHASE3_EXPECTED_NIC_COUNT PHASE3_JOB_WAIT_TIME; do
+ if [[ -z "${!v:-}" ]]; then
+ missing_env="$missing_env $v"
+ fi
+ done
+ if [[ -n "$missing_env" ]]; then
+ echo "[Phase 3] ERROR: required env var(s) unset:$missing_env" >&2
+ echo "[Phase 3] failing every input node with reason=missing-env" >&2
+ for n in $local_phase3_nodes; do
+ label_phase_failed "$n" "${PHASE3_LABEL_KEY:-amd.com/nic-health}" \
+ "phase3-missing-env:${missing_env# }" \
+ || echo "[Phase 3] WARN: label_phase_failed failed for $n" >&2
+ done
+ return 1
+ fi
+
+ local job_template=/phase3-configs/cluster-validation-phase3-job-config.yaml
+ if [[ ! -f "$job_template" ]]; then
+ echo "[Phase 3] ERROR: job template not found at $job_template" >&2
+ for n in $local_phase3_nodes; do
+ label_phase_failed "$n" "$PHASE3_LABEL_KEY" "job-template-missing" \
+ || echo "[Phase 3] WARN: label_phase_failed failed for $n" >&2
+ done
+ return 1
+ fi
+
+ # ---- Phase-local state -------------------------------------------------
+ # Indexed by job name. Declared local so the orchestrator shell stays
+ # clean when re-sourced. Status transitions:
+ # submitted -> complete | failed (terminal, parse marker)
+ # submitted -> timeout (terminal, no marker)
+ # submit-failed (terminal, no marker)
+ declare -A phase3_job_to_node=()
+ declare -A phase3_job_status=() # "submitted" | "submit-failed" | "complete" | "failed" | "timeout"
+ declare -A phase3_job_reason=() # orchestrator-side failure reason (timeout/submit-failed only)
+ local phase3_job_names=""
+
+ # ---- Step 1: Parallel-submit Phase 3 Job per node ---------------------
+ # Mirror of PHASE1_SCRIPT / PHASE2_SCRIPT Step 1. kubectl apply is
+ # non-blocking from the Job's perspective, so submitting in a tight
+ # loop is effectively parallel.
+ echo "[Phase 3] submitting Phase 3 Jobs (parallel)..."
+ local ts max_job_name_len=63 prefix="cvf-phase3"
+ ts=$(date +%Y%m%d-%H%M%S)
+ for node in $local_phase3_nodes; do
+ local job_name="${prefix}-${node}-${ts}"
+ if [[ "${#job_name}" -gt "$max_job_name_len" ]]; then
+ local node_hash
+ node_hash=$(echo -n "$node" | sha1sum | cut -c1-6)
+ job_name="${prefix}-${node_hash}-${ts}"
+ echo "[Phase 3] node=$node hostname too long, using hash -> job=$job_name"
+ fi
+ phase3_job_to_node[$job_name]="$node"
+ phase3_job_names="$phase3_job_names $job_name"
+
+ # Render template. The Phase 3 template defines three
+ # $$-prefixed placeholders: $$NODE (target hostname),
+ # $$EXPECTED_NIC_COUNT (NIC resource request count), and
+ # $$ROCE_WORKLOAD_IMAGE (container image; -- mirrors
+ # the PHASE4_DRIVER_SCRIPT substitution pattern so Phases 3/4/5
+ # all pin to the same ROCE_WORKLOAD_IMAGE ConfigMap key). Also
+ # rewrite the template's fixed metadata.name to our unique
+ # per-node job_name so multiple nodes can run concurrently
+ # without name collision.
+ local image="${ROCE_WORKLOAD_IMAGE:-rocm/roce-workload:latest}"
+ if sed "s|\$\$NODE|${node}|g; \
+ s/^ name: cluster-validation-phase3-job/ name: ${job_name}/; \
+ s|\$\$EXPECTED_NIC_COUNT|${PHASE3_EXPECTED_NIC_COUNT}|g; \
+ s|\$\$ROCE_WORKLOAD_IMAGE|${image}|g" \
+ "$job_template" | kubectl apply -f - >/dev/null; then
+ echo "[Phase 3] submitted job=$job_name node=$node"
+ phase3_job_status[$job_name]="submitted"
+ else
+ echo "[Phase 3] ERROR: kubectl apply failed for job=$job_name node=$node" >&2
+ phase3_job_status[$job_name]="submit-failed"
+ phase3_job_reason[$job_name]="job-creation-failed"
+ fi
+ done
+ echo "[Phase 3] all Jobs submitted at $(date)"
+
+ # ---- Step 2: Poll-wait every Job --------------------------------------
+ # Per-job wallclock budget = PHASE3_JOB_WAIT_TIME. Sequential wait
+ # loop, but the Jobs themselves run in parallel on different nodes.
+ # 5s poll interval matches PHASE1_SCRIPT / PHASE2_SCRIPT cadence.
+ local timeout="${PHASE3_JOB_WAIT_TIME}"
+ for job_name in $phase3_job_names; do
+ local node="${phase3_job_to_node[$job_name]}"
+ if [[ "${phase3_job_status[$job_name]:-}" == "submit-failed" ]]; then
+ continue
+ fi
+ echo "[Phase 3] waiting for job=$job_name node=$node (timeout=${timeout}s)"
+ local start_time elapsed
+ start_time=$(date +%s)
+ local terminal="false"
+ while true; do
+ elapsed=$(( $(date +%s) - start_time ))
+ if [[ "$elapsed" -ge "$timeout" ]]; then
+ echo "[Phase 3] job=$job_name TIMEOUT after ${timeout}s"
+ phase3_job_status[$job_name]="timeout"
+ # design §6: pod-cannot-mount-NICs is the dominant cause
+ # of Job-pending timeouts -> reason=nic-not-allocated.
+ phase3_job_reason[$job_name]="nic-not-allocated"
+ terminal="true"
+ break
+ fi
+ local complete_st failed_st
+ complete_st=$(kubectl get job "$job_name" \
+ -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' \
+ 2>/dev/null || echo "")
+ if [[ "$complete_st" == "True" ]]; then
+ # Container exited 0 -> all checks passed. Parse the
+ # PHASE3_RESULT marker in Step 3 to write the label.
+ echo "[Phase 3] job=$job_name Complete=True (node=$node)"
+ phase3_job_status[$job_name]="complete"
+ terminal="true"
+ break
+ fi
+ failed_st=$(kubectl get job "$job_name" \
+ -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' \
+ 2>/dev/null || echo "")
+ if [[ "$failed_st" == "True" ]]; then
+ # Container exited non-zero -> one or more checks failed.
+ # The PHASE3_RESULT marker is still emitted before exit 1;
+ # parse it in Step 3 to record the reason + failed-nics.
+ echo "[Phase 3] job=$job_name Failed=True (node=$node)"
+ phase3_job_status[$job_name]="failed"
+ terminal="true"
+ break
+ fi
+ sleep 5
+ done
+ if [[ "$terminal" != "true" ]]; then
+ # Defensive: should never fall through, but keep state sane.
+ phase3_job_status[$job_name]="timeout"
+ phase3_job_reason[$job_name]="nic-not-allocated"
+ fi
+ done
+
+ # ---- Step 3: Orchestrator-side label writes ---------------------------
+ # Single source of truth for all per-node Phase 3 labels. For Jobs
+ # that reached terminal state (complete|failed), parse the
+ # PHASE3_RESULT marker from container logs and write the label
+ # accordingly. For Jobs that never reached terminal state
+ # (submit-failed, timeout) write label_phase_failed directly.
+ _phase3_parse_and_label() {
+ local job_name="$1" node="$2"
+ local log_line status reason failed_nics observed_fw
+ # `--tail=20` is enough: PHASE3_CHECK_SCRIPT emits the marker
+ # as one of its last few lines. We take the last PHASE3_RESULT
+ # line in case a previous Job pod attempt also emitted one.
+ log_line=$(kubectl logs "job/$job_name" --tail=20 2>/dev/null \
+ | grep -E '^PHASE3_RESULT ' | tail -n 1)
+ if [[ -z "$log_line" ]]; then
+ echo "[Phase 3] WARN: no PHASE3_RESULT line in logs for job=$job_name node=$node -- treating as failed" >&2
+ label_phase_failed "$node" "$PHASE3_LABEL_KEY" "no-result-line" \
+ || echo "[Phase 3] WARN: label_phase_failed failed for $node" >&2
+ return 1
+ fi
+ status=$(echo "$log_line" | grep -oE 'status=[a-z]+' | head -1 | cut -d= -f2)
+ # Extract optional observed_fw field from the marker. Present on
+ # both pass and fail lines whenever Check 5 ran; absent when
+ # PHASE3_DRIVER_FW_CHECK_ENABLED=false (in which case we do NOT
+ # clear any prior annotation on the node -- last-known-good is
+ # more useful than empty for operators flipping the check off).
+ observed_fw=$(echo "$log_line" | sed -nE 's/.*observed_fw=([^ ]+).*/\1/p')
+ case "$status" in
+ passed)
+ echo "[Phase 3] orchestrator-side label: node=$node status=passed"
+ label_phase_passed "$node" "$PHASE3_LABEL_KEY" \
+ || echo "[Phase 3] WARN: label_phase_passed failed for $node" >&2
+ ;;
+ failed)
+ reason=$(echo "$log_line" | sed -nE 's/.*reason=([^ ]+).*/\1/p')
+ failed_nics=$(echo "$log_line" | sed -nE 's/.*failed_nics=([^ ]+).*/\1/p')
+ echo "[Phase 3] orchestrator-side label: node=$node status=failed reason=${reason:-no-reason} failed_nics=${failed_nics:-}"
+ label_phase_failed "$node" "$PHASE3_LABEL_KEY" "${reason:-no-reason}" \
+ || echo "[Phase 3] WARN: label_phase_failed failed for $node" >&2
+ if [[ -n "$failed_nics" ]]; then
+ annotate_phase_value "$node" "$PHASE3_LABEL_KEY" "failed-nics" "$failed_nics" \
+ || echo "[Phase 3] WARN: annotate failed-nics failed for $node" >&2
+ fi
+ ;;
+ *)
+ echo "[Phase 3] WARN: unrecognized PHASE3_RESULT status=${status:-} for node=$node" >&2
+ label_phase_failed "$node" "$PHASE3_LABEL_KEY" "bad-result-line" \
+ || echo "[Phase 3] WARN: label_phase_failed failed for $node" >&2
+ ;;
+ esac
+ # Write observed_fw annotation regardless of pass/fail, whenever
+ # the marker carried it. The absence branch is intentionally a
+ # no-op so we preserve any last-known-good annotation from a
+ # prior run when an operator disables Check 5.
+ if [[ -n "$observed_fw" ]]; then
+ annotate_phase_value "$node" "$PHASE3_LABEL_KEY" "observed-fw" "$observed_fw" \
+ || echo "[Phase 3] WARN: annotate observed-fw failed for $node" >&2
+ fi
+ }
+
+ local phase3_orch_failed_count=0
+ for job_name in $phase3_job_names; do
+ local node="${phase3_job_to_node[$job_name]}"
+ local status="${phase3_job_status[$job_name]:-}"
+ case "$status" in
+ submit-failed|timeout)
+ local reason="${phase3_job_reason[$job_name]:-failed}"
+ echo "[Phase 3] orchestrator-side label: node=$node status=$status reason=$reason"
+ label_phase_failed "$node" "$PHASE3_LABEL_KEY" "$reason" \
+ || echo "[Phase 3] WARN: label_phase_failed failed for $node" >&2
+ phase3_orch_failed_count=$((phase3_orch_failed_count + 1))
+ ;;
+ complete|failed)
+ _phase3_parse_and_label "$job_name" "$node"
+ ;;
+ *)
+ echo "[Phase 3] WARN: unexpected status=$status for job=$job_name" >&2
+ ;;
+ esac
+ done
+
+ # ---- Step 4: Cleanup hung Jobs ----------------------------------------
+ # Anything that timed out (still Active, neither Complete nor Failed)
+ # gets explicitly deleted so it does not linger past the next CronJob
+ # tick. ttlSecondsAfterFinished handles the completed ones.
+ for job_name in $phase3_job_names; do
+ local status="${phase3_job_status[$job_name]:-}"
+ if [[ "$status" == "timeout" ]]; then
+ echo "[Phase 3] deleting hung job=$job_name"
+ kubectl delete job "$job_name" --ignore-not-found=true \
+ --wait=false >/dev/null 2>&1 \
+ || echo "[Phase 3] WARN: delete job=$job_name failed" >&2
+ fi
+ done
+
+ echo "[Phase 3] PHASE3_SCRIPT done: orchestrator-labeled-failed=${phase3_orch_failed_count} (submit-failed/timeout); pass/fail labels for completed Jobs derived from PHASE3_RESULT marker"
+ return 0
+
+ # === Phase 4 (Pairwise Rail Bandwidth Test) env vars ===
+ # Consumed by `PHASE4_DRIVER_SCRIPT` and by
+ # the per-pair server/client Jobs rendered from
+ # `cluster-validation-phase4-job-config`.
+ # See design doc design doc §4 -> "Code Path: Phase 4 ConfigMap variables".
+ #
+ # The driver projects these via `envFrom` on the orchestrator container
+ # and substitutes the NAD/IB-DEV prefixes into per-rail Job manifests
+ # at `$$RAIL_IDX` render time. Tune via ConfigMap edit -- no code change.
+
+ # Number of rails (per-NIC RDMA devices) tested per node. Default 8
+ # matches the MI300 fleet baseline (8 NICs, one rail per NIC,
+ # validated on smc300x-ccs-aus-gpuf268). The driver iterates rails
+ # 0.(PHASE4_RAIL_COUNT - 1) serially within each pair. Lowering
+ # this value (e.g. PHASE4_RAIL_COUNT=4) is the supported way to skip
+ # rails 4-7 on partially-cabled testbeds without code changes
+ # (design §7 test case "rail-count-override").
+ PHASE4_RAIL_COUNT: "8"
+ # Minimum acceptable measured bandwidth per rail in Gbps. A rail is
+ # marked failed if `ib_write_bw` "BW average" parses to a value below
+ # this threshold; a node is marked failed iff ANY of its tested rails
+ # fell below threshold. Default 380 Gbps is ~95% of the 400 Gbps line
+ # rate target for a single Pensando 400G NIC (design §1) and leaves
+ # headroom for protocol overhead and cable-marginality jitter.
+ PHASE4_BW_THRESHOLD: "380"
+ # Per-rail wallclock budget in seconds for a single (pair, rail)
+ # measurement -- covers server pod startup, client pod startup,
+ # `ib_write_bw` handshake + transfer, and log scrape. Default 180s
+ # leaves headroom for image pull (rocm/roce-workload), pod scheduling,
+ # and NAD attachment. A pair's total budget is roughly
+ # PHASE4_RAIL_COUNT * PHASE4_PAIR_WAIT_TIME (rails serialized within
+ # a pair); on default values that is 8 * 180 = 1440s per pair.
+ # patchable: cluster-validation-framework.timeouts.phase4-pair-wait-secs
+ PHASE4_PAIR_WAIT_TIME: "180"
+ # Cap on concurrent pair runners. The driver forks up to this many
+ # `pair_runner` background functions in parallel; further pairs queue
+ # until a slot frees. Default 8 keeps the cluster API server happy on
+ # large pools (design §2: "pairs run in parallel . bounded to keep
+ # the kubernetes API server happy on large clusters"). Raise on
+ # beefier control planes; lower if `kubectl apply` 429s appear during
+ # parallel submits (design §6 -> "api-throttled" recovery).
+ PHASE4_MAX_CONCURRENT_PAIRS: "8"
+ # Prefix for per-rail NetworkAttachmentDefinition names. The driver
+ # constructs the full NAD name as `${PHASE4_NAD_NAME_PREFIX}${RAIL_IDX}`
+ # (e.g. `amd-host-device-nad-rail-3` for rail 3) and stamps it into
+ # the Job pod's `k8s.v1.cni.cncf.io/networks` annotation. The default
+ # matches the per-rail NAD naming convention shipped in
+ # `nad-per-rail.yaml`; override only if the fleet uses a different
+ # NAD naming scheme.
+ PHASE4_NAD_NAME_PREFIX: "amd-host-device-nad-rail-"
+ # Prefix for the RDMA device name passed to `ib_write_bw -d`. The
+ # in-pod command is rendered as
+ # `ib_write_bw -d ${PHASE4_IB_DEV_PREFIX}${RAIL_IDX} -i 1 -F .`,
+ # binding the test to a specific per-rail RDMA device. Default
+ # `ionic_` matches the device-naming convention exposed by the AMD
+ # Pensando ionic_rdma kernel driver (`/sys/class/infiniband/ionic_N`).
+ # Override for non-Pensando NICs (e.g. `mlx5_` on Mellanox).
+ PHASE4_IB_DEV_PREFIX: "ionic_"
+
+ # === Phase 4 (Pairwise Rail Bandwidth Test) driver script ===
+ # Orchestrator-side driver. Sourced (not exec'd) by `run_phase4` in
+ # cluster-validation-job.yaml after replaces the generic
+ # `_run_phase_generic` stub with a source-and-invoke of this script.
+ # The input list is already gated to nodes that passed Phase 3
+ # (amd.com/nic-health=passed) by the orchestrator's filter_passed_nodes
+ # call -- PHASE4_DRIVER_SCRIPT does NOT re-narrow.
+ #
+ # Pre-existing companions (consumed at runtime):
+ # * Per-pair Job templates -- /phase4-configs/{server,client}.yaml
+ # mounted; templates
+ # * helpers -- label_phase_passed, label_phase_failed,
+ # annotate_phase_value (sourced upstream
+ # from PHASE_NODE_LABEL_SCRIPT).
+ # * PHASE4_* env vars, projected via envFrom on
+ # the orchestrator container.
+ #
+ # Contract (design §4 -> "Code Path: PHASE4_DRIVER_SCRIPT"):
+ # 1. Sort the input nodes and build a full-mesh round-robin
+ # schedule using the circle algorithm:
+ # - For N input nodes there are N-1 rounds. Each round contains
+ # floor(N/2) disjoint pairs (every node appears in at most one
+ # pair per round); for odd N a different node "sits out" each
+ # round via the bye slot. Union over rounds covers every pair
+ # in C(N,2), enabling A-B / B-C / A-C transitive triangulation
+ # of rail faults (a rail that fails on A in pairs against
+ # multiple peers is much more likely to be A's fault than the
+ # peers').
+ # - Single-node input (N=1) records `unpaired=true` annotation
+ # and pass-labels with no peer measurement, same as before.
+ # 2. For each round, fork pair_runners (bounded by
+ # PHASE4_MAX_CONCURRENT_PAIRS); wait for the round before
+ # starting the next. Within a round pairs are guaranteed
+ # node-disjoint so no two pair_runners contend for the same
+ # node/NIC.
+ # 3. pair_runner iterates rails 0.(PHASE4_RAIL_COUNT-1) serially:
+ # - render+apply server Job, wait for pod IP
+ # - render+apply client Job with that IP
+ # - wait for both, parse "BW average" Gbps from client log
+ # - record rail_bw[node][rail][round] and reason on failure
+ # - cleanup both Jobs
+ # 4. After all rounds finish, aggregate per-node:
+ # node passes iff ALL (rail, round) measurements were
+ # >= PHASE4_BW_THRESHOLD with no recorded reason.
+ # 5. Label via helpers; write per-(rail, round) annotations and
+ # `failed-rails` + `triangulation` summaries via
+ # annotate_phase_value.
+ # 6. SKIP_RAIL_BANDWIDTH_TEST=true: pass-label every input node, no
+ # Jobs created (mirror of SKIP_NIC_VALIDATION fast-path).
+ #
+ # Inter-process state: bash subshells cannot share arrays back to the
+ # parent, so pair_runners write per-(node, rail, round) results to a
+ # shared tmp directory (PHASE4_STATE_DIR). The driver aggregates from
+ # disk after each round's `wait`. Layout:
+ # $PHASE4_STATE_DIR/results//rail--round- = ""
+ # $PHASE4_STATE_DIR/results//rail--round-.reason = ""
+ # $PHASE4_STATE_DIR/results//rail-