diff --git a/.markdownlint-cli2.jsonc b/.markdownlint-cli2.jsonc index 6aad0f7a9..2048dada6 100644 --- a/.markdownlint-cli2.jsonc +++ b/.markdownlint-cli2.jsonc @@ -12,6 +12,7 @@ "**/tests/pytests/**", "**/knowledge/**", "**/tests/**", + "**/example/gpu-validation-cluster/**", "**/docs-internal/**" ] } diff --git a/example/gpu-validation-cluster/.gitignore b/example/gpu-validation-cluster/.gitignore new file mode 100644 index 000000000..f4bcf4e4b --- /dev/null +++ b/example/gpu-validation-cluster/.gitignore @@ -0,0 +1,3 @@ + +# Lab-specific test scaffolding (per-engineer; contains hostnames + cred refs) +ansible/inventory.test.yml diff --git a/example/gpu-validation-cluster/README.md b/example/gpu-validation-cluster/README.md index 04f903746..01a9f09e5 100644 --- a/example/gpu-validation-cluster/README.md +++ b/example/gpu-validation-cluster/README.md @@ -4,213 +4,1101 @@ A containerized, one-click deployment solution for validating AMD GPU and AINIC ## Overview -This project provides an automated, reproducible testing environment for GPU operator functionality. It deploys a complete Kubernetes cluster with AMD GPU and Network operators pre-configured, enabling rapid validation of operator features and performance. +This project provides an automated, reproducible testing environment for AMD GPU + AMD AINIC validation at fleet scale. A single Ansible run brings up a containerized k3s cluster across N nodes, installs the AMD GPU + Network operators, and deploys the Cluster Validation Framework (CVF) — a CronJob-driven, 5-phase gated pipeline that exercises GPUs, intra-node xGMI, per-NIC RDMA health, per-rail bandwidth, and multi-node RCCL collectives. + +```mermaid +flowchart TB + subgraph CN["Control Node (your laptop / jumphost)"] + AP["ansible-playbook
(cvf.yml + lifecycle playbooks)"] + end + + subgraph SN["Server Node"] + SC["docker container 'server'
k3s server (control-plane)"] + SC --> GO["AMD GPU Operator
(kube-amd-gpu)"] + SC --> NO["AMD Network Operator
(kube-amd-network)"] + SC --> MO["MPI Operator
(mpi-operator)"] + SC --> CVF["Cluster Validation Framework
CronJob (default */30)"] + end + + subgraph AN["Agent Node(s) 1..N"] + AC["docker container 'agent'
k3s agent + GPU/NIC device plugins"] + end + + AP -->|setup-cluster| SC + AP -->|setup-cluster / add-agent-nodes| AC + AP -->|cvf.yml -e action=…| SC + AC -.k3s join.-> SC + + CVF --> P0["Phase 0: candidate selection"] + P0 --> P1["Phase 1: GPU HW acceptance
RVS gst_single + AGFHC xgmi/pcie/hbm"] + P1 --> P2["Phase 2: intra-node GPU mesh
RCCL all_reduce (single node)"] + P2 --> P3["Phase 3: per-node NIC health
rdma link + ibv_devinfo"] + P3 --> P4["Phase 4: pairwise rail bandwidth
ib_write_bw per rail per pair"] + P4 --> P5["Phase 5: multi-node RCCL
MPIJob all_reduce_perf"] + P5 --> RES["per-node labels:
amd.com/{phase}=passed|failed"] +``` ## Features -- **Automated Deployment**: Single-command cluster initialization with all operators ready -- **GPU Operator**: Full AMD GPU device plugin with resource management and scheduling -- **Network Operator**: AMD network operator for advanced networking and performance optimization -- **Cluster Validation Framework**: Comprehensive automated tests for both GPU validation and RCCL tests. -- **Containerized**: Entire stack runs in containers for portability and consistency +- **One-command multi-node deployment** via Ansible (`playbooks/setup-cluster.yml`) +- **AMD GPU + Network operators** pre-configured (inbox driver by default; operator-installed driver opt-in) +- **5-phase validation pipeline** with per-phase skip flags, per-stage timeouts, and per-framework test-runner images +- **Day-2 operations playbook** (`playbooks/cvf.yml`) — single dispatcher for reapply / reset / inject-secret / trigger / status / fresh-run +- **Persistent per-phase logs** on each node under `/var/log/cluster-validation/` (Phase 1 test-runner, Phase 2 RCCL, Phase 3 NIC probes, Phase 4 ib_write_bw server+client, Phase 5 RCCL launcher) +- **Containerized** — entire stack runs in `docker` so the host stays clean ## Quick Start -There are two ways to deploy the GPU Validation Cluster: +Multi-node deployment is fully Ansible-driven. The flow is **(1) edit `configs/config.json`** for your environment, then **(2) run two playbooks**: `setup-cluster.yml` for one-time bringup, `cvf.yml` for all day-2 operations. + +### Prerequisites + +**Host OS (required, every cluster node):** + +- **Ubuntu 22.04 LTS** or **Ubuntu 24.04 LTS** — these are the only validated host OS versions. Other distros and other Ubuntu versions are not supported by the bringup playbook (Docker install, kernel-module DKMS, and the operator chart all assume Ubuntu LTS). + +**Per cluster node:** + +- Docker daemon running (the playbook will install it if missing) +- `jq` CLI (installed by the playbook if missing) +- `amdgpu` + `ionic` (and `ionic_rdma`) kernel modules present — either inbox or installed by the AMD operators (see step 0.1) +- SSH access from the control node (configured by `setup-ssh-keys.yml`) -1. **Automated (Recommended)**: Using Ansible playbooks for multi-node deployment -2. **Manual**: Using shell scripts on each node individually +**Control node (your laptop or jumphost):** -### Option 1: Automated Deployment with Ansible (Recommended) +- Ansible 2.9+ +- SSH key to every cluster node (set up via `setup-ssh-keys.yml` if not already in place) +- Working copy of this repo + an `ansible/inventory.yml` that lists the server + agent nodes -For deploying a multi-node cluster, use the Ansible automation: +### Step 0 — Pre-flight configuration + +Edit `configs/config.json` **before** running any playbook. The settings drive what the framework installs, runs, and how long it waits. + +#### 0.1 Driver source: inbox vs operator-installed + +Each AMD operator can either (a) use the host's existing kernel module ("inbox") or (b) install/manage its own driver. Default is **inbox** for both. + +```jsonc +"amd-gpu-operator": { + "version": "v1.4.1", // operator chart version + "install-amdgpu-driver": false, // false = use host's amdgpu; true = operator installs driver-version + "driver-version": "30.30" +}, +"network-operator": { + "version": "v1.0.0", + "install-network-operator": true, // optional (default true). false = skip the network operator entirely for a GPU-only cluster + "install-ionic-driver": false, // false = use host's ionic_rdma; true = operator installs driver-version + "driver-version": "1.117.5-a-56" +} +``` + +> **GPU-only clusters:** set `install-network-operator: false` to skip the +> network operator, its `NetworkConfig`, and Multus altogether. Only do +> this when no NIC-dependent phases run — make sure Phases 3, 4, and 5 are +> skipped in `skip-tests` and that `node-selector-labels` does not require +> an `amd-nic`/`amd-vnic` label. cert-manager, the GPU operator, +> mpi-operator, and the CVF CronJob still install normally. + +**How to decide:** check what's already loaded on each node before bringup. ```bash -cd ansible -./quickstart.sh +# On every node — confirm GPU + RDMA drivers are present and report a version +modinfo amdgpu | grep -E '^version|^srcversion' | head -1 +modinfo ionic_rdma | grep -E '^version|^srcversion' | head -1 +lsmod | grep -E '^amdgpu|^ionic' ``` -**Benefits:** +If both are present and a known-good version, leave `install-*-driver: false`. Otherwise set to `true` and pin `driver-version` to a version supported by your kernel. -- Automated deployment across all nodes from a single command -- Automatic installation of prerequisites (Docker, jq) -- Passwordless SSH setup included -- One-command teardown and status checking -- Handles image distribution automatically +> **Container ↔ kernel ABI matching.** The `roce-workload` image used by Phases 2/4/5 ships its own `libionic` userspace which must support the host kernel's `ionic_rdma` ABI. If you see `libibverbs: Warning: Driver ionic does not support the kernel ABI of N (supports M to M)` in Phase 4 logs, swap to a `roce-workload` image whose ainic version matches the host driver (see step 0.4). -**See the [Ansible README](ansible/README.md) for detailed instructions.** +#### 0.2 Node selector labels (physical vs VF-passthrough VM) -### Option 2: Manual Deployment +`node-selector-labels` is an AND-list — only nodes carrying **every** label in the array are eligible for cluster validation. The label you choose depends on whether the node exposes physical AMD devices or virtual-function (VF) passthrough devices to a guest VM: -For manual deployment or single-node testing, follow these steps: +| Label | Means | Set by | +|---|---|---| +| `feature.node.kubernetes.io/amd-gpu=true` | Physical AMD GPU (bare metal / VM with PF-passthrough) | NFD + AMD GPU operator | +| `feature.node.kubernetes.io/amd-vgpu=true` | **VF-passthrough** AMD GPU inside a guest VM (SR-IOV VF assigned to the VM) | NFD + AMD GPU operator (VM-aware) | +| `feature.node.kubernetes.io/amd-nic=true` | Physical AMD Pensando NIC (bare metal / VM with PF-passthrough) | NFD + AMD Network operator | +| `feature.node.kubernetes.io/amd-vnic=true` | **VF-passthrough** AMD Pensando NIC inside a guest VM (SR-IOV VF assigned to the VM) | NFD + AMD Network operator (VM-aware) | -#### Prerequisites +Default in `config.json` (physical GPU + physical NIC fleet): +```jsonc +"node-selector-labels": [ + "feature.node.kubernetes.io/amd-gpu=true", + "feature.node.kubernetes.io/amd-nic=true" +] +``` -- Docker engine installed and daemon is running (validated on Docker 29.1.5 or newer) -- `jq` CLI for JSON parsing -- Ubuntu 22.04 or 24.04 host +**Tune for your fleet:** -#### Deployment Steps +| Posture | `node-selector-labels` | +|---|---| +| Bare-metal / PF-passthrough VM (default) | `["amd-gpu=true", "amd-nic=true"]` | +| Fully virtualized (VF GPU + VF NIC inside guest VM) | `["amd-vgpu=true", "amd-vnic=true"]` | +| Mixed (VF GPU + PF NIC, etc.) | mix and match the four labels | +| Early bringup, no NIC yet | drop the NIC label: `["amd-gpu=true"]` only | -1. **Build the container image** +**Verify what's actually labeled** before changing this: +```bash +ansible -i ansible/inventory.yml server-node -m shell -a \ + "docker exec server kubectl get nodes -L feature.node.kubernetes.io/amd-gpu \ + -L feature.node.kubernetes.io/amd-vgpu \ + -L feature.node.kubernetes.io/amd-nic \ + -L feature.node.kubernetes.io/amd-vnic" +``` - ```bash - ./gpu-cluster.sh build - ``` +NFD applies these labels asynchronously after the corresponding operator's device plugin advertises a resource. If a node shows the device under `kubectl describe node` capacity but the label is missing, the operator hasn't completed its NFD rule rollout yet — wait \~1 min and re-check. + +> **Custom (non-NFD) labels must be applied by you.** Only the four +> `feature.node.kubernetes.io/amd-{gpu,vgpu,nic,vnic}` labels above are +> set automatically (by NFD). If you set `node-selector-labels` to a +> custom label such as `["cvf-candidate=true"]`, **nothing applies it for +> you** — you must label the target nodes yourself, or Phase 0 will match +> zero nodes and every CronJob tick will skip with +> `[Node Selection: WARNING] Selector '...' matched 0 nodes`: +> ```bash +> docker exec server kubectl label node cvf-candidate=true +> ``` +> This is a useful pattern when you want to *explicitly* opt nodes into +> validation rather than auto-selecting every GPU/NIC node. + +> **Note:** the orchestrator uses these labels for Phase 0 candidate selection. NIC-requiring phases (Phase 3 / Phase 4) further narrow inside their own per-phase script via intersection with `amd-nic=true` (or `amd-vnic=true` if you set it), so non-NIC nodes pass through GPU-only phases cleanly without entering the NIC phases. + +#### 0.3 Which phases / stages to run + estimated runtime + +Two layers of skip flags: **per-phase** (5 flags) and **per-Phase-1-stage** (4 flags). + +```jsonc +"skip-tests": { + "skip-phase1-gpu-hw-acceptance": true, // Phase 1: RVS gst_single + AGFHC xgmi/pcie/hbm (DEFAULT: SKIPPED, please configure your AGFHC image and pull secrets then enable this phase) + "skip-phase2-gpu-mesh-validation": false, // Phase 2: intra-node 8-rank RCCL all_reduce + "skip-phase3-nic-validation": false, // Phase 3: per-node NIC health (needs amd-nic=true) + "skip-phase4-rail-bandwidth-test": false, // Phase 4: pairwise ib_write_bw per rail + "skip-phase5-rccl-test": false, // Phase 5: multi-node RCCL via MPIJob + "skip-phase1-stages": { // Only consulted when skip-phase1-gpu-hw-acceptance=false + "skip-phase1-gpu-stress": true, // gst_single (RVS) ~15-30 min/node + "skip-phase1-xgmi-lvl1": true, // xgmi_lvl1 (AGFHC) ~3-5 min/node (private image) + "skip-phase1-pcie-lvl1": true, // pcie_lvl1 (AGFHC) ~3-5 min/node (private image) + "skip-phase1-hbm-lvl1": true // hbm_lvl1 (AGFHC) ~3-5 min/node (private image) + } +} +``` - After building, you have two options to make the image available on all nodes: - - - **Option A**: Save and port the image to other nodes: - - ```bash - # On server node: save the image - docker save gpu-validation-cluster:latest -o gpu-validation-cluster.tar - - # Transfer to worker nodes and load: - scp gpu-validation-cluster.tar user@worker-node:/tmp/ - ssh user@worker-node "docker load -i /tmp/gpu-validation-cluster.tar" - ``` - - - **Option B**: Rebuild the image on each worker node: - - ```bash - # Run on each worker node - ./gpu-cluster.sh build - ``` - -2. **Configure cluster validation framework** - - Before starting the cluster, edit `configs/config.json` to match your environment. Common configuration options: - - **Device Type Selection:** - - For physical GPUs: `"gpu-type": "amd-gpu"` - - For SR-IOV VF GPUs (in VMs): `"gpu-type": "amd-vgpu"` - - For physical NICs: `"nic-type": "amd-nic"` - - For virtual NICs (in VMs): `"nic-type": "amd-vnic"` - - **Resource Configuration:** - - ```json - "cluster-validation-framework": { - "node-selector-labels": [ // Node selector labels for candidate selection - "feature.node.kubernetes.io/amd-gpu=true", // GPU label selector - "feature.node.kubernetes.io/amd-nic=true" // NIC label selector - ], - "resources": { - "worker-replicas": 2, // Number of nodes to validate in parallel - "gpu-per-worker": 8, // Number of GPUs per node - "pf-nic-per-worker": 0, // Number of physical function NICs per node - "vf-nic-per-worker": 8, // Number of virtual function NICs per node - "slots-per-worker": 8, // MPI ranks per worker - "node-validation-interval-mins": 10 // Minimum interval between validation runs on same node - }, - "skip-tests": { - "skip-gpu-validation": false, // Set to true to skip GPU validation tests (RVS/AGFHC) - "skip-rccl-test": false // Set to true to skip MPI Job RCCL tests - } +The default skips all of Phase 1 (no AGFHC pull secret required out of the box) and runs Phase 2 → 5. Set `skip-phase1-gpu-hw-acceptance: false` then enable individual stages in `skip-phase1-stages` when you want HW acceptance coverage — and remember to inject the AGFHC pull secret (step 0.5) if any of the `xgmi/pcie/hbm-lvl1` stages are enabled. + +**Estimated per-phase wallclock (per node, in parallel across nodes):** + +| Phase | Stages / contents | Typical | Worst-case (timeout) | +|---|---|---|---| +| 1.gpu-stress | RVS gst_single (FP8/FP16/dgemm sweep) | 15-30 min | 60 min | +| 1.xgmi-lvl1 | AGFHC xGMI link sweep | 3-5 min | 20 min | +| 1.pcie-lvl1 | AGFHC PCIe lane sweep | 3-5 min | 20 min | +| 1.hbm-lvl1 | AGFHC HBM ECC sweep | 3-5 min | 20 min | +| 2 | intra-node 8-rank RCCL all_reduce | 3-8 min | 10 min | +| 3 | per-node NIC health (count, ip link, rdma link, ibv_devinfo) | 30s-2 min | 5 min | +| 4 | pairwise `ib_write_bw` × 8 rails per pair (rails serialized) | 1-2 min × 8 = 8-16 min per pair (pairs run in parallel) | 40 min per pair | +| 5 | multi-node RCCL MPIJob (`all_reduce_perf`) | 1-5 min | 5 min | + +**Suggested bringup postures:** + +| Posture | Skip flags | Use when | +|---|---|---| +| **Network-focused** (default) | P1 skipped, P2+P3+P4+P5 enabled | network fabric + RCCL stack validation; no AGFHC pull secret needed | +| **GPU HW acceptance only** | P1 enabled (set `skip-phase1-stages.*` to choose stages), P2+P3+P4+P5 skipped | new GPU node bringup; HW burn-in; requires AGFHC pull secret if xgmi/pcie/hbm-lvl1 enabled | +| **GPU-only smoke** | P1+P2 enabled, P3+P4+P5 skipped | datacenter early, no NIC infrastructure yet | +| **Full pipeline** | all 5 enabled | full HW + fabric + workload validation; requires AGFHC pull secret | + +#### 0.4 Per-step container images + +```jsonc +"images": { + "roce-workload": "docker.io/rocm/roce-workload:ubuntu24_rocm-7.0.2_rccl-7.0.2_anp-v1.2.0_ainic-1.117.5-a-56", + "test-runner": { + "rvs": "docker.io/rocm/test-runner:v1.4.0", // Phase 1 gpu-stress + "agfhc": "docker.io/amdpsdo/test-runner:agfhc-v1.5.0-4" // Phase 1 xgmi/pcie/hbm — PRIVATE, see 0.4 + }, + "orchestrator": "docker.io/bitnamilegacy/kubectl:1.33.4", // submit-mpijob, NAD apply + "preflight-init": "docker.io/bitnamilegacy/kubectl:1.33.4", // Phase 5 launcher init container + "nic-health": "docker.io/rocm/roce-workload:" // Phase 3 NIC probes (now uses ${ROCE_WORKLOAD_IMAGE}) +} +``` + +| Image | Used by | Registry | +|---|---|---| +| `rocm/test-runner` (RVS) | Phase 1 `gpu-stress` (Recipe: `gst_single`) | public (Docker Hub) | +| **`amdpsdo/test-runner:agfhc-*`** | Phase 1 `xgmi-lvl1` / `pcie-lvl1` / `hbm-lvl1` | **PRIVATE** — see 0.4 | +| `rocm/roce-workload` | Phase 2, Phase 3 NIC health Job, Phase 4 server+client, Phase 5 worker/launcher (all four pinned to one `ROCE_WORKLOAD_IMAGE` ConfigMap key) | public | +| `bitnamilegacy/kubectl` | orchestrator + Phase 5 preflight init | public | + +> **roce-workload ↔ kernel ABI:** if Phase 4 fails on every rail with `libibverbs: Warning: Driver ionic does not support the kernel ABI`, your `ainic-*` tag's libionic doesn't match the host's `ionic_rdma` ABI. Try a different tag from the [`rocm/roce-workload` Docker Hub repo](https://hub.docker.com/r/rocm/roce-workload/tags). + +#### 0.5 ⚠️ AGFHC requires a pull secret + +If **any Phase 1 stage other than `gpu-stress` is enabled** (i.e. `xgmi-lvl1`, `pcie-lvl1`, or `hbm-lvl1`), the orchestrator will pull `docker.io/amdpsdo/test-runner:agfhc-*` — a **private Docker Hub repo**. The pod will sit in `ErrImagePull` until you inject pull credentials. + +**Steps:** + +1. Request an AGFHC pull token from your AMD representative. +2. Add it to `configs/config.json`: + ```jsonc + "global": { + "image-pull-secrets": [ + { "registry-url": "docker.io", + "username": "amdpsdo", + "token": "dckr_oat_…" // your token, keep out of git + } + ] } ``` +3. Inject it into the running cluster: + ```bash + ansible-playbook -i ansible/inventory.yml ansible/playbooks/cvf.yml -e action=inject-secret + ``` +4. (Alternative — keep token out of `config.json`) inject ad-hoc via CLI override: + ```bash + ansible-playbook -i ansible/inventory.yml ansible/playbooks/cvf.yml -e action=inject-secret \ + -e username=amdpsdo -e token=dckr_oat_… + ``` - **Node Selector Labels:** - The `node-selector-labels` array defines which nodes are eligible for cluster validation. Each label is combined with AND logic to select nodes. +The playbook is idempotent (additive strategic-merge patch on `cluster-validation-sa.imagePullSecrets`). See [`ansible/README.md`](ansible/README.md) for full details. - Common label combinations: - - Physical GPUs + Physical NICs: `["feature.node.kubernetes.io/amd-gpu=true", "feature.node.kubernetes.io/amd-nic=true"]` - - Virtual GPUs + Virtual NICs (in VMs): `["feature.node.kubernetes.io/amd-vgpu=true", "feature.node.kubernetes.io/amd-vnic=true"]` - - Mixed configurations: Customize the array to match your environment +If `xgmi-lvl1` / `pcie-lvl1` / `hbm-lvl1` are all set to `skip: true`, no secret is needed. -3. **Start the validation cluster** +#### 0.6 Timeouts - ```bash - # Bring up control plane (run in background) - ./gpu-cluster.sh run server & +Make sure each per-phase budget is generous enough for your hardware — a too-tight budget kills the test Job mid-run and the orchestrator records it as a failure even when the underlying test was about to pass. - # Fetch control plane token to join the cluster - ./gpu-cluster.sh get-token +```jsonc +"timeouts": { + "phase1-stages-secs": { + "phase1-gpu-stress": 3600, // gst_single can take 30+ min on slower nodes; 1h headroom + "phase1-xgmi-lvl1": 1200, + "phase1-pcie-lvl1": 1200, + "phase1-hbm-lvl1": 1200 + }, + "phase2-job-wait-secs": 600, + "phase3-job-wait-secs": 300, + "phase4-pair-wait-secs": 300, + "phase5-mpijob-wait-secs": 300 +} +``` - # On other nodes, bring up workers to join the cluster (run in background) - ./gpu-cluster.sh run agent & - ``` +Tuning advice: -4. **Verify cluster status** +- If you see Phase 1 stages marked TIMEOUT but the persistent log under `/var/log/cluster-validation/__stdout.gz` shows the test still progressing → **raise** the corresponding `phase1-stages-secs` entry. +- If you see Phase 4 `ib-write-bw-crashed` on every rail → **not** a timeout issue; check the persistent log `/var/log/cluster-validation/_phase4-{server,client}_railN.log` (libionic ABI mismatch, missing per-rail NAD, etc.). +- Per-node-interval `resources.node-validation-interval-mins: 30` controls how often a single node re-runs; lower it (e.g. 10) only if you want tighter feedback during bringup. - After bringing up the cluster, login to the server container to check cluster status: +#### 0.7 Advanced — everything else lives in YAML - ```bash - # Login to server container - docker exec -it server bash +`config.json` is intentionally narrow (resources, skip flags, timeouts, images, schedule). For deeper tuning open `configs/cluster-validation-config.yaml` — every field that's overridable via `config.json` carries a `# patchable: ` marker; everything else is a tunable that requires editing the YAML directly. Examples of YAML-only knobs: - # Check all nodes are ready - kubectl get nodes +| YAML key (search in `cluster-validation-config.yaml`) | What it controls | +|---|---| +| `PHASE2_BW_THRESHOLD` | min intra-node RCCL all_reduce GB/s for Phase 2 PASS | +| `PHASE4_BW_THRESHOLD`, `PHASE4_RAIL_COUNT`, `PHASE4_MAX_CONCURRENT_PAIRS`, `PHASE4_IB_DEV_PREFIX`, `PHASE4_NAD_NAME_PREFIX` | Phase 4 bandwidth threshold, rails per pair, parallelism cap, RDMA device prefix, NAD prefix | +| `PHASE3_EXPECTED_NIC_COUNT`, `PHASE3_AMD_NIC_PCI_IDS`, `PHASE3_MIN_GID_COUNT` | Phase 3 NIC count + PCI ID allowlist + minimum GID count | +| `RCCL_ENV_VARS`, `PHASE2_RCCL_ENV_VARS` | NCCL/RCCL env-var blocks for Phase 5 vs single-node Phase 2 | +| `GPU_VALIDATION_STAGES_JSON` | full Phase 1 stage spec (Name, Framework, Recipe, Iterations, TimeoutSeconds, Arguments) — patchable per-field via `config.json`, or replace wholesale here | +| `PHASE45_PREFLIGHT_SCRIPT`, `PHASE5_LAUNCHER_SCRIPT`, `PHASE5_WORKER_SCRIPT` | the actual bash bodies run inside each phase's pods | - # Check all pods are running - kubectl get pods -A +After editing the YAML, push to the live cluster with `ansible-playbook playbooks/cvf.yml -e action=reapply`. - # Exit container - exit +### Step 1 — One-time bringup - # Check cluster validation framework status - ./gpu-cluster.sh status +```bash +cd ansible - # Check per-node validation results - ./gpu-cluster.sh node-status - ``` +# 1a. (only first time) configure passwordless SSH to all nodes +ansible-playbook -i inventory.yml playbooks/setup-ssh-keys.yml --ask-pass --ask-become-pass -5. **Tear down the cluster** +# 1b. bring up the cluster (k3s + operators + CVF CronJob) +ansible-playbook -i inventory.yml playbooks/setup-cluster.yml +``` - ```bash - ./gpu-cluster.sh teardown - ``` +`setup-cluster.yml` builds the Docker image once on the control node, ships the tar to every node, starts the `server` container on the server node, then joins the `agent` containers. Estimated 10-20 min depending on network speed and node count. + +### Step 2 — Day-2 operations via `cvf.yml` + +All CVF runtime operations live in one playbook dispatched by `-e action=`: + +```bash +# After editing configs/*.yaml or config.json, push changes to the live cluster +ansible-playbook -i inventory.yml playbooks/cvf.yml -e action=reapply + +# Clear per-phase node labels + annotations (forces a fresh full run on next trigger) +ansible-playbook -i inventory.yml playbooks/cvf.yml -e action=reset + +# Inject AGFHC pull secret(s) into the live SA (idempotent) +ansible-playbook -i inventory.yml playbooks/cvf.yml -e action=inject-secret + +# Trigger one validation run manually (creates Job 'cvf-test'; default schedule is */30) +ansible-playbook -i inventory.yml playbooks/cvf.yml -e action=trigger + +# Full status dump: CronJob, recent orchestrators, in-flight Jobs, per-node phase labels + failure annotations +ansible-playbook -i inventory.yml playbooks/cvf.yml -e action=status + +# Composite: reapply -> reset (delete_all_jobs=true) -> inject-secret -> trigger -> status +ansible-playbook -i inventory.yml playbooks/cvf.yml -e action=fresh-run +``` + +> **Cron-vs-manual race:** the CronJob's `concurrencyPolicy: Forbid` blocks cron-vs-cron overlap but **not** manual-vs-cron. If a manual `trigger` exceeds the `*/30` schedule, the next cron tick spawns a second orchestrator that fights for the same nodes. For manual runs expected to exceed 30 min, suspend the cron first: +> ```bash +> ansible-playbook playbooks/cvf.yml -e action=reset -e suspend_cronjob=true +> ansible-playbook playbooks/cvf.yml -e action=trigger +> # ...after completion, re-enable: +> docker exec server kubectl patch cronjob cluster-validation-cron-job \ +> -n default --type=merge -p '{"spec":{"suspend":false}}' +> ``` + +See [`ansible/README.md`](ansible/README.md) for the full `cvf.yml` action reference (all flags, idempotency notes, race-condition documentation). + +### Step 3 — Tear down + +```bash +ansible-playbook -i inventory.yml playbooks/teardown-cluster.yml +``` + +## Multi-Phase Validation Pipeline + +The cluster validation framework runs a single CronJob (`cluster-validation-cron-job`) that drives a **5-phase, gated state machine**. Each phase narrows the candidate node pool: a node MUST pass phase N to enter phase N+1. Phases are independently enable-able via the `skip-tests` block in `config.json`, so a datacenter early in bringup can run only the GPU-only phases (1+2) and light up downstream phases as NIC and rail infrastructure arrive. The orchestrator skeleton, the per-phase `run_phaseN` stubs, the `DRY_RUN=1` planning mode, and the shared node-label helper library all live in `configs/cluster-validation-config.yaml`; the CronJob shell lives in `configs/cluster-validation-job.yaml`. + +### Phases, Labels, and Skip Flags + +| Phase | Purpose | `config.json` skip flag | ConfigMap env var | Per-phase node label key | Default | +|-------|---------|-------------------------|-------------------|--------------------------|---------| +| 1 | GPU HW acceptance (RVS / AGFHC via Test Runner) | `skip-gpu-hw-acceptance` | `SKIP_GPU_HW_ACCEPTANCE` | `amd.com/gpu-hw-acceptance` | enabled | +| 2 | Intra-node GPU collective / mesh validation | `skip-gpu-mesh-validation` | `SKIP_GPU_MESH_VALIDATION` | `amd.com/gpu-mesh-validation` | enabled | +| 3 | Per-node NIC health (requires `amd-nic=true`) | `skip-nic-validation` | `SKIP_NIC_VALIDATION` | `amd.com/nic-health` | skipped | +| 4 | Pairwise rail bandwidth | `skip-rail-bandwidth-test` | `SKIP_RAIL_BANDWIDTH_TEST` | `amd.com/rail-bandwidth` | skipped | +| 4.5 | Cross-node connectivity matrix (pre-flight gate before Phase 5) | (shares `skip-rccl-test`) | (shares `SKIP_RCCL_TEST`) | (no new label key; annotation only) | skipped (with Phase 5) | +| 5 | Multi-node RCCL via MPIJob | `skip-rccl-test` | `SKIP_RCCL_TEST` | `amd.com/cluster-validation-status` | skipped | + +On each phase, the per-phase script labels every input node with either `=passed` or `=failed`. On failure it also writes a failure-reason annotation: + +```text +-failure-reason= +``` + +(The suffix `-failure-reason` is published as the ConfigMap constant `PHASE_FAILURE_REASON_ANNOTATION_SUFFIX`.) The phase label is the **only contract** between phases — phase scripts MUST use the `label_phase_passed` / `label_phase_failed` / `annotate_phase_value` helpers from `PHASE_NODE_LABEL_SCRIPT` rather than calling `kubectl label` directly, and they MUST read `PHASE{N}_LABEL_KEY` from the ConfigMap rather than hard-coding key names. + +### Phase 1: Per-Node GPU Hardware Acceptance + +**Scope:** First gate of the pipeline. Per-node, no-network GPU hardware acceptance using a Test Runner image (`rocm/test-runner` by default). The phase catches dead GPUs, thermal issues, HBM errors, PCIe degradation, and xGMI link issues at boot. + +**Result label:** `amd.com/gpu-hw-acceptance=passed|failed`. Only nodes labeled `passed` advance to Phase 2. + +**Multi-stage execution model.** The ROCm test-runner CLI executes only the first `TestCases[]` entry per invocation, so Phase 1 runs one Test Runner Job *per recipe per candidate node*. The orchestrator iterates the configured stages **sequentially per node** with **stop-on-first-failure** semantics: stage S submits one Job per still-alive node *in parallel*, waits for the whole batch, drops any node that failed, then moves to stage S+1. Cross-node parallelism is preserved within each stage; cross-stage execution is gated by the previous stage's per-node result. + +Each stage is fully self-describing in the `GPU_VALIDATION_STAGES_JSON` ConfigMap key — including its own `Image`, allowing RVS and AGFHC test-runner images to be pinned to different versions: + +```yaml +GPU_VALIDATION_STAGES_JSON: | + [ + { "Name": "gpu-stress", "Image": "docker.io/rocm/test-runner:v1.4.0", + "Framework": "RVS", "Recipe": "gst_single", + "Iterations": 1, "TimeoutSeconds": 1800, "Arguments": "--parallel" }, + { "Name": "xgmi-lvl1", "Image": "docker.io/rocm/test-runner:v1.4.0", + "Framework": "AGFHC", "Recipe": "xgmi_lvl1", + "Iterations": 1, "TimeoutSeconds": 300, "Arguments": "" }, + { "Name": "pcie-lvl1", "Image": "docker.io/rocm/test-runner:v1.4.0", + "Framework": "AGFHC", "Recipe": "pcie_lvl1", + "Iterations": 1, "TimeoutSeconds": 300, "Arguments": "" }, + { "Name": "hbm-lvl1", "Image": "docker.io/rocm/test-runner:v1.4.0", + "Framework": "AGFHC", "Recipe": "hbm_lvl1", + "Iterations": 1, "TimeoutSeconds": 300, "Arguments": "" } + ] +``` + +| Field | Purpose | +|-------|---------| +| `Name` | Orchestrator identifier for the stage. Used in per-stage ConfigMap names (`cvf-phase1---`), Job names (`cvf-tr---`), and per-stage annotations. Must be DNS-1123-safe (lowercase alphanumerics + `-`). | +| `Image` | Test Runner image used **for this stage only**. Lets RVS and AGFHC ship independently. | +| `Framework` / `Recipe` / `Iterations` / `TimeoutSeconds` / `Arguments` | Passed verbatim into the runner as a single-element `TestCases[]` payload via a per-stage per-node ConfigMap. | + +Per-stage Job wait budget = `TimeoutSeconds + 120s` (pod-startup slack). There is **no** global `TEST_RUNNER_JOB_WAIT_TIME` — each stage owns its own budget. + +**Default sub-tests.** The shipped `GPU_VALIDATION_STAGES_JSON` defines these four stages, executed in array order: + +| # | Stage `Name` | Framework | Recipe | Purpose | +|---|--------------|-----------|--------|---------| +| 1 | `gst-single` | RVS | `gst_single` | Single-GPU stress (compute + memory) — baseline GPU health | +| 2 | `xgmi-lvl1` | AGFHC | `xgmi_lvl1` | xGMI inter-GPU link integrity (level-1) | +| 3 | `pcie-lvl1` | AGFHC | `pcie_lvl1` | PCIe link integrity, lane width / speed (level-1) | +| 4 | `hbm-lvl1` | AGFHC | `hbm_lvl1` | HBM memory integrity (level-1) | + +**Per-stage annotation scheme.** Independent of the aggregate `amd.com/gpu-hw-acceptance` label, each stage writes its own annotation per node: + +```text +amd.com/gpu-hw-acceptance-stage-=passed|failed +``` + +If a stage fails, subsequent stages are *not* submitted for that node, so no annotation is written for stages after the failing one. The aggregate label is `passed` only when **all** stages pass for the node; on first failure, the aggregate label becomes `failed` and the standard `-failure-reason` annotation is set to `stage-:` (e.g., `stage-hbm-lvl1:subtest-failed:hbm_lvl1`), with `-failed-subtest=` for sub-test failures. -## Usage +**Sample failure annotation output.** When the `hbm-lvl1` stage fails on a node, the prior stages' per-stage annotations are still recorded: + +```text +$ kubectl describe node smc300x-ccs-aus-gpuf268 +Name: smc300x-ccs-aus-gpuf268 +Labels: amd.com/gpu-hw-acceptance=failed + feature.node.kubernetes.io/amd-gpu=true + ... +Annotations: amd.com/gpu-hw-acceptance-stage-gst-single=passed + amd.com/gpu-hw-acceptance-stage-xgmi-lvl1=passed + amd.com/gpu-hw-acceptance-stage-pcie-lvl1=passed + amd.com/gpu-hw-acceptance-stage-hbm-lvl1=failed + amd.com/gpu-hw-acceptance-failure-reason=stage-hbm-lvl1:subtest-failed:hbm_lvl1 + amd.com/gpu-hw-acceptance-failed-subtest=hbm_lvl1 + amd.com/cluster-validation-last-run-timestamp=2026-05-20T14:32:11Z + ... +``` + +Other failure reasons the phase can emit (in the `-failure-reason` annotation, always prefixed with `stage-:`): + +| Reason value | Meaning | +|--------------|---------| +| `stage-:subtest-failed:` | The named sub-test in the stage failed; `` also appears in `-failed-subtest`. | +| `stage-:recipe-not-found` | The AGFHC/RVS recipe is missing from this stage's `Image`. `-failed-subtest=`. Action: upgrade the `Image` field for the affected stage in `GPU_VALIDATION_STAGES_JSON`. | +| `stage-:test-runner-did-not-emit-results` | Job completed but `result.json` is absent. `-failed-subtest=unknown`. | +| `stage-:timeout` | Job exceeded the stage's `TimeoutSeconds + 120s` budget. | +| `stage-:configmap-creation-failed` | `kubectl apply` on the per-stage per-node `GPU_VALIDATION_TESTS_JSON` ConfigMap returned non-zero. | +| `stage-:job-creation-failed` | `kubectl apply` on the Test Runner Job returned non-zero. | +| `phase1-missing-env:` / `phase1-stages-empty-or-invalid` / `phase1-stages-missing-fields` | Preamble validation failed before any stage was submitted; the same reason is written for every input node and no Jobs/ConfigMaps are created. | + +**`SKIP_GPU_HW_ACCEPTANCE` behavior.** Setting `skip-gpu-hw-acceptance: true` in `config.json` (which renders to `SKIP_GPU_HW_ACCEPTANCE=true` in the ConfigMap) makes Phase 1 short-circuit: + +- **No Test Runner Job is created** for any input node. +- Every input node is immediately labeled `amd.com/gpu-hw-acceptance=passed` via the standard `label_phase_passed` helper, so downstream `filter_passed_nodes` calls treat the node as eligible for Phase 2. +- No failure annotation is written, and no `result.json` parsing occurs. + +This short-circuit exists for incremental bringup: when GPU hardware has already been independently validated (e.g., during burn-in) or when the operator wants to exercise Phases 2-5 without re-running the full HW acceptance suite each tick, `SKIP_GPU_HW_ACCEPTANCE=true` keeps the pipeline moving without consuming the ~30-minute Test Runner budget per node. + +### Phase 2: Intra-Node GPU Mesh Validation + +**Scope:** Second gate of the pipeline. Per-node, no-network RCCL `all_reduce_perf` across all 8 local GPUs of a node to stress the xGMI mesh under a real collective workload. One Kubernetes `batch/v1.Job` is created per candidate node that passed Phase 1; the Job is **not** an MPIJob — launcher and worker are co-located in a single pod, and `mpirun --np 8 --host localhost --allow-run-as-root` drives all 8 ranks locally. The phase catches xGMI link failures that only manifest under collective load, GPU mesh issues, and single-GPU instability under stress that Phase 1 single-GPU recipes can miss. + +**Result label:** `amd.com/gpu-mesh-validation=passed|failed`. Only nodes labeled `passed` advance to Phase 3 (or to the per-node label being available for downstream consumers when Phases 3-5 are skipped). + +**Workload.** The container runs the RCCL `all_reduce_perf` benchmark from the `rocm/roce-workload` image (already used by Phase 5), driven by `mpirun` against the 8 GPUs requested via `amd.com/gpu: 8`: + +```bash +mpirun --np 8 --host localhost --allow-run-as-root \ + --mca btl ^vader,openib \ + $PERF_TEST_DIR/all_reduce_perf \ + -b $PHASE2_START_MSG_SIZE -e $PHASE2_END_MSG_SIZE \ + -f $PHASE2_STEP_FACTOR -g 1 \ + -n $PHASE2_ITER_COUNT -w $PHASE2_WARMUP_ITER_COUNT +``` + +After `mpirun` exits, the shared `validate-single-test.sh` parses the standard `Avg bus bandwidth` line from the RCCL output, compares it against `PHASE2_BW_THRESHOLD`, and the per-phase label/annotation helpers write the result. The per-Job template lives in the new `cluster-validation-phase2-job-config` ConfigMap (mounted into the orchestrator's `submit-mpijob` container at `/phase2-configs`); per-node rendering uses the same `sed`-substitution pattern as Phase 1's Test Runner Job. + +**ConfigMap env vars (Phase 2 subset).** Tunable values are sourced from `cluster-validation-config`: + +| Variable | Default | Purpose | +|----------|---------|---------| +| `PHASE2_START_MSG_SIZE` | `1K` | `-b` lower bound for `all_reduce_perf` message-size sweep | +| `PHASE2_END_MSG_SIZE` | `2G` | `-e` upper bound for the sweep | +| `PHASE2_STEP_FACTOR` | `2` | `-f` geometric step factor between message sizes | +| `PHASE2_ITER_COUNT` | `6` | `-n` measured iterations per message size | +| `PHASE2_WARMUP_ITER_COUNT` | `20` | `-w` warmup iterations (excluded from BW measurement) | +| `PHASE2_BW_THRESHOLD` | `200` | Minimum acceptable `Avg bus bandwidth` in GB/s | +| `PHASE2_JOB_WAIT_TIME` | `600` | Per-node Job wait budget in seconds | +| `PHASE2_RCCL_ENV_VARS` | (block) | RCCL/NCCL env vars sourced before `mpirun`; intra-node subset of Phase 5's `RCCL_ENV_VARS` — drops `NCCL_NET_PLUGIN` and all `NCCL_IB_*` tunables (single node, no fabric) and keeps `NCCL_DEBUG=INFO` for diagnostics | + +The `PHASE2_RCCL_ENV_VARS` block deliberately contains **no IB-specific tunables** — Phase 2 exercises only the local xGMI mesh, so any IB/RoCE env var would be misleading at best and a source of false failures at worst. Phase 5 (multi-node MPIJob) is where IB/RoCE settings belong. + +**Sample failure annotation output.** When the bandwidth measurement falls below the configured threshold, the node receives the `failed` label plus the uniform `-failure-reason` annotation: + +```text +$ kubectl describe node smc300x-ccs-aus-gpuf268 +Name: smc300x-ccs-aus-gpuf268 +Labels: amd.com/gpu-hw-acceptance=passed + amd.com/gpu-mesh-validation=failed + feature.node.kubernetes.io/amd-gpu=true + ... +Annotations: amd.com/gpu-mesh-validation-failure-reason=bus-bw-below-threshold + amd.com/gpu-mesh-validation-measured-bw=148.3 + amd.com/cluster-validation-last-run-timestamp=2026-05-20T15:08:42Z + ... +``` + +Failure reasons the phase can emit (in the `-failure-reason` annotation): + +| Reason value | Meaning | +|--------------|---------| +| `bus-bw-below-threshold` | RCCL completed cleanly but measured `Avg bus bandwidth` < `PHASE2_BW_THRESHOLD`. Measured value is also written to the `-measured-bw` annotation. | +| `rccl-crash` | `mpirun` exited non-zero (RCCL aborted, runtime error). Last 50 lines of `/shared/phase2.log` captured in the annotation (truncated to fit the 256 KB annotation budget). | +| `xgmi-init-failure` | RCCL failed during initialization with an xGMI-specific error class — the link came up at boot but cannot carry collective traffic. Inspect `NCCL_DEBUG=INFO` output for the offending peer pair. | +| `timeout` | Job did not complete within `PHASE2_JOB_WAIT_TIME` (typically: scheduler did not grant all 8 GPUs, or `all_reduce_perf` hung). Conservative — the node is retried on the next CronJob tick. | +| `job-creation-failed` | `kubectl apply` of the rendered Job template returned non-zero. Usually a `cluster-validation-phase2-job-config` ConfigMap deploy issue. | + +**`SKIP_GPU_MESH_VALIDATION` behavior.** Setting `skip-gpu-mesh-validation: true` in `config.json` (which renders to `SKIP_GPU_MESH_VALIDATION=true` in the ConfigMap) makes Phase 2 short-circuit: + +- **No Phase 2 Job is created** for any input node. +- Every input node is immediately labeled `amd.com/gpu-mesh-validation=passed` via the standard `label_phase_passed` helper, so downstream `filter_passed_nodes` calls treat the node as eligible for Phase 3 (or for being surfaced as Phase-2-clean when Phases 3-5 are also skipped). +- No failure annotation is written, and no `all_reduce_perf` log parsing occurs. + +This short-circuit exists for the same incremental-bringup reasons as Phase 1: when collective behavior has already been independently validated, or when the operator wants to exercise downstream phases without consuming the per-node Phase 2 budget on every CronJob tick. + +#### Threshold Tuning + +`PHASE2_BW_THRESHOLD` is the single number that decides pass vs. fail when RCCL completes cleanly, so it MUST be tuned for the GPU SKU and topology in use. The default (`200` GB/s) is calibrated for MI300-series xGMI on an 8-GPU node — high enough to catch a single degraded xGMI link, low enough to absorb normal run-to-run variance. + +**How to override.** Change the value in `configs/cluster-validation-config.yaml` and push it to the live cluster with `ansible-playbook playbooks/cvf.yml -e action=reapply` (re-applies + patches all ConfigMaps; no orchestrator code changes are required). Example override snippet: + +```yaml +# configs/cluster-validation-config.yaml +data: + PHASE2_BW_THRESHOLD: "180" # relax to 180 GB/s for a slightly older silicon revision +``` + +The next CronJob tick will pick up the new value via the `envFrom: configMapRef` projection on the Phase 2 Job pod. Already-labeled nodes keep their existing label until the next phase run overwrites it. + +**Expected ranges by GPU SKU.** Use these as starting points; always validate against a measured baseline on healthy hardware before locking in a production threshold. + +| GPU SKU | xGMI generation | Typical `all_reduce_perf` `Avg bus bandwidth` (8-GPU, message sizes 1K-2G) | Suggested `PHASE2_BW_THRESHOLD` | +|---------|-----------------|----------------------------------------------------------------------------|---------------------------------| +| MI300X / MI300A | xGMI 3 (4-link, 128 GB/s/link bi-dir) | 220-260 GB/s | `200` (default) | +| MI250 / MI250X | xGMI 2 (8-link, 50-100 GB/s/link bi-dir, ring topology) | 140-180 GB/s | `120` | +| MI210 | xGMI 2 (3-link per pair) | 60-90 GB/s | `50` | +| MI100 | xGMI 1 (3-link per pair) | 40-60 GB/s | `35` | + +**Tuning workflow:** + +1. **Measure baseline.** On known-good hardware, set `PHASE2_BW_THRESHOLD: "1"` (effectively pass-through) and let one CronJob tick run. Read the actual `Avg bus bandwidth` from the Job pod logs or from `/var/log/cluster-validation/`. +2. **Set the threshold ~10-15% below the baseline.** Tight enough to flag a degraded link (typically one bad xGMI link drops the figure by 20-30%), loose enough to absorb normal variance. +3. **Re-test on the same node with the new threshold.** Confirm `amd.com/gpu-mesh-validation=passed`. +4. **Inject a known-bad threshold** (`PHASE2_BW_THRESHOLD: "9999"`) on one tick to confirm the failure path labels and annotates correctly (`failed-reason=bus-bw-below-threshold`, measured value recorded), then revert. + +If `Avg bus bandwidth` is consistently below the suggested-range floor on healthy hardware, suspect a topology or firmware issue (xGMI link not training at full width, BIOS/SBIOS misconfiguration, ROCm/RCCL version mismatch with the silicon) rather than relaxing the threshold further. + +### Phase 3: Per-Node NIC Health Check + +**Scope:** Third gate of the pipeline. Per-node, structural-only RDMA NIC readiness check (no traffic generation) running on every node that passed Phase 2 **and** carries the `feature.node.kubernetes.io/amd-nic=true` label. One Kubernetes `batch/v1.Job` is created per candidate node; the Job runs the `${ROCE_WORKLOAD_IMAGE}` image (the same image pinned in `cluster-validation-config` and reused by Phases 2, 4, and 5 — see [Image switch note](#image-switch-phase-3-now-uses-roce_workload_image) below), requests all `amd.com/nic` device-plugin resources on the node (so the pod sees every NIC), and self-labels its own node via an in-pod `kubectl` against the `cluster-validation-sa` ServiceAccount. The phase catches missing NICs, drivers not loaded, bad cables, switch ports down, RoCE misconfiguration, and driver/firmware version skew before downstream phases attempt RDMA traffic. + +**Result label:** `amd.com/nic-health=passed|failed`. Only nodes labeled `passed` advance to Phase 4 (or to the per-node label being available for downstream consumers when Phases 4-5 are skipped). The phase is a point-in-time structural check — link flaps that develop later are caught by Phase 4 (real RDMA traffic). + +**Pre-flight + 5 structural checks.** Each Phase 3 Job runs `PHASE3_CHECK_SCRIPT` from the `cluster-validation-config` ConfigMap. The script first runs a **pre-flight self-check** (invokes `ibv_devinfo` once; non-zero exit short-circuits with `preflight-failed:ibv_devinfo=` so an `ROCE_WORKLOAD_IMAGE` ABI mismatch surfaces as a single clean failure rather than cascading into confused check-3/check-4 errors). Pre-flight is followed by five independent signals; the script aggregates failures across all five before labeling: + +| # | Check | Tool | Pass criterion | +|---|-------|------|----------------| +| 1 | NIC enumeration | `lspci -Dnn` filtered by `$PHASE3_AMD_NIC_PCI_IDS` (comma-separated `vendor:device` allowlist) | Count equals `PHASE3_EXPECTED_NIC_COUNT` (default `8`) | +| 2 | Link state | `/sys/class/net//operstate` over every netdev whose `device/driver` symlink points at `ionic` | Every ionic interface reports `up` | +| 3 | RDMA link state | `rdma link show` | Every device reports state `ACTIVE` | +| 4 | GID table | `ibv_devinfo -d -v` | Each device responds and reports `>= PHASE3_MIN_GID_COUNT` GID entries (default `1`) | +| 5 | Firmware ↔ workload-image alignment | `cat /sys/class/infiniband/ionic_/fw_ver` per ionic IB device, substring-matched against `$ROCE_WORKLOAD_IMAGE` (sed-substituted into the pod env at Job render time) | The observed firmware-version string appears as a substring of the workload image reference. Gated by `PHASE3_DRIVER_FW_CHECK_ENABLED` and `PHASE3_DRIVER_FW_STRICT`; "no data" (env unset OR no readable `fw_ver`) → SKIP, not FAIL | + +The script does **not** short-circuit on the first failing check (Checks 1–5) — it runs all of them, then aggregates the failure reasons and the offending NIC names into annotations. Pre-flight is the one exception: pre-flight failure short-circuits because every downstream check would be tainted by the same ABI/tool fault. This keeps the failure record actionable: an operator inspecting a `failed` node sees every distinct fault on the node, not just the first one tripped. + +**ConfigMap env vars (Phase 3 subset).** Tunable values are sourced from `cluster-validation-config`: + +| Variable | Default | Purpose | +|----------|---------|---------| +| `PHASE3_EXPECTED_NIC_COUNT` | `8` | Required PCI device count (Check 1). Adjust for nodes with non-default NIC populations. | +| `PHASE3_AMD_NIC_PCI_IDS` | `1dd8:1002` | Comma-separated PCI `vendor:device` allowlist matched against the `[vid:did]` tag from `lspci -nn` for Check 1. **`1dd8:1002` is the Pensando DSC Ethernet Controller PF** — the canonical NIC identifier for the AMD Pensando DSC fleet (DSC-25, DSC-200, Salina). Append additional pairs (e.g. `"1dd8:1002,1dd8:1003"`) to cover a future PF revision or a second SKU. Override only when validating non-Pensando NICs. | +| `PHASE3_MIN_GID_COUNT` | `1` | Minimum GID table entries per RDMA device (Check 4). A device with zero GIDs cannot carry RoCE traffic. | +| `PHASE3_JOB_WAIT_TIME` | `120` | Per-node Job wait budget in seconds. A Job still pending past this budget is treated as `nic-not-allocated`. | +| `PHASE3_ANNOTATION_MAX_BYTES` | `250` | Maximum bytes of `failure-reason` / `failed-nics` annotation values (truncated to fit Kubernetes' 256 KB total-annotation budget on a node). | +| `PHASE3_DRIVER_FW_CHECK_ENABLED` | `"true"` | Gates Check 5 (firmware ↔ workload-image alignment). Set to `"false"` to skip the check — Phase 3 then runs only Checks 1–4 and emits the marker without the `observed_fw=` field. The pre-flight `ibv_devinfo` self-check still runs regardless. | +| `PHASE3_DRIVER_FW_STRICT` | `"true"` | When `"true"`, any per-NIC firmware-version that does NOT appear as a substring of `$ROCE_WORKLOAD_IMAGE` **fails** the node with `fw-image-mismatch:=/image=`. When `"false"`, mismatches surface only via the `-observed-fw` annotation and the check is informational. "No data" (missing env or unreadable sysfs) is always treated as SKIP, never a fail. | + +**Pensando PCI device ID configuration.** Check 1 (NIC enumeration) counts the lines from `lspci -Dnn` whose trailing `[vendor:device]` tag matches one of the entries listed in `PHASE3_AMD_NIC_PCI_IDS` (comma-separated). The default `1dd8:1002` is the Pensando "DSC Ethernet Controller" PF — the device ID the kernel `ionic` driver binds to and the canonical identifier for a NIC PF on every AMD Pensando NIC variant currently shipping in the fleet (DSC-25, DSC-200, Salina). Because each Pensando card exposes six PCI functions per vendor (two PCI bridges, three Processing-accelerator sub-functions, and one Ethernet controller PF), a vendor-only filter would over-count by ~6x and SR-IOV would inflate it further; the device-ID allowlist is the precise, drift-immune signal. + +To confirm on a target node: + +```bash +$ lspci -Dnn | grep -E '\[1dd8:1002\]' | head +0000:05:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1002] +0000:06:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1002] +... +``` + +If the count is zero on a node that visibly has Pensando NICs, suspect (a) `lspci` not installed in the container image (regression), (b) NICs not yet enumerated by the kernel (boot-order race — Phase 3 reschedules on the next CronJob tick), or (c) a new SKU exposes a different device ID — extend the allowlist to e.g. `PHASE3_AMD_NIC_PCI_IDS: "1dd8:1002,1dd8:1003"`. For cross-vendor evaluation (e.g. Mellanox/NVIDIA ConnectX) set the full list, such as `PHASE3_AMD_NIC_PCI_IDS: "15b3:1019"`. + +#### Image switch: Phase 3 uses `${ROCE_WORKLOAD_IMAGE}` + +Phase 3's Job container image is `${ROCE_WORKLOAD_IMAGE}` — the same image already pinned in `cluster-validation-config` for Phases 2, 4, and 5. One image pin, one bump for all four phases, and the image ships `ibv_devinfo` + `rdma` + `ethtool` + the `/sys/class/infiniband/ionic_*/fw_ver` sysfs view which is all the checks need. + +The centralization is also the main risk surface. The `rocm/roce-workload` image is rebuilt frequently, and at least one historical tag (`...ainic-1.117.1-a-63`) shipped `libionic 54.0-149` ABI v2 while the validation cluster's host `ionic_rdma` module spoke ABI v1 — every Phase 4 `ib_write_bw` rail failed to initialize verbs until the host driver was upgraded. A future ABI-incompatible tag would cause Checks 3 and 4 (`rdma link show`, `ibv_devinfo`) to fail with confusing per-NIC errors on every node simultaneously. The pre-flight `ibv_devinfo` self-check at the top of `PHASE3_CHECK_SCRIPT` exists for this case: non-zero exit short-circuits the script with `preflight-failed:ibv_devinfo=` as the single reason. Remediation: revert `ROCE_WORKLOAD_IMAGE` to a known-good tag in `cluster-validation-config` and wait for the next CronJob tick. + +#### Check 5: firmware ↔ workload-image alignment + +Check 5 verifies each NIC's running firmware is the one the workload image was qualified against. It reads `/sys/class/infiniband/ionic_/fw_ver` (populated by the driver from the NIC's admin command at probe time) and substring-matches the value against `$ROCE_WORKLOAD_IMAGE`. The image tag itself — e.g. `...ainic-1.117.5-a-56` — is the contract: no operator-curated allowlist, no compat-map maintenance. Bumping the firmware on a node without also bumping the image (or vice versa) trips the check. + +"No data" cases are always SKIP, never FAIL: +- `$ROCE_WORKLOAD_IMAGE` unset (orchestrator render failed) → `CHECK 5 SKIP: ROCE_WORKLOAD_IMAGE env not set` +- No `fw_ver` readable for any ionic device → `CHECK 5 SKIP: no fw_ver readable under /sys/class/infiniband/*/` + +Only an actual mismatch (we read fw + we read image + strings differ) fails in strict mode. Set `PHASE3_DRIVER_FW_CHECK_ENABLED=false` to skip the check unconditionally (the marker stays byte-for-byte compatible with the pre-Check-5 contract — no `observed_fw=` field). + +**Sample failure annotation output.** When one or more checks fail, the node receives the `failed` label plus two annotations: the uniform `-failure-reason` (per-phase contract, comma-separated list of reason tokens) and the Phase-3-specific `-failed-nics` (comma-separated list of offending NIC names, used by operators to localize the fault): + +```text +$ kubectl describe node smc300x-ccs-aus-gpuf268 +Name: smc300x-ccs-aus-gpuf268 +Labels: amd.com/gpu-hw-acceptance=passed + amd.com/gpu-mesh-validation=passed + amd.com/nic-health=failed + feature.node.kubernetes.io/amd-gpu=true + feature.node.kubernetes.io/amd-nic=true + ... +Annotations: amd.com/nic-health-failure-reason=rdma-state:rocep5s0=DOWN,link-state:rocep5s0=DOWN + amd.com/nic-health-failed-nics=rocep5s0 + amd.com/cluster-validation-last-run-timestamp=2026-05-20T15:42:08Z + ... +``` + +The example above shows the canonical "one cable unplugged" failure: NIC `rocep5s0` fails both Check 2 (`link-state:rocep5s0=DOWN`) and Check 3 (`rdma-state:rocep5s0=DOWN`), and both reasons are aggregated in `-failure-reason`. The `-failed-nics` annotation lists `rocep5s0` once even though it tripped two checks — operators can `ssh` to the node and inspect that single device. A multi-NIC failure would render as, e.g.: + +```text +amd.com/nic-health-failure-reason=link-state:rocep5s0=DOWN,gid-table:rocep5s0=0,rdma-state:rocep7s0=INIT +amd.com/nic-health-failed-nics=rocep5s0,rocep7s0 +``` + +**Worked Check 5 example — firmware/image mismatch.** On a node where `ionic0` reports firmware `1.117.4` but the workload image pinned in `cluster-validation-config` is `...ainic-1.117.5-a-56`, the node is labeled `failed` and receives the three annotations below. The `-observed-fw` annotation is written **whenever Check 5 ran**, regardless of pass or fail outcome, so operators always see the cluster's full per-NIC firmware inventory: + +```text +amd.com/nic-health=failed +amd.com/nic-health-failure-reason=fw-image-mismatch:ionic0=1.117.4/image=...ainic-1.117.5-a-56 +amd.com/nic-health-failed-nics=ionic0 +amd.com/nic-health-observed-fw=ionic0=1.117.4,ionic1=1.117.5-a-56,ionic2=1.117.5-a-56,ionic3=1.117.5-a-56,ionic4=1.117.5-a-56,ionic5=1.117.5-a-56,ionic6=1.117.5-a-56,ionic7=1.117.5-a-56 +``` + +Only `ionic0` mismatches (every other NIC's firmware appears as a substring of the image tag), so only `ionic0` is in `-failed-nics`. When `PHASE3_DRIVER_FW_CHECK_ENABLED=false`, `-observed-fw` is omitted from the marker and the orchestrator preserves any previously-recorded value on the node (last-known-good). + +Failure reasons the phase can emit (in the `-failure-reason` annotation): + +| Reason token | Source | Meaning | +|--------------|--------|---------| +| `nic-count:expected=,actual=` | Check 1 | `lspci -Dnn` filtered by `$PHASE3_AMD_NIC_PCI_IDS` returned `` devices, not the required ``. Indicates missing NIC, driver not loaded, or a new PF device ID that needs to be appended to `PHASE3_AMD_NIC_PCI_IDS`. | +| `link-state:=` | Check 2 | Interface `` reports `` (e.g., `DOWN`, `UNKNOWN`) instead of `UP`. Indicates bad cable, switch-port down, or admin-disabled interface. `` also appears in `-failed-nics`. | +| `rdma-state:=` | Check 3 | RDMA device `` reports `` (e.g., `DOWN`, `INIT`, `ARMED`) instead of `ACTIVE`. Indicates RoCE not configured, peer not up, or fabric-level fault. `` also appears in `-failed-nics`. | +| `ibv-devinfo:=unresponsive` | Check 4 | `ibv_devinfo -d ` failed to return. Indicates a kernel-level RDMA verbs failure on that device. `` also appears in `-failed-nics`. | +| `gid-table:=` | Check 4 | RDMA device `` reports `` GID entries, below `PHASE3_MIN_GID_COUNT`. A device with zero GIDs cannot carry RoCE traffic. Typically caused by IP not yet assigned to the RDMA interface. | +| `preflight-failed:ibv_devinfo=` | Pre-flight | The first invocation of `ibv_devinfo` inside the Phase 3 pod returned non-zero. Short-circuits the script (Checks 1–5 skipped). Almost always indicates `${ROCE_WORKLOAD_IMAGE}` was bumped to a tag whose userspace libraries are ABI-incompatible with the host kernel modules (canonical example: `libionic 54.0-149` ABI v2 against host `ionic_rdma` ABI v1). Remediation: revert `ROCE_WORKLOAD_IMAGE` to a known-good tag and wait for the next CronJob tick. No `-failed-nics` is written. | +| `fw-image-mismatch:=/image=` | Check 5 | RDMA device `` reports firmware ``, but that string does NOT appear as a substring of the workload image reference ``. Under `PHASE3_DRIVER_FW_STRICT=true` (default) the node fails; under `PHASE3_DRIVER_FW_STRICT=false` the `-observed-fw` annotation is still written but the node is not failed. Remediation: bump `ROCE_WORKLOAD_IMAGE` to a tag built against the running firmware, OR re-flash the NIC to the firmware version the current image was qualified against. `` also appears in `-failed-nics`. | +| `job-creation-failed` | Orchestrator | `kubectl apply` of the rendered Phase 3 Job template returned non-zero. Usually a `cluster-validation-phase3-job-config` ConfigMap deploy issue. No node-level `-failed-nics` is written in this case. | +| `nic-not-allocated` | Orchestrator | Job pod stayed `Pending` past `PHASE3_JOB_WAIT_TIME` — the scheduler could not grant the requested `amd.com/nic: ` devices, typically because the `amd.com/nic` device plugin has not advertised resources, the node has fewer NICs than `PHASE3_EXPECTED_NIC_COUNT`, or admission was rejected. No node-level `-failed-nics` is written; inspect `kubectl describe pod` for the rejection reason. | + +The annotation keys follow the cross-phase convention defined: `` for `-failure-reason`, and the Phase-3-specific `-failed-nics` for the offending-device list. Both annotation values are truncated to `PHASE3_ANNOTATION_MAX_BYTES` (default `250`) bytes to stay within Kubernetes' 256 KB total-annotation budget per node. + +**`SKIP_NIC_VALIDATION` behavior.** Setting `skip-nic-validation: true` in `config.json` (which renders to `SKIP_NIC_VALIDATION=true` in the ConfigMap) makes Phase 3 short-circuit: + +- **No Phase 3 Job is created** for any input node — neither for nodes carrying `amd-nic=true` nor for nodes without the NIC label. +- Every input node is immediately labeled `amd.com/nic-health=passed` via the standard `label_phase_passed` helper, so downstream `filter_passed_nodes` calls treat the node as eligible for Phase 4 (or for being surfaced as Phase-3-clean when Phases 4-5 are also skipped). +- No failure annotation is written, no `lspci` / `rdma` / `ibv_devinfo` invocations occur, and the orchestrator does **not** intersect the input pool with `amd-nic=true` (the intersection only runs when Phase 3 is actually executed). +- The `cluster-validation-phase3-job-config` ConfigMap is still mounted into the orchestrator container, but the Job template is never rendered. + +`skip-nic-validation: true` is the default in the shipped `config.json` (see the [Incremental Bringup Workflow](#incremental-bringup-workflow) below) because Phase 3 requires both NIC hardware and the `feature.node.kubernetes.io/amd-nic=true` label, neither of which is guaranteed during the GPU-only Day-1 bringup posture. Flip the flag to `false` once the network-operator has rolled out and `kubectl get nodes -l feature.node.kubernetes.io/amd-nic=true` returns the expected NIC-capable node set. + +### Phase 4: Pairwise Rail Bandwidth Test + +**Scope:** Fourth gate of the pipeline. Pairwise per-rail `ib_write_bw` between every pair of nodes that passed Phase 3, one rail (0.7) at a time per pair, with a default threshold of 380 Gbps. The phase isolates failures down to the `{pair, rail}` granularity so an operator inspecting a `failed` node sees exactly which rails fell below threshold and against which peer. Two Kubernetes `batch/v1.Job`s (server + client) are created per `(pair, rail)` instance from the `cluster-validation-phase4-job-config` ConfigMap; both Jobs run the `rocm/roce-workload` image (already used by Phase 5) and request a single NIC at a specific rail index via the per-rail `NetworkAttachmentDefinition` (`amd-host-device-nad-rail-${RAIL_IDX}`). The phase catches bad TOR ports, MTU mismatches, cable issues, and per-rail driver glitches that only manifest under real RDMA traffic — failure modes Phase 3's structural check cannot see. + +**Result label:** `amd.com/rail-bandwidth=passed|failed`. A node is labeled `passed` iff **every** rail it participated in measured at or above `PHASE4_BW_THRESHOLD` against **every** paired neighbor it was tested with. Only nodes labeled `passed` advance to Phase 4.5 (cross-node connectivity matrix) and Phase 5 (multi-node RCCL). + +**Pairing model.** Phase-3-passed nodes are sorted lexicographically and paired via round-robin: `node[0]<->node[1]`, `node[2]<->node[3]`, and so on. With an odd-count input, the last node has no peer to test against and is short-circuited to `passed` with an `unpaired=true` annotation (see [Unpaired-Node Behavior](#unpaired-node-behavior) below); the design intentionally trades partial coverage for forward progress, because downstream Phase 4.5 enforces a 2-node minimum anyway. Pairs run in parallel (capped by `PHASE4_MAX_CONCURRENT_PAIRS`); rails are serialized **within** a pair so the same NIC is never double-claimed by two concurrent `ib_write_bw` flows. + +**Workload.** For each `(pair, rail)` instance, `PHASE4_DRIVER_SCRIPT` (a) `sed`-renders the server Job template (pinned to node A, requests `amd-host-device-nad-rail-${RAIL_IDX}`), (b) waits for the server pod to publish its pod IP, (c) `sed`-renders the client Job template (pinned to node B, gets `$PEER_POD_IP` substituted in), (d) waits for both Jobs up to `PHASE4_PAIR_WAIT_TIME`, (e) parses `BW average` Gbps from the client pod's log, and (f) records the measurement under `$PHASE4_STATE_DIR/results//rail-` for both endpoints. Inside the container, `ib_write_bw` is invoked as: + +```bash +# server side (node A) — listens for one client connection on the rail's RDMA device +DEV="${PHASE4_IB_DEV_PREFIX}${RAIL_IDX}" # e.g., rdma_dev_3 for rail 3 +ib_write_bw -d "$DEV" -i 1 -F + +# client side (node B) — connects to the server pod IP on the same rail +DEV="${PHASE4_IB_DEV_PREFIX}${RAIL_IDX}" +ib_write_bw -d "$DEV" -i 1 -F "$PEER_POD_IP" +``` + +After all pair-runners complete, the driver aggregates per-node from the on-disk state and writes the labels and annotations via the standard `label_phase_passed` / `label_phase_failed` / `annotate_phase_value` helpers (per the design contract — never `kubectl label` directly). + +**ConfigMap env vars (Phase 4 subset).** Tunable values are sourced from `cluster-validation-config`: + +| Variable | Default | Purpose | +|----------|---------|---------| +| `PHASE4_RAIL_COUNT` | `8` | Number of rails tested per pair. Iterated `0.(PHASE4_RAIL_COUNT - 1)` serially within each pair. Lowering this (e.g., `4`) is the supported way to skip the upper rails on hardware with fewer than 8 NICs per node. | +| `PHASE4_BW_THRESHOLD` | `380` | Minimum acceptable `ib_write_bw` `BW average` in **Gbps**. Per-rail comparison is a float compare via `awk` (bash arithmetic does not handle `388.42 >= 380`). See [Threshold Tuning](#threshold-tuning-phase-4) for guidance. | +| `PHASE4_PAIR_WAIT_TIME` | `180` | Per-(pair, rail) wallclock wait budget in seconds. Half the budget is spent waiting for the server pod IP; the other half covers client startup, handshake, and the actual transfer. Total per-pair runtime is bounded by `PHASE4_RAIL_COUNT * PHASE4_PAIR_WAIT_TIME` (rails serialized within a pair). | +| `PHASE4_MAX_CONCURRENT_PAIRS` | `8` | Cap on parallel pair-runners. Bounded to keep the kube API server happy on large clusters — see [Concurrency-Cap Tuning](#concurrency-cap-tuning) below. | +| `PHASE4_NAD_NAME_PREFIX` | `amd-host-device-nad-rail-` | Prefix used to construct the per-rail `NetworkAttachmentDefinition` name as `${PHASE4_NAD_NAME_PREFIX}${RAIL_IDX}`. The 8 NADs (`amd-host-device-nad-rail-0` through `-7`) are deployed automatically by `setup-cluster.yml` (and re-applied by `cvf.yml -e action=reapply`) from `configs/nad-per-rail.yaml` — no manual `kubectl apply` step required. Override only if the fleet ships NADs under a different naming scheme out-of-band. | +| `PHASE4_IB_DEV_PREFIX` | `rdma_dev_` | Prefix used to select the RDMA device inside the pod as `${PHASE4_IB_DEV_PREFIX}${RAIL_IDX}` for `ib_write_bw -d`. Override only if a different image surfaces RDMA devices under a different naming scheme. | + +**Sample annotation output (passing node).** When every rail on every paired neighbor measured at or above `PHASE4_BW_THRESHOLD`, the node is labeled `passed`, every `rail-N` annotation carries its measured Gbps, and the diagnostic `peer` annotation records the last peer the node was paired with: + +```text +$ kubectl describe node smc300x-ccs-aus-gpuf268 +Name: smc300x-ccs-aus-gpuf268 +Labels: amd.com/gpu-hw-acceptance=passed + amd.com/gpu-mesh-validation=passed + amd.com/nic-health=passed + amd.com/rail-bandwidth=passed + feature.node.kubernetes.io/amd-gpu=true + feature.node.kubernetes.io/amd-nic=true + ... +Annotations: amd.com/rail-bandwidth-rail-0=388 + amd.com/rail-bandwidth-rail-1=391 + amd.com/rail-bandwidth-rail-2=387 + amd.com/rail-bandwidth-rail-3=385 + amd.com/rail-bandwidth-rail-4=390 + amd.com/rail-bandwidth-rail-5=389 + amd.com/rail-bandwidth-rail-6=386 + amd.com/rail-bandwidth-rail-7=388 + amd.com/rail-bandwidth-peer=smc300x-ccs-aus-gpuf269 + amd.com/cluster-validation-last-run-timestamp=2026-05-20T16:14:55Z + ... +``` + +**Sample failure annotation output.** When one or more rails fail (here: rail 5 measured 180 Gbps and rail 7 lost its peer pod before the client could attach), the node receives the `failed` label, the uniform `-failure-reason` annotation (a CSV-encoded summary), the Phase-4-specific `-failed-rails` annotation (compact CSV of offending rail indices), and the per-rail `-rail-N` annotations preserve whatever measurement was recorded — including the bad value, so an operator can see the magnitude of the regression: + +```text +$ kubectl describe node smc300x-ccs-aus-gpuf268 +Name: smc300x-ccs-aus-gpuf268 +Labels: amd.com/gpu-hw-acceptance=passed + amd.com/gpu-mesh-validation=passed + amd.com/nic-health=passed + amd.com/rail-bandwidth=failed + feature.node.kubernetes.io/amd-gpu=true + feature.node.kubernetes.io/amd-nic=true + ... +Annotations: amd.com/rail-bandwidth-failure-reason=failed-rails:5,7 + amd.com/rail-bandwidth-failed-rails=5,7 + amd.com/rail-bandwidth-rail-0=388 + amd.com/rail-bandwidth-rail-1=391 + amd.com/rail-bandwidth-rail-2=387 + amd.com/rail-bandwidth-rail-3=385 + amd.com/rail-bandwidth-rail-4=390 + amd.com/rail-bandwidth-rail-5=180 + amd.com/rail-bandwidth-rail-6=386 + amd.com/rail-bandwidth-peer=smc300x-ccs-aus-gpuf269 + amd.com/cluster-validation-last-run-timestamp=2026-05-20T16:18:02Z + ... +``` + +The `-failure-reason` annotation always uses the `failed-rails:` form for per-rail failures — keeping the failure summary compact for the dashboard view while delegating the offending-rail list to the dedicated `-failed-rails` annotation. The `-rail-N` annotations for failed rails carry the **measured** Gbps (`180` in the example) when a parse succeeded; when no measurement could be recorded (crash, parse failure, peer unready), the `-rail-N` annotation is simply omitted and the rail index appears in `-failed-rails`. + +**Per-(pair, rail) failure reasons.** Each rail's outcome is classified by `PHASE4_DRIVER_SCRIPT` from the client Job's terminal state and log content. The reason is persisted to `$PHASE4_STATE_DIR/results//rail-.reason` and ultimately rolls up into the `failed-rails:` summary on the node: + +| Reason token (per rail) | Meaning | +|-------------------------|---------| +| `below-threshold:` | Client Job completed cleanly and a `BW average` line was parsed, but the value `` is below `PHASE4_BW_THRESHOLD`. The measured value is also written to `rail-` so an operator sees the magnitude. | +| `parse-failed` | Client Job completed cleanly but the log contained no parseable `BW average` line — usually means the per-rail RDMA device did not respond or the client exited too early. | +| `ib-write-bw-crashed` | Client Job ended `Failed` and no `BW average` line was emitted — `ib_write_bw` aborted (bad CLI args, missing device, RDMA verbs error). | +| `peer-pod-unready` | Either the server pod never published a pod IP within half of `PHASE4_PAIR_WAIT_TIME`, or the client Job timed out before reaching a terminal state. Typically caused by NAD attach hangs or scheduler back-pressure. | +| `nad-missing` | The requested `amd-host-device-nad-rail-${RAIL_IDX}` does not exist — pod admission rejected by the NAD admission webhook. Action: verify per-rail NADs are deployed (`kubectl get net-attach-def -A \| grep amd-host-device-nad-rail-`). The NADs are normally installed automatically by `setup-cluster.yml` from `configs/nad-per-rail.yaml`; re-apply with `ansible-playbook playbooks/cvf.yml -e action=reapply` if they were deleted out-of-band, and confirm the Multus CRD (`networkattachmentdefinitions.k8s.cni.cncf.io`) exists. | +| `job-creation-failed` | `kubectl apply` of the rendered server or client Job template returned non-zero for a non-throttling reason (RBAC denial, malformed manifest, control-plane error). | +| `api-throttled` | `kubectl apply` returned HTTP 429 and the driver's exponential back-off retry budget (3 attempts) was exhausted. Remaining rails for that pair are marked with this reason. | +| `render-failed` | `sed` substitution of the server or client Job template failed to produce a valid manifest — usually a stale `cluster-validation-phase4-job-config` ConfigMap missing a substitution placeholder. | +| `driver-state-dir-failed` | The driver could not create `/tmp/phase4--$$` as its on-disk state directory. Every input node fails Phase 4 in this scenario — surface as a CronJob filesystem regression. | + +#### Unpaired-Node Behavior + +When the Phase-3-passed input pool has an odd count, the lexicographically-last node has no peer to test against. Rather than failing the node (which would be punitive for a configuration issue outside the node's control) or holding it back from downstream consumers (Phase 4.5 would simply drop a single node by its 2-node minimum), `PHASE4_DRIVER_SCRIPT` short-circuits the unpaired node to `passed`: + +```text +$ kubectl describe node smc300x-ccs-aus-gpuf270 # the odd one out +Name: smc300x-ccs-aus-gpuf270 +Labels: amd.com/rail-bandwidth=passed + ... +Annotations: amd.com/rail-bandwidth-unpaired=true + amd.com/cluster-validation-last-run-timestamp=2026-05-20T16:14:55Z + ... +``` + +No Jobs are created for the unpaired node, no per-rail measurements are recorded, and the diagnostic `unpaired=true` annotation lets dashboards and downstream phases distinguish a "passed because not tested" node from a "passed because every rail met threshold" node. Phase 4.5 (cross-node connectivity matrix) enforces an `N >= 2` minimum and will drop the unpaired node from the matrix on the same tick. + +The single-node testbed (`smc300x-ccs-aus-gpuf268`) always exercises this path — every Phase 4 run on a one-node cluster ends in an immediate unpaired pass-label with no `ib_write_bw` Jobs created. Real pairwise validation is deferred to a 2-node testbed in production. + +**`SKIP_RAIL_BANDWIDTH_TEST` behavior.** Setting `skip-rail-bandwidth-test: true` in `config.json` (which renders to `SKIP_RAIL_BANDWIDTH_TEST=true` in the ConfigMap) makes Phase 4 short-circuit: + +- **No `ib_write_bw` Jobs are created** for any pair — neither server nor client. +- Every input node is immediately labeled `amd.com/rail-bandwidth=passed` via the standard `label_phase_passed` helper, so downstream `filter_passed_nodes` calls treat the node as eligible for Phase 4.5 and Phase 5. +- No pairing is computed, no per-rail annotations are written, no `unpaired=true` annotation is written, and no failure annotation is written. +- The `cluster-validation-phase4-job-config` ConfigMap is still mounted at `/phase4-configs` inside the orchestrator container, but neither the server nor the client Job template is ever rendered. + +`skip-rail-bandwidth-test: true` is the default in the shipped `config.json` (see the [Incremental Bringup Workflow](#incremental-bringup-workflow) below) because Phase 4 requires (a) a 2-node minimum to exercise any real pairing, (b) per-rail NADs (`amd-host-device-nad-rail-0` through `-7`) deployed in the cluster, and (c) Phase 3 actually running to produce a non-empty input pool. Flip the flag to `false` once those prerequisites are in place and the network fabric is stable enough that the pairwise threshold check is meaningful. + +#### Threshold Tuning (Phase 4) + +`PHASE4_BW_THRESHOLD` is the single Gbps number that decides pass vs. fail per `(pair, rail)` when `ib_write_bw` completes cleanly, so it MUST be tuned for the NIC SKU, link speed, and fabric topology in use. The default (`380` Gbps) is calibrated for AMD Pensando DSC-200 (200 GbE per port, single-port-per-rail attachment, 400 Gbps line rate) — high enough to catch a single degraded link that drops a rail to ~half rate, low enough to absorb the small framing and orchestration overhead vs. wire speed. + +**How to override.** Change the value in `configs/cluster-validation-config.yaml` and push it to the live cluster with `ansible-playbook playbooks/cvf.yml -e action=reapply` (re-applies + patches all ConfigMaps; no orchestrator code changes are required). Example override snippet: + +```yaml +# configs/cluster-validation-config.yaml +data: + PHASE4_BW_THRESHOLD: "180" # relax to 180 Gbps for a 200 GbE / single-port fabric +``` + +The next CronJob tick will pick up the new value via the `envFrom: configMapRef` projection on the orchestrator container. Already-labeled nodes keep their existing label until the next phase run overwrites it. + +**Expected ranges by NIC / fabric class.** Use these as starting points; always validate against a measured baseline on healthy hardware before locking in a production threshold. + +| NIC class | Per-rail line rate | Typical `ib_write_bw` `BW average` | Suggested `PHASE4_BW_THRESHOLD` | +|-----------|--------------------|------------------------------------|---------------------------------| +| Pensando DSC-200 (2 × 200 GbE) | 400 Gbps | 390-395 Gbps | `380` (default) | +| Pensando DSC-25 (2 × 25 GbE) | 50 Gbps | 46-49 Gbps | `45` | +| 200 GbE single-port-per-rail | 200 Gbps | 190-198 Gbps | `180` | +| 100 GbE single-port-per-rail | 100 Gbps | 92-98 Gbps | `90` | + +**Tuning workflow:** + +1. **Measure baseline.** On known-good hardware, set `PHASE4_BW_THRESHOLD: "1"` (effectively pass-through) and let one CronJob tick run on a 2-node pool. Read the per-rail `rail-N` annotations on either node — those are the actual measured Gbps values. +2. **Set the threshold ~3-5% below the baseline.** Tight enough to flag a degraded link (a single bad lane on a 4-lane link typically drops the figure by 25%), loose enough to absorb normal run-to-run variance and small orchestration overhead. +3. **Re-test the same pair with the new threshold.** Confirm `amd.com/rail-bandwidth=passed` on both nodes and that every `rail-N` annotation is recorded. +4. **Inject a known-bad threshold** (`PHASE4_BW_THRESHOLD: "9999"`) on one tick to confirm the failure path labels and annotates correctly (`failure-reason=failed-rails:0,1,.,7`, `failed-rails=0,1,.,7`, every `rail-N` carrying the measured value), then revert. + +If `ib_write_bw` `BW average` is consistently below the suggested-range floor on healthy hardware, suspect a fabric or driver issue (per-rail NAD pointing at the wrong PF/VF, switch egress queue mis-tuned, MTU mismatch, ROCm/OFED version drift) rather than relaxing the threshold further. The per-rail granularity of the annotations is the diagnostic — a single rail consistently 30% slow points at one cable or one switch port; all rails uniformly slow points at host- or image-level configuration. + +#### Concurrency-Cap Tuning + +Pair-runners are forked in parallel under a hard cap of `PHASE4_MAX_CONCURRENT_PAIRS` (default `8`). The cap exists for one reason: each pair-runner submits up to 2 Jobs per rail (server + client) for up to `PHASE4_RAIL_COUNT` rails, so an uncapped run on a 64-node pool (32 pairs × 8 rails × 2 Jobs = 512 `kubectl apply` calls) would trivially exceed the kube API server's default QPS budget and trigger 429-throttled storms. + +**How the cap behaves.** + +- The driver dispatches pair-runners in lexicographic pair order, sleeping in a `wait -n` loop until fewer than `PHASE4_MAX_CONCURRENT_PAIRS` are in flight. +- Within a single pair, rails are **always serialized** — only one `(pair, rail)` `ib_write_bw` flow exists at any moment per pair. This invariant prevents a pair from double-claiming the same NIC for two concurrent flows, which would corrupt both measurements. +- When `kubectl apply` returns HTTP 429 despite the cap, the driver's `_phase4_apply_with_backoff` retries with exponential back-off up to 3 attempts before marking the remaining rails of that pair as `api-throttled` and moving on. +- `PHASE4_MAX_CONCURRENT_PAIRS` is bounded to `>= 1` by the driver — a misconfigured `0` or negative value is promoted to `1` (fully serial) with a WARN log line, ensuring the phase never stalls outright. + +**Sizing guidance.** + +| Cluster size (Phase-3-passed nodes) | Recommended `PHASE4_MAX_CONCURRENT_PAIRS` | +|-------------------------------------|-------------------------------------------| +| 2-4 nodes (1-2 pairs) | `2` — fewer in-flight Jobs, easier to triage | +| 5-16 nodes (3-8 pairs) | `8` (default) | +| 17-64 nodes (9-32 pairs) | `16` — paired with a kube API QPS raise (`--kube-api-qps`, `--kube-api-burst` on `kube-apiserver`) | +| 65+ nodes (33+ pairs) | Keep at `16` — additional parallelism past 16 typically saturates the etcd write path before it speeds up the phase wall-clock; let the pair queue drain instead. | + +**Total Phase 4 wall-clock budget.** With the cap honored, the upper bound is `ceil(num_pairs / PHASE4_MAX_CONCURRENT_PAIRS) * PHASE4_RAIL_COUNT * PHASE4_PAIR_WAIT_TIME`. With defaults (`8` cap, `8` rails, `180`s per rail) and 16 pairs that is `2 * 8 * 180 = 2880` s ≈ 48 min worst-case; healthy hardware completes well under half that because each rail's wait budget covers the full timeout, not the measured runtime. + +### Phase 4.5: Cross-Node Connectivity Matrix Test + +**Scope:** Pre-flight gate immediately before Phase 5 (multi-node RCCL). Phase 4.5 validates that every node that passed Phase 4 can actually reach every other node across four orthogonal dimensions — SSH, DNS, MPI process spawning, and RCCL topology detection — so multi-hop routing or DNS faults that pairwise Phase 4 cannot observe are caught before the long Phase 5 MPIJob run starts. Phase 4.5 does **not** introduce a new per-node phase label: it is a gate, not a tested capability of an individual node. Failures are recorded as a node annotation and abort the Phase 5 MPIJob via init-container non-zero exit. + +**Execution model: piggy-back on the Phase 5 launcher init-container.** Phase 4.5 runs **inside the existing `wait-for-worker-pods` init-container of the Phase 5 MPIJob launcher** — there is **no separate Pod or Job** for Phase 4.5. The orchestrator sequence is: + +1. The CronJob orchestrator submits the Phase 5 MPIJob (the standard launcher + worker set). +2. The launcher's `wait-for-worker-pods` init-container runs `PHASE45_PREFLIGHT_SCRIPT` (sourced from `cluster-validation-config.yaml` via env var). The init-container blocks launcher startup until the pre-flight passes. +3. If the pre-flight fails, the init-container exits non-zero → launcher pod fails → MPIJob fails → existing CronJob Phase 5 failure path applies (no extra wiring). + +The rationale for sharing the init-container instead of standing up a dedicated Phase 4.5 Pod: the init-container already has the assembled worker pod set, the `cluster-validation-sa` ServiceAccount with the credentials/RBAC to `kubectl exec` into every worker, and the network access to reach every worker. Reusing it avoids a second pod-startup latency on every CronJob tick. + +**The 4 checks.** `PHASE45_PREFLIGHT_SCRIPT` runs four independent probes against the worker set. The script does **not** short-circuit on the first failing check — it runs all four, then aggregates the failure reasons into a single annotation. This keeps the failure record actionable: an operator inspecting a `failed` run sees every distinct fault class on the cluster, not just the first one tripped. + +| # | Check | What it does | Failure reason token | +|---|-------|--------------|----------------------| +| 1 | N×N SSH mesh | For every (src, dst) ordered worker-pod pair, `kubectl exec` the src and `ssh -o ConnectTimeout=5 dst-ip 'echo ok'`. Records every failing pair in `failed_pairs`. Catches multi-hop routing faults, asymmetric ACLs, and partial fabric outages that pairwise Phase 4 cannot observe. | `ssh-mesh` | +| 2 | DNS forward + reverse | From the first worker pod, `getent hosts ` (forward) for every node hostname, then `getent hosts ` (reverse) on the returned IP. Catches cluster-DNS / coredns regressions before Phase 5 hits them mid-collective. | `dns` | +| 3 | `mpirun --hostfile` no-op spawn | Write `WORKER_IPS` to `/tmp/hf` and `mpirun --hostfile /tmp/hf --np $(wc -l < /tmp/hf) --allow-run-as-root true`. A no-op program isolates the OMPI launcher / orted bootstrap path from the workload — a hang or non-zero exit here means MPI cannot start, period. | `mpi-spawn` | +| 4 | RCCL topology probe | Minimal `mpirun . -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT $PERF_TEST_DIR/all_reduce_perf -b 1K -e 1K -n 1` and grep the first 50 `NCCL INFO comm|topology` lines. Wrapped in a 60 s timeout. On timeout, **soft-fail** — annotate `rccl-topology` but proceed, because first-run warming of kernel/firmware caches can legitimately exceed 60 s. The real RCCL test is Phase 5. | `rccl-topology` | + +**Sample failure annotation output.** When one or more checks fail, every participating worker node receives the `amd.com/phase4_5-failure-reason` annotation listing the failed check classes (comma-separated). NO node label is written or changed — per-phase labels from Phases 1-4 stay frozen, preserving the audit trail. + +```text +$ kubectl describe node smc300x-ccs-aus-gpuf268 +Name: smc300x-ccs-aus-gpuf268 +Labels: amd.com/gpu-hw-acceptance=passed + amd.com/gpu-mesh-validation=passed + amd.com/nic-health=passed + amd.com/rail-bandwidth=passed + feature.node.kubernetes.io/amd-gpu=true + feature.node.kubernetes.io/amd-nic=true + ... +Annotations: amd.com/phase4_5-failure-reason=ssh-mesh,mpi-spawn + amd.com/cluster-validation-last-run-timestamp=2026-05-20T16:47:33Z + ... +``` + +The example shows the canonical "fabric or routing fault" pattern: N×N SSH probes find at least one unreachable pair and `mpirun` cannot complete its TCP bootstrap, so both `ssh-mesh` and `mpi-spawn` are aggregated into the single annotation. An operator triaging the run can `kubectl get nodes -o jsonpath` over `amd.com/phase4_5-failure-reason` to see which classes failed cluster-wide and start investigation there. + +Best-effort attribution caveat: the script annotates **every** participating node with the same failure-reason list rather than trying to attribute "node A is at fault for the ssh-mesh failure to node B". Failure modes at this layer are almost always shared infrastructure (a TOR port, the coredns Service, a fabric routing policy), so per-node attribution would be misleading more often than it would be helpful. If one specific node is repeatedly the only one annotated across CronJob ticks, that signal will surface organically and the operator can isolate it. + +**No per-node label change is intentional.** Unlike Phases 1-4, Phase 4.5 does **not** write `=passed|failed` for any node. The rationale: + +- Phase 4.5 tests a **cluster property** (the N×N reachability surface), not a per-node capability. Labeling individual nodes `phase4_5=failed` would falsely imply that the node has a localized fault, when in practice the failure usually means "this set of nodes cannot collectively form an MPI world." +- Keeping the per-node labels frozen at the Phase 4 outcome preserves the audit trail: an operator looking at a node a week later sees the result of every per-node phase that was actually exercised on that node, without Phase 4.5 (a gate) overwriting that history. +- The annotation alone is sufficient for the downstream contract: the next CronJob tick re-evaluates the candidate pool from scratch, sees `amd.com/phase4_5-failure-reason=.` on the previously-failing nodes, and the operator decides whether to investigate before the next Phase 4.5 attempt. + +**`SKIP_RCCL_TEST` behavior (shared with Phase 5).** Phase 4.5 does **not** carry its own skip flag — it shares `SKIP_RCCL_TEST` with Phase 5. Setting `skip-rccl-test: true` in `config.json` (which renders to `SKIP_RCCL_TEST=true` in the ConfigMap) makes both phases short-circuit together: + +- **No Phase 5 MPIJob is submitted** when `SKIP_RCCL_TEST=true`. Because Phase 4.5 lives **inside the launcher init-container of that very MPIJob**, no init-container is created either — Phase 4.5 simply does not run. +- No `amd.com/phase4_5-failure-reason` annotation is written, no `PHASE45_PREFLIGHT_SCRIPT` is invoked, and no `kubectl exec` traffic flows between worker pods. +- The orchestrator's `maybe_run_phase45` helper logs that Phase 4.5 is skipped together with Phase 5 and continues to post-phase cleanup. + +This shared-flag model is intentional. Phase 4.5 exists **only** to gate Phase 5 — a Phase 5 that is skipped needs no pre-flight, and a Phase 5 that is enabled should always have its pre-flight run. Splitting the two flags would let a misconfigured cluster run Phase 5 directly without the pre-flight, which would simply turn a fast, well-classified Phase 4.5 failure into a slow, opaque RCCL crash 30 minutes into the MPIJob. The shared flag also keeps the dry-run shape consistent: `DRY_RUN=1` with `SKIP_RCCL_TEST=false` prints "DRY_RUN -- skipping run_phase5" and "DRY_RUN -- skipping maybe_run_phase45" in lock-step. + +#### Per-(pair, check) Granularity Limits + +Phase 4.5's diagnostics are intentionally coarser than Phase 4's `(pair, rail)` granularity for two reasons: + +1. **Failure modes at this layer are usually shared infrastructure.** SSH mesh failures, DNS regressions, and MPI bootstrap hangs almost never localize to a single node — they localize to a switch port, a coredns Pod, or a control-plane RBAC change. Recording every failing `(src, dst)` SSH pair in the annotation would push the annotation past the 256 KB per-node total-annotation budget on large clusters without giving the operator new information. +2. **The fast diagnostic loop is `kubectl logs` of the launcher init-container, not annotations.** Every failing SSH pair, every DNS miss, and the full `mpirun` stderr is printed to the init-container's stdout. Operators triaging a Phase 4.5 failure read those logs first; the annotation exists only to make the failure visible on `kubectl describe node` and to allow dashboards to surface the affected node set without scraping pod logs. + +The annotation value is bounded to ≤ 128 bytes (the four reason tokens plus comma separators fit comfortably) so the per-node 256 KB total-annotation budget is never threatened, even when Phase 4.5 runs on every CronJob tick for weeks. + +### Phase 5: Multi-Node RCCL Collective Test + +**Scope:** Final gate of the pipeline. Multi-node RCCL collective workload (`all_reduce_perf`, `broadcast_perf`, `reduce_scatter_perf`) executed as a single MPIJob across the worker set that survived Phase 4.5. The phase produces the terminal cluster verdict on each participating node and is the only phase whose `passed` outcome means "this node was actually exercised end-to-end under multi-node RCCL traffic above the configured bandwidth threshold." The MPIJob template (`cluster-validation-mpijob-config`) and the RCCL workload image (`rocm/roce-workload`) are reused from prior releases — Phase 5 itself is the same workload as before; the work refactors how the orchestrator drives it (script extraction, helper-based labeling, dynamic worker count, per-worker triage) without changing the workload semantics. + +**Result label:** `amd.com/cluster-validation-status=passed|failed`. This is the **same label key the framework has always written for Phase 5** (`PHASE5_LABEL_KEY` in the ConfigMap) — no change for any operator dashboard, alerting rule, or downstream consumer that already reads `amd.com/cluster-validation-status`. What changed is the **write path** (now via the `label_phase_passed` / `label_phase_failed` helpers, not raw `kubectl label`) and the **failure-reason annotation** (now per-worker, see [Sample failure annotation output](#sample-failure-annotation-output-phase-5) below). + +**Pairing model.** Phase 5 takes the node set that survived Phase 4.5 as its input pool. The MPIJob is rendered with `worker-replicas = len(input pool)` — dynamic per CronJob tick — so a tick that finds 3 surviving nodes runs a 3-worker MPIJob, a tick that finds 7 runs a 7-worker MPIJob, and so on. The `WORKER_REPLICAS` value from `config.json` (rendered into the ConfigMap as `WORKER_REPLICAS`) is **the upper bound on candidate selection in Phase 0** (the orchestrator picks at most `WORKER_REPLICAS` candidates from the labeled node pool); treat it as a maximum tick budget, not a target — Phase 5 always uses the actual surviving count. + +**ConfigMap env vars (Phase 5 subset).** Tunable values are sourced from `cluster-validation-config`: + +| Variable | Default | Purpose | +|----------|---------|---------| +| `PHASE5_MIN_WORKERS` | `2` | Minimum surviving-node count required before Phase 5 submits the MPIJob. If `input_count < PHASE5_MIN_WORKERS`, the script logs the reason and `return 0` with no MPIJob submitted and no label changes (the per-phase verdict is undefined for a degenerate single-node MPIJob). Default `2` reflects "multi-node RCCL is multi-node by definition." Override to `1` for opt-in single-node plumbing validation, or to a higher value to demand a larger minimum cluster. | +| `WORKER_REPLICAS` | (from `config.json`) | Phase 0 candidate-selection budget (see above) — upper bound, not the MPIJob worker count. | +| `MPIJOB_WAIT_TIME` | `240` | `kubectl wait mpijob --for=condition=Succeeded --timeout` budget in seconds. Bounds total Phase 5 wallclock per tick. | +| `SLOTS_PER_WORKER` | (from `config.json`) | MPI ranks per worker (= GPUs per worker by convention); passed into the MPIJob template via `sed` substitution. | +| `ROCE_WORKLOAD_IMAGE` | (from `config.json` `images.roce-workload`) | RCCL workload container image. | +| `PHASE5_LABEL_KEY` | `amd.com/cluster-validation-status` | The label key written on every participating node. | + + +#### Sample failure annotation output (Phase 5) + +When the MPIJob exits non-zero (any cause — RCCL `validate_rccl_test.sh` reports bandwidth below threshold, a worker container crashes, the launcher aborts, `kubectl wait` times out, etc.), **every** participating node receives the `failed` label plus the uniform `-failure-reason` annotation. The reason value is constructed **per-node** from the worker pod that ran on that node and that pod's container exit code, so an operator inspecting `kubectl describe node` sees which worker pod failed on each node and what its exit was — no need to cross-reference the launcher log to attribute the failure: + +```text +$ kubectl describe node smc300x-ccs-aus-gpuf268 +Name: smc300x-ccs-aus-gpuf268 +Labels: amd.com/gpu-hw-acceptance=passed + amd.com/gpu-mesh-validation=passed + amd.com/nic-health=passed + amd.com/rail-bandwidth=passed + amd.com/cluster-validation-status=failed + feature.node.kubernetes.io/amd-gpu=true + feature.node.kubernetes.io/amd-nic=true + ... +Annotations: amd.com/cluster-validation-status-failure-reason=worker-pod=cluster-validation-mpi-job-20260520-1632-worker-0,exit=137 + amd.com/cluster-validation-last-run-timestamp=2026-05-20T16:38:11Z + ... +``` + +The example shows the canonical "OOM-killed worker" pattern: node `smc300x-ccs-aus-gpuf268` ran worker pod `cluster-validation-mpi-job-20260520-1632-worker-0`, which exited 137 (SIGKILL — typically OOM). A neighbour node in the same MPIJob would carry the same `failed` label but a different `-failure-reason` annotation pointing at its own worker pod and exit code, letting the operator localize the cluster-level failure down to a specific node + container. + +**Per-node failure-reason field semantics.** The `-failure-reason` annotation always uses the form `worker-pod=,exit=`. Both fields fall back to the literal string `unknown` when the corresponding lookup fails — never empty, never omitted — so the annotation shape stays deterministic for log scrapers and dashboards: + +| Sub-field | Source | `unknown` fallback condition | +|-----------|--------|------------------------------| +| `worker-pod=` | `kubectl get pods -l training.kubeflow.org/job-name=,training.kubeflow.org/replica-type=worker --field-selector spec.nodeName= -o jsonpath='{.items[0].metadata.name}'` | No worker pod matched the node (race against MPIJob `cleanPodPolicy: All`, scheduler eviction, or a node that the MPIJob never actually placed a worker on). | +| `exit=` | `kubectl get pod -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}'` on the resolved worker pod | The worker pod's primary container never reached `terminated` state (still `running`, `waiting`, or evicted before terminating), or the worker pod itself was unknown. | + +When `worker-pod` is `unknown`, `exit` is forced to `unknown` as well — the exit code lookup requires the pod name. The annotation value is bounded by the helper's `PHASE_ANNOTATION_VALUE_MAX_BYTES` cap, so pod names that approach the 253-char DNS-1123 limit cannot push the annotation past the 256 KB per-node total-annotation budget. + +**Per-worker log dump.** On **both** the pass and fail paths, after the launcher log is collected, the orchestrator walks the input node set and saves each worker pod's logs to a per-node file under `LOG_DIR` (= `/var/log/cluster-validation` on the host, mounted from a `hostPath` volume): + +```text +/var/log/cluster-validation/launcher-.log # existing — launcher pod, --all-containers +/var/log/cluster-validation/worker--.log # new — worker pod on , --all-containers +``` + +For example, a 3-node MPIJob `cluster-validation-mpi-job-20260520-1632` produces: + +```text +/var/log/cluster-validation/launcher-cluster-validation-mpi-job-20260520-1632.log +/var/log/cluster-validation/worker-smc300x-ccs-aus-gpuf268-cluster-validation-mpi-job-20260520-1632.log +/var/log/cluster-validation/worker-smc300x-ccs-aus-gpuf269-cluster-validation-mpi-job-20260520-1632.log +/var/log/cluster-validation/worker-smc300x-ccs-aus-gpuf270-cluster-validation-mpi-job-20260520-1632.log +``` + +The per-worker log files exist on the **same host** as the orchestrator container (the `submit-mpijob` pod) — typically the LOG_STORE_NODE_NAME chosen at install time. Operators triaging a Phase 5 failure read the per-node log file matching the failed node from the `-failure-reason` annotation, without scraping the launcher log for that worker's interleaved output. The per-worker dump runs on the pass path too so a passing run's per-worker timing/topology data is still archived for later baseline comparison. If a worker pod is missing at log-collection time (race against MPIJob cleanup, scheduler eviction), the orchestrator emits a single `Warning: Could not find worker pod for node ` line and continues with the remaining nodes — one missing log file does not break the loop. Existing log-retention policy (controlled by `CLEANUP_TEST_LOGS` at teardown) covers both the launcher and the new per-worker files. + +#### Per-node failure-reason mapping (Phase 5) + +Phase 5's `-failure-reason` is a single composite token rather than a fixed enum, but the cross-field combinations map to recognizable triage classes: + +| `worker-pod` | `exit` | Meaning | +|--------------|--------|---------| +| concrete pod name | numeric (e.g., `1`, `137`, `139`) | Worker pod ran to terminal state and exited non-zero. Use the per-worker log (`worker--.log`) to read the actual error message. Common: `137` = SIGKILL (OOM, eviction); `139` = SIGSEGV (RCCL or driver crash); `1` = RCCL `validate_rccl_test.sh` bandwidth-below-threshold. | +| concrete pod name | `0` | Worker pod exited zero but the MPIJob still failed — almost always means the **launcher** detected `validate_rccl_test.sh` failure across the collective output (bandwidth below threshold, missing test output). Inspect the launcher log; the worker exit was clean but the run's verdict was not. | +| concrete pod name | `unknown` | Worker pod exists but its primary container never reached `terminated` state at label-write time. Common causes: container still running when `kubectl wait` timed out, pod evicted before terminating, or `--for=condition=Succeeded` fired non-zero while the pod was mid-shutdown. | +| `unknown` | `unknown` | No worker pod could be matched to this node. Indicates the MPIJob's `cleanPodPolicy: All` already deleted worker pods by the time the failure-labeling loop ran, the worker never got scheduled on this node, or `kubectl get pods` with the label/field selector returned empty. | + +The per-worker log file is the source of truth for the actual failure root cause — the annotation surface is intentionally compact (single field, deterministic shape) so dashboards can pivot on it without parsing a multi-token string. Operators chasing a recurring failure pattern join the annotation across nodes (`kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.annotations.amd\\.com/cluster-validation-status-failure-reason}{"\n"}{end}'`) to spot whether the same worker-pod-suffix or the same exit code is recurring cluster-wide. + +**`SKIP_RCCL_TEST` behavior.** Setting `skip-rccl-test: true` in `config.json` (which renders to `SKIP_RCCL_TEST=true` in the ConfigMap) makes Phase 5 short-circuit: + +- **No MPIJob is submitted** for the input set. Because Phase 4.5 lives inside the Phase 5 MPIJob launcher init-container, Phase 4.5 also does not run when `SKIP_RCCL_TEST=true` (the two phases share the flag — see [Phase 4.5: Cross-Node Connectivity Matrix Test](#phase-45-cross-node-connectivity-matrix-test)). +- Every input node is immediately labeled `amd.com/cluster-validation-status=passed` via the standard `label_phase_passed` helper. This preserves the long-standing early-bringup short-circuit semantics — the same behavior Phase 5 has always had on `SKIP_RCCL_TEST=true`. +- No failure annotation is written, no per-worker log files are created, and `PHASE5_MIN_WORKERS` is not evaluated (skip wins over the min-workers guard). +- The candidate-label removal loop (`amd.com/cluster-validation-candidate-`) still runs so the input nodes return to the pool for the next CronJob tick under the new label outcome. + +**`PHASE5_MIN_WORKERS` short-circuit.** When `SKIP_RCCL_TEST=false` and the surviving input set has fewer nodes than `PHASE5_MIN_WORKERS` (default `2`), the script logs `[Phase 5] only node(s) survived prior phases; minimum is -- skipping MPIJob (no label changes)` and `return 0` without submitting the MPIJob and **without** writing any Phase 5 label or annotation. This intentionally leaves `amd.com/cluster-validation-status` **absent** on those nodes — there is no defensible `passed` (MPIJob never ran) or `failed` (no measurement, no failure) verdict to write. The empty-input case (`input_count=0`) takes the same branch and is treated identically. Operators wanting Phase 5 to label single-node clusters as `passed` for plumbing-validation purposes should set `PHASE5_MIN_WORKERS=1` (and accept that the resulting one-worker MPIJob exercises no inter-node collective at all). The single-node testbed `smc300x-ccs-aus-gpuf268` always exercises this short-circuit path with the default `PHASE5_MIN_WORKERS=2`. + +### Dry-Run Mode + +For orchestrator-flow validation without touching cluster state, set `DRY_RUN=1` on the CronJob (or on a manual `kubectl exec` of the orchestrator container): + +```bash +# Inside the CronJob's submit-mpijob container +DRY_RUN=1 bash /scripts/run-phases.sh +``` + +In `DRY_RUN=1` mode: + +- Phase 0 reads existing `amd.com/cluster-validation-candidate=true` labels instead of writing new ones — no candidate labeling happens. +- Every `run_phaseN` function is overridden to `:`-style no-op that prints a `DRY_RUN -- skipping run_phaseN ()` line. +- `maybe_run_phase45` and the post-phase cleanup / log-collection steps log "DRY_RUN -- skipping ." instead of running. +- The orchestrator still walks the entire phase order, honors all `SKIP_*` flags, and prints the planned filtered candidate pool per phase — useful for verifying that a given `config.json` produces the expected pipeline shape before committing to a real run. + +Combinations the dry-run is designed to exercise: + +- All 5 skip flags `true` → exits 0 cleanly, no `kubectl` writes. +- Phases 1+2 enabled, 3-5 skipped → plan output ends after the Phase 2 filter. +- Phase 3 enabled but zero `amd-nic=true` nodes present → empty pool, exits 0. + +## `gpu-cluster.sh` — internal contract + +> `gpu-cluster.sh` is the underlying per-node script the Ansible playbooks invoke. Operators normally **do not run it directly** — `setup-cluster.yml` calls `run server`/`run agent` on each node, and `cvf.yml -e action=reapply` calls `reapply-cvf` inside the server container. This section documents the script's interface for debugging and customization. ```text Usage: ./gpu-cluster.sh [args...] Commands: - build Build the Docker image - run [args...] Run the node as server or agent - teardown Tear down the cluster and clean up - get-token Run on server node to print agent join token - status Show cluster validation framework status and recent runs - node-status Show validation status per node - help Show this help message + build Build the Docker image (called by setup-cluster.yml on the control node) + run [args...] Start k3s server or agent container (called by setup-cluster.yml on each node) + teardown Tear down the cluster + clean up (called by teardown-cluster.yml) + get-token Print the agent join token (called by add-agent-nodes.yml) + reapply-cvf Re-run install_cluster_validation_framework (called by cvf.yml -e action=reapply) + status Show CronJob + recent orchestrator pods (called by cvf.yml -e action=status) + node-status Show per-node validation result table + help Show this help Run arguments: run server run agent ``` -## Environment Variables +**Environment variables** (used by the script; the playbooks pass them as Ansible vars): | Variable | Default | Description | | -------- | ------- | ----------- | | `IMAGE_NAME` | `gpu-validation-cluster` | Docker image name | | `IMAGE_TAG` | `latest` | Docker image tag | | `BUILD_DIR` | `$SCRIPT_DIR/build` | Path to directory containing Dockerfile and entrypoint.sh | -| `CONFIG_DIR` | `$SCRIPT_DIR/configs` | Path to directory containing config.json and other config files | -| `CLEANUP_TEST_LOGS` | `false` | Clean up cluster validation test logs in `/var/log/cluster-validation` during teardown | +| `CONFIG_DIR` | `$SCRIPT_DIR/configs` | Path to directory containing `config.json` and other config files | +| `CLEANUP_TEST_LOGS` | `false` | Clean up validation logs in `/var/log/cluster-validation` during teardown | -### Examples +**Direct invocation examples** (for on-node debugging only): ```bash -# Build using a custom build directory +# Build with a custom build directory (normally set via setup-cluster.yml var) BUILD_DIR=/path/to/custom/build ./gpu-cluster.sh build -# Run server node with custom config directory -CONFIG_DIR=/path/to/custom/configs ./gpu-cluster.sh run server - -# Run agent node with custom config directory to join a cluster -CONFIG_DIR=/path/to/custom/configs ./gpu-cluster.sh run agent - -# Teardown with cluster validation logs cleanup enabled -CLEANUP_TEST_LOGS=true ./gpu-cluster.sh teardown +# Re-apply CVF configs (Ansible equivalent: cvf.yml -e action=reapply) +CONFIG_DIR=/path/to/custom/configs ./gpu-cluster.sh reapply-cvf -# Show cluster validation framework CronJob status and recent pod runs +# Local status view (Ansible equivalent: cvf.yml -e action=status, which also shows per-node Ansible-side state) ./gpu-cluster.sh status - -# Show per-node validation test status (last run time and result) ./gpu-cluster.sh node-status + +# Teardown with log cleanup (Ansible equivalent below under Cleanup Behavior) +CLEANUP_TEST_LOGS=true ./gpu-cluster.sh teardown ``` ## Directory Structure @@ -218,7 +1106,7 @@ CLEANUP_TEST_LOGS=true ./gpu-cluster.sh teardown ```text gpu-validation-cluster/ ├── README.md # Project documentation -├── gpu-cluster.sh # Unified script for build, run, teardown, and get-token +├── gpu-cluster.sh # Internal per-node script invoked by the Ansible playbooks (build, run, reapply-cvf, teardown, get-token) ├── build/ # Build context │ ├── Dockerfile # Dockerfile to build the image │ └── entrypoint.sh # Container entrypoint script @@ -228,7 +1116,6 @@ gpu-validation-cluster/ │ └── cluster-validation-job.yaml # Cluster validation framework cronjob └── ansible/ # Ansible automation (recommended for multi-node) ├── README.md # Ansible documentation - ├── quickstart.sh # Quick start script ├── inventory.yml # Node inventory configuration (edit this!) ├── ansible.cfg # Ansible configuration └── playbooks/ # Ansible playbooks @@ -237,7 +1124,7 @@ gpu-validation-cluster/ ├── add-agent-nodes.yml # Add new nodes to existing cluster ├── remove-agent-nodes.yml # Remove nodes from existing cluster ├── teardown-cluster.yml # Teardown entire cluster - └── check-status.yml # Status checking + └── cvf.yml # CVF day-2 ops (reapply / reset / inject-secret / trigger / status / fresh-run) ``` ## Configuration @@ -250,102 +1137,52 @@ Customize operator behavior by editing files in the `configs/` directory: ## Cleanup Behavior -By default, the teardown command preserves cluster validation logs in `/var/log/cluster-validation` for troubleshooting and analysis. To remove these logs during teardown, set the `CLEANUP_TEST_LOGS` environment variable to `true`: - -```bash -CLEANUP_TEST_LOGS=true ./gpu-cluster.sh teardown -``` - -## Monitoring Validation Tests - -### Cluster-Wide Status - -To view the overall cluster validation framework status including CronJob configuration and recent pod runs: +By default, `teardown-cluster.yml` preserves the persistent validation logs in `/var/log/cluster-validation` on each node for post-mortem analysis. To also wipe the logs during teardown: ```bash -./gpu-cluster.sh status +ansible-playbook playbooks/teardown-cluster.yml -e cleanup_test_logs=true ``` -This command displays: +(Under the hood the playbook passes `CLEANUP_TEST_LOGS=true` to `gpu-cluster.sh teardown` on each node.) -- **CronJob Status**: Configuration and schedule of validation CronJobs -- **Recent Pod Runs**: Last 5 pod executions with timestamps, phases, and assigned nodes -- **Pod Details**: Detailed information about recent validation test pods - -### Per-Node Validation Status +## Monitoring Validation Tests -To view validation test results broken down by individual node: +For the unified status view (CronJob, recent orchestrator pods, in-flight phase Jobs, MPIJob, per-node phase labels + Phase 1 stage annotations + failure-reason annotations + per-node container ping): ```bash -./gpu-cluster.sh node-status +ansible-playbook playbooks/cvf.yml -e action=status ``` -This command displays: - -- **Node Summary Table**: Shows each node with its last run timestamp and validation result (Passed/Failed/Pending) -- **Detailed Node Information**: Per-node breakdown including: - - Last run timestamp (from node annotation) - - Validation result status - - Most recent pod name that executed on the node - -**Result Status Legend:** - -- `Passed`: All validation tests on the node passed - - Label: `amd.com/cluster-validation-status=passed` -- `Failed`: One or more validation tests on the node failed - - Label: `amd.com/cluster-validation-status=failed` -- `Pending`: Validation tests are running or have not yet executed - - Label: not set (no label present) - -### Understanding the Output - -The per-node view uses Kubernetes node labels and annotations to track validation test execution: +This is the single entry point for cluster-wide AND per-node validation state. See the [`cvf.yml` action reference in `ansible/README.md`](ansible/README.md#cvfyml) for the full output schema. -- **Annotation `amd.com/cluster-validation-last-run-timestamp`**: Timestamp of the last validation test execution on this node -- **Label `amd.com/cluster-validation-status`**: Current validation result status: - - Set to `passed` if all tests passed - - Set to `failed` if any tests failed - - Not set (empty) if tests are pending or have not yet run +### Result schema (what the status output means) -## Ansible Automation +The status output is computed from Kubernetes node labels and annotations the orchestrator writes after each phase: -For production deployments with multiple nodes, the Ansible automation provides a streamlined deployment experience: +| Per-node label | Phase | Values | +|---|---|---| +| `amd.com/gpu-hw-acceptance` | 1 (GPU HW acceptance) | `passed` / `failed` / `skipped` / unset | +| `amd.com/gpu-mesh-validation` | 2 (intra-node RCCL) | `passed` / `failed` / `skipped` / unset | +| `amd.com/nic-health` | 3 (NIC health) | `passed` / `failed` / `skipped` / unset (NIC-capable nodes only) | +| `amd.com/rail-bandwidth` | 4 (rail bandwidth) | `passed` / `failed` / `skipped` / unset | +| `amd.com/cluster-validation-status` | 5 (multi-node RCCL) — also the aggregate label | `passed` / `failed` / unset | -**Key Features:** +| Per-node annotation | Purpose | +|---|---| +| `amd.com/cluster-validation-last-run-timestamp` | Wall-clock of the last orchestrator run that touched this node. Drives the `NODE_VALIDATION_INTERVAL_MINS` skip gate. | +| `amd.com/gpu-hw-acceptance-stage-` | Phase 1 per-stage result (`passed`/`failed`/`skipped`). Example: `amd.com/gpu-hw-acceptance-stage-gpu-stress=passed`. | +| `amd.com/-failure-reason` | Free-form reason when the corresponding label is `failed`. Example: `amd.com/rail-bandwidth-failure-reason=failed-rails:0,1,2`. | +| `amd.com/-failed-subtest` | Specific sub-test or recipe that failed (Phase 1). | +| `amd.com/-failed-nics` | Specific NIC IDs that failed (Phase 3). | -- **One-command deployment**: Deploy entire cluster across all nodes with a single command -- **Automatic prerequisites**: Installs Docker and jq on all nodes -- **SSH key management**: Automated passwordless SSH setup -- **Image distribution**: Builds image locally and distributes to all nodes -- **Token management**: Automatically retrieves and distributes k3s join token -- **Status monitoring**: Unified status checking across all nodes -- **Clean teardown**: One-command cluster removal - -**Quick Start:** +Phase 5's failure reason is per-worker and takes the form `amd.com/cluster-validation-status-failure-reason=worker-pod=,exit=` — see [Phase 5: Multi-Node RCCL Collective Test](#phase-5-multi-node-rccl-collective-test) for sub-field semantics. +Inspect raw labels/annotations directly via: ```bash -# 1. Configure node inventory -cd ansible -vi inventory.yml # Update with your node IPs and settings - -# 2. Run automated deployment -./quickstart.sh - -# 3. Check cluster status -ansible-playbook playbooks/check-status.yml - -# 4. Teardown cluster -ansible-playbook playbooks/teardown-cluster.yml +ansible -i ansible/inventory.yml server-node -m shell -a \ + "docker exec server kubectl get node -o yaml | yq .metadata.labels,.metadata.annotations" ``` -**Supported Scenarios:** - -- Server node local (where you run Ansible) + remote agent nodes -- All remote nodes (server + agents on different machines) -- Mixed environments with different sudo requirements - -**See the [ansible/README.md](ansible/README.md) for complete documentation.** - ## FAQ and Troubleshooting ### 1. What if I hit the DockerHub rate limit to pull images from public repository? @@ -369,27 +1206,13 @@ Users could configure a DockerHub account secrets in `configs/config.json` so th If the agent container starts but doesn't appear when running `kubectl get nodes` on the server, check the logs: -**For manual deployment:** - ```bash -# On the agent node host, check k3s logs -cat /var/log/k3s.log +# On the agent node host +cat /tmp/gpu-cluster-agent.log # ansible-side deployment log +docker logs agent # k3s agent container log -# Or check the GPU cluster script logs -docker logs agent -``` - -**For Ansible deployment:** - -```bash -# On the agent node host, check the deployment logs -cat /tmp/gpu-cluster-agent.log - -# Check k3s logs inside the container -docker logs agent - -# Or SSH to the agent node and check -ssh vm@ "cat /tmp/gpu-cluster-agent.log" +# Or via Ansible from the control node +ansible -i ansible/inventory.yml -m shell -a "cat /tmp/gpu-cluster-agent.log; docker logs agent --tail 50" ``` **Common issues:** @@ -397,14 +1220,12 @@ ssh vm@ "cat /tmp/gpu-cluster-agent.log" - **Wrong server IP**: Verify the server IP is correct and reachable from the agent node - **Network connectivity**: Ensure port 6443 is accessible between nodes - **Token mismatch**: Verify the token matches the server's token -- **Config missing**: Check if `/home/vm/gpu-validation-cluster/configs/config.json` exists on the agent node +- **Config missing**: Check that `{{ config_dir }}/config.json` exists on the agent node (see inventory's `config_dir`) - **Firewall blocking**: Ensure firewall rules allow k3s traffic (TCP 6443, 10250) -- **Port binding failed**: If you see "bind: address already in use", check for existing k3s processes (`sudo lsof -i :6443` or `ps aux | grep k3s`) and stop them, or run `./gpu-cluster.sh teardown` to clean up +- **Port binding failed**: If you see "bind: address already in use", check for existing k3s processes (`sudo lsof -i :6443` or `ps aux | grep k3s`) and stop them, or re-run `ansible-playbook playbooks/teardown-cluster.yml` to clean up ### 3. How do I add new agent nodes to an existing cluster? -**For Ansible deployments:** - ```bash # 1. Add new nodes to inventory.yml vi ansible/inventory.yml @@ -418,35 +1239,14 @@ cd ansible ansible-playbook playbooks/add-agent-nodes.yml --limit agent-node-4 # 3. Verify new nodes joined -ansible-playbook playbooks/check-status.yml +ansible-playbook playbooks/cvf.yml -e action=status docker exec server kubectl get nodes ``` -**For manual deployments:** - -```bash -# 1. On new agent node: Copy and load the Docker image -scp gpu-validation-cluster.tar user@new-node:/tmp/ -ssh user@new-node "docker load -i /tmp/gpu-validation-cluster.tar" - -# 2. Copy project files to new node -scp -r . user@new-node:~/gpu-validation-cluster/ - -# 3. Get token from server node -./gpu-cluster.sh get-token - -# 4. On new agent node: Join the cluster -ssh user@new-node -cd ~/gpu-validation-cluster -./gpu-cluster.sh run agent -``` - -**Important**: The add-agent-nodes.yml playbook is designed to safely add nodes without disrupting the existing cluster. It automatically detects and skips nodes already in the cluster. +The `add-agent-nodes.yml` playbook safely adds nodes without disrupting the existing cluster; it auto-detects and skips nodes already in the cluster. ### 4. How do I remove agent nodes from an existing cluster? -**For Ansible deployments:** - ```bash # Remove a specific node cd ansible @@ -456,22 +1256,8 @@ ansible-playbook playbooks/remove-agent-nodes.yml --limit agent-node-2 ansible-playbook playbooks/remove-agent-nodes.yml --limit "agent-node-2,agent-node-3" # Verify nodes were removed -ansible-playbook playbooks/check-status.yml +ansible-playbook playbooks/cvf.yml -e action=status docker exec server kubectl get nodes ``` -**For manual deployments:** - -```bash -# 1. On server node: Drain the node (move workloads gracefully) -docker exec server kubectl drain --ignore-daemonsets --delete-emptydir-data --force - -# 2. Delete node from cluster -docker exec server kubectl delete node - -# 3. On agent node: Stop and clean up -ssh user@agent-node -./gpu-cluster.sh teardown -``` - -**Important**: The remove-agent-nodes.yml playbook gracefully drains workloads before removing nodes. Always use `--limit` to target specific nodes. +The `remove-agent-nodes.yml` playbook gracefully drains workloads before removing nodes. Always use `--limit` to target specific nodes. diff --git a/example/gpu-validation-cluster/ansible/README.md b/example/gpu-validation-cluster/ansible/README.md index d28c9410c..fab7b4ec1 100644 --- a/example/gpu-validation-cluster/ansible/README.md +++ b/example/gpu-validation-cluster/ansible/README.md @@ -31,14 +31,14 @@ This directory contains Ansible playbooks for automating the deployment and mana ansible/ ├── ansible.cfg # Ansible configuration ├── inventory.yml # Cluster inventory (edit this!) -├── quickstart.sh # Interactive quick start script ├── playbooks/ │ ├── setup-ssh-keys.yml # Configure passwordless SSH │ ├── setup-cluster.yml # Main cluster setup │ ├── add-agent-nodes.yml # Add new nodes to existing cluster │ ├── remove-agent-nodes.yml # Remove nodes from existing cluster │ ├── teardown-cluster.yml # Teardown entire cluster -│ └── check-status.yml # Check cluster status +│ ├── set-gpu-compute-mode.yml # Switch GPU compute-partition mode (SPX/CPX/...) + reboot to apply +│ └── cvf.yml # CVF day-2 ops: reapply / reset / inject-secret / trigger / status / fresh-run ├── group_vars/ # Group variables (optional) └── README.md ``` @@ -114,7 +114,7 @@ This will: Check the cluster status: ```bash -ansible-playbook playbooks/check-status.yml +ansible-playbook playbooks/cvf.yml -e action=status ``` ### Step 5: Adding New Agent Nodes (Optional) @@ -133,7 +133,7 @@ vi inventory.yml ansible-playbook playbooks/add-agent-nodes.yml --limit agent-node-4 # 3. Verify new nodes joined -ansible-playbook playbooks/check-status.yml +ansible-playbook playbooks/cvf.yml -e action=status ``` **Important**: Use `--limit` to target only the new nodes you want to add. The playbook automatically detects and skips nodes already in the cluster. @@ -150,7 +150,7 @@ ansible-playbook playbooks/remove-agent-nodes.yml --limit agent-node-2 ansible-playbook playbooks/remove-agent-nodes.yml --limit "agent-node-2,agent-node-3" # Verify nodes were removed -ansible-playbook playbooks/check-status.yml +ansible-playbook playbooks/cvf.yml -e action=status ``` **Important**: This will drain workloads, delete the node from Kubernetes, stop the agent container, and clean up cluster state. @@ -190,11 +190,11 @@ ansible-playbook playbooks/setup-cluster.yml - Exports image to tar file 2. **Prerequisites Phase (all nodes):** - - Checks for Docker installation - - Installs Docker if not present - - Checks for jq installation - - Installs jq if not present - - Adds user to docker group + - Checks for Docker installation; installs Docker if not present + - Stops + disables any pre-existing **native** `k3s.service` / `k3s-agent.service` so the containerized k3s doesn't collide on host ports 6443/6444 + - Ensures a local `/etc/group` `docker` entry (fixes `SocketGroup=docker` boot failure on hosts that only have the group via LDAP/SSSD) + - Always (re)starts the docker daemon (not just on a brand-new install) + - Installs jq if not present; adds user to docker group 3. **Distribution Phase (all nodes):** - Copies Docker image tar to all nodes @@ -202,15 +202,19 @@ ansible-playbook playbooks/setup-cluster.yml - Copies scripts and configs 4. **Server Phase (server node):** + - **Idempotency probe**: skips restart when the existing `server` container is Up AND its kubectl can reach the apiserver. Otherwise tears down + recreates. - Starts k3s server container - Retrieves cluster join token - Waits for server to be ready 5. **Agent Phase (agent nodes):** + - **Idempotency probe**: skips restart when the existing `agent` container is Up AND has an established TCP connection to the apiserver port 6443. Otherwise tears down + recreates. - Starts k3s agent containers one by one - Joins agents to server - Verifies connection + Both the server and agent containers run a **supervisor loop** as their entrypoint (`build/entrypoint.sh`) — if `k3s server` / `k3s agent` crashes inside the container (post-reboot iptables-nft / cni0 transients), the supervisor auto-restarts it with exponential backoff. A rolling restart cap (`K3S_MAX_RESTARTS` / `K3S_RESTART_WINDOW`) escalates to `exit 1` so `docker --restart=unless-stopped` recreates the container fresh as the next layer of recovery. + 6. **Verification Phase:** - Lists all cluster nodes - Displays cluster info @@ -227,6 +231,11 @@ ansible-playbook playbooks/setup-cluster.yml --limit server_nodes # Dry run (check mode) ansible-playbook playbooks/setup-cluster.yml --check + +# Force-recreate the server + agent containers even when the idempotency +# probe says they look healthy (e.g. after editing gpu-cluster.sh or +# config.json in a way that requires a fresh container start). +ansible-playbook playbooks/setup-cluster.yml -e force_restart=true ``` ### add-agent-nodes.yml @@ -346,22 +355,101 @@ ansible-playbook playbooks/teardown-cluster.yml - Cleans up cluster state - Verifies cleanup -### check-status.yml +### set-gpu-compute-mode.yml -**Purpose:** Check current cluster status. +**Purpose:** Switch the AMD GPU compute-partition mode (SPX / CPX / TPX / QPX / DPX) on one or more nodes and reboot to apply. **Usage:** ```bash -ansible-playbook playbooks/check-status.yml +# Flip every agent node to SPX (single-partition; the canonical setting +# for multi-GPU RCCL workloads — CPX disables P2P/XGMI between +# partitions, which would force RCCL to fall back to SHM through host +# DRAM and tank intra-node bandwidth). +ansible-playbook -i inventory.yml playbooks/set-gpu-compute-mode.yml -e partition=SPX + +# Target a single host +ansible-playbook -i inventory.yml playbooks/set-gpu-compute-mode.yml \ + -e partition=SPX -e target=agent-node-1 + +# Skip the reboot (only useful if you know the firmware applies live) +ansible-playbook -i inventory.yml playbooks/set-gpu-compute-mode.yml \ + -e partition=SPX -e skip_reboot=true ``` **What it does:** -- Shows Docker container status on all nodes -- Displays Kubernetes nodes -- Lists all pods -- Shows cluster validation framework status +- Stops the CVF `agent` container on the target host (drains workloads) +- Runs `rocm-smi --setcomputepartition ` on each target +- Reboots the node (BAR/aperture reclamation generally requires a cold restart on MI300/MI308) +- Waits for SSH + Docker to come back, then re-checks the partition +- After completion, re-run `setup-cluster.yml` to rejoin the agent container to k3s (the existing tasks are idempotent and will recreate `/var/lib/gpu-validation-cluster` state cleanly) + +### cvf.yml + +**Purpose:** Single dispatcher for all Cluster Validation Framework day-2 operations. +Five lifecycle playbooks (setup-ssh-keys / setup-cluster / add-agent-nodes / +remove-agent-nodes / teardown-cluster) stay separate; everything else lives here +behind `-e action=`. + +**Actions:** + +| `-e action=` | Purpose | +|---|---| +| `reapply` | Copy configs/ + gpu-cluster.sh to server-node and run `gpu-cluster.sh reapply-cvf` (kubectl apply YAMLs + kubectl patch CM/CronJob from config.json overrides). `config.json` is NOT applied to k8s — it's the source of patches. **Does NOT propagate `global.image-pull-secrets`** — use `inject-secret`. | +| `reset` | Clear per-phase labels + annotations on every node. Optional: `-e suspend_cronjob=true`, `-e delete_completed_jobs=true`, `-e delete_all_jobs=true` (full slate including MPIJob CRDs and orphan pods). | +| `inject-secret` | Create/update docker-registry Secret(s) + strategic-merge-patch `cluster-validation-sa.imagePullSecrets`. Reads `global.image-pull-secrets` from controller-side `config.json`. Affects only pods created after the patch. CLI override: `-e username=… -e token=… [-e registry_url=…] [-e secret_name=…]`. | +| `trigger` | Idempotent `kubectl create job --from=cronjob/...` using a fixed name (default `cvf-test`). Deletes prior Job + orphan pods first. Override with `-e job_name=`. **Cron-vs-manual race**: a manual run that exceeds the `*/30` cron schedule will race the next cron tick for the same nodes and per-rail Phase 4 Job names. Suspend cron before long manual runs (see below). | +| `status` | Dump CronJob, recent orchestrator pods, latest orchestrator log tail, in-flight phase Jobs, MPIJob, per-node phase labels + Phase 1 stage annotations + failure-reason annotations + per-node container ping + **per-node latest log file per phase** (paths under `/var/log/cluster-validation/`, so operators can `tail -f` straight to the right file). | +| `fresh-run` | Composite: `reapply` → `reset` (with `delete_all_jobs=true`) → `inject-secret` → `trigger` → `status`. Replaces the four-command sequence operators kept typing. | + +**Examples:** + +```bash +# Single action +ansible-playbook playbooks/cvf.yml -e action=status +ansible-playbook playbooks/cvf.yml -e action=reapply +ansible-playbook playbooks/cvf.yml -e action=reset -e delete_all_jobs=true +ansible-playbook playbooks/cvf.yml -e action=inject-secret +ansible-playbook playbooks/cvf.yml -e action=inject-secret -e username=u -e token=t +ansible-playbook playbooks/cvf.yml -e action=trigger -e job_name=my-run + +# Composite (replaces the 4-command chain) +ansible-playbook playbooks/cvf.yml -e action=fresh-run +``` + +**Idempotency:** all actions are safe to re-run. Secret creation uses +`kubectl create --dry-run=client -o yaml | kubectl apply -f -`; SA patches +use strategic-merge with mergeKey=name; trigger deletes any prior `cvf-test` +Job before creating. + +**Validation:** unknown `action` values fail fast with the valid list. + +**Suspending the CronJob during long manual runs:** + +The CronJob's `concurrencyPolicy: Forbid` only blocks cron-vs-cron overlap. +A manual `trigger` that exceeds the `*/30` schedule will be lapped by the +next cron tick — both orchestrators then race for the same nodes and reuse +the same per-rail Phase 4 Job names (`cvf-p4-{server,client}--r`, +no timestamp suffix), which causes `kubectl apply` rc=2 on the loser. + +Recommended pattern for runs expected to take more than 30 minutes: + +```bash +# Suspend the cron + reset state in one call +ansible-playbook playbooks/cvf.yml -e action=reset -e suspend_cronjob=true + +# Run the manual orchestrator +ansible-playbook playbooks/cvf.yml -e action=trigger + +# ...wait for completion (check with -e action=status), then re-enable cron +docker exec server kubectl patch cronjob cluster-validation-cron-job \ + -n default --type=merge -p '{"spec":{"suspend":false}}' +``` + +`fresh-run` does NOT auto-suspend the cron. Combine with +`-e suspend_cronjob=true` if a long pipeline is expected (or run when +no cron tick is due within the next ~hour). ## Advanced Usage @@ -516,7 +604,7 @@ ansible-playbook playbooks/setup-cluster.yml ## Additional Resources - [Ansible Documentation](https://docs.ansible.com/) -- [GPU Validation Cluster Main README](../README.md) +- [GPU Validation Cluster Main README](./README.md) - [k3s Documentation](https://docs.k3s.io/) ## Support diff --git a/example/gpu-validation-cluster/ansible/playbooks/add-agent-nodes.yml b/example/gpu-validation-cluster/ansible/playbooks/add-agent-nodes.yml index 65cb39e90..ef94bfd58 100644 --- a/example/gpu-validation-cluster/ansible/playbooks/add-agent-nodes.yml +++ b/example/gpu-validation-cluster/ansible/playbooks/add-agent-nodes.yml @@ -406,7 +406,7 @@ Total cluster nodes: {{ node_count.stdout }} To verify: - ansible-playbook playbooks/check-status.yml + ansible-playbook playbooks/cvf.yml -e action=status To check validation status: ./gpu-cluster.sh status diff --git a/example/gpu-validation-cluster/ansible/playbooks/check-status.yml b/example/gpu-validation-cluster/ansible/playbooks/check-status.yml deleted file mode 100644 index 9025144c2..000000000 --- a/example/gpu-validation-cluster/ansible/playbooks/check-status.yml +++ /dev/null @@ -1,59 +0,0 @@ ---- -# Check GPU Validation Cluster status -# Usage: ansible-playbook playbooks/check-status.yml - -- name: Check container status on all nodes - hosts: cluster_nodes - gather_facts: no - - tasks: - - name: Check Docker containers - shell: 'docker ps --filter "name=server" --filter "name=agent" --format "table {{ ''{{'' }}.Names{{ ''}}'' }}\t{{ ''{{'' }}.Status{{ ''}}'' }}\t{{ ''{{'' }}.Ports{{ ''}}'' }}"' - register: container_status - changed_when: false - - - name: Display container status - debug: - msg: | - {{ inventory_hostname }}: - {{ container_status.stdout }} - -- name: Check Kubernetes cluster status - hosts: server_nodes - gather_facts: no - - tasks: - - name: Get cluster nodes - command: docker exec server kubectl get nodes -o wide - register: k8s_nodes - changed_when: false - ignore_errors: yes - - - name: Display Kubernetes nodes - debug: - msg: "{{ k8s_nodes.stdout_lines }}" - when: k8s_nodes.rc == 0 - - - name: Get cluster pods - command: docker exec server kubectl get pods --all-namespaces - register: k8s_pods - changed_when: false - ignore_errors: yes - - - name: Display Kubernetes pods - debug: - msg: "{{ k8s_pods.stdout_lines }}" - when: k8s_pods.rc == 0 - - - name: Check cluster validation framework status - shell: | - cd {{ script_dir }} - CONFIG_DIR={{ config_dir }} ./gpu-cluster.sh status - register: cvf_status - changed_when: false - ignore_errors: yes - - - name: Display cluster validation framework status - debug: - msg: "{{ cvf_status.stdout_lines }}" - when: cvf_status.rc == 0 diff --git a/example/gpu-validation-cluster/ansible/playbooks/cvf.yml b/example/gpu-validation-cluster/ansible/playbooks/cvf.yml new file mode 100644 index 000000000..82a35f3ea --- /dev/null +++ b/example/gpu-validation-cluster/ansible/playbooks/cvf.yml @@ -0,0 +1,796 @@ +--- +# Single dispatcher for Cluster Validation Framework day-2 operations. +# All CVF actions live here behind `-e action=`; lifecycle playbooks +# (setup-cluster / teardown-cluster / setup-ssh-keys / add-agent-nodes / +# remove-agent-nodes) remain separate. +# +# Actions: +# reapply -- copy configs/ + gpu-cluster.sh to server-node and +# run `gpu-cluster.sh reapply-cvf` (kubectl apply +# YAMLs + kubectl patch CM/CronJob from config.json +# overrides). config.json itself is NOT applied to +# k8s; it's the source of patches. Does not propagate +# global.image-pull-secrets -- use inject-secret. +# +# reset -- clear per-phase labels + annotations on every node. +# Optional: +# -e suspend_cronjob=true suspend the CronJob +# -e delete_completed_jobs=true GC completed Jobs +# -e delete_all_jobs=true nuke ALL CVF Jobs + +# MPIJobs + orphan pods +# (used for tangled runs) +# +# inject-secret -- create/update docker-registry Secret(s) + strategic- +# merge-patch cluster-validation-sa.imagePullSecrets. +# Reads global.image-pull-secrets from CONTROLLER-side +# config.json; affects only pods created after patch. +# CLI override (single ad-hoc): +# -e username=foo -e token=bar [-e registry_url=...] +# [-e secret_name=...] +# +# trigger -- idempotent `kubectl create job --from=cronjob/...` +# using a fixed name (default cvf-test). Deletes prior +# Job + orphan pods first. +# -e job_name= override default cvf-test +# +# Cron-vs-manual race: CronJob.concurrencyPolicy=Forbid +# only blocks cron-vs-cron overlap, NOT manual-vs-cron. +# If a manual `trigger` run exceeds the */30 schedule, +# the next cron tick spawns a parallel orchestrator +# that re-selects the same nodes and hits the +# Phase 4 immutable-selector race on per-rail Jobs +# (cvf-p4-{server,client}--r have no +# timestamp suffix). Recommended pattern when +# triggering manually for runs >30min: +# ansible-playbook playbooks/cvf.yml -e action=reset \ +# -e suspend_cronjob=true +# ansible-playbook playbooks/cvf.yml -e action=trigger +# # ...wait for completion, then re-enable: +# docker exec server kubectl patch cronjob \ +# cluster-validation-cron-job -n default \ +# --type=merge -p '{"spec":{"suspend":false}}' +# fresh-run does NOT auto-suspend the cron -- combine +# with `-e suspend_cronjob=true` if a long run is +# expected (or run fresh-run when no cron tick is due). +# +# status -- dump CronJob state, recent orchestrator pods, latest +# orchestrator log tail, in-flight phase Jobs, MPIJob +# state, per-node phase labels + Phase 1 stage +# annotations + failure-reason annotations. +# +# fresh-run -- composite: reapply -> reset (delete_all_jobs=true) -> +# inject-secret -> trigger -> status. Replaces the +# 4-command sequence operators kept typing. +# +# Examples: +# ansible-playbook -i inventory.test.yml playbooks/cvf.yml -e action=reapply +# ansible-playbook -i inventory.test.yml playbooks/cvf.yml -e action=reset -e delete_all_jobs=true +# ansible-playbook -i inventory.test.yml playbooks/cvf.yml -e action=inject-secret +# ansible-playbook -i inventory.test.yml playbooks/cvf.yml -e action=inject-secret -e username=u -e token=t +# ansible-playbook -i inventory.test.yml playbooks/cvf.yml -e action=trigger -e job_name=my-run +# ansible-playbook -i inventory.test.yml playbooks/cvf.yml -e action=status +# ansible-playbook -i inventory.test.yml playbooks/cvf.yml -e action=fresh-run + +- name: CVF day-2 operations (server-side) + hosts: server_nodes + gather_facts: false + vars: + action: "" + container_name: server + cvf_namespace: default + service_account_name: cluster-validation-sa + cronjob_name: cluster-validation-cron-job + job_name: cvf-test + delete_all_jobs: false + delete_completed_jobs: false + suspend_cronjob: false + # inject-secret CLI override (when both username+token are set, + # config.json is bypassed for a single ad-hoc secret). + registry_url: "" + username: "" + token: "" + secret_name: "" + _valid_actions: + - reapply + - reset + - inject-secret + - trigger + - status + - fresh-run + _is_fresh_run: "{{ action == 'fresh-run' }}" + tasks: + - name: Validate action + fail: + msg: | + Unknown or missing action: '{{ action }}'. + Valid: {{ _valid_actions | join(', ') }}. + Use -e action=. + when: action not in _valid_actions + + # ====================================================================== + # Action: reapply + # ====================================================================== + - name: reapply > copy configs/ to server node + copy: + src: "{{ playbook_dir }}/../../configs/" + dest: "{{ config_dir }}/" + mode: '0644' + when: action == 'reapply' or _is_fresh_run + + - name: reapply > copy gpu-cluster.sh to server node + copy: + src: "{{ playbook_dir }}/../../gpu-cluster.sh" + dest: "{{ script_dir }}/gpu-cluster.sh" + mode: '0755' + when: action == 'reapply' or _is_fresh_run + + - name: reapply > verify server container is running + shell: docker ps --format '{% raw %}{{.Names}}{% endraw %}' | grep -qx '{{ container_name }}' + changed_when: false + when: action == 'reapply' or _is_fresh_run + + - name: reapply > run gpu-cluster.sh reapply-cvf + shell: | + cd {{ script_dir }} + CONFIG_DIR={{ config_dir }} ./gpu-cluster.sh reapply-cvf + register: reapply_out + changed_when: true + when: action == 'reapply' or _is_fresh_run + + - name: reapply > show last 20 lines of reapply output + debug: + msg: "{{ reapply_out.stdout_lines[-20:] }}" + when: (action == 'reapply' or _is_fresh_run) and reapply_out is defined + + - name: reapply > verify post-patch CM keys + shell: | + docker exec {{ container_name }} kubectl get cm cluster-validation-config -n {{ cvf_namespace }} \ + -o jsonpath='{.data.SKIP_GPU_HW_ACCEPTANCE}{"|"}{.data.SKIP_GPU_MESH_VALIDATION}{"|"}{.data.SKIP_NIC_VALIDATION}{"|"}{.data.SKIP_RAIL_BANDWIDTH_TEST}{"|"}{.data.SKIP_RCCL_TEST}{"|"}{.data.NODE_VALIDATION_INTERVAL_MINS}{"|"}{.data.PHASE2_JOB_WAIT_TIME}{"|"}{.data.PHASE3_JOB_WAIT_TIME}{"|"}{.data.PHASE4_PAIR_WAIT_TIME}{"|"}{.data.MPIJOB_WAIT_TIME}' + register: reapply_cm + changed_when: false + when: action == 'reapply' or _is_fresh_run + + - debug: + msg: "CM keys (SKIP1|SKIP2|SKIP3|SKIP4|SKIP5|INTERVAL|P2|P3|P4|P5): {{ reapply_cm.stdout }}" + when: (action == 'reapply' or _is_fresh_run) and reapply_cm is defined + + - name: reapply > verify CronJob schedule + shell: | + docker exec {{ container_name }} kubectl get cronjob {{ cronjob_name }} -n {{ cvf_namespace }} \ + -o jsonpath='{.spec.schedule}' + register: reapply_cron + changed_when: false + when: action == 'reapply' or _is_fresh_run + + - debug: + msg: "CronJob schedule: {{ reapply_cron.stdout }}" + when: (action == 'reapply' or _is_fresh_run) and reapply_cron is defined + + - name: reapply > verify Phase 1 stages JSON (Skip + TimeoutSeconds + Image) + shell: | + docker exec {{ container_name }} kubectl get cm cluster-validation-config -n {{ cvf_namespace }} \ + -o jsonpath='{.data.GPU_VALIDATION_STAGES_JSON}' \ + | jq -c '[.[] | {Name, Skip, TimeoutSeconds, Image}]' + register: reapply_stages + changed_when: false + when: action == 'reapply' or _is_fresh_run + + - debug: + msg: "{{ reapply_stages.stdout }}" + when: (action == 'reapply' or _is_fresh_run) and reapply_stages is defined + + # ====================================================================== + # Action: reset + # fresh-run forces delete_all_jobs=true so nothing lingers. + # ====================================================================== + - name: reset > set delete_all_jobs=true under fresh-run + set_fact: + delete_all_jobs: true + when: _is_fresh_run + + - name: reset > show node labels (before) + shell: | + docker exec {{ container_name }} kubectl get nodes \ + -L amd.com/gpu-hw-acceptance \ + -L amd.com/gpu-mesh-validation \ + -L amd.com/nic-health \ + -L amd.com/rail-bandwidth \ + -L amd.com/cluster-validation-status + register: reset_before + changed_when: false + when: action == 'reset' or _is_fresh_run + + - debug: + msg: "{{ reset_before.stdout_lines }}" + when: (action == 'reset' or _is_fresh_run) and reset_before is defined + + - name: reset > remove per-phase labels + candidate label on every node + shell: | + set -eu + for N in $(docker exec {{ container_name }} kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do + docker exec {{ container_name }} kubectl label node "$N" \ + amd.com/gpu-hw-acceptance- \ + amd.com/gpu-mesh-validation- \ + amd.com/nic-health- \ + amd.com/rail-bandwidth- \ + amd.com/cluster-validation-status- \ + amd.com/cluster-validation-candidate- \ + --overwrite 2>&1 | sed 's/^/ /' || true + done + args: + executable: /bin/bash + register: reset_label_clear + changed_when: "'labeled' in (reset_label_clear.stdout | default(''))" + when: action == 'reset' or _is_fresh_run + + - name: reset > remove last-run timestamp + per-stage + failure-reason annotations + shell: | + set -eu + for N in $(docker exec {{ container_name }} kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do + docker exec {{ container_name }} kubectl annotate node "$N" \ + amd.com/cluster-validation-last-run-timestamp- \ + --overwrite 2>&1 | sed 's/^/ /' || true + KEYS=$(docker exec {{ container_name }} kubectl get node "$N" -o json \ + | jq -r '(.metadata.annotations // {}) + | keys[] + | select(startswith("amd.com/gpu-hw-acceptance-stage-") + or test("amd.com/.+-(failure-reason|failed-subtest|failed-nics)$"))') + for K in $KEYS; do + docker exec {{ container_name }} kubectl annotate node "$N" "${K}-" \ + --overwrite 2>&1 | sed 's/^/ /' || true + done + done + args: + executable: /bin/bash + register: reset_ann_clear + changed_when: "'annotated' in (reset_ann_clear.stdout | default(''))" + when: action == 'reset' or _is_fresh_run + + - name: reset > suspend CronJob (optional) + shell: | + docker exec {{ container_name }} kubectl patch cronjob {{ cronjob_name }} \ + -n {{ cvf_namespace }} --type=merge -p '{"spec":{"suspend":true}}' + args: + executable: /bin/bash + register: reset_cron_suspend + changed_when: "'patched' in (reset_cron_suspend.stdout | default(''))" + when: (action == 'reset' or _is_fresh_run) and (suspend_cronjob | bool) + + - name: reset > delete completed Jobs (optional, mutually exclusive with delete_all_jobs) + shell: | + set -eu + docker exec {{ container_name }} kubectl get jobs -n {{ cvf_namespace }} -o json \ + | jq -r '.items + | map(select(.status.active // 0 == 0)) + | map(select(.status.succeeded // 0 > 0 or .status.failed // 0 > 0)) + | .[].metadata.name' \ + | while read -r J; do + docker exec {{ container_name }} kubectl delete job "$J" -n {{ cvf_namespace }} 2>&1 | sed 's/^/ /' || true + done + args: + executable: /bin/bash + register: reset_job_completed + changed_when: "'deleted' in (reset_job_completed.stdout | default(''))" + when: + - action == 'reset' or _is_fresh_run + - delete_completed_jobs | bool + - not (delete_all_jobs | bool) + + - name: reset > delete ALL CVF Jobs + MPIJob CRDs + orphan pods (optional) + shell: | + set -eu + # 1. Phase 5 MPIJob CRDs (cascade-deletes launcher Job + worker pods). + docker exec {{ container_name }} kubectl get mpijobs.kubeflow.org -n {{ cvf_namespace }} -o json 2>/dev/null \ + | jq -r '.items + | map(select(.metadata.name | startswith("cluster-validation-mpi-job-") + or startswith("cvf-"))) + | .[].metadata.name' \ + | while read -r M; do + docker exec {{ container_name }} kubectl delete mpijob.kubeflow.org "$M" -n {{ cvf_namespace }} --wait=true 2>&1 | sed 's/^/ /' || true + done + + # 2. batch/v1 Jobs. wait=true is intentional - next trigger + # reuses these names and kubectl apply rejects rc=2 if the + # same-name Job is still in-flight selector immutable. + docker exec {{ container_name }} kubectl get jobs -n {{ cvf_namespace }} -o json \ + | jq -r '.items + | map(select((.metadata.name | startswith("cluster-validation-cron-job-")) + or (.metadata.name | startswith("cluster-validation-mpi-job-")) + or (.metadata.name | startswith("cvf-e2e")) + or (.metadata.name | startswith("cvf-test")) + or (.metadata.name | startswith("cvf-tr")) + or (.metadata.name | startswith("cvf-phase")) + or (.metadata.name | startswith("cvf-p4")) + or (.metadata.name | startswith("ib-write-bw")))) + | .[].metadata.name' \ + | while read -r J; do + docker exec {{ container_name }} kubectl delete job "$J" -n {{ cvf_namespace }} --wait=true 2>&1 | sed 's/^/ /' || true + done + + # 3. Orphan pods that survived Job/MPIJob deletion. + docker exec {{ container_name }} kubectl get pods -n {{ cvf_namespace }} -o json \ + | jq -r '.items + | map(select((.metadata.name | startswith("cluster-validation-cron-job-")) + or (.metadata.name | startswith("cluster-validation-mpi-job-")) + or (.metadata.name | startswith("cvf-e2e")) + or (.metadata.name | startswith("cvf-test")) + or (.metadata.name | startswith("cvf-tr")) + or (.metadata.name | startswith("cvf-phase")) + or (.metadata.name | startswith("cvf-p4")) + or (.metadata.name | startswith("ib-write-bw")))) + | .[].metadata.name' \ + | while read -r P; do + docker exec {{ container_name }} kubectl delete pod "$P" -n {{ cvf_namespace }} --force --grace-period=0 2>&1 | sed 's/^/ /' || true + done + args: + executable: /bin/bash + register: reset_all_jobs + changed_when: "'deleted' in (reset_all_jobs.stdout | default(''))" + when: (action == 'reset' or _is_fresh_run) and (delete_all_jobs | bool) + + - name: reset > show node labels (after) + shell: | + docker exec {{ container_name }} kubectl get nodes \ + -L amd.com/gpu-hw-acceptance \ + -L amd.com/gpu-mesh-validation \ + -L amd.com/nic-health \ + -L amd.com/rail-bandwidth \ + -L amd.com/cluster-validation-status + register: reset_after + changed_when: false + when: action == 'reset' or _is_fresh_run + + - debug: + msg: "{{ reset_after.stdout_lines }}" + when: (action == 'reset' or _is_fresh_run) and reset_after is defined + + # ====================================================================== + # Action: inject-secret + # ====================================================================== + - name: inject > verify ServiceAccount exists + shell: | + docker exec {{ container_name }} kubectl get sa {{ service_account_name }} \ + -n {{ cvf_namespace }} >/dev/null + changed_when: false + when: action == 'inject-secret' or _is_fresh_run + + - name: inject > build credentials list (CLI override OR controller-side config.json) + shell: | + set -eu + if [ -n "{{ username }}" ] && [ -n "{{ token }}" ]; then + REG='{{ registry_url | default("docker.io", true) }}' + USR='{{ username }}' + TOK='{{ token }}' + NAME='{{ secret_name }}' + [ -z "$NAME" ] && NAME="$(echo "$USR" | tr -c 'a-z0-9-' '-' | sed 's/-*$//')-pull-secret" + jq -nc --arg r "$REG" --arg u "$USR" --arg t "$TOK" --arg n "$NAME" \ + '[{registry: $r, username: $u, token: $t, name: $n}]' + else + jq -c '[(.global["image-pull-secrets"] // [])[] + | { registry: (.["registry-url"] // "docker.io"), + username: .username, + token: .token, + name: ((.username | gsub("[^a-z0-9-]"; "-") | sub("-+$"; "")) + "-pull-secret") + }]' '{{ playbook_dir }}/../../configs/config.json' + fi + args: + executable: /bin/bash + delegate_to: localhost + run_once: true + register: inject_creds_json + changed_when: false + when: action == 'inject-secret' or _is_fresh_run + + - name: inject > parse credentials list + set_fact: + inject_creds_list: "{{ inject_creds_json.stdout | from_json }}" + when: (action == 'inject-secret' or _is_fresh_run) and inject_creds_json is defined + + - name: inject > show credentials (token redacted) + debug: + msg: "{{ inject_creds_list | map('combine', {'token': ''}) | list }}" + when: (action == 'inject-secret' or _is_fresh_run) and inject_creds_list is defined + + - name: inject > skip when no credentials configured (fresh-run treats as no-op) + debug: + msg: "No image-pull credentials configured -- skipping (no-op)." + when: + - action == 'inject-secret' or _is_fresh_run + - inject_creds_list is defined + - (inject_creds_list | length) == 0 + + - name: inject > fail when explicitly asked for inject-secret but no creds + fail: + msg: | + No image-pull credentials found. + Either add entries to configs/config.json -> global.image-pull-secrets, + or pass -e username=... -e token=... [-e registry_url=...] + when: + - action == 'inject-secret' # explicit, not fresh-run + - inject_creds_list is defined + - (inject_creds_list | length) == 0 + + - name: inject > create/update each docker-registry Secret (idempotent) + shell: | + set -eu + SRV_FLAG="" + if [ "{{ item.registry }}" != "docker.io" ] && [ "{{ item.registry }}" != "index.docker.io" ]; then + SRV_FLAG="--docker-server={{ item.registry }}" + fi + docker exec {{ container_name }} kubectl create secret docker-registry {{ item.name }} \ + $SRV_FLAG \ + --docker-username='{{ item.username }}' --docker-password='{{ item.token }}' \ + -n {{ cvf_namespace }} \ + --dry-run=client -o yaml \ + | docker exec -i {{ container_name }} kubectl apply -f - + args: + executable: /bin/bash + loop: "{{ inject_creds_list | default([]) }}" + loop_control: + label: "{{ item.name }} ({{ item.registry }}/{{ item.username }})" + register: inject_secret_apply + changed_when: "'created' in (inject_secret_apply.stdout | default('')) or 'configured' in (inject_secret_apply.stdout | default(''))" + no_log: true + when: + - action == 'inject-secret' or _is_fresh_run + - inject_creds_list is defined + - (inject_creds_list | length) > 0 + + - name: inject > strategic-merge-patch SA (additive on imagePullSecrets, idempotent) + shell: | + docker exec {{ container_name }} kubectl patch sa {{ service_account_name }} \ + -n {{ cvf_namespace }} --type=strategic \ + -p '{"imagePullSecrets":[{"name":"{{ item.name }}"}]}' + args: + executable: /bin/bash + loop: "{{ inject_creds_list | default([]) }}" + loop_control: + label: "{{ item.name }}" + register: inject_sa_patch + changed_when: "'patched' in (inject_sa_patch.stdout | default('')) and 'no change' not in (inject_sa_patch.stdout | default(''))" + when: + - action == 'inject-secret' or _is_fresh_run + - inject_creds_list is defined + - (inject_creds_list | length) > 0 + + - name: inject > show SA imagePullSecrets after patch + shell: | + docker exec {{ container_name }} kubectl get sa {{ service_account_name }} \ + -n {{ cvf_namespace }} -o jsonpath='{.imagePullSecrets}' + register: inject_sa_state + changed_when: false + when: + - action == 'inject-secret' or _is_fresh_run + - inject_creds_list is defined + - (inject_creds_list | length) > 0 + + - debug: + msg: "{{ service_account_name }} imagePullSecrets: {{ inject_sa_state.stdout }}" + when: + - action == 'inject-secret' or _is_fresh_run + - inject_sa_state is defined and inject_sa_state.stdout is defined + + # ====================================================================== + # Action: trigger + # + # Reminder: this creates a one-shot Job ('cvf-test' by default) that + # runs INDEPENDENTLY of the */30 CronJob. If the manual run takes + # >30min, the next cron tick will spawn a SECOND orchestrator and + # race for the same nodes + Phase 4 per-rail Job names (which lack + # a timestamp suffix -> kubectl apply rc=2 on the loser). To avoid: + # ansible-playbook playbooks/cvf.yml -e action=reset -e suspend_cronjob=true + # before triggering, then re-enable cron after the manual run: + # docker exec server kubectl patch cronjob cluster-validation-cron-job \ + # -n default --type=merge -p '{"spec":{"suspend":false}}' + # ====================================================================== + - name: trigger > verify CronJob template exists + shell: | + docker exec {{ container_name }} kubectl get cronjob {{ cronjob_name }} \ + -n {{ cvf_namespace }} >/dev/null + changed_when: false + when: action == 'trigger' or _is_fresh_run + + - name: trigger > delete prior Job with the same name (if any) + shell: | + docker exec {{ container_name }} kubectl delete job {{ job_name }} -n {{ cvf_namespace }} \ + --ignore-not-found=true --wait=true 2>&1 + args: + executable: /bin/bash + register: trigger_prior + changed_when: "'deleted' in (trigger_prior.stdout | default(''))" + when: action == 'trigger' or _is_fresh_run + + - name: trigger > force-clean orphan pods from the prior Job + shell: | + docker exec {{ container_name }} kubectl get pods -n {{ cvf_namespace }} \ + -l job-name={{ job_name }} -o name 2>/dev/null \ + | xargs -r -n1 docker exec {{ container_name }} kubectl delete -n {{ cvf_namespace }} \ + --force --grace-period=0 2>&1 || true + args: + executable: /bin/bash + changed_when: false + when: action == 'trigger' or _is_fresh_run + + - name: trigger > create Job from CronJob template + shell: | + docker exec {{ container_name }} kubectl create job --from=cronjob/{{ cronjob_name }} \ + {{ job_name }} -n {{ cvf_namespace }} + args: + executable: /bin/bash + register: trigger_create + changed_when: "'created' in (trigger_create.stdout | default(''))" + when: action == 'trigger' or _is_fresh_run + + - name: trigger > wait for orchestrator pod to be scheduled (up to 30s) + shell: | + for i in $(seq 1 15); do + POD=$(docker exec {{ container_name }} kubectl get pods -n {{ cvf_namespace }} \ + -l job-name={{ job_name }} -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) + if [ -n "$POD" ]; then + echo "$POD" + exit 0 + fi + sleep 2 + done + echo "ERROR: orchestrator pod did not appear within 30s" >&2 + exit 1 + args: + executable: /bin/bash + register: trigger_pod + changed_when: false + when: action == 'trigger' or _is_fresh_run + + - name: trigger > show orchestrator pod state + shell: | + docker exec {{ container_name }} kubectl get pod {{ trigger_pod.stdout }} -n {{ cvf_namespace }} -o wide + register: trigger_state + changed_when: false + when: (action == 'trigger' or _is_fresh_run) and trigger_pod is defined + + - debug: + msg: + - "Triggered: job/{{ job_name }} (orchestrator pod: {{ trigger_pod.stdout }})" + - "{{ trigger_state.stdout_lines }}" + when: (action == 'trigger' or _is_fresh_run) and trigger_state is defined + + # ====================================================================== + # Action: status + # ====================================================================== + - name: status > nodes (wide) + shell: docker exec {{ container_name }} kubectl get nodes -o wide + register: status_nodes + changed_when: false + ignore_errors: yes + when: action == 'status' or _is_fresh_run + + - debug: + msg: "{{ status_nodes.stdout_lines }}" + when: (action == 'status' or _is_fresh_run) and status_nodes is defined and status_nodes.rc == 0 + + - name: status > CronJob + shell: docker exec {{ container_name }} kubectl get cronjob -n {{ cvf_namespace }} -o wide + register: status_cron + changed_when: false + ignore_errors: yes + when: action == 'status' or _is_fresh_run + + - debug: + msg: "{{ status_cron.stdout_lines }}" + when: (action == 'status' or _is_fresh_run) and status_cron is defined and status_cron.rc == 0 + + - name: status > recent orchestrator pods (last 5) + shell: | + docker exec {{ container_name }} kubectl get pods -n {{ cvf_namespace }} -o json 2>/dev/null \ + | jq -r '.items + | map(select(.metadata.name | startswith("cluster-validation-cron-job-") or startswith("cvf-e2e-") or startswith("cvf-test"))) + | sort_by(.metadata.creationTimestamp) + | reverse | .[0:5] + | .[] | "\(.metadata.name)\t\(.status.phase)\t\(.spec.nodeName // "?")\t\(.metadata.creationTimestamp)"' + args: + executable: /bin/bash + register: status_orch + changed_when: false + ignore_errors: yes + when: action == 'status' or _is_fresh_run + + - debug: + msg: "Recent orchestrator pods (NAME | PHASE | NODE | CREATED):\n{{ status_orch.stdout }}" + when: (action == 'status' or _is_fresh_run) and status_orch is defined + + - name: status > tail logs of the most recent orchestrator pod + shell: | + POD=$(docker exec {{ container_name }} kubectl get pods -n {{ cvf_namespace }} -o json 2>/dev/null \ + | jq -r '.items + | map(select(.metadata.name | startswith("cluster-validation-cron-job-") or startswith("cvf-e2e-") or startswith("cvf-test"))) + | sort_by(.metadata.creationTimestamp) + | last.metadata.name // ""') + [ -z "$POD" ] && { echo "no orchestrator pods yet"; exit 0; } + echo "===== Orchestrator pod: $POD =====" + docker exec {{ container_name }} kubectl logs "$POD" -n {{ cvf_namespace }} 2>&1 | tail -n 80 + args: + executable: /bin/bash + register: status_log + changed_when: false + ignore_errors: yes + when: action == 'status' or _is_fresh_run + + - debug: + msg: "{{ status_log.stdout_lines }}" + when: (action == 'status' or _is_fresh_run) and status_log is defined + + - name: status > in-flight phase Jobs + shell: | + docker exec {{ container_name }} kubectl get jobs -n {{ cvf_namespace }} -o json 2>/dev/null \ + | jq -r '.items + | map(select((.metadata.name | startswith("cluster-validation-cron-job-")) | not)) + | map("\(.metadata.name)\tactive=\(.status.active // 0)\tsucc=\(.status.succeeded // 0)\tfail=\(.status.failed // 0)") + | .[]?' || true + args: + executable: /bin/bash + register: status_jobs + changed_when: false + ignore_errors: yes + when: action == 'status' or _is_fresh_run + + - debug: + msg: "Per-phase Jobs:\n{{ status_jobs.stdout if status_jobs.stdout else ' (none)' }}" + when: (action == 'status' or _is_fresh_run) and status_jobs is defined + + - name: status > MPIJob (Phase 5) + shell: | + docker exec {{ container_name }} kubectl get mpijob -n {{ cvf_namespace }} \ + -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.conditions[-1].type}{"\t"}{.status.conditions[-1].reason}{"\n"}{end}' 2>/dev/null \ + || echo "(no mpijob)" + args: + executable: /bin/bash + register: status_mpi + changed_when: false + ignore_errors: yes + when: action == 'status' or _is_fresh_run + + - debug: + msg: "MPIJob:\n{{ status_mpi.stdout }}" + when: (action == 'status' or _is_fresh_run) and status_mpi is defined + + - name: status > per-node phase labels + Phase 1 stage annotations + failure reasons + shell: | + set -eu + echo "================================================================" + echo " Per-node phase status" + echo " P1=amd.com/gpu-hw-acceptance P2=amd.com/gpu-mesh-validation" + echo " P3=amd.com/nic-health P4=amd.com/rail-bandwidth" + echo " P5/AGG=amd.com/cluster-validation-status (Phase 5 = aggregate label)" + echo "================================================================" + printf " %-72s | %-9s | %-9s | %-9s | %-9s | %-9s | %s\n" \ + NODE P1 P2 P3 P4 P5/AGG LAST_RUN + printf " %s\n" "$(printf -- '-%.0s' {1..165})" + for NODE in $(docker exec {{ container_name }} kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do + NJ=$(docker exec {{ container_name }} kubectl get node "$NODE" -o json) + P1=$(jq -r '.metadata.labels."amd.com/gpu-hw-acceptance" // "—"' <<<"$NJ") + P2=$(jq -r '.metadata.labels."amd.com/gpu-mesh-validation" // "—"' <<<"$NJ") + P3=$(jq -r '.metadata.labels."amd.com/nic-health" // "—"' <<<"$NJ") + P4=$(jq -r '.metadata.labels."amd.com/rail-bandwidth" // "—"' <<<"$NJ") + P5=$(jq -r '.metadata.labels."amd.com/cluster-validation-status" // "—"' <<<"$NJ") + TS=$(jq -r '.metadata.annotations."amd.com/cluster-validation-last-run-timestamp" // "never"' <<<"$NJ") + printf " %-72s | %-9s | %-9s | %-9s | %-9s | %-9s | %s\n" \ + "$NODE" "$P1" "$P2" "$P3" "$P4" "$P5" "$TS" + done + echo "" + echo "================================================================" + echo " Per-node Phase 1 stage annotations" + echo "================================================================" + for NODE in $(docker exec {{ container_name }} kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do + echo " $NODE:" + docker exec {{ container_name }} kubectl get node "$NODE" -o json \ + | jq -r '(.metadata.annotations // {}) + | to_entries + | map(select(.key | startswith("amd.com/gpu-hw-acceptance-stage-"))) + | sort_by(.key) + | if length == 0 then " (no stage annotations yet)" + else (map(" " + .key + " = " + .value) | join("\n")) end' + done + echo "" + echo "================================================================" + echo " Per-node failure-reason / failed-subtest / failed-nics annotations" + echo "================================================================" + any_fail=0 + for NODE in $(docker exec {{ container_name }} kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do + OUT=$(docker exec {{ container_name }} kubectl get node "$NODE" -o json \ + | jq -r '(.metadata.annotations // {}) + | to_entries + | map(select(.key | test("-(failure-reason|failed-subtest|failed-nics)$"))) + | map(" " + .key + " = " + .value) + | .[]?') + if [ -n "$OUT" ]; then + echo " $NODE:"; echo "$OUT"; any_fail=1 + fi + done + if [ $any_fail -eq 0 ]; then + echo " (none — all phases passing or pending)" + fi + args: + executable: /bin/bash + register: status_per_node + changed_when: false + failed_when: false + when: action == 'status' or _is_fresh_run + + - debug: + msg: "{{ status_per_node.stdout_lines }}" + when: (action == 'status' or _is_fresh_run) and status_per_node is defined + +# ---------------------------------------------------------------------------- +# Second play: container ping across all nodes (status / fresh-run only). +# Lightweight, lets operators see whether server + agent docker containers +# are alive without having to run a separate command. +# ---------------------------------------------------------------------------- +- name: CVF day-2 operations (per-node container check) + hosts: cluster_nodes + gather_facts: false + vars: + action: "" + _is_fresh_run: "{{ action == 'fresh-run' }}" + tasks: + - name: status > docker containers on each node + shell: docker ps --filter 'name=server' --filter 'name=agent' --format 'table {% raw %}{{.Names}}\t{{.Status}}{% endraw %}' + register: status_containers + changed_when: false + when: action == 'status' or _is_fresh_run + + - debug: + msg: "{{ inventory_hostname }}: {{ status_containers.stdout_lines }}" + when: (action == 'status' or _is_fresh_run) and status_containers is defined + + # ---- Per-node log file inventory ---------------------------------- + # Every phase Job/MPIJob writes its full stdout/stderr to a host + # hostPath at /var/log/cluster-validation/__.log + # (see cluster-validation-job.yaml). Pods get GC'd after + # ttlSecondsAfterFinished, but the files persist. Show the LATEST + # file per phase per node so operators can `tail`/`grep` directly + # without hunting through `kubectl get pods`. + - name: status > latest log file per phase on each node + shell: | + set -eu + LOG_DIR=/var/log/cluster-validation + if [ ! -d "$LOG_DIR" ]; then + echo " no-log-dir : node has not run any phase Job yet" + exit 0 + fi + # Parallel arrays: label[i] paired with glob[i]. Order matches + # the pipeline so output reads top-to-bottom. + labels="phase1 phase2 phase3 phase4_server phase4_client phase4_5 phase5 orchestrator" + i=0 + for label in $labels; do + case "$label" in + phase1) glob='*_phase1_*.log' ;; + phase2) glob='*_phase2_*.log' ;; + phase3) glob='*_phase3_*.log' ;; + phase4_server) glob='*_phase4-server_*.log' ;; + phase4_client) glob='*_phase4-client_*.log' ;; + phase4_5) glob='*_phase4_5_*.log' ;; + phase5) glob='*_phase5_*.log' ;; + orchestrator) glob='cronjob-*.log' ;; + esac + # ls -t over a glob; suppress "no match" stderr; head -1 = newest + latest=$(cd "$LOG_DIR" && ls -t $glob 2>/dev/null | head -1 || true) + if [ -z "$latest" ]; then + printf " %-18s : none\n" "$label" + else + sz=$(stat -c%s "$LOG_DIR/$latest" 2>/dev/null || echo 0) + printf " %-18s : %s/%s bytes=%d\n" "$label" "$LOG_DIR" "$latest" "$sz" + fi + done + args: + executable: /bin/bash + register: status_logs + changed_when: false + failed_when: false + when: action == 'status' or _is_fresh_run + + - debug: + msg: + - "================================================================" + - " Log files on {{ inventory_hostname }} (latest per phase)" + - "================================================================" + - "{{ status_logs.stdout_lines }}" + when: (action == 'status' or _is_fresh_run) and status_logs is defined diff --git a/example/gpu-validation-cluster/ansible/playbooks/remove-agent-nodes.yml b/example/gpu-validation-cluster/ansible/playbooks/remove-agent-nodes.yml index 6d9e71e26..0ecd209e3 100644 --- a/example/gpu-validation-cluster/ansible/playbooks/remove-agent-nodes.yml +++ b/example/gpu-validation-cluster/ansible/playbooks/remove-agent-nodes.yml @@ -221,7 +221,7 @@ - Cluster state cleaned up To verify: - ansible-playbook playbooks/check-status.yml + ansible-playbook playbooks/cvf.yml -e action=status To re-add nodes: ansible-playbook playbooks/add-agent-nodes.yml --limit node-name diff --git a/example/gpu-validation-cluster/ansible/playbooks/set-gpu-compute-mode.yml b/example/gpu-validation-cluster/ansible/playbooks/set-gpu-compute-mode.yml new file mode 100644 index 000000000..f52ff2eae --- /dev/null +++ b/example/gpu-validation-cluster/ansible/playbooks/set-gpu-compute-mode.yml @@ -0,0 +1,118 @@ +--- +# Switch the GPU compute-partition mode on target hosts (default: agent_nodes) +# and reboot to make the change stick. +# +# Usage: +# ansible-playbook -i inventory.test.yml playbooks/set-gpu-compute-mode.yml \ +# -e partition=SPX +# ansible-playbook -i inventory.test.yml playbooks/set-gpu-compute-mode.yml \ +# -e partition=SPX -e target=agent-node-1 +# +# Vars (all optional): +# partition -- SPX (default) | CPX | TPX | QPX | DPX +# target -- inventory pattern (default: agent_nodes) +# skip_reboot -- true to set the partition but skip the reboot +# (only useful if the firmware applies it live; on +# MI300/MI308 a reboot is normally required to reclaim BARs) +# +# After this completes, re-run setup-cluster.yml to rejoin the agent +# container to k3s -- the existing tasks in that playbook are idempotent +# and will recreate /var/lib/gpu-validation-cluster state cleanly. + +- name: Drain CVF agent container before partition change + hosts: "{{ target | default('agent_nodes') }}" + gather_facts: no + become: yes + tasks: + - name: Stop agent container if present + command: docker rm -f agent + changed_when: false + failed_when: false + +- name: Set GPU compute partition mode + hosts: "{{ target | default('agent_nodes') }}" + gather_facts: yes + become: yes + vars: + partition: SPX + skip_reboot: false + tasks: + - name: Verify rocm-smi is installed + command: which rocm-smi + register: rocm_smi_check + changed_when: false + + - name: Show current compute partition (before) + command: rocm-smi --showcomputepartition + register: partition_before + changed_when: false + + - debug: + msg: "{{ partition_before.stdout_lines }}" + + - name: Set compute partition to {{ partition }} + command: rocm-smi --setcomputepartition {{ partition }} + register: partition_set + # rocm-smi exits 0 even when the change is queued for next boot. + # Inspect stdout if you need to distinguish live vs deferred. + changed_when: true + + - debug: + msg: "{{ partition_set.stdout_lines }}" + + - name: Show current compute partition (after, pre-reboot) + command: rocm-smi --showcomputepartition + register: partition_after_pre_reboot + changed_when: false + + - debug: + msg: "{{ partition_after_pre_reboot.stdout_lines }}" + + # MI300/MI308 generally need a reboot to free PCIe BAR space allocated + # for the now-removed CPX shadow PFs. Without the reboot the amdgpu + # driver may keep the previous partition layout in /sys. + - name: Reboot node to apply partition change + reboot: + msg: "Reboot for GPU compute-partition change to {{ partition }}" + connect_timeout: 30 + reboot_timeout: 900 # up to 15 min for hardware POST + boot + pre_reboot_delay: 5 + post_reboot_delay: 30 + test_command: whoami + when: not (skip_reboot | bool) + + - name: Wait for docker daemon to come back after reboot + shell: | + for i in $(seq 1 30); do + if docker info >/dev/null 2>&1; then exit 0; fi + sleep 5 + done + exit 1 + args: + executable: /bin/bash + changed_when: false + when: not (skip_reboot | bool) + + - name: Show current compute partition (after reboot) + command: rocm-smi --showcomputepartition + register: partition_after_reboot + changed_when: false + when: not (skip_reboot | bool) + + - debug: + msg: "{{ partition_after_reboot.stdout_lines }}" + when: not (skip_reboot | bool) + +- name: Summary + hosts: localhost + connection: local + gather_facts: no + tasks: + - debug: + msg: | + ============================================================ + GPU partition change complete. + Next: + ansible-playbook -i inventory.test.yml playbooks/setup-cluster.yml + ansible-playbook -i inventory.test.yml playbooks/cvf.yml -e action=fresh-run -e suspend_cronjob=true + ============================================================ diff --git a/example/gpu-validation-cluster/ansible/playbooks/setup-cluster.yml b/example/gpu-validation-cluster/ansible/playbooks/setup-cluster.yml index 2aab19907..7075cc149 100644 --- a/example/gpu-validation-cluster/ansible/playbooks/setup-cluster.yml +++ b/example/gpu-validation-cluster/ansible/playbooks/setup-cluster.yml @@ -56,18 +56,51 @@ become: yes tasks: - - name: Update apt cache - apt: - update_cache: yes - cache_valid_time: 3600 - when: ansible_os_family == "Debian" - + # Probe docker BEFORE the first apt update. If docker is already + # installed and working we leave the host's apt sources untouched; + # only when we are going to (re)install docker do we normalize its apt + # source below. This keeps the playbook from disturbing a healthy host. - name: Check if Docker is installed command: docker --version register: docker_check ignore_errors: yes changed_when: false + # A docker apt source left by prior provisioning -- often BOTH a + # hand-written docker.list AND an Ansible-generated + # download_docker_com_linux_ubuntu.list -- can declare the same repo + # with a `signed-by` that conflicts (or is empty), so `apt update` + # aborts with "Conflicting values set for option Signed-By ... + # download.docker.com" and the whole play dies before docker installs. + # When we are about to install docker, unconditionally remove ALL + # pre-existing docker source files first; we recreate one canonical + # entry (correct signed-by) a few tasks below. Done BEFORE the first + # apt cache update so that update can't trip over the conflict. + - name: Remove pre-existing docker apt source files (before cache update) + shell: | + set -o pipefail + matches=$(ls -1 /etc/apt/sources.list.d/*docker*.list \ + /etc/apt/sources.list.d/*docker*.sources \ + /etc/apt/sources.list.d/download_docker_com*.list \ + 2>/dev/null || true) + if [ -n "$matches" ]; then + echo "$matches" | xargs rm -f + echo "REMOVED" + fi + args: + executable: /bin/bash + register: docker_apt_source_cleanup + changed_when: "'REMOVED' in (docker_apt_source_cleanup.stdout | default(''))" + when: + - ansible_os_family == "Debian" + - docker_check.rc != 0 + + - name: Update apt cache + apt: + update_cache: yes + cache_valid_time: 3600 + when: ansible_os_family == "Debian" + - name: Install Docker prerequisites apt: name: @@ -80,18 +113,34 @@ - ansible_os_family == "Debian" - docker_check.rc != 0 - - name: Add Docker GPG key - apt_key: + - name: Ensure apt keyrings directory exists + file: + path: /etc/apt/keyrings + state: directory + mode: '0755' + when: + - ansible_os_family == "Debian" + - docker_check.rc != 0 + + # Use the modern keyring path (NOT the deprecated apt_key, which writes + # to the global trusted.gpg). force:yes refreshes the key so a stale + # one left by a prior provisioning is replaced with a known-good copy. + - name: Add Docker GPG key to /etc/apt/keyrings/docker.asc + get_url: url: https://download.docker.com/linux/ubuntu/gpg - state: present + dest: /etc/apt/keyrings/docker.asc + mode: '0644' + force: yes when: - ansible_os_family == "Debian" - docker_check.rc != 0 - - name: Add Docker repository + - name: Add Docker repository (canonical entry with explicit signed-by) apt_repository: - repo: "deb [arch=amd64] https://download.docker.com/linux/ubuntu {{ ansible_distribution_release }} stable" + repo: "deb [arch=amd64 signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu {{ ansible_distribution_release }} stable" + filename: docker state: present + update_cache: yes when: - ansible_os_family == "Debian" - docker_check.rc != 0 @@ -108,12 +157,67 @@ - ansible_os_family == "Debian" - docker_check.rc != 0 - - name: Start and enable Docker service + # Some GPU hosts ship a NATIVE k3s install (k3s.service / k3s-agent.service + # under /etc/systemd/system) from a prior bringup. That k3s binds host + # ports 6443/6444 (we run with --net=host inside the container), so our + # containerized k3s server fails with "bind: address already in use". + # Stop + disable any such native units before starting our stack. + - name: Find native k3s systemd units (if any) + # `systemctl list-unit-files 'k3s*.service'` exits rc=1 when no units + # match (the common case on a fresh host). Combined with set -o + # pipefail that fails the pipe, so we explicitly mask it with + # failed_when: false -- empty stdout is the legitimate "no native + # k3s installed" signal, which the next task handles via `when:`. + shell: | + set -o pipefail + systemctl list-unit-files --no-legend 'k3s*.service' 2>/dev/null \ + | awk '{print $1}' | tr '\n' ' ' + args: + executable: /bin/bash + register: native_k3s_units + changed_when: false + failed_when: false + + - name: Stop + disable native k3s units (if any are present) + shell: | + set -eu + for U in {{ native_k3s_units.stdout }}; do + systemctl stop "$U" 2>/dev/null || true + systemctl disable "$U" 2>/dev/null || true + done + # Belt-and-suspenders: in case k3s was started manually + if command -v /usr/local/bin/k3s-killall.sh >/dev/null 2>&1; then + /usr/local/bin/k3s-killall.sh 2>/dev/null || true + fi + args: + executable: /bin/bash + register: native_k3s_stop + changed_when: native_k3s_units.stdout | trim != '' + when: native_k3s_units.stdout | trim != '' + + # Some GPU hosts get the `docker` group from LDAP/SSSD only and not + # /etc/group. systemd's docker.socket unit has `SocketGroup=docker` + # and resolves it before SSSD is fully up at boot, so the socket fails + # with exit code 216/GROUP and the service never starts. Ensuring a + # local /etc/group entry makes resolution work via files-NSS at boot. + - name: Ensure local /etc/group has a 'docker' entry (idempotent) + group: + name: docker + state: present + when: ansible_os_family == "Debian" + + # Always (re)start docker — previously this only ran when docker was + # newly installed (`when: docker_check.rc != 0`), so hosts where the + # daemon had failed at boot stayed broken across re-runs. + - name: Reset any failed docker units (idempotent, no-op when clean) + shell: systemctl reset-failed docker.socket docker.service 2>/dev/null || true + changed_when: false + + - name: Start and enable Docker service (always) systemd: name: docker state: started enabled: yes - when: docker_check.rc != 0 - name: Add user to docker group user: @@ -234,11 +338,42 @@ hosts: server_nodes gather_facts: yes become: yes + vars: + # Pass `-e force_restart=true` to force-recreate the container even + # when it already looks healthy (e.g. after editing gpu-cluster.sh + # or config.json in a way that requires a fresh start). + force_restart: false tasks: + # Idempotency probe: container is "healthy" iff it is currently Up + # AND its kubectl can reach the apiserver. If both true, skip the + # destructive rm + run so re-runs don't churn k3s state. + - name: Probe existing server container health + shell: | + set -o pipefail + UP=$(docker ps --filter name=^server$ --format '{% raw %}{{.Status}}{% endraw %}' | head -1) + if [ -z "$UP" ]; then + echo "missing" + exit 0 + fi + if docker exec server kubectl get --raw=/readyz >/dev/null 2>&1; then + echo "healthy" + else + echo "unhealthy" + fi + args: + executable: /bin/bash + register: server_health + changed_when: false + + - name: Show server container health + debug: + msg: "server health: {{ server_health.stdout }} (force_restart={{ force_restart }})" + - name: Stop existing server container if running command: docker rm -f server ignore_errors: yes + when: server_health.stdout != 'healthy' or (force_restart | bool) - name: Start server container shell: | @@ -247,11 +382,13 @@ sleep 5 args: executable: /bin/bash + when: server_health.stdout != 'healthy' or (force_restart | bool) - name: Wait for server to be ready wait_for: timeout: 50 delegate_to: localhost + when: server_health.stdout != 'healthy' or (force_restart | bool) - name: Check server container status shell: 'docker ps --filter name=server --format "{{ ''{{'' }}.Names{{ ''}}'' }}: {{ ''{{'' }}.Status{{ ''}}'' }}"' @@ -287,6 +424,8 @@ gather_facts: no become: yes serial: 1 # Start agents one at a time to avoid overwhelming the server + vars: + force_restart: false tasks: - name: Get server IP and token @@ -298,9 +437,43 @@ debug: msg: "Connecting to server at {{ server_ip }}" + # Agent "healthy" iff: container Up AND it has an established TCP + # connection from inside the container to the k3s apiserver port + # (6443) on the server. This is local to the agent host (no + # cross-node SSH needed) and catches both "container died" and + # "container Up but k3s agent crashed / lost supervisor link". + - name: Probe existing agent container health + shell: | + set -o pipefail + UP=$(docker ps --filter name=^agent$ --format '{% raw %}{{.Status}}{% endraw %}' | head -1) + if [ -z "$UP" ]; then + echo "missing" + exit 0 + fi + # ss inside the agent container; k3s-agent holds an ESTAB conn + # to :6443. Match either peer in case ss column + # ordering changes between distros. + if docker exec agent ss -tn state established 2>/dev/null \ + | awk '{print $3, $4}' \ + | grep -qE "(:6443\s|\s{{ server_ip }}:6443)"; then + echo "healthy" + else + echo "unhealthy" + fi + args: + executable: /bin/bash + register: agent_health + changed_when: false + failed_when: false + + - name: Show agent container health + debug: + msg: "agent health: {{ agent_health.stdout }} (force_restart={{ force_restart }})" + - name: Stop existing agent container if running command: docker rm -f agent ignore_errors: yes + when: agent_health.stdout != 'healthy' or (force_restart | bool) - name: Start agent container shell: | @@ -309,11 +482,13 @@ sleep 5 args: executable: /bin/bash + when: agent_health.stdout != 'healthy' or (force_restart | bool) - name: Wait for agent to be ready wait_for: timeout: 15 delegate_to: localhost + when: agent_health.stdout != 'healthy' or (force_restart | bool) - name: Check agent container status shell: 'docker ps --filter name=agent --format "{{ ''{{'' }}.Names{{ ''}}'' }}: {{ ''{{'' }}.Status{{ ''}}'' }}"' @@ -352,6 +527,65 @@ debug: msg: "{{ cluster_info.stdout_lines }}" + # The bringup install runs inside the backgrounded gpu-cluster.sh on + # the server. Earlier this play only checked that nodes joined, so a + # half-installed cluster (operators missing because the install aborted + # during post-start k3s flapping) still reported success. Explicitly + # verify the helm releases reached STATUS=deployed and the CVF CronJob + # exists, retrying while operators reconcile, and fail loudly otherwise. + - name: Check whether network operator is enabled in config.json + command: docker exec server jq -r 'if .["network-operator"]["install-network-operator"] == null then true else .["network-operator"]["install-network-operator"] end' /configs/config.json + register: net_op_enabled + changed_when: false + + - name: Build expected Helm release list + set_fact: + expected_helm_releases: >- + {{ + [ + { 'release': 'cert-manager', 'ns': 'cert-manager' }, + { 'release': 'amd-gpu-operator', 'ns': 'kube-amd-gpu' } + ] + + ( + [ { 'release': 'amd-network-operator', 'ns': 'kube-amd-network' } ] + if (net_op_enabled.stdout | trim | bool) else [] + ) + }} + + - name: Verify required Helm releases are deployed + shell: | + set -o pipefail + docker exec server helm status "{{ item.release }}" -n "{{ item.ns }}" -o json 2>/dev/null \ + | jq -r '.info.status' 2>/dev/null + args: + executable: /bin/bash + register: helm_release_status + changed_when: false + retries: 30 + delay: 10 + until: helm_release_status.stdout == "deployed" + loop: "{{ expected_helm_releases }}" + loop_control: + label: "{{ item.release }}" + + - name: Check whether CVF is enabled in config.json + command: docker exec server jq -r '.["cluster-validation-framework"]["install-cvf"] // false' /configs/config.json + register: cvf_enabled + changed_when: false + + - name: Verify CVF CronJob exists (when CVF enabled) + command: docker exec server kubectl get cronjob cluster-validation-cron-job -n default + register: cvf_cronjob + changed_when: false + retries: 30 + delay: 10 + until: cvf_cronjob.rc == 0 + when: cvf_enabled.stdout | trim | bool + + - name: Installation verification passed + debug: + msg: "All operators deployed and CVF state verified." + - name: Summary hosts: localhost connection: local @@ -369,7 +603,7 @@ Agent Nodes: {{ groups['agent_nodes'] | length }} To check cluster status: - ansible-playbook playbooks/check-status.yml + ansible-playbook playbooks/cvf.yml -e action=status To teardown cluster: ansible-playbook playbooks/teardown-cluster.yml diff --git a/example/gpu-validation-cluster/ansible/playbooks/teardown-cluster.yml b/example/gpu-validation-cluster/ansible/playbooks/teardown-cluster.yml index 4448d4e45..d5fbd2937 100644 --- a/example/gpu-validation-cluster/ansible/playbooks/teardown-cluster.yml +++ b/example/gpu-validation-cluster/ansible/playbooks/teardown-cluster.yml @@ -1,17 +1,22 @@ --- # Teardown GPU Validation Cluster -# Usage: ansible-playbook playbooks/teardown-cluster.yml +# Usage: +# ansible-playbook playbooks/teardown-cluster.yml +# ansible-playbook playbooks/teardown-cluster.yml -e cleanup_test_logs=true - name: Teardown cluster on all nodes hosts: cluster_nodes gather_facts: no become: yes + vars: + # Set to true to also remove /var/log/cluster-validation on each node. + cleanup_test_logs: false tasks: - name: Run teardown script shell: | cd {{ script_dir }} - CONFIG_DIR={{ config_dir }} ./gpu-cluster.sh teardown + CONFIG_DIR={{ config_dir }} CLEANUP_TEST_LOGS={{ cleanup_test_logs | bool | string | lower }} ./gpu-cluster.sh teardown args: executable: /bin/bash ignore_errors: yes diff --git a/example/gpu-validation-cluster/ansible/quickstart.sh b/example/gpu-validation-cluster/ansible/quickstart.sh deleted file mode 100755 index 9ed0dc803..000000000 --- a/example/gpu-validation-cluster/ansible/quickstart.sh +++ /dev/null @@ -1,93 +0,0 @@ -#!/bin/bash -# Quick start script for GPU Validation Cluster Ansible deployment - -set -e - -SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" -cd "$SCRIPT_DIR" - -echo "==========================================" -echo "GPU Validation Cluster - Quick Start" -echo "==========================================" -echo "" - -# Check if Ansible is installed -if ! command -v ansible &> /dev/null; then - echo "[ERROR] Ansible is not installed." - echo "[INFO] Install Ansible:" - echo " sudo apt update && sudo apt install ansible" - echo " or" - echo " pip install ansible" - exit 1 -fi - -echo "[INFO] Ansible version: $(ansible --version | head -n1)" -echo "" - -# Check if inventory exists -if [ ! -f "inventory.yml" ]; then - echo "[ERROR] inventory.yml not found!" - echo "[INFO] Please create and edit inventory.yml with your node details:" - echo " vi inventory.yml" - echo "" - echo "[INFO] Required configuration in inventory.yml:" - echo " - ansible_host: IP addresses of your nodes" - echo " - ansible_user: SSH username" - echo " - ansible_connection: local (for server node if running locally)" - exit 1 -fi - -echo "[INFO] Inventory file found: inventory.yml" -echo "" - -# Test connectivity -echo "[STEP 1] Testing SSH connectivity to nodes..." -if ansible all -m ping -o; then - echo "[SUCCESS] All nodes are reachable" -else - echo "[WARN] Some nodes are not reachable via SSH" - echo "" - read -p "Do you want to setup SSH keys? (y/n) " -n 1 -r - echo "" - if [[ $REPLY =~ ^[Yy]$ ]]; then - echo "[INFO] Running SSH key setup..." - ansible-playbook playbooks/setup-ssh-keys.yml --ask-pass --ask-become-pass - else - echo "[ERROR] Cannot proceed without SSH access to nodes" - exit 1 - fi -fi - -echo "" -echo "[STEP 2] Ready to deploy cluster" -echo "" -echo "The following will be performed:" -echo " 1. Build Docker image locally" -echo " 2. Install Docker and jq on all nodes (if needed)" -echo " 3. Copy image to all nodes" -echo " 4. Start server node" -echo " 5. Start agent nodes and join cluster" -echo "" -read -p "Proceed with cluster deployment? (y/n) " -n 1 -r -echo "" - -if [[ ! $REPLY =~ ^[Yy]$ ]]; then - echo "Deployment cancelled." - exit 0 -fi - -echo "" -echo "[STEP 3] Deploying cluster..." -ansible-playbook playbooks/setup-cluster.yml - -echo "" -echo "==========================================" -echo "Deployment Complete!" -echo "==========================================" -echo "" -echo "Check cluster status:" -echo " ansible-playbook playbooks/check-status.yml" -echo "" -echo "Teardown cluster:" -echo " ansible-playbook playbooks/teardown-cluster.yml" -echo "" diff --git a/example/gpu-validation-cluster/build/Dockerfile b/example/gpu-validation-cluster/build/Dockerfile index a4c99739b..cfc631327 100644 --- a/example/gpu-validation-cluster/build/Dockerfile +++ b/example/gpu-validation-cluster/build/Dockerfile @@ -34,12 +34,14 @@ RUN curl -sfL https://github.com/k3s-io/k3s/releases/download/${K3S_VERSION}/k3s mkdir -p /var/lib/rancher/k3s/agent RUN mkdir -p /images && \ - skopeo copy docker://ghcr.io/k8snetworkplumbingwg/multus-cni:${MULTUS_CNI_VERSION} oci-archive:/images/multus-cni-${MULTUS_CNI_VERSION}.tar && \ - zstd -19 /images/multus-cni-${MULTUS_CNI_VERSION}.tar -o /images/multus-cni-${MULTUS_CNI_VERSION}.tar.zst && \ - rm /images/multus-cni-${MULTUS_CNI_VERSION}.tar && \ - skopeo copy docker://registry.k8s.io/nfd/node-feature-discovery:${NFD_VERSION} oci-archive:/images/nfd-${NFD_VERSION}.tar && \ - zstd -19 /images/nfd-${NFD_VERSION}.tar -o /images/nfd-${NFD_VERSION}.tar.zst && \ - rm /images/nfd-${NFD_VERSION}.tar + skopeo copy \ + docker://ghcr.io/k8snetworkplumbingwg/multus-cni:${MULTUS_CNI_VERSION} \ + docker-archive:/images/multus-cni-${MULTUS_CNI_VERSION}.tar:ghcr.io/k8snetworkplumbingwg/multus-cni:${MULTUS_CNI_VERSION} && \ + zstd -19 --rm /images/multus-cni-${MULTUS_CNI_VERSION}.tar && \ + skopeo copy \ + docker://registry.k8s.io/nfd/node-feature-discovery:${NFD_VERSION} \ + docker-archive:/images/nfd-${NFD_VERSION}.tar:registry.k8s.io/nfd/node-feature-discovery:${NFD_VERSION} && \ + zstd -19 --rm /images/nfd-${NFD_VERSION}.tar # Set kubeconfig environment variable ENV KUBECONFIG=/etc/rancher/k3s/k3s.yaml diff --git a/example/gpu-validation-cluster/build/entrypoint.sh b/example/gpu-validation-cluster/build/entrypoint.sh index 7b7eeb7da..487a4c95b 100755 --- a/example/gpu-validation-cluster/build/entrypoint.sh +++ b/example/gpu-validation-cluster/build/entrypoint.sh @@ -26,7 +26,8 @@ start_k3s_server() { --disable=traefik \ --disable=servicelb \ --kubelet-arg=serialize-image-pulls=true \ - ${K3S_EXTRA_ARGS} > /var/log/k3s.log 2>&1 & + ${K3S_EXTRA_ARGS} >> /var/log/k3s.log 2>&1 & + K3S_PID=$! } start_k3s_agent() { @@ -35,7 +36,7 @@ start_k3s_agent() { exit 1 fi echo "[INFO] Starting k3s agent, joining server at $K3S_URL" - + # Create registries.yaml for in-cluster registry # do this for agent before starting the k3s agent process # at this point we already know the URL to in cluster registry service @@ -44,30 +45,100 @@ start_k3s_agent() { ${AGENT_REGISTRIES_CONFIG} EOF echo "[INFO] Created registries.yaml for registry at ${K3S_IP}:${IN_CLUSTER_REGISTRY_PORT}" - + k3s agent \ --server="$K3S_URL" \ --token="$K3S_TOKEN" \ --with-node-id \ --kubelet-arg=serialize-image-pulls=true \ - ${K3S_EXTRA_ARGS} > /var/log/k3s.log 2>&1 & + ${K3S_EXTRA_ARGS} >> /var/log/k3s.log 2>&1 & + K3S_PID=$! +} + +# Supervisor loop. Re-launches k3s ($MODE) whenever it exits non-zero, +# replacing the previous "fire-and-forget + sleep infinity" pattern. +# +# Why this exists: k3s frequently crashes on the FIRST start after an OS +# reboot due to transient host state -- iptables-nft chains wiped (kube- +# router netpol panics on missing KUBE-ROUTER-OUTPUT), missing cni0 +# device (flannel can't add IP), stale containerd socket file from prior +# boot, etc. A second/third attempt typically succeeds because the first +# attempt re-creates the chains/ifaces before crashing. Before this loop, +# such crashes left the container looking "Up" (entrypoint + sleep) while +# k3s was dead, requiring an operator to docker rm/run the container. +# +# Safety: a rolling cap (MAX_RESTARTS in WINDOW_SECS) prevents a hot loop +# when k3s is fundamentally misconfigured. When the cap is hit the script +# exits non-zero so `docker --restart=unless-stopped` (already set by +# gpu-cluster.sh) recreates the container fresh, clearing any per-PID +# state inside the container. The CVF setup-cluster.yml health probe +# will then detect the new container on its next idempotent re-run. +supervise_k3s() { + local mode="$1" + local restarts=0 + local window_start + window_start=$(date +%s) + local MAX_RESTARTS="${K3S_MAX_RESTARTS:-10}" + local WINDOW_SECS="${K3S_RESTART_WINDOW:-600}" # 10 min + local BACKOFF_MIN=2 + local BACKOFF_MAX=30 + local backoff="$BACKOFF_MIN" + + # Forward SIGTERM/SIGINT to k3s so `docker stop` is graceful (node + # drain) instead of getting SIGKILLed after the 10s grace period. + # shellcheck disable=SC2064 + trap 'echo "[INFO] supervisor: forwarding signal to k3s pid=$K3S_PID"; \ + [ -n "$K3S_PID" ] && kill -TERM "$K3S_PID" 2>/dev/null; \ + wait "$K3S_PID" 2>/dev/null; exit 0' TERM INT + + while true; do + case "$mode" in + server) start_k3s_server ;; + agent) start_k3s_agent ;; + esac + + # `wait $pid` blocks until that child exits and returns its exit code. + # `set -e` would abort the script if k3s exits non-zero -- we want to + # observe the rc and decide, so disable -e around the wait. + set +e + wait "$K3S_PID" + local rc=$? + set -e + local now + now=$(date +%s) + + # Reset the rolling restart window if k3s ran longer than WINDOW_SECS + # before crashing -- it was effectively stable, so this crash is + # unrelated to the post-boot flapping we are protecting against. + if [ $(( now - window_start )) -gt "$WINDOW_SECS" ]; then + restarts=0 + window_start=$now + backoff="$BACKOFF_MIN" + fi + + restarts=$(( restarts + 1 )) + echo "[WARN] supervisor: k3s $mode exited rc=$rc (restart $restarts/$MAX_RESTARTS in ${WINDOW_SECS}s window)" + + if [ "$restarts" -ge "$MAX_RESTARTS" ]; then + echo "[ERROR] supervisor: $MAX_RESTARTS restarts within ${WINDOW_SECS}s -- giving up so docker recreates the container" + exit 1 + fi + + echo "[INFO] supervisor: sleeping ${backoff}s before restart" + sleep "$backoff" + # Exponential backoff capped at BACKOFF_MAX so we don't grow unbounded. + backoff=$(( backoff * 2 )) + [ "$backoff" -gt "$BACKOFF_MAX" ] && backoff="$BACKOFF_MAX" + done } move_preloaded_images -echo "[INFO] Starting k3s in $MODE mode" +echo "[INFO] Starting k3s in $MODE mode (supervised)" case "$MODE" in - server) - start_k3s_server - ;; - agent) - start_k3s_agent - ;; + server|agent) supervise_k3s "$MODE" ;; *) echo "[ERROR] Unknown mode: $MODE. Use 'server' or 'agent'" exit 1 ;; esac - -# Keep container running -sleep infinity diff --git a/example/gpu-validation-cluster/configs/cluster-validation-config.yaml b/example/gpu-validation-cluster/configs/cluster-validation-config.yaml index 22dce2826..f48c90bba 100644 --- a/example/gpu-validation-cluster/configs/cluster-validation-config.yaml +++ b/example/gpu-validation-cluster/configs/cluster-validation-config.yaml @@ -3,31 +3,276 @@ kind: ConfigMap metadata: name: cluster-validation-config data: - LOG_STORE_NODE_NAME: "" # Specify a candidate node name to store logs for easier access, or leave empty to let the system pick one of the candidate nodes - JOB_NAME: cluster-validation-mpi-job # Must match MPIJob metadata.name - WORKER_REPLICAS: "__WORKER_REPLICAS__" # Number of Worker Pods in each MPIJob doing actual computation - LAUNCHER_REPLICAS: "__LAUNCHER_REPLICAS__" # Number of Launcher Pods for the MPIJob, which coordinates workers - SLOTS_PER_WORKER: "__SLOTS_PER_WORKER__" # MPI ranks per Worker pod - GPU_PER_WORKER: "__GPU_PER_WORKER__" # Number of GPUs to request per Worker pod - PF_NIC_PER_WORKER: "__PF_NIC_PER_WORKER__" # Number of PF-NICs to request per Worker pod - VF_NIC_PER_WORKER: "__VF_NIC_PER_WORKER__" # Number of VF-NICs to request per Worker pod - NODE_VALIDATION_INTERVAL_MINS: "__NODE_VALIDATION_INTERVAL_MINS__" # minimum interval in minutes between cluster validation runs on a given worker node + LOG_STORE_NODE_NAME: "" # Specify a candidate node name to store logs for easier access, or leave empty to let the system pick one of the candidate nodes + JOB_NAME: cluster-validation-mpi-job # Must match MPIJob metadata.name + # Resource defaults below. Override via config.json + # ["cluster-validation-framework"].resources.* (gpu-cluster.sh patches + # the deployed ConfigMap after this YAML is applied). + # Fields below carry a `patchable:` marker showing the JSON path that + # overrides them when gpu-cluster.sh runs. + WORKER_REPLICAS: "2" # Worker Pods per MPIJob. patchable: cluster-validation-framework.resources.worker-replicas + LAUNCHER_REPLICAS: "1" # Launcher Pods per MPIJob. patchable: cluster-validation-framework.resources.launcher-replicas + SLOTS_PER_WORKER: "8" # MPI ranks per Worker pod. patchable: cluster-validation-framework.resources.slots-per-worker + GPU_PER_WORKER: "8" # GPUs per Worker pod. patchable: cluster-validation-framework.resources.gpu-per-worker + PF_NIC_PER_WORKER: "0" # PF-NICs per Worker pod. patchable: cluster-validation-framework.resources.pf-nic-per-worker + VF_NIC_PER_WORKER: "8" # VF-NICs per Worker pod. patchable: cluster-validation-framework.resources.vf-nic-per-worker + NODE_VALIDATION_INTERVAL_MINS: "30" # Minimum minutes between validation runs per node. patchable: cluster-validation-framework.resources.node-validation-interval-mins PF_NIC_NAD_NAME: "amd-host-device-nad" VF_NIC_NAD_NAME: "vf-amd-host-device-nad" # === Node Selection Labels for candidates === # NOTE: # The entire list below can be customized via config.json["cluster-validation-framework"]["node-selector-labels"] - # Default: Select nodes with feature.node.kubernetes.io/amd-gpu=true and feature.node.kubernetes.io/amd-nic=true - # For virtual function (VF) based GPU in a VM, use amd-vgpu=true instead of amd-gpu=true + # Default: feature.node.kubernetes.io/amd-gpu=true ONLY. + # The previous default also required feature.node.kubernetes.io/amd-nic=true; that + # joined requirement was dropped so a datacenter early in bringup can run Phase 1 + # (GPU HW acceptance) and Phase 2 (intra-node GPU mesh) before any NIC + # infrastructure exists. This is the core enabling change for incremental bringup + # described design section 4 -> "Code Path: Candidate selection update". + # NIC-requiring phases (Phase 3 / NIC health, Phase 4 / rail bandwidth) narrow the + # candidate set inside their own run_phaseN script via + # intersect + # (see cluster-validation-job.yaml and design section 5). The orchestrator + # does NOT enforce amd-nic=true at candidate-selection time. + # For virtual function (VF) based GPU in a VM, use amd-vgpu=true instead of amd-gpu=true. # For virtual function (VF) based NIC in a VM, use amd-vnic=true instead of amd-nic=true + # (override node-selector-labels in config.json when running NIC-requiring phases on VFs). + # patchable: cluster-validation-framework.node-selector-labels (newline-joined) NODE_SELECTOR_LABELS: | - __NODE_SELECTOR_LABELS__ + feature.node.kubernetes.io/amd-gpu=true CANDIDATE_LABEL: "amd.com/cluster-validation-candidate=true" SUCCESS_LABEL: "amd.com/cluster-validation-status=passed" FAILURE_LABEL: "amd.com/cluster-validation-status=failed" TIMESTAMP_ANNOTATION: "amd.com/cluster-validation-last-run-timestamp" + # === Per-Phase Label Keys (consumed by every phase) === + # Downstream phases MUST read these constants + # rather than hard-coding their own label key. Phases write + # =passed|failed to the node and, on failure, + # an annotation =. + PHASE1_LABEL_KEY: "amd.com/gpu-hw-acceptance" + PHASE2_LABEL_KEY: "amd.com/gpu-mesh-validation" + PHASE3_LABEL_KEY: "amd.com/nic-health" + PHASE4_LABEL_KEY: "amd.com/rail-bandwidth" + PHASE5_LABEL_KEY: "amd.com/cluster-validation-status" + PHASE_FAILURE_REASON_ANNOTATION_SUFFIX: "-failure-reason" + + # === Phase 5 Tunables === + # PHASE5_MIN_WORKERS: minimum surviving-node + # count required for Phase 5 to actually submit an MPIJob. Default + # "2" because the RCCL collectives Phase 5 runs (all_reduce_perf, + # broadcast_perf, reduce_scatter_perf) are multi-node by definition + # a single-worker MPIJob is degenerate and produces no meaningful + # bandwidth signal. Operators can override to "1" to opt into the + # degenerate single-node MPIJob (e.g., for plumbing-only validation + # on a single-node testbed). See design doc sections 4 and 6 + # ("Error Handling" -> "Below PHASE5_MIN_WORKERS"). + PHASE5_MIN_WORKERS: "2" + + # === Node Metadata Helper Library === + # Pure shell functions sourced by the CronJob orchestrator and by every + # per-phase script. Provides the uniform label/annotate/filter primitives + # described design §4 -> "Code Path: Node metadata helpers". + # + # Contract (see design doc design §5): + # * Downstream phases MUST NOT call `kubectl label` / `kubectl annotate` + # on phase labels directly -- they MUST use these helpers. + # * Label values are exactly "passed" or "failed". + # * On failure, the failure-reason annotation is written using + # PHASE_FAILURE_REASON_ANNOTATION_SUFFIX. + # * `filter_passed_nodes` is the gate primitive between phases: + # a node that is missing the label, or whose label value is not + # "passed", is dropped (conservative -- design §6). + # + # Functions: + # label_phase_passed + # label_phase_failed + # annotate_phase_value + # filter_passed_nodes # stdout + # + # Diagnostics go to stderr so that `filter_passed_nodes` stdout stays + # clean for piping (`for n in $(filter_passed_nodes .)`). + # + # Idempotency: all kubectl writes use --overwrite so re-runs of the + # same phase on the same node converge to the latest label/annotation. + PHASE_NODE_LABEL_SCRIPT: | + #!/bin/bash + # Node metadata helper library. Source this file -- do not exec. + # All helpers return 0 on success and non-zero on failure. A failed + # helper call is treated by `filter_passed_nodes` as "not passed", + # i.e. the node is conservatively dropped from subsequent phases. + + # Maximum length of a single annotation value. Kubernetes caps the + # total annotation set on an object at 256 KiB; cap any individual + # reason/value at 200_000 bytes so a handful of large annotations + # cannot push the object over the limit. Truncation is logged. + : "${PHASE_ANNOTATION_VALUE_MAX_BYTES:=200000}" + + # _phase_log + # Internal logger. Sends a tagged line to stderr. + _phase_log() { + echo "[phase-helpers] $*" >&2 + } + + # _phase_require_args + # Returns 0 if expected == actual, else logs and returns 2. + _phase_require_args() { + local fn_name="$1" + local expected="$2" + local actual="$3" + if [[ "$actual" -ne "$expected" ]]; then + _phase_log "ERROR: ${fn_name} expects ${expected} arg(s), got ${actual}" + return 2 + fi + return 0 + } + + # _phase_require_nonempty + # Returns 0 if value is non-empty, else logs and returns 2. + _phase_require_nonempty() { + local fn_name="$1" + local arg_name="$2" + local value="$3" + if [[ -z "$value" ]]; then + _phase_log "ERROR: ${fn_name}: argument '${arg_name}' is empty" + return 2 + fi + return 0 + } + + # _phase_truncate_value + # Emits value on stdout, truncated to PHASE_ANNOTATION_VALUE_MAX_BYTES. + # Logs to stderr when truncation occurs. + _phase_truncate_value() { + local value="$1" + local max="${PHASE_ANNOTATION_VALUE_MAX_BYTES}" + local len="${#value}" + if [[ "$len" -gt "$max" ]]; then + _phase_log "WARN: annotation value of ${len} bytes truncated to ${max} bytes" + printf '%s' "${value:0:$max}" + else + printf '%s' "$value" + fi + } + + # label_phase_passed + # Sets =passed on . + label_phase_passed() { + _phase_require_args "label_phase_passed" 2 "$#" || return 2 + local node="$1" + local key="$2" + _phase_require_nonempty "label_phase_passed" "node" "$node" || return 2 + _phase_require_nonempty "label_phase_passed" "phase_label_key" "$key" || return 2 + if ! kubectl label node "$node" "${key}=passed" --overwrite >/dev/null; then + _phase_log "ERROR: kubectl label ${node} ${key}=passed failed" + return 1 + fi + _phase_log "node=${node} ${key}=passed" + return 0 + } + + # label_phase_failed + # Sets =failed on AND writes + # =. + # If labeling fails, the annotation write is still attempted so that + # the failure reason is captured wherever possible. + label_phase_failed() { + _phase_require_args "label_phase_failed" 3 "$#" || return 2 + local node="$1" + local key="$2" + local reason="$3" + _phase_require_nonempty "label_phase_failed" "node" "$node" || return 2 + _phase_require_nonempty "label_phase_failed" "phase_label_key" "$key" || return 2 + # reason is allowed to be empty -- a missing reason is still a + # legitimate failure signal; we just don't write the annotation. + + local rc=0 + if ! kubectl label node "$node" "${key}=failed" --overwrite >/dev/null; then + _phase_log "ERROR: kubectl label ${node} ${key}=failed failed" + rc=1 + fi + + if [[ -n "$reason" ]]; then + local annotation_key="${key}${PHASE_FAILURE_REASON_ANNOTATION_SUFFIX}" + local safe_reason + safe_reason=$(_phase_truncate_value "$reason") + if ! kubectl annotate node "$node" "${annotation_key}=${safe_reason}" --overwrite >/dev/null; then + _phase_log "ERROR: kubectl annotate ${node} ${annotation_key}=... failed" + rc=1 + fi + fi + + _phase_log "node=${node} ${key}=failed reason=${reason:-}" + return "$rc" + } + + # annotate_phase_value + # Writes a phase-scoped annotation of the form + # -= + # to . Intended for phase scripts that want to publish + # structured per-phase data (e.g. measured bandwidth, test version). + annotate_phase_value() { + _phase_require_args "annotate_phase_value" 4 "$#" || return 2 + local node="$1" + local phase_key="$2" + local sub_key="$3" + local value="$4" + _phase_require_nonempty "annotate_phase_value" "node" "$node" || return 2 + _phase_require_nonempty "annotate_phase_value" "phase_label_key" "$phase_key" || return 2 + _phase_require_nonempty "annotate_phase_value" "key" "$sub_key" || return 2 + # value may be empty -- caller controls semantics. + + local annotation_key="${phase_key}-${sub_key}" + local safe_value + safe_value=$(_phase_truncate_value "$value") + if ! kubectl annotate node "$node" "${annotation_key}=${safe_value}" --overwrite >/dev/null; then + _phase_log "ERROR: kubectl annotate ${node} ${annotation_key}=... failed" + return 1 + fi + return 0 + } + + # filter_passed_nodes + # Emits, on stdout, the subset of whose current value of + # is exactly "passed". A node that lacks the + # label, has a different value, or whose label cannot be read is + # silently dropped (conservative -- see design §6). + # Output is space-separated, single line, terminated by newline if + # any nodes match; empty input or no matches produces no output. + filter_passed_nodes() { + _phase_require_args "filter_passed_nodes" 2 "$#" || return 2 + local nodes="$1" + local key="$2" + _phase_require_nonempty "filter_passed_nodes" "phase_label_key" "$key" || return 2 + # nodes may be empty -- caller may legitimately pass an empty list. + + # Escape dots in the label key for jsonpath traversal of + # `.metadata.labels.`. kubectl jsonpath uses `.` as the + # field separator, so dots inside the label key must be + # backslash-escaped (matches the existing TIMESTAMP_ANNOTATION + # handling in CRONJOB_CANDIDATE_NODES_SELECTION_SCRIPT). + local escaped_key + escaped_key=$(echo "$key" | sed 's/\./\\./g') + + local out="" + local node value + for node in $nodes; do + value=$(kubectl get node "$node" \ + -o jsonpath="{.metadata.labels.${escaped_key}}" 2>/dev/null || echo "") + if [[ "$value" == "passed" ]]; then + if [[ -z "$out" ]]; then + out="$node" + else + out="$out $node" + fi + fi + done + if [[ -n "$out" ]]; then + echo "$out" + fi + return 0 + } + # === RCCL Tests Definitions === TESTS_JSON: | { @@ -38,8 +283,11 @@ data: ] } - RCCL_WORKLOAD_IMAGE: "docker.io/rocm/roce-workload:ubuntu24_rocm-7.0.2_rccl-7.0.2_anp-v1.2.0_ainic-1.117.1-a-63" - MPIJOB_WAIT_TIME: "240" + # Single workload image for Phase 2 (intra-node RCCL), + # Phase 4 (ib_write_bw server/client), and Phase 5 (multi-node RCCL). + # patchable: cluster-validation-framework.images.roce-workload + ROCE_WORKLOAD_IMAGE: "docker.io/rocm/roce-workload:ubuntu24_rocm-7.0.2_rccl-7.0.2_anp-v1.2.0_ainic-1.117.1-a-63" + MPIJOB_WAIT_TIME: "240" # Phase 5: wallclock seconds for MPIJob completion. patchable: cluster-validation-framework.timeouts.phase5-mpijob-wait-secs DEBUG_DELAY: "20" WAIT_FOR_WORKERS: "true" ENABLE_SSH_CHECK: "true" @@ -47,20 +295,100 @@ data: SSH_CHECK_INTERVAL: "4" SSH_CHECK_TIMEOUT: "60" + # Phase 5 multi-node RCCL env vars. Profile follows the AMD Pollara + # 400 RCCL ops-guide (PCIe relaxed ordering, TC=96, GID index 1, + # ignore CPU affinity, GDR flush). Non-default deviations: + # + # NCCL_NET_PLUGIN=none - bypass ANP, use stock NCCL_NET=IB + # (same ibverbs path as ib_write_bw) + # NCCL_IB_QPS_PER_CONNECTION=1 - one QP per peer per channel + # NCCL_IB_SPLIT_DATA_ON_QPS=0 + # NCCL_IB_USE_INLINE=0 - SGE list path (more battle-tested) + # NCCL_MIN/MAX_NCHANNELS=4 - lower QP fan-out per rank + # NCCL_IB_HCA=ionic_0.ionic_7 - explicit device list + # NCCL_IB_TIMEOUT=22, RETRY_CNT=12 - wider IB layer retry budget + # + # Diagnostics: NCCL_DEBUG=INFO, NCCL_DEBUG_SUBSYS=INIT,NET, + # NCCL_TOPO_DUMP_FILE=/tmp/topo_all.txt. RCCL_ENV_VARS: | #!/bin/bash + # --- transport selection (the one knob that mattered most) ----- + RCCL_ENV="$RCCL_ENV -x NCCL_NET_PLUGIN=none" + RCCL_ENV="$RCCL_ENV -x NCCL_IB_HCA=ionic_0,ionic_1,ionic_2,ionic_3,ionic_4,ionic_5,ionic_6,ionic_7" + # --- QP / send-path shape (avoid multi-QP doorbell races) ------ + RCCL_ENV="$RCCL_ENV -x NCCL_IB_QPS_PER_CONNECTION=1" + RCCL_ENV="$RCCL_ENV -x NCCL_IB_SPLIT_DATA_ON_QPS=0" + RCCL_ENV="$RCCL_ENV -x NCCL_IB_USE_INLINE=0" + RCCL_ENV="$RCCL_ENV -x NCCL_MIN_NCHANNELS=4" + RCCL_ENV="$RCCL_ENV -x NCCL_MAX_NCHANNELS=4" + # --- RoCEv2 GID + PCIe + traffic class (AMD Pollara 400 UG) ---- RCCL_ENV="$RCCL_ENV -x NCCL_IB_GID_INDEX=1" + RCCL_ENV="$RCCL_ENV -x NCCL_IB_PCI_RELAXED_ORDERING=1" + RCCL_ENV="$RCCL_ENV -x NCCL_IB_TC=96" + RCCL_ENV="$RCCL_ENV -x NCCL_IB_TIMEOUT=22" + RCCL_ENV="$RCCL_ENV -x NCCL_IB_RETRY_CNT=12" + # --- recv / GDR / lockfree (FW + ROCm interplay) --------------- RCCL_ENV="$RCCL_ENV -x NCCL_NET_OPTIONAL_RECV_COMPLETION=0" RCCL_ENV="$RCCL_ENV -x NCCL_GDR_FLUSH_DISABLE=1" RCCL_ENV="$RCCL_ENV -x RCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING=0" - RCCL_ENV="$RCCL_ENV -x NCCL_IB_USE_INLINE=1" RCCL_ENV="$RCCL_ENV -x IONIC_LOCKFREE=all" - RCCL_ENV="$RCCL_ENV -x NCCL_NET_PLUGIN=librccl-anp.so" - RCCL_ENV="$RCCL_ENV -x NCCL_TOPO_DUMP_FILE=/tmp/topo_all.txt" - RCCL_ENV="$RCCL_ENV -x NCCL_DEBUG=INFO" RCCL_ENV="$RCCL_ENV -x NCCL_DMABUF_ENABLE=0" + # --- scheduling + diagnostics ---------------------------------- + RCCL_ENV="$RCCL_ENV -x NCCL_IGNORE_CPU_AFFINITY=1" + RCCL_ENV="$RCCL_ENV -x NCCL_DEBUG=INFO" + RCCL_ENV="$RCCL_ENV -x NCCL_DEBUG_SUBSYS=INIT,NET" + RCCL_ENV="$RCCL_ENV -x NCCL_TOPO_DUMP_FILE=/tmp/topo_all.txt" echo "$RCCL_ENV" > /shared/rccl_env.txt + # === Phase 2 (Intra-Node GPU Mesh Validation) env vars === + # Consumed by the per-node Phase 2 RCCL Job (cluster-validation-job.yaml, + # `cluster-validation-phase2-job-config` ConfigMap) and by + # `PHASE2_SCRIPT`. See design doc design doc §4 -> + # "Code Path: Phase 2 ConfigMap variables". + # + # The PHASE2_START_MSG_SIZE / PHASE2_END_MSG_SIZE / PHASE2_STEP_FACTOR / + # PHASE2_ITER_COUNT / PHASE2_WARMUP_ITER_COUNT block are mpirun args for + # `all_reduce_perf`; they mirror the Phase 5 MPI_LAUNCHER_ENV_VARS + # defaults (1K.2G, step 2, 6 iters, 20 warmup) so a single tuning + # change can be applied symmetrically when needed. + PHASE2_START_MSG_SIZE: "1K" + PHASE2_END_MSG_SIZE: "2G" + PHASE2_STEP_FACTOR: "2" + PHASE2_ITER_COUNT: "6" + PHASE2_WARMUP_ITER_COUNT: "20" + # Default 100 GB/s -- intra-node Avg bus bandwidth baseline. MI300X + # with full xGMI fabric measures ~250-300 GB/s; MI308X (PCIe-only, + # no xGMI) measures ~105-110 GB/s. 100 is the lower-common-denominator + # threshold; raise per-deployment for fabric-rich SKUs. Validate + # against measured baseline before rollout to a new SKU (see design doc + # design §8). Tunable via ConfigMap edit -- no code change needed. + PHASE2_BW_THRESHOLD: "100" + # Per-node Job wallclock budget. all_reduce_perf with the defaults above + # completes in well under a minute on healthy MI300 xGMI; 600s leaves + # headroom for pod scheduling, image pull, and a slow-but-passing run. + # PHASE2_SCRIPT counts a node as `timeout` failure past this budget. + # patchable: cluster-validation-framework.timeouts.phase2-job-wait-secs + PHASE2_JOB_WAIT_TIME: "600" + # Stripped variant of RCCL_ENV_VARS for the single-node Phase 2 path: + # drops NCCL_NET_PLUGIN (no librccl-anp -- collectives stay on xGMI) + # and every NCCL_IB_* / IONIC_* / GDR-fabric tunable (no NIC / no + # InfiniBand). Keeps NCCL_DEBUG=INFO for diagnostics and + # NCCL_DMABUF_ENABLE=0 to match the Phase 5 path. Consumed by the + # Phase 2 Job container via `source <(echo "$PHASE2_RCCL_ENV_VARS")` + # (see cluster-validation-job.yaml `cluster-validation-phase2-job-config`). + PHASE2_RCCL_ENV_VARS: | + #!/bin/bash + export NCCL_DEBUG=INFO + export NCCL_DMABUF_ENABLE=0 + # rccl-tests build dir (all_reduce_perf binary). Same path as + # MPI_LAUNCHER_ENV_VARS uses for Phase 5; duplicated here because + # the Phase 2 Job container only sources PHASE2_RCCL_ENV_VARS + # (not MPI_LAUNCHER_ENV_VARS), and the Job script runs under + # `set -u` so an unset PERF_TEST_DIR aborts with exit 127. + export PERF_TEST_DIR=/root/rccl-tests/build + # No NCCL_NET_PLUGIN -- single node, intra-node xGMI only. + # No NCCL_IB_* / IONIC_* -- no fabric / no NICs involved. + # MPIJob documentation: https://www.kubeflow.org/docs/components/trainer/legacy-v1/user-guides/mpi/ MPI_LAUNCHER_ENV_VARS: | export MPIRUN_SSH_PORT=22 @@ -75,41 +403,113 @@ data: export WARMUP_ITER_COUNT="20" export CHECK_ITER_COUNT="0" + # === Per-Phase Skip Flags === + # One flag per phase. Patched in from config.json["cluster-validation-framework"]["skip-tests"] + # by gpu-cluster.sh after this YAML is applied. + SKIP_GPU_HW_ACCEPTANCE: "false" # Phase 1: GPU hardware acceptance (test-runner / RVS / AGFHC). patchable: cluster-validation-framework.skip-tests.skip-phase1-gpu-hw-acceptance + SKIP_GPU_MESH_VALIDATION: "false" # Phase 2: Intra-node GPU collective / mesh validation. patchable: cluster-validation-framework.skip-tests.skip-phase2-gpu-mesh-validation + SKIP_NIC_VALIDATION: "false" # Phase 3: Per-node NIC health (requires amd-nic=true). patchable: cluster-validation-framework.skip-tests.skip-phase3-nic-validation + SKIP_RAIL_BANDWIDTH_TEST: "false" # Phase 4: Pairwise rail bandwidth. patchable: cluster-validation-framework.skip-tests.skip-phase4-rail-bandwidth-test + SKIP_RCCL_TEST: "false" # Phase 5: Multi-node RCCL via MPIJob. patchable: cluster-validation-framework.skip-tests.skip-phase5-rccl-test + # === GPU Validation Tests Definitions === - # RVS: ROCm Validation Suite. For a full list of supported recipes and arguments, refer to https://instinct.docs.amd.com/projects/gpu-operator/en/latest/test/appendix-test-recipe.html - # AGFHC: AMD GPU Field Health Check. For a full list of supported recipes and arguments, refer to https://instinct.docs.amd.com/projects/gpu-operator/en/latest/test/agfhc.html - # Refer to the above links for other available test frameworks and recipes, and configure the wait time accordingly. - SKIP_GPU_VALIDATION: "__SKIP_GPU_VALIDATION__" # Set to "true" to skip GPU validation tests, directly start the RCCL tests - SKIP_RCCL_TEST: "__SKIP_RCCL_TEST__" # Set to "true" to skip MPI Job RCCL tests - TEST_RUNNER_JOB_WAIT_TIME: "1200" - TEST_RUNNER_SUCCESS_LABEL: "amd.com/gpu-validation-test=passed" - TEST_RUNNER_FAILURE_LABEL: "amd.com/gpu-validation-test=failed" - TEST_RUNNER_IMAGE: "docker.io/rocm/test-runner:v1.4.0" - GPU_VALIDATION_TESTS_JSON: | - { - "TestConfig": { - "GPU_HEALTH_CHECK": { - "TestLocationTrigger": { - "global": { - "TestParameters": { - "MANUAL": { - "TestCases": [ - { - "Framework": "RVS", - "Recipe": "gst_single", - "Iterations": 1, - "StopOnFailure": true, - "TimeoutSeconds": 1200, - "Arguments": "--parallel" - } - ] - } - } - } - } - } + # RVS: ROCm Validation Suite. Recipe + argument reference + # https://instinct.docs.amd.com/projects/gpu-operator/en/latest/test/appendix-test-recipe.html + # AGFHC: AMD GPU Field Health Check. Recipe reference + # https://instinct.docs.amd.com/projects/gpu-operator/en/latest/test/agfhc.html + # + # PHASE 1 MULTI-STAGE MODEL: + # The ROCm test-runner CLI executes only the FIRST entry of a + # TestCases[] array even when N entries are provided -- so the + # previous single-Job-per-node design (one GPU_VALIDATION_TESTS_JSON + # with multiple recipes) silently dropped every recipe past the first. + # Phase 1 now runs each recipe as its OWN sub-stage: one test-runner + # Job per (stage, node), sequenced sequentially per-node with + # stop-on-first-failure semantics. Cross-node parallelism is preserved + # within each stage (submit-all-then-wait). + # + # Each stage carries its own `Image` because the RVS and AGFHC release + # cadences differ and may ship as separate test-runner images + # (e.g. rocm/test-runner-rvs:vX vs rocm/test-runner-agfhc:vY). + # + # Per-stage outcome is recorded as a node annotation: + # amd.com/gpu-hw-acceptance-stage-=passed|failed + # The aggregate label `amd.com/gpu-hw-acceptance=passed|failed` + # (= AND across all stages) remains the dependency contract Phase 2 + # reads -- unchanged. On a failed node, the failing stage's recipe + # is additionally written to the `failed-subtest` annotation, and + # the failure-reason annotation captures stage-name:reason. + # + # Required per-stage fields: + # Name -- orchestrator identifier (used in CM/Job names + # and annotation suffixes); must be a valid + # k8s name fragment (lowercase alphanum + '-'). + # Image -- test-runner container image for this stage. + # Framework -- RVS | AGFHC | . (passed verbatim to runner). + # Recipe -- recipe name within Framework. + # TimeoutSeconds -- per-stage per-node wallclock budget. The Job + # wait loop uses TimeoutSeconds + 120s of slack + # to absorb pod startup / image pull / result + # upload, so this number should match the actual + # in-container test budget. + # Optional per-stage fields: + # Iterations -- runner iterations (defaults to 1 if omitted). + # Arguments -- extra runner CLI args (defaults to ""). + # + # Phase 1 writes the uniform per-phase label PHASE1_LABEL_KEY + # (= "amd.com/gpu-hw-acceptance") via the helper library (see + # PHASE_NODE_LABEL_SCRIPT above and PHASE1_SCRIPT below). + # + # Optional per-stage "Skip": true bypasses that stage without + # submitting a test-runner Job. Skipped stages annotate every + # still-alive node with -stage-=skipped and + # do NOT count toward stop-on-first-failure. If every stage is + # skipped, the aggregate label is set to + # "skipped" (not "passed"); Phase 2 currently gates on "passed" so + # an all-skipped Phase 1 will block downstream phases by design. + # + # patchable: per-stage Skip set from cluster-validation-framework.skip-tests.skip-phase1-stages.skip-phase1- + # patchable: per-stage TimeoutSeconds set from cluster-validation-framework.timeouts.phase1-stages-secs.phase1- + # patchable: per-stage Image set from cluster-validation-framework.images.test-runner. (rvs/agfhc) + GPU_VALIDATION_STAGES_JSON: | + [ + { + "Name": "gpu-stress", + "Image": "docker.io/rocm/test-runner:v1.4.0", + "Framework": "RVS", + "Recipe": "gst_single", + "Iterations": 1, + "TimeoutSeconds": 1800, + "Arguments": "--parallel" + }, + { + "Name": "xgmi-lvl1", + "Image": "docker.io/rocm/test-runner:v1.4.0", + "Framework": "AGFHC", + "Recipe": "xgmi_lvl1", + "Iterations": 1, + "TimeoutSeconds": 300, + "Arguments": "" + }, + { + "Name": "pcie-lvl1", + "Image": "docker.io/rocm/test-runner:v1.4.0", + "Framework": "AGFHC", + "Recipe": "pcie_lvl1", + "Iterations": 1, + "TimeoutSeconds": 300, + "Arguments": "" + }, + { + "Name": "hbm-lvl1", + "Image": "docker.io/rocm/test-runner:v1.4.0", + "Framework": "AGFHC", + "Recipe": "hbm_lvl1", + "Iterations": 1, + "TimeoutSeconds": 300, + "Arguments": "" } - } + ] # === select-and-label-candidates.sh === @@ -124,6 +524,24 @@ data: echo "[Node Selection] Using node selector: ${NODE_SELECTOR_LABEL}" nodes=$(kubectl get nodes -l "${NODE_SELECTOR_LABEL}" -o name | sed 's|node/||') + # Distinguish "selector matches no nodes at all" from "matched nodes + # but none are eligible (busy / recently tested)". A custom selector + # (e.g. cvf-candidate=true) that nothing applies automatically -- only + # the NFD labels feature.node.kubernetes.io/amd-{gpu,nic} are set for + # you -- silently matches zero nodes, which otherwise surfaces only as + # a generic "Insufficient candidates" later. Warn explicitly and show + # which nodes DO carry each requested label so the operator can see the + # gap and apply the label (kubectl label node =). + if [ -z "$nodes" ]; then + echo "[Node Selection: WARNING] Selector '${NODE_SELECTOR_LABEL}' matched 0 nodes in the cluster." + echo "[Node Selection] Nothing applies custom labels automatically; only NFD sets" + echo "[Node Selection] feature.node.kubernetes.io/amd-gpu and .../amd-nic." + echo "[Node Selection] If you configured a custom node-selector-labels value, label the" + echo "[Node Selection] target nodes, e.g.: kubectl label node ${NODE_SELECTOR_LABEL%%=*}=${NODE_SELECTOR_LABEL#*=}" + echo "[Node Selection] Current node labels for reference:" + kubectl get nodes --show-labels 2>/dev/null | sed 's/^/[Node Selection] /' || true + fi + candidates=() candidate_count=0 current_time_secs=$(date +%s) @@ -181,6 +599,11 @@ data: # --- Action Decision --- if [ $candidate_count -lt ${WORKER_REPLICAS} ]; then echo "[Node Selection: Skipped] Insufficient candidates (required $WORKER_REPLICAS, available $candidate_count)" + if [ -z "$nodes" ]; then + echo "[Node Selection: Skipped] Cause: selector '${NODE_SELECTOR_LABEL}' matched 0 nodes (see WARNING above) -- label your target nodes." + else + echo "[Node Selection: Skipped] Cause: selector matched $(echo "$nodes" | wc -w) node(s) but none were idle/eligible this tick (busy=${#busy_nodes[@]}, recently-tested=${#skipped_recent_nodes[@]})." + fi echo "==================================================================" exit 1 else @@ -198,7 +621,7 @@ data: fi # === wait-for-worker-pods.sh === - WAIT_FOR_WORKERS_SCRIPT: | + PHASE45_PREFLIGHT_SCRIPT: | #!/bin/bash set -euo pipefail @@ -223,6 +646,13 @@ data: FIRST_POD=$(echo $WORKER_PODS | awk '{print $1}') WORKER_IPS=$(kubectl get pods -n "$NAMESPACE" -l "$JOB_LABELS" -o jsonpath='{.items[*].status.podIP}') + # ---------------------------------------------------------------- + # Phase 1: launcher->worker per-IP readiness wait (existing + # behavior). Each worker IP must accept an SSH connection from + # FIRST_POD within $SSH_CHECK_TIMEOUT, retried every + # $SSH_CHECK_INTERVAL seconds. This catches the + # workers-not-yet-up case before we begin the N*N matrix probe. + # ---------------------------------------------------------------- kubectl exec -n "$NAMESPACE" "$FIRST_POD" -- bash -c " for ip in $WORKER_IPS; do echo '__ Testing SSH to' \$ip '...' @@ -241,7 +671,348 @@ data: echo '__ SSH OK for' \$ip done " - echo "__ SSH connectivity verified across all worker pods." + echo "__ launcher->worker SSH readiness verified." + + # ---------------------------------------------------------------- + # Phase 2 (, design §4): N*N SSH mesh probe. + # For every (src_pod, dst_ip) pair, exec into src_pod and SSH + # to dst_ip with a fast ConnectTimeout=5 single-shot probe. + # Record each failure in failed_pairs and CONTINUE -- we want + # a complete picture of the fabric, not a fail-fast trip on + # the first broken pair (design §4 calls this out explicitly). + # + # Note: the surrounding script runs under `set -euo pipefail`, + # so each probe is guarded with `|| failed_pairs+=(.)` to + # prevent a single SSH failure from aborting the script. + # ---------------------------------------------------------------- + echo "--- Phase 4.5: N*N SSH mesh check ---" + echo " src pods: $WORKER_PODS" + echo " dst IPs: $WORKER_IPS" + # ssh_mesh_failed is initialised here (alongside failed_pairs) + # so the verdict block can read it under `set -u` + # even if the mesh loop records zero failures. + ssh_mesh_failed=false + failed_pairs=() + for src in $WORKER_PODS; do + for dst_ip in $WORKER_IPS; do + if kubectl exec -n "$NAMESPACE" "$src" -- \ + ssh -o StrictHostKeyChecking=no \ + -o UserKnownHostsFile=/dev/null \ + -o ConnectTimeout=5 \ + "$dst_ip" 'echo ok' >/dev/null 2>&1; then + echo "__ mesh OK: $src -> $dst_ip" + else + echo "__ mesh FAIL: $src -> $dst_ip" + failed_pairs+=("$src->$dst_ip") + fi + done + done + + # Per (design §4): record the failure flag + # and CONTINUE to subsequent checks so the verdict block can + # aggregate all failure classes into a single annotation. + # Previously this block exit'd 1 immediately, which prevented + # DNS / MPI / RCCL probes from running on SSH-mesh-failed + # fabrics and produced thinner annotations. + if [ "${#failed_pairs[@]}" -gt 0 ]; then + ssh_mesh_failed=true + echo "WARN: N*N SSH mesh check failed for ${#failed_pairs[@]} pair(s) (ssh_mesh_failed=true):" + for pair in "${failed_pairs[@]}"; do + echo " $pair" + done + echo "WARN: continuing to subsequent checks (verdict gating below will aggregate failures)." + else + echo "__ N*N SSH mesh verified across all worker pods." + fi + + # ---------------------------------------------------------------- + # Phase 3 (, design §4): DNS forward+reverse + # check. From the FIRST worker pod, resolve every worker node + # hostname forward (getent hosts ) then reverse-resolve + # the returned address. A miss on either direction is recorded + # in dns_misses. this check MUST NOT + # short-circuit -- it iterates every hostname and continues to + # subsequent checks regardless of misses. Overall verdict + # gating (annotate-and-evict) is handled elsewhere. + # + # WORKER_NODES is derived from .spec.nodeName -- one entry per + # worker pod, possibly with duplicates if multiple workers + # land on the same node. We pass the deduplicated, whitespace- + # separated list into the in-pod loop. The surrounding script + # runs under `set -euo pipefail`, so each getent call is + # guarded with `|| echo MISS` to keep the loop alive on a + # failed resolution. + # ---------------------------------------------------------------- + echo "--- Phase 4.5: DNS forward+reverse check ---" + WORKER_NODES=$(kubectl get pods -n "$NAMESPACE" -l "$JOB_LABELS" \ + -o jsonpath='{range .items[*]}{.spec.nodeName}{"\n"}{end}' \ + | awk 'NF' | sort -u | tr '\n' ' ') + echo " nodes: $WORKER_NODES" + echo " via pod: $FIRST_POD" + + dns_misses=() + while IFS= read -r line; do + [ -z "$line" ] && continue + dns_misses+=("$line") + done < <( + kubectl exec -n "$NAMESPACE" "$FIRST_POD" -- bash -c ' + for h in '"$WORKER_NODES"'; do + fwd=$(getent hosts "$h" 2>/dev/null || echo MISS) + if [ "$fwd" = MISS ]; then + echo "DNS:$h:fwd=MISS rev=SKIP" + continue + fi + addr=$(echo "$fwd" | awk "{print \$1}") + rev=$(getent hosts "$addr" 2>/dev/null || echo MISS) + if [ "$rev" = MISS ]; then + echo "DNS:$h:fwd=$addr rev=MISS" + fi + done + ' || true + ) + + # dns_failed is the boolean view of dns_misses, exported for + # the verdict block so it can read a single flag + # under `set -u` regardless of which sub-check fired. + dns_failed=false + if [ "${#dns_misses[@]}" -gt 0 ]; then + dns_failed=true + echo "WARN: DNS check recorded ${#dns_misses[@]} miss(es) (dns_failed=true):" + for miss in "${dns_misses[@]}"; do + echo " $miss" + done + echo "WARN: continuing to subsequent checks (verdict gating below will aggregate failures)." + else + echo "__ DNS forward+reverse OK for all worker nodes." + fi + + # ---------------------------------------------------------------- + # Phase 4 (, design §4): mpirun --hostfile + # no-op spawn check. From the FIRST worker pod, write the worker + # pod IPs (one per line) to /tmp/hf and invoke + # `mpirun --hostfile /tmp/hf --np --allow-run-as-root true`. + # A non-zero exit from kubectl exec sets mpi_spawn_failed=true. + # This catches OMPI launcher misconfiguration (missing orted, + # broken PATH, btl/oob misconfig) before the heavier RCCL probe. + # + # this check MUST NOT short-circuit -- it + # records the failure flag, logs WARN, and continues. Overall + # verdict gating (annotate-and-evict) is owned by a separate + # change. The surrounding script runs under `set -euo pipefail`, + # so the kubectl exec is guarded with `|| mpi_spawn_failed=true` + # to keep the script alive on spawn failure. mpi_spawn_failed is + # initialised to false so later annotate logic can read it under + # `set -u`. + # ---------------------------------------------------------------- + echo "--- Phase 4.5: mpirun --hostfile no-op spawn check ---" + echo " via pod: $FIRST_POD" + echo " np: $(echo $WORKER_IPS | wc -w) (one rank per worker IP)" + + mpi_spawn_failed=false + # --prefix is REQUIRED for cross-node spawn: SSH non-interactive + # shells don't source .bashrc, so the OMPI bin/lib dirs are not + # on the remote PATH. Without --prefix, orted is "command not + # found" on the remote and mpirun aborts. OMPI_DIR comes from + # MPI_LAUNCHER_ENV_VARS; the fallback matches the image's MPI + # install recipe. + kubectl exec -n "$NAMESPACE" "$FIRST_POD" -- bash -c ' + : "${OMPI_DIR:=/root/ompi/install}" + echo "'"$WORKER_IPS"'" | tr " " "\n" | awk "NF" > /tmp/hf + echo "hostfile contents:" + cat /tmp/hf + mpirun --prefix "$OMPI_DIR" \ + --hostfile /tmp/hf \ + --np $(wc -l < /tmp/hf) \ + --allow-run-as-root \ + true + ' || mpi_spawn_failed=true + + if [ "$mpi_spawn_failed" = "true" ]; then + echo "WARN: mpirun --hostfile no-op spawn failed (mpi_spawn_failed=true)." + echo "WARN: continuing to subsequent checks (verdict gating handled elsewhere)." + else + echo "__ mpirun --hostfile no-op spawn OK across all worker IPs." + fi + + # ---------------------------------------------------------------- + # Phase 5 (, design §4 and §6): minimal RCCL + # topology probe. From the FIRST worker pod, run mpirun with one + # rank per GPU (np = num_worker_ips * 8) against + # $PERF_TEST_DIR/all_reduce_perf with NCCL_DEBUG=INFO and + # NCCL_DEBUG_SUBSYS=INIT. Filter the combined stdout/stderr for + # the first 50 "NCCL INFO comm|topology" lines so the launcher + # log captures RCCL's topology view without flooding the + # operator log with full debug spew. + # + # The whole kubectl exec is wrapped in `timeout 60` (design §6: + # first-run topology discovery can be slow while NCCL warms + # caches). Two failure flags are tracked: + # * rccl_topo_timeout : timeout 60 fired (exit code 124) + # -- soft-fail per design §6 (annotate + # and proceed; the real test is Phase 5) + # * rccl_topo_failed : any non-timeout non-zero exit + # -- still WARN-and-continue here; + # verdict gating (annotate-and-evict) + # is handled elsewhere. + # + # this check MUST NOT short-circuit. It + # records the failure flags, logs WARN, and continues. The + # surrounding script runs under `set -euo pipefail`, so the + # kubectl exec is guarded with an `if !` block to keep the + # script alive on probe failure. Both flags are initialised + # to false so later annotate logic can read them under + # `set -u`. + # ---------------------------------------------------------------- + echo "--- Phase 4.5: RCCL topology probe ---" + echo " via pod: $FIRST_POD" + echo " np: $(($(echo $WORKER_IPS | wc -w) * 8)) (8 ranks per worker IP)" + echo " timeout: 60s (soft-fail on timeout per design §6)" + + rccl_topo_failed=false + rccl_topo_timeout=false + rccl_exit=0 + timeout 60 kubectl exec -n "$NAMESPACE" "$FIRST_POD" -- bash -c ' + : "${OMPI_DIR:=/root/ompi/install}" + echo "'"$WORKER_IPS"'" | tr " " "\n" | awk "NF" > /tmp/hf + echo "hostfile contents:" + cat /tmp/hf + # --prefix: same SSH/PATH reason as the no-op spawn above. + mpirun --prefix "$OMPI_DIR" \ + --hostfile /tmp/hf \ + --np $(($(wc -l < /tmp/hf) * 8)) \ + --allow-run-as-root \ + -x NCCL_DEBUG=INFO \ + -x NCCL_DEBUG_SUBSYS=INIT \ + $PERF_TEST_DIR/all_reduce_perf -b 1K -e 1K -n 1 2>&1 \ + | grep -E "NCCL INFO comm|topology" \ + | head -50 + ' || rccl_exit=$? + + if [ "$rccl_exit" -eq 124 ]; then + rccl_topo_timeout=true + echo "WARN: RCCL topology probe timed out after 60s (rccl_topo_timeout=true)." + echo "WARN: treating as soft-fail per design §6 (first-run warming can be slow)." + echo "WARN: continuing to verdict gating below (timeout still annotates rccl-topology)." + elif [ "$rccl_exit" -ne 0 ]; then + rccl_topo_failed=true + echo "WARN: RCCL topology probe failed with exit $rccl_exit (rccl_topo_failed=true)." + echo "WARN: continuing to verdict gating below." + else + echo "__ RCCL topology probe OK across all worker IPs." + fi + + # ---------------------------------------------------------------- + # Phase 6 (, design §4 verdict block + §5,§6): + # Aggregate verdict + annotate-on-failure + abort MPIJob. + # + # Two classes of failure are distinguished: + # + # 1. HARD failures -- abort the MPIJob via non-zero exit. + # Annotated AND fail-closed. + # * ssh_mesh_failed -> class "ssh-mesh" + # * dns_failed -> class "dns" + # * mpi_spawn_failed -> class "mpi-spawn" + # * rccl_topo_failed -> class "rccl-topology" (non-timeout + # non-zero exit from the RCCL probe) + # + # 2. SOFT failures -- annotate ONLY, do not abort. Per design §6, + # first-run NCCL topology discovery can be slow while caches + # warm; a 60s timeout is treated as "annotate and proceed + # the real test is Phase 5". Without this carve-out, every + # cold-cache run would fail-loop Phase 4.5 and never reach + # Phase 5. + # * rccl_topo_timeout -> class "rccl-topology" (annotated + # so the operator can still triage + # warm-cache vs. fabric/NCCL issues, + # but does NOT contribute to the + # abort decision) + # + # The annotation set is the union of HARD + SOFT classes so the + # operator sees the full picture. The abort decision is based + # only on the HARD classes. + # + # Annotation key: `amd.com/phase4_5-failure-reason=` + # written with --overwrite to replace stale values from a + # previous failed run on the same node. + # + # Annotated nodes: deduplicated WORKER_NODES (derived earlier + # from .spec.nodeName). + # + # Per design §5: NO new label key is introduced. Phase 4.5 is a + # gate, not a tested capability of an individual node -- per-node + # labels stay frozen from Phase 4 so the audit trail is preserved. + # The annotation is the only durable artefact of a Phase 4.5 + # failure. + # + # Per design §2: a non-zero exit here fails the init-container, + # which fails the launcher pod, which fails the MPIJob, which + # the CronJob orchestrator's Phase 5 wait already treats as a + # failure (existing failure path -- no new orchestrator code + # needed). On the next CronJob tick the annotated nodes are + # re-evaluated. + # + # The annotate call is guarded with a fallback echo so a single + # failing kubectl annotate (e.g. node renamed out from under us) + # does not abort the whole loop -- the script must still reach + # the final FATAL exit so the MPIJob aborts on HARD failure. + # ---------------------------------------------------------------- + echo "--- Phase 4.5: verdict ---" + # Build the failed_classes list. Using `if; then; fi` rather + # than `[ . ] && .` because the surrounding script runs + # under `set -e` and a short-circuit `&&` whose left side + # evaluates false would propagate a non-zero status and abort + # the script before the verdict block could annotate. + # + # hard_failed -> drives the abort exit. + # annotation_classes -> drives the annotation reason string + # (union of HARD + SOFT classes). + hard_failed=false + annotation_classes=() + if [ "$ssh_mesh_failed" = "true" ]; then + annotation_classes+=("ssh-mesh"); hard_failed=true + fi + if [ "$dns_failed" = "true" ]; then + annotation_classes+=("dns"); hard_failed=true + fi + if [ "$mpi_spawn_failed" = "true" ]; then + annotation_classes+=("mpi-spawn"); hard_failed=true + fi + if [ "$rccl_topo_failed" = "true" ]; then + annotation_classes+=("rccl-topology"); hard_failed=true + elif [ "$rccl_topo_timeout" = "true" ]; then + # Soft-fail per design §6: annotate but DO NOT abort. + annotation_classes+=("rccl-topology") + fi + + if [ "${#annotation_classes[@]}" -gt 0 ]; then + # Join classes with comma -- IFS-trick keeps the join self + # contained without spawning a subshell. + old_ifs="$IFS" + IFS=, + reason="${annotation_classes[*]}" + IFS="$old_ifs" + + if [ "$hard_failed" = "true" ]; then + echo "FATAL: Phase 4.5 pre-flight failed: $reason" + else + echo "WARN: Phase 4.5 pre-flight soft-failed: $reason (annotating but not aborting per design §6)" + fi + echo " annotating participating nodes: $WORKER_NODES" + for node in $WORKER_NODES; do + kubectl annotate node "$node" \ + "amd.com/phase4_5-failure-reason=$reason" \ + --overwrite || \ + echo "WARN: failed to annotate node $node with $reason (continuing)." + done + + if [ "$hard_failed" = "true" ]; then + echo "FATAL: aborting MPIJob via non-zero init-container exit (design §2, §5)." + exit 1 + fi + echo "__ Phase 4.5 proceeding past soft-fail (rccl-topology timeout) -- Phase 5 will exercise the real path." + else + echo "__ Phase 4.5 pre-flight passed: ssh-mesh, dns, mpi-spawn, rccl-topology all OK." + fi fi # === validate-single-test.sh === @@ -271,142 +1042,2991 @@ data: exit 1 fi - # === GPU Validation Test Script === - GPU_VALIDATION_TEST_SCRIPT: | + # === Phase 1 Script: per-stage per-node Test Runner orchestration === + # Sourced by the orchestrator's run_phase1 via _run_phase_generic. + # On entry: + # PHASE_NODES = space-separated input node list (also $@) + # PHASE1_LABEL_KEY = "amd.com/gpu-hw-acceptance" (from this ConfigMap) + # SKIP_GPU_HW_ACCEPTANCE = "true" short-circuits to pass-label-all + # GPU_VALIDATION_STAGES_JSON, GPU_PER_WORKER from ConfigMap + # Helpers (already sourced by orchestrator): + # label_phase_passed, label_phase_failed, annotate_phase_value + # Job template available at /test-runner-configs/cluster-validation-test-runner-job-config.yaml + # + # Multi-stage contract ( refactor, supersedes/762/765 single-Job + # design): + # For each stage S in GPU_VALIDATION_STAGES_JSON (sequential, in array order): + # 0. If S.Skip == true: annotate every still-alive node with + # ${PHASE1_LABEL_KEY}-stage-${S.Name}=skipped, do NOT submit a + # Job, do NOT drop the node from the alive set. Skipped stages + # never count as the "first failure" for stop-on-failure. + # 1. For each node still alive (no prior stage failed on it): + # a. Build single-recipe payload from S.{Framework,Recipe,Iterations, + # TimeoutSeconds,Arguments} wrapped in TestParameters.MANUAL.TestCases[0]. + # b. Create per-stage per-node ConfigMap + # cvf-phase1-${S.Name}-${node-hash}-${ts} + # with key GPU_VALIDATION_TESTS_JSON = single-recipe payload. + # c. sed-render Job template, substituting $$NODE, $$GPU_PER_WORKER, + # $$TEST_RUNNER_IMAGE = S.Image, $$PHASE1_CONFIG_MAP = above CM name, + # and the Job's metadata.name. + # d. kubectl apply. + # 2. Poll-wait every Job in this stage (per-stage timeout = S.TimeoutSeconds + # + 120s slack for pod startup / result upload). + # 3. For each Job: locate the test-runner output file and parse it. + # rocm/test-runner:v1.4.0 writes one artifact per iteration to + # /var/log/amd-test-runner/___result(.gz) + # (volume-mounted at /var/log/cluster-validation on the host). + # We glob host-path first, then fall back to `kubectl exec ls` + + # `kubectl cp`; .gz variants are gunzipped to a temp file before + # parsing. + # 4. Per node: write per-stage annotation + # ${PHASE1_LABEL_KEY}-stage-${S.Name}=passed|failed + # via annotate_phase_value. On failure, drop node from the alive set + # and remember (S.Name, failure_reason, failed_subtest). + # 5. Cleanup: kubectl delete the stage's Job + its per-stage ConfigMap + # (best-effort; ttlSecondsAfterFinished + the delete here both apply). + # After all stages: + # 6. For nodes still alive AND that had >=1 non-skip stage submit + # -> label_phase_passed. + # For nodes still alive AND zero non-skip stages submitted + # (every stage carried Skip=true) + # -> aggregate ${PHASE1_LABEL_KEY}=skipped (Phase 2's gating + # check on "passed" then keeps the node out of downstream + # phases by design). + # For nodes dropped earlier + # -> label_phase_failed with reason + # "stage-${first_failed_name}:${reason}" and annotate + # failed-subtest = recorded sub-test from the first failed + # stage. + # + # SKIP_GPU_HW_ACCEPTANCE=true early-exits before stage iteration: every + # input node is pass-labeled and no Jobs / CMs are created. + # + # Missing result.json for a stage -> that stage is treated as failed for the + # affected node with reason "test-runner-did-not-emit-results" and + # failed-subtest = unknown. + # + # All progress output (banners, per-stage progress) goes to stdout. + # Diagnostic-only lines go to stderr so they don't pollute callers. + PHASE1_SCRIPT: | #!/bin/bash - # Step 2: GPU Validation Test Script - # Expected input variables (from parent scope): - # - nodes - # - SKIP_GPU_VALIDATION - # - GPU_PER_WORKER - # - TEST_RUNNER_IMAGE - # - TEST_RUNNER_JOB_WAIT_TIME - # - TEST_RUNNER_SUCCESS_LABEL - # - TEST_RUNNER_FAILURE_LABEL - # - CANDIDATE_LABEL - # - FAILURE_LABEL - # - WORKER_REPLICAS - # - DEBUG_DELAY - # - # Expected variables to be defined in parent scope: - # - job_names (will be populated) - # - job_to_node (will be populated) - # - passed_nodes (will be populated) - # - failed_nodes (will be populated) - - if [[ "${SKIP_GPU_VALIDATION,,}" == "true" ]]; then - echo "SKIP_GPU_VALIDATION is set to true. Skipping test runner jobs." - passed_nodes="$nodes" - failed_nodes="" - else - for node in $nodes; do - ts=$(date +%Y%m%d-%H%M%S) - prefix="cvf-test-runner" - max_len=63 - job_name="${prefix}-${node}-${ts}" - if [ ${#job_name} -gt $max_len ]; then - # If hostname is too long, replace node with node-hash - echo "Truncating long job name.. $job_name" + # Phase 1 -- GPU HW Acceptance per-node Test Runner orchestration. + # Sourced (not exec'd) by the orchestrator. Do NOT `set -e`; we want + # to label every input node even if individual kubectl calls fail. + + # Timestamp prefix for stage-progress log lines. Long-running stages + # (AGFHC xgmi_lvl1 / pcie_lvl1 / hbm_lvl1 can each take minutes) make + # operator triage easier when every banner carries a wallclock time. + _p1_ts() { date '+%Y-%m-%d %H:%M:%S'; } + + # ---- Input handling --------------------------------------------------- + # The orchestrator passes the node list as positional args AND exports + # PHASE_NODES. Prefer positional ($@) so a re-source with different + # args works as expected; fall back to PHASE_NODES if $# is zero. + local_phase1_nodes="" + if [[ "$#" -gt 0 ]]; then + local_phase1_nodes="$*" + elif [[ -n "${PHASE_NODES:-}" ]]; then + local_phase1_nodes="$PHASE_NODES" + fi + + if [[ -z "$local_phase1_nodes" ]]; then + echo "[$(_p1_ts)] [Phase 1] no input nodes -- nothing to do" + return 0 + fi + + echo "[$(_p1_ts)] [Phase 1] PHASE1_SCRIPT start: nodes=[$local_phase1_nodes]" + + # ---- Early-exit: SKIP_GPU_HW_ACCEPTANCE ------------------------------- + # Incremental-bringup short-circuit. Every input node receives the + # pass label, no Test Runner Jobs are created. + if [[ "${SKIP_GPU_HW_ACCEPTANCE,,}" == "true" ]]; then + echo "[$(_p1_ts)] [Phase 1] SKIP_GPU_HW_ACCEPTANCE=true -- pass-labeling all input nodes" + for n in $local_phase1_nodes; do + label_phase_passed "$n" "$PHASE1_LABEL_KEY" \ + || echo "[$(_p1_ts)] [Phase 1] WARN: label_phase_passed failed for $n" >&2 + done + echo "[$(_p1_ts)] [Phase 1] PHASE1_SCRIPT done (skip mode)" + return 0 + fi + + # ---- Validate required env vars --------------------------------------- + local missing_env="" + for v in PHASE1_LABEL_KEY GPU_PER_WORKER GPU_VALIDATION_STAGES_JSON; do + if [[ -z "${!v:-}" ]]; then + missing_env="$missing_env $v" + fi + done + if [[ -n "$missing_env" ]]; then + echo "[$(_p1_ts)] [Phase 1] ERROR: required env var(s) unset:$missing_env" >&2 + echo "[$(_p1_ts)] [Phase 1] failing every input node with reason=missing-env" >&2 + for n in $local_phase1_nodes; do + label_phase_failed "$n" "${PHASE1_LABEL_KEY:-amd.com/gpu-hw-acceptance}" \ + "phase1-missing-env:${missing_env# }" \ + || echo "[$(_p1_ts)] [Phase 1] WARN: label_phase_failed failed for $n" >&2 + done + return 1 + fi + + if ! command -v jq >/dev/null 2>&1; then + echo "[$(_p1_ts)] [Phase 1] ERROR: jq not available; cannot parse GPU_VALIDATION_STAGES_JSON or test-runner results" >&2 + for n in $local_phase1_nodes; do + label_phase_failed "$n" "$PHASE1_LABEL_KEY" "phase1-jq-missing" \ + || echo "[$(_p1_ts)] [Phase 1] WARN: label_phase_failed failed for $n" >&2 + done + return 1 + fi + + # Parse + validate stages array. Each element must have at minimum + # {Name, Image, Framework, Recipe, TimeoutSeconds}. Empty array or + # missing required field -> fail-fast every input node. + local stages_count=0 + stages_count=$(jq 'length' <<<"$GPU_VALIDATION_STAGES_JSON" 2>/dev/null || echo "0") + if [[ -z "$stages_count" || "$stages_count" -le 0 ]]; then + echo "[$(_p1_ts)] [Phase 1] ERROR: GPU_VALIDATION_STAGES_JSON is not a non-empty JSON array" >&2 + for n in $local_phase1_nodes; do + label_phase_failed "$n" "$PHASE1_LABEL_KEY" "phase1-stages-empty-or-invalid" \ + || echo "[$(_p1_ts)] [Phase 1] WARN: label_phase_failed failed for $n" >&2 + done + return 1 + fi + local missing_field + missing_field=$(jq -r ' + [ to_entries[] + | . as $e + | [ "Name","Image","Framework","Recipe","TimeoutSeconds" ] + | map(select(($e.value[.] // null) == null) | "stage[\($e.key)]." + .) + ] | add // [] | join(",") + ' <<<"$GPU_VALIDATION_STAGES_JSON" 2>/dev/null || echo "parse-error") + if [[ -n "$missing_field" && "$missing_field" != "parse-error" ]]; then + echo "[$(_p1_ts)] [Phase 1] ERROR: GPU_VALIDATION_STAGES_JSON missing required fields: $missing_field" >&2 + for n in $local_phase1_nodes; do + label_phase_failed "$n" "$PHASE1_LABEL_KEY" \ + "phase1-stages-missing-fields:${missing_field}" \ + || echo "[$(_p1_ts)] [Phase 1] WARN: label_phase_failed failed for $n" >&2 + done + return 1 + fi + # Optional .Skip must be a bool when present. Bad type fails fast so + # an accidentally-quoted "true" doesn't silently run the stage. + local bad_skip + bad_skip=$(jq -r ' + [ to_entries[] + | select(.value | has("Skip")) + | select((.value.Skip | type) != "boolean") + | "stage[\(.key)].Skip=\(.value.Skip | tostring)" + ] | join(",") + ' <<<"$GPU_VALIDATION_STAGES_JSON" 2>/dev/null || echo "parse-error") + if [[ -n "$bad_skip" && "$bad_skip" != "parse-error" ]]; then + echo "[$(_p1_ts)] [Phase 1] ERROR: GPU_VALIDATION_STAGES_JSON Skip must be bool: $bad_skip" >&2 + for n in $local_phase1_nodes; do + label_phase_failed "$n" "$PHASE1_LABEL_KEY" \ + "phase1-stages-bad-skip-type:${bad_skip}" \ + || echo "[$(_p1_ts)] [Phase 1] WARN: label_phase_failed failed for $n" >&2 + done + return 1 + fi + + local job_template=/test-runner-configs/cluster-validation-test-runner-job-config.yaml + if [[ ! -f "$job_template" ]]; then + echo "[$(_p1_ts)] [Phase 1] ERROR: job template not found at $job_template" >&2 + for n in $local_phase1_nodes; do + label_phase_failed "$n" "$PHASE1_LABEL_KEY" "job-template-missing" \ + || echo "[$(_p1_ts)] [Phase 1] WARN: label_phase_failed failed for $n" >&2 + done + return 1 + fi + + # ---- Per-node aggregate state (across all stages) --------------------- + # alive[node] = "1" while no stage has failed on this node; + # unset/empty after a stage failure dropped it. + # node_ran[node] = "1" once at least one non-skipped stage has been + # submitted for this node. Used to distinguish the + # "all stages were Skip=true" case (aggregate + # label=skipped) from the normal pass path. + # node_fail_stage = first failing stage Name (empty for alive nodes). + # node_fail_reason = failure reason annotation value for first failure. + # node_fail_subtest= recorded subtest name for first failure. + declare -A alive=() + declare -A node_ran=() + declare -A node_fail_stage=() + declare -A node_fail_reason=() + declare -A node_fail_subtest=() + local n + for n in $local_phase1_nodes; do + alive[$n]="1" + done + + local ts + ts=$(date +%Y%m%d-%H%M%S) + + # ---- Helper: short, deterministic node hash for k8s names (max 63 chars). + _phase1_node_hash() { + echo -n "$1" | sha1sum | cut -c1-6 + } + + # Where test-runner pod logs are persisted. Defaults to the shared + # /var/log/cluster-validation hostPath; a user may reconfigure this + # path later. The orchestrator must be able to write here (see + # runAsUser:0 on the CronJob container in cluster-validation-job.yaml). + local results_root="/var/log/cluster-validation" + local stage_idx + + # ---- Outer loop: one iteration per stage ------------------------------ + for (( stage_idx=0; stage_idx=skipped; do NOT submit a Job, do NOT touch + # alive[] (skipped stages must not drop the node), do NOT touch + # node_ran[] (the all-skipped path must produce aggregate + # label=skipped, not passed). + if [[ "$stage_skip" == "true" ]]; then + echo "[$(_p1_ts)] [Phase 1] =============================================" + echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} idx=${stage_idx} SKIPPED (Skip=true)" + for n in $local_phase1_nodes; do + [[ "${alive[$n]:-}" == "1" ]] || continue + annotate_phase_value "$n" "$PHASE1_LABEL_KEY" \ + "stage-${stage_name}" "skipped" \ + || echo "[$(_p1_ts)] [Phase 1] WARN: annotate stage=$stage_name skipped failed for $n" >&2 + done + continue + fi + + # Per-stage Job-wait wallclock = per-stage TimeoutSeconds + 120s + # slack for pod scheduling / image pull / result upload. + local stage_wait=$(( stage_timeout + 120 )) + + # Build the single-recipe runner-config payload (compact JSON). + # Shape matches what the runner expects at /etc/test-runner/config.json: + # TestConfig.GPU_HEALTH_CHECK.TestLocationTrigger.global + # .TestParameters.MANUAL.TestCases = [ ] + local recipe_payload + recipe_payload=$(jq -nc \ + --arg fw "$stage_fw" \ + --arg recipe "$stage_recipe" \ + --argjson iters "$stage_iters" \ + --argjson timeout "$stage_timeout" \ + --arg args "$stage_args" \ + '{TestConfig:{GPU_HEALTH_CHECK:{TestLocationTrigger:{global:{TestParameters:{MANUAL:{TestCases:[{Framework:$fw,Recipe:$recipe,Iterations:$iters,StopOnFailure:true,TimeoutSeconds:$timeout,Arguments:$args}]}}}}}}}') + + # Snapshot alive nodes for this stage. Dropped nodes (failed earlier) + # do not get a Job here -- stop-on-first-failure short-circuits. + local alive_in_stage="" + for n in $local_phase1_nodes; do + [[ "${alive[$n]:-}" == "1" ]] && alive_in_stage="$alive_in_stage $n" + done + alive_in_stage="${alive_in_stage# }" + + if [[ -z "$alive_in_stage" ]]; then + echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} skipped (no alive nodes left)" + continue + fi + + echo "[$(_p1_ts)] [Phase 1] =============================================" + echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} idx=${stage_idx} fw=${stage_fw} recipe=${stage_recipe}" + echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} image=${stage_image}" + echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} alive_nodes=[${alive_in_stage}]" + echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} per-job timeout=${stage_timeout}s (wait budget=${stage_wait}s)" + + # Per-stage state (Job/CM names indexed by node). + declare -A stage_job_to_node=() + declare -A stage_node_to_job=() + declare -A stage_node_to_cm=() + declare -A stage_job_status=() # passed | failed | timeout | submit-failed + declare -A stage_job_subtest=() + declare -A stage_job_reason=() + local stage_job_names="" + + # ---- Per-stage Step 1: create per-node ConfigMap + Job ------------ + echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} submitting Jobs (parallel)..." + for node in $alive_in_stage; do + local node_hash cm_name job_name + node_hash=$(_phase1_node_hash "$node") + cm_name="cvf-phase1-${stage_name}-${node_hash}-${ts}" + job_name="cvf-tr-${stage_name}-${node_hash}-${ts}" + stage_node_to_cm[$node]="$cm_name" + stage_node_to_job[$node]="$job_name" + stage_job_to_node[$job_name]="$node" + stage_job_names="$stage_job_names $job_name" + # Mark node as having had a real (non-skipped) stage attempt. + # Used by the final aggregate-labeling step to distinguish + # "all stages skipped" (label=skipped) from "all ran+passed" + # (label=passed). + node_ran[$node]="1" + + # Idempotent CM create: dry-run | apply lets us re-run after a + # mid-stage crash without "already exists" errors. + if ! kubectl create configmap "$cm_name" \ + --from-literal="GPU_VALIDATION_TESTS_JSON=${recipe_payload}" \ + --dry-run=client -o yaml 2>/dev/null \ + | kubectl apply -f - >/dev/null 2>&1; then + echo "[$(_p1_ts)] [Phase 1] ERROR: failed to create configmap=$cm_name node=$node" >&2 + stage_job_status[$job_name]="submit-failed" + stage_job_reason[$job_name]="configmap-creation-failed" + stage_job_subtest[$job_name]="unknown" + continue + fi + + # Render Job template. The four placeholders supplied by Phase 1: + # $$NODE -- target hostname (nodeSelector) + # $$GPU_PER_WORKER -- amd.com/gpu request/limit + # $$TEST_RUNNER_IMAGE -- per-stage image + # $$PHASE1_CONFIG_MAP -- per-stage per-node CM (config-volume) + # Plus the metadata.name rewrite so each Job gets a unique name. + if sed "s|\$\$NODE|${node}|g; \ + s/^ name: cluster-validation-test-runner-job/ name: ${job_name}/; \ + s|\$\$GPU_PER_WORKER|${GPU_PER_WORKER}|g; \ + s|\$\$TEST_RUNNER_IMAGE|${stage_image}|g; \ + s|\$\$PHASE1_CONFIG_MAP|${cm_name}|g" \ + "$job_template" | kubectl apply -f - >/dev/null; then + echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} submitted job=$job_name node=$node cm=$cm_name" + else + echo "[$(_p1_ts)] [Phase 1] ERROR: kubectl apply failed for job=$job_name node=$node" >&2 + stage_job_status[$job_name]="submit-failed" + stage_job_reason[$job_name]="job-creation-failed" + stage_job_subtest[$job_name]="unknown" + fi + done + echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} all Jobs submitted at $(date)" + + # ---- Per-stage Step 2: Poll-wait every Job ------------------------ + # 5s sleep interval preserves the proven Phase 1 polling cadence. + local job_name + for job_name in $stage_job_names; do + local node="${stage_job_to_node[$job_name]}" + if [[ "${stage_job_status[$job_name]:-}" == "submit-failed" ]]; then + continue + fi + echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} waiting job=$job_name node=$node (timeout=${stage_wait}s)" + local start_time elapsed terminal="false" + start_time=$(date +%s) + while true; do + elapsed=$(( $(date +%s) - start_time )) + if [[ "$elapsed" -ge "$stage_wait" ]]; then + echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} job=$job_name TIMEOUT after ${stage_wait}s" + stage_job_status[$job_name]="timeout" + stage_job_reason[$job_name]="timeout" + stage_job_subtest[$job_name]="unknown" + terminal="true" + break + fi + local complete_st failed_st + complete_st=$(kubectl get job "$job_name" \ + -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' \ + 2>/dev/null || echo "") + if [[ "$complete_st" == "True" ]]; then + echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} job=$job_name Complete=True (node=$node)" + stage_job_status[$job_name]="passed" + terminal="true" + break + fi + failed_st=$(kubectl get job "$job_name" \ + -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' \ + 2>/dev/null || echo "") + if [[ "$failed_st" == "True" ]]; then + echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} job=$job_name Failed=True (node=$node)" + stage_job_status[$job_name]="failed" + terminal="true" + break + fi + sleep 5 + done + if [[ "$terminal" != "true" ]]; then + # Defensive belt-and-suspenders. Should be unreachable. + stage_job_status[$job_name]="timeout" + stage_job_reason[$job_name]="timeout" + stage_job_subtest[$job_name]="unknown" + fi + done + + # ---- Per-stage Step 3: Finalize verdict from the Job condition --- + # The Kubernetes Job condition is the SOLE source of truth for + # pass/fail. It is cluster-scoped API state (set by the test-runner + # exit code via backoffLimit=0/restartPolicy=Never) and is therefore + # node-agnostic: it is read identically whether the orchestrator is + # co-located with the worker or not. This matches the original + # pre-refactor CVF design. + # + # We deliberately do NOT read the per-stage result.json here. That + # file is written ONLY to the worker node's local hostPath, so in a + # multi-node cluster the orchestrator (scheduled on an arbitrary + # node) either cannot see it OR -- worse -- matches a different + # node's same-named file at the shared-by-name volume root and + # mis-attributes one node's result to another. Relying on the Job + # condition avoids both failure modes. The verdict was already set + # in Step 2's poll-wait loop (passed | failed | timeout | + # submit-failed); Step 3 only captures pod logs for diagnostics. + for job_name in $stage_job_names; do + local status="${stage_job_status[$job_name]:-failed}" + if [[ "$status" == "submit-failed" || "$status" == "timeout" ]]; then + continue + fi + + local pod_name="" + pod_name=$(kubectl get pods \ + -l "job-name=${job_name}" \ + -o jsonpath='{.items[-1:].metadata.name}' \ + 2>/dev/null || echo "") + + # Always persist the test-runner pod's stdout to results_root, + # pass or fail, for post-run triage. `kubectl logs` works on a + # Completed pod and is node-agnostic. This is diagnostics only + # -- it is never parsed for the verdict. Writing here requires + # the orchestrator to be able to write results_root (see + # runAsUser:0 on the CronJob container). + local pod_log_file="" + if [[ -n "$pod_name" ]]; then + local pod_log_dest="${results_root}/${job_name}_pod.log" + if kubectl logs "$pod_name" > "$pod_log_dest" 2>/dev/null \ + && [[ -s "$pod_log_dest" ]]; then + pod_log_file="$pod_log_dest" + echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} job=$job_name pod-log saved=${pod_log_file}" + else + [[ -f "$pod_log_dest" ]] && rm -f "$pod_log_dest" + echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} job=$job_name WARN: could not save pod log to ${pod_log_dest}" >&2 + fi + fi + + if [[ "$status" == "passed" ]]; then + echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} job=$job_name PASS (Job Complete=True)" + else + # Job Failed=True. The result file is not consulted, so the + # specific sub-test is unknown; the Job condition is enough + # to fail the node. Surface the saved pod-log tail for triage. + stage_job_status[$job_name]="failed" + stage_job_subtest[$job_name]="unknown" + stage_job_reason[$job_name]="job-failed" + echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} job=$job_name FAILED (Job Failed=True)" >&2 + if [[ -n "$pod_log_file" && -s "$pod_log_file" ]]; then + echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} job=$job_name pod-log tail:" >&2 + tail -n 20 "$pod_log_file" >&2 + fi + fi + done + + # ---- Per-stage Step 4: Annotate per-stage result + update alive --- + # Per-stage annotation key (sub-key passed to annotate_phase_value): + # ${PHASE1_LABEL_KEY}-stage-${stage_name} = passed | failed + # Aggregate label/annotation comes after the outer loop. + local stage_passed=0 stage_failed=0 + for job_name in $stage_job_names; do + local node="${stage_job_to_node[$job_name]}" + local status="${stage_job_status[$job_name]:-failed}" + if [[ "$status" == "passed" ]]; then + annotate_phase_value "$node" "$PHASE1_LABEL_KEY" \ + "stage-${stage_name}" "passed" \ + || echo "[$(_p1_ts)] [Phase 1] WARN: annotate stage=$stage_name passed failed for $node" >&2 + stage_passed=$((stage_passed + 1)) + else + annotate_phase_value "$node" "$PHASE1_LABEL_KEY" \ + "stage-${stage_name}" "failed" \ + || echo "[$(_p1_ts)] [Phase 1] WARN: annotate stage=$stage_name failed failed for $node" >&2 + # Record first-failure info on the node and drop from alive. + if [[ "${alive[$node]:-}" == "1" ]]; then + node_fail_stage[$node]="$stage_name" + node_fail_reason[$node]="${stage_job_reason[$job_name]:-failed}" + node_fail_subtest[$node]="${stage_job_subtest[$job_name]:-unknown}" + alive[$node]="" + fi + stage_failed=$((stage_failed + 1)) + fi + done + echo "[$(_p1_ts)] [Phase 1] stage=${stage_name} done: passed=${stage_passed} failed=${stage_failed}" + + # ---- Per-stage Step 5: Cleanup ------------------------------------ + # Delete Jobs explicitly (so timed-out ones don't linger past + # ttlSecondsAfterFinished); delete per-stage ConfigMaps (avoid + # leaking N-stage * M-node CMs per cron tick). + for job_name in $stage_job_names; do + kubectl delete job "$job_name" --ignore-not-found=true \ + --wait=false >/dev/null 2>&1 \ + || echo "[$(_p1_ts)] [Phase 1] WARN: delete job=$job_name failed" >&2 + done + for node in $alive_in_stage; do + local cm_name="${stage_node_to_cm[$node]:-}" + [[ -z "$cm_name" ]] && continue + kubectl delete configmap "$cm_name" --ignore-not-found=true \ + --wait=false >/dev/null 2>&1 \ + || echo "[$(_p1_ts)] [Phase 1] WARN: delete configmap=$cm_name failed" >&2 + done + + # Reset per-stage maps before the next iteration. + unset stage_job_to_node stage_node_to_job stage_node_to_cm + unset stage_job_status stage_job_subtest stage_job_reason + done + + # ---- Final aggregate labeling ---------------------------------------- + # alive=1 + ran=1 -> every stage that was not Skip=true passed. + # alive=1 + ran=0 -> every stage was Skip=true (or no stages ran for + # any other reason) -> aggregate label=skipped so + # downstream phases see this node as not validated. + # alive=0 -> a non-skip stage failed -> label=failed + + # failed-subtest annotation. + local phase1_passed_count=0 phase1_failed_count=0 phase1_skipped_count=0 + for n in $local_phase1_nodes; do + if [[ "${alive[$n]:-}" == "1" ]]; then + if [[ "${node_ran[$n]:-}" == "1" ]]; then + if label_phase_passed "$n" "$PHASE1_LABEL_KEY"; then + phase1_passed_count=$((phase1_passed_count + 1)) + else + echo "[$(_p1_ts)] [Phase 1] WARN: label_phase_passed failed for $n" >&2 + fi + else + # All stages skipped -- write aggregate label=skipped + # directly (no helper -- this is the only call site). + if kubectl label node "$n" "${PHASE1_LABEL_KEY}=skipped" \ + --overwrite >/dev/null; then + phase1_skipped_count=$((phase1_skipped_count + 1)) + echo "[$(_p1_ts)] [Phase 1] node=${n} ${PHASE1_LABEL_KEY}=skipped (all stages Skip=true)" + else + echo "[$(_p1_ts)] [Phase 1] WARN: kubectl label ${n} ${PHASE1_LABEL_KEY}=skipped failed" >&2 + fi + fi + else + local fstage="${node_fail_stage[$n]:-unknown}" + local freason="${node_fail_reason[$n]:-failed}" + local fsubtest="${node_fail_subtest[$n]:-unknown}" + if label_phase_failed "$n" "$PHASE1_LABEL_KEY" "stage-${fstage}:${freason}"; then + phase1_failed_count=$((phase1_failed_count + 1)) + else + echo "[$(_p1_ts)] [Phase 1] WARN: label_phase_failed failed for $n" >&2 + phase1_failed_count=$((phase1_failed_count + 1)) + fi + annotate_phase_value "$n" "$PHASE1_LABEL_KEY" \ + "failed-subtest" "$fsubtest" \ + || echo "[$(_p1_ts)] [Phase 1] WARN: annotate_phase_value failed for $n" >&2 + fi + done + + echo "[$(_p1_ts)] [Phase 1] PHASE1_SCRIPT done: passed=${phase1_passed_count} failed=${phase1_failed_count} skipped=${phase1_skipped_count}" + return 0 + + # === Phase 2 Script: per-node single-node 8-GPU RCCL all_reduce === + # Sourced by the orchestrator's run_phase2 via _run_phase_generic + # (current) or by the dedicated run_phase2 wiring delivered. + # Mirrors PHASE1_SCRIPT's submit/wait/parse/label shape. + # + # On entry: + # PHASE_NODES = space-separated input node list (also $@) + # PHASE2_LABEL_KEY = "amd.com/gpu-mesh-validation" (from this ConfigMap) + # SKIP_GPU_MESH_VALIDATION = "true" short-circuits to pass-label-all + # ROCE_WORKLOAD_IMAGE, PHASE2_JOB_WAIT_TIME, PHASE2_BW_THRESHOLD, + # PHASE2_START_MSG_SIZE, PHASE2_END_MSG_SIZE, PHASE2_STEP_FACTOR, + # PHASE2_ITER_COUNT, PHASE2_WARMUP_ITER_COUNT, PHASE2_RCCL_ENV_VARS + # from this ConfigMap + # Helpers (already sourced by orchestrator): + # label_phase_passed, label_phase_failed, annotate_phase_value + # Job template available at + # /phase2-configs/cluster-validation-phase2-job-config.yaml + # (mounted from cluster-validation-phase2-job-config + # ConfigMap,). + # + # Contract (design §4 -> "Code Path: PHASE2_SCRIPT"): + # 1. Per input node: sed-render Job template ($$NODE substitution, + # unique per-node Job name) and kubectl apply. Record job name. + # 2. Poll-wait all Jobs bounded by PHASE2_JOB_WAIT_TIME. + # 3. For each Job: classify result by inspecting Job condition AND + # container log: + # * Complete=True -> passed + # * Failed=True with mpirun crash -> failed (rccl-crash) + # * Failed=True with BW message -> failed (bus-bw-below-threshold) + # * Pending/Active past timeout -> failed (timeout) + # * Failed=True other -> failed (rccl-crash, default) + # Threshold checking happens INSIDE the Job container (see + # cluster-validation-phase2-job-config -- validate-single-test.sh + # runs there and returns non-zero on BW miss). PHASE2_SCRIPT + # only differentiates by log markers; it does NOT re-run the + # validator. + # 4. label_phase_passed / label_phase_failed via helpers. + # On failure: annotation + # ${PHASE2_LABEL_KEY}${PHASE_FAILURE_REASON_ANNOTATION_SUFFIX}= + # via label_phase_failed. Additionally, for measured-BW failures + # and successes, annotate the measured Avg bus bandwidth via + # annotate_phase_value "$node" "$PHASE2_LABEL_KEY" "measured-bw" "". + # 5. SKIP_GPU_MESH_VALIDATION=true early-exit: pass-label every input + # node, no Jobs created. + # 6. Cleanup hung (still-Active) Jobs at the end. + # + # Missing job template -> every input node labeled failed with + # reason "job-template-missing". + # + # All output (banners, per-step progress) goes to stdout. Diagnostic-only + # lines go to stderr so they don't accidentally pollute callers that may + # capture stdout in the future. + PHASE2_SCRIPT: | + #!/bin/bash + # Phase 2 -- Intra-node GPU collective (RCCL all_reduce_perf) per-node + # orchestration. Sourced (not exec'd) by the orchestrator. Do NOT + # `set -e`; we want to label every input node even if individual + # kubectl calls fail (mirror of PHASE1_SCRIPT, see design doc). + + # ---- Input handling --------------------------------------------------- + # Mirror of PHASE1_SCRIPT: orchestrator passes the node list as + # positional args AND exports PHASE_NODES. Prefer positional ($@) so a + # re-source with different args works as expected; fall back to + # PHASE_NODES if $# is zero. + local_phase2_nodes="" + if [[ "$#" -gt 0 ]]; then + local_phase2_nodes="$*" + elif [[ -n "${PHASE_NODES:-}" ]]; then + local_phase2_nodes="$PHASE_NODES" + fi + + if [[ -z "$local_phase2_nodes" ]]; then + echo "[Phase 2] no input nodes -- nothing to do" + return 0 + fi + + echo "[Phase 2] PHASE2_SCRIPT start: nodes=[$local_phase2_nodes]" + + # ---- Early-exit: SKIP_GPU_MESH_VALIDATION ----------------------------- + # Incremental-bringup short-circuit (design §4 contract step 5). + # Every input node receives the pass label, no Phase 2 Jobs are + # created. Matches the SKIP_GPU_HW_ACCEPTANCE pattern in PHASE1_SCRIPT. + if [[ "${SKIP_GPU_MESH_VALIDATION,,}" == "true" ]]; then + echo "[Phase 2] SKIP_GPU_MESH_VALIDATION=true -- pass-labeling all input nodes" + for n in $local_phase2_nodes; do + label_phase_passed "$n" "$PHASE2_LABEL_KEY" \ + || echo "[Phase 2] WARN: label_phase_passed failed for $n" >&2 + done + echo "[Phase 2] PHASE2_SCRIPT done (skip mode)" + return 0 + fi + + # ---- Validate required env vars --------------------------------------- + # If any are unset/empty, fail every input node fast with a clear + # reason. PHASE2_RCCL_ENV_VARS and the PHASE2_*_MSG_SIZE / iter knobs + # are consumed inside the Job container -- the driver only needs the + # ones used by submit + classify. + local missing_env="" + for v in PHASE2_LABEL_KEY ROCE_WORKLOAD_IMAGE PHASE2_JOB_WAIT_TIME PHASE2_BW_THRESHOLD; do + if [[ -z "${!v:-}" ]]; then + missing_env="$missing_env $v" + fi + done + if [[ -n "$missing_env" ]]; then + echo "[Phase 2] ERROR: required env var(s) unset:$missing_env" >&2 + echo "[Phase 2] failing every input node with reason=missing-env" >&2 + for n in $local_phase2_nodes; do + label_phase_failed "$n" "${PHASE2_LABEL_KEY:-amd.com/gpu-mesh-validation}" \ + "phase2-missing-env:${missing_env# }" \ + || echo "[Phase 2] WARN: label_phase_failed failed for $n" >&2 + done + return 1 + fi + + local job_template=/phase2-configs/cluster-validation-phase2-job-config.yaml + if [[ ! -f "$job_template" ]]; then + echo "[Phase 2] ERROR: job template not found at $job_template" >&2 + for n in $local_phase2_nodes; do + label_phase_failed "$n" "$PHASE2_LABEL_KEY" "job-template-missing" \ + || echo "[Phase 2] WARN: label_phase_failed failed for $n" >&2 + done + return 1 + fi + + # ---- Phase-local state ------------------------------------------------- + # Indexed by job name. Declared local so the orchestrator shell stays + # clean when re-sourced. + declare -A phase2_job_to_node=() + declare -A phase2_job_status=() # "passed" | "failed" | "timeout" | "submit-failed" + declare -A phase2_job_reason=() # failure reason annotation value + declare -A phase2_job_bw=() # measured Avg bus bandwidth, when parseable + local phase2_job_names="" + + # ---- Step 1: Parallel-submit Phase 2 Job per node --------------------- + # Mirror of PHASE1_SCRIPT Step 1. kubectl apply is non-blocking from + # the Job's perspective, so submitting in a tight loop is effectively + # parallel. + echo "[Phase 2] submitting Phase 2 Jobs (parallel)..." + local ts max_job_name_len=63 prefix="cvf-phase2" + ts=$(date +%Y%m%d-%H%M%S) + for node in $local_phase2_nodes; do + local job_name="${prefix}-${node}-${ts}" + if [[ "${#job_name}" -gt "$max_job_name_len" ]]; then + local node_hash node_hash=$(echo -n "$node" | sha1sum | cut -c1-6) job_name="${prefix}-${node_hash}-${ts}" + echo "[Phase 2] node=$node hostname too long, using hash -> job=$job_name" fi - job_names="$job_names $job_name" - job_to_node[$job_name]=$node - echo "Submitting test runner job for node: $node (job: $job_name)" - sed "s|\$\$NODE|${node}|g; \ - s/^ name: cluster-validation-test-runner-job/ name: ${job_name}/; \ - s|\$\$GPU_PER_WORKER|${GPU_PER_WORKER}|g; \ - s|\$\$TEST_RUNNER_IMAGE|${TEST_RUNNER_IMAGE}|g" \ - /test-runner-configs/cluster-validation-test-runner-job-config.yaml | kubectl apply -f - - sleep 1 - done - echo "[Test Runner Jobs: Submitted for all candidate nodes]" - echo -e "\n$(date): Waiting for test runner jobs to complete..." - - for job_name in $job_names; do - node=${job_to_node[$job_name]} - echo "Waiting for job $job_name (node: $node)..." - + phase2_job_to_node[$job_name]="$node" + phase2_job_names="$phase2_job_names $job_name" + + # Render template -- $$NODE is the only placeholder defined by the + # Phase 2 template (see design doc). Also rewrite the template's + # fixed metadata.name to our unique per-node job_name so multiple + # nodes can run concurrently without name collision. + if sed "s|\$\$NODE|${node}|g; \ + s/^ name: cluster-validation-phase2-job/ name: ${job_name}/; \ + s|\$\$ROCE_WORKLOAD_IMAGE|${ROCE_WORKLOAD_IMAGE}|g" \ + "$job_template" | kubectl apply -f - >/dev/null; then + echo "[Phase 2] submitted job=$job_name node=$node" + else + echo "[Phase 2] ERROR: kubectl apply failed for job=$job_name node=$node" >&2 + phase2_job_status[$job_name]="submit-failed" + phase2_job_reason[$job_name]="job-creation-failed" + fi + done + echo "[Phase 2] all Jobs submitted at $(date)" + + # ---- Step 2: Poll-wait every Job -------------------------------------- + # Per-job wallclock budget = PHASE2_JOB_WAIT_TIME. Sequential wait + # loop, but the Jobs themselves run in parallel on different nodes. + # Matches the PHASE1_SCRIPT cadence (5s poll interval). + local timeout="${PHASE2_JOB_WAIT_TIME}" + for job_name in $phase2_job_names; do + local node="${phase2_job_to_node[$job_name]}" + if [[ "${phase2_job_status[$job_name]:-}" == "submit-failed" ]]; then + continue + fi + echo "[Phase 2] waiting for job=$job_name node=$node (timeout=${timeout}s)" + local start_time elapsed start_time=$(date +%s) - timeout=${TEST_RUNNER_JOB_WAIT_TIME} - job_succeeded=false - + local terminal="false" while true; do - elapsed=$(($(date +%s) - start_time)) - if [ $elapsed -ge $timeout ]; then - echo "Job $job_name timed out after ${timeout}s ❌" - break - fi - - status=$(kubectl get job "$job_name" -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' 2>/dev/null || echo "") - if [[ "$status" == "True" ]]; then - echo "Job $job_name completed successfully ✅ (node: $node)" - job_succeeded=true - break - fi - - failed_status=$(kubectl get job "$job_name" -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' 2>/dev/null || echo "") - if [[ "$failed_status" == "True" ]]; then - echo "Job $job_name failed ❌ (node: $node)" - break - fi - sleep 5 + elapsed=$(( $(date +%s) - start_time )) + if [[ "$elapsed" -ge "$timeout" ]]; then + echo "[Phase 2] job=$job_name TIMEOUT after ${timeout}s" + phase2_job_status[$job_name]="timeout" + phase2_job_reason[$job_name]="timeout" + terminal="true" + break + fi + local complete_st failed_st + complete_st=$(kubectl get job "$job_name" \ + -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' \ + 2>/dev/null || echo "") + if [[ "$complete_st" == "True" ]]; then + echo "[Phase 2] job=$job_name Complete=True (node=$node)" + phase2_job_status[$job_name]="passed" + terminal="true" + break + fi + failed_st=$(kubectl get job "$job_name" \ + -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' \ + 2>/dev/null || echo "") + if [[ "$failed_st" == "True" ]]; then + echo "[Phase 2] job=$job_name Failed=True (node=$node)" + phase2_job_status[$job_name]="failed" + terminal="true" + break + fi + sleep 5 done - - if [ "$job_succeeded" = true ]; then - passed_nodes="$passed_nodes $node" + if [[ "$terminal" != "true" ]]; then + # Defensive: should never fall through, but keep state sane. + phase2_job_status[$job_name]="timeout" + phase2_job_reason[$job_name]="timeout" + fi + done + + # ---- Step 3: Classify Job result via container log -------------------- + # The Phase 2 Job container (cluster-validation-phase2-job-config, + #) runs mpirun then validate-single-test.sh internally. + # We retrieve the container log via `kubectl logs` and look for + # signature lines to differentiate the failure reasons enumerated by + # the design doc: + # * "phase2 mpirun exited " -> rccl-crash + # * "phase2 bandwidth below threshold" -> bus-bw-below-threshold + # * "phase2 all_reduce_perf PASSED" -> passed (sanity check) + # The measured "Avg bus bandwidth: " line is also captured for + # the annotation when present (both pass and bw-fail paths emit it + # via the all_reduce_perf output). + for job_name in $phase2_job_names; do + local status="${phase2_job_status[$job_name]:-failed}" + # submit-failed / timeout already have status+reason set. + if [[ "$status" == "submit-failed" || "$status" == "timeout" ]]; then + continue + fi + + # Pod naming: kubectl creates a pod per Job named "${job_name}-". + # backoffLimit: 0 -> at most one pod. + local pod_name="" + pod_name=$(kubectl get pods \ + -l "job-name=${job_name}" \ + -o jsonpath='{.items[-1:].metadata.name}' \ + 2>/dev/null || echo "") + + local log_text="" + if [[ -n "$pod_name" ]]; then + # --tail=200 -- we only need to look at the trailing section + # for the classification markers and the Avg-bw line. Keep + # the captured volume small to stay well under the + # PHASE_ANNOTATION_VALUE_MAX_BYTES cap if we ever embed a tail. + log_text=$(kubectl logs "$pod_name" --tail=200 2>/dev/null || echo "") + fi + + # Always try to capture the measured bandwidth, regardless of + # pass/fail -- the validator script prints it on both paths. + local measured_bw="" + measured_bw=$(echo "$log_text" \ + | awk -F': ' '/Avg bus bandwidth/ {print $2}' \ + | tail -n 1 | tr -d '[:space:]') + if [[ -n "$measured_bw" ]]; then + phase2_job_bw[$job_name]="$measured_bw" + fi + + if [[ "$status" == "passed" ]]; then + echo "[Phase 2] job=$job_name PASS (measured-bw=${measured_bw:-})" + continue + fi + + # status == "failed": classify by log marker. Order matters + # the BW-below-threshold path is reached after a successful + # mpirun, so the crash marker is more specific. + if echo "$log_text" | grep -q "phase2 mpirun exited"; then + phase2_job_reason[$job_name]="rccl-crash" + echo "[Phase 2] job=$job_name FAIL reason=rccl-crash (measured-bw=${measured_bw:-})" + elif echo "$log_text" | grep -q "phase2 bandwidth below threshold"; then + phase2_job_reason[$job_name]="bus-bw-below-threshold" + echo "[Phase 2] job=$job_name FAIL reason=bus-bw-below-threshold (measured-bw=${measured_bw:-}, threshold=${PHASE2_BW_THRESHOLD})" + else + # Job=Failed but no recognized marker -- default to rccl-crash + # (any non-zero exit from the container is treated as a crash + # signal per design §6 unless the validator explicitly flagged + # bw-below-threshold). + phase2_job_reason[$job_name]="rccl-crash" + echo "[Phase 2] job=$job_name FAIL reason=rccl-crash (default, no marker found)" + fi + done + + # ---- Step 4: Label every input node via helpers ------------- + # Pass nodes get measured-bw annotation when we parsed one. Failed + # nodes get failed-reason via label_phase_failed AND, when available, + # the measured-bw annotation (useful for bus-bw-below-threshold + # forensics -- design §6 "Bandwidth below threshold . annotation + # records measured BW"). + local phase2_passed_count=0 phase2_failed_count=0 + for job_name in $phase2_job_names; do + local node="${phase2_job_to_node[$job_name]}" + local status="${phase2_job_status[$job_name]:-failed}" + local bw="${phase2_job_bw[$job_name]:-}" + if [[ "$status" == "passed" ]]; then + if label_phase_passed "$node" "$PHASE2_LABEL_KEY"; then + phase2_passed_count=$((phase2_passed_count + 1)) + else + echo "[Phase 2] WARN: label_phase_passed failed for $node" >&2 + fi + if [[ -n "$bw" ]]; then + annotate_phase_value "$node" "$PHASE2_LABEL_KEY" \ + "measured-bw" "$bw" \ + || echo "[Phase 2] WARN: annotate_phase_value (measured-bw) failed for $node" >&2 + fi + else + local reason="${phase2_job_reason[$job_name]:-failed}" + if label_phase_failed "$node" "$PHASE2_LABEL_KEY" "$reason"; then + phase2_failed_count=$((phase2_failed_count + 1)) + else + echo "[Phase 2] WARN: label_phase_failed failed for $node" >&2 + phase2_failed_count=$((phase2_failed_count + 1)) + fi + if [[ -n "$bw" ]]; then + annotate_phase_value "$node" "$PHASE2_LABEL_KEY" \ + "measured-bw" "$bw" \ + || echo "[Phase 2] WARN: annotate_phase_value (measured-bw) failed for $node" >&2 + fi + fi + done + + # ---- Step 5: Cleanup hung Jobs ---------------------------------------- + # Anything that timed out (still Active, neither Complete nor Failed) + # gets explicitly deleted so it does not linger past the next CronJob + # tick. ttlSecondsAfterFinished handles the completed ones. + for job_name in $phase2_job_names; do + local status="${phase2_job_status[$job_name]:-failed}" + if [[ "$status" == "timeout" ]]; then + echo "[Phase 2] deleting hung job=$job_name" + kubectl delete job "$job_name" --ignore-not-found=true \ + --wait=false >/dev/null 2>&1 \ + || echo "[Phase 2] WARN: delete job=$job_name failed" >&2 + fi + done + + echo "[Phase 2] PHASE2_SCRIPT done: passed=${phase2_passed_count} failed=${phase2_failed_count}" + return 0 + + # === Phase 3 (Per-Node NIC Health Check) env vars === + # Consumed by the per-node Phase 3 Job (cluster-validation-job.yaml, + # `cluster-validation-phase3-job-config` ConfigMap) and by + # `PHASE3_CHECK_SCRIPT` above. See design doc design doc §4 -> + # "Code Path: Phase 3 ConfigMap variables". + # + # PHASE3_CHECK_SCRIPT also defines `${VAR:-default}` fallbacks for these + # keys so the script remains independently runnable for debug; the values + # below are the canonical fleet defaults projected via `envFrom` on the + # Phase 3 Job container. + # + # Tune these via ConfigMap edit -- no code change needed. + + # Expected number of AMD/Pensando NIC physical functions per node, + # matched against the count of PCI entries with a vendor:device pair + # listed in $PHASE3_AMD_NIC_PCI_IDS. Default 8 matches the Pensando + # MI300 fleet baseline (validated on smc300x-ccs-aus-gpuf268). Override + # for SKUs with a different NIC population. + PHASE3_EXPECTED_NIC_COUNT: "8" + # PCI vendor:device allowlist for AMD/Pensando NIC Physical Functions, + # as a comma-separated list. `1dd8:1002` is the "DSC Ethernet Controller" + # PF on the AMD Pensando DSC fleet (DSC-25 / DSC-200 / Salina) -- it is + # the device ID the kernel `ionic` driver binds to and is the canonical + # identifier for a NIC PF (verified on smc300x-ccs-aus-gpuf268,). + # To include a future PF revision or a second SKU, append additional + # vendor:device pairs (e.g. "1dd8:1002,1dd8:1003"). The matching uses + # `lspci -nn` output's trailing `[vendor:device]` tag, which is unique + # per device ID and immune to description-string variation -- this + # naturally excludes SR-IOV VFs (which carry a different device ID), + # PCI bridges, and Processing-accelerator sub-functions that share the + # vendor ID but are not NIC PFs. + PHASE3_AMD_NIC_PCI_IDS: "1dd8:1002" + # Minimum number of GID table entries required per RDMA device. Each + # configured RoCEv2 GID slot prints a `GID[` line in + # `ibv_devinfo -d $dev -v`. Default 1 catches "GID table empty" + # (driver misconfig, RoCE not initialized); raise to require multiple + # RoCE versions. + PHASE3_MIN_GID_COUNT: "1" + # Per-node Phase 3 Job wallclock budget (seconds). The four structural + # checks (lspci / ip link / rdma link / ibv_devinfo) complete in well + # under a second on a healthy node; 120s leaves headroom for image + # pull, pod scheduling, and NIC-resource allocation (design §6 -> + # "Pod cannot mount NIC devices . orchestrator times it out at + # PHASE3_JOB_WAIT_TIME"). PHASE3_SCRIPT treats a Job that + # exceeds this budget as `nic-not-allocated` failure. + # patchable: cluster-validation-framework.timeouts.phase3-job-wait-secs + PHASE3_JOB_WAIT_TIME: "120" + + # ---- Phase 3 Check 5 (firmware <-> workload-image alignment) inputs --- + # The in-pod check compares each NIC's running firmware version (as + # reported by `ethtool -i ` -- which queries the running ionic + # driver, which in turn cached the fw version from the NIC via dev cmd + # at probe time) against the workload image reference passed in via + # the ROCE_WORKLOAD_IMAGE env var. Match policy is a simple substring + # test: the observed firmware-version string must appear somewhere in + # the image reference. This works because the roce-workload image tag + # encodes the ainic firmware version it was built against (e.g. + # `...ainic-1.117.5-a-56`), so the tag IS the contract for which + # firmware is qualified with this RoCE workload. + # + # Replaces the prior nicctl + curated compat-map approach: the + # roce-workload image does NOT bundle nicctl, while it DOES ship + # ethtool; and the image tag itself declares its expected firmware, + # removing the need for a separately maintained allowlist. + + # Master switch for Check 5. Default "true". When "false", Check 5 is + # skipped entirely and the PHASE3_RESULT marker is byte-identical to + # the pre-Check-5 contract. + PHASE3_DRIVER_FW_CHECK_ENABLED: "true" + # Strict-mode toggle. "true" (default): any per-NIC fw <-> image-tag + # mismatch fails the node with reason + # `fw-image-mismatch:=/image=`. "false": + # mismatches surface only via the `observed-fw` node annotation and + # Phase 3 still passes. + PHASE3_DRIVER_FW_STRICT: "true" + + # === Phase 3 Check Script: in-Job 4-check NIC health + self-labelling === + # Embedded into the per-node Phase 3 Job (cluster-validation-phase3-job-config + # ConfigMap,). The Job container writes this string to /tmp/check.sh + # and execs it. Runs inside a privileged pod on the rocm/network-operator-utils + # image so it has lspci / ip / rdma / ibv_* / kubectl available. + # + # On entry (env from cluster-validation-config envFrom + downward-API): + # NODE_NAME = spec.nodeName (downward API on the Job pod) + # PHASE3_LABEL_KEY = "amd.com/nic-health" (, this ConfigMap) + # PHASE3_AMD_NIC_PCI_IDS = "1dd8:1002" Pensando DSC PF default; comma- + # separated list of vendor:device pairs + # PHASE3_EXPECTED_NIC_COUNT = "8" default + # PHASE3_MIN_GID_COUNT = "1" default + # PHASE_FAILURE_REASON_ANNOTATION_SUFFIX = "-failure-reason" (this ConfigMap) + # ServiceAccount cluster-validation-sa has node label/annotate perms. + # + # Contract (design §4 -> "Code Path: PHASE3_CHECK_SCRIPT"): + # 1. Check 1 -- NIC count via `lspci -nn` filtered to the device IDs + # listed in $PHASE3_AMD_NIC_PCI_IDS must equal $PHASE3_EXPECTED_NIC_COUNT. + # 2. Check 2 -- `ip link` state == UP for every amd-nic interface + # (interface names start with enP* or ens*). + # 3. Check 3 -- `rdma link show` reports state ACTIVE per device. + # 4. Check 4 -- `ibv_devinfo` responds AND each device's GID table + # has >= $PHASE3_MIN_GID_COUNT entries. + # 5. On any failure: self-label node with ${PHASE3_LABEL_KEY}=failed, + # write annotations ${PHASE3_LABEL_KEY}${SUFFIX}= and + # ${PHASE3_LABEL_KEY}-failed-nics=, exit non-zero. + # 6. On all pass: self-label ${PHASE3_LABEL_KEY}=passed, exit 0. + # + # Annotation values are truncated to PHASE3_ANNOTATION_MAX_BYTES (=250 by + # default) before write so a worst-case all-NIC failure cannot push the + # node object over the K8s annotation size cap (design §6, test case #10). + # + # NOTE: this script intentionally does NOT use `set -e`. Every check must + # run to completion so the failure annotation can enumerate ALL failed NICs + # in a single run -- a `set -e` early-exit would mask later checks. The + # only non-zero exit happens at the end, after labelling. + PHASE3_CHECK_SCRIPT: | + #!/bin/bash + # Phase 3 -- per-node NIC health check + self-labelling. Runs IN the + # per-node Phase 3 Job pod. Do NOT `set -e`; we + # want every check to run so the failure annotation captures the + # complete list of failed NICs in one pass (design §4). + set +e + + # ---- Defensive defaults ------------------------------------------------ + # PHASE3_LABEL_KEY is and projected via envFrom on + # cluster-validation-config, but the Job pod could in principle be + # launched against an older ConfigMap; fall back to the canonical key. + PHASE3_LABEL_KEY="${PHASE3_LABEL_KEY:-amd.com/nic-health}" + PHASE_FAILURE_REASON_ANNOTATION_SUFFIX="${PHASE_FAILURE_REASON_ANNOTATION_SUFFIX:--failure-reason}" + + # Per-check inputs default to the Pensando MI300-fleet baseline. These + # are normally overridden ConfigMap keys; defaults exist + # so the script is independently runnable for debug. + PHASE3_AMD_NIC_PCI_IDS="${PHASE3_AMD_NIC_PCI_IDS:-1dd8:1002}" + PHASE3_EXPECTED_NIC_COUNT="${PHASE3_EXPECTED_NIC_COUNT:-8}" + PHASE3_MIN_GID_COUNT="${PHASE3_MIN_GID_COUNT:-1}" + + # Annotation value truncation cap. K8s caps the per-object annotation + # set at 256 KiB total; 250 bytes per value leaves abundant headroom + # and matches the design doc + test plan case #10 expectation. + PHASE3_ANNOTATION_MAX_BYTES="${PHASE3_ANNOTATION_MAX_BYTES:-250}" + + # Sysfs roots. Production hard-defaults to the in-pod paths; tests + # override these to point at fixture directories so the script can be + # exercised on a host without real ionic devices. Do NOT change the + # defaults -- the script reads from the container's auto-mounted + # /sys (kernel filesystem is netns-agnostic for sysfs class paths). + PHASE3_IB_SYSFS_DIR="${PHASE3_IB_SYSFS_DIR:-/sys/class/infiniband}" + PHASE3_NET_SYSFS_DIR="${PHASE3_NET_SYSFS_DIR:-/sys/class/net}" + + # NODE_NAME is informational only -- the orchestrator owns the label. + # The script no longer self-labels via in-pod kubectl; it emits a + # PHASE3_RESULT marker on stdout that PHASE3_SCRIPT parses. + echo "[Phase 3] check start: node=${NODE_NAME:-} expected_nics=${PHASE3_EXPECTED_NIC_COUNT} pci_ids=${PHASE3_AMD_NIC_PCI_IDS}" + + # ---- Pre-flight: roce-workload userspace ABI sanity ------------------ + # Catches the case where ROCE_WORKLOAD_IMAGE was bumped to a tag + # whose libionic ABI does not match the host ionic_rdma module + # (canonical example: rocm/roce-workload:...a-63 shipped libionic + # 54.0-149 ABI v2 while the host driver speaks ABI v1 -- see memory + # gpuop-689-phase4-ionic-abi.md). Without this gate, a bad image + # would cascade into four confusing check-1..4 failures on every + # NIC; with it, we emit one unambiguous preflight reason and skip + # the rest of the checks entirely. + # + # Previously also checked `nicctl --help`; dropped now that Check 5 + # uses ethtool (always present in the roce-workload image) and the + # nicctl-based driver/fw compat path is gone. + preflight_fail_reasons=() + if ! ibv_msg=$(ibv_devinfo 2>&1 >/dev/null); then + preflight_fail_reasons+=("preflight-failed:ibv_devinfo=${ibv_msg:0:80}") + fi + if [[ "${#preflight_fail_reasons[@]}" -gt 0 ]]; then + full_reason=$(IFS=,; echo "${preflight_fail_reasons[*]}") + reason="${full_reason:0:${PHASE3_ANNOTATION_MAX_BYTES}}" + echo "[Phase 3] PRE-FLIGHT FAIL: ${full_reason}" + echo "PHASE3_RESULT status=failed reason=${reason} failed_nics=" + exit 1 + fi + + fail_reasons=() + fail_nics=() + + # ---- Check 1: NIC enumeration via PCI vendor:device allowlist --------- + # Count only PFs whose PCI vendor:device tag matches one of the entries + # in $PHASE3_AMD_NIC_PCI_IDS (comma-separated list). `lspci -Dnn` + # appends the literal `[vvvv:dddd]` tag to every line, which is + # immune to description-string drift and uniquely identifies the PF + # device ID (e.g. `1dd8:1002` for the Pensando DSC Ethernet + # Controller PF). SR-IOV VFs carry a *different* device ID and are + # naturally excluded, as are PCI bridges (`1dd8:0008`, `1dd8:1001`) + # and Processing-accelerator sub-functions (`1dd8:100c/100f/1012`) + # that share the vendor but are not NIC PFs. Counting raw lines or + # filtering by `lspci -Dd :` alone inflates the result by + # ~6x on Pensando cards (a single 8-port node reports 48 raw lines). + # To support multiple SKUs, the comma list is rewritten to a regex + # alternation (`1dd8:1002,1dd8:1003` -> `1dd8:1002|1dd8:1003`) and + # matched against the trailing `[vid:did]` tag. + pci_alt=$(echo "$PHASE3_AMD_NIC_PCI_IDS" | sed 's/,/|/g') + nic_count=$(lspci -Dnn 2>/dev/null \ + | grep -cE "\[(${pci_alt})\]" || true) + nic_count="${nic_count:-0}" + if [[ "$nic_count" -ne "$PHASE3_EXPECTED_NIC_COUNT" ]]; then + fail_reasons+=("nic-count:expected=${PHASE3_EXPECTED_NIC_COUNT},actual=${nic_count}") + echo "[Phase 3] CHECK 1 FAIL: nic_count=${nic_count} expected=${PHASE3_EXPECTED_NIC_COUNT}" + else + echo "[Phase 3] CHECK 1 PASS: nic_count=${nic_count}" + fi + + # ---- Check 2: link state UP for every ionic-driven interface ---------- + # TWO fixes vs the original implementation: + # 1. Enumerate by DRIVER, not interface-name prefix. systemd-predictable + # naming on Pensando NICs varies across hosts -- some report + # `enP5p1s0f0` (slot scheme), others `enp104s0f3` (bus scheme), + # others `ens` (older enumeration). The previous prefix filter + # `enP|ens` silently missed `enp*` and false-passed with zero + # iterations. + # 2. Read link state from sysfs (`operstate`), NOT from `ip link`. + # The Phase 3 pod runs on the cluster pod network -- ionic netdevs + # are in the HOST netns, not the pod's, so `ip link` cannot see + # them and netlink-based tooling (ethtool, ip) returns "no such + # device". sysfs is a single global kernel filesystem mounted + # inside the pod regardless of netns, so `/sys/class/net//` + # is always accessible. + for d in "$PHASE3_NET_SYSFS_DIR"/*/; do + iface=$(basename "$d") + drv=$(basename "$(readlink "$d/device/driver" 2>/dev/null)" 2>/dev/null) + [[ "$drv" == "ionic" ]] || continue + # operstate values per kernel Documentation/ABI/testing/sysfs-class-net: + # "up", "down", "unknown", "lowerlayerdown", "testing", "dormant", + # "notpresent". Only "up" is healthy. + operstate=$(cat "$d/operstate" 2>/dev/null) + if [[ "$operstate" != "up" ]]; then + fail_nics+=("$iface") + fail_reasons+=("link-state:${iface}=${operstate:-UNKNOWN}") + echo "[Phase 3] CHECK 2 FAIL: iface=${iface} operstate=${operstate:-UNKNOWN}" + fi + done + + # ---- Check 3: `rdma link show` reports state ACTIVE per device -------- + # `rdma link show` emits one line per RDMA link, e.g. + # link rocep5s0/1 state ACTIVE physical_state LINK_UP netdev enP5p1s0f0 + # We split on whitespace -- field 2 is `/` (we keep the + # whole token; the failure annotation should be specific enough to + # locate the offending link/port pair). For the state token we anchor + # to a leading space so we only match the standalone `state` field + # and never the trailing chunk of `physical_state` (which would + # otherwise be picked up by a naive `grep -oE 'state .'`). + while read -r line; do + # Skip blank lines (e.g. trailing newline) defensively. + [[ -z "$line" ]] && continue + dev=$(echo "$line" | awk '{print $2}') + state=$(echo "$line" | grep -oE ' state [A-Z_]+' | head -n 1 | awk '{print $2}') + if [[ -n "$dev" && "$state" != "ACTIVE" ]]; then + fail_nics+=("$dev") + fail_reasons+=("rdma-state:${dev}=${state:-UNKNOWN}") + echo "[Phase 3] CHECK 3 FAIL: dev=${dev} rdma-state=${state:-UNKNOWN}" + fi + done < <(rdma link show 2>/dev/null) + + # ---- Check 4: ibv_devinfo responsive + non-empty GID table ------------ + # `ibv_devices` output: + # device node GUID + # ------ ---------------- + # rocep5s0 . + # rocep6s0 . + # We skip the 2-line header (`tail -n +3`) and pull the first column. + # For each device: + # * `ibv_devinfo -d $dev` must exit 0 (driver responsive). + # * `ibv_devinfo -d $dev -v` must report >= PHASE3_MIN_GID_COUNT + # `GID[` entries (verbose mode lists each GID slot). + for dev in $(ibv_devices 2>/dev/null | tail -n +3 | awk 'NF{print $1}'); do + if ! ibv_devinfo -d "$dev" >/dev/null 2>&1; then + fail_nics+=("$dev") + fail_reasons+=("ibv-devinfo:${dev}=unresponsive") + echo "[Phase 3] CHECK 4 FAIL: dev=${dev} ibv_devinfo unresponsive" + continue + fi + gid_count=$(ibv_devinfo -d "$dev" -v 2>/dev/null | grep -c "GID\[") + if [[ "$gid_count" -lt "$PHASE3_MIN_GID_COUNT" ]]; then + fail_nics+=("$dev") + fail_reasons+=("gid-table:${dev}=${gid_count}") + echo "[Phase 3] CHECK 4 FAIL: dev=${dev} gid_count=${gid_count} min=${PHASE3_MIN_GID_COUNT}" + fi + done + + # ---- Check 5: firmware <-> workload-image alignment ------------------ + # Read each ionic NIC's running firmware version from sysfs + # `/sys/class/infiniband/ionic_/fw_ver` and require it to appear + # as a substring of the workload image reference passed in as + # $ROCE_WORKLOAD_IMAGE. This catches "host is on fw X but the + # workload image was built against fw Y" without any curated + # allowlist -- the image tag IS the contract. + # + # sysfs path notes: + # * `/sys/class/infiniband//fw_ver` is populated by every + # RDMA driver's `ib_device.ops.query_device()` and reflects the + # running firmware on the card (driver cached from the NIC's + # admin command at probe time). + # * Why sysfs instead of `ethtool -i `: the Phase 3 pod + # runs on the cluster pod network. ionic netdevs are in the HOST + # netns, not the pod's, so ethtool (netlink) cannot reach them. + # /sys/class/infiniband IS visible inside the pod -- sysfs is a + # single global kernel filesystem mounted regardless of netns, + # and the existing Check 4 already reads + # /sys/class/infiniband/*/ports/*/gids/ from the same place. + # + # Strict-mode-aware (PHASE3_DRIVER_FW_STRICT): true => mismatch + # fails the node; false => mismatch only surfaces via the + # observed_fw= annotation and Phase 3 still passes. + # + # "No data" policy: if we cannot read fw_ver from any ionic device + # OR the orchestrator failed to pass ROCE_WORKLOAD_IMAGE, the check + # is SKIPPED rather than failed (loud WARN on stdout). A missing + # comparison value is an infrastructure gap, not a NIC-health + # signal -- Checks 1-4 still gate. Only an actual mismatch + # (we read fw + we read image + strings differ) fails the node in + # strict mode. + PHASE3_OBSERVED_FW_CSV="" + PHASE3_DRIVER_FW_CHECK_ENABLED="${PHASE3_DRIVER_FW_CHECK_ENABLED:-true}" + if [[ "${PHASE3_DRIVER_FW_CHECK_ENABLED,,}" == "true" ]]; then + strict="${PHASE3_DRIVER_FW_STRICT:-true}" + workload_image="${ROCE_WORKLOAD_IMAGE:-}" + + # One fw_ver per /sys/class/infiniband/ionic_. Filter by driver + # symlink (not by `ionic_*` glob) so this picks up any future + # rename of the IB device naming scheme. + declare -A dev_fw + for d in "$PHASE3_IB_SYSFS_DIR"/*/; do + dev=$(basename "$d") + drv=$(basename "$(readlink "$d/device/driver" 2>/dev/null)" 2>/dev/null) + [[ "$drv" == "ionic" ]] || continue + fw=$(cat "$d/fw_ver" 2>/dev/null | tr -d '[:space:]') + [[ -z "$fw" ]] && continue + dev_fw[$dev]="$fw" + done + + # "No data" cases are treated as SKIP rather than FAIL: if we + # can't read fw at all, OR can't get the workload image to compare + # against, we cannot evaluate alignment and a Phase 3 fail here + # would be a false positive (the NICs themselves may be perfectly + # healthy -- Checks 1-4 still gate). Loud WARN on stdout so the + # infrastructure gap (missing env / unreadable sysfs) is visible. + # An actual mismatch (we read fw + we read image, strings differ) + # below still fails in strict mode -- that's the intended check. + _check5_skip=false + if [[ -z "$workload_image" ]]; then + echo "[Phase 3] CHECK 5 SKIP: ROCE_WORKLOAD_IMAGE env not set -- cannot evaluate fw/image alignment" + _check5_skip=true + fi + if [[ "${#dev_fw[@]}" -eq 0 ]]; then + echo "[Phase 3] CHECK 5 SKIP: no fw_ver readable under ${PHASE3_IB_SYSFS_DIR}/*/ -- treating as no-data" + _check5_skip=true + fi + + # Per-NIC substring match: workload_image must contain fw. + # `per_dev_observed` is built even on skip so the operator still + # gets the partial observed_fw annotation when SOME devices read. + per_dev_observed=() + for dev in "${!dev_fw[@]}"; do + fw="${dev_fw[$dev]}" + per_dev_observed+=("${dev}=${fw}") + if [[ "$_check5_skip" == "true" ]]; then + continue + fi + if [[ -n "$workload_image" && "$workload_image" != *"$fw"* ]]; then + echo "[Phase 3] CHECK 5 MISMATCH: dev=${dev} fw=${fw} not in image=${workload_image}" + if [[ "${strict,,}" == "true" ]]; then + fail_nics+=("$dev") + # Only emit the image tag portion in the failure reason -- + # full registry/repo prefix would blow the annotation cap. + fail_reasons+=("fw-image-mismatch:${dev}=${fw}/image=${workload_image##*:}") + fi else - failed_nodes="$failed_nodes $node" + [[ -n "$workload_image" ]] && \ + echo "[Phase 3] CHECK 5 OK: dev=${dev} fw=${fw} matches image substring" fi done + PHASE3_OBSERVED_FW_CSV=$(IFS=,; echo "${per_dev_observed[*]}") fi - - # Count and report results - passed_count=$(echo $passed_nodes | wc -w) - failed_count=$(echo $failed_nodes | wc -w) - echo "==================================================================" - echo "Test Runner Jobs Summary:" - echo " Passed: $passed_count node(s)" - if [ $passed_count -gt 0 ]; then - echo " Nodes: $passed_nodes" - fi - echo " Failed: $failed_count node(s)" - if [ $failed_count -gt 0 ]; then - echo " Nodes: $failed_nodes" - fi - echo "==================================================================" - - if [[ "${SKIP_GPU_VALIDATION,,}" == "true" ]]; then - echo "SKIP_GPU_VALIDATION is set to true. Skip labelling nodes." + + # ---- Emit PHASE3_RESULT marker for orchestrator to parse -------------- + # The orchestrator (PHASE3_SCRIPT) parses this marker line from + # `kubectl logs job/ --tail=20` and performs label/annotate + # writes itself. No in-pod kubectl dependency. + # + # Marker contract (single line on stdout, last PHASE3_RESULT wins): + # PHASE3_RESULT status=passed [observed_fw=] + # PHASE3_RESULT status=failed reason= failed_nics= [observed_fw=] + # + # Reason and offending-NIC lists are comma-joined then truncated to + # the K8s-safe per-value cap. Truncation is intentional: a worst-case + # fleet-wide failure must not push the node object over the 256 KiB + # annotation budget. We log the full untruncated values separately to + # stdout so the Job log retains complete detail. + # + # observed_fw= is added by Check 5 whenever + # PHASE3_DRIVER_FW_CHECK_ENABLED=true. It is omitted entirely when + # Check 5 is disabled so the marker stays byte-for-byte identical to + # the pre-Check-5 contract on operators who flip the check off. + # Values are truncated to PHASE3_ANNOTATION_MAX_BYTES for the same + # node-object size-cap reason as `reason=` / `failed_nics=`. + obs_fw_field="" + if [[ -n "${PHASE3_OBSERVED_FW_CSV:-}" ]]; then + obs_fw_trunc="${PHASE3_OBSERVED_FW_CSV:0:${PHASE3_ANNOTATION_MAX_BYTES}}" + obs_fw_field=" observed_fw=${obs_fw_trunc}" + fi + + if [[ "${#fail_reasons[@]}" -eq 0 ]]; then + echo "[Phase 3] all checks PASSED" + echo "PHASE3_RESULT status=passed${obs_fw_field}" + echo "[Phase 3] PHASE3_CHECK_SCRIPT done: PASS" + exit 0 + fi + + full_reason=$(IFS=,; echo "${fail_reasons[*]}") + full_nics=$(IFS=,; echo "${fail_nics[*]}") + reason="${full_reason:0:${PHASE3_ANNOTATION_MAX_BYTES}}" + nics="${full_nics:0:${PHASE3_ANNOTATION_MAX_BYTES}}" + + echo "[Phase 3] FAIL: full_reason=${full_reason}" + echo "[Phase 3] FAIL: full_failed_nics=${full_nics}" + echo "PHASE3_RESULT status=failed reason=${reason} failed_nics=${nics}${obs_fw_field}" + + echo "[Phase 3] PHASE3_CHECK_SCRIPT done: FAIL (${#fail_reasons[@]} reason(s), ${#fail_nics[@]} NIC(s))" + exit 1 + + # === Phase 3 Script: per-node NIC health Job submit + wait + orchestrator-side label === + # Sourced by the orchestrator's dedicated run_phase3 wiring + # (cluster-validation-job.yaml,). Mirrors PHASE1_SCRIPT / + # PHASE2_SCRIPT submit/wait/label shape. PHASE3_CHECK_SCRIPT runs inside + # the Job pod; the pod's base image does not ship kubectl, so the script + # emits a PHASE3_RESULT marker line on stdout that PHASE3_SCRIPT parses + # via `kubectl logs job/` after the Job reaches a terminal state. + # The orchestrator owns all label/annotate writes -- this removes the + # in-pod kubectl dependency entirely. + # + # PHASE3_SCRIPT responsibilities: + # * intersect the input pool with `amd-nic=true` -- ALREADY DONE + # by the cronjob orchestrator before run_phase3 is invoked + # (see cluster-validation-job.yaml around line ~958). PHASE3_SCRIPT + # therefore treats its $@ / PHASE_NODES input as already-narrowed. + # * submit one Phase 3 Job per node + # * wait for Job completion bounded by PHASE3_JOB_WAIT_TIME + # * for every terminal Job (Complete or Failed), parse the + # PHASE3_RESULT marker from `kubectl logs job/ --tail=20` and + # issue the corresponding label_phase_passed / label_phase_failed + # (+ failed-nics annotation on failure). + # * for Jobs that never reached terminal state (submit-failed, + # timeout) write label_phase_failed directly -- the pod never got + # to emit a marker. + # * cleanup hung Jobs at the end + # + # On entry: + # PHASE_NODES = space-separated input node list (also $@) + # PHASE3_LABEL_KEY = "amd.com/nic-health" (from this ConfigMap) + # SKIP_NIC_VALIDATION = "true" short-circuits to pass-label-all + # PHASE3_EXPECTED_NIC_COUNT, PHASE3_JOB_WAIT_TIME from this ConfigMap + #. PHASE3_AMD_NIC_PCI_IDS / PHASE3_MIN_GID_COUNT are + # consumed inside the Job container -- the driver only needs the + # ones used by submit + classify. + # Helpers (already sourced by orchestrator): + # label_phase_passed, label_phase_failed, annotate_phase_value + # Job template available at + # /phase3-configs/cluster-validation-phase3-job-config.yaml + # (mounted by the cronjob from cluster-validation-phase3-job-config + # ConfigMap,). + # + # Contract: + # 1. Intersection with amd-nic=true: orchestrator-side (cronjob). + # PHASE3_SCRIPT trusts its input list is already narrowed. + # 2. Per input node: sed-render Job template ($$NODE, + # $$EXPECTED_NIC_COUNT substitution, unique per-node Job name) + # and kubectl apply. + # 3. Poll-wait all Jobs bounded by PHASE3_JOB_WAIT_TIME. Record + # terminal status as complete|failed|timeout per job. + # 4. Orchestrator-side label writes -- single source of truth: + # * submit-failed -> label_phase_failed reason=job-creation-failed + # * timeout -> label_phase_failed reason=nic-not-allocated + # * complete -> parse PHASE3_RESULT from job logs + # -> label_phase_passed (status=passed) + # -> label_phase_failed + failed-nics + # annotation (status=failed) + # -> label_phase_failed reason=no-result-line + # (marker missing -- defensive) + # * failed -> same parse path as complete (the Job + # container exited non-zero but the marker + # still appears on stdout before exit 1). + # 5. SKIP_NIC_VALIDATION=true early-exit: pass-label every input node, + # no Jobs created. + # 6. Cleanup hung (timed-out) Jobs at the end. + # + # Missing job template -> every input node labeled failed with reason + # "job-template-missing". + # + # All output (banners, per-step progress) goes to stdout. Diagnostic-only + # lines go to stderr so they don't accidentally pollute callers that may + # capture stdout in the future. + PHASE3_SCRIPT: | + #!/bin/bash + # Phase 3 -- per-node NIC health Job submit + wait orchestration. + # Sourced (not exec'd) by the orchestrator. Do NOT `set -e`; we want + # to label every "did-not-run" input node even if individual kubectl + # calls fail (mirror of PHASE1_SCRIPT / PHASE2_SCRIPT). + + # ---- Input handling --------------------------------------------------- + # Orchestrator passes the node list as positional args AND exports + # PHASE_NODES. Prefer positional ($@) so a re-source with different + # args works as expected; fall back to PHASE_NODES if $# is zero. + # The input is already intersected with amd-nic=true upstream + # (cluster-validation-job.yaml run_phase3 caller), so PHASE3_SCRIPT + # does NOT re-narrow. + local_phase3_nodes="" + if [[ "$#" -gt 0 ]]; then + local_phase3_nodes="$*" + elif [[ -n "${PHASE_NODES:-}" ]]; then + local_phase3_nodes="$PHASE_NODES" + fi + + if [[ -z "$local_phase3_nodes" ]]; then + echo "[Phase 3] no input nodes -- nothing to do" + return 0 + fi + + echo "[Phase 3] PHASE3_SCRIPT start: nodes=[$local_phase3_nodes]" + + # ---- Early-exit: SKIP_NIC_VALIDATION ---------------------------------- + # Incremental-bringup short-circuit (design §4 contract step 5). + # Every input node receives the pass label, no Phase 3 Jobs are + # created. Matches the SKIP_GPU_HW_ACCEPTANCE / SKIP_GPU_MESH_VALIDATION + # pattern in PHASE1_SCRIPT / PHASE2_SCRIPT. + if [[ "${SKIP_NIC_VALIDATION,,}" == "true" ]]; then + echo "[Phase 3] SKIP_NIC_VALIDATION=true -- pass-labeling all input nodes" + for n in $local_phase3_nodes; do + label_phase_passed "$n" "$PHASE3_LABEL_KEY" \ + || echo "[Phase 3] WARN: label_phase_passed failed for $n" >&2 + done + echo "[Phase 3] PHASE3_SCRIPT done (skip mode)" + return 0 + fi + + # ---- Validate required env vars --------------------------------------- + # If any are unset/empty, fail every input node fast with a clear + # reason. PHASE3_CHECK_SCRIPT runs INSIDE the Job container -- the + # driver only needs the ones used by submit + classify. + local missing_env="" + for v in PHASE3_LABEL_KEY PHASE3_EXPECTED_NIC_COUNT PHASE3_JOB_WAIT_TIME; do + if [[ -z "${!v:-}" ]]; then + missing_env="$missing_env $v" + fi + done + if [[ -n "$missing_env" ]]; then + echo "[Phase 3] ERROR: required env var(s) unset:$missing_env" >&2 + echo "[Phase 3] failing every input node with reason=missing-env" >&2 + for n in $local_phase3_nodes; do + label_phase_failed "$n" "${PHASE3_LABEL_KEY:-amd.com/nic-health}" \ + "phase3-missing-env:${missing_env# }" \ + || echo "[Phase 3] WARN: label_phase_failed failed for $n" >&2 + done + return 1 + fi + + local job_template=/phase3-configs/cluster-validation-phase3-job-config.yaml + if [[ ! -f "$job_template" ]]; then + echo "[Phase 3] ERROR: job template not found at $job_template" >&2 + for n in $local_phase3_nodes; do + label_phase_failed "$n" "$PHASE3_LABEL_KEY" "job-template-missing" \ + || echo "[Phase 3] WARN: label_phase_failed failed for $n" >&2 + done + return 1 + fi + + # ---- Phase-local state ------------------------------------------------- + # Indexed by job name. Declared local so the orchestrator shell stays + # clean when re-sourced. Status transitions: + # submitted -> complete | failed (terminal, parse marker) + # submitted -> timeout (terminal, no marker) + # submit-failed (terminal, no marker) + declare -A phase3_job_to_node=() + declare -A phase3_job_status=() # "submitted" | "submit-failed" | "complete" | "failed" | "timeout" + declare -A phase3_job_reason=() # orchestrator-side failure reason (timeout/submit-failed only) + local phase3_job_names="" + + # ---- Step 1: Parallel-submit Phase 3 Job per node --------------------- + # Mirror of PHASE1_SCRIPT / PHASE2_SCRIPT Step 1. kubectl apply is + # non-blocking from the Job's perspective, so submitting in a tight + # loop is effectively parallel. + echo "[Phase 3] submitting Phase 3 Jobs (parallel)..." + local ts max_job_name_len=63 prefix="cvf-phase3" + ts=$(date +%Y%m%d-%H%M%S) + for node in $local_phase3_nodes; do + local job_name="${prefix}-${node}-${ts}" + if [[ "${#job_name}" -gt "$max_job_name_len" ]]; then + local node_hash + node_hash=$(echo -n "$node" | sha1sum | cut -c1-6) + job_name="${prefix}-${node_hash}-${ts}" + echo "[Phase 3] node=$node hostname too long, using hash -> job=$job_name" + fi + phase3_job_to_node[$job_name]="$node" + phase3_job_names="$phase3_job_names $job_name" + + # Render template. The Phase 3 template defines three + # $$-prefixed placeholders: $$NODE (target hostname), + # $$EXPECTED_NIC_COUNT (NIC resource request count), and + # $$ROCE_WORKLOAD_IMAGE (container image; -- mirrors + # the PHASE4_DRIVER_SCRIPT substitution pattern so Phases 3/4/5 + # all pin to the same ROCE_WORKLOAD_IMAGE ConfigMap key). Also + # rewrite the template's fixed metadata.name to our unique + # per-node job_name so multiple nodes can run concurrently + # without name collision. + local image="${ROCE_WORKLOAD_IMAGE:-rocm/roce-workload:latest}" + if sed "s|\$\$NODE|${node}|g; \ + s/^ name: cluster-validation-phase3-job/ name: ${job_name}/; \ + s|\$\$EXPECTED_NIC_COUNT|${PHASE3_EXPECTED_NIC_COUNT}|g; \ + s|\$\$ROCE_WORKLOAD_IMAGE|${image}|g" \ + "$job_template" | kubectl apply -f - >/dev/null; then + echo "[Phase 3] submitted job=$job_name node=$node" + phase3_job_status[$job_name]="submitted" + else + echo "[Phase 3] ERROR: kubectl apply failed for job=$job_name node=$node" >&2 + phase3_job_status[$job_name]="submit-failed" + phase3_job_reason[$job_name]="job-creation-failed" + fi + done + echo "[Phase 3] all Jobs submitted at $(date)" + + # ---- Step 2: Poll-wait every Job -------------------------------------- + # Per-job wallclock budget = PHASE3_JOB_WAIT_TIME. Sequential wait + # loop, but the Jobs themselves run in parallel on different nodes. + # 5s poll interval matches PHASE1_SCRIPT / PHASE2_SCRIPT cadence. + local timeout="${PHASE3_JOB_WAIT_TIME}" + for job_name in $phase3_job_names; do + local node="${phase3_job_to_node[$job_name]}" + if [[ "${phase3_job_status[$job_name]:-}" == "submit-failed" ]]; then + continue + fi + echo "[Phase 3] waiting for job=$job_name node=$node (timeout=${timeout}s)" + local start_time elapsed + start_time=$(date +%s) + local terminal="false" + while true; do + elapsed=$(( $(date +%s) - start_time )) + if [[ "$elapsed" -ge "$timeout" ]]; then + echo "[Phase 3] job=$job_name TIMEOUT after ${timeout}s" + phase3_job_status[$job_name]="timeout" + # design §6: pod-cannot-mount-NICs is the dominant cause + # of Job-pending timeouts -> reason=nic-not-allocated. + phase3_job_reason[$job_name]="nic-not-allocated" + terminal="true" + break + fi + local complete_st failed_st + complete_st=$(kubectl get job "$job_name" \ + -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' \ + 2>/dev/null || echo "") + if [[ "$complete_st" == "True" ]]; then + # Container exited 0 -> all checks passed. Parse the + # PHASE3_RESULT marker in Step 3 to write the label. + echo "[Phase 3] job=$job_name Complete=True (node=$node)" + phase3_job_status[$job_name]="complete" + terminal="true" + break + fi + failed_st=$(kubectl get job "$job_name" \ + -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' \ + 2>/dev/null || echo "") + if [[ "$failed_st" == "True" ]]; then + # Container exited non-zero -> one or more checks failed. + # The PHASE3_RESULT marker is still emitted before exit 1; + # parse it in Step 3 to record the reason + failed-nics. + echo "[Phase 3] job=$job_name Failed=True (node=$node)" + phase3_job_status[$job_name]="failed" + terminal="true" + break + fi + sleep 5 + done + if [[ "$terminal" != "true" ]]; then + # Defensive: should never fall through, but keep state sane. + phase3_job_status[$job_name]="timeout" + phase3_job_reason[$job_name]="nic-not-allocated" + fi + done + + # ---- Step 3: Orchestrator-side label writes --------------------------- + # Single source of truth for all per-node Phase 3 labels. For Jobs + # that reached terminal state (complete|failed), parse the + # PHASE3_RESULT marker from container logs and write the label + # accordingly. For Jobs that never reached terminal state + # (submit-failed, timeout) write label_phase_failed directly. + _phase3_parse_and_label() { + local job_name="$1" node="$2" + local log_line status reason failed_nics observed_fw + # `--tail=20` is enough: PHASE3_CHECK_SCRIPT emits the marker + # as one of its last few lines. We take the last PHASE3_RESULT + # line in case a previous Job pod attempt also emitted one. + log_line=$(kubectl logs "job/$job_name" --tail=20 2>/dev/null \ + | grep -E '^PHASE3_RESULT ' | tail -n 1) + if [[ -z "$log_line" ]]; then + echo "[Phase 3] WARN: no PHASE3_RESULT line in logs for job=$job_name node=$node -- treating as failed" >&2 + label_phase_failed "$node" "$PHASE3_LABEL_KEY" "no-result-line" \ + || echo "[Phase 3] WARN: label_phase_failed failed for $node" >&2 + return 1 + fi + status=$(echo "$log_line" | grep -oE 'status=[a-z]+' | head -1 | cut -d= -f2) + # Extract optional observed_fw field from the marker. Present on + # both pass and fail lines whenever Check 5 ran; absent when + # PHASE3_DRIVER_FW_CHECK_ENABLED=false (in which case we do NOT + # clear any prior annotation on the node -- last-known-good is + # more useful than empty for operators flipping the check off). + observed_fw=$(echo "$log_line" | sed -nE 's/.*observed_fw=([^ ]+).*/\1/p') + case "$status" in + passed) + echo "[Phase 3] orchestrator-side label: node=$node status=passed" + label_phase_passed "$node" "$PHASE3_LABEL_KEY" \ + || echo "[Phase 3] WARN: label_phase_passed failed for $node" >&2 + ;; + failed) + reason=$(echo "$log_line" | sed -nE 's/.*reason=([^ ]+).*/\1/p') + failed_nics=$(echo "$log_line" | sed -nE 's/.*failed_nics=([^ ]+).*/\1/p') + echo "[Phase 3] orchestrator-side label: node=$node status=failed reason=${reason:-no-reason} failed_nics=${failed_nics:-}" + label_phase_failed "$node" "$PHASE3_LABEL_KEY" "${reason:-no-reason}" \ + || echo "[Phase 3] WARN: label_phase_failed failed for $node" >&2 + if [[ -n "$failed_nics" ]]; then + annotate_phase_value "$node" "$PHASE3_LABEL_KEY" "failed-nics" "$failed_nics" \ + || echo "[Phase 3] WARN: annotate failed-nics failed for $node" >&2 + fi + ;; + *) + echo "[Phase 3] WARN: unrecognized PHASE3_RESULT status=${status:-} for node=$node" >&2 + label_phase_failed "$node" "$PHASE3_LABEL_KEY" "bad-result-line" \ + || echo "[Phase 3] WARN: label_phase_failed failed for $node" >&2 + ;; + esac + # Write observed_fw annotation regardless of pass/fail, whenever + # the marker carried it. The absence branch is intentionally a + # no-op so we preserve any last-known-good annotation from a + # prior run when an operator disables Check 5. + if [[ -n "$observed_fw" ]]; then + annotate_phase_value "$node" "$PHASE3_LABEL_KEY" "observed-fw" "$observed_fw" \ + || echo "[Phase 3] WARN: annotate observed-fw failed for $node" >&2 + fi + } + + local phase3_orch_failed_count=0 + for job_name in $phase3_job_names; do + local node="${phase3_job_to_node[$job_name]}" + local status="${phase3_job_status[$job_name]:-}" + case "$status" in + submit-failed|timeout) + local reason="${phase3_job_reason[$job_name]:-failed}" + echo "[Phase 3] orchestrator-side label: node=$node status=$status reason=$reason" + label_phase_failed "$node" "$PHASE3_LABEL_KEY" "$reason" \ + || echo "[Phase 3] WARN: label_phase_failed failed for $node" >&2 + phase3_orch_failed_count=$((phase3_orch_failed_count + 1)) + ;; + complete|failed) + _phase3_parse_and_label "$job_name" "$node" + ;; + *) + echo "[Phase 3] WARN: unexpected status=$status for job=$job_name" >&2 + ;; + esac + done + + # ---- Step 4: Cleanup hung Jobs ---------------------------------------- + # Anything that timed out (still Active, neither Complete nor Failed) + # gets explicitly deleted so it does not linger past the next CronJob + # tick. ttlSecondsAfterFinished handles the completed ones. + for job_name in $phase3_job_names; do + local status="${phase3_job_status[$job_name]:-}" + if [[ "$status" == "timeout" ]]; then + echo "[Phase 3] deleting hung job=$job_name" + kubectl delete job "$job_name" --ignore-not-found=true \ + --wait=false >/dev/null 2>&1 \ + || echo "[Phase 3] WARN: delete job=$job_name failed" >&2 + fi + done + + echo "[Phase 3] PHASE3_SCRIPT done: orchestrator-labeled-failed=${phase3_orch_failed_count} (submit-failed/timeout); pass/fail labels for completed Jobs derived from PHASE3_RESULT marker" + return 0 + + # === Phase 4 (Pairwise Rail Bandwidth Test) env vars === + # Consumed by `PHASE4_DRIVER_SCRIPT` and by + # the per-pair server/client Jobs rendered from + # `cluster-validation-phase4-job-config`. + # See design doc design doc §4 -> "Code Path: Phase 4 ConfigMap variables". + # + # The driver projects these via `envFrom` on the orchestrator container + # and substitutes the NAD/IB-DEV prefixes into per-rail Job manifests + # at `$$RAIL_IDX` render time. Tune via ConfigMap edit -- no code change. + + # Number of rails (per-NIC RDMA devices) tested per node. Default 8 + # matches the MI300 fleet baseline (8 NICs, one rail per NIC, + # validated on smc300x-ccs-aus-gpuf268). The driver iterates rails + # 0.(PHASE4_RAIL_COUNT - 1) serially within each pair. Lowering + # this value (e.g. PHASE4_RAIL_COUNT=4) is the supported way to skip + # rails 4-7 on partially-cabled testbeds without code changes + # (design §7 test case "rail-count-override"). + PHASE4_RAIL_COUNT: "8" + # Minimum acceptable measured bandwidth per rail in Gbps. A rail is + # marked failed if `ib_write_bw` "BW average" parses to a value below + # this threshold; a node is marked failed iff ANY of its tested rails + # fell below threshold. Default 380 Gbps is ~95% of the 400 Gbps line + # rate target for a single Pensando 400G NIC (design §1) and leaves + # headroom for protocol overhead and cable-marginality jitter. + PHASE4_BW_THRESHOLD: "380" + # Per-rail wallclock budget in seconds for a single (pair, rail) + # measurement -- covers server pod startup, client pod startup, + # `ib_write_bw` handshake + transfer, and log scrape. Default 180s + # leaves headroom for image pull (rocm/roce-workload), pod scheduling, + # and NAD attachment. A pair's total budget is roughly + # PHASE4_RAIL_COUNT * PHASE4_PAIR_WAIT_TIME (rails serialized within + # a pair); on default values that is 8 * 180 = 1440s per pair. + # patchable: cluster-validation-framework.timeouts.phase4-pair-wait-secs + PHASE4_PAIR_WAIT_TIME: "180" + # Cap on concurrent pair runners. The driver forks up to this many + # `pair_runner` background functions in parallel; further pairs queue + # until a slot frees. Default 8 keeps the cluster API server happy on + # large pools (design §2: "pairs run in parallel . bounded to keep + # the kubernetes API server happy on large clusters"). Raise on + # beefier control planes; lower if `kubectl apply` 429s appear during + # parallel submits (design §6 -> "api-throttled" recovery). + PHASE4_MAX_CONCURRENT_PAIRS: "8" + # Prefix for per-rail NetworkAttachmentDefinition names. The driver + # constructs the full NAD name as `${PHASE4_NAD_NAME_PREFIX}${RAIL_IDX}` + # (e.g. `amd-host-device-nad-rail-3` for rail 3) and stamps it into + # the Job pod's `k8s.v1.cni.cncf.io/networks` annotation. The default + # matches the per-rail NAD naming convention shipped in + # `nad-per-rail.yaml`; override only if the fleet uses a different + # NAD naming scheme. + PHASE4_NAD_NAME_PREFIX: "amd-host-device-nad-rail-" + # Prefix for the RDMA device name passed to `ib_write_bw -d`. The + # in-pod command is rendered as + # `ib_write_bw -d ${PHASE4_IB_DEV_PREFIX}${RAIL_IDX} -i 1 -F .`, + # binding the test to a specific per-rail RDMA device. Default + # `ionic_` matches the device-naming convention exposed by the AMD + # Pensando ionic_rdma kernel driver (`/sys/class/infiniband/ionic_N`). + # Override for non-Pensando NICs (e.g. `mlx5_` on Mellanox). + PHASE4_IB_DEV_PREFIX: "ionic_" + + # === Phase 4 (Pairwise Rail Bandwidth Test) driver script === + # Orchestrator-side driver. Sourced (not exec'd) by `run_phase4` in + # cluster-validation-job.yaml after replaces the generic + # `_run_phase_generic` stub with a source-and-invoke of this script. + # The input list is already gated to nodes that passed Phase 3 + # (amd.com/nic-health=passed) by the orchestrator's filter_passed_nodes + # call -- PHASE4_DRIVER_SCRIPT does NOT re-narrow. + # + # Pre-existing companions (consumed at runtime): + # * Per-pair Job templates -- /phase4-configs/{server,client}.yaml + # mounted; templates + # * helpers -- label_phase_passed, label_phase_failed, + # annotate_phase_value (sourced upstream + # from PHASE_NODE_LABEL_SCRIPT). + # * PHASE4_* env vars, projected via envFrom on + # the orchestrator container. + # + # Contract (design §4 -> "Code Path: PHASE4_DRIVER_SCRIPT"): + # 1. Sort the input nodes and build a full-mesh round-robin + # schedule using the circle algorithm: + # - For N input nodes there are N-1 rounds. Each round contains + # floor(N/2) disjoint pairs (every node appears in at most one + # pair per round); for odd N a different node "sits out" each + # round via the bye slot. Union over rounds covers every pair + # in C(N,2), enabling A-B / B-C / A-C transitive triangulation + # of rail faults (a rail that fails on A in pairs against + # multiple peers is much more likely to be A's fault than the + # peers'). + # - Single-node input (N=1) records `unpaired=true` annotation + # and pass-labels with no peer measurement, same as before. + # 2. For each round, fork pair_runners (bounded by + # PHASE4_MAX_CONCURRENT_PAIRS); wait for the round before + # starting the next. Within a round pairs are guaranteed + # node-disjoint so no two pair_runners contend for the same + # node/NIC. + # 3. pair_runner iterates rails 0.(PHASE4_RAIL_COUNT-1) serially: + # - render+apply server Job, wait for pod IP + # - render+apply client Job with that IP + # - wait for both, parse "BW average" Gbps from client log + # - record rail_bw[node][rail][round] and reason on failure + # - cleanup both Jobs + # 4. After all rounds finish, aggregate per-node: + # node passes iff ALL (rail, round) measurements were + # >= PHASE4_BW_THRESHOLD with no recorded reason. + # 5. Label via helpers; write per-(rail, round) annotations and + # `failed-rails` + `triangulation` summaries via + # annotate_phase_value. + # 6. SKIP_RAIL_BANDWIDTH_TEST=true: pass-label every input node, no + # Jobs created (mirror of SKIP_NIC_VALIDATION fast-path). + # + # Inter-process state: bash subshells cannot share arrays back to the + # parent, so pair_runners write per-(node, rail, round) results to a + # shared tmp directory (PHASE4_STATE_DIR). The driver aggregates from + # disk after each round's `wait`. Layout: + # $PHASE4_STATE_DIR/results//rail--round- = "" + # $PHASE4_STATE_DIR/results//rail--round-.reason = "" + # $PHASE4_STATE_DIR/results//rail--round-.peer = "" + # + # Error reasons recorded on a per-rail basis (surfaced via the + # per-rail annotation suffix on failure; mirrors design §6): + # peer-pod-unready -- server Job pod never reached Running with IP + # ib-write-bw-crashed -- client Job exited non-zero + # parse-failed -- client log had no "BW average" line + # nad-missing -- pod admission rejected (server or client) + # api-throttled -- repeated 429 from kubectl apply + # below-threshold: -- measured BW < PHASE4_BW_THRESHOLD + PHASE4_DRIVER_SCRIPT: | + #!/bin/bash + # Phase 4 -- pairwise per-rail ib_write_bw driver. + # Sourced (not exec'd) by the orchestrator. Do NOT `set -e`; we want + # to label every "did-not-run" input node even if individual kubectl + # calls fail (mirror of PHASE3_SCRIPT convention). + + # ---- Input handling --------------------------------------------------- + # Same convention as PHASE3_SCRIPT: prefer positional ($@), fall back + # to PHASE_NODES env. Input is already filter_passed_nodes-narrowed + # to nic-health=passed upstream; do NOT re-narrow here. + local_phase4_nodes="" + if [[ "$#" -gt 0 ]]; then + local_phase4_nodes="$*" + elif [[ -n "${PHASE_NODES:-}" ]]; then + local_phase4_nodes="$PHASE_NODES" + fi + + if [[ -z "$local_phase4_nodes" ]]; then + echo "[Phase 4] no input nodes -- nothing to do" + return 0 + fi + + echo "[Phase 4] PHASE4_DRIVER_SCRIPT start: nodes=[$local_phase4_nodes]" + + # ---- Early-exit: SKIP_RAIL_BANDWIDTH_TEST ----------------------------- + # Incremental-bringup short-circuit (design §1 contract step 6). + # Pass-label every input node, no Phase 4 Jobs. Matches the + # SKIP_NIC_VALIDATION pattern in PHASE3_SCRIPT. + if [[ "${SKIP_RAIL_BANDWIDTH_TEST,,}" == "true" ]]; then + echo "[Phase 4] SKIP_RAIL_BANDWIDTH_TEST=true -- pass-labeling all input nodes" + for n in $local_phase4_nodes; do + label_phase_passed "$n" "$PHASE4_LABEL_KEY" \ + || echo "[Phase 4] WARN: label_phase_passed failed for $n" >&2 + done + echo "[Phase 4] PHASE4_DRIVER_SCRIPT done (skip mode)" + return 0 + fi + + # ---- Validate required env vars --------------------------------------- + # Mirror of PHASE3_SCRIPT: if any are unset/empty, fail every input + # node fast with a clear reason. RAIL_COUNT / BW_THRESHOLD / + # PAIR_WAIT_TIME / MAX_CONCURRENT_PAIRS / NAD_NAME_PREFIX / + # IB_DEV_PREFIX are the driver-side consumers. + local missing_env="" + for v in PHASE4_LABEL_KEY PHASE4_RAIL_COUNT PHASE4_BW_THRESHOLD \ + PHASE4_PAIR_WAIT_TIME PHASE4_MAX_CONCURRENT_PAIRS \ + PHASE4_NAD_NAME_PREFIX PHASE4_IB_DEV_PREFIX; do + if [[ -z "${!v:-}" ]]; then + missing_env="$missing_env $v" + fi + done + if [[ -n "$missing_env" ]]; then + echo "[Phase 4] ERROR: required env var(s) unset:$missing_env" >&2 + echo "[Phase 4] failing every input node with reason=missing-env" >&2 + for n in $local_phase4_nodes; do + label_phase_failed "$n" "${PHASE4_LABEL_KEY:-amd.com/rail-bandwidth}" \ + "phase4-missing-env:${missing_env# }" \ + || echo "[Phase 4] WARN: label_phase_failed failed for $n" >&2 + done + return 1 + fi + + local server_tmpl=/phase4-configs/cluster-validation-phase4-server-job-config.yaml + local client_tmpl=/phase4-configs/cluster-validation-phase4-client-job-config.yaml + if [[ ! -f "$server_tmpl" || ! -f "$client_tmpl" ]]; then + echo "[Phase 4] ERROR: Job templates missing (server=$server_tmpl client=$client_tmpl)" >&2 + for n in $local_phase4_nodes; do + label_phase_failed "$n" "$PHASE4_LABEL_KEY" "job-template-missing" \ + || echo "[Phase 4] WARN: label_phase_failed failed for $n" >&2 + done + return 1 + fi + + # ---- Sort input ------------------------------------------------------- + # Stable, deduped ordering so the schedule is reproducible across + # re-sources (design §7 test cases + # "mesh-roundrobin-N{2,4,5,6,7,8}"). + local sorted_nodes + sorted_nodes=$(printf '%s\n' $local_phase4_nodes | LC_ALL=C sort -u | tr '\n' ' ') + sorted_nodes="${sorted_nodes% }" + + # Re-emit the sorted-node list into a positional-args-friendly array. + # shellcheck disable=SC2206 + local -a phase4_sorted=($sorted_nodes) + local phase4_count="${#phase4_sorted[@]}" + + echo "[Phase 4] sorted input ($phase4_count nodes): ${phase4_sorted[*]}" + + # ---- Full-mesh round-robin schedule ---------------------- + # Circle algorithm. For an even N nodes (indices 0..N-1) we fix + # node 0 and rotate the remaining N-1 indices around it: in each + # round, position k (k=0..N/2-1) plays position N-1-k after the + # rotation. After N-1 rounds every unordered pair has played + # exactly once, giving C(N,2) total pairs scheduled in N-1 + # node-disjoint rounds of N/2 pairs each. + # + # For odd N we add a virtual "bye" slot to make N+1 even and run + # the same algorithm; in each round one real node is paired with + # bye and sits out (no measurement, no Job). The node that sits + # out differs every round, so each real node sits out exactly + # once across the N rounds. + # + # TODO: clusters larger than ~16 nodes amplify the + # number of scheduled pairs quadratically (e.g. N=32 -> 31 rounds + # * 16 pairs = 496 measurements). For those topologies prefer to + # partition the input by rack and run the mesh schedule per + # partition, then a sparse cross-rack sample, rather than a true + # full mesh. Partitioning is intentionally NOT implemented here; + # the operator can pre-split by passing per-rack node lists to + # separate Phase 4 invocations. + # + # `phase4_rounds` is the schedule: one entry per round, formatted + # as "nodeA1,nodeB1 nodeA2,nodeB2 ..." (comma joins a pair, space + # joins pairs within a round). The bye node, if any, is implicit + # (no pair containing it is emitted). `phase4_total_pairs` is the + # sum of pair counts across rounds, used only for logging. + local -a phase4_rounds=() + local phase4_total_pairs=0 + local phase4_unpaired_node="" + + if [[ "$phase4_count" -eq 0 ]]; then + : # nothing to schedule -- already handled by early-exit above + elif [[ "$phase4_count" -eq 1 ]]; then + # Lone node: no peer exists in the whole cluster. Pass-label + # via the unpaired fast-path below; emit no rounds. + phase4_unpaired_node="${phase4_sorted[0]}" else - if [ $passed_count -gt 0 ]; then - echo "Labeling passed test runner nodes..." - for n in $passed_nodes; do - echo " - Node $n: Adding test runner success label" - kubectl label node "$n" "${TEST_RUNNER_SUCCESS_LABEL}" --overwrite + # Build a virtual node-index array. For odd N append a + # sentinel -1 representing the "bye" slot; pairs containing + # -1 are dropped (one real node sits out that round). + local -a phase4_circle=() + local _ci=0 + while [[ "$_ci" -lt "$phase4_count" ]]; do + phase4_circle+=("$_ci") + _ci=$((_ci + 1)) done - fi - if [ $failed_count -gt 0 ]; then - echo "Processing failed nodes..." - CANDIDATE_LABEL_KEY=${CANDIDATE_LABEL%%=*} - for n in $failed_nodes; do - echo " - Node $n: Adding test runner failure label" - kubectl label node "$n" "${TEST_RUNNER_FAILURE_LABEL}" --overwrite - echo " - Node $n: Removing candidate label and marking as failed" - kubectl label node "$n" "${CANDIDATE_LABEL_KEY}-" --overwrite - kubectl label node "$n" "${FAILURE_LABEL}" --overwrite + local phase4_has_bye="false" + if (( phase4_count % 2 == 1 )); then + phase4_circle+=("-1") + phase4_has_bye="true" + fi + local phase4_circle_size="${#phase4_circle[@]}" # always even + local phase4_round_count=$(( phase4_circle_size - 1 )) + local phase4_half=$(( phase4_circle_size / 2 )) + + local phase4_pairs_per_round="$phase4_half" + if [[ "$phase4_has_bye" == "true" ]]; then + phase4_pairs_per_round=$(( phase4_half - 1 )) + fi + echo "[Phase 4] mesh schedule: nodes=${phase4_count} rounds=${phase4_round_count} pairs/round=${phase4_pairs_per_round}" + + # Rotate positions 1..circle_size-1 left by one slot every + # round, while keeping position 0 fixed. Within a round, the + # k-th pair is (circle[k], circle[circle_size-1-k]) for k in + # [0, half-1]. We drop pairs that contain the bye index (-1). + local round_idx=0 + while [[ "$round_idx" -lt "$phase4_round_count" ]]; do + local round_pairs="" + local k=0 + while [[ "$k" -lt "$phase4_half" ]]; do + local idx_a="${phase4_circle[$k]}" + local idx_b="${phase4_circle[$(( phase4_circle_size - 1 - k ))]}" + if [[ "$idx_a" != "-1" && "$idx_b" != "-1" ]]; then + local na="${phase4_sorted[$idx_a]}" + local nb="${phase4_sorted[$idx_b]}" + # Canonical order within a pair (sorted) so log + # output is stable regardless of which slot a node + # ended up in after rotation. + local pair_str + if [[ "$na" < "$nb" ]]; then + pair_str="${na},${nb}" + else + pair_str="${nb},${na}" + fi + if [[ -z "$round_pairs" ]]; then + round_pairs="$pair_str" + else + round_pairs="${round_pairs} ${pair_str}" + fi + phase4_total_pairs=$(( phase4_total_pairs + 1 )) + fi + k=$(( k + 1 )) + done + phase4_rounds+=("$round_pairs") + echo "[Phase 4] round ${round_idx}: ${round_pairs}" + + # Rotate: fix index 0, shift positions 1..end left by one. + # The element that was at position 1 wraps around to the + # last position. We do this in-place with a temp swap. + local rotate_tmp="${phase4_circle[1]}" + local rp=1 + while [[ "$rp" -lt $(( phase4_circle_size - 1 )) ]]; do + phase4_circle[$rp]="${phase4_circle[$(( rp + 1 ))]}" + rp=$(( rp + 1 )) + done + phase4_circle[$(( phase4_circle_size - 1 ))]="$rotate_tmp" + + round_idx=$(( round_idx + 1 )) done - fi fi - - # Check if minimum nodes passed - min_nodes=${WORKER_REPLICAS} - if [ $passed_count -lt $min_nodes ]; then - echo "❌ Insufficient nodes passed test runner jobs. Required: $min_nodes, Passed: $passed_count" - echo "Skipping MPI job submission." - sleep ${DEBUG_DELAY} - exit 1 + + echo "[Phase 4] schedule: rounds=${#phase4_rounds[@]} total_pairs=${phase4_total_pairs} unpaired=${phase4_unpaired_node:-}" + + # ---- Unpaired (lone) node fast-path ---------------------------------- + # Only triggers for N=1 (no peer in the whole cluster). With the + # full-mesh schedule, odd N no longer leaves a node + # permanently unpaired -- each real node sits out exactly one round + # but is paired in all the others. Design §6 + test case + # "single-node-input". + if [[ -n "$phase4_unpaired_node" ]]; then + local upn="$phase4_unpaired_node" + echo "[Phase 4] unpaired node=$upn -- pass-labeling (no peer to test)" + label_phase_passed "$upn" "$PHASE4_LABEL_KEY" \ + || echo "[Phase 4] WARN: label_phase_passed failed for $upn" >&2 + annotate_phase_value "$upn" "$PHASE4_LABEL_KEY" "unpaired" "true" \ + || echo "[Phase 4] WARN: annotate unpaired=true failed for $upn" >&2 + fi + + # ---- Shared inter-process state --------------------------------------- + # pair_runner functions run in background subshells; bash subshells + # cannot mutate parent-scope associative arrays. Use a tmp dir as + # the shared write surface, then aggregate after `wait`. + local phase4_ts + phase4_ts=$(date +%Y%m%d-%H%M%S) + local PHASE4_STATE_DIR="/tmp/phase4-${phase4_ts}-$$" + if ! mkdir -p "${PHASE4_STATE_DIR}/results"; then + echo "[Phase 4] ERROR: cannot create state dir ${PHASE4_STATE_DIR}" >&2 + for n in $local_phase4_nodes; do + label_phase_failed "$n" "$PHASE4_LABEL_KEY" "driver-state-dir-failed" \ + || true + done + return 1 fi - - echo "[Test Runner Jobs: $passed_count node(s) passed, proceeding with RCCL tests]" - echo "==================================================================" + echo "[Phase 4] state dir: $PHASE4_STATE_DIR" + + # ---- Helper: render_phase4_template ----------------------------------- + # Stdin-less sed render of a server/client Job template. The two + # templates define $$NODE, $$RAIL_IDX, $$NAD_NAME, + # $$ROCE_WORKLOAD_IMAGE (server+client) and $$PEER_POD_IP (client + # only). We also rewrite the template's fixed metadata.name to a + # unique per-(role, node, rail) name so concurrent pair_runners do + # not collide. Stdout: rendered YAML; stderr: errors. + _phase4_render() { + local tmpl="$1" + local role="$2" # "server" | "client" + local node="$3" + local rail_idx="$4" + local round_idx="$5" + local nad_name="$6" + local peer_ip="$7" # empty string for server + local image="${ROCE_WORKLOAD_IMAGE:-rocm/roce-workload:latest}" + # include the round index so the same (role, node, + # rail) repeated across mesh rounds gets a unique k8s Job name + # (we explicit-delete after each rail, but the unique name is + # belt-and-suspenders against ttl-GC racing the next round). + local job_name="cvf-p4-${role}-${node}-r${rail_idx}-rd${round_idx}" + # Job names are bound at 63 chars by k8s; long hostnames + role + # + rail + round blow that budget. Hash-shorten if needed + # (mirror of PHASE3_SCRIPT job_name handling). + local max_job_name_len=63 + if [[ "${#job_name}" -gt "$max_job_name_len" ]]; then + local node_hash + node_hash=$(echo -n "$node" | sha1sum | cut -c1-6) + job_name="cvf-p4-${role}-${node_hash}-r${rail_idx}-rd${round_idx}" + echo "[Phase 4] node=$node role=$role rail=$rail_idx round=$round_idx hostname too long, using hash -> job=$job_name" >&2 + fi + # Rewrite the template's fixed metadata.name to the unique name + # in addition to substituting the $$-placeholders. + local name_re_server='^ name: cluster-validation-phase4-server-job' + local name_re_client='^ name: cluster-validation-phase4-client-job' + local rename_expr + if [[ "$role" == "server" ]]; then + rename_expr="s|${name_re_server}| name: ${job_name}|" + else + rename_expr="s|${name_re_client}| name: ${job_name}|" + fi + sed "${rename_expr}; \ + s|\$\$NODE|${node}|g; \ + s|\$\$RAIL_IDX|${rail_idx}|g; \ + s|\$\$NAD_NAME|${nad_name}|g; \ + s|\$\$PEER_POD_IP|${peer_ip}|g; \ + s|\$\$ROCE_WORKLOAD_IMAGE|${image}|g" \ + "$tmpl" + # Echo the chosen job name on stderr so the caller can capture + # it without parsing the rendered YAML. + echo "RENDERED_JOB_NAME=${job_name}" >&2 + } + + # ---- Helper: apply with 429 backoff --------------------------------- + # Wraps `kubectl apply -f -` with up to 3 retries on apparent + # throttling (design §6 "api-throttled-retry"). Returns 0 on success, + # 1 on apply failure, 2 on 429/throttle exhaustion. Reads YAML from + # stdin. + _phase4_apply_with_backoff() { + local attempt=1 + local max_attempts=3 + local out rc + local input + input=$(cat) + while [[ "$attempt" -le "$max_attempts" ]]; do + out=$(printf '%s' "$input" | kubectl apply -f - 2>&1) + rc=$? + if [[ "$rc" -eq 0 ]]; then + return 0 + fi + if echo "$out" | grep -qE '429|TooManyRequests|throttl'; then + echo "[Phase 4] kubectl apply throttled (attempt ${attempt}/${max_attempts}): $out" >&2 + sleep $((attempt * 2)) + attempt=$((attempt + 1)) + continue + fi + # Non-throttle failure -- surface the error and bail. + echo "[Phase 4] kubectl apply failed: $out" >&2 + return 1 + done + echo "[Phase 4] kubectl apply throttled after ${max_attempts} attempts" >&2 + return 2 + } + + # ---- Helper: wait_for_pod_ip ------------------------------------------ + # Polls the Job's first pod for status.podIP. The pod selector key + # for batch/v1.Job pods is job-name=. Returns 0 + emits IP + # on stdout, or returns non-zero if the timeout fires. + # + # Reasons surfaced by callers when this returns non-zero: + # - admission rejected (no pod ever created) -> reason=nad-missing + # - admission accepted but pod stuck Pending -> reason=peer-pod-unready + # We can distinguish by also checking whether ANY pod exists for the + # job -- if not, almost always a NAD/admission issue. + _phase4_wait_for_pod_ip() { + local job_name="$1" + local timeout="$2" # seconds + local start_time elapsed pod_ip pod_count + start_time=$(date +%s) + while true; do + elapsed=$(( $(date +%s) - start_time )) + if [[ "$elapsed" -ge "$timeout" ]]; then + pod_count=$(kubectl get pods -l "job-name=${job_name}" \ + --no-headers 2>/dev/null | wc -l) + if [[ "$pod_count" -eq 0 ]]; then + echo "[Phase 4] wait_for_pod_ip job=$job_name timeout -- no pod (admission rejected?)" >&2 + return 2 # nad-missing-ish + fi + echo "[Phase 4] wait_for_pod_ip job=$job_name timeout -- pod never got IP" >&2 + return 1 # peer-pod-unready + fi + pod_ip=$(kubectl get pods -l "job-name=${job_name}" \ + -o jsonpath='{.items[0].status.podIP}' 2>/dev/null || echo "") + if [[ -n "$pod_ip" ]]; then + echo "$pod_ip" + return 0 + fi + sleep 2 + done + } + + # ---- Helper: wait_for_job_terminal ------------------------------------ + # Waits for the Job to reach Complete=True or Failed=True. Returns 0 + # on Complete, 1 on Failed, 2 on timeout. Mirrors PHASE3_SCRIPT cadence. + _phase4_wait_for_job_terminal() { + local job_name="$1" + local timeout="$2" + local start_time elapsed complete_st failed_st + start_time=$(date +%s) + while true; do + elapsed=$(( $(date +%s) - start_time )) + if [[ "$elapsed" -ge "$timeout" ]]; then + return 2 + fi + complete_st=$(kubectl get job "$job_name" \ + -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' \ + 2>/dev/null || echo "") + if [[ "$complete_st" == "True" ]]; then + return 0 + fi + failed_st=$(kubectl get job "$job_name" \ + -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' \ + 2>/dev/null || echo "") + if [[ "$failed_st" == "True" ]]; then + return 1 + fi + sleep 5 + done + } + + # ---- Helper: parse_bw_average ----------------------------------------- + # Greps the "BW average" line from `kubectl logs ` of the + # client Job's pod and emits the Gbps value (a float, possibly with + # decimals) on stdout. ib_write_bw default output line looks like: + # 65536 5000 0.00 388.42 0.740434 + # with column headers "#bytes #iterations BW peak[Gb/sec] BW + # average[Gb/sec] MsgRate[Mpps]". We pull the 4th whitespace- + # delimited field of the line that matches "BW average". Returns 0 + # on success + value on stdout, 1 on parse failure. + _phase4_parse_bw_average() { + local job_name="$1" + local pod_name log bw_value + pod_name=$(kubectl get pods -l "job-name=${job_name}" \ + -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "") + if [[ -z "$pod_name" ]]; then + return 1 + fi + log=$(kubectl logs "$pod_name" 2>/dev/null || echo "") + if [[ -z "$log" ]]; then + return 1 + fi + # ib_write_bw emits the result line as the first line where + # column 4 is the BW-average value. The in-pod client script + # additionally echoes "[phase4-client] rail "; pick + # the data row (digits in column 1), not the echo. + bw_value=$(echo "$log" | awk ' + /BW average/ { in_table = 1; next } + in_table && $1 ~ /^[0-9]+$/ && NF >= 4 { + print $4; exit + } + ') + if [[ -z "$bw_value" ]]; then + return 1 + fi + printf '%s' "$bw_value" + return 0 + } + + # ---- Helper: record_rail_result --------------------------------------- + # Writes a per-(node, rail, round) result file inside the state dir + # (full-mesh schedule means a node has up to N-1 rounds + # of data per rail). Concurrency note: within a single round, no + # two pair_runners ever touch the SAME (node, rail, round) tuple + # because the round's pairs are node-disjoint by construction. + # Rounds are executed sequentially, so cross-round writes are + # serialized at the round boundary. No locking required. + _phase4_record_rail_result() { + local node="$1" + local rail_idx="$2" + local round_idx="$3" + local bw_value="$4" # empty if failure + local reason="$5" # empty if pass + local peer="$6" + local node_dir="${PHASE4_STATE_DIR}/results/${node}" + local base="${node_dir}/rail-${rail_idx}-round-${round_idx}" + mkdir -p "$node_dir" + if [[ -n "$bw_value" ]]; then + printf '%s' "$bw_value" > "${base}" + fi + if [[ -n "$reason" ]]; then + printf '%s' "$reason" > "${base}.reason" + fi + if [[ -n "$peer" ]]; then + printf '%s' "$peer" > "${base}.peer" + fi + } + + # ---- Helper: pair_runner ---------------------------------------------- + # Run inside a background subshell -- mutate state only via + # _phase4_record_rail_result (file writes). Iterates rails serially + # so the per-rail NAD is only claimed by one pod-pair at a time. + # The round index is passed in so each measurement is + # recorded under its own (node, rail, round) tuple in state. + pair_runner() { + local node_a="$1" + local node_b="$2" + local round_idx="$3" + local pair_tag="${node_a}<->${node_b}/round=${round_idx}" + echo "[Phase 4][${pair_tag}] pair_runner start" + + local rail + for (( rail=0; rail < PHASE4_RAIL_COUNT; rail++ )); do + local nad_name="${PHASE4_NAD_NAME_PREFIX}${rail}" + echo "[Phase 4][${pair_tag}] rail=${rail} nad=${nad_name} START" + + # --- Step 1: render+apply server Job (pinned to node_a) ---- + local server_rendered server_apply_rc server_job_name="" + local render_err render_tmp + # Per-subshell unique tmp path. $$ is identical across bash + # subshells (parent PID), so PIDs of concurrent pair_runners + # collide. BASHPID is the subshell's own PID; pair with the + # rail index, round index, and role for full uniqueness + # even if BASHPID is reused after a pair finishes. + render_tmp=$(mktemp "/tmp/p4-render-${BASHPID:-$$}-rd${round_idx}-r${rail}-s.XXXXXX") + server_rendered=$(_phase4_render "$server_tmpl" "server" \ + "$node_a" "$rail" "$round_idx" "$nad_name" "" 2> >(tee "$render_tmp" >&2)) + render_err=$(cat "$render_tmp" 2>/dev/null || true) + rm -f "$render_tmp" + server_job_name=$(echo "$render_err" \ + | sed -n 's/^RENDERED_JOB_NAME=//p' | tail -1) + if [[ -z "$server_job_name" ]]; then + echo "[Phase 4][${pair_tag}] rail=${rail} render-failed (server)" >&2 + _phase4_record_rail_result "$node_a" "$rail" "$round_idx" "" \ + "render-failed" "$node_b" + _phase4_record_rail_result "$node_b" "$rail" "$round_idx" "" \ + "render-failed" "$node_a" + continue + fi + printf '%s' "$server_rendered" | _phase4_apply_with_backoff + server_apply_rc=$? + if [[ "$server_apply_rc" -ne 0 ]]; then + local rsn="job-creation-failed" + [[ "$server_apply_rc" -eq 2 ]] && rsn="api-throttled" + echo "[Phase 4][${pair_tag}] rail=${rail} server apply failed rc=${server_apply_rc}" >&2 + _phase4_record_rail_result "$node_a" "$rail" "$round_idx" "" "$rsn" "$node_b" + _phase4_record_rail_result "$node_b" "$rail" "$round_idx" "" "$rsn" "$node_a" + continue + fi + + # --- Step 2: wait for server pod IP (admission + scheduled) --- + # Budget: half the per-rail wallclock for pod-IP wait (server + # needs to be Running before the client starts), the other + # half covers client startup + handshake + transfer. + local server_pod_ip_timeout=$(( PHASE4_PAIR_WAIT_TIME / 2 )) + [[ "$server_pod_ip_timeout" -lt 30 ]] && server_pod_ip_timeout=30 + local server_pod_ip + server_pod_ip=$(_phase4_wait_for_pod_ip "$server_job_name" \ + "$server_pod_ip_timeout") + local pod_ip_rc=$? + if [[ "$pod_ip_rc" -ne 0 ]]; then + local rsn="peer-pod-unready" + [[ "$pod_ip_rc" -eq 2 ]] && rsn="nad-missing" + echo "[Phase 4][${pair_tag}] rail=${rail} server pod IP wait failed rc=${pod_ip_rc} reason=${rsn}" >&2 + _phase4_record_rail_result "$node_a" "$rail" "$round_idx" "" "$rsn" "$node_b" + _phase4_record_rail_result "$node_b" "$rail" "$round_idx" "" "$rsn" "$node_a" + kubectl delete job "$server_job_name" --ignore-not-found=true \ + --wait=false >/dev/null 2>&1 || true + continue + fi + echo "[Phase 4][${pair_tag}] rail=${rail} server_pod_ip=${server_pod_ip}" + + # --- Step 3: render+apply client Job (pinned to node_b) ---- + local client_rendered client_apply_rc client_job_name="" + render_tmp=$(mktemp "/tmp/p4-render-${BASHPID:-$$}-rd${round_idx}-r${rail}-c.XXXXXX") + client_rendered=$(_phase4_render "$client_tmpl" "client" \ + "$node_b" "$rail" "$round_idx" "$nad_name" "$server_pod_ip" \ + 2> >(tee "$render_tmp" >&2)) + render_err=$(cat "$render_tmp" 2>/dev/null || true) + rm -f "$render_tmp" + client_job_name=$(echo "$render_err" \ + | sed -n 's/^RENDERED_JOB_NAME=//p' | tail -1) + if [[ -z "$client_job_name" ]]; then + echo "[Phase 4][${pair_tag}] rail=${rail} render-failed (client)" >&2 + _phase4_record_rail_result "$node_a" "$rail" "$round_idx" "" \ + "render-failed" "$node_b" + _phase4_record_rail_result "$node_b" "$rail" "$round_idx" "" \ + "render-failed" "$node_a" + kubectl delete job "$server_job_name" --ignore-not-found=true \ + --wait=false >/dev/null 2>&1 || true + continue + fi + printf '%s' "$client_rendered" | _phase4_apply_with_backoff + client_apply_rc=$? + if [[ "$client_apply_rc" -ne 0 ]]; then + local rsn="job-creation-failed" + [[ "$client_apply_rc" -eq 2 ]] && rsn="api-throttled" + echo "[Phase 4][${pair_tag}] rail=${rail} client apply failed rc=${client_apply_rc}" >&2 + _phase4_record_rail_result "$node_a" "$rail" "$round_idx" "" "$rsn" "$node_b" + _phase4_record_rail_result "$node_b" "$rail" "$round_idx" "" "$rsn" "$node_a" + kubectl delete job "$server_job_name" --ignore-not-found=true \ + --wait=false >/dev/null 2>&1 || true + continue + fi + + # --- Step 4: wait for client Job terminal -------------------- + local client_terminal_rc + _phase4_wait_for_job_terminal "$client_job_name" \ + "$PHASE4_PAIR_WAIT_TIME" + client_terminal_rc=$? + + # --- Step 5: classify + parse -------------------------------- + local bw_value="" reason="" + case "$client_terminal_rc" in + 0) + # Complete=True -- parse BW from client log. + bw_value=$(_phase4_parse_bw_average "$client_job_name") + if [[ -z "$bw_value" ]]; then + reason="parse-failed" + else + # Float compare via awk -- bash arithmetic does + # not handle "388.42" >= "380". awk emits 1 if + # the rail meets threshold, 0 otherwise. + local meets + meets=$(awk -v v="$bw_value" -v t="$PHASE4_BW_THRESHOLD" \ + 'BEGIN { print (v + 0 >= t + 0) ? 1 : 0 }') + if [[ "$meets" != "1" ]]; then + reason="below-threshold:${bw_value}" + fi + fi + ;; + 1) + # Failed=True -- client crashed (ib_write_bw non-zero exit + # or the in-pod "no BW line" exit). Still try to parse; + # if a BW line exists, prefer parse over crash reason. + bw_value=$(_phase4_parse_bw_average "$client_job_name") + if [[ -z "$bw_value" ]]; then + reason="ib-write-bw-crashed" + else + local meets + meets=$(awk -v v="$bw_value" -v t="$PHASE4_BW_THRESHOLD" \ + 'BEGIN { print (v + 0 >= t + 0) ? 1 : 0 }') + if [[ "$meets" != "1" ]]; then + reason="below-threshold:${bw_value}" + fi + fi + ;; + 2) + # Timeout -- typically peer-pod-unready on the client + # side (NAD attach stuck, pod pending). Use the same + # reason as the server pod-IP-wait timeout. + reason="peer-pod-unready" + ;; + esac + + # Persist both endpoints. A failing measurement fails BOTH + # nodes for that (rail, round) tuple. Same BW value for + # both endpoints -- they measured the same point-to-point + # flow. With full mesh, the per-rail pass/fail + # at the NODE level becomes "ALL rounds for that rail + # passed" (see aggregation block). + _phase4_record_rail_result "$node_a" "$rail" "$round_idx" \ + "$bw_value" "$reason" "$node_b" + _phase4_record_rail_result "$node_b" "$rail" "$round_idx" \ + "$bw_value" "$reason" "$node_a" + + echo "[Phase 4][${pair_tag}] rail=${rail} DONE bw=${bw_value:-} reason=${reason:-}" + + # --- Step 6: cleanup ----------------------------------------- + # Explicit delete -- ttlSecondsAfterFinished (300) GCs Jobs + # left over from cleanly-finished cases, but we delete now + # so the next rail's NAD/NIC is freed immediately. + kubectl delete job "$server_job_name" --ignore-not-found=true \ + --wait=false >/dev/null 2>&1 || true + kubectl delete job "$client_job_name" --ignore-not-found=true \ + --wait=false >/dev/null 2>&1 || true + done + + echo "[Phase 4][${pair_tag}] pair_runner done" + } + + # ---- Bounded-parallel, round-by-round pair_runner dispatch ----------- + # We iterate rounds sequentially. Within a round, pairs + # are node-disjoint by construction (circle algorithm guarantee), + # so they can all run in parallel without contention; we still cap + # parallelism at PHASE4_MAX_CONCURRENT_PAIRS to keep the k8s API + # server happy on large pools (design §2 + test case + # "concurrency-cap-honored"). We poll the live background-job count + # (`jobs -rp`) before forking the next pair within a round; this is + # cheaper and more portable than `wait -n` (some bash builds). + # Defensive: a misconfigured "0" cap would deadlock the slot-wait + # loop; promote to 1 (serial) with a logged warning. + if [[ "$PHASE4_MAX_CONCURRENT_PAIRS" -lt 1 ]]; then + echo "[Phase 4] WARN: PHASE4_MAX_CONCURRENT_PAIRS=${PHASE4_MAX_CONCURRENT_PAIRS} invalid -- promoting to 1 (serial)" >&2 + PHASE4_MAX_CONCURRENT_PAIRS=1 + fi + echo "[Phase 4] dispatching pair_runners (cap=${PHASE4_MAX_CONCURRENT_PAIRS}, rounds=${#phase4_rounds[@]})" + local pair_idx=0 # global, monotonic across rounds (for log scrape) + local round_idx=0 + for round_entry in "${phase4_rounds[@]}"; do + # Skip empty rounds (defensive -- the scheduler never emits + # one with the current circle algorithm but the cost is nil). + if [[ -z "$round_entry" ]]; then + round_idx=$((round_idx + 1)) + continue + fi + local round_pair_pids="" + local round_pair_count=0 + echo "[Phase 4] round ${round_idx} START: ${round_entry}" + # shellcheck disable=SC2206 + local -a round_pair_array=($round_entry) + local rp_entry + for rp_entry in "${round_pair_array[@]}"; do + # Wait until we have a free slot inside this round. + while true; do + local running + running=$(jobs -rp 2>/dev/null | wc -l) + if [[ "$running" -lt "$PHASE4_MAX_CONCURRENT_PAIRS" ]]; then + break + fi + sleep 2 + done + local na="${rp_entry%,*}" + local nb="${rp_entry#*,}" + echo "[Phase 4] forking pair #${pair_idx} (round=${round_idx}): ${na} <-> ${nb}" + pair_runner "$na" "$nb" "$round_idx" & + round_pair_pids="$round_pair_pids $!" + round_pair_count=$((round_pair_count + 1)) + pair_idx=$((pair_idx + 1)) + done + # Wait for THIS round to finish before starting the next so + # the disjointness guarantee holds at the dispatch boundary + # too (a node from round R+1 cannot start while round R is + # still using its NIC). + echo "[Phase 4] round ${round_idx} waiting for ${round_pair_count} pair_runner(s)" + local rpid + for rpid in $round_pair_pids; do + wait "$rpid" 2>/dev/null || true + done + echo "[Phase 4] round ${round_idx} DONE" + round_idx=$((round_idx + 1)) + done + echo "[Phase 4] all rounds complete (total pairs forked=${pair_idx})" + + # ---- Aggregation: per-node pass/fail + annotations ------------------- + # A node passes iff EVERY (rail, round) measurement it + # participated in produced a BW value >= PHASE4_BW_THRESHOLD with + # no recorded reason. A rail with at least one failing round goes + # into failed-rails; the full list of failing (peer, rail, round) + # measurements goes into the triangulation annotation so the + # operator can identify which rail / which peer / which round + # produced the fault (A-B / A-C / B-C reasoning). + # + # Per-(rail, round) BW annotation is always written when a value + # exists; the value carries both the BW and the peer hostname so a + # single annotation key encodes the full measurement record: + # amd.com/rail-bandwidth-rail--round-=/peer= + # Failed measurements without a parseable BW are still emitted as + # an annotation, with the reason in place of the BW: + # amd.com/rail-bandwidth-rail--round-=/peer= + local phase4_pass_count=0 + local phase4_fail_count=0 + local phase4_total_rounds=${#phase4_rounds[@]} + for node in "${phase4_sorted[@]}"; do + # Skip the unpaired (N=1) node -- already labeled above. + if [[ "$node" == "$phase4_unpaired_node" ]]; then + continue + fi + local node_dir="${PHASE4_STATE_DIR}/results/${node}" + local failed_rails="" + local triangulation="" + local node_failed="false" + + local rail + for (( rail=0; rail < PHASE4_RAIL_COUNT; rail++ )); do + local rail_has_failure="false" + local rail_had_any_measurement="false" + local round_i + for (( round_i=0; round_i < phase4_total_rounds; round_i++ )); do + local base="${node_dir}/rail-${rail}-round-${round_i}" + local bw_file="${base}" + local reason_file="${base}.reason" + local peer_file="${base}.peer" + # If this node sat out this round (odd-N bye, or the + # round simply did not include it), no files will + # exist -- skip silently. Bye rounds are expected and + # are NOT counted as a failure. + if [[ ! -f "$bw_file" && ! -f "$reason_file" && ! -f "$peer_file" ]]; then + continue + fi + rail_had_any_measurement="true" + local rail_bw="" rail_reason="" rail_peer="" + [[ -f "$bw_file" ]] && rail_bw=$(cat "$bw_file" 2>/dev/null || echo "") + [[ -f "$reason_file" ]] && rail_reason=$(cat "$reason_file" 2>/dev/null || echo "") + [[ -f "$peer_file" ]] && rail_peer=$(cat "$peer_file" 2>/dev/null || echo "") + + # Per-(rail, round) annotation: always carries the peer + # so a reader can reconstruct the mesh without consulting + # external state. Value form is "/peer=". + local ann_value="" + if [[ -n "$rail_bw" ]]; then + ann_value="$rail_bw" + elif [[ -n "$rail_reason" ]]; then + ann_value="$rail_reason" + else + ann_value="no-measurement" + fi + if [[ -n "$rail_peer" ]]; then + ann_value="${ann_value}/peer=${rail_peer}" + fi + annotate_phase_value "$node" "$PHASE4_LABEL_KEY" \ + "rail-${rail}-round-${round_i}" "$ann_value" \ + || echo "[Phase 4] WARN: annotate rail-${rail}-round-${round_i} failed for $node" >&2 + + # Classify this round's measurement. + local round_failed="false" + if [[ -n "$rail_reason" ]]; then + round_failed="true" + elif [[ -z "$rail_bw" ]]; then + # Files exist but BW empty and no reason: treat as + # missing measurement (driver bug / pair_runner + # aborted mid-record). Conservative: count as + # failure so the operator notices. + round_failed="true" + fi + if [[ "$round_failed" == "true" ]]; then + rail_has_failure="true" + local tri_entry="peer=${rail_peer:-}/rail=${rail}/round=${round_i}" + if [[ -z "$triangulation" ]]; then + triangulation="$tri_entry" + else + triangulation="${triangulation},${tri_entry}" + fi + fi + done + + # Per-rail roll-up. A rail with no measurement at all (no + # round ever touched it -- happens with PHASE4_RAIL_COUNT + # higher than what the schedule covered, or a catastrophic + # state-dir failure) is treated as a failure conservatively. + if [[ "$rail_had_any_measurement" == "false" ]]; then + rail_has_failure="true" + fi + if [[ "$rail_has_failure" == "true" ]]; then + node_failed="true" + if [[ -z "$failed_rails" ]]; then + failed_rails="$rail" + else + failed_rails="${failed_rails},${rail}" + fi + fi + done + + if [[ "$node_failed" == "true" ]]; then + local reason="failed-rails:${failed_rails}" + label_phase_failed "$node" "$PHASE4_LABEL_KEY" "$reason" \ + || echo "[Phase 4] WARN: label_phase_failed failed for $node" >&2 + annotate_phase_value "$node" "$PHASE4_LABEL_KEY" \ + "failed-rails" "$failed_rails" \ + || echo "[Phase 4] WARN: annotate failed-rails failed for $node" >&2 + # Triangulation summary annotation: every failing + # (peer, rail, round) tuple this node was party to. + # Operators cross-reference this across nodes to localize + # whether a rail fault is the node's, the peer's, or the + # link between them (A-B vs A-C vs B-C disambiguation). + if [[ -n "$triangulation" ]]; then + annotate_phase_value "$node" "$PHASE4_LABEL_KEY" \ + "triangulation" "$triangulation" \ + || echo "[Phase 4] WARN: annotate triangulation failed for $node" >&2 + fi + phase4_fail_count=$((phase4_fail_count + 1)) + else + label_phase_passed "$node" "$PHASE4_LABEL_KEY" \ + || echo "[Phase 4] WARN: label_phase_passed failed for $node" >&2 + phase4_pass_count=$((phase4_pass_count + 1)) + fi + done + + # ---- State dir cleanup ------------------------------------------------ + # Best-effort -- on failure the path is logged so an operator can + # inspect manually. The CronJob filesystem is ephemeral anyway. + rm -rf "$PHASE4_STATE_DIR" 2>/dev/null \ + || echo "[Phase 4] WARN: cleanup of $PHASE4_STATE_DIR failed" >&2 + + local unpaired_count=0 + [[ -n "$phase4_unpaired_node" ]] && unpaired_count=1 + echo "[Phase 4] PHASE4_DRIVER_SCRIPT done: pass=${phase4_pass_count} fail=${phase4_fail_count} unpaired=${unpaired_count}" + return 0 + + # === Phase 5 Script: multi-node RCCL via MPIJob === + # + # Sourced (not exec'd) by the multi-phase CronJob orchestrator's + # run_phase5 stub (cluster-validation-job.yaml -> _run_phase_generic + # 5 "Multi-node RCCL via MPIJob" PHASE5_SCRIPT "$1"). The orchestrator + # invokes: + # + # source /tmp/run-phase5.sh $phase45_passed + # + # i.e. the surviving Phase 4.5 node set is passed in as positional + # args ($@). The sourced script body forwards those args into + # `run_phase5_main "$@"` (see invocation at the foot of this + # ConfigMap key), so the per-phase work runs inside a function whose + # input pool is its own positional args -- no outer-scope reads. + # + # Contract: + # * Input : space-separated surviving Phase 4.5 nodes, taken + # from `run_phase5_main`'s own positional args ($@). + # The script body forwards the sourced positional args + # into the function so the existing orchestrator + # source-with-positional-args contract still works. + # * Output : amd.com/cluster-validation-status=passed|failed on + # every input_nodes member; on the failed path, each + # node also gets the reason annotation + # `amd.com/cluster-validation-status-` set to + # `worker-pod=,exit=`. After + # launcher-log collection, every input node has its + # worker pod logs saved to + # `${LOG_DIR}/worker-${node}-${new_job}.log`. Old + # MPIJobs (not the new one) untouched (orchestrator + # handles MPIJob cleanup and launcher log collection). + # * Side effect: a new MPIJob `cluster-validation-mpi-job-` + # is created (unless SKIP_RCCL_TEST=true). MPIJob + # worker-replicas = input_count (dynamic). + # + # ConfigMap variables consumed: + # SKIP_RCCL_TEST, CANDIDATE_LABEL, LAUNCHER_REPLICAS, + # LOG_STORE_NODE_NAME, SLOTS_PER_WORKER, GPU_PER_WORKER, + # PF_NIC_PER_WORKER, VF_NIC_PER_WORKER, PF_NIC_NAD_NAME, + # VF_NIC_NAD_NAME, ROCE_WORKLOAD_IMAGE, MPIJOB_WAIT_TIME, + # DEBUG_DELAY, LOG_DIR, PHASE5_LABEL_KEY. Note: WORKER_REPLICAS + # from `config.json` is NOT consumed by this script's body + # for the MPIJob render -- the actual worker count comes from + # `input_count`. `WORKER_REPLICAS` retains its Phase 0 meaning + # as the maximum candidate-pool budget. SUCCESS_LABEL / + # FAILURE_LABEL (defined at the head of this ConfigMap) are NO + # LONGER consumed here either: routed phase-label + # writes through the helpers (`label_phase_passed` / + # `label_phase_failed`), which derive the key from + # `PHASE5_LABEL_KEY` and the value (`passed`/`failed`) internally. + # SUCCESS_LABEL / FAILURE_LABEL remain defined for the test harness + # (`tests/test_orchestrator_dry_run.sh`) and for any out-of-tree + # consumers; they are not deleted here. See Node Selection block + # for `WORKER_REPLICAS`'s budget role. + # The MPIJob manifest source is /mpi-configs/cluster-validation-mpijob-config.yaml + # (mounted from the cluster-validation-mpijob-config ConfigMap). + PHASE5_SCRIPT: | + #!/bin/bash + # Phase 5 -- multi-node RCCL via MPIJob. + # Sourced (not exec'd) by the orchestrator. Do NOT `set -e`; the + # surrounding orchestrator runs under `set -uo pipefail` and a + # failed phase must still let downstream cleanup + fail-loud exit + # run. + + # ---- run_phase5_main (parameterised input) -------------- + # The per-phase work is encapsulated in `run_phase5_main`. It takes + # the surviving Phase 4.5 node set as its own positional args + # (`run_phase5_main `) -- the body no longer + # reads any outer-scope `passed_nodes` variable. `input_count` is + # derived from the input set and drives `actual_worker_replicas`, + # which in turn renders the MPIJob's `worker-replicas` field. The + # `WORKER_REPLICAS` ConfigMap value is consequently a MAXIMUM + # (Phase 0 node-selection budget) rather than a target for this + # change's body. + # + # Subsequent Sub-tasks build on this contract: + # reads `input_count` to enforce PHASE5_MIN_WORKERS + # (see the guard further down); will + # have the orchestrator invoke `run_phase5_main "$phase45_passed"` + # directly after sourcing. Until lands, the script + # body's trailing `run_phase5_main "$@"` line forwards the sourced + # positional args so the existing + # `source "$script_path" $nodes` + # contract in cluster-validation-job.yaml (_run_phase_generic) + # continues to work unchanged. + run_phase5_main() { + local input_nodes="$*" + local input_count + input_count=$(echo "$input_nodes" | wc -w) + local actual_worker_replicas=$input_count + + echo -e "===Step 3: Submitting MPIJob===" + echo "[Phase 5] input_count=${input_count} (drives MPIJob worker-replicas;" \ + "WORKER_REPLICAS=${WORKER_REPLICAS} is the Phase 0 max)" + + # Check if RCCL test should be skipped + if [[ "${SKIP_RCCL_TEST,,}" == "true" ]]; then + echo "SKIP_RCCL_TEST is set to true. Skipping MPI Job RCCL test." + echo "==================================================================" + + # Pass-label every input node via the helper. + # Candidate label is preserved (eligibility gate) -- Phase 0's + # timestamp/interval filter handles re-run spacing. + echo -e "===Labeling nodes (RCCL test skipped)===" + for n in $input_nodes; do + label_phase_passed "$n" "$PHASE5_LABEL_KEY" \ + || echo "[Phase 5] WARN: label_phase_passed failed for $n" >&2 + done + echo "[CronJob Result: ${PHASE5_LABEL_KEY}=passed] Cluster Validation Status updated on Candidate Nodes (RCCL test skipped)" + echo "==================================================================" + echo "[CronJob Completed]" + # Pre-refactor code had `exit 0` here (it was inlined into the + # CronJob container args, so `exit 0` ended the whole pod). + # The script is now `source`d by the orchestrator; using `exit` + # would terminate the orchestrator before cleanup_old_mpijobs + + # collect_launcher_logs + the fail-loud accounting can run. + # `return 0` is the minimal source-vs-exec adjustment required + # to preserve the orchestrator's overall behavior; no Phase 5 + # semantic change. + return 0 + fi + + # ---- PHASE5_MIN_WORKERS guard ------- + # RCCL collectives (all_reduce_perf, broadcast_perf, + # reduce_scatter_perf) are multi-node by definition; a one-worker + # MPIJob is degenerate. If fewer than PHASE5_MIN_WORKERS nodes + # survived earlier phases, skip submitting the MPIJob and skip + # writing PHASE5 labels (no `passed`/`failed` verdict can be + # justified). The orchestrator's downstream cleanup_old_mpijobs + + # fail-loud accounting still runs because we `return` (not + # `exit`). Default minimum is 2; operators can override to 1 via + # the PHASE5_MIN_WORKERS ConfigMap key for opt-in single-node + # plumbing validation. Empty input set (input_count=0) also takes + # this branch -- matches design §6 "Empty input set" handling. + # See design §4 ("Phase 5 ConfigMap variables") and §6 + # ("Below PHASE5_MIN_WORKERS"). + if [ "$input_count" -lt "$PHASE5_MIN_WORKERS" ]; then + echo "[Phase 5] only ${input_count} node(s) survived prior phases;" \ + "minimum is ${PHASE5_MIN_WORKERS} -- skipping MPIJob (no label changes)" + echo "==================================================================" + echo "[CronJob Completed]" + return 0 + fi + + ts=$(date +%Y%m%d-%H%M) + new_job="cluster-validation-mpi-job-${ts}" + + # Generate PF NAD list (for PF) + PF_NAD_LIST=$(printf "${PF_NIC_NAD_NAME},%.0s" $(seq 1 $PF_NIC_PER_WORKER)) + PF_NAD_LIST=${PF_NAD_LIST%,} # remove trailing comma + # Generate VF NAD list (for VF) + VF_NAD_LIST=$(printf "${VF_NIC_NAD_NAME},%.0s" $(seq 1 $VF_NIC_PER_WORKER)) + VF_NAD_LIST=${VF_NAD_LIST%,} # remove trailing comma + + if [ "$PF_NIC_PER_WORKER" -gt 0 ]; then + NAD_ANNOTATION="$PF_NAD_LIST" + elif [ "$VF_NIC_PER_WORKER" -gt 0 ]; then + NAD_ANNOTATION="$VF_NAD_LIST" + else + NAD_ANNOTATION="" + fi + + sed "s/^ name: cluster-validation-mpi-job/ name: ${new_job}/; \ + s|\$\$WORKER_REPLICAS|${actual_worker_replicas}|g; \ + s|\$\$LAUNCHER_REPLICAS|${LAUNCHER_REPLICAS}|g; \ + s|\$\$LOG_STORE_NODE_NAME|${LOG_STORE_NODE_NAME}|g; \ + s|\$\$SLOTS_PER_WORKER|${SLOTS_PER_WORKER}|g; \ + s|\$\$GPU_PER_WORKER|${GPU_PER_WORKER}|g; \ + s|\$\$PF_NIC_PER_WORKER|${PF_NIC_PER_WORKER}|g; \ + s|\$\$VF_NIC_PER_WORKER|${VF_NIC_PER_WORKER}|g; \ + s|\$\$NAD_ANNOTATION|${NAD_ANNOTATION}|g; \ + s|\$\$ROCE_WORKLOAD_IMAGE|${ROCE_WORKLOAD_IMAGE}|g" \ + /mpi-configs/cluster-validation-mpijob-config.yaml | kubectl apply -f - + echo "[MPIJob: Submitted for $actual_worker_replicas worker node(s)]" + echo "==================================================================" + + # per-node worker-pod lookup. Used twice below: + # (1) in the failed-MPIJob labelling loop to read the per-worker + # terminated.exitCode for the `worker-pod=.,exit=.` + # reason annotation, and + # (2) in the per-worker log-dump loop after launcher-log + # collection. + # Emits the worker pod name on stdout when found, empty string + # otherwise. Never fails -- the `|| true` swallows the + # `kubectl get pods` non-zero on missing pods so callers can + # branch on empty output without `set -e` surprises. + _phase5_worker_pod_for_node() { + local _node="$1" + kubectl get pods \ + -l "training.kubeflow.org/job-name=${new_job},training.kubeflow.org/replica-type=worker" \ + --field-selector "spec.nodeName=${_node}" \ + -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || true + } + + echo -e "===Step 4: Waiting for MPIJob completion===" + # `job_status` (passed|failed) drives which helper is called + # per node below. The `label_phase_passed` / `label_phase_failed` + # helpers own the label key (`PHASE5_LABEL_KEY` = + # `amd.com/cluster-validation-status`) and value (`passed` / + # `failed`) directly. The per-node `` arg passed to + # `label_phase_failed` is derived per-node from the failing + # worker pod's name + exit code (see Step 5 below). + # + # Poll for either Succeeded or Failed condition. Watching only + # the success edge would force the orchestrator to sit through + # the full MPIJOB_WAIT_TIME on every Failed MPIJob, blocking + # subsequent cron ticks via concurrencyPolicy: Forbid. + job_status=timeout + end=$(( $(date +%s) + MPIJOB_WAIT_TIME )) + while (( $(date +%s) < end )); do + succ=$(kubectl get mpijob "$new_job" \ + -o jsonpath='{.status.conditions[?(@.type=="Succeeded")].status}' 2>/dev/null) + fail=$(kubectl get mpijob "$new_job" \ + -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' 2>/dev/null) + if [[ "$succ" == "True" ]]; then job_status=passed; break; fi + if [[ "$fail" == "True" ]]; then job_status=failed; break; fi + sleep 5 + done + case "$job_status" in + passed) + echo "MPIJob $new_job succeeded ✅" + echo "[MPIJob Result: Passed]" + ;; + failed) + echo "MPIJob $new_job failed ❌" + echo "[MPIJob Result: Failed]" + sleep ${DEBUG_DELAY} + ;; + timeout) + echo "MPIJob $new_job did not reach terminal state within ${MPIJOB_WAIT_TIME}s -- treating as failed" + echo "[MPIJob Result: Failed]" + job_status=failed + sleep ${DEBUG_DELAY} + ;; + esac + echo "==================================================================" + + echo -e "===Step 5: Labeling nodes based on MPIJob result===" + # phase-label writes go through the + # helpers (not raw `kubectl label`) so the per-phase scripts + # all converge on the same write path. Branch on `job_status` + # to pick `label_phase_passed` vs `label_phase_failed`. + # + # on the failed path, derive the per-node + # `` arg from the worker pod that ran on that node + # (`worker-pod=,exit=`). If pod lookup or + # exit-code lookup fails, fall back to `unknown` for that + # field so operators still see a deterministic annotation + # rather than an empty one. The helper truncates + # the reason to PHASE_ANNOTATION_VALUE_MAX_BYTES. + for n in $input_nodes; do + if [ "$job_status" = passed ]; then + label_phase_passed "$n" "$PHASE5_LABEL_KEY" \ + || echo "[Phase 5] WARN: label_phase_passed failed for $n" >&2 + else + worker_pod=$(_phase5_worker_pod_for_node "$n") + if [ -n "$worker_pod" ]; then + exit_code=$(kubectl get pod "$worker_pod" \ + -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}' \ + 2>/dev/null) + [ -z "$exit_code" ] && exit_code="unknown" + else + worker_pod="unknown" + exit_code="unknown" + fi + phase5_fail_reason="worker-pod=${worker_pod},exit=${exit_code}" + label_phase_failed "$n" "$PHASE5_LABEL_KEY" "$phase5_fail_reason" \ + || echo "[Phase 5] WARN: label_phase_failed failed for $n" >&2 + fi + done + # Candidate label intentionally preserved -- Phase 0's + # timestamp/interval gate (NODE_VALIDATION_INTERVAL_MINS) + # handles re-run spacing. + echo "[CronJob Result: ${PHASE5_LABEL_KEY}=${job_status}] Cluster Validation Status updated on Candidate Nodes" + echo "==================================================================" + + echo -e "===Retrieving launcher pod logs===" + LAUNCHER_POD=$(kubectl get pods -l "training.kubeflow.org/job-name=${new_job},training.kubeflow.org/replica-type=launcher" \ + -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || true) + if [ -n "$LAUNCHER_POD" ]; then + LAUNCHER_LOG_FILE="${LOG_DIR}/launcher-${new_job}.log" + echo "Saving launcher pod logs ($LAUNCHER_POD) to $LAUNCHER_LOG_FILE" + kubectl logs "$LAUNCHER_POD" --all-containers=true > "$LAUNCHER_LOG_FILE" 2>&1 || \ + echo "Warning: Failed to retrieve launcher pod logs" + else + echo "Warning: Could not find launcher pod for MPIJob $new_job" + fi + echo "==================================================================" + + # per-worker log dump. For each input node, find + # its worker pod and save its logs to + # `${LOG_DIR}/worker-${node}-${new_job}.log`. Runs on both + # pass and fail paths: operators correlate per-node + # annotations with per-node logs regardless of outcome. If a + # worker pod is missing (race against cleanup, scheduler + # eviction, etc.), warn but continue -- one missing log file + # must not break the loop for the other nodes. + echo -e "===Retrieving per-worker pod logs===" + for n in $input_nodes; do + worker_pod=$(_phase5_worker_pod_for_node "$n") + if [ -n "$worker_pod" ]; then + WORKER_LOG_FILE="${LOG_DIR}/worker-${n}-${new_job}.log" + echo "Saving worker pod logs ($worker_pod on $n) to $WORKER_LOG_FILE" + kubectl logs "$worker_pod" --all-containers=true > "$WORKER_LOG_FILE" 2>&1 || \ + echo "Warning: Failed to retrieve worker pod logs for $n ($worker_pod)" + else + echo "Warning: Could not find worker pod for node $n in MPIJob $new_job" + fi + done + echo "==================================================================" + + # Explicit MPIJob delete so leftover worker pods do not hold + # amd.com/gpu / amd.com/nic reservations across cron ticks + # (which would block Phase 0 selection via the busy-pool check). + # Per-worker + launcher logs were captured above before this. + echo "===Deleting MPIJob $new_job===" + kubectl delete mpijob "$new_job" --ignore-not-found --wait=false \ + || echo "[Phase 5] WARN: failed to delete mpijob $new_job" >&2 + echo "==================================================================" + return 0 + } + + # Source-time: define run_phase5_main only. The orchestrator + # (cluster-validation-job.yaml `run_phase5`) invokes it + # explicitly with the input node list after sourcing. diff --git a/example/gpu-validation-cluster/configs/cluster-validation-job.yaml b/example/gpu-validation-cluster/configs/cluster-validation-job.yaml index 6d86c4198..19e138876 100644 --- a/example/gpu-validation-cluster/configs/cluster-validation-job.yaml +++ b/example/gpu-validation-cluster/configs/cluster-validation-job.yaml @@ -16,25 +16,37 @@ data: runPolicy: cleanPodPolicy: All backoffLimit: 0 - ttlSecondsAfterFinished: 20 # <-update before deploy + ttlSecondsAfterFinished: 20 # <-update before deploy mpiReplicaSpecs: Launcher: - replicas: $$LAUNCHER_REPLICAS # Value substituted at runtime from LAUNCHER_REPLICAS + replicas: $$LAUNCHER_REPLICAS # Value substituted at runtime from LAUNCHER_REPLICAS template: spec: # If nodeName is set, it will direct the launcher pod to run on the specified node for easier log access. Otherwise, it will be scheduled on any of the candidate nodes. nodeName: $$LOG_STORE_NODE_NAME # Only schedule on nodes labeled as candidate nodeSelector: - amd.com/cluster-validation-candidate: "true" # Must match CANDIDATE_LABEL - serviceAccountName: cluster-validation-sa # Must have permission to create MPIJobs + amd.com/cluster-validation-candidate: "true" # Must match CANDIDATE_LABEL + serviceAccountName: cluster-validation-sa # Must have permission to create MPIJobs restartPolicy: Never volumes: - name: shared emptyDir: {} + # Node-local log archive so launcher output persists past + # MPIJob cleanPodPolicy=All + ttlSecondsAfterFinished=20. + # Same hostPath pattern as Phase 2/3/4 -- writes show up + # as /var/log/cluster-validation/_phase5_launcher.log + # on the node pinned via nodeName=$$LOG_STORE_NODE_NAME. + - name: log-storage + hostPath: + path: /var/log/cluster-validation + type: DirectoryOrCreate initContainers: - name: wait-for-worker-pods - image: docker.io/rocm/network-operator-utils:v1.1.0 + # Needs kubectl on PATH (the preflight script runs + # kubectl get/wait/exec against the worker pods). + # patchable: cluster-validation-framework.images.preflight-init + image: docker.io/bitnamilegacy/kubectl:1.33.4 imagePullPolicy: IfNotPresent envFrom: - configMapRef: @@ -42,17 +54,27 @@ data: volumeMounts: - name: shared mountPath: /shared + - name: log-storage + mountPath: /var/log/cluster-validation command: ["/bin/bash", "-c"] args: - | - # Load wait-for-worker script from ConfigMap - echo "$WAIT_FOR_WORKERS_SCRIPT" > /shared/wait-for-worker.sh + # Persist preflight stdout+stderr to a host-mounted + # log file so failures here (SSH mesh / DNS / mpirun + # spawn / RCCL topology probe) survive pod GC. Same + # pattern as the launcher main container. + TS=$(date -u +%Y-%m-%dT%H-%M-%S.%6NZ) + HOST_LOG="/var/log/cluster-validation/${TS}_phase4_5_preflight.log" + exec &> >(tee -a "${HOST_LOG}") + + # Load Phase 4.5 pre-flight script from ConfigMap + echo "$PHASE45_PREFLIGHT_SCRIPT" > /shared/wait-for-worker.sh chmod +x /shared/wait-for-worker.sh /shared/wait-for-worker.sh containers: - name: rccl-launcher - image: $$RCCL_WORKLOAD_IMAGE + image: $$ROCE_WORKLOAD_IMAGE imagePullPolicy: Always envFrom: - configMapRef: @@ -60,10 +82,21 @@ data: volumeMounts: - name: shared mountPath: /shared - + - name: log-storage + mountPath: /var/log/cluster-validation + command: ["/bin/bash", "-c"] args: - | + # Persist full launcher stdout+stderr to a host-mounted + # log file so post-mortem is possible even after the + # MPIJob is GC'd (cleanPodPolicy=All + + # ttlSecondsAfterFinished=20). Pattern matches Phase + # 2/3/4 jobs that already use /var/log/cluster-validation. + TS=$(date -u +%Y-%m-%dT%H-%M-%S.%6NZ) + HOST_LOG="/var/log/cluster-validation/${TS}_phase5_launcher.log" + exec &> >(tee -a "${HOST_LOG}") + exec 2>&1 set -euxo pipefail @@ -87,7 +120,7 @@ data: echo "$VALIDATE_RCCL_TEST_SCRIPT" > /shared/validate-single-test.sh chmod +x /shared/validate-single-test.sh - # -- mpi run start --- + # -- mpi run start --- NP=$(( WORKER_REPLICAS * SLOTS_PER_WORKER )) echo "MPI NP = $NP" failed=0 @@ -98,14 +131,17 @@ data: echo "-------------------------------------------" echo "Running $test (threshold: $threshold)..." - ${OMPI_DIR}/bin/mpirun --np $NP \ + # --prefix tells mpirun to set PATH + + # LD_LIBRARY_PATH on remote ranks so SSH + # non-interactive shells can find orted. + ${OMPI_DIR}/bin/mpirun --prefix "${OMPI_DIR}" --np $NP \ -x PATH -x LD_LIBRARY_PATH -x LD_PRELOAD \ --allow-run-as-root --mca plm_rsh_agent "$RSH_AGENT" \ --mca btl ^vader,openib -mca btl_tcp_if_include $MCA_IF \ ${RCCL_ENV} \ $PERF_TEST_DIR/${test} -b ${START_MSG_SIZE} -e ${END_MSG_SIZE} \ -n ${ITER_COUNT} -w ${WARMUP_ITER_COUNT} -c ${CHECK_ITER_COUNT} \ - -f ${STEP_FACTOR} -g ${THREADS_PER_GPU} | tee /shared/${test}.log + -f ${STEP_FACTOR} -g ${THREADS_PER_GPU} | tee /shared/${test}.log echo "Validating $test result..." if ! /shared/validate-single-test.sh "$test" "$threshold"; then @@ -117,7 +153,7 @@ data: echo "All RCCL test runs done" if [ "$failed" -ne 0 ]; then - echo "Validation FAILED for one or more tests ❌" + echo "Validation FAILED for one or more tests ❌" echo "Sleeping ${DEBUG_DELAY} secs before exiting to debug failure" sleep ${DEBUG_DELAY} echo "Launcher exiting with failure." @@ -131,7 +167,7 @@ data: echo "Launcher exiting with success" Worker: - replicas: $$WORKER_REPLICAS # Dynamically set based on number of passed nodes + replicas: $$WORKER_REPLICAS # Dynamically set based on number of passed nodes template: metadata: annotations: @@ -141,11 +177,11 @@ data: spec: # Only schedule on nodes labeled as candidate nodeSelector: - amd.com/cluster-validation-candidate: "true" # Must match CANDIDATE_LABEL + amd.com/cluster-validation-candidate: "true" # Must match CANDIDATE_LABEL restartPolicy: Never containers: - name: rccl-test-worker - image: $$RCCL_WORKLOAD_IMAGE + image: $$ROCE_WORKLOAD_IMAGE imagePullPolicy: Always securityContext: capabilities: @@ -197,9 +233,17 @@ data: nodeSelector: kubernetes.io/hostname: $$NODE volumes: - - name: config-volume # Config map volume + # Phase-1 multi-stage refactor: the config-volume now + # points at a per-stage, per-node ConfigMap created on the fly + # by PHASE1_SCRIPT (named cvf-phase1---). + # Each per-stage CM holds a single-recipe GPU_VALIDATION_TESTS_JSON + # so the test-runner CLI -- which only ever executes the first + # TestCases[] entry -- runs exactly the one recipe for this stage. + # The orchestrator deletes the per-stage CM after the stage's + # Job is parsed. + - name: config-volume # Per-stage single-recipe config map configMap: - name: cluster-validation-config + name: $$PHASE1_CONFIG_MAP - hostPath: # Specify to use this directory on the host as volume path: /var/log/cluster-validation type: DirectoryOrCreate @@ -244,6 +288,658 @@ data: backoffLimit: 0 ttlSecondsAfterFinished: 300 # TTL for the job to be auto cleaned up after finishing --- +# ========================================================================= +# Phase 2 per-node Job template ConfigMap. +# See the phase design notes +# section 4 -> "Code Path: Phase 2 Job template ConfigMap". +# +# Holds a batch/v1.Job template for the Phase 2 single-node 8-GPU RCCL +# all_reduce_perf test. One Job per node (no MPIJob -- mpirun runs with +# --host localhost). The template is sed-rendered by $PHASE2_SCRIPT +# the only substitution placeholder is $$NODE (target node hostname). +# +# Image, env vars (PHASE2_*, PHASE2_RCCL_ENV_VARS), perf-test path, BW +# threshold, and the VALIDATE_RCCL_TEST_SCRIPT body all come from the +# cluster-validation-config ConfigMap via envFrom -- so threshold tuning +# does not require touching this template. +# ========================================================================= +apiVersion: v1 +kind: ConfigMap +metadata: + name: cluster-validation-phase2-job-config +data: + cluster-validation-phase2-job-config.yaml: | + apiVersion: batch/v1 + kind: Job + metadata: + # Job name is per-node; PHASE2_SCRIPT renders the final unique + # name (e.g. cluster-validation-phase2-job-) before apply. + name: cluster-validation-phase2-job + labels: + amd.com/cluster-validation-created: "true" + spec: + template: + spec: + serviceAccountName: cluster-validation-sa + # Pin the Job to the target node. PHASE2_SCRIPT substitutes + # $$NODE with the node hostname (one Job per Phase-1-passed + # node). amd.com/gpu: 8 below makes this a sole-tenant Job + # for the node's GPUs. + nodeSelector: + kubernetes.io/hostname: $$NODE + restartPolicy: Never + volumes: + # Scratch space for rendered env scripts and the + # validate-single-test.sh body (sourced from + # VALIDATE_RCCL_TEST_SCRIPT in cluster-validation-config). + - name: shared + emptyDir: {} + # Memory-backed /dev/shm for RCCL inter-rank shmem. + # Default container /dev/shm is 64Mi tmpfs which is too + # small for 8-rank RCCL all_reduce — NCCL aborts with + # "No space left on device" creating shmem segments + # (~10MiB per rank pair). Mount a sized Memory emptyDir + # at /dev/shm to give RCCL room. 16Gi is a comfortable + # upper bound (consumes node RAM, not GPU HBM). + - name: dshm + emptyDir: + medium: Memory + sizeLimit: 16Gi + # Node-local log archive. Persists across pod GC so + # operators can debug failures after the fact. Matches + # the pattern used by Phase 1 test-runner + Phase 5 + # launcher. Filename includes a UTC timestamp prefix. + - name: log-storage + hostPath: + path: /var/log/cluster-validation + type: DirectoryOrCreate + containers: + - name: phase2-rccl + # Same image used by Phase 5 launcher/worker (see + # cluster-validation-mpijob-config above). Substituted at + # deploy time from ROCE_WORKLOAD_IMAGE. + image: $$ROCE_WORKLOAD_IMAGE + imagePullPolicy: Always + envFrom: + - configMapRef: + name: cluster-validation-config + # Single-GPU-per-rank, 8 ranks, all on this one node -> + # request all 8 GPUs. Hardcoded (not $$GPU_PER_WORKER) per + # design doc section 4: Phase 2 is intentionally an 8-GPU + # sole-tenant single-node stress of the xGMI mesh. + resources: + requests: + amd.com/gpu: 8 + limits: + amd.com/gpu: 8 + volumeMounts: + - name: shared + mountPath: /shared + - name: dshm + mountPath: /dev/shm + - name: log-storage + mountPath: /var/log/cluster-validation + env: + - name: NODE_NAME + valueFrom: + fieldRef: + fieldPath: spec.nodeName + command: ["/bin/bash", "-c"] + args: + - | + exec 2>&1 + set -uo pipefail + + # Node-local persistent log target. PHASE2_SCRIPT cleans + # up the pod but this file survives on the host for + # post-mortem (matches Phase 1 / 5 pattern). + TS=$(date -u +%Y-%m-%dT%H-%M-%S.%6NZ) + HOST_LOG="/var/log/cluster-validation/${TS}_phase2_intranode-rccl.log" + + # --- prepare phase2 rccl env --- + # PHASE2_RCCL_ENV_VARS is a stripped variant of the + # Phase 5 RCCL_ENV_VARS: drops NCCL_NET_PLUGIN and + # IB-related tunables (no fabric -- single node). + echo "$PHASE2_RCCL_ENV_VARS" > /shared/phase2_rccl_env.sh + chmod +x /shared/phase2_rccl_env.sh + # shellcheck disable=SC1091 + source /shared/phase2_rccl_env.sh + + # --- prepare validation script --- + # Reuse the same Avg-bus-bandwidth parser as Phase 5 + # (design doc section 2). Sourced from + # VALIDATE_RCCL_TEST_SCRIPT in cluster-validation-config. + echo "$VALIDATE_RCCL_TEST_SCRIPT" > /shared/validate-single-test.sh + chmod +x /shared/validate-single-test.sh + + # --- run all_reduce_perf across all 8 local GPUs --- + # --host localhost: no SSH, no MPIJob, no fabric. + # --mca btl ^vader,openib: same exclude list as Phase 5 + # launcher (avoid the vader shared-mem BTL and the + # OpenIB BTL; we only want TCP/self for OMPI control). + # PHASE2_* sizing knobs all come from + # cluster-validation-config envFrom above. + # --host localhost:$GPU_PER_WORKER declares N slots on + # this node so mpirun can launch N ranks (one per GPU). + # Without the ":N" Open MPI defaults to 1 slot and + # aborts with "not enough slots available". + # GPU_PER_WORKER comes from cluster-validation-config + # envFrom above (templated from config.json at deploy + # time). + mpirun --np $GPU_PER_WORKER --host localhost:$GPU_PER_WORKER --allow-run-as-root \ + --mca btl ^vader,openib \ + $PERF_TEST_DIR/all_reduce_perf \ + -b $PHASE2_START_MSG_SIZE -e $PHASE2_END_MSG_SIZE \ + -f $PHASE2_STEP_FACTOR -g 1 \ + -n $PHASE2_ITER_COUNT -w $PHASE2_WARMUP_ITER_COUNT \ + 2>&1 | tee /shared/phase2_intranode_all_reduce.log "${HOST_LOG}" + mpirun_rc=${PIPESTATUS[0]} + + if [ "$mpirun_rc" -ne 0 ]; then + # mpirun crash / RCCL init failure / xGMI init + # failure. PHASE2_SCRIPT (driver) inspects the log + # tail and writes failed-reason=rccl-crash via node + # annotation. Exit non-zero so the Job is marked + # Failed and PHASE2_SCRIPT can detect it. + echo "phase2 mpirun exited $mpirun_rc -- RCCL test crashed" + exit "$mpirun_rc" + fi + + # --- threshold check --- + # validate-single-test.sh reads /shared/.log, + # extracts the Avg bus bandwidth line, and exits + # non-zero if below the supplied threshold. The Job + # exit code is what PHASE2_SCRIPT uses to classify + # pass vs bus-bw-below-threshold (see design section + # 4 -> Code Path: PHASE2_SCRIPT). + if ! /shared/validate-single-test.sh \ + "phase2_intranode_all_reduce" "$PHASE2_BW_THRESHOLD"; then + echo "phase2 bandwidth below threshold ($PHASE2_BW_THRESHOLD GB/s)" + exit 1 + fi + + echo "phase2 all_reduce_perf PASSED (>= $PHASE2_BW_THRESHOLD GB/s)" + # No retries: a Phase 2 failure is a hard hardware/topology + # signal -- PHASE2_SCRIPT handles label/annotation and the next + # CronJob tick re-tests if appropriate. + backoffLimit: 0 + # TTL keeps completed Jobs around briefly for log capture by + # PHASE2_SCRIPT, then GC. + ttlSecondsAfterFinished: 300 +--- +# ========================================================================= +# Phase 3 per-node NIC health Job template +# ConfigMap. See the phase design notes +# section 4 -> "Code Path: Phase 3 Job template". +# +# Holds a batch/v1.Job template for the per-node NIC health check +# (NIC count, ip link UP, rdma link ACTIVE, ibv_devinfo + GID table, +# plus driver/firmware compatibility). +# One Job per node that passed Phase 2 and carries the amd-nic=true +# label. The template is sed-rendered by $PHASE3_SCRIPT -- the +# substitution placeholders are $$NODE (target node hostname), +# $$EXPECTED_NIC_COUNT (expected NIC count, sourced from +# PHASE3_EXPECTED_NIC_COUNT in cluster-validation-config), and +# $$ROCE_WORKLOAD_IMAGE (rocm/roce-workload image tag, sourced from +# ROCE_WORKLOAD_IMAGE in cluster-validation-config -- same image used +# by Phases 2/4/5; ships nicctl for Check 5). +# +# Image (rocm/roce-workload, same as Phases 2/4/5), env vars +# (PHASE3_*), and the PHASE3_CHECK_SCRIPT body all come from the +# cluster-validation-config ConfigMap via envFrom -- so +# threshold/expected-count tuning does not require touching this +# template. +# +# The container runs privileged because the in-Job checks need +# rdma/ibv tooling access to NIC device files; the cluster-validation-sa +# already carries node label/annotate permissions (per +# cluster-validation-role) so PHASE3_CHECK_SCRIPT can self-label the +# node via in-pod kubectl. +# ========================================================================= +apiVersion: v1 +kind: ConfigMap +metadata: + name: cluster-validation-phase3-job-config +data: + cluster-validation-phase3-job-config.yaml: | + apiVersion: batch/v1 + kind: Job + metadata: + # Job name is per-node; PHASE3_SCRIPT renders the final unique + # name (e.g. cluster-validation-phase3-job-) before apply. + name: cluster-validation-phase3-job + labels: + amd.com/cluster-validation-created: "true" + spec: + template: + spec: + # cluster-validation-sa already has node label/annotate + # permissions per cluster-validation-role -- in-pod kubectl + # writes the amd.com/nic-health label and failure annotations + # directly from PHASE3_CHECK_SCRIPT. + serviceAccountName: cluster-validation-sa + # Pin the Job to the target node. PHASE3_SCRIPT substitutes + # $$NODE with the node hostname (one Job per Phase-2-passed + # node with amd-nic=true). amd.com/nic: $$EXPECTED_NIC_COUNT + # below makes this a sole-tenant Job for the node's NICs so + # the pod sees them all via PCIe device passthrough. + nodeSelector: + kubernetes.io/hostname: $$NODE + restartPolicy: Never + volumes: + # Node-local log archive; persists across pod GC. + - name: log-storage + hostPath: + path: /var/log/cluster-validation + type: DirectoryOrCreate + containers: + - name: nic-health + # roce-workload image (substituted at render time from + # ROCE_WORKLOAD_IMAGE in cluster-validation-config) -- + # same image Phases 2/4/5 use. Ships ibv_devinfo + rdma + + # ethtool. Check 5 reads firmware-version from + # `ethtool -i ` and substring-matches it against + # the image reference itself, so the workload image's tag + # is the authoritative declaration of which firmware was + # qualified with this workload. + image: $$ROCE_WORKLOAD_IMAGE + imagePullPolicy: IfNotPresent + # Privileged is required for rdma link show / ibv_devinfo + # to access the NIC device files mounted via the + # amd.com/nic device plugin (see design doc section 4 -> + # Code Path: Phase 3 Job template). + securityContext: + privileged: true + envFrom: + # Sources PHASE3_CHECK_SCRIPT, PHASE3_LABEL_KEY, + # PHASE3_EXPECTED_NIC_COUNT, PHASE3_AMD_NIC_PCI_IDS, + # PHASE3_MIN_GID_COUNT, and the label/annotate + # helper conventions used by PHASE3_CHECK_SCRIPT. + - configMapRef: + name: cluster-validation-config + env: + # NODE_NAME is consumed by PHASE3_CHECK_SCRIPT's + # self-label step (kubectl label node "$NODE_NAME" .). + # Downward-API resolution -- kept independent of the + # nodeSelector substitution above so the script can run + # under a static nodeSelector if needed for tests. + - name: NODE_NAME + valueFrom: + fieldRef: + fieldPath: spec.nodeName + # Workload image reference (same value substituted into + # `image:` above). PHASE3_CHECK_SCRIPT's Check 5 + # substring-matches each NIC's ethtool firmware-version + # against this string. PHASE3_SCRIPT sed-renders the + # value at Job-template instantiation time, so the env + # carries the FULL reference (registry/repo:tag). + - name: ROCE_WORKLOAD_IMAGE + value: $$ROCE_WORKLOAD_IMAGE + # Sole-tenant on the node's NICs. PHASE3_SCRIPT substitutes + # $$EXPECTED_NIC_COUNT from PHASE3_EXPECTED_NIC_COUNT in + # cluster-validation-config (default "8"). This both + # schedules the pod onto all amd.com/nic devices on the + # node (via the device plugin) and gates admission if any + # NICs are unhealthy/unallocated -- the Job stays pending + # and PHASE3_SCRIPT times it out as nic-not-allocated. + resources: + requests: + amd.com/nic: $$EXPECTED_NIC_COUNT + limits: + amd.com/nic: $$EXPECTED_NIC_COUNT + volumeMounts: + - name: log-storage + mountPath: /var/log/cluster-validation + command: ["/bin/bash", "-c"] + args: + - | + # PHASE3_CHECK_SCRIPT is sourced from + # cluster-validation-config via envFrom above. It + # runs the four health checks (NIC count, ip link, + # rdma link, ibv_devinfo+GID) and self-labels the + # node via in-pod kubectl per the design helper + # conventions (PHASE3_LABEL_KEY + + # PHASE_FAILURE_REASON_ANNOTATION_SUFFIX). + echo "$PHASE3_CHECK_SCRIPT" > /tmp/check.sh + chmod +x /tmp/check.sh + # Persist full check.sh output to node-local + # /var/log/cluster-validation/ for post-mortem. + # PIPESTATUS preserves check.sh's exit code (tee + # always exits 0, which would mask failures). + TS=$(date -u +%Y-%m-%dT%H-%M-%S.%6NZ) + HOST_LOG="/var/log/cluster-validation/${TS}_phase3_nic-health.log" + /tmp/check.sh 2>&1 | tee "${HOST_LOG}" + exit ${PIPESTATUS[0]} + # No retries: a Phase 3 failure is a hard NIC/topology signal + # PHASE3_SCRIPT (orchestrator) inspects the resulting node label + # and the next CronJob tick re-tests if appropriate. + backoffLimit: 0 + # TTL keeps completed Jobs around briefly for log capture by + # PHASE3_SCRIPT, then GC. + ttlSecondsAfterFinished: 300 +--- +# ========================================================================= +# Phase 4 per-(pair,rail) Job template ConfigMap. +# See the phase design notes +# section 4 -> "Code Path: Phase 4 Job templates (server + client)". +# +# Holds TWO batch/v1.Job templates under separate ConfigMap data keys: +# * cluster-validation-phase4-server-job-config.yaml -- ib_write_bw +# server, one per (pair, rail) instance. Pinned to node A of the +# pair, requests one rail's NAD, runs `ib_write_bw -d $DEV -F` +# (server mode, waits for client connect, exits on completion). +# * cluster-validation-phase4-client-job-config.yaml -- ib_write_bw +# client, one per (pair, rail) instance. Pinned to node B of the +# pair, requests the same rail's NAD on the peer side, runs +# `ib_write_bw -d $DEV -F $PEER_POD_IP`, parses "BW average" +# from the test output, tees to /shared/phase4_rail${RAIL_IDX}.log +# so PHASE4_DRIVER_SCRIPT can read the BW value via +# `kubectl logs`. +# +# The driver script (PHASE4_DRIVER_SCRIPT, sibling script) is +# responsible for: (a) sed-rendering each template once per (pair, +# rail) with unique Job names; (b) discovering the server pod IP +# after admission and substituting $$PEER_POD_IP into the client +# template; (c) waiting for both Jobs up to PHASE4_PAIR_WAIT_TIME +# seconds; (d) parsing the client log; (e) cleanup. +# +# Substitution placeholders (sed-rendered by PHASE4_DRIVER_SCRIPT): +# $$NODE -- target node hostname (kubernetes.io/hostname). +# Server pins to node A; client pins to node B. +# $$PEER_POD_IP -- server pod IP discovered post-admission. +# Client-template only (server has no peer arg). +# $$RAIL_IDX -- rail index 0.PHASE4_RAIL_COUNT-1. Selects the +# RDMA device via ${PHASE4_IB_DEV_PREFIX}${RAIL_IDX}. +# $$NAD_NAME -- per-rail NetworkAttachmentDefinition name +# (typically ${PHASE4_NAD_NAME_PREFIX}${RAIL_IDX}, +# e.g. amd-host-device-nad-rail-3). Requested via +# the standard Multus annotation +# k8s.v1.cni.cncf.io/networks. Pod admission is +# rejected if the NAD does not exist -- driver +# records that as failed-reason=nad-missing. +# +# Image: rocm/roce-workload (substituted from ROCE_WORKLOAD_IMAGE in +# cluster-validation-config). The image ships OFED userspace and +# ib_write_bw -- no init-container fetch needed. envFrom pulls all +# PHASE4_* variables plus the helper +# conventions so the container scripts can self-resolve device names +# and thresholds without further substitution. +# +# Resources: amd.com/nic: 1 -- the device-plugin gate ensures the +# pod only schedules when the requested rail's NIC is allocatable. +# IPC_LOCK + privileged are required for RDMA verbs (ib_write_bw +# uses pinned memory and direct verbs access). +# ========================================================================= +apiVersion: v1 +kind: ConfigMap +metadata: + name: cluster-validation-phase4-job-config +data: + cluster-validation-phase4-server-job-config.yaml: | + apiVersion: batch/v1 + kind: Job + metadata: + # Job name is per-(pair, rail); PHASE4_DRIVER_SCRIPT renders the + # final unique name (e.g. + # cluster-validation-phase4-server-job--rail-) before + # apply. + name: cluster-validation-phase4-server-job + labels: + amd.com/cluster-validation-created: "true" + amd.com/cluster-validation-phase4-role: "server" + spec: + template: + metadata: + # Per-rail NetworkAttachmentDefinition request. The Multus + # admission controller attaches the rail-specific host + # device (amd-host-device-nad-rail-${RAIL_IDX}) to the pod's + # net namespace; ib_write_bw uses it via the RDMA verbs + # device name resolved from ${PHASE4_IB_DEV_PREFIX}${RAIL_IDX}. + annotations: + k8s.v1.cni.cncf.io/networks: $$NAD_NAME + labels: + amd.com/cluster-validation-created: "true" + amd.com/cluster-validation-phase4-role: "server" + spec: + serviceAccountName: cluster-validation-sa + # Pin the Job to node A of the pair. PHASE4_DRIVER_SCRIPT + # substitutes $$NODE per (pair, rail) instance. + nodeSelector: + kubernetes.io/hostname: $$NODE + restartPolicy: Never + volumes: + - name: shared + emptyDir: {} + - name: log-storage + hostPath: + path: /var/log/cluster-validation + type: DirectoryOrCreate + containers: + - name: phase4-ib-server + # rocm/roce-workload -- ships OFED + ib_write_bw. Same + # image used by the client template below. Substituted + # at deploy time from ROCE_WORKLOAD_IMAGE in + # cluster-validation-config. + image: $$ROCE_WORKLOAD_IMAGE + imagePullPolicy: IfNotPresent + envFrom: + # Sources PHASE4_RAIL_COUNT, PHASE4_BW_THRESHOLD, + # PHASE4_PAIR_WAIT_TIME, PHASE4_NAD_NAME_PREFIX, + # PHASE4_IB_DEV_PREFIX plus the + # label/annotate helper conventions. + - configMapRef: + name: cluster-validation-config + env: + # RAIL_IDX is sed-substituted at render time so the + # container script can resolve the RDMA device name + # without a separate templating layer. + - name: RAIL_IDX + value: "$$RAIL_IDX" + # POD_IP is discovered by PHASE4_DRIVER_SCRIPT + # post-admission via downward API on the live pod + # (status.podIP). Exposed here so the in-container + # logs include the IP for diagnostics. + - name: POD_IP + valueFrom: + fieldRef: + fieldPath: status.podIP + - name: NODE_NAME + valueFrom: + fieldRef: + fieldPath: spec.nodeName + # IPC_LOCK + privileged: ib_write_bw uses pinned memory + # and direct RDMA verbs (rdma_dev_${RAIL_IDX}); both + # require elevated privilege. Mirrors the Phase 5 worker + # pod's NIC-access posture. + securityContext: + privileged: true + capabilities: + add: ["IPC_LOCK"] + # Sole-tenant on the requested rail's NIC. The device + # plugin gates admission until the rail's NIC is + # allocatable; if not, the pod stays pending and + # PHASE4_DRIVER_SCRIPT times it out as nic-not-allocated. + resources: + requests: + amd.com/nic: 1 + limits: + amd.com/nic: 1 + volumeMounts: + - name: shared + mountPath: /shared + - name: log-storage + mountPath: /var/log/cluster-validation + command: ["/bin/bash", "-c"] + args: + - | + exec 2>&1 + set -uo pipefail + + # Resolve the RDMA device for this rail. The naming + # convention (rdma_dev_${RAIL_IDX}) is set by the + # cluster's NAD/device-plugin wiring; the prefix is + # configurable via PHASE4_IB_DEV_PREFIX so different + # fleets can adapt without touching the template. + DEV="${PHASE4_IB_DEV_PREFIX}${RAIL_IDX}" + echo "[phase4-server] node=${NODE_NAME} rail=${RAIL_IDX} dev=${DEV} pod_ip=${POD_IP}" + + # Node-local persistent log target (survives pod GC). + TS=$(date -u +%Y-%m-%dT%H-%M-%S.%6NZ) + HOST_LOG="/var/log/cluster-validation/${TS}_phase4-server_rail${RAIL_IDX}.log" + + # Server mode: ib_write_bw with no peer IP. -F skips + # the CPU frequency check (required in containers + # where /proc/cpuinfo cpu_MHz is unreliable). -i 1 + # pins to the device's port 1. The server blocks + # until a client connects, runs one BW test, then + # exits 0 on success. + ib_write_bw -d "${DEV}" -i 1 -F \ + 2>&1 | tee "/shared/phase4_rail${RAIL_IDX}_server.log" "${HOST_LOG}" + ib_rc=${PIPESTATUS[0]} + + if [ "$ib_rc" -ne 0 ]; then + # ib_write_bw server-side crash. PHASE4_DRIVER_SCRIPT + # inspects the log tail and writes failed-reason= + # ib-write-bw-crashed via the per-rail annotation + # on node A. Exit non-zero so the Job is marked + # Failed and the driver can detect it. + echo "[phase4-server] ib_write_bw exited ${ib_rc}" + exit "$ib_rc" + fi + + echo "[phase4-server] rail ${RAIL_IDX} server completed cleanly" + # No retries: a Phase 4 failure is a hard per-(pair, rail) + # signal -- PHASE4_DRIVER_SCRIPT handles label/annotation and + # the next CronJob tick re-tests if appropriate. + backoffLimit: 0 + # TTL keeps completed Jobs around briefly for log capture by + # PHASE4_DRIVER_SCRIPT, then GC. + ttlSecondsAfterFinished: 300 + cluster-validation-phase4-client-job-config.yaml: | + apiVersion: batch/v1 + kind: Job + metadata: + # Job name is per-(pair, rail); PHASE4_DRIVER_SCRIPT renders the + # final unique name (e.g. + # cluster-validation-phase4-client-job--rail-) before + # apply. The driver renders the client template AFTER the server + # pod's IP is known (substituted into $$PEER_POD_IP below). + name: cluster-validation-phase4-client-job + labels: + amd.com/cluster-validation-created: "true" + amd.com/cluster-validation-phase4-role: "client" + spec: + template: + metadata: + # Same per-rail NAD as the server -- the client needs the + # same rail's NIC on its own node to drive ib_write_bw + # against the matching device on the server. + annotations: + k8s.v1.cni.cncf.io/networks: $$NAD_NAME + labels: + amd.com/cluster-validation-created: "true" + amd.com/cluster-validation-phase4-role: "client" + spec: + serviceAccountName: cluster-validation-sa + # Pin the Job to node B of the pair. PHASE4_DRIVER_SCRIPT + # substitutes $$NODE per (pair, rail) instance. + nodeSelector: + kubernetes.io/hostname: $$NODE + restartPolicy: Never + volumes: + - name: shared + emptyDir: {} + - name: log-storage + hostPath: + path: /var/log/cluster-validation + type: DirectoryOrCreate + containers: + - name: phase4-ib-client + image: $$ROCE_WORKLOAD_IMAGE + imagePullPolicy: IfNotPresent + envFrom: + - configMapRef: + name: cluster-validation-config + env: + - name: RAIL_IDX + value: "$$RAIL_IDX" + # PEER_POD_IP is the server pod's status.podIP, + # discovered by PHASE4_DRIVER_SCRIPT after the server + # Job's pod transitions to Running and substituted + # into this template just before apply. + - name: PEER_POD_IP + value: "$$PEER_POD_IP" + - name: NODE_NAME + valueFrom: + fieldRef: + fieldPath: spec.nodeName + securityContext: + privileged: true + capabilities: + add: ["IPC_LOCK"] + resources: + requests: + amd.com/nic: 1 + limits: + amd.com/nic: 1 + volumeMounts: + - name: shared + mountPath: /shared + - name: log-storage + mountPath: /var/log/cluster-validation + command: ["/bin/bash", "-c"] + args: + - | + exec 2>&1 + set -uo pipefail + + DEV="${PHASE4_IB_DEV_PREFIX}${RAIL_IDX}" + echo "[phase4-client] node=${NODE_NAME} rail=${RAIL_IDX} dev=${DEV} peer=${PEER_POD_IP}" + + # Node-local persistent log target (survives pod GC). + TS=$(date -u +%Y-%m-%dT%H-%M-%S.%6NZ) + HOST_LOG="/var/log/cluster-validation/${TS}_phase4-client_rail${RAIL_IDX}.log" + + if [ -z "${PEER_POD_IP:-}" ]; then + # Defensive: if the driver failed to substitute + # $$PEER_POD_IP (e.g. server pod never reached + # Running), the client cannot run. Exit non-zero + # so the driver records reason=peer-pod-unready. + echo "[phase4-client] PEER_POD_IP empty -- aborting" + exit 1 + fi + + # Client mode: ib_write_bw against the server pod IP. + # -F skips CPU freq check; -i 1 pins to port 1. The + # client connects, runs one BW test, prints + # "BW average[Gb/sec] = " on stdout, exits 0. + # PHASE4_DRIVER_SCRIPT greps "BW average" from this + # log to populate the per-rail annotation. + ib_write_bw -d "${DEV}" -i 1 -F "${PEER_POD_IP}" \ + 2>&1 | tee "/shared/phase4_rail${RAIL_IDX}_client.log" "${HOST_LOG}" + ib_rc=${PIPESTATUS[0]} + + if [ "$ib_rc" -ne 0 ]; then + echo "[phase4-client] ib_write_bw exited ${ib_rc}" + exit "$ib_rc" + fi + + # Surface the parsed BW value to stdout for the + # driver's `kubectl logs` consumption. The driver + # also re-parses the full log for robustness; this + # echo is a convenience marker. + bw_line=$(grep -E '^[[:space:]]*[0-9]+[[:space:]]+[0-9]+.*BW average' \ + "/shared/phase4_rail${RAIL_IDX}_client.log" || true) + if [ -z "$bw_line" ]; then + # No BW line: PHASE4_DRIVER_SCRIPT records + # failed-reason=parse-failed. + echo "[phase4-client] no BW average line in client log" + exit 1 + fi + echo "[phase4-client] rail ${RAIL_IDX} ${bw_line}" + backoffLimit: 0 + ttlSecondsAfterFinished: 300 +--- apiVersion: v1 kind: ServiceAccount metadata: @@ -265,6 +961,14 @@ rules: resources: ["jobs"] verbs: ["get", "list", "watch", "create", "delete", "patch"] + # Allow ConfigMap operations -- Phase 1 + # creates one per-stage per-node ConfigMap on the fly (cvf-phase1-- + # -) holding the single-recipe GPU_VALIDATION_TESTS_JSON + # payload, then deletes it after the stage's Job is parsed. + - apiGroups: [""] + resources: ["configmaps"] + verbs: ["get", "list", "watch", "create", "delete", "patch", "update"] + # Allow node operations - apiGroups: [""] resources: ["nodes"] @@ -280,6 +984,12 @@ rules: resources: ["pods", "pods/exec"] verbs: ["get", "list", "watch", "create", "update"] + # Allow orchestrator to read Phase 3 Job pod logs to parse the + # PHASE3_RESULT marker emitted by PHASE3_CHECK_SCRIPT. + - apiGroups: [""] + resources: ["pods/log"] + verbs: ["get", "list"] + --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding @@ -300,176 +1010,736 @@ kind: CronJob metadata: name: cluster-validation-cron-job spec: - # schedule: "0 0 * * *" - # schedule every 10 minutes (Change as needed) - schedule: "__CRONJOB_SCHEDULE__" - concurrencyPolicy: Forbid # Dont overlap runs + # Default: every 30 minutes. + # patchable: cluster-validation-framework.cronjob.schedule + schedule: "*/30 * * * *" + concurrencyPolicy: Forbid # Dont overlap runs jobTemplate: spec: template: spec: - serviceAccountName: cluster-validation-sa # must have permission to create MPIJobs + serviceAccountName: cluster-validation-sa # must have permission to create MPIJobs restartPolicy: Never containers: - name: submit-mpijob - image: docker.io/rocm/network-operator-utils:v1.1.0 + # Orchestrator runs bash + kubectl throughout (selection, + # label/annotate, MPIJob poll, cleanup). Needs kubectl on PATH. + # patchable: cluster-validation-framework.images.orchestrator + image: docker.io/bitnamilegacy/kubectl:1.33.4 imagePullPolicy: IfNotPresent + # Run as root: this image defaults to UID 1001, which cannot + # write the root-owned /var/log/cluster-validation hostPath. + # The orchestrator writes its own cronjob-*.log there and + # persists each test-runner pod log (_pod.log) for + # triage, so it must be able to write that shared volume. + securityContext: + runAsUser: 0 command: ["/bin/bash", "-c"] envFrom: - configMapRef: name: cluster-validation-config args: - | + # ========================================================= + # Multi-Phase CronJob Orchestrator (, owned by + #). See design doc §4 -> "Code Path: CronJob + # bash state machine (run-phases.sh)". + # + # Pipeline: + # Phase 0 -> Phase 1 -> Phase 2 -> Phase 3 -> Phase 4 + # -> Phase 4.5 -> Phase 5 + # Each phase: + # * exits early if its SKIP_* flag is "true" + # * runs run_phaseN (stub by default; sources + # $PHASE{N}_SCRIPT from the ConfigMap if defined) + # * narrows the live-node pool via filter_passed_nodes, + # keyed on $PHASE{N}_LABEL_KEY + # + # Phase-script contract: each run_phaseN script writes + # $PHASE{N}_LABEL_KEY=passed|failed on every input node + # before returning (see design §5). Labels are the ONLY + # contract; stdout/stderr are not parsed. + # + # DRY_RUN=1 mode: run_phaseN stubs become `:`; Phase 0 + # candidate selection is skipped (no kubectl writes); + # the planned phase order + filtered pool per phase is + # printed to stdout. + # + # Final exit: non-zero if any phase produced failed + # nodes (preserves existing fail-loud behavior). + # ========================================================= + # --- Redirect all output to a timestamped log file --- LOG_DIR=/var/log/cluster-validation mkdir -p "$LOG_DIR" LOG_TS=$(date +%Y%m%d-%H%M%S) LOG_FILE="${LOG_DIR}/cronjob-${LOG_TS}.log" - exec > >(tee -a "$LOG_FILE") 2>&1 - set -euo pipefail + # Prefix every stdout/stderr line with a wall-clock + # timestamp. Sourced phase scripts inherit this + # redirect so all output is timestamped uniformly. + # The EXIT trap closes the pipe + reaps the formatter + # so the parent does not exit before the file is + # fully drained (bounded so a stuck formatter cannot + # hang the CronJob pod). + _ts_prefix() { + while IFS= read -r line; do + printf '[%s] %s\n' \ + "$(date '+%Y-%m-%d %H:%M:%S.%3N')" "$line" + done + } + exec > >(_ts_prefix | tee -a "$LOG_FILE") 2>&1 + _ts_pid=$! + _flush_ts_pipe() { + exec 1>&- 2>&- + if [ -n "${_ts_pid:-}" ]; then + for _i in 1 2 3 4 5; do + if ! kill -0 "$_ts_pid" 2>/dev/null; then + break + fi + sleep 0.2 + done + fi + } + trap _flush_ts_pipe EXIT - echo -e "\n$(date): ===Step 1: Determining candidate nodes===" - echo "$CRONJOB_CANDIDATE_NODES_SELECTION_SCRIPT" > /tmp/select-and-label-candidate.sh - chmod +x /tmp/select-and-label-candidate.sh - if ! /tmp/select-and-label-candidate.sh; then - exit 0 + # Allow inner helpers to fail without taking the + # orchestrator down -- a failed phase must be logged + # and the pipeline must continue with the remaining + # nodes. `pipefail` + `-u` stay on for early detection + # of typos and pipe wiring bugs. + set -uo pipefail + + DRY_RUN="${DRY_RUN:-0}" + if [[ "${DRY_RUN}" == "1" ]]; then + echo "[Orchestrator] DRY_RUN=1 -- no kubectl writes will be performed" fi - nodes=$(kubectl get nodes -l "${CANDIDATE_LABEL}" -o name | sed 's|node/||') - - echo -e "\n$(date): ===Step 2: Submitting test runner jobs for each candidate node===" - # Define variables BEFORE calling the modularized script - job_names="" - declare -A job_to_node - passed_nodes="" - failed_nodes="" - - # Call the modularized GPU validation script - echo "$GPU_VALIDATION_TEST_SCRIPT" > /tmp/gpu-validation-test.sh - chmod +x /tmp/gpu-validation-test.sh - source /tmp/gpu-validation-test.sh - - echo -e "\n$(date): ===Step 3: Submitting MPIJob===" - - # Check if RCCL test should be skipped - if [[ "${SKIP_RCCL_TEST,,}" == "true" ]]; then - echo "SKIP_RCCL_TEST is set to true. Skipping MPI Job RCCL test." - echo "==================================================================" - - # Label passed nodes with success status and remove candidate label - CLUSTER_VALIDATION_STATUS_LABEL=${SUCCESS_LABEL} - echo -e "\n$(date): ===Labeling nodes (RCCL test skipped)===" - for n in $passed_nodes; do - echo "Labeling node $n with $CLUSTER_VALIDATION_STATUS_LABEL" - kubectl label node "$n" "$CLUSTER_VALIDATION_STATUS_LABEL" --overwrite - done - CANDIDATE_LABEL_KEY=${CANDIDATE_LABEL%%=*} - for n in $passed_nodes; do - echo "Removing candidate label on node: $n" - kubectl label node "$n" "${CANDIDATE_LABEL_KEY}-" --overwrite + + # --- Source the node-metadata helper library --- + # Provides label_phase_passed, label_phase_failed, + # annotate_phase_value, filter_passed_nodes. + echo "$PHASE_NODE_LABEL_SCRIPT" > /tmp/phase-node-label.sh + chmod +x /tmp/phase-node-label.sh + # shellcheck disable=SC1091 + source /tmp/phase-node-label.sh + + # --------------------------------------------------------- + # Phase failure accounting. Each run_phaseN writes labels + # in the cluster, but we also track whether *this run* + # produced any non-passed nodes so the CronJob can exit + # non-zero (fail-loud, design §6). + # --------------------------------------------------------- + ANY_PHASE_FAILED=0 + + # account_phase_failures + # Logs the per-phase pass/fail split and bumps the global + # failure counter when input != passed. Empty inputs are + # not a failure (no work to do). + account_phase_failures() { + local phase_name="$1" + local input_pool="$2" + local passed_pool="$3" + local in_count out_count + in_count=$(echo "$input_pool" | wc -w) + out_count=$(echo "$passed_pool" | wc -w) + local failed_count=$(( in_count - out_count )) + echo "[Orchestrator] ${phase_name}: input=${in_count} passed=${out_count} failed=${failed_count}" + if [[ "$failed_count" -gt 0 ]]; then + ANY_PHASE_FAILED=1 + fi + } + + # nodes_with_label + # Emits, on stdout, the space-separated list of node names + # currently matching . Returns empty on + # error or when no nodes match. Honors DRY_RUN by still + # reading state (read is non-mutating). + nodes_with_label() { + local selector="$1" + kubectl get nodes -l "$selector" -o name 2>/dev/null \ + | sed 's|node/||' | tr '\n' ' ' | sed 's/ *$//' + } + + # intersect + # Emits, on stdout, the space-separated intersection of + # two space-separated node-name pools. Order follows + # pool_a. Empty input on either side yields empty output. + intersect() { + local a="$1" + local b="$2" + [[ -z "$a" || -z "$b" ]] && return 0 + local out="" + local x y + for x in $a; do + for y in $b; do + if [[ "$x" == "$y" ]]; then + if [[ -z "$out" ]]; then + out="$x" + else + out="$out $x" + fi + break + fi + done done - echo "[CronJob Result: $CLUSTER_VALIDATION_STATUS_LABEL] Cluster Validation Status updated on Candidate Nodes (RCCL test skipped)" - echo "==================================================================" - echo "[CronJob Completed] $(date)" + echo "$out" + } + + # --------------------------------------------------------- + # run_phaseN + # + # Default behavior: + # * print a banner identifying the phase + # * if $PHASE{N}_SCRIPT is non-empty, drop it to a tmp + # file and `source` it so the per-phase code runs in + # the orchestrator's shell (helpers + env stay live) + # * otherwise, no-op (downstream work hasn't landed yet) + # + # Phase-script contract: the sourced script MUST label + # every node in its input list with $PHASE{N}_LABEL_KEY + # before returning. The orchestrator does not parse + # phase-script stdout -- labels are the gate. + # + # DRY_RUN=1 overrides ALL run_phaseN bodies with `:` so + # the same phase plan can be exercised without any + # cluster mutation. + # --------------------------------------------------------- + _run_phase_generic() { + local phase_num="$1" + local phase_label="$2" + local script_var_name="$3" + local nodes="$4" + echo "------------------------------------------------------------------" + echo "[Phase ${phase_num}] ${phase_label}" + echo "[Phase ${phase_num}] input nodes: ${nodes:-}" + echo "------------------------------------------------------------------" + if [[ -z "$nodes" ]]; then + echo "[Phase ${phase_num}] no input nodes -- nothing to do" + return 0 + fi + # Look up the per-phase script body. Indirect expansion + # is required because env-var names are computed from + # the phase number. + local script_body="${!script_var_name:-}" + if [[ -z "$script_body" ]]; then + echo "[Phase ${phase_num}] ${script_var_name} not set -- stub no-op" + return 0 + fi + local script_path="/tmp/run-phase${phase_num}.sh" + echo "$script_body" > "$script_path" + chmod +x "$script_path" + echo "[Phase ${phase_num}] sourcing ${script_var_name} from $script_path" + # PHASE_NODES is the documented input handle for per-phase + # scripts. Stays in sync with the positional arg. + PHASE_NODES="$nodes" + export PHASE_NODES + # shellcheck disable=SC1090 + source "$script_path" $nodes + local rc=$? + if [[ "$rc" -ne 0 ]]; then + echo "[Phase ${phase_num}] WARNING: ${script_var_name} exited ${rc}; continuing with whatever labels were written" + fi + return 0 + } + + # --------------------------------------------------------- + # run_phase1 -- GPU HW acceptance (, dedicated wiring + #). Replaces the generic-dispatch stub + # delivered. See design doc §4 -> "Code Path: + # orchestrator wiring". + # + # Contract: write $PHASE1_SCRIPT to a tmp path, chmod +x, + # source it with the input node list as positional args. + # The sourced PHASE1_SCRIPT writes + # $PHASE1_LABEL_KEY=passed|failed on every input node + # before returning. The orchestrator does not parse + # PHASE1_SCRIPT stdout -- labels are the gate. + # --------------------------------------------------------- + run_phase1() { + local nodes="$1" + echo "------------------------------------------------------------------" + echo "[Phase 1] GPU HW acceptance" + echo "[Phase 1] input nodes: ${nodes:-}" + echo "------------------------------------------------------------------" + if [[ -z "$nodes" ]]; then + echo "[Phase 1] no input nodes -- nothing to do" + return 0 + fi + if [[ -z "${PHASE1_SCRIPT:-}" ]]; then + echo "[Phase 1] PHASE1_SCRIPT not set -- nothing to source" + return 0 + fi + local script_path="/tmp/run-phase1.sh" + echo "$PHASE1_SCRIPT" > "$script_path" + chmod +x "$script_path" + echo "[Phase 1] sourcing PHASE1_SCRIPT from $script_path" + # PHASE_NODES is the documented input handle for per-phase + # scripts. Stays in sync with the positional args. + PHASE_NODES="$nodes" + export PHASE_NODES + # shellcheck disable=SC1090 + source "$script_path" $nodes + local rc=$? + if [[ "$rc" -ne 0 ]]; then + echo "[Phase 1] WARNING: PHASE1_SCRIPT exited ${rc}; continuing with whatever labels were written" + fi + return 0 + } + # --------------------------------------------------------- + # run_phase2 -- Intra-node GPU collective (, dedicated + # wiring). Replaces the generic-dispatch + # stub delivered. See design doc §4 -> "Code + # Path: orchestrator wiring". + # + # Contract: write $PHASE2_SCRIPT to a tmp path, chmod +x, + # source it with the input node list as positional args. + # The sourced PHASE2_SCRIPT writes + # $PHASE2_LABEL_KEY=passed|failed on every input node + # before returning. The orchestrator does not parse + # PHASE2_SCRIPT stdout -- labels are the gate. + # --------------------------------------------------------- + run_phase2() { + local nodes="$1" + echo "------------------------------------------------------------------" + echo "[Phase 2] Intra-node GPU collective" + echo "[Phase 2] input nodes: ${nodes:-}" + echo "------------------------------------------------------------------" + if [[ -z "$nodes" ]]; then + echo "[Phase 2] no input nodes -- nothing to do" + return 0 + fi + if [[ -z "${PHASE2_SCRIPT:-}" ]]; then + echo "[Phase 2] PHASE2_SCRIPT not set -- nothing to source" + return 0 + fi + local script_path="/tmp/run-phase2.sh" + echo "$PHASE2_SCRIPT" > "$script_path" + chmod +x "$script_path" + echo "[Phase 2] sourcing PHASE2_SCRIPT from $script_path" + # PHASE_NODES is the documented input handle for per-phase + # scripts. Stays in sync with the positional args. + PHASE_NODES="$nodes" + export PHASE_NODES + # shellcheck disable=SC1090 + source "$script_path" $nodes + local rc=$? + if [[ "$rc" -ne 0 ]]; then + echo "[Phase 2] WARNING: PHASE2_SCRIPT exited ${rc}; continuing with whatever labels were written" + fi + return 0 + } + # --------------------------------------------------------- + # run_phase3 -- Per-node NIC health (, dedicated + # wiring). Replaces the generic-dispatch + # stub delivered. See design doc §4 -> "Code + # Path: PHASE3_SCRIPT + orchestrator wiring". + # + # Contract: write $PHASE3_SCRIPT to a tmp path, chmod +x, + # source it with the input node list as positional args. + # The sourced PHASE3_SCRIPT submits one + # Phase 3 Job per node (template mounted at + # /phase3-configs/cluster-validation-phase3-job-config.yaml) + # and waits up to PHASE3_JOB_WAIT_TIME for completion. + # The in-pod PHASE3_CHECK_SCRIPT self-labels + # the node via in-pod kubectl; PHASE3_SCRIPT only fills + # in label_phase_failed for nodes whose Job could not + # run (submit-failed -> job-creation-failed; timeout -> + # nic-not-allocated). The orchestrator does not parse + # PHASE3_SCRIPT stdout -- labels are the gate. + # + # The caller (cronjob orchestrator) is responsible for + # narrowing the input pool to amd-nic=true BEFORE + # invoking run_phase3; PHASE3_SCRIPT trusts its $@ + # list is already nic-capable. + # --------------------------------------------------------- + run_phase3() { + local nodes="$1" + echo "------------------------------------------------------------------" + echo "[Phase 3] Per-node NIC health" + echo "[Phase 3] input nodes: ${nodes:-}" + echo "------------------------------------------------------------------" + if [[ -z "$nodes" ]]; then + echo "[Phase 3] no input nodes -- nothing to do" + return 0 + fi + if [[ -z "${PHASE3_SCRIPT:-}" ]]; then + echo "[Phase 3] PHASE3_SCRIPT not set -- nothing to source" + return 0 + fi + local script_path="/tmp/run-phase3.sh" + echo "$PHASE3_SCRIPT" > "$script_path" + chmod +x "$script_path" + echo "[Phase 3] sourcing PHASE3_SCRIPT from $script_path" + # PHASE_NODES is the documented input handle for per-phase + # scripts. Stays in sync with the positional args. + PHASE_NODES="$nodes" + export PHASE_NODES + # shellcheck disable=SC1090 + source "$script_path" $nodes + local rc=$? + if [[ "$rc" -ne 0 ]]; then + echo "[Phase 3] WARNING: PHASE3_SCRIPT exited ${rc}; continuing with whatever labels were written" + fi + return 0 + } + # --------------------------------------------------------- + # run_phase4 -- Pairwise rail bandwidth (, dedicated + # wiring). Replaces the generic-dispatch + # stub delivered. See design doc §4 -> "Code + # Path: orchestrator wiring". + # + # Contract: write $PHASE4_DRIVER_SCRIPT to a tmp path, + # chmod +x, source it with the input node list as + # positional args. The sourced PHASE4_DRIVER_SCRIPT + # submits per-(pair, rail) server + client + # Jobs (templates mounted at /phase4-configs/.), parses + # "BW average" from client logs, and writes + # $PHASE4_LABEL_KEY=passed|failed on every input node + # plus per-rail annotations + # (amd.com/rail-bandwidth-rail-{N}=, + # amd.com/rail-bandwidth-failed-rails=.) before + # returning. The orchestrator does not parse + # PHASE4_DRIVER_SCRIPT stdout -- labels are the gate. + # + # The caller (cronjob orchestrator) is responsible for + # narrowing the input pool to Phase-3-passed + # (amd.com/nic-health=passed) BEFORE invoking + # run_phase4; PHASE4_DRIVER_SCRIPT trusts its $@ list + # is already nic-healthy. + # --------------------------------------------------------- + run_phase4() { + local nodes="$1" + echo "------------------------------------------------------------------" + echo "[Phase 4] Pairwise rail bandwidth" + echo "[Phase 4] input nodes: ${nodes:-}" + echo "------------------------------------------------------------------" + if [[ -z "$nodes" ]]; then + echo "[Phase 4] no input nodes -- nothing to do" + return 0 + fi + if [[ -z "${PHASE4_DRIVER_SCRIPT:-}" ]]; then + echo "[Phase 4] PHASE4_DRIVER_SCRIPT not set -- nothing to source" + return 0 + fi + local script_path="/tmp/run-phase4.sh" + echo "$PHASE4_DRIVER_SCRIPT" > "$script_path" + chmod +x "$script_path" + echo "[Phase 4] sourcing PHASE4_DRIVER_SCRIPT from $script_path" + # PHASE_NODES is the documented input handle for per-phase + # scripts. Stays in sync with the positional args. + PHASE_NODES="$nodes" + export PHASE_NODES + # shellcheck disable=SC1090 + source "$script_path" $nodes + local rc=$? + if [[ "$rc" -ne 0 ]]; then + echo "[Phase 4] WARNING: PHASE4_DRIVER_SCRIPT exited ${rc}; continuing with whatever labels were written" + fi + return 0 + } + # --------------------------------------------------------- + # run_phase5 -- Multi-node RCCL via MPIJob (, + # dedicated wiring). Replaces the + # generic-dispatch stub delivered. See + # design doc §4 -> "Code Path: orchestrator wiring + + # cleanup of old in-CronJob block". + # + # Contract: write $PHASE5_SCRIPT to a tmp path, source + # it once to register the run_phase5_main function, + # then invoke run_phase5_main "$nodes" with the input + # node list as a single space-separated argument. + # + # The sourced PHASE5_SCRIPT (/.2) defines + # run_phase5_main which: applies the SKIP_RCCL_TEST + # short-circuit, enforces PHASE5_MIN_WORKERS, renders + # and submits the MPIJob with actual_worker_replicas + # equal to the input count, waits for completion, + # writes $PHASE5_LABEL_KEY=passed|failed on every + # input node via label_phase_passed/label_phase_failed + #, and dumps per-worker logs. + # The orchestrator does not parse PHASE5_SCRIPT + # stdout -- labels are the gate. + # + # The caller (cronjob orchestrator) is responsible for + # narrowing the input pool to nodes surviving Phase + # 4.5 BEFORE invoking run_phase5; run_phase5_main + # trusts its argument list is already 4.5-passed. + # + # Cleanup-of-old-MPIJobs and CronJob fail-loud exit + # logic intentionally remain in the orchestrator + # (cleanup_old_mpijobs, collect_launcher_logs, and + # the ANY_PHASE_FAILED exit gate below) -- they are + # cross-phase concerns, not per-phase script logic. + # --------------------------------------------------------- + run_phase5() { + local nodes="$1" + echo "------------------------------------------------------------------" + echo "[Phase 5] Multi-node RCCL via MPIJob" + echo "[Phase 5] input nodes: ${nodes:-}" + echo "------------------------------------------------------------------" + if [[ -z "$nodes" ]]; then + echo "[Phase 5] no input nodes -- nothing to do" + return 0 + fi + if [[ -z "${PHASE5_SCRIPT:-}" ]]; then + echo "[Phase 5] PHASE5_SCRIPT not set -- nothing to source" + return 0 + fi + local script_path="/tmp/run-phase5.sh" + echo "$PHASE5_SCRIPT" > "$script_path" + chmod +x "$script_path" + echo "[Phase 5] sourcing PHASE5_SCRIPT from $script_path" + # PHASE_NODES is the documented input handle for per-phase + # scripts. Stays in sync with the run_phase5_main arg. + PHASE_NODES="$nodes" + export PHASE_NODES + # shellcheck disable=SC1090 + source "$script_path" + if ! declare -f run_phase5_main >/dev/null; then + echo "[Phase 5] WARNING: PHASE5_SCRIPT did not define run_phase5_main; skipping" + return 0 + fi + run_phase5_main "$nodes" + local rc=$? + if [[ "$rc" -ne 0 ]]; then + echo "[Phase 5] WARNING: run_phase5_main exited ${rc}; continuing with whatever labels were written" + fi + return 0 + } + + # Phase 4.5 -- cross-node connectivity matrix. + # No SKIP_* flag of its own; downstream work may add one + # and may re-define this function via the same + # source-from-ConfigMap mechanism by overriding the + # generic stub. + maybe_run_phase45() { + local nodes="$1" + # Diagnostics go to stderr so the captured stdout + # (consumed via `$(maybe_run_phase45 .)`) carries only + # the node list and nothing else. + echo "------------------------------------------------------------------" >&2 + echo "[Phase 4.5] Cross-node connectivity matrix" >&2 + echo "[Phase 4.5] input nodes: ${nodes:-}" >&2 + echo "------------------------------------------------------------------" >&2 + local script_body="${PHASE45_SCRIPT:-}" + if [[ -z "$script_body" || -z "$nodes" ]]; then + echo "[Phase 4.5] no script defined or empty pool -- pass-through" >&2 + echo "$nodes" + return 0 + fi + local script_path="/tmp/run-phase45.sh" + echo "$script_body" > "$script_path" + chmod +x "$script_path" + PHASE_NODES="$nodes" + export PHASE_NODES + # shellcheck disable=SC1090 + source "$script_path" $nodes + # Phase 4.5 pass-through: re-emit input on stdout so the + # caller can pipe to Phase 5. The phase script is + # expected to filter via its own logic (e.g. by writing + # PHASE4_LABEL_KEY on a node it deems broken). For now, + # we trust the input pool. + echo "$nodes" + } + + # DRY_RUN override: replace every run_phaseN body with `:` + # so phase scripts don't execute but the pipeline still + # walks every phase and prints the plan. + if [[ "${DRY_RUN}" == "1" ]]; then + run_phase1() { echo "[Phase 1] DRY_RUN -- skipping run_phase1 ($1)"; } + run_phase2() { echo "[Phase 2] DRY_RUN -- skipping run_phase2 ($1)"; } + run_phase3() { echo "[Phase 3] DRY_RUN -- skipping run_phase3 ($1)"; } + run_phase4() { echo "[Phase 4] DRY_RUN -- skipping run_phase4 ($1)"; } + run_phase5() { echo "[Phase 5] DRY_RUN -- skipping run_phase5 ($1)"; } + maybe_run_phase45() { + echo "[Phase 4.5] DRY_RUN -- pass-through ($1)" >&2 + echo "$1" + } + fi + + # --------------------------------------------------------- + # Phase 0 -- Candidate selection + # --------------------------------------------------------- + echo -e "\n$(date): ===Phase 0: Determining candidate nodes===" + if [[ "${DRY_RUN}" == "1" ]]; then + # In DRY_RUN, do NOT label nodes. Just read whatever + # nodes currently match the candidate selector so the + # downstream phase plan has something to show. + candidates=$(nodes_with_label "${CANDIDATE_LABEL}") + if [[ -z "$candidates" ]]; then + echo "[Phase 0] DRY_RUN: no nodes currently match ${CANDIDATE_LABEL} -- using empty pool" + fi + else + echo "$CRONJOB_CANDIDATE_NODES_SELECTION_SCRIPT" > /tmp/select-and-label-candidate.sh + chmod +x /tmp/select-and-label-candidate.sh + if ! /tmp/select-and-label-candidate.sh; then + # Insufficient candidates is not a CronJob failure + # the next tick will retry. Preserves existing + # behavior from the linear orchestrator. + echo "[Phase 0] insufficient candidates -- exiting 0" + exit 0 + fi + candidates=$(nodes_with_label "${CANDIDATE_LABEL}") + fi + echo "[Phase 0] candidate nodes: ${candidates:-}" + + # Empty candidate pool is a clean no-op exit (design §6). + if [[ -z "$candidates" ]]; then + echo "[Orchestrator] empty candidate pool after Phase 0 -- exiting 0" exit 0 fi - ts=$(date +%Y%m%d-%H%M) - new_job="cluster-validation-mpi-job-${ts}" - # Calculate worker replicas based on passed nodes count - actual_worker_replicas=$passed_count - - # Generate PF NAD list (for PF) - PF_NAD_LIST=$(printf "${PF_NIC_NAD_NAME},%.0s" $(seq 1 $PF_NIC_PER_WORKER)) - PF_NAD_LIST=${PF_NAD_LIST%,} # remove trailing comma - # Generate VF NAD list (for VF) - VF_NAD_LIST=$(printf "${VF_NIC_NAD_NAME},%.0s" $(seq 1 $VF_NIC_PER_WORKER)) - VF_NAD_LIST=${VF_NAD_LIST%,} # remove trailing comma - - if [ "$PF_NIC_PER_WORKER" -gt 0 ]; then - NAD_ANNOTATION="$PF_NAD_LIST" - elif [ "$VF_NIC_PER_WORKER" -gt 0 ]; then - NAD_ANNOTATION="$VF_NAD_LIST" + # --------------------------------------------------------- + # Phase 1 -- GPU HW acceptance + # --------------------------------------------------------- + if [[ "${SKIP_GPU_HW_ACCEPTANCE,,}" != "true" ]]; then + run_phase1 "$candidates" + phase1_passed=$(filter_passed_nodes "$candidates" "$PHASE1_LABEL_KEY") + account_phase_failures "Phase 1" "$candidates" "$phase1_passed" else - NAD_ANNOTATION="" + echo "[Phase 1] SKIP_GPU_HW_ACCEPTANCE=true -- pass-through" + phase1_passed="$candidates" fi - sed "s/^ name: cluster-validation-mpi-job/ name: ${new_job}/; \ - s|\$\$WORKER_REPLICAS|${actual_worker_replicas}|g; \ - s|\$\$LAUNCHER_REPLICAS|${LAUNCHER_REPLICAS}|g; \ - s|\$\$LOG_STORE_NODE_NAME|${LOG_STORE_NODE_NAME}|g; \ - s|\$\$SLOTS_PER_WORKER|${SLOTS_PER_WORKER}|g; \ - s|\$\$GPU_PER_WORKER|${GPU_PER_WORKER}|g; \ - s|\$\$PF_NIC_PER_WORKER|${PF_NIC_PER_WORKER}|g; \ - s|\$\$VF_NIC_PER_WORKER|${VF_NIC_PER_WORKER}|g; \ - s|\$\$NAD_ANNOTATION|${NAD_ANNOTATION}|g; \ - s|\$\$RCCL_WORKLOAD_IMAGE|${RCCL_WORKLOAD_IMAGE}|g" \ - /mpi-configs/cluster-validation-mpijob-config.yaml | kubectl apply -f - - echo "[MPIJob: Submitted for $actual_worker_replicas worker node(s)]" - echo "==================================================================" + # --------------------------------------------------------- + # Phase 2 -- Intra-node GPU collective + # --------------------------------------------------------- + if [[ "${SKIP_GPU_MESH_VALIDATION,,}" != "true" ]]; then + run_phase2 "$phase1_passed" + phase2_passed=$(filter_passed_nodes "$phase1_passed" "$PHASE2_LABEL_KEY") + account_phase_failures "Phase 2" "$phase1_passed" "$phase2_passed" + else + echo "[Phase 2] SKIP_GPU_MESH_VALIDATION=true -- pass-through" + phase2_passed="$phase1_passed" + fi + + # --------------------------------------------------------- + # Phase 3 -- Per-node NIC health + # Narrows by amd-nic=true before invoking; nodes without + # NIC labels are silently dropped at this gate (they + # cannot participate in NIC-requiring phases). + # --------------------------------------------------------- + if [[ "${SKIP_NIC_VALIDATION,,}" != "true" ]]; then + nic_capable=$(intersect "$phase2_passed" "$(nodes_with_label 'feature.node.kubernetes.io/amd-nic=true')") + echo "[Phase 3] nic-capable subset: ${nic_capable:-}" + if [[ -n "$nic_capable" ]]; then + run_phase3 "$nic_capable" + phase3_passed=$(filter_passed_nodes "$nic_capable" "$PHASE3_LABEL_KEY") + account_phase_failures "Phase 3" "$nic_capable" "$phase3_passed" + else + echo "[Phase 3] no NIC-capable nodes -- skipping" + phase3_passed="" + fi + else + echo "[Phase 3] SKIP_NIC_VALIDATION=true -- pass-through" + phase3_passed="$phase2_passed" + fi - echo -e "\n$(date): ===Step 4: Waiting for MPIJob completion===" - if kubectl wait mpijob "$new_job" --for=condition=Succeeded --timeout=${MPIJOB_WAIT_TIME}s; then - CLUSTER_VALIDATION_STATUS_LABEL=${SUCCESS_LABEL} - job_status=passed - echo "$(date): MPIJob $new_job succeeded ✅" - echo "[MPIJob Result: Passed]" + # --------------------------------------------------------- + # Phase 4 -- Pairwise rail bandwidth + # Dedicated run_phase4 wiring landed via; + # sources PHASE4_DRIVER_SCRIPT and the + # /phase4-configs mount carries the server + client + # Job templates from cluster-validation-phase4-job-config. + # --------------------------------------------------------- + if [[ "${SKIP_RAIL_BANDWIDTH_TEST,,}" != "true" ]]; then + run_phase4 "$phase3_passed" + phase4_passed=$(filter_passed_nodes "$phase3_passed" "$PHASE4_LABEL_KEY") + account_phase_failures "Phase 4" "$phase3_passed" "$phase4_passed" else - CLUSTER_VALIDATION_STATUS_LABEL=${FAILURE_LABEL} - job_status=failed - echo "$(date): MPIJob $new_job failed ❌" - echo "[MPIJob Result: Failed]" - sleep ${DEBUG_DELAY} + echo "[Phase 4] SKIP_RAIL_BANDWIDTH_TEST=true -- pass-through" + phase4_passed="$phase3_passed" fi - echo "==================================================================" - echo -e "\n$(date): ===Step 5: Labeling nodes based on MPIJob result===" - for n in $passed_nodes; do - echo "Labeling node $n with $CLUSTER_VALIDATION_STATUS_LABEL" - kubectl label node "$n" "$CLUSTER_VALIDATION_STATUS_LABEL" --overwrite - done - CANDIDATE_LABEL_KEY=${CANDIDATE_LABEL%%=*} - for n in $passed_nodes; do - echo "Removing candidate label on node: $n" - kubectl label node "$n" "${CANDIDATE_LABEL_KEY}-" --overwrite - done - echo "[CronJob Result: $CLUSTER_VALIDATION_STATUS_LABEL] Cluster Validation Status updated on Candidate Nodes" - echo "==================================================================" + # --------------------------------------------------------- + # Phase 4.5 -- Cross-node connectivity matrix + # No SKIP_* flag; maybe_run_phase45 is itself a pass-through + # unless PHASE45_SCRIPT is defined. + # --------------------------------------------------------- + phase45_passed=$(maybe_run_phase45 "$phase4_passed") - echo -e "\n$(date): ===Retrieving launcher pod logs===" - LAUNCHER_POD=$(kubectl get pods -l "training.kubeflow.org/job-name=${new_job},training.kubeflow.org/replica-type=launcher" \ - -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || true) - if [ -n "$LAUNCHER_POD" ]; then - LAUNCHER_LOG_FILE="${LOG_DIR}/launcher-${new_job}.log" - echo "Saving launcher pod logs ($LAUNCHER_POD) to $LAUNCHER_LOG_FILE" - kubectl logs "$LAUNCHER_POD" --all-containers=true > "$LAUNCHER_LOG_FILE" 2>&1 || \ - echo "Warning: Failed to retrieve launcher pod logs" + # --------------------------------------------------------- + # Phase 5 -- Multi-node RCCL via MPIJob + # --------------------------------------------------------- + if [[ "${SKIP_RCCL_TEST,,}" != "true" ]]; then + run_phase5 "$phase45_passed" + phase5_passed=$(filter_passed_nodes "$phase45_passed" "$PHASE5_LABEL_KEY") + account_phase_failures "Phase 5" "$phase45_passed" "$phase5_passed" else - echo "Warning: Could not find launcher pod for MPIJob $new_job" + echo "[Phase 5] SKIP_RCCL_TEST=true -- pass-through" + phase5_passed="$phase45_passed" fi - echo "==================================================================" - echo -e "\n$(date): ===Step 6: Cleaning up old MPIJobs===" - mpijobs=$(kubectl get mpijobs -o jsonpath='{.items[*].metadata.name}' \ - | tr ' ' '\n' | grep '^cluster-validation-mpi-job-' | sort) - count=$(echo "$mpijobs" | wc -l) - keep=3 # <-update before deploy - if [ "$count" -gt "$keep" ]; then - del=$(echo "$mpijobs" | head -n -"$keep") - for job in $del; do - echo "Deleting old MPIJob: $job" - kubectl delete mpijob "$job" --ignore-not-found + # --------------------------------------------------------- + # Cleanup -- old MPIJobs (only meaningful when Phase 5 ran) + # --------------------------------------------------------- + cleanup_old_mpijobs() { + if [[ "${DRY_RUN}" == "1" ]]; then + echo "[Cleanup] DRY_RUN -- skipping old-MPIJob cleanup" + return 0 + fi + echo -e "\n$(date): ===Cleanup: pruning old MPIJobs===" + local mpijobs count keep del + mpijobs=$(kubectl get mpijobs -o jsonpath='{.items[*].metadata.name}' 2>/dev/null \ + | tr ' ' '\n' | grep '^cluster-validation-mpi-job-' | sort || true) + if [[ -z "$mpijobs" ]]; then + echo "[Cleanup] no cluster-validation MPIJobs present" + return 0 + fi + count=$(echo "$mpijobs" | wc -l) + keep=3 # <-update before deploy + if [[ "$count" -gt "$keep" ]]; then + del=$(echo "$mpijobs" | head -n -"$keep") + for job in $del; do + echo "[Cleanup] Deleting old MPIJob: $job" + kubectl delete mpijob "$job" --ignore-not-found + done + fi + } + + # --------------------------------------------------------- + # Launcher log collection -- only when Phase 5 produced an + # MPIJob (i.e. SKIP_RCCL_TEST != true and the phase script + # actually submitted one). The orchestrator does not own + # the job name; we discover any active launcher pods. + # --------------------------------------------------------- + collect_launcher_logs() { + if [[ "${DRY_RUN}" == "1" ]]; then + echo "[Logs] DRY_RUN -- skipping launcher log collection" + return 0 + fi + if [[ "${SKIP_RCCL_TEST,,}" == "true" ]]; then + return 0 + fi + echo -e "\n$(date): ===Collecting launcher pod logs===" + local launcher_pods pod + launcher_pods=$(kubectl get pods \ + -l "training.kubeflow.org/replica-type=launcher" \ + -o jsonpath='{.items[*].metadata.name}' 2>/dev/null || true) + if [[ -z "$launcher_pods" ]]; then + echo "[Logs] no launcher pods found" + return 0 + fi + for pod in $launcher_pods; do + local log_file="${LOG_DIR}/launcher-${pod}.log" + echo "[Logs] saving ${pod} -> ${log_file}" + kubectl logs "$pod" --all-containers=true > "$log_file" 2>&1 \ + || echo "[Logs] WARNING: failed to retrieve logs for $pod" done - fi + } + + cleanup_old_mpijobs + collect_launcher_logs - # Fail the overall cronjob if any test runner jobs failed - if [ $failed_count -gt 0 ]; then - echo "Test runner jobs failed on $failed_count node(s). Failing cronjob." - echo "[CronJob Result: FAILED] ❌" - sleep ${DEBUG_DELAY} + # --------------------------------------------------------- + # Fail-loud exit (design §6) + # --------------------------------------------------------- + echo "==================================================================" + if [[ "$ANY_PHASE_FAILED" -ne 0 ]]; then + echo "[CronJob Result: FAILED] one or more phases produced failed nodes" + sleep "${DEBUG_DELAY}" exit 1 fi - echo "[CronJob Completed] $(date)" echo "==================================================================" volumeMounts: @@ -477,6 +1747,40 @@ spec: mountPath: /mpi-configs - name: test-runner-configs mountPath: /test-runner-configs + # Phase 2 per-node Job template + # source for PHASE2_SCRIPT. PHASE2_SCRIPT sed-renders the + # template at /phase2-configs/cluster-validation-phase2-job-config.yaml + # (substituting $$NODE per Phase-1-passed node) and applies + # the resulting batch/v1.Job. Mirrors the mpi-configs and + # test-runner-configs mounts above; backed by the + # cluster-validation-phase2-job-config ConfigMap. + - name: phase2-configs + mountPath: /phase2-configs + # Phase 3 per-node Job template + # source for PHASE3_SCRIPT. PHASE3_SCRIPT sed-renders the + # template at /phase3-configs/cluster-validation-phase3-job-config.yaml + # (substituting $$NODE and $$EXPECTED_NIC_COUNT per + # nic-capable Phase-2-passed node) and applies the resulting + # batch/v1.Job. Mirrors the phase2-configs mount above; + # backed by the cluster-validation-phase3-job-config + # ConfigMap. + - name: phase3-configs + mountPath: /phase3-configs + # Phase 4 per-(pair, rail) Job + # template source for PHASE4_DRIVER_SCRIPT. + # PHASE4_DRIVER_SCRIPT sed-renders the server and client + # templates at + # /phase4-configs/cluster-validation-phase4-server-job-config.yaml + # and + # /phase4-configs/cluster-validation-phase4-client-job-config.yaml + # (substituting $$NODE, $$RAIL_IDX, $$NAD_NAME, and + # $$PEER_POD_IP per (pair, rail) instance) and applies + # the resulting batch/v1.Jobs. Mirrors the phase2-configs + # and phase3-configs mounts above; backed by the + # cluster-validation-phase4-job-config ConfigMap + #. + - name: phase4-configs + mountPath: /phase4-configs - name: cluster-validation-logs mountPath: /var/log/cluster-validation volumes: @@ -486,6 +1790,28 @@ spec: - name: test-runner-configs configMap: name: cluster-validation-test-runner-job-config + # backing ConfigMap for the + # /phase2-configs mount above. cluster-validation-phase2-job-config + # is defined earlier in this manifest and contains + # the per-node Phase 2 RCCL all_reduce_perf Job template. + - name: phase2-configs + configMap: + name: cluster-validation-phase2-job-config + # backing ConfigMap for the + # /phase3-configs mount above. cluster-validation-phase3-job-config + # is defined earlier in this manifest and contains + # the per-node Phase 3 NIC-health Job template. + - name: phase3-configs + configMap: + name: cluster-validation-phase3-job-config + # backing ConfigMap for the + # /phase4-configs mount above. cluster-validation-phase4-job-config + # is defined earlier in this manifest + # and contains the per-(pair, rail) Phase 4 ib_write_bw + # server and client Job templates under separate data keys. + - name: phase4-configs + configMap: + name: cluster-validation-phase4-job-config - name: cluster-validation-logs hostPath: path: /var/log/cluster-validation diff --git a/example/gpu-validation-cluster/configs/config.json b/example/gpu-validation-cluster/configs/config.json index 40e96595d..b5dffe294 100644 --- a/example/gpu-validation-cluster/configs/config.json +++ b/example/gpu-validation-cluster/configs/config.json @@ -46,7 +46,7 @@ "version": "v0.4.0" }, "cronjob": { - "schedule": "*/10 * * * *" + "schedule": "*/30 * * * *" }, "node-selector-labels": [ "feature.node.kubernetes.io/amd-gpu=true", @@ -57,13 +57,44 @@ "launcher-replicas": 1, "slots-per-worker": 8, "gpu-per-worker": 8, - "pf-nic-per-worker": 0, - "vf-nic-per-worker": 8, - "node-validation-interval-mins": 10 + "pf-nic-per-worker": 8, + "vf-nic-per-worker": 0, + "node-validation-interval-mins": 30 }, "skip-tests": { - "skip-gpu-validation": false, - "skip-rccl-test": false + "skip-phase1-gpu-hw-acceptance": true, + "skip-phase1-stages": { + "skip-phase1-gpu-stress": true, + "skip-phase1-xgmi-lvl1": true, + "skip-phase1-pcie-lvl1": true, + "skip-phase1-hbm-lvl1": true + }, + "skip-phase2-gpu-mesh-validation": false, + "skip-phase3-nic-validation": false, + "skip-phase4-rail-bandwidth-test": false, + "skip-phase5-rccl-test": false + }, + "timeouts": { + "phase1-stages-secs": { + "phase1-gpu-stress": 3600, + "phase1-xgmi-lvl1": 1200, + "phase1-pcie-lvl1": 1200, + "phase1-hbm-lvl1": 1200 + }, + "phase2-job-wait-secs": 600, + "phase3-job-wait-secs": 300, + "phase4-pair-wait-secs": 300, + "phase5-mpijob-wait-secs": 300 + }, + "images": { + "roce-workload": "docker.io/rocm/roce-workload:ubuntu24_rocm-7.0.2_rccl-7.0.2_anp-v1.2.0_ainic-1.117.5-a-56", + "test-runner": { + "rvs": "docker.io/rocm/test-runner:v1.4.0", + "agfhc": "docker.io/amdpsdo/test-runner:agfhc-v1.5.0-4" + }, + "orchestrator": "docker.io/bitnamilegacy/kubectl:1.33.4", + "preflight-init": "docker.io/bitnamilegacy/kubectl:1.33.4", + "nic-health": "docker.io/rocm/network-operator-utils:v1.1.0" } } } diff --git a/example/gpu-validation-cluster/configs/nad-per-rail.yaml b/example/gpu-validation-cluster/configs/nad-per-rail.yaml new file mode 100644 index 000000000..08d08d02b --- /dev/null +++ b/example/gpu-validation-cluster/configs/nad-per-rail.yaml @@ -0,0 +1,194 @@ +# Per-rail NetworkAttachmentDefinition manifests +# See the phase design notes +# +# Purpose +# ------- +# Phase 4 (pairwise rail bandwidth test) requires one +# NetworkAttachmentDefinition per rail (0.7). Each NAD pins a pod to a +# single rail's NIC so that `ib_write_bw -d $DEV` runs against exactly +# the rail under test, with no cross-rail traffic leaking through +# Multus' generic PF/VF NAD. +# +# Naming convention (locked by the Phase 4 design): +# amd-host-device-nad-rail-${RAIL_IDX} RAIL_IDX in 0.7 +# +# This file deploys all 8 NADs into the namespace where Phase 4 Jobs +# run (default: `kube-system` -- matches the existing PF/VF NAD +# placement). Override the namespace with `kustomize` or +# `kubectl apply -n ` if the cluster uses a non-default namespace +# for Multus NADs. +# +# Conditional deployment +# ---------------------- +# If the cluster already ships per-rail NADs (verify with +# `kubectl get net-attach-def -A | grep amd-host-device-nad-rail-`), +# this file is a no-op -- skip applying it. Phase 4 (.785) +# only reads the NADs by name; how they got there does not matter. +# +# Per-fleet PCI customization +# --------------------------- +# `host-device` CNI accepts EITHER an interface name (`device`) OR a +# PCI address (`pciBusID`). The default below uses the netdev naming +# convention `rdma_dev_${RAIL_IDX}` which mirrors `PHASE4_IB_DEV_PREFIX` +# from the Phase 4 ConfigMap and matches the +# RDMA verbs device name `ib_write_bw -d rdma_dev_` expects. +# +# If your fleet uses different netdev names (e.g. `ensf0`), either: +# (a) replace each NAD's `"device"` field with the fleet-specific +# interface name, or +# (b) replace `"device": "rdma_dev_N"` with +# `"pciBusID": "0000::."` for the rail's NIC. +# `lspci -d 1dd8:` on a target node lists the 8 Pensando NICs in PCI +# order; rail 0 is conventionally the lowest BDF, rail 7 the highest. +# +# IPAM +# ---- +# `host-device` does not require IPAM because the host device is moved +# into the pod's netns with its existing kernel addressing. Phase 4's +# `ib_write_bw` connects over RDMA verbs, not the netdev IP stack, so +# the netdev's L3 config (or absence thereof) does not affect the test. +# An `ipam: { "type": "static" }` block is intentionally omitted. +--- +apiVersion: k8s.cni.cncf.io/v1 +kind: NetworkAttachmentDefinition +metadata: + name: amd-host-device-nad-rail-0 + namespace: kube-system + labels: + app.kubernetes.io/part-of: gpu-validation-cluster + app.kubernetes.io/component: phase4-rail-nad + amd.com/rail-index: "0" +spec: + config: | + { + "cniVersion": "0.3.1", + "type": "host-device", + "name": "amd-host-device-nad-rail-0", + "device": "rdma_dev_0" + } +--- +apiVersion: k8s.cni.cncf.io/v1 +kind: NetworkAttachmentDefinition +metadata: + name: amd-host-device-nad-rail-1 + namespace: kube-system + labels: + app.kubernetes.io/part-of: gpu-validation-cluster + app.kubernetes.io/component: phase4-rail-nad + amd.com/rail-index: "1" +spec: + config: | + { + "cniVersion": "0.3.1", + "type": "host-device", + "name": "amd-host-device-nad-rail-1", + "device": "rdma_dev_1" + } +--- +apiVersion: k8s.cni.cncf.io/v1 +kind: NetworkAttachmentDefinition +metadata: + name: amd-host-device-nad-rail-2 + namespace: kube-system + labels: + app.kubernetes.io/part-of: gpu-validation-cluster + app.kubernetes.io/component: phase4-rail-nad + amd.com/rail-index: "2" +spec: + config: | + { + "cniVersion": "0.3.1", + "type": "host-device", + "name": "amd-host-device-nad-rail-2", + "device": "rdma_dev_2" + } +--- +apiVersion: k8s.cni.cncf.io/v1 +kind: NetworkAttachmentDefinition +metadata: + name: amd-host-device-nad-rail-3 + namespace: kube-system + labels: + app.kubernetes.io/part-of: gpu-validation-cluster + app.kubernetes.io/component: phase4-rail-nad + amd.com/rail-index: "3" +spec: + config: | + { + "cniVersion": "0.3.1", + "type": "host-device", + "name": "amd-host-device-nad-rail-3", + "device": "rdma_dev_3" + } +--- +apiVersion: k8s.cni.cncf.io/v1 +kind: NetworkAttachmentDefinition +metadata: + name: amd-host-device-nad-rail-4 + namespace: kube-system + labels: + app.kubernetes.io/part-of: gpu-validation-cluster + app.kubernetes.io/component: phase4-rail-nad + amd.com/rail-index: "4" +spec: + config: | + { + "cniVersion": "0.3.1", + "type": "host-device", + "name": "amd-host-device-nad-rail-4", + "device": "rdma_dev_4" + } +--- +apiVersion: k8s.cni.cncf.io/v1 +kind: NetworkAttachmentDefinition +metadata: + name: amd-host-device-nad-rail-5 + namespace: kube-system + labels: + app.kubernetes.io/part-of: gpu-validation-cluster + app.kubernetes.io/component: phase4-rail-nad + amd.com/rail-index: "5" +spec: + config: | + { + "cniVersion": "0.3.1", + "type": "host-device", + "name": "amd-host-device-nad-rail-5", + "device": "rdma_dev_5" + } +--- +apiVersion: k8s.cni.cncf.io/v1 +kind: NetworkAttachmentDefinition +metadata: + name: amd-host-device-nad-rail-6 + namespace: kube-system + labels: + app.kubernetes.io/part-of: gpu-validation-cluster + app.kubernetes.io/component: phase4-rail-nad + amd.com/rail-index: "6" +spec: + config: | + { + "cniVersion": "0.3.1", + "type": "host-device", + "name": "amd-host-device-nad-rail-6", + "device": "rdma_dev_6" + } +--- +apiVersion: k8s.cni.cncf.io/v1 +kind: NetworkAttachmentDefinition +metadata: + name: amd-host-device-nad-rail-7 + namespace: kube-system + labels: + app.kubernetes.io/part-of: gpu-validation-cluster + app.kubernetes.io/component: phase4-rail-nad + amd.com/rail-index: "7" +spec: + config: | + { + "cniVersion": "0.3.1", + "type": "host-device", + "name": "amd-host-device-nad-rail-7", + "device": "rdma_dev_7" + } diff --git a/example/gpu-validation-cluster/gpu-cluster.sh b/example/gpu-validation-cluster/gpu-cluster.sh index f3ebd10bf..723ee2bb5 100755 --- a/example/gpu-validation-cluster/gpu-cluster.sh +++ b/example/gpu-validation-cluster/gpu-cluster.sh @@ -10,6 +10,7 @@ usage() { echo " build Build the Docker image" echo " run [args...] Run the node as server or agent" echo " teardown Tear down the cluster and clean up" + echo " reapply-cvf Re-apply CVF configs from config.json against the running server container" echo " get-token Run on server node to print agent join token" echo " status Show cluster validation framework status and recent runs" echo " node-status Show validation status per node" @@ -112,7 +113,14 @@ cmd_run() { mkdir -p "$SCRIPT_STATE_DIR/cni-bin" # Common docker run options + # --runtime=runc is pinned because some GPU hosts default dockerd to the + # AMD container runtime (Default Runtime: amd in `docker info`). The k3s + # control-plane container is privileged and doesn't need GPU passthrough, + # so we must not inherit a host default that points at a missing or + # GPU-specific runtime binary (would fail at container start with + # "amd-container-runtime: executable file not found in $PATH"). DOCKER_OPTS=( + "--runtime=runc" "--privileged" "--net=host" "--cgroupns=host" @@ -382,9 +390,52 @@ YAMEOF --create-namespace \ --set crds.enabled=true + wait_for_cert_manager_webhook "$CERT_MANAGER_NS" + echo "[INFO] cert-manager installation completed" } + # Block until the cert-manager webhook is actually able to admit + # requests. Without this, the very next step (install_amd_gpu_operator) + # applies Certificate/Issuer objects that are validated by + # cert-manager's admission webhook, and the helm install fails with + # "failed calling webhook ... no endpoints available for service + # cert-manager-webhook". This is especially likely right after a fresh + # container start, when k3s flaps (see entrypoint.sh supervise_k3s) and + # restarts the cert-manager pods, briefly emptying the webhook's + # Endpoints exactly when the GPU-operator install races ahead. + wait_for_cert_manager_webhook() { + local CERT_MANAGER_NS="$1" + + echo "[INFO] Waiting for cert-manager webhook to be ready..." + + # 1. Deployment rollout complete (all replicas Available). + docker exec "$CONTAINER_NAME" kubectl rollout status \ + deployment/cert-manager-webhook -n "$CERT_MANAGER_NS" --timeout=300s + + # 2. Service has at least one ready endpoint. rollout status can + # report complete a moment before the EndpointSlice is populated, + # so poll the Endpoints object until an address appears. + local MAX_RETRIES=60 + local RETRY_COUNT=0 + while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do + local EP + EP=$(docker exec "$CONTAINER_NAME" kubectl get endpoints cert-manager-webhook \ + -n "$CERT_MANAGER_NS" -o jsonpath='{.subsets[*].addresses[*].ip}' 2>/dev/null) + if [ -n "$EP" ]; then + echo "[INFO] cert-manager webhook has endpoints: $EP" + break + fi + echo "[INFO] Waiting for cert-manager webhook endpoints... ($((RETRY_COUNT + 1))/$MAX_RETRIES)" + sleep 2 + RETRY_COUNT=$((RETRY_COUNT + 1)) + done + if [ $RETRY_COUNT -ge $MAX_RETRIES ]; then + echo "[ERROR] cert-manager webhook has no endpoints after $MAX_RETRIES retries" + exit 1 + fi + } + # Function to install AMD GPU operator install_amd_gpu_operator() { if [ "$MODE" != "server" ]; then @@ -396,21 +447,58 @@ YAMEOF local AMD_GPU_CHART=$(read_config '.["amd-gpu-operator"].chart') local AMD_GPU_NS=$(read_config '.["amd-gpu-operator"].namespace') + local CERT_MANAGER_NS=$(read_config '.["cert-manager"].namespace') + echo "[INFO] Installing AMD GPU operator via Helm..." echo "[INFO] Version: $AMD_GPU_VERSION" - if docker exec "$CONTAINER_NAME" helm list -n "$AMD_GPU_NS" | grep -q amd-gpu-operator; then + # Only skip when an existing release is actually deployed. A prior + # attempt that failed mid-apply (e.g. the cert-manager webhook race) + # leaves a release in STATUS=failed which `helm list` still shows -- + # treating that as "installed" would silently ship a broken operator. + local RELEASE_STATUS + RELEASE_STATUS=$(docker exec "$CONTAINER_NAME" helm status amd-gpu-operator \ + -n "$AMD_GPU_NS" -o json 2>/dev/null | jq -r '.info.status // ""') + if [ "$RELEASE_STATUS" = "deployed" ]; then echo "[INFO] AMD GPU operator is already installed, skipping..." return fi + if [ -n "$RELEASE_STATUS" ]; then + echo "[WARN] AMD GPU operator release in status='$RELEASE_STATUS' -- uninstalling before retry" + docker exec "$CONTAINER_NAME" helm uninstall amd-gpu-operator -n "$AMD_GPU_NS" 2>/dev/null || true + fi docker exec "$CONTAINER_NAME" helm repo add rocm "$AMD_GPU_REPO" docker exec "$CONTAINER_NAME" helm repo update - docker exec "$CONTAINER_NAME" helm install amd-gpu-operator "$AMD_GPU_CHART" \ - --namespace "$AMD_GPU_NS" \ - --create-namespace \ - --version="$AMD_GPU_VERSION" + # Retry the install: its chart applies Certificate/Issuer objects + # gated by the cert-manager webhook, which can transiently lose its + # endpoints during post-start k3s flapping. Re-assert webhook + # readiness and clean up any partial release before each attempt. + local MAX_RETRIES=5 + local RETRY_COUNT=0 + local INSTALL_OK=false + while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do + wait_for_cert_manager_webhook "$CERT_MANAGER_NS" + + if docker exec "$CONTAINER_NAME" helm install amd-gpu-operator "$AMD_GPU_CHART" \ + --namespace "$AMD_GPU_NS" \ + --create-namespace \ + --version="$AMD_GPU_VERSION"; then + INSTALL_OK=true + break + fi + + RETRY_COUNT=$((RETRY_COUNT + 1)) + echo "[WARN] AMD GPU operator install failed (attempt $RETRY_COUNT/$MAX_RETRIES) -- cleaning up and retrying" + docker exec "$CONTAINER_NAME" helm uninstall amd-gpu-operator -n "$AMD_GPU_NS" 2>/dev/null || true + sleep 10 + done + + if [ "$INSTALL_OK" != true ]; then + echo "[ERROR] AMD GPU operator install failed after $MAX_RETRIES attempts" + exit 1 + fi echo "[INFO] AMD GPU operator installation completed" } @@ -421,6 +509,22 @@ YAMEOF return fi + # Optional: skip the network operator entirely for GPU-only + # clusters. Defaults to true (install) so existing configs are + # unaffected. When false, the network operator, its NetworkConfig + # CR, and Multus (which the operator ships) are all skipped -- only + # valid when no NIC-dependent phases (3/4/5) will run. + # NOTE: use an explicit null check, NOT jq's `// true`. The `//` + # alternative operator fires on `false` too (jq treats false as + # empty), so `false // true` => true and the flag would never take + # effect. `if . == null then true else . end` defaults only when + # the key is absent. + local INSTALL_NETWORK_OPERATOR=$(read_config 'if .["network-operator"]["install-network-operator"] == null then true else .["network-operator"]["install-network-operator"] end') + if [ "$INSTALL_NETWORK_OPERATOR" != "true" ]; then + echo "[INFO] network-operator.install-network-operator=false -- skipping network operator install" + return + fi + local NETWORK_VERSION=$(read_config '.["network-operator"].version') local NETWORK_REPO=$(read_config '.["network-operator"].repo') local NETWORK_CHART=$(read_config '.["network-operator"].chart') @@ -525,17 +629,23 @@ REGEOF " echo "[INFO] registries.yaml configured successfully, need to restart k3s server to apply changes" echo "[INFO] Killing k3s process to apply registry changes..." - docker exec "$CONTAINER_NAME" sh -c "pkill -9 k3s || true" - - echo "[INFO] Restarting k3s service..." - sleep 3 - docker exec -d "$CONTAINER_NAME" /usr/local/bin/k3s server --embedded-registry --disable=traefik --disable=servicelb - echo "[INFO] Waiting for k3s to be ready..." - sleep 10 - docker exec "$CONTAINER_NAME" sh -c "until kubectl get nodes &>/dev/null; do sleep 1; done" - echo "[INFO] k3s restarted successfully" + # Let the entrypoint supervisor own the restart. Previously this + # function did `pkill -9 k3s` AND then launched its own + # `k3s server ...` via `docker exec -d` -- but entrypoint.sh's + # supervise_k3s loop ALSO relaunches k3s the moment it sees the + # process exit. The result was two competing k3s servers fighting + # for port 6443, one of which kept crashing ("stabilized after N + # restarts"). Any kubectl/helm call that blipped during that race + # tripped `set -e` and aborted the whole (backgrounded) install. + # + # Fix: only kill k3s; the supervisor relaunches it (with the + # correct, single source-of-truth arg set, including + # --kubelet-arg=serialize-image-pulls=true). We then wait for the + # freshly-supervised apiserver to come back. + docker exec "$CONTAINER_NAME" sh -c "pkill -9 k3s || true" + echo "[INFO] Waiting for supervisor to relaunch k3s and the API to be ready..." RETRY_COUNT=0 while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do if docker exec "$CONTAINER_NAME" kubectl api-resources &>/dev/null; then @@ -619,6 +729,15 @@ DEVICEEOF " echo "[INFO] AMD GPU driver configuration applied successfully" + # The NetworkConfig CR below is owned by the network operator's + # CRD; skip it when the network operator was not installed, or the + # apply fails ("no matches for kind NetworkConfig"). + local INSTALL_NETWORK_OPERATOR=$(read_config 'if .["network-operator"]["install-network-operator"] == null then true else .["network-operator"]["install-network-operator"] end') + if [ "$INSTALL_NETWORK_OPERATOR" != "true" ]; then + echo "[INFO] network operator skipped -- not applying NetworkConfig" + return + fi + local INSTALL_IONIC_DRIVER=$(read_config '.["network-operator"]["install-ionic-driver"]') local NETWORK_NS=$(read_config '.["network-operator"].namespace') local REGISTRY_PORT=$(read_config '.["in-cluster-registry"].nodePort') @@ -659,6 +778,85 @@ EOF " } + # Wait for a workload (Deployment/DaemonSet) to finish rolling out, + # tolerating slow image pulls while failing fast on unrecoverable + # pull errors. `kubectl rollout status` already blocks until the + # rollout completes (it does NOT time out on a slow-but-progressing + # pull the way a fixed sleep-loop does); we run it under an overall + # deadline and, on each lapse, inspect pod events so an + # ErrImagePull/ImagePullBackOff with auth/not-found surfaces + # immediately instead of burning the whole budget. + # + # wait_for_rollout [hard_timeout_secs] [pod_selector] + wait_for_rollout() { + local target="$1" + local ns="$2" + local hard_timeout="${3:-900}" + local selector="${4:-}" + + echo "[INFO] Waiting for $target in $ns to roll out (hard timeout ${hard_timeout}s)..." + local deadline=$(( $(date +%s) + hard_timeout )) + + while :; do + local remaining=$(( deadline - $(date +%s) )) + if [ "$remaining" -le 0 ]; then + echo "[ERROR] $target did not become ready within ${hard_timeout}s" + _dump_pull_failures "$ns" "$selector" + return 1 + fi + + # rollout status returns 0 as soon as the workload is ready. + # Cap each attempt at 60s so we periodically re-check for fatal + # pull errors; the final attempt shrinks to whatever time is + # left before the hard deadline. + local step=$(( remaining < 60 ? remaining : 60 )) + if docker exec "$CONTAINER_NAME" kubectl rollout status "$target" -n "$ns" \ + --timeout="${step}s" >/dev/null 2>&1; then + echo "[INFO] $target is ready" + return 0 + fi + + # Not ready yet. Fast-fail on definitively broken pulls. + if _has_fatal_pull_error "$ns" "$selector"; then + echo "[ERROR] $target has an unrecoverable image-pull error -- not waiting out the timeout" + _dump_pull_failures "$ns" "$selector" + return 1 + fi + echo "[INFO] $target still progressing (image pull / scheduling); ${remaining}s left..." + done + } + + # Returns 0 if any pod (optionally matching ) in has a + # container blocked on an image pull for a reason that will not + # self-resolve (authn/authz failure or missing image/tag). + _has_fatal_pull_error() { + local ns="$1" + local selector="$2" + local sel_args=() + [ -n "$selector" ] && sel_args=(-l "$selector") + + local reasons + reasons=$(docker exec "$CONTAINER_NAME" kubectl get pods -n "$ns" "${sel_args[@]}" \ + -o jsonpath='{range .items[*].status.containerStatuses[*]}{.state.waiting.reason}={.state.waiting.message}{"\n"}{end}' \ + 2>/dev/null) + # ImagePullBackOff/ErrImagePull alone can be transient (registry + # blip); only treat as fatal when the message names an auth or + # not-found condition. + echo "$reasons" | grep -qiE '(unauthorized|authentication required|denied|forbidden|not found|manifest unknown|no such host|repository does not exist)' + } + + _dump_pull_failures() { + local ns="$1" + local selector="$2" + local sel_args=() + [ -n "$selector" ] && sel_args=(-l "$selector") + echo "[INFO] --- pod image-pull diagnostics ($ns) ---" + docker exec "$CONTAINER_NAME" kubectl get pods -n "$ns" "${sel_args[@]}" \ + -o wide 2>/dev/null | sed 's/^/[INFO] /' || true + docker exec "$CONTAINER_NAME" kubectl get events -n "$ns" \ + --field-selector type=Warning 2>/dev/null | grep -iE 'pull|image' | tail -10 | sed 's/^/[INFO] /' || true + } + # Function to configure CNI folder for all nodes configure_cni_folder() { echo "[INFO] Configuring CNI for node..." @@ -681,25 +879,65 @@ EOF echo "[INFO] CNI configuration completed" } - prepare_multus_artifacts() { - echo "[INFO] Preparing Multus CNI artifacts..." - # wait for CNI binaries to be available - # it could take time to pull the multus image from GitHub container registry - local MAX_RETRIES=70 - local RETRY_COUNT=0 - while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do - if docker exec "$CONTAINER_NAME" sh -c '[ "$(ls -A /opt/cni/bin)" ]'; then - break + # Block until a path inside the container is non-empty, tolerating a + # slow image pull. Unlike a fixed iteration count, this uses a wall- + # clock deadline and -- in server mode, where kubectl works -- fails + # fast on an unrecoverable multus image-pull error instead of waiting + # out the whole budget. + # + # wait_for_path [hard_timeout_secs] + wait_for_path() { + local path="$1" + local desc="$2" + local hard_timeout="${3:-300}" + local deadline=$(( $(date +%s) + hard_timeout )) + + while ! docker exec "$CONTAINER_NAME" sh -c "[ -e \"$path\" ] && [ -n \"\$(ls -A \"$path\" 2>/dev/null)\" ]"; do + if [ "$(date +%s)" -ge "$deadline" ]; then + echo "[ERROR] $desc not available within ${hard_timeout}s ($path)" + [ "$MODE" = "server" ] && _dump_pull_failures "kube-amd-network" "app.kubernetes.io/name=multus" + return 1 fi - echo "[INFO] Waiting for CNI binaries to be available... ($((RETRY_COUNT + 1))/$MAX_RETRIES)" + if [ "$MODE" = "server" ] && _has_fatal_pull_error "kube-amd-network" "app.kubernetes.io/name=multus"; then + echo "[ERROR] multus image pull failed unrecoverably -- not waiting for $desc" + _dump_pull_failures "kube-amd-network" "app.kubernetes.io/name=multus" + return 1 + fi + echo "[INFO] Waiting for $desc (image pull may be in progress)..." sleep 3 - RETRY_COUNT=$((RETRY_COUNT + 1)) done - if [ $RETRY_COUNT -ge $MAX_RETRIES ]; then - echo "[ERROR] CNI binaries not found after $MAX_RETRIES retries" - exit 1 + } + + prepare_multus_artifacts() { + # Multus is shipped by the network operator. If it is skipped there + # is no multus DaemonSet, no /opt/cni/bin population, and no + # 00-multus.conflist -- so this whole step is a no-op. (Multus is + # only needed by NIC phases 3/4/5, which a GPU-only cluster skips.) + local INSTALL_NETWORK_OPERATOR=$(read_config 'if .["network-operator"]["install-network-operator"] == null then true else .["network-operator"]["install-network-operator"] end') + if [ "$INSTALL_NETWORK_OPERATOR" != "true" ]; then + echo "[INFO] network operator skipped -- skipping Multus CNI artifact preparation" + return + fi + + echo "[INFO] Preparing Multus CNI artifacts..." + + # The multus CNI binaries land in /opt/cni/bin only after the + # multus DaemonSet pod is up, which in turn waits on its image + # pull. On the server we can watch the DaemonSet rollout directly + # (progress-aware: tolerates a slow pull, fast-fails on a broken + # one) for a clean signal before falling through to the file check. + if [ "$MODE" = "server" ]; then + local MULTUS_DS + MULTUS_DS=$(docker exec "$CONTAINER_NAME" kubectl get ds -n kube-amd-network \ + -l app.kubernetes.io/name=multus -o name 2>/dev/null | head -1) + if [ -n "$MULTUS_DS" ]; then + wait_for_rollout "$MULTUS_DS" "kube-amd-network" 600 "app.kubernetes.io/name=multus" || exit 1 + fi fi + # wait for CNI binaries to be available (image pull can be slow) + wait_for_path /opt/cni/bin "CNI binaries" 300 || exit 1 + local CP_MAX_RETRIES=30 local CP_RETRY_COUNT=0 while [ $CP_RETRY_COUNT -lt $CP_MAX_RETRIES ]; do @@ -717,22 +955,56 @@ EOF fi # wait for 00-multus.conflist to be available - RETRY_COUNT=0 - while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do - if docker exec "$CONTAINER_NAME" sh -c '[ "$(ls -A /etc/cni/net.d/00-multus.conflist)" ]'; then - break - fi - echo "[INFO] Waiting for Multus CNI configuration to be available... ($((RETRY_COUNT + 1))/$MAX_RETRIES)" - sleep 2 - RETRY_COUNT=$((RETRY_COUNT + 1)) - done - if [ $RETRY_COUNT -ge $MAX_RETRIES ]; then - echo "[ERROR] Multus CNI configuration not found after $MAX_RETRIES retries" - exit 1 - fi + wait_for_path /etc/cni/net.d/00-multus.conflist "Multus CNI configuration" 300 || exit 1 docker exec "$CONTAINER_NAME" cp /etc/cni/net.d/00-multus.conflist /etc/cni/net.d/00-multus.conf } + # _patch_embedded_image + # Some ConfigMaps embed a full YAML template as a string in their + # data field (e.g. cluster-validation-mpijob-config.yaml inside + # cluster-validation-mpijob-config). Container images inside those + # templates can't be patched via standard `kubectl patch cm` because + # the image line is buried in a string value, not a structured key. + # This helper reads the data key, runs a literal-text replacement on + # the image line, and writes the data key back via dry-run + apply. + # No-op when is empty (operator left the override blank) + # or matches (already at target). + _patch_embedded_image() { + local cm_name="$1" + local data_key="$2" + local old_image="$3" + local new_image="$4" + [ -z "$new_image" ] && return 0 + [ "$new_image" = "$old_image" ] && return 0 + local current=$(docker exec "$CONTAINER_NAME" kubectl get cm "$cm_name" -n default \ + -o jsonpath="{.data.${data_key//./\\.}}" 2>/dev/null) + if [ -z "$current" ]; then + echo "[WARN] _patch_embedded_image: ConfigMap '$cm_name' key '$data_key' not found; skipping image override" + return 0 + fi + if ! echo "$current" | grep -qF "image: $old_image"; then + echo "[INFO] _patch_embedded_image: '$cm_name'.'$data_key' no longer references '$old_image'; skipping" + return 0 + fi + echo "[INFO] Overriding image in $cm_name/$data_key: $old_image -> $new_image" + # Use bash literal substring replacement instead of sed. Image + # refs commonly contain `.` (e.g. docker.io, version tags like + # v1.4.0) which sed would treat as a regex "any char" wildcard, + # producing over-permissive matches. Bash `${var//search/replace}` + # is a literal-string replace -- no metachar escaping needed for + # either side, no risk of `.`/`&`/`|` surprises. + local search="image: $old_image" + local replace="image: $new_image" + local patched="${current//$search/$replace}" + # Use kubectl patch --type=merge with a jq-built single-key + # payload (jq handles multi-line + escape-sensitive characters + # correctly). Other keys in the CM are untouched. + local payload=$(jq -nc --arg k "$data_key" --arg v "$patched" \ + '{data: {($k): $v}}') + docker exec "$CONTAINER_NAME" kubectl patch cm "$cm_name" -n default \ + --type=merge -p "$payload" + } + install_cluster_validation_framework() { local INSTALL_CVF=$(read_config '.["cluster-validation-framework"]["install-cvf"]') if [ "$MODE" != "server" ] || [ "$INSTALL_CVF" != "true" ]; then @@ -741,27 +1013,63 @@ EOF echo "[INFO] Installing Cluster Validation Framework..." - # Read configuration values with defaults - local CRONJOB_SCHEDULE=$(read_config '.["cluster-validation-framework"].cronjob.schedule // "*/10 * * * *"') + # Read configuration values with defaults. These are used to + # construct a kubectl patch payload AFTER the YAML is applied, + # so the YAML stays valid standalone (operator can do + # `kubectl apply -f configs/*.yaml` without this script). + local CRONJOB_SCHEDULE=$(read_config '.["cluster-validation-framework"].cronjob.schedule // "*/30 * * * *"') local WORKER_REPLICAS=$(read_config '.["cluster-validation-framework"].resources["worker-replicas"] // 2') local LAUNCHER_REPLICAS=$(read_config '.["cluster-validation-framework"].resources["launcher-replicas"] // 1') local SLOTS_PER_WORKER=$(read_config '.["cluster-validation-framework"].resources["slots-per-worker"] // 8') local GPU_PER_WORKER=$(read_config '.["cluster-validation-framework"].resources["gpu-per-worker"] // 8') local PF_NIC_PER_WORKER=$(read_config '.["cluster-validation-framework"].resources["pf-nic-per-worker"] // 0') local VF_NIC_PER_WORKER=$(read_config '.["cluster-validation-framework"].resources["vf-nic-per-worker"] // 8') - local NODE_VALIDATION_INTERVAL=$(read_config '.["cluster-validation-framework"].resources["node-validation-interval-mins"] // 10') - local SKIP_GPU_VALIDATION=$(read_config '.["cluster-validation-framework"]["skip-tests"]["skip-gpu-validation"] // false') - local SKIP_RCCL_TEST=$(read_config '.["cluster-validation-framework"]["skip-tests"]["skip-rccl-test"] // false') - - # Read node selector labels with default values - local NODE_SELECTOR_LABELS=$(read_config '.["cluster-validation-framework"]["node-selector-labels"] // ["feature.node.kubernetes.io/amd-gpu=true", "feature.node.kubernetes.io/amd-nic=true"]') - # Convert JSON array to YAML list format with 4 spaces indentation (to match the placeholder indentation) - local NODE_SELECTOR_LABELS_YAML=$(echo "$NODE_SELECTOR_LABELS" | jq -r '.[] | " - " + .') + local NODE_VALIDATION_INTERVAL=$(read_config '.["cluster-validation-framework"].resources["node-validation-interval-mins"] // 30') + + # Per-phase skip flags (all 5 wired). JSON keys carry an + # explicit skip-phase{N}- prefix so the config is self-describing. + local SKIP_GPU_HW_ACCEPTANCE=$(read_config '.["cluster-validation-framework"]["skip-tests"]["skip-phase1-gpu-hw-acceptance"] // false') + local SKIP_GPU_MESH_VALIDATION=$(read_config '.["cluster-validation-framework"]["skip-tests"]["skip-phase2-gpu-mesh-validation"] // false') + local SKIP_NIC_VALIDATION=$(read_config '.["cluster-validation-framework"]["skip-tests"]["skip-phase3-nic-validation"] // false') + local SKIP_RAIL_BANDWIDTH_TEST=$(read_config '.["cluster-validation-framework"]["skip-tests"]["skip-phase4-rail-bandwidth-test"] // false') + local SKIP_RCCL_TEST=$(read_config '.["cluster-validation-framework"]["skip-tests"]["skip-phase5-rccl-test"] // false') + + # Per-Phase-1 stage skip map. Keys carry the same skip-phase1- + # prefix as the top-level Phase keys (so the JSON shape stays + # uniform); the renderer strips the prefix when looking up each + # stage's short Name (e.g. JSON key skip-phase1-gpu-stress -> + # stage Name "gpu-stress"). PHASE1_SCRIPT honours the resulting + # per-stage "Skip" field in GPU_VALIDATION_STAGES_JSON. + local PHASE1_STAGES_SKIP_MAP=$(read_config '.["cluster-validation-framework"]["skip-tests"]["skip-phase1-stages"] // {}') + + # Per-phase timeouts. Phase 1 carries a per-stage map; phases 2-5 + # carry single scalars. Empty string = keep YAML default. + local PHASE1_STAGES_TIMEOUT_MAP=$(read_config '.["cluster-validation-framework"].timeouts["phase1-stages-secs"] // {}') + local PHASE2_JOB_WAIT_SECS=$(read_config '.["cluster-validation-framework"].timeouts["phase2-job-wait-secs"] // ""') + local PHASE3_JOB_WAIT_SECS=$(read_config '.["cluster-validation-framework"].timeouts["phase3-job-wait-secs"] // ""') + local PHASE4_PAIR_WAIT_SECS=$(read_config '.["cluster-validation-framework"].timeouts["phase4-pair-wait-secs"] // ""') + local PHASE5_MPIJOB_WAIT_SECS=$(read_config '.["cluster-validation-framework"].timeouts["phase5-mpijob-wait-secs"] // ""') + + # Per-component image overrides. Empty string = keep YAML default. + local IMG_ROCE_WORKLOAD=$(read_config '.["cluster-validation-framework"].images["roce-workload"] // ""') + # test-runner is a per-framework map: { rvs: "...", agfhc: "..." }. + # The renderer looks up each Phase 1 stage by its lowercased + # Framework. Empty map = keep YAML defaults on every stage. + local IMG_TEST_RUNNER_MAP=$(read_config '.["cluster-validation-framework"].images["test-runner"] // {}') + local IMG_ORCHESTRATOR=$(read_config '.["cluster-validation-framework"].images["orchestrator"] // ""') + local IMG_PREFLIGHT_INIT=$(read_config '.["cluster-validation-framework"].images["preflight-init"] // ""') + local IMG_NIC_HEALTH=$(read_config '.["cluster-validation-framework"].images["nic-health"] // ""') + + # Node-selector labels: surface as a single newline-separated string + # (matches the NODE_SELECTOR_LABELS ConfigMap key shape). + local NODE_SELECTOR_LABELS=$(read_config '.["cluster-validation-framework"]["node-selector-labels"] // ["feature.node.kubernetes.io/amd-gpu=true"]') + local NODE_SELECTOR_LABELS_FLAT=$(echo "$NODE_SELECTOR_LABELS" | jq -r 'join("\n")') echo "[INFO] CronJob Schedule: $CRONJOB_SCHEDULE" echo "[INFO] Node Selector Labels: $(echo "$NODE_SELECTOR_LABELS" | jq -r 'join(", ")')" echo "[INFO] Resources - Workers: $WORKER_REPLICAS, GPUs/Worker: $GPU_PER_WORKER" - echo "[INFO] Skip GPU Validation: $SKIP_GPU_VALIDATION, Skip RCCL Test: $SKIP_RCCL_TEST" + echo "[INFO] Skip flags: HW=$SKIP_GPU_HW_ACCEPTANCE Mesh=$SKIP_GPU_MESH_VALIDATION NIC=$SKIP_NIC_VALIDATION Rail=$SKIP_RAIL_BANDWIDTH_TEST RCCL=$SKIP_RCCL_TEST" + echo "[INFO] Timeouts (s): P2=${PHASE2_JOB_WAIT_SECS:-default} P3=${PHASE3_JOB_WAIT_SECS:-default} P4=${PHASE4_PAIR_WAIT_SECS:-default} P5=${PHASE5_MPIJOB_WAIT_SECS:-default}; Node-interval(min)=$NODE_VALIDATION_INTERVAL" # Install MPI Operator echo "[INFO] Installing MPI Operator..." @@ -769,37 +1077,193 @@ EOF docker exec "$CONTAINER_NAME" kubectl apply --server-side --force-conflicts -f https://raw.githubusercontent.com/kubeflow/mpi-operator/$MPI_OPERATOR_VERSION/deploy/v2beta1/mpi-operator.yaml echo "[INFO] MPI Operator installation completed" - # Apply cluster-validation-config.yaml with substitutions + # ============================================================ + # Step A: apply YAMLs as-is (defaults take effect) + # ============================================================ echo "[INFO] Applying Cluster Validation ConfigMap..." - # Do substitutions using a bash while-loop to properly handle multiline content - docker exec "$CONTAINER_NAME" cat /configs/cluster-validation-config.yaml | \ - while IFS= read -r line; do - line="${line//__WORKER_REPLICAS__/$WORKER_REPLICAS}" - line="${line//__LAUNCHER_REPLICAS__/$LAUNCHER_REPLICAS}" - line="${line//__SLOTS_PER_WORKER__/$SLOTS_PER_WORKER}" - line="${line//__GPU_PER_WORKER__/$GPU_PER_WORKER}" - line="${line//__PF_NIC_PER_WORKER__/$PF_NIC_PER_WORKER}" - line="${line//__VF_NIC_PER_WORKER__/$VF_NIC_PER_WORKER}" - line="${line//__NODE_VALIDATION_INTERVAL_MINS__/$NODE_VALIDATION_INTERVAL}" - line="${line//__SKIP_GPU_VALIDATION__/$SKIP_GPU_VALIDATION}" - line="${line//__SKIP_RCCL_TEST__/$SKIP_RCCL_TEST}" - # Handle multiline NODE_SELECTOR_LABELS replacement - if [[ "$line" == *"__NODE_SELECTOR_LABELS__"* ]]; then - echo "$NODE_SELECTOR_LABELS_YAML" - else - echo "$line" - fi - done | docker exec -i "$CONTAINER_NAME" kubectl apply -f - + docker exec "$CONTAINER_NAME" kubectl apply -f /configs/cluster-validation-config.yaml - # Apply cluster-validation-job.yaml with substitutions echo "[INFO] Applying Cluster Validation CronJob..." - docker exec "$CONTAINER_NAME" sh -c "cat /configs/cluster-validation-job.yaml | \ - sed 's|__CRONJOB_SCHEDULE__|'\"${CRONJOB_SCHEDULE}\"'|g' | \ - kubectl apply -f -" + docker exec "$CONTAINER_NAME" kubectl apply -f /configs/cluster-validation-job.yaml + + # ============================================================ + # Step B: patch CM scalar keys from config.json (no-op when + # the JSON values already match the YAML defaults). + # ============================================================ + echo "[INFO] Applying ConfigMap overrides from config.json..." + local CM_PATCH=$(jq -nc \ + --arg wr "$WORKER_REPLICAS" \ + --arg lr "$LAUNCHER_REPLICAS" \ + --arg sw "$SLOTS_PER_WORKER" \ + --arg gw "$GPU_PER_WORKER" \ + --arg pf "$PF_NIC_PER_WORKER" \ + --arg vf "$VF_NIC_PER_WORKER" \ + --arg iv "$NODE_VALIDATION_INTERVAL" \ + --arg sa "$SKIP_GPU_HW_ACCEPTANCE" \ + --arg sm "$SKIP_GPU_MESH_VALIDATION" \ + --arg sn "$SKIP_NIC_VALIDATION" \ + --arg sb "$SKIP_RAIL_BANDWIDTH_TEST" \ + --arg sr "$SKIP_RCCL_TEST" \ + --arg nsl "$NODE_SELECTOR_LABELS_FLAT" \ + --arg roceimg "$IMG_ROCE_WORKLOAD" \ + --arg t2 "$PHASE2_JOB_WAIT_SECS" \ + --arg t3 "$PHASE3_JOB_WAIT_SECS" \ + --arg t4 "$PHASE4_PAIR_WAIT_SECS" \ + --arg t5 "$PHASE5_MPIJOB_WAIT_SECS" \ + '{data: ({ + WORKER_REPLICAS: $wr, LAUNCHER_REPLICAS: $lr, + SLOTS_PER_WORKER: $sw, GPU_PER_WORKER: $gw, + PF_NIC_PER_WORKER: $pf, VF_NIC_PER_WORKER: $vf, + NODE_VALIDATION_INTERVAL_MINS: $iv, + SKIP_GPU_HW_ACCEPTANCE: $sa, SKIP_GPU_MESH_VALIDATION: $sm, + SKIP_NIC_VALIDATION: $sn, SKIP_RAIL_BANDWIDTH_TEST: $sb, + SKIP_RCCL_TEST: $sr, + NODE_SELECTOR_LABELS: ($nsl + "\n") + } + + (if $roceimg != "" then {ROCE_WORKLOAD_IMAGE: $roceimg} else {} end) + + (if $t2 != "" then {PHASE2_JOB_WAIT_TIME: $t2} else {} end) + + (if $t3 != "" then {PHASE3_JOB_WAIT_TIME: $t3} else {} end) + + (if $t4 != "" then {PHASE4_PAIR_WAIT_TIME: $t4} else {} end) + + (if $t5 != "" then {MPIJOB_WAIT_TIME: $t5} else {} end) + )}') + docker exec "$CONTAINER_NAME" kubectl patch cm cluster-validation-config -n default \ + --type=merge -p "$CM_PATCH" + + # ============================================================ + # Step C: patch the Phase 1 stages JSON inside the CM (per-stage + # Skip flag + optional global test-runner image override). + # Read the freshly-applied YAML's stages, mutate via jq, write + # back as a single-key patch. + # ============================================================ + local CURRENT_STAGES=$(docker exec "$CONTAINER_NAME" kubectl get cm cluster-validation-config -n default \ + -o jsonpath='{.data.GPU_VALIDATION_STAGES_JSON}') + if [ -n "$CURRENT_STAGES" ]; then + # Skip-map and timeout-map keys for a stage with Name="gpu-stress" + # are "skip-phase1-gpu-stress" / "phase1-gpu-stress". jq prepends + # the appropriate prefix during lookup so stage Names in YAML + # stay short. TimeoutSeconds override is applied only when the + # timeout map carries an entry; otherwise the YAML default + # survives. Image override is looked up by lowercased Framework + # (rvs/agfhc) so RVS and AGFHC test-runner releases can be pinned + # independently. + local NEW_STAGES=$(jq -c --argjson skipmap "$PHASE1_STAGES_SKIP_MAP" \ + --argjson timeoutmap "$PHASE1_STAGES_TIMEOUT_MAP" \ + --argjson trimgmap "$IMG_TEST_RUNNER_MAP" ' + map( + . + {Skip: ($skipmap[("skip-phase1-" + .Name)] // false)} + | (if ($timeoutmap[("phase1-" + .Name)] // null) != null + then .TimeoutSeconds = $timeoutmap[("phase1-" + .Name)] + else . end) + | (.Framework as $fw + | ($trimgmap[$fw | ascii_downcase] // "") as $img + | if $img != "" then .Image = $img else . end) + )' <<<"$CURRENT_STAGES") + local STAGES_PATCH=$(jq -nc --arg s "$NEW_STAGES" '{data: {GPU_VALIDATION_STAGES_JSON: $s}}') + docker exec "$CONTAINER_NAME" kubectl patch cm cluster-validation-config -n default \ + --type=merge -p "$STAGES_PATCH" + fi + + # ============================================================ + # Step D: patch the CronJob (schedule + orchestrator image). + # ============================================================ + docker exec "$CONTAINER_NAME" kubectl patch cronjob cluster-validation-cron-job -n default \ + --type=merge -p "{\"spec\":{\"schedule\":\"$CRONJOB_SCHEDULE\"}}" + if [ -n "$IMG_ORCHESTRATOR" ]; then + docker exec "$CONTAINER_NAME" kubectl set image \ + cronjob/cluster-validation-cron-job submit-mpijob="$IMG_ORCHESTRATOR" -n default + fi + + # ============================================================ + # Step E: patch images embedded inside other ConfigMap data + # (the preflight init container and the Phase 3 nic-health Job + # are nested in YAML strings inside their CMs). Only runs when + # the config.json override is non-empty AND differs from the + # YAML default. + # ============================================================ + _patch_embedded_image cluster-validation-mpijob-config \ + "cluster-validation-mpijob-config.yaml" \ + "docker.io/bitnamilegacy/kubectl:1.33.4" \ + "$IMG_PREFLIGHT_INIT" + _patch_embedded_image cluster-validation-phase3-job-config \ + "cluster-validation-phase3-job-config.yaml" \ + "docker.io/rocm/network-operator-utils:v1.1.0" \ + "$IMG_NIC_HEALTH" + + # Apply per-rail NetworkAttachmentDefinitions (Phase 4 prerequisite). + # Phase 4's pairwise rail bandwidth test pins each pod to a specific + # rail's NIC via the annotation + # k8s.v1.cni.cncf.io/networks: amd-host-device-nad-rail-${RAIL_IDX} + # (cluster-validation-config.yaml: PHASE4_NAD_NAME_PREFIX). Without + # these NADs, every rail-test exits as `ib-write-bw-crashed`. + # + # The Multus CRD is installed by the AMD network-operator helm + # chart, which runs ASYNCHRONOUSLY relative to this script: the + # chart is staged by k3s on container start and may take up to + # ~2 min to complete after the API server is ready. We poll for + # the CRD (correct name: `network-attachment-definitions.k8s.cni.cncf.io` + # -- plural form with dashes per the upstream Multus deployment) + # then apply. `kubectl apply -f` is idempotent so re-applies on + # subsequent `run server` invocations are no-ops. + # Multus (which provides the NAD CRD) ships with the network + # operator. If it was skipped, the CRD will never appear -- don't + # waste the poll budget; Phase 4 is expected to be skipped too. + local INSTALL_NETWORK_OPERATOR=$(read_config 'if .["network-operator"]["install-network-operator"] == null then true else .["network-operator"]["install-network-operator"] end') + if [ "$INSTALL_NETWORK_OPERATOR" != "true" ]; then + echo "[INFO] network operator skipped -- not applying per-rail NADs (Phase 4 unavailable)" + echo "[INFO] Cluster Validation Framework installation completed" + return + fi + + echo "[INFO] Applying Phase 4 per-rail NetworkAttachmentDefinitions..." + local NAD_CRD_NAME="network-attachment-definitions.k8s.cni.cncf.io" + local NAD_WAIT_TIMEOUT=180 # seconds + local nad_deadline=$(( $(date +%s) + NAD_WAIT_TIMEOUT )) + local nad_applied=false + while [ "$(date +%s)" -lt "$nad_deadline" ]; do + if docker exec "$CONTAINER_NAME" kubectl get crd "$NAD_CRD_NAME" >/dev/null 2>&1; then + docker exec "$CONTAINER_NAME" kubectl wait --for=condition=Established \ + --timeout=30s "crd/$NAD_CRD_NAME" >/dev/null 2>&1 || true + if docker exec "$CONTAINER_NAME" kubectl apply -f /configs/nad-per-rail.yaml; then + nad_applied=true + fi + break + fi + echo "[INFO] Waiting for Multus CRD ($NAD_CRD_NAME) -- network-operator install in progress..." + sleep 5 + done + if [ "$nad_applied" = "false" ]; then + echo "[WARN] Multus CRD did not appear within ${NAD_WAIT_TIMEOUT}s -- skipping per-rail NADs." + echo "[WARN] Phase 4 (rail bandwidth) will fail until Multus + per-rail NADs are present." + echo "[WARN] After Multus is up, re-apply manually: kubectl apply -f /configs/nad-per-rail.yaml" + fi echo "[INFO] Cluster Validation Framework installation completed" } + # ============================================================ + # REAPPLY_CVF_ONLY fast-path: re-run install_cluster_validation_framework + # against the already-running 'server' container, then exit. + # Driven by `cmd_reapply_cvf`. The full bringup (docker run, k3s + # start, driver install, multus artifacts) is skipped — only the + # apply-then-patch CVF block runs. Caller must guarantee the + # server container is already up; we re-check here. + # ============================================================ + if [ "${REAPPLY_CVF_ONLY:-}" = "true" ]; then + if [ "$MODE" != "server" ]; then + echo "[ERROR] REAPPLY_CVF_ONLY requires MODE=server (got: $MODE)" + exit 1 + fi + if ! docker ps --format '{{.Names}}' | grep -qx "$CONTAINER_NAME"; then + echo "[ERROR] reapply-cvf: container '$CONTAINER_NAME' is not running" + echo "[INFO] Bring up the cluster first: $0 run server" + exit 1 + fi + echo "[INFO] Reapply mode: re-running CVF apply+patch against existing '$CONTAINER_NAME' container" + install_cluster_validation_framework + echo "[INFO] CVF reapply completed" + return 0 + fi + # Print sanitized command without exposing sensitive information if [ "$MODE" = "agent" ]; then echo "[INFO] Starting k3s agent container with masked credentials..." @@ -823,27 +1287,69 @@ EOF echo "[INFO] Waiting for k3s to be ready..." sleep 10 - configure_cni_folder + # All post-start bringup steps, in order. Every step is individually + # idempotent (each install checks for existing state and skips, or + # re-applies harmlessly), so the whole sequence can be safely re-run. + run_bringup_steps() { + configure_cni_folder - if [ "$MODE" = "server" ]; then - configure_server_registries + if [ "$MODE" = "server" ]; then + configure_server_registries - setup_in_cluster_registry + setup_in_cluster_registry - install_cert_manager + install_cert_manager - install_amd_gpu_operator + install_amd_gpu_operator - install_network_operator - fi + install_network_operator + fi - prepare_multus_artifacts + prepare_multus_artifacts - if [ "$MODE" = "server" ]; then - install_driver + if [ "$MODE" = "server" ]; then + install_driver - install_cluster_validation_framework - fi + install_cluster_validation_framework + fi + } + + # Run the bringup sequence with bounded retries. A single transient + # failure -- e.g. a kubectl/helm call that blips while k3s is still + # flapping right after container start (see entrypoint.sh + # supervise_k3s) -- used to abort the whole sequence under `set -e`, + # leaving operators half-installed (or not installed at all) while the + # backgrounded script exited and the playbook still reported success. + # Because every step is idempotent, retrying the sequence lets the + # bringup self-heal once k3s settles. + # + # Each attempt runs in a subshell with its own `set -e` so a step that + # fails (these helpers abort via `exit 1` on timeout) ends only that + # attempt; we capture the status with `set +e` so the outer errexit + # doesn't kill the script before we can retry. + BRINGUP_MAX_ATTEMPTS="${BRINGUP_MAX_ATTEMPTS:-5}" + bringup_attempt=1 + while true; do + echo "[INFO] Bringup steps: attempt ${bringup_attempt}/${BRINGUP_MAX_ATTEMPTS}" + set +e + ( set -e; run_bringup_steps ) + bringup_rc=$? + set -e + + if [ "$bringup_rc" -eq 0 ]; then + echo "[INFO] Bringup steps completed on attempt ${bringup_attempt}" + break + fi + + if [ "$bringup_attempt" -ge "$BRINGUP_MAX_ATTEMPTS" ]; then + echo "[ERROR] Bringup steps failed after ${BRINGUP_MAX_ATTEMPTS} attempts (last rc=${bringup_rc})" + exit 1 + fi + + echo "[WARN] Bringup steps failed on attempt ${bringup_attempt} (rc=${bringup_rc}) -- retrying after backoff" + sleep 15 + bringup_attempt=$(( bringup_attempt + 1 )) + done echo "[INFO] Node Bringup completed successfully" echo "[INFO] Container is running with restart policy 'unless-stopped'" @@ -858,6 +1364,13 @@ EOF wait $CONTAINER_PID } +cmd_reapply_cvf() { + # Reapply CVF configs from config.json against the already-running + # server container. Use after editing configs/*.yaml or config.json + # without tearing down the cluster. + REAPPLY_CVF_ONLY=true cmd_run server +} + cmd_teardown() { echo "Starting GPU Validation Cluster teardown..." @@ -1141,6 +1654,9 @@ case "$COMMAND" in teardown) cmd_teardown "$@" ;; + reapply-cvf) + cmd_reapply_cvf "$@" + ;; get-token) cmd_get_token "$@" ;; diff --git a/example/gpu-validation-cluster/tests/README.md b/example/gpu-validation-cluster/tests/README.md new file mode 100644 index 000000000..da61b54e6 --- /dev/null +++ b/example/gpu-validation-cluster/tests/README.md @@ -0,0 +1,374 @@ +# gpu-validation-cluster bash unit tests + +These tests exercise the bash artifacts shipped by the +`cluster-validation-config` ConfigMap and the +`cluster-validation-cron-job` orchestrator without requiring a real +Kubernetes cluster. They run anywhere `bash`, `awk`, `sed`, and +`grep` are available. + +## What is covered + +- **`PHASE3_CHECK_SCRIPT` + `PHASE3_SCRIPT`** — + per-node NIC health gate (NIC count via `lspci`, link state via + `ip link`, RDMA link state via `rdma link show`, GID table via + `ibv_devinfo`) plus the outer Job submit/wait driver: + - `shellcheck PHASE3_CHECK_SCRIPT` (skipped automatically in CI + images that don't ship shellcheck). + - In-Job CHECK pass case: all 4 checks pass -> `amd.com/nic-health=passed` + self-label, exit 0, no failure annotation. + - NIC count mismatch -> `=failed`, reason + `nic-count:expected=.,actual=.`. + - One NIC link DOWN -> `=failed`, failed-nics carries the iface + name, reason includes `link-state:=DOWN`. + - One RDMA link in `INIT` state -> `=failed`, reason `rdma-state:`. + - One device with an empty GID table -> `=failed`, reason + `gid-table:=0`. + - `ibv_devinfo` unresponsive on one device -> `=failed`, reason + `ibv-devinfo:=unresponsive`. + - Partial failure (only Check 3 fails) -> failure annotation contains + only the rdma-state reason, never a spurious entry from a passing + check class. + - Annotation size truncation -- worst-case all-NIC failure values + are clamped to `PHASE3_ANNOTATION_MAX_BYTES` (default 250) so the + node object cannot blow the 256 KiB annotation budget. + - `NODE_NAME` unset -> exit 2, zero kubectl side effects. + - Outer driver `PHASE3_SCRIPT`: + - empty input list -> no-op, return 0. + - `SKIP_NIC_VALIDATION=true` short-circuit (case-insensitive) -> + every input node pass-labeled, no Jobs created, no kubectl + get/apply work. + - Missing required env -> all input nodes labeled failed with + `failure-reason=phase3-missing-env:.`; no Jobs submitted. + - Missing job template -> all input nodes labeled failed with + `failure-reason=job-template-missing`. + - `kubectl apply` failure -> per-node `failure-reason=job-creation-failed`, + no poll work for the affected job. + - `PHASE3_JOB_WAIT_TIME=0` + no conditions -> `=failed`, + `failure-reason=nic-not-allocated`, hung Job explicitly deleted + at cleanup. + - `Complete=True` / `Failed=True` -> the orchestrator must NOT + write the node label (in-pod kubectl owns the passed/failed + labels per design §4 step 4); only submit-failed / timeout + cases get orchestrator-side labels. + - Parallel-submit ordering: every `kubectl apply` precedes any + `kubectl get job` poll (no per-node serialization). + - `PHASE_NODES` env-var fallback when no positional args. + - Sample shim fixtures live under `tests/fixtures/phase3/` + (lspci pass / count-mismatch / empty; ip link pass / one-down; + rdma link pass / one-init; ibv_devices listing; ibv_devinfo + pass / empty-gid). + +- **`PHASE2_SCRIPT` driver** — orchestrates the + per-node intra-node RCCL `all_reduce_perf` Job: + - pass case: `Complete=True` + log with `Avg bus bandwidth` line -> + `amd.com/gpu-mesh-validation=passed` + `measured-bw` annotation. + - `bus-bw-below-threshold` fail: `Failed=True` + log marker + `phase2 bandwidth below threshold` -> failed label, + `failure-reason=bus-bw-below-threshold`, measured-bw annotation. + - `rccl-crash` fail: `Failed=True` + log marker + `phase2 mpirun exited` -> failed label, + `failure-reason=rccl-crash`. + - Failed without marker -> default `failure-reason=rccl-crash` + (design §6 default). + - `timeout`: no Job conditions seeded + `PHASE2_JOB_WAIT_TIME=0` + -> failed label, `failure-reason=timeout`, hung Job explicitly + deleted at cleanup. + - `SKIP_GPU_MESH_VALIDATION=true` short-circuit: every input node + pass-labeled, no Jobs created, no kubectl get/logs/apply work. + - Threshold-too-high inject (`PHASE2_BW_THRESHOLD=9999`): contract + pin — `PHASE2_SCRIPT` never re-runs the validator, so the inject + is observed only inside the Job container; classification by log + marker is unchanged. + - Missing-env fast-fail (e.g. `ROCE_WORKLOAD_IMAGE` unset). + - Missing job template -> all input nodes labeled failed with + `failure-reason=job-template-missing`. + - Parallel-submit ordering: every `kubectl apply` precedes any + `kubectl get job` poll (no per-node serialization). + - `kubectl apply` failure -> per-node `failure-reason=job-creation-failed`, + no poll/log work for the affected job. + - `PHASE_NODES` env-var fallback when no positional args. + - `PHASE2_RCCL_ENV_VARS` ConfigMap value contains no IB / fabric + tunables (intra-node only — TC4 in `-test-plan.md`). + - Sample `phase2.log` fixtures live under `tests/fixtures/phase2/` + (pass / bw-below-threshold / rccl-crash / failed-no-marker). + +- **`PHASE_NODE_LABEL_SCRIPT` helper library** + against a recording `kubectl` mock: + - `label_phase_passed` writes the `=passed` label with + `--overwrite`. + - `label_phase_failed` writes the `=failed` label AND a + `-failure-reason` annotation. + - `annotate_phase_value` writes + `-=` annotations. + - `filter_passed_nodes` returns the subset of input nodes whose + label is exactly `passed`, preserving input order. + - Argument validation (empty args / wrong arity must return + non-zero with no kubectl side effects). + - `kubectl` failure propagation (return non-zero when the + underlying call fails). + - Contract invariants: `--overwrite` on every write, diagnostics + to stderr only. + +- **`PHASE4_DRIVER_SCRIPT`** — + pairwise per-rail RDMA bandwidth test driver. Covers: + - Round-robin pairing (even input -> N/2 pairs; odd input -> N/2 + pairs + the trailing node tagged `unpaired=true`). + - Empty / single-node input fast paths. + - `SKIP_RAIL_BANDWIDTH_TEST=true` (case-insensitive) short-circuit: + every input node pass-labeled, no Jobs created. + - Missing required env -> all input nodes labeled failed with + `failure-reason=phase4-missing-env:.`; no Jobs submitted. + - Missing job templates -> all input nodes labeled failed with + `failure-reason=job-template-missing`. + - All-rails-pass single pair -> both nodes labeled passed with + per-rail BW annotations (`amd.com/rail-bandwidth-rail-N=`) + and a `peer=` diagnostic annotation. + - Single rail below `PHASE4_BW_THRESHOLD` -> `failed-rails=` + annotation; per-rail BW preserved for diagnostics. + - All rails below threshold -> `failed-rails=0,1,.,7`. + - `ib_write_bw` crashed (client Job `Failed=True` + no BW line in + log) -> rail recorded `reason=ib-write-bw-crashed`. + - `ib_write_bw` Complete but empty log -> `reason=parse-failed`. + - Server pod IP never set + `PHASE4_PAIR_WAIT_TIME=0` -> + `reason=peer-pod-unready` (pod created but no IP) or + `reason=nad-missing` (no pod created at all). + - `PHASE4_RAIL_COUNT=4` override -> only rails 0-3 annotated; + rails 4-7 never appear in any annotation. + - 16 nodes (8 pairs) with `PHASE4_MAX_CONCURRENT_PAIRS=8` -> + every pair forked, all 16 nodes labeled (concurrency cap + sanity check). + - `PHASE4_MAX_CONCURRENT_PAIRS=0` -> promoted to 1 with a + logged warning rather than deadlocking. + - `PHASE_NODES` env-var fallback when no positional args. + - Sample `ib_write_bw` client log fixtures live under + `tests/fixtures/phase4/` (pass, below-threshold, crashed, empty). + +- **`PHASE45_PREFLIGHT_SCRIPT`** + the Phase 4.5 pre-flight gate inside the Phase 5 launcher + init-container. Four checks (SSH mesh, DNS, MPI spawn, RCCL + topology) plus the verdict block. Covers: + - All-checks-pass on a healthy 2-pod cluster -> exit 0, zero + annotate calls, verdict banner present. + - SSH mesh single-pair fail -> ssh_mesh_failed=true, annotation + class `ssh-mesh`, exit 1. + - SSH mesh all-pairs fail -> failed_pairs count = N*N, still + single `ssh-mesh` class. + - `WORKER_REPLICAS=1` degenerate self-pair -> no divide-by-zero + or hang, passes. + - DNS forward-miss fixture -> dns_failed=true, class `dns`, exit 1. + - mpirun --hostfile no-op spawn fail -> mpi_spawn_failed=true, + class `mpi-spawn`, exit 1. + - RCCL probe non-timeout non-zero exit -> rccl_topo_failed=true, + class `rccl-topology`, exit 1 (hard-fail). + - RCCL probe exit 124 (timeout) -> rccl_topo_timeout=true, class + `rccl-topology`, exit 0 (soft-fail per design §6). + - All four checks fail -> annotation reason + `ssh-mesh,dns,mpi-spawn,rccl-topology` (union, fixed order), + exit 1, every participating node annotated. + - Hard fail (ssh-mesh) + RCCL soft-fail -> classes include both, + hard wins the exit code (1). + - `ENABLE_SSH_CHECK=false` short-circuit -> whole pre-flight body + is gated off, exit 0, zero kubectl exec calls. + - `WAIT_FOR_WORKERS=false` -> kubectl wait not invoked but the + four checks still run. + - Sample DNS / RCCL stdout fixtures live under + `tests/fixtures/phase4_5/`. + +- **Orchestrator `DRY_RUN=1` mode** from the + three scenarios documented in `-config-framework-orchestration-design.md` + §7 "Orchestrator dry-run": + - All five skip flags `true` -> exit 0, zero `kubectl label`/ + `kubectl annotate` calls. + - Phases 1+2 enabled, 3-5 skipped -> Phase 1 and Phase 2 banners + appear in order; Phases 3-5 take the `SKIP_*=true -- pass-through` + branch. + - Phase 3 enabled but no `feature.node.kubernetes.io/amd-nic=true` + nodes -> Phase 3 takes the "no NIC-capable nodes" branch and + exits 0. + - Plus: empty Phase-0 pool exits 0 cleanly; cleanup and log + collection honor `DRY_RUN`. + +## Layout + +``` +tests/ + README.md # this file + run_all.sh # entry point + test_phase_node_label_script.sh # helper library + test_orchestrator_dry_run.sh # orchestrator DRY_RUN mode + test_phase1.sh # PHASE1_SCRIPT + test_phase2.sh # PHASE2_SCRIPT + test_phase3.sh # PHASE3_CHECK_SCRIPT + PHASE3_SCRIPT + test_phase4.sh # PHASE4_DRIVER_SCRIPT + test_phase4_5.sh # PHASE45_PREFLIGHT_SCRIPT + lib/ + assert.sh # hand-rolled bash assertions + kubectl_mock.sh # recording kubectl shim + extract_script.sh # YAML block-scalar extractor + fixtures/ + phase1/ # AGFHC result.json fixtures + phase2/ # phase2.log fixtures + phase3/ # lspci / ip / rdma / ibv_* shim fixtures + phase4/ # ib_write_bw client-log fixtures + phase4_5/ # DNS / RCCL exec-stdout fixtures +``` + +## How to run + +```sh +# Run the full suite +./example/gpu-validation-cluster/tests/run_all.sh + +# Run a single file +./example/gpu-validation-cluster/tests/run_all.sh test_orchestrator_dry_run.sh + +# Run one test file directly (bypasses the suite summary) +bash example/gpu-validation-cluster/tests/test_phase_node_label_script.sh +``` + +`run_all.sh` exits 0 only when every test file reports 0 failures. + +## How the kubectl mock works + +`lib/kubectl_mock.sh` installs a small bash script named `kubectl` +in a temp directory and prepends that directory to `$PATH`. This is +the only reliable way to intercept calls in CI environments where a +real `kubectl` is installed -- a bash `function kubectl { . }` +override is only visible to the current shell, and the orchestrator +writes per-phase scripts to `/tmp` and `source`s them, so any +function override would be invisible to those sub-shells. + +The mock: +- records every invocation as a single line in + `$KUBECTL_CALLS_FILE` (one arg per token, joined with single + spaces); +- serves canned label values for `kubectl get node . -o jsonpath=.` + from `$KUBECTL_STATE_FILE` (seeded via `kubectl_mock_set_label`); +- supports one-shot or sticky failure injection for `label`, + `annotate`, and `get` via `kubectl_mock_fail` / + `kubectl_mock_fail_sticky`; +- returns exit 99 for any kubectl verb the helpers and the + orchestrator are not supposed to invoke -- catches accidental + real-world calls in tests. + +## How the YAML extraction works + +The artifacts under test are embedded as multi-line `|` block +scalars inside Kubernetes manifests: + +- `PHASE_NODE_LABEL_SCRIPT` lives under + `configs/cluster-validation-config.yaml` -> `data:`. +- The orchestrator body lives under + `configs/cluster-validation-job.yaml` -> the `submit-mpijob` + container's `args:` list. + +`lib/extract_script.sh` provides two pure-bash/awk extractors so the +tests do not depend on PyYAML or `yq`: + +- `extract_configmap_data ` -- emits the body of + `data.: |` on stdout. +- `extract_cronjob_orchestrator ` -- emits the body of the + `submit-mpijob` container's `args: - |` block on stdout. + +Both extractors strip the YAML block's leading indent and stop on +the first less-indented non-blank line, matching kubectl's own +normalization. + +## How to add a new test + +1. Create `tests/test_.sh`. +2. Source the three libs: + ```bash + TEST_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd) + source "${TEST_DIR}/lib/assert.sh" + source "${TEST_DIR}/lib/kubectl_mock.sh" + source "${TEST_DIR}/lib/extract_script.sh" + ``` +3. `kubectl_mock_init` once, then `kubectl_mock_reset` between + tests. +4. Use `it "" && { run ; assert_* .; }` per case. +5. Call `assert_summary` at the end -- it sets the file's exit + status to 0 only when every `it` passed. + +Tests are discovered by `run_all.sh` via the `test_*.sh` glob; +no registration step is needed. + +## Test-plan coverage mapping + +These tests realize the following cases from +the test plan: + +| Test case | Description | Realized by | +|---|---|---| +| TC2 | helper-label-pass-fail-annotate | `test_phase_node_label_script.sh` (`label_phase_passed` / `label_phase_failed` / `annotate_phase_value` blocks) | +| TC3 | filter-passed-nodes-returns-subset | `test_phase_node_label_script.sh` (`filter_passed_nodes` block) | +| TC4 | dryrun-all-skipped-exits-zero | `test_orchestrator_dry_run.sh` scenario 1 | +| TC7 | helper-rejects-missing-args | `test_phase_node_label_script.sh` (negative-validation cases per helper) | +| TC8 | empty-candidate-pool | `test_orchestrator_dry_run.sh` "DRY_RUN exits 0 when no candidate nodes are present" | +| TC10 | helper-kubectl-failure | `test_phase_node_label_script.sh` (`kubectl_mock_fail` cases per helper) | +| TC14 | dryrun-phase1-2-enabled-only | `test_orchestrator_dry_run.sh` scenario 2 | + +These tests realize the following cases from +the test plan (PHASE2_SCRIPT +behavior): + +| Test case | Description | Realized by | +|---|---|---| +| TC2 | phase2-pass-labels-node | `test_phase2.sh` ("single node pass" + "mixed pass/fail") | +| TC3 | skip-phase2-passlabels-all | `test_phase2.sh` ("SKIP_GPU_MESH_VALIDATION=true .") | +| TC4 | rccl-env-no-ib-vars | `test_phase2.sh` ("PHASE2_RCCL_ENV_VARS contains no IB/fabric tunables") | +| TC5 | bw-below-threshold-fails | `test_phase2.sh` ("Failed + bw-below-threshold marker ." + "PHASE2_BW_THRESHOLD=9999 inject .") | +| TC6 | rccl-crash-fails | `test_phase2.sh` ("Failed + mpirun-exited marker ." + default-marker case) | +| TC7 | empty-input-list | `test_phase2.sh` ("PHASE2_SCRIPT with empty input list .") | +| TC9 | gpu-not-allocated-timeout | `test_phase2.sh` ("no conditions + PHASE2_JOB_WAIT_TIME=0 -> reason=timeout .") | +| TC10 | hung-job-cleanup | `test_phase2.sh` (same case — verifies `delete job . --ignore-not-found=true --wait=false`) | + +These tests realize the following cases from +the test plan (Phase 3 NIC health): + +| Test case | Description | Realized by | +|---|---|---| +| TC2 | phase3-pass-labels-node | `test_phase3.sh` ("PHASE3_CHECK_SCRIPT all checks pass -> =passed") | +| TC3 | skip-phase3-passlabels-all | `test_phase3.sh` ("SKIP_NIC_VALIDATION=true ." + case-insensitive) | +| TC4 | shellcheck-clean | `test_phase3.sh` ("shellcheck PHASE3_CHECK_SCRIPT .") | +| TC5 | nic-count-mismatch | `test_phase3.sh` ("PHASE3_CHECK_SCRIPT nic-count mismatch .") | +| TC6 | link-down-one-nic | `test_phase3.sh` ("PHASE3_CHECK_SCRIPT one NIC link DOWN .") | +| TC7 | rdma-state-not-active | `test_phase3.sh` ("PHASE3_CHECK_SCRIPT one rdma link INIT .") | +| TC8 | empty-gid-table | `test_phase3.sh` ("PHASE3_CHECK_SCRIPT empty GID table .") | +| TC10 | annotation-size-truncation | `test_phase3.sh` ("PHASE3_CHECK_SCRIPT large failure list truncates .") | +| TC11 | tools-missing-image | `test_phase3.sh` ("PHASE3_CHECK_SCRIPT ibv_devinfo unresponsive .") | +| TC12 | nic-not-allocated-timeout | `test_phase3.sh` ("no conditions + PHASE3_JOB_WAIT_TIME=0 -> reason=nic-not-allocated .") | + +These tests realize the following cases from +the test plan (Phase 4 pairwise +RDMA bandwidth): + +| Test case | Description | Realized by | +|---|---|---| +| TC1 | pairing-roundrobin-even | `test_phase4.sh` ("pairing round-robin even: [a,b,c,d] -> pairs=2 .") | +| TC2 | pairing-roundrobin-odd | `test_phase4.sh` ("pairing round-robin odd: [a,b,c,d,e] -> pairs=2 unpaired=node-e") | +| TC3 | per-rail-annotation-written | `test_phase4.sh` ("single pair all rails pass -> both nodes passed + per-rail annotations") | +| TC4 | skip-phase4-passlabels-all | `test_phase4.sh` ("SKIP_RAIL_BANDWIDTH_TEST=true ." + case-insensitive) | +| TC5 | single-rail-fail | `test_phase4.sh` ("single rail fail (rail 5 below threshold) -> failed-rails=5 .") | +| TC6 | all-rails-fail-one-pair | `test_phase4.sh` ("all rails fail on one pair -> failed-rails=0,1,2,3,4,5,6,7") | +| TC7 | ib-write-bw-crash | `test_phase4.sh` ("client Failed + no BW line -> reason=ib-write-bw-crashed") | +| TC8 | parse-failure | `test_phase4.sh` ("client Complete but empty log -> reason=parse-failed") | +| TC9 | single-node-input | `test_phase4.sh` ("single-node input -> unpaired pass-label .") | +| TC10 | empty-input | `test_phase4.sh` ("empty input list is a no-op .") | +| TC11 | rail-count-override | `test_phase4.sh` ("PHASE4_RAIL_COUNT=4 -> rails 0-3 annotated; rails 4-7 absent") | +| TC12 | server-pod-unready-timeout | `test_phase4.sh` ("server pod IP never set + PHASE4_PAIR_WAIT_TIME=0 -> reason=peer-pod-unready") | +| TC15 | concurrency-cap-honored | `test_phase4.sh` ("16 nodes (8 pairs) with cap=8 -> .") | + +These tests realize the following cases from +the test plan (Phase 4.5 cross-node +connectivity matrix pre-flight): + +| Test case | Description | Realized by | +|---|---|---| +| TC2 | all-checks-pass | `test_phase4_5.sh` ("all checks pass on a healthy 2-pod cluster .") | +| TC4 | single-pair-ssh-fail | `test_phase4_5.sh` ("single SSH mesh pair fails .") | +| TC5 | dns-fail-fwd | `test_phase4_5.sh` ("DNS forward miss .") | +| TC6 | mpi-spawn-fail | `test_phase4_5.sh` ("mpirun --hostfile no-op fails .") | +| TC7 | rccl-topology-timeout | `test_phase4_5.sh` ("RCCL probe times out (exit 124) .") | +| TC8 | worker-replicas-1 | `test_phase4_5.sh` ("WORKER_REPLICAS=1 self-pair only .") | +| TC9 | annotation-includes-all-failed-classes | `test_phase4_5.sh` ("all four checks fail .") | diff --git a/example/gpu-validation-cluster/tests/fixtures/phase2/phase2-bw-below-threshold.log b/example/gpu-validation-cluster/tests/fixtures/phase2/phase2-bw-below-threshold.log new file mode 100644 index 000000000..7f0f994b6 --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase2/phase2-bw-below-threshold.log @@ -0,0 +1,29 @@ +# Sample phase2.log fixture -- bandwidth-below-threshold fail. +# Emitted by the cluster-validation-phase2-job-config Job container +# when all_reduce_perf completes cleanly BUT the measured +# Avg bus bandwidth is below PHASE2_BW_THRESHOLD. validate-single-test.sh +# emits the "phase2 bandwidth below threshold" marker line and exits +# non-zero, which surfaces as Job=Failed=True with this marker present +# in the container log. +# +# This fixture also stresses the threshold-too-high inject case from +# the design doc test plan (set PHASE2_BW_THRESHOLD=9999) -- the log +# shape is identical; only the threshold parameter differs. +[2026-05-19 17:33:42] mpirun --np 8 --host localhost --allow-run-as-root --mca btl ^vader,openib /root/rccl-tests/build/all_reduce_perf -b 1K -e 2G -f 2 -g 1 -n 6 -w 20 +# nThread 1 nGpus 1 minBytes 1024 maxBytes 2147483648 step: 2(factor) warmup iters: 20 iters: 6 validation: 1 graph: 0 +# Using devices +# Rank 0 Group 0 Pid 52 device 0 [0xc0] AMD Instinct MI300X +# Rank 1 Group 0 Pid 53 device 1 [0xc1] AMD Instinct MI300X +# +# out-of-place in-place +# size count type redop time algbw busbw error time algbw busbw error +# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) + 536870912 134217728 float sum 5421.10 99.04 173.32 N/A 5402.66 99.37 173.91 N/A + 1073741824 268435456 float sum 10821.85 99.22 173.65 N/A 10791.41 99.50 174.13 N/A + 2147483648 536870912 float sum 21724.81 98.85 172.99 N/A 21680.45 99.05 173.34 N/A +# Errors: 0 +# Avg bus bandwidth: 173.7 +# +[phase2] validate-single-test.sh "phase2_intranode_all_reduce" 200 +[phase2] measured-bw=173.7 threshold=200 +phase2 bandwidth below threshold: measured=173.7 threshold=200 diff --git a/example/gpu-validation-cluster/tests/fixtures/phase2/phase2-failed-no-marker.log b/example/gpu-validation-cluster/tests/fixtures/phase2/phase2-failed-no-marker.log new file mode 100644 index 000000000..4eabdde7e --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase2/phase2-failed-no-marker.log @@ -0,0 +1,11 @@ +# Sample phase2.log fixture -- Failed=True with no recognized marker. +# Simulates a scenario where the Job ends Failed=True but neither the +# crash marker nor the bw-threshold marker appears in the container +# log -- e.g. container killed by OOM before the mpirun wrapper line +# could be emitted, validator script missing, init-container failure. +# PHASE2_SCRIPT defaults this case to reason=rccl-crash per design +# section 6 "Any non-zero exit from the container is treated as a +# crash signal unless the validator explicitly flagged bw-below-threshold". +[2026-05-19 17:36:11] starting Phase 2 RCCL Job for node smc300x-ccs-aus-gpuf268 +[2026-05-19 17:36:11] command prepared +container terminated -- no further output diff --git a/example/gpu-validation-cluster/tests/fixtures/phase2/phase2-pass.log b/example/gpu-validation-cluster/tests/fixtures/phase2/phase2-pass.log new file mode 100644 index 000000000..ddd8712c7 --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase2/phase2-pass.log @@ -0,0 +1,35 @@ +# Sample phase2.log fixture -- passing run. +# Emitted by the cluster-validation-phase2-job-config Job container +# when all_reduce_perf completes cleanly and the bandwidth +# check inside validate-single-test.sh passes the PHASE2_BW_THRESHOLD. +# +# Trimmed for fixture purposes -- only the final "Avg bus bandwidth" +# line and the validator's PASSED marker matter for PHASE2_SCRIPT's +# log parsing (it uses awk to pluck the Avg bus bandwidth value and +# greps for "phase2 all_reduce_perf PASSED" as a sanity signal). +[2026-05-19 17:32:10] mpirun --np 8 --host localhost --allow-run-as-root --mca btl ^vader,openib /root/rccl-tests/build/all_reduce_perf -b 1K -e 2G -f 2 -g 1 -n 6 -w 20 +# nThread 1 nGpus 1 minBytes 1024 maxBytes 2147483648 step: 2(factor) warmup iters: 20 iters: 6 validation: 1 graph: 0 +# Using devices +# Rank 0 Group 0 Pid 42 device 0 [0xc0] AMD Instinct MI300X +# Rank 1 Group 0 Pid 43 device 1 [0xc1] AMD Instinct MI300X +# Rank 2 Group 0 Pid 44 device 2 [0xc2] AMD Instinct MI300X +# Rank 3 Group 0 Pid 45 device 3 [0xc3] AMD Instinct MI300X +# Rank 4 Group 0 Pid 46 device 4 [0xc4] AMD Instinct MI300X +# Rank 5 Group 0 Pid 47 device 5 [0xc5] AMD Instinct MI300X +# Rank 6 Group 0 Pid 48 device 6 [0xc6] AMD Instinct MI300X +# Rank 7 Group 0 Pid 49 device 7 [0xc7] AMD Instinct MI300X +# +# out-of-place in-place +# size count type redop time algbw busbw error time algbw busbw error +# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) + 1024 256 float sum 13.21 0.08 0.14 N/A 12.10 0.08 0.15 N/A + 2048 512 float sum 13.84 0.15 0.26 N/A 13.05 0.16 0.27 N/A + 536870912 134217728 float sum 2189.41 245.21 429.13 N/A 2178.40 246.45 431.30 N/A + 1073741824 268435456 float sum 4304.55 249.45 436.54 N/A 4290.10 250.29 438.01 N/A + 2147483648 536870912 float sum 8556.43 250.95 439.16 N/A 8527.81 251.79 440.64 N/A +# Errors: 0 +# Avg bus bandwidth: 234.7 +# +[phase2] validate-single-test.sh "phase2_intranode_all_reduce" 200 +[phase2] measured-bw=234.7 threshold=200 +phase2 all_reduce_perf PASSED diff --git a/example/gpu-validation-cluster/tests/fixtures/phase2/phase2-rccl-crash.log b/example/gpu-validation-cluster/tests/fixtures/phase2/phase2-rccl-crash.log new file mode 100644 index 000000000..aff461ce0 --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase2/phase2-rccl-crash.log @@ -0,0 +1,28 @@ +# Sample phase2.log fixture -- mpirun crash (non-zero exit). +# Emitted by the cluster-validation-phase2-job-config Job container +# when all_reduce_perf aborts mid-run (xGMI link failure, +# RCCL assertion, segfault, etc.). The Job's container script wraps +# mpirun and prints a "phase2 mpirun exited " marker on non-zero +# exit; PHASE2_SCRIPT keys off this exact phrase to classify the +# failure as rccl-crash. +[2026-05-19 17:34:55] mpirun --np 8 --host localhost --allow-run-as-root --mca btl ^vader,openib /root/rccl-tests/build/all_reduce_perf -b 1K -e 2G -f 2 -g 1 -n 6 -w 20 +# nThread 1 nGpus 1 minBytes 1024 maxBytes 2147483648 step: 2(factor) warmup iters: 20 iters: 6 validation: 1 graph: 0 +# Using devices +# Rank 0 Group 0 Pid 61 device 0 [0xc0] AMD Instinct MI300X +# Rank 1 Group 0 Pid 62 device 1 [0xc1] AMD Instinct MI300X +# Rank 2 Group 0 Pid 63 device 2 [0xc2] AMD Instinct MI300X +NCCL WARN [Rank 3] proxyConnect transport=0 reconnect failed +NCCL INFO Channel 00/02 : 0 1 2 3 4 5 6 7 +NCCL ERROR misc/xgmi.cc:412 GPU 3 xGMI link 2 reported HW error +[Rank 3] RCCL FATAL: xgmi-init-failure: cannot initialize peer device 4 +-------------------------------------------------------------------------- +Primary job terminated normally, but 1 process returned +a non-zero exit code. Per user-direction, the job has been aborted. +-------------------------------------------------------------------------- +-------------------------------------------------------------------------- +mpirun detected that one or more processes exited with non-zero status, thus causing +the job to be terminated. The first process to exit was: + Process name: [[40132,1],3] + Exit code: 1 +-------------------------------------------------------------------------- +phase2 mpirun exited 1 diff --git a/example/gpu-validation-cluster/tests/fixtures/phase3/check5_nicctl_empty.txt b/example/gpu-validation-cluster/tests/fixtures/phase3/check5_nicctl_empty.txt new file mode 100644 index 000000000..e69de29bb diff --git a/example/gpu-validation-cluster/tests/fixtures/phase3/check5_nicctl_match.txt b/example/gpu-validation-cluster/tests/fixtures/phase3/check5_nicctl_match.txt new file mode 100644 index 000000000..fc14ed77c --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase3/check5_nicctl_match.txt @@ -0,0 +1,2 @@ +ionic0: 1.117.5 +ionic1: 1.117.5 diff --git a/example/gpu-validation-cluster/tests/fixtures/phase3/check5_nicctl_mismatch.txt b/example/gpu-validation-cluster/tests/fixtures/phase3/check5_nicctl_mismatch.txt new file mode 100644 index 000000000..0d24ba04c --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase3/check5_nicctl_mismatch.txt @@ -0,0 +1,2 @@ +ionic0: 1.117.5 +ionic1: 1.117.9 diff --git a/example/gpu-validation-cluster/tests/fixtures/phase3/ibv-devices.txt b/example/gpu-validation-cluster/tests/fixtures/phase3/ibv-devices.txt new file mode 100644 index 000000000..287531387 --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase3/ibv-devices.txt @@ -0,0 +1,10 @@ + device node GUID + ------ ---------------- + rocep5s0 aabbccddeeff0000 + rocep6s0 aabbccddeeff0001 + rocep7s0 aabbccddeeff0002 + rocep8s0 aabbccddeeff0003 + rocep9s0 aabbccddeeff0004 + rocep10s0 aabbccddeeff0005 + rocep11s0 aabbccddeeff0006 + rocep12s0 aabbccddeeff0007 diff --git a/example/gpu-validation-cluster/tests/fixtures/phase3/ibv-devinfo-empty-gid.txt b/example/gpu-validation-cluster/tests/fixtures/phase3/ibv-devinfo-empty-gid.txt new file mode 100644 index 000000000..e037a4ae7 --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase3/ibv-devinfo-empty-gid.txt @@ -0,0 +1,17 @@ +hca_id: rocep5s0 + transport: InfiniBand (0) + fw_ver: 1.0.0 + node_guid: aabb:ccdd:eeff:0000 + sys_image_guid: aabb:ccdd:eeff:0000 + vendor_id: 0x1dd8 + vendor_part_id: 1 + hw_ver: 0x3 + phys_port_cnt: 1 + port: 1 + state: PORT_DOWN (1) + max_mtu: 4096 (5) + active_mtu: 4096 (5) + sm_lid: 0 + port_lid: 0 + port_lmc: 0x00 + link_layer: Ethernet diff --git a/example/gpu-validation-cluster/tests/fixtures/phase3/ibv-devinfo-pass.txt b/example/gpu-validation-cluster/tests/fixtures/phase3/ibv-devinfo-pass.txt new file mode 100644 index 000000000..1eb18dac6 --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase3/ibv-devinfo-pass.txt @@ -0,0 +1,19 @@ +hca_id: rocep5s0 + transport: InfiniBand (0) + fw_ver: 1.0.0 + node_guid: aabb:ccdd:eeff:0000 + sys_image_guid: aabb:ccdd:eeff:0000 + vendor_id: 0x1dd8 + vendor_part_id: 1 + hw_ver: 0x3 + phys_port_cnt: 1 + port: 1 + state: PORT_ACTIVE (4) + max_mtu: 4096 (5) + active_mtu: 4096 (5) + sm_lid: 0 + port_lid: 0 + port_lmc: 0x00 + link_layer: Ethernet + GID[ 0]: fe80:0000:0000:0000:a8bb:ccff:fedd:eeff, RoCE v1 + GID[ 1]: fe80:0000:0000:0000:a8bb:ccff:fedd:eeff, RoCE v2 diff --git a/example/gpu-validation-cluster/tests/fixtures/phase3/ip-link-one-down.txt b/example/gpu-validation-cluster/tests/fixtures/phase3/ip-link-one-down.txt new file mode 100644 index 000000000..e3c9b4b43 --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase3/ip-link-one-down.txt @@ -0,0 +1,9 @@ +lo UNKNOWN 00:00:00:00:00:00 +enP5p1s0f0 UP aa:bb:cc:00:00:00 +enP6p1s0f0 UP aa:bb:cc:00:00:01 +enP7p1s0f0 DOWN aa:bb:cc:00:00:02 +enP8p1s0f0 UP aa:bb:cc:00:00:03 +enP9p1s0f0 UP aa:bb:cc:00:00:04 +enP10p1s0f0 UP aa:bb:cc:00:00:05 +enP11p1s0f0 UP aa:bb:cc:00:00:06 +enP12p1s0f0 UP aa:bb:cc:00:00:07 diff --git a/example/gpu-validation-cluster/tests/fixtures/phase3/ip-link-pass.txt b/example/gpu-validation-cluster/tests/fixtures/phase3/ip-link-pass.txt new file mode 100644 index 000000000..64fb04b21 --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase3/ip-link-pass.txt @@ -0,0 +1,9 @@ +lo UNKNOWN 00:00:00:00:00:00 +enP5p1s0f0 UP aa:bb:cc:00:00:00 +enP6p1s0f0 UP aa:bb:cc:00:00:01 +enP7p1s0f0 UP aa:bb:cc:00:00:02 +enP8p1s0f0 UP aa:bb:cc:00:00:03 +enP9p1s0f0 UP aa:bb:cc:00:00:04 +enP10p1s0f0 UP aa:bb:cc:00:00:05 +enP11p1s0f0 UP aa:bb:cc:00:00:06 +enP12p1s0f0 UP aa:bb:cc:00:00:07 diff --git a/example/gpu-validation-cluster/tests/fixtures/phase3/lspci-count-mismatch.txt b/example/gpu-validation-cluster/tests/fixtures/phase3/lspci-count-mismatch.txt new file mode 100644 index 000000000..71ab0247c --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase3/lspci-count-mismatch.txt @@ -0,0 +1,7 @@ +0000:05:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1002] (rev 03) +0000:06:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1002] (rev 03) +0000:07:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1002] (rev 03) +0000:08:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1002] (rev 03) +0000:09:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1002] (rev 03) +0000:0a:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1002] (rev 03) +0000:0b:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1002] (rev 03) diff --git a/example/gpu-validation-cluster/tests/fixtures/phase3/lspci-empty.txt b/example/gpu-validation-cluster/tests/fixtures/phase3/lspci-empty.txt new file mode 100644 index 000000000..e69de29bb diff --git a/example/gpu-validation-cluster/tests/fixtures/phase3/lspci-pass.txt b/example/gpu-validation-cluster/tests/fixtures/phase3/lspci-pass.txt new file mode 100644 index 000000000..f991635ab --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase3/lspci-pass.txt @@ -0,0 +1,8 @@ +0000:05:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1002] (rev 03) +0000:06:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1002] (rev 03) +0000:07:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1002] (rev 03) +0000:08:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1002] (rev 03) +0000:09:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1002] (rev 03) +0000:0a:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1002] (rev 03) +0000:0b:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1002] (rev 03) +0000:0c:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1002] (rev 03) diff --git a/example/gpu-validation-cluster/tests/fixtures/phase3/rdma-link-one-init.txt b/example/gpu-validation-cluster/tests/fixtures/phase3/rdma-link-one-init.txt new file mode 100644 index 000000000..09c7fb881 --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase3/rdma-link-one-init.txt @@ -0,0 +1,8 @@ +link rocep5s0/1 state ACTIVE physical_state LINK_UP netdev enP5p1s0f0 +link rocep6s0/1 state ACTIVE physical_state LINK_UP netdev enP6p1s0f0 +link rocep7s0/1 state INIT physical_state LINK_UP netdev enP7p1s0f0 +link rocep8s0/1 state ACTIVE physical_state LINK_UP netdev enP8p1s0f0 +link rocep9s0/1 state ACTIVE physical_state LINK_UP netdev enP9p1s0f0 +link rocep10s0/1 state ACTIVE physical_state LINK_UP netdev enP10p1s0f0 +link rocep11s0/1 state ACTIVE physical_state LINK_UP netdev enP11p1s0f0 +link rocep12s0/1 state ACTIVE physical_state LINK_UP netdev enP12p1s0f0 diff --git a/example/gpu-validation-cluster/tests/fixtures/phase3/rdma-link-pass.txt b/example/gpu-validation-cluster/tests/fixtures/phase3/rdma-link-pass.txt new file mode 100644 index 000000000..5b448d3c8 --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase3/rdma-link-pass.txt @@ -0,0 +1,8 @@ +link rocep5s0/1 state ACTIVE physical_state LINK_UP netdev enP5p1s0f0 +link rocep6s0/1 state ACTIVE physical_state LINK_UP netdev enP6p1s0f0 +link rocep7s0/1 state ACTIVE physical_state LINK_UP netdev enP7p1s0f0 +link rocep8s0/1 state ACTIVE physical_state LINK_UP netdev enP8p1s0f0 +link rocep9s0/1 state ACTIVE physical_state LINK_UP netdev enP9p1s0f0 +link rocep10s0/1 state ACTIVE physical_state LINK_UP netdev enP10p1s0f0 +link rocep11s0/1 state ACTIVE physical_state LINK_UP netdev enP11p1s0f0 +link rocep12s0/1 state ACTIVE physical_state LINK_UP netdev enP12p1s0f0 diff --git a/example/gpu-validation-cluster/tests/fixtures/phase4/ib-write-bw-below-threshold.log b/example/gpu-validation-cluster/tests/fixtures/phase4/ib-write-bw-below-threshold.log new file mode 100644 index 000000000..57c8053e5 --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase4/ib-write-bw-below-threshold.log @@ -0,0 +1,17 @@ +# Sample ib_write_bw client log fixture -- below-threshold result. +# BW average column 4 is 180.50, well under the default 380 Gbps +# threshold. PHASE4_DRIVER_SCRIPT classifies this as +# reason="below-threshold:180.50" (the literal BW value is suffixed +# onto the reason string for diagnostics) and writes a per-rail +# annotation with the same value. +[phase4-client] starting ib_write_bw -d rdma_dev_5 -i 1 -F 10.42.0.19 +--------------------------------------------------------------------------------------- + RDMA_Write BW Test + Device : rdma_dev_5 + Mtu : 4096[B] + Link type : Ethernet +--------------------------------------------------------------------------------------- + #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] + 65536 5000 192.18 180.50 0.344162 +--------------------------------------------------------------------------------------- +[phase4-client] ib_write_bw exit=0 diff --git a/example/gpu-validation-cluster/tests/fixtures/phase4/ib-write-bw-crashed.log b/example/gpu-validation-cluster/tests/fixtures/phase4/ib-write-bw-crashed.log new file mode 100644 index 000000000..262ec2afa --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase4/ib-write-bw-crashed.log @@ -0,0 +1,15 @@ +# Sample ib_write_bw client log fixture -- crashed run. +# ib_write_bw exited non-zero before emitting the "BW average" line. +# PHASE4_DRIVER_SCRIPT's parse step finds no value -> classifies as +# reason="ib-write-bw-crashed" (when paired with Failed=True from +# the client Job's terminal condition). +[phase4-client] starting ib_write_bw -d rdma_dev_2 -i 1 -F 10.42.0.20 +--------------------------------------------------------------------------------------- + RDMA_Write BW Test + Device : rdma_dev_2 +--------------------------------------------------------------------------------------- +ibv_create_qp failed: Cannot allocate memory +Couldn't create QP +Unable to create QP. +ethernet_client_exchange_data: Couldn't read remote address +[phase4-client] ib_write_bw exit=1 diff --git a/example/gpu-validation-cluster/tests/fixtures/phase4/ib-write-bw-empty.log b/example/gpu-validation-cluster/tests/fixtures/phase4/ib-write-bw-empty.log new file mode 100644 index 000000000..7e092eaf8 --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase4/ib-write-bw-empty.log @@ -0,0 +1,10 @@ +# Sample ib_write_bw client log fixture -- empty / no BW average line. +# The ib_write_bw process exited 0 (so the client Job is +# Complete=True) but the log was truncated before the result row was +# written. PHASE4_DRIVER_SCRIPT's parse step finds no value -> the +# driver classifies the rail as reason="parse-failed" (design §6). +[phase4-client] starting ib_write_bw -d rdma_dev_7 -i 1 -F 10.42.0.21 +--------------------------------------------------------------------------------------- + RDMA_Write BW Test + Device : rdma_dev_7 +--------------------------------------------------------------------------------------- diff --git a/example/gpu-validation-cluster/tests/fixtures/phase4/ib-write-bw-pass-high.log b/example/gpu-validation-cluster/tests/fixtures/phase4/ib-write-bw-pass-high.log new file mode 100644 index 000000000..c636719b9 --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase4/ib-write-bw-pass-high.log @@ -0,0 +1,14 @@ +# Sample ib_write_bw client log fixture -- passing run with a higher BW +# value used to distinguish per-rail annotations in mixed-pass tests. +# Same shape as ib-write-bw-pass.log; BW average column 4 is 392.18. +[phase4-client] starting ib_write_bw -d rdma_dev_0 -i 1 -F 10.42.0.18 +--------------------------------------------------------------------------------------- + RDMA_Write BW Test + Device : rdma_dev_0 + Mtu : 4096[B] + Link type : Ethernet +--------------------------------------------------------------------------------------- + #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] + 65536 5000 394.55 392.18 0.747823 +--------------------------------------------------------------------------------------- +[phase4-client] ib_write_bw exit=0 diff --git a/example/gpu-validation-cluster/tests/fixtures/phase4/ib-write-bw-pass.log b/example/gpu-validation-cluster/tests/fixtures/phase4/ib-write-bw-pass.log new file mode 100644 index 000000000..1670aaf53 --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase4/ib-write-bw-pass.log @@ -0,0 +1,33 @@ +# Sample ib_write_bw client log fixture -- passing run >= 380 Gbps threshold. +# Emitted by the cluster-validation-phase4-client-job container +# when ib_write_bw completes cleanly against the server +# pod on the paired node, using one specific rail (PHASE4_IB_DEV_PREFIX +# + RAIL_IDX). PHASE4_DRIVER_SCRIPT only consumes the "BW average" line: +# it greps for "BW average" then takes column 4 of the next data row. +# +# Trimmed to the minimum needed by _phase4_parse_bw_average -- the header +# line containing "BW average", followed by a data row whose first +# column is a numeric byte count. +[phase4-client] starting ib_write_bw -d rdma_dev_3 -i 1 -F 10.42.0.17 +--------------------------------------------------------------------------------------- + RDMA_Write BW Test + Dual-port : OFF Device : rdma_dev_3 + Number of qps : 1 Transport type : IB + Connection type : RC Using SRQ : OFF + PCIe relax order: ON + TX depth : 128 + CQ Moderation : 1 + Mtu : 4096[B] + Link type : Ethernet + GID index : 3 + Max inline data : 0[B] + rdma_cm QPs : OFF + Data ex. method : Ethernet +--------------------------------------------------------------------------------------- + local address: LID 0000 QPN 0x0042 PSN 0x12af34 RKey 0x180100 VAddr 0x007fa9b8000000 + remote address: LID 0000 QPN 0x0051 PSN 0x6e74c2 RKey 0x180101 VAddr 0x007f0fc8000000 +--------------------------------------------------------------------------------------- + #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] + 65536 5000 391.04 388.42 0.740434 +--------------------------------------------------------------------------------------- +[phase4-client] ib_write_bw exit=0 diff --git a/example/gpu-validation-cluster/tests/fixtures/phase4_5/README.md b/example/gpu-validation-cluster/tests/fixtures/phase4_5/README.md new file mode 100644 index 000000000..4ca641190 --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase4_5/README.md @@ -0,0 +1,29 @@ +# Phase 4.5 fixtures + +Canned stdout bodies for the `kubectl exec` calls issued by +`PHASE45_PREFLIGHT_SCRIPT`. The test harness feeds these into the +mock `kubectl`'s exec-response queue (see +`lib/kubectl_mock.sh::kubectl_mock_queue_exec`). + +The pre-flight script issues exec calls in this order (per healthy +2-pod run): + +| # | Phase | Stdout shape we care about | +|---|-------|----------------------------| +| 1 | launcher->worker SSH readiness (one big exec) | -- (only exit code matters) | +| 2.N+1 | N*N SSH mesh (one exec per (src, dst_ip) pair) | -- (only exit code) | +| -- | DNS forward+reverse (one exec) | `DNS::fwd= rev=` lines, or empty when clean | +| -- | MPI no-op spawn (one exec) | -- (only exit code) | +| -- | RCCL topology probe (wrapped in `timeout 60`) | NCCL INFO lines (empty is fine), but exit code drives classification | + +## Fixtures + +- `dns-clean.txt` -- empty body; clean DNS run produces no MISS lines. +- `dns-fwd-miss.txt` -- one host fails forward resolution. +- `rccl-pass.txt` -- a handful of `NCCL INFO` lines, simulating a + fast topology discovery. +- `rccl-empty.txt` -- empty stdout; the grep|head pipeline produces + nothing on a non-NCCL-aware host. We pair this with a non-zero + exit code for the `rccl_topo_failed` branch and with exit 124 for + the soft-fail `rccl_topo_timeout` branch (the host-side `timeout` + shim writes nothing on timeout either). diff --git a/example/gpu-validation-cluster/tests/fixtures/phase4_5/dns-clean.txt b/example/gpu-validation-cluster/tests/fixtures/phase4_5/dns-clean.txt new file mode 100644 index 000000000..e69de29bb diff --git a/example/gpu-validation-cluster/tests/fixtures/phase4_5/dns-fwd-miss.txt b/example/gpu-validation-cluster/tests/fixtures/phase4_5/dns-fwd-miss.txt new file mode 100644 index 000000000..16112ac0c --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase4_5/dns-fwd-miss.txt @@ -0,0 +1 @@ +DNS:worker-b:fwd=MISS rev=SKIP diff --git a/example/gpu-validation-cluster/tests/fixtures/phase4_5/rccl-empty.txt b/example/gpu-validation-cluster/tests/fixtures/phase4_5/rccl-empty.txt new file mode 100644 index 000000000..e69de29bb diff --git a/example/gpu-validation-cluster/tests/fixtures/phase4_5/rccl-pass.txt b/example/gpu-validation-cluster/tests/fixtures/phase4_5/rccl-pass.txt new file mode 100644 index 000000000..4117e0a01 --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase4_5/rccl-pass.txt @@ -0,0 +1,4 @@ +NCCL INFO comm 0x55a 0xabcd nranks 8 cudaDev 0 busId 1000 commId 0xff - Init COMPLETE +NCCL INFO comm 0x55b 0xabcd nranks 8 cudaDev 1 busId 2000 commId 0xff - Init COMPLETE +NCCL INFO Channel topology: ring [0] 0 -> 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7 -> 0 +NCCL INFO Connected all rings diff --git a/example/gpu-validation-cluster/tests/fixtures/phase5/.gitkeep b/example/gpu-validation-cluster/tests/fixtures/phase5/.gitkeep new file mode 100644 index 000000000..4bb3c50ee --- /dev/null +++ b/example/gpu-validation-cluster/tests/fixtures/phase5/.gitkeep @@ -0,0 +1,6 @@ +# Placeholder so the empty fixtures directory is tracked. +# Phase 5 tests () seed kubectl mock state +# inline via kubectl_mock_set_phase5_* helpers rather than reading +# canned fixture files; the directory is reserved for future negative +# fixtures (e.g. canned launcher logs, worker logs) if the test scope +# expands to log-content assertions. diff --git a/example/gpu-validation-cluster/tests/lib/assert.sh b/example/gpu-validation-cluster/tests/lib/assert.sh new file mode 100755 index 000000000..1b5915211 --- /dev/null +++ b/example/gpu-validation-cluster/tests/lib/assert.sh @@ -0,0 +1,288 @@ +#!/bin/bash +# Hand-rolled bash assertion library for the gpu-validation-cluster +# test suite. Provides a minimal, dependency-free +# alternative to `bats` so tests can run in any environment that has +# bash >= 4 on PATH. +# +# Usage: +# source "$(dirname "$0")/lib/assert.sh" +# +# it "label_phase_passed writes the passed label" && { +# run label_phase_passed node-a amd.com/x +# assert_status 0 +# assert_kubectl_call "label node node-a amd.com/x=passed --overwrite" +# } +# +# Conventions: +# * Each test file is a normal bash script. It sources this library, +# then issues `it "" && { . }` blocks. The library tracks +# per-test pass/fail and prints a summary at exit. +# * `run ` executes the SUT under test in a sub-shell so a +# non-zero exit cannot abort the harness; captures stdout, stderr, +# and exit status into globals (LAST_STDOUT, LAST_STDERR, +# LAST_STATUS). +# * Assertion failures are non-fatal -- they record the failure and +# let the rest of the test (and the rest of the file) continue, so +# a single broken expectation does not mask later regressions. + +set -uo pipefail + +# --- counters and state --------------------------------------------- +ASSERT_TESTS_TOTAL=0 +ASSERT_TESTS_FAILED=0 +ASSERT_CURRENT_TEST="" +ASSERT_CURRENT_FAILED=0 +ASSERT_FAILED_NAMES=() + +# Globals populated by `run`. +LAST_STDOUT="" +LAST_STDERR="" +LAST_STATUS=0 + +# --- test lifecycle ------------------------------------------------- + +# it +# Begins a new test case. Always returns 0 so callers chain with `&&`. +# A test starts with no failures; failures accumulate as assertions +# fire. +it() { + # Flush the previous test if any. + _assert_finalize_current + ASSERT_CURRENT_TEST="$1" + ASSERT_CURRENT_FAILED=0 + ASSERT_TESTS_TOTAL=$((ASSERT_TESTS_TOTAL + 1)) + printf ' - %s ... ' "$ASSERT_CURRENT_TEST" + return 0 +} + +# _assert_finalize_current +# Internal: emit PASS/FAIL for the in-flight test (called by `it` and +# at end-of-file via `assert_summary`). +_assert_finalize_current() { + if [[ -z "$ASSERT_CURRENT_TEST" ]]; then + return 0 + fi + if [[ "$ASSERT_CURRENT_FAILED" -ne 0 ]]; then + printf 'FAIL\n' + ASSERT_TESTS_FAILED=$((ASSERT_TESTS_FAILED + 1)) + ASSERT_FAILED_NAMES+=("$ASSERT_CURRENT_TEST") + else + printf 'PASS\n' + fi + ASSERT_CURRENT_TEST="" + ASSERT_CURRENT_FAILED=0 +} + +# assert_summary +# Emit the per-file totals on stdout and return 0 if all tests passed, +# 1 otherwise. Call at the end of every test file. +assert_summary() { + _assert_finalize_current + echo + echo " ----------------------------------------------------------" + if [[ "$ASSERT_TESTS_FAILED" -eq 0 ]]; then + echo " ${ASSERT_TESTS_TOTAL} test(s) passed" + return 0 + fi + echo " ${ASSERT_TESTS_FAILED}/${ASSERT_TESTS_TOTAL} test(s) FAILED:" + local n + for n in "${ASSERT_FAILED_NAMES[@]}"; do + echo " * ${n}" + done + return 1 +} + +# _assert_fail +# Record a failure on the current test. Multiple calls accumulate so +# you can see every broken expectation in one run. +_assert_fail() { + ASSERT_CURRENT_FAILED=$((ASSERT_CURRENT_FAILED + 1)) + # Newline so the per-test PASS/FAIL is not glued onto the message. + if [[ "$ASSERT_CURRENT_FAILED" -eq 1 ]]; then + printf '\n' + fi + echo " FAIL: $*" >&2 +} + +# --- command runner ------------------------------------------------- + +# run +# Execute a command in a subshell, capturing stdout/stderr/status into +# LAST_STDOUT / LAST_STDERR / LAST_STATUS so subsequent assertions can +# inspect the result without re-running the SUT. +run() { + local out_file err_file + out_file=$(mktemp) + err_file=$(mktemp) + # `set +e` so a failing SUT just sets LAST_STATUS; the harness + # stays alive. + set +e + ( "$@" ) >"$out_file" 2>"$err_file" + LAST_STATUS=$? + set -e + LAST_STDOUT=$(cat "$out_file") + LAST_STDERR=$(cat "$err_file") + rm -f "$out_file" "$err_file" +} + +# --- value assertions ----------------------------------------------- + +assert_status() { + local expected="$1" + if [[ "$LAST_STATUS" -ne "$expected" ]]; then + _assert_fail "expected exit status ${expected}, got ${LAST_STATUS} (stderr: ${LAST_STDERR})" + fi +} + +assert_stdout_equals() { + local expected="$1" + if [[ "$LAST_STDOUT" != "$expected" ]]; then + _assert_fail "stdout mismatch + expected: [${expected}] + actual: [${LAST_STDOUT}]" + fi +} + +assert_stdout_empty() { + if [[ -n "$LAST_STDOUT" ]]; then + _assert_fail "expected empty stdout, got: [${LAST_STDOUT}]" + fi +} + +assert_stdout_contains() { + local needle="$1" + if ! grep -qF -- "$needle" <<<"$LAST_STDOUT"; then + _assert_fail "expected stdout to contain [${needle}], got: [${LAST_STDOUT}]" + fi +} + +assert_stdout_not_contains() { + local needle="$1" + if grep -qF -- "$needle" <<<"$LAST_STDOUT"; then + _assert_fail "expected stdout NOT to contain [${needle}], got: [${LAST_STDOUT}]" + fi +} + +assert_stderr_contains() { + local needle="$1" + if ! grep -qF -- "$needle" <<<"$LAST_STDERR"; then + _assert_fail "expected stderr to contain [${needle}], got: [${LAST_STDERR}]" + fi +} + +assert_stderr_not_contains() { + local needle="$1" + if grep -qF -- "$needle" <<<"$LAST_STDERR"; then + _assert_fail "expected stderr NOT to contain [${needle}], got: [${LAST_STDERR}]" + fi +} + +assert_equals() { + local expected="$1" + local actual="$2" + if [[ "$expected" != "$actual" ]]; then + _assert_fail "expected [${expected}], got [${actual}]" + fi +} + +assert_not_equals() { + local unexpected="$1" + local actual="$2" + if [[ "$unexpected" == "$actual" ]]; then + _assert_fail "expected NOT [${unexpected}], got [${actual}]" + fi +} + +assert_file_exists() { + local path="$1" + if [[ ! -f "$path" ]]; then + _assert_fail "expected file to exist: ${path}" + fi +} + +assert_file_empty() { + local path="$1" + if [[ ! -f "$path" ]]; then + _assert_fail "expected file (empty) to exist: ${path}" + return + fi + if [[ -s "$path" ]]; then + _assert_fail "expected file ${path} to be empty, got: [$(cat "$path")]" + fi +} + +assert_file_contains() { + local path="$1" + local needle="$2" + if [[ ! -f "$path" ]]; then + _assert_fail "expected file to exist for contains check: ${path}" + return + fi + if ! grep -qF -- "$needle" "$path"; then + _assert_fail "expected ${path} to contain [${needle}], got: [$(cat "$path")]" + fi +} + +# --- kubectl-mock-aware assertions ---------------------------------- +# These look at the call log produced by `lib/kubectl_mock.sh` +# (KUBECTL_CALLS_FILE). + +# assert_kubectl_no_calls +# Verify the mock kubectl recorded zero invocations. +assert_kubectl_no_calls() { + local path="${KUBECTL_CALLS_FILE:-}" + if [[ -z "$path" ]]; then + _assert_fail "KUBECTL_CALLS_FILE not set -- did you source lib/kubectl_mock.sh?" + return + fi + if [[ -s "$path" ]]; then + _assert_fail "expected zero kubectl calls, got: +$(cat "$path")" + fi +} + +# assert_kubectl_call_count +assert_kubectl_call_count() { + local expected="$1" + local path="${KUBECTL_CALLS_FILE:-}" + local actual=0 + if [[ -n "$path" && -f "$path" ]]; then + actual=$(wc -l <"$path" | tr -d ' ') + fi + if [[ "$actual" -ne "$expected" ]]; then + _assert_fail "expected ${expected} kubectl call(s), got ${actual}: +$(cat "$path" 2>/dev/null || true)" + fi +} + +# assert_kubectl_call +# Each recorded call is a single line consisting of all positional +# args joined by a single space (see kubectl_mock.sh). The match is +# exact equality of at least one recorded line. +assert_kubectl_call() { + local expected="$1" + local path="${KUBECTL_CALLS_FILE:-}" + if [[ -z "$path" || ! -f "$path" ]]; then + _assert_fail "no kubectl call log to search (expected: ${expected})" + return + fi + if ! grep -qxF -- "$expected" "$path"; then + _assert_fail "expected kubectl call [${expected}] not found. Actual log: +$(cat "$path")" + fi +} + +# assert_kubectl_call_contains +# Looser variant: substring match against any recorded call line. +assert_kubectl_call_contains() { + local needle="$1" + local path="${KUBECTL_CALLS_FILE:-}" + if [[ -z "$path" || ! -f "$path" ]]; then + _assert_fail "no kubectl call log to search (needle: ${needle})" + return + fi + if ! grep -qF -- "$needle" "$path"; then + _assert_fail "expected some kubectl call to contain [${needle}]. Actual log: +$(cat "$path")" + fi +} diff --git a/example/gpu-validation-cluster/tests/lib/extract_script.sh b/example/gpu-validation-cluster/tests/lib/extract_script.sh new file mode 100755 index 000000000..32700bb3e --- /dev/null +++ b/example/gpu-validation-cluster/tests/lib/extract_script.sh @@ -0,0 +1,110 @@ +#!/bin/bash +# Extracts a multi-line `data.: |` script block from a Kubernetes +# ConfigMap YAML and writes the body to stdout. Pure bash + awk so the +# test suite does not depend on PyYAML / yq. +# +# Usage: +# extract_configmap_data /path/to/configmap.yaml KEY_NAME +# +# Limitations (acceptable for the cluster-validation-config layout): +# * The ConfigMap is the only document containing `data:` at column 0 +# (or column 2 if nested under `metadata:`). We anchor on a `data:` +# line at column 0, then read keys at column 2. +# * The block scalar must use `|` (literal). `>` (folded) is not +# supported. The actual file uses `|` everywhere. +# * Trailing blank lines in the block are dropped by awk default; +# this matches kubectl's own normalization. + +# extract_cronjob_orchestrator +# Extract the multi-line bash body from the `submit-mpijob` container's +# `args: - |` block in cluster-validation-job.yaml. We anchor on the +# container name line so we only pick up that one args block (not the +# other init/launcher containers in the same file). +# +# Approach: find `name: submit-mpijob` (at any indent), then scan +# forward for the first `args:` line, then the `- |` line, then capture +# the body until indentation falls below the body's indent. +extract_cronjob_orchestrator() { + local yaml_path="$1" + if [[ ! -f "$yaml_path" ]]; then + echo "extract_cronjob_orchestrator: file not found: $yaml_path" >&2 + return 2 + fi + awk ' + function leading_spaces(s, i) { + for (i = 1; i <= length(s); i++) { + if (substr(s, i, 1) != " ") return i - 1 + } + return length(s) + } + BEGIN { state = 0; body_indent = -1 } + # state 0: looking for the submit-mpijob container + state == 0 && /name:[[:space:]]*submit-mpijob[[:space:]]*$/ { + state = 1; next + } + # state 1: looking for the args: line + state == 1 && /^[[:space:]]+args:[[:space:]]*$/ { state = 2; next } + # state 2: looking for the literal-scalar opener `- |` + state == 2 && /^[[:space:]]+-[[:space:]]+\|[[:space:]]*$/ { + state = 3; body_indent = -1; next + } + # state 3: capture body lines until indent < body_indent (and line + # is non-blank). Blank lines are preserved verbatim. + state == 3 { + if ($0 == "") { print ""; next } + ls = leading_spaces($0) + if (body_indent < 0) { + body_indent = ls + } + if (ls < body_indent) { + state = 4 + next + } + print substr($0, body_indent + 1) + } + ' "$yaml_path" +} + +extract_configmap_data() { + local yaml_path="$1" + local key="$2" + if [[ ! -f "$yaml_path" ]]; then + echo "extract_configmap_data: file not found: $yaml_path" >&2 + return 2 + fi + awk -v want="$key" ' + BEGIN { in_data = 0; in_block = 0; block_indent = 4 } + function leading_spaces(s, i) { + for (i = 1; i <= length(s); i++) { + if (substr(s, i, 1) != " ") return i - 1 + } + return length(s) + } + # Start of the top-level data: block (column 0). + /^data:[[:space:]]*$/ { in_data = 1; in_block = 0; next } + # Leave data when we hit another top-level YAML key. + in_data && /^[A-Za-z0-9_-]+:/ && !/^data:/ { in_data = 0; in_block = 0 } + in_data { + # Match ` KEY: |` -- exactly two-space indent for data keys. + if (match($0, /^ ([A-Za-z0-9_-]+):[[:space:]]*\|[[:space:]]*$/, m)) { + if (m[1] == want) { + in_block = 1 + next + } else if (in_block) { + in_block = 0 + } + } + if (in_block) { + # Blank lines stay part of the block. + if ($0 == "") { print ""; next } + ls = leading_spaces($0) + # Any line less indented than block_indent ends the block + # (matches YAML block-scalar semantics: a less-indented + # non-blank line terminates the scalar). + if (ls < block_indent) { in_block = 0; next } + # Strip exactly block_indent leading spaces. + print substr($0, block_indent + 1) + } + } + ' "$yaml_path" +} diff --git a/example/gpu-validation-cluster/tests/lib/kubectl_mock.sh b/example/gpu-validation-cluster/tests/lib/kubectl_mock.sh new file mode 100755 index 000000000..f19b532a4 --- /dev/null +++ b/example/gpu-validation-cluster/tests/lib/kubectl_mock.sh @@ -0,0 +1,924 @@ +#!/bin/bash +# kubectl mock infrastructure for the gpu-validation-cluster bash +# test suite. +# +# Why a real executable on PATH rather than a bash function override: +# The helper library under test is sourced into the test process, +# but the orchestrator's `_run_phase_generic` writes per-phase +# scripts to /tmp and `source`s them. In CI environments where a +# real `kubectl` exists on PATH, a bash function override is only +# visible to the current shell; sub-shells that look up `kubectl` +# via execvp will find the real binary and try to talk to a real +# cluster. Putting a fake `kubectl` on PATH as the FIRST entry +# guarantees every kubectl invocation -- in any sub-shell, sourced +# script, or process substitution -- hits the mock. +# +# Usage: +# source "$(dirname "$0")/lib/kubectl_mock.sh" +# kubectl_mock_init # creates tmpdir, installs PATH +# kubectl_mock_reset # zero call log + canned state +# kubectl_mock_set_label \ +# # serve for jsonpath get +# kubectl_mock_fail label 1 # next `kubectl label` exits 1 +# .run SUT. +# assert_kubectl_call "label node node-a x=passed --overwrite" +# kubectl_mock_cleanup # on EXIT +# +# Call log format: +# One line per kubectl invocation, all positional args +# space-separated, e.g. `label node node-a amd.com/x=passed --overwrite`. + +# --- module state --------------------------------------------------- +KUBECTL_MOCK_DIR="" +KUBECTL_CALLS_FILE="" +KUBECTL_STATE_FILE="" +KUBECTL_FAIL_DIR="" +KUBECTL_MOCK_ORIG_PATH="" + +# kubectl_mock_init +# One-time setup per test process: creates a temp dir, writes a +# `kubectl` shim into it, prepends it to PATH. Registers an EXIT trap +# to clean up unless the caller already manages cleanup. +kubectl_mock_init() { + if [[ -n "$KUBECTL_MOCK_DIR" && -d "$KUBECTL_MOCK_DIR" ]]; then + return 0 + fi + KUBECTL_MOCK_DIR=$(mktemp -d -t kubectl-mock-XXXXXX) + KUBECTL_CALLS_FILE="${KUBECTL_MOCK_DIR}/calls.log" + KUBECTL_STATE_FILE="${KUBECTL_MOCK_DIR}/state" + KUBECTL_FAIL_DIR="${KUBECTL_MOCK_DIR}/fail" + mkdir -p "$KUBECTL_FAIL_DIR" + : >"$KUBECTL_CALLS_FILE" + : >"$KUBECTL_STATE_FILE" + + KUBECTL_MOCK_ORIG_PATH="$PATH" + + # Render the shim. Note we use single-quoted heredoc so the inner + # script sees its own variables at runtime via the env we export. + cat >"${KUBECTL_MOCK_DIR}/kubectl" <<'KCT' +#!/bin/bash +# Mock kubectl. All invocations are recorded; verbs `label`, +# `annotate`, and `get` (with -o jsonpath=) are honored for behavioral +# tests. Any other verb returns 99 so an accidental real-world call +# in a test does not silently succeed. +# +# stdin handling: callers (especially `kubectl apply -f -`) pipe data +# into us. Drain stdin to /dev/null before exiting so the upstream +# command (e.g. `sed`) does not receive SIGPIPE. Without this, runs +# under `set -o pipefail` see intermittent rc=141 from the pipe and +# treat the apply as failed -- a classic flake. +if [[ ! -t 0 ]]; then + cat >/dev/null 2>&1 || true +fi + +CALLS="${KUBECTL_CALLS_FILE:?KUBECTL_CALLS_FILE must be set}" +STATE="${KUBECTL_STATE_FILE:?KUBECTL_STATE_FILE must be set}" +FAILDIR="${KUBECTL_FAIL_DIR:?KUBECTL_FAIL_DIR must be set}" + +# Capture original argv into ARGS BEFORE any I/O so concurrent mock +# invocations cannot corrupt our view. (Phase 4's bounded-parallel +# pair_runners produce real concurrent invocations.) Previous code +# appended to CALLS then re-read it with `tail -n1`, which is racy: +# two concurrent writers can interleave printfs, and `tail -n1` may +# return another shell's last write entirely. +ARGS=( "$@" ) + +# Build the call-log line in memory, then append it with a SINGLE +# write so the per-line shape is preserved even if other mock procs +# are writing concurrently. POSIX guarantees that write up to +# PIPE_BUF bytes (typically 4096) to an O_APPEND fd is atomic, and +# kubectl arg lines fit easily under that limit. +call_line="$1" +shift || true +for a in "$@"; do + call_line+=" $a" +done +printf '%s\n' "$call_line" >>"$CALLS" + +verb="${ARGS[0]}" +case "$verb" in + label|annotate) + if [[ -f "${FAILDIR}/${verb}" ]]; then + ec=$(cat "${FAILDIR}/${verb}") + # one-shot: remove after consuming + rm -f "${FAILDIR}/${verb}" + exit "$ec" + fi + if [[ -f "${FAILDIR}/${verb}.sticky" ]]; then + ec=$(cat "${FAILDIR}/${verb}.sticky") + exit "$ec" + fi + exit 0 + ;; + get) + # Expected shape: get node -o jsonpath={.} + # We respond by looking up the requested label in state. + # State format, one entry per line: + # |= + if [[ -f "${FAILDIR}/get" ]]; then + ec=$(cat "${FAILDIR}/get") + rm -f "${FAILDIR}/get" + exit "$ec" + fi + if [[ -f "${FAILDIR}/get.sticky" ]]; then + ec=$(cat "${FAILDIR}/get.sticky") + exit "$ec" + fi + + # ---------------------------------------------------------- + # Phase-1-specific early route: + # get pods -l job-name= -o jsonpath={.items[-1:].metadata.name} + # The generic `-l` arm below short-circuits with `exit 0` for + # selector listings, so this pattern -- which combines -l with + # a jsonpath -- must be detected first. + # ---------------------------------------------------------- + p_jobname="" + p_jsonpath="" + p_is_pods=0 + # Phase-5 selector pieces. The Kubeflow MPIJob + # selector shape is + # training.kubeflow.org/job-name=,training.kubeflow.org/replica-type= + # optionally combined with --field-selector spec.nodeName=. + # We capture each piece independently of the early-route + # job-name= parser above so the Phase-5 routes can match without + # disturbing existing Phase 1/4 jobname semantics. + p_kf_jobname="" + p_kf_replica="" + p_field_node="" + p_first_positional="" + p_seen_pods=0 + pi=1 + while [[ $pi -lt ${#ARGS[@]} ]]; do + case "${ARGS[$pi]}" in + pods|pod) + p_is_pods=1 + p_seen_pods=1 + ;; + -l) + pj=$((pi + 1)) + if [[ $pj -lt ${#ARGS[@]} ]]; then + psel="${ARGS[$pj]}" + if [[ "$psel" == "job-name="* ]]; then + p_jobname="${psel#job-name=}" + fi + # Walk comma-separated selector fragments and + # pull out the Kubeflow keys we care about. + IFS=',' read -r -a _sel_parts <<<"$psel" + for _sp in "${_sel_parts[@]}"; do + case "$_sp" in + training.kubeflow.org/job-name=*) + p_kf_jobname="${_sp#training.kubeflow.org/job-name=}" + ;; + training.kubeflow.org/replica-type=*) + p_kf_replica="${_sp#training.kubeflow.org/replica-type=}" + ;; + esac + done + fi + ;; + --field-selector) + pj=$((pi + 1)) + if [[ $pj -lt ${#ARGS[@]} ]]; then + _fs="${ARGS[$pj]}" + if [[ "$_fs" == "spec.nodeName="* ]]; then + p_field_node="${_fs#spec.nodeName=}" + fi + fi + ;; + --field-selector=*) + _fs="${ARGS[$pi]#--field-selector=}" + if [[ "$_fs" == "spec.nodeName="* ]]; then + p_field_node="${_fs#spec.nodeName=}" + fi + ;; + -o) + pj=$((pi + 1)) + if [[ $pj -lt ${#ARGS[@]} ]]; then + p_jsonpath="${ARGS[$pj]}" + fi + ;; + *) + # First non-flag positional after the verb. Used by + # the Phase-5 exit-code route which targets a single + # pod by name: `kubectl get pod -o .`. + if [[ "$p_seen_pods" -eq 1 && -z "$p_first_positional" \ + && "${ARGS[$pi]}" != -* ]]; then + p_first_positional="${ARGS[$pi]}" + fi + ;; + esac + pi=$((pi + 1)) + done + + # ---------------------------------------------------------- + # Phase-5 per-worker pod lookups. + # PHASE5_SCRIPT issues: + # a) get pods -l training.kubeflow.org/job-name=, + # training.kubeflow.org/replica-type=worker + # --field-selector spec.nodeName= + # -o jsonpath='{.items[0].metadata.name}' + # -> worker pod name on that node (or empty) + # b) get pods -l training.kubeflow.org/job-name=, + # training.kubeflow.org/replica-type=launcher + # -o jsonpath='{.items[0].metadata.name}' + # -> launcher pod name (or empty) + # c) get pod -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}' + # -> per-worker container exit code (or empty) + # ---------------------------------------------------------- + if [[ "$p_is_pods" -eq 1 && -n "$p_kf_jobname" \ + && "$p_kf_replica" == "worker" && -n "$p_field_node" \ + && "$p_jsonpath" == *".items"*".metadata.name"* ]]; then + val="" + while IFS= read -r line; do + if [[ "$line" == "phase5-worker-pod|${p_kf_jobname}|${p_field_node}="* ]]; then + val="${line#phase5-worker-pod|${p_kf_jobname}|${p_field_node}=}" + fi + done <"$STATE" + printf '%s' "$val" + exit 0 + fi + if [[ "$p_is_pods" -eq 1 && -n "$p_kf_jobname" \ + && "$p_kf_replica" == "launcher" \ + && "$p_jsonpath" == *".items"*".metadata.name"* ]]; then + val="" + while IFS= read -r line; do + if [[ "$line" == "phase5-launcher-pod|${p_kf_jobname}="* ]]; then + val="${line#phase5-launcher-pod|${p_kf_jobname}=}" + fi + done <"$STATE" + printf '%s' "$val" + exit 0 + fi + if [[ "$p_is_pods" -eq 1 && -n "$p_first_positional" \ + && "$p_jsonpath" == *"terminated.exitCode"* ]]; then + val="" + while IFS= read -r line; do + if [[ "$line" == "phase5-pod-exit|${p_first_positional}="* ]]; then + val="${line#phase5-pod-exit|${p_first_positional}=}" + fi + done <"$STATE" + printf '%s' "$val" + exit 0 + fi + if [[ "$p_is_pods" -eq 1 && -n "$p_jobname" \ + && "$p_jsonpath" == *".items"*".metadata.name"* ]]; then + val="" + while IFS= read -r line; do + if [[ "$line" == "pod-for-job|${p_jobname}="* ]]; then + val="${line#pod-for-job|${p_jobname}=}" + fi + done <"$STATE" + printf '%s' "$val" + exit 0 + fi + + # ---------------------------------------------------------- + # Phase-4-specific routes: + # get pods -l job-name= -o jsonpath={.items[0].status.podIP} + # -- look up pod-ip|= from state. + # get pods -l job-name= --no-headers (no -o) + # -- list one pod-name line per seeded pod-for-job entry; + # PHASE4_DRIVER_SCRIPT only uses `wc -l` on the result + # to count whether ANY pod exists (admission check). + # Both patterns are pod selectors with -l job-name=., so they + # must be detected before the generic -l arm below short-circuits. + # ---------------------------------------------------------- + if [[ "$p_is_pods" -eq 1 && -n "$p_jobname" \ + && "$p_jsonpath" == *".items"*".status.podIP"* ]]; then + val="" + while IFS= read -r line; do + if [[ "$line" == "pod-ip|${p_jobname}="* ]]; then + val="${line#pod-ip|${p_jobname}=}" + fi + done <"$STATE" + printf '%s' "$val" + exit 0 + fi + if [[ "$p_is_pods" -eq 1 && -n "$p_jobname" \ + && -z "$p_jsonpath" ]]; then + # No -o flag: emit one line per seeded pod-for-job entry + # (matches `kubectl get pods -l job-name=X --no-headers` shape + # closely enough for `wc -l`-based existence checks). + while IFS= read -r line; do + if [[ "$line" == "pod-for-job|${p_jobname}="* ]]; then + pod_name="${line#pod-for-job|${p_jobname}=}" + printf '%s\n' "$pod_name" + fi + done <"$STATE" + exit 0 + fi + + # ---------------------------------------------------------- + # Phase-4.5 worker-pod listing routes. + # PHASE45_PREFLIGHT_SCRIPT issues three pod listings: + # 1. get pods -n NS -l -o jsonpath={.items[*].metadata.name} + # -> space-separated pod names (KUBECTL_MOCK_POD_NAMES) + # 2. get pods -n NS -l -o jsonpath={.items[*].status.podIP} + # -> space-separated pod IPs (KUBECTL_MOCK_POD_IPS) + # 3. get pods -n NS -l + # -o jsonpath='{range .items[*]}{.spec.nodeName}{"\n"}{end}' + # -> one node name per line (KUBECTL_MOCK_NODE_NAMES, + # \n-separated input) + # Plus the MPIJob discovery: + # 4. get mpijob -o jsonpath={.items[*].metadata.name} + # -> space-separated mpijob names (KUBECTL_MOCK_MPIJOB_NAMES) + # The pre-flight script does NOT use the kubeflow selector by + # `job-name=`, so it bypasses the earlier early-routes above. + # We key by the jsonpath shape rather than the selector content + # so tests don't have to mirror the exact label string. + # ---------------------------------------------------------- + p_is_mpijob=0 + p_mpijob_name="" + for ai_idx in "${!ARGS[@]}"; do + ai="${ARGS[$ai_idx]}" + case "$ai" in + mpijob|mpijobs) + p_is_mpijob=1 + # The next non-flag positional after `mpijob` is the + # job name when present (e.g. `get mpijob `). + ai_next=$((ai_idx + 1)) + if [[ $ai_next -lt ${#ARGS[@]} ]]; then + nxt="${ARGS[$ai_next]}" + if [[ "$nxt" != -* ]]; then + p_mpijob_name="$nxt" + fi + fi + ;; + esac + done + if [[ "$p_is_mpijob" -eq 1 \ + && "$p_jsonpath" == *".items"*".metadata.name"* ]]; then + while IFS= read -r line; do + if [[ "$line" == "phase45-mpijob-names="* ]]; then + printf '%s' "${line#phase45-mpijob-names=}" + fi + done <"$STATE" + exit 0 + fi + + # ---------------------------------------------------------- + # Phase-5 MPIJob terminal-condition polling. + # PHASE5_SCRIPT polls both Succeeded and Failed conditions + # every 5s instead of `kubectl wait --for=condition=Succeeded`, + # so a Failed MPIJob short-circuits within ~5s instead of + # blocking for MPIJOB_WAIT_TIME. + # Shape: + # get mpijob -o jsonpath={.status.conditions[?(@.type=="Succeeded")].status} + # get mpijob -o jsonpath={.status.conditions[?(@.type=="Failed")].status} + # Seed via: kubectl_mock_set_mpijob_condition + # ---------------------------------------------------------- + if [[ "$p_is_mpijob" -eq 1 && -n "$p_mpijob_name" \ + && -n "$p_jsonpath" ]]; then + mp_cond="" + case "$p_jsonpath" in + *'@.type=="Succeeded"'*) mp_cond="Succeeded" ;; + *'@.type=="Failed"'*) mp_cond="Failed" ;; + esac + if [[ -n "$mp_cond" ]]; then + val="" + while IFS= read -r line; do + if [[ "$line" == "mpijob|${p_mpijob_name}=${mp_cond}="* ]]; then + val="${line#mpijob|${p_mpijob_name}=${mp_cond}=}" + fi + done <"$STATE" + printf '%s' "$val" + exit 0 + fi + fi + if [[ "$p_is_pods" -eq 1 && -z "$p_jobname" && \ + "$p_jsonpath" == *".items"*".metadata.name"* ]]; then + while IFS= read -r line; do + if [[ "$line" == "phase45-pod-names="* ]]; then + printf '%s' "${line#phase45-pod-names=}" + fi + done <"$STATE" + exit 0 + fi + if [[ "$p_is_pods" -eq 1 && -z "$p_jobname" && \ + "$p_jsonpath" == *".items"*".status.podIP"* ]]; then + while IFS= read -r line; do + if [[ "$line" == "phase45-pod-ips="* ]]; then + printf '%s' "${line#phase45-pod-ips=}" + fi + done <"$STATE" + exit 0 + fi + if [[ "$p_is_pods" -eq 1 && -z "$p_jobname" && \ + "$p_jsonpath" == *"range"*"nodeName"* ]]; then + # node-names entry stores the lines joined by '\n' literal + # (encoded so the one-line STATE format survives); decode + # by replacing the literal '\n' marker with real newlines. + while IFS= read -r line; do + if [[ "$line" == "phase45-node-names="* ]]; then + enc="${line#phase45-node-names=}" + printf '%s' "$enc" | sed 's||\ +|g' + fi + done <"$STATE" + exit 0 + fi + + # Walk args to find `node ` and `-o jsonpath=.`. + node="" + jp="" + i=1 + while [[ $i -lt ${#ARGS[@]} ]]; do + case "${ARGS[$i]}" in + node|nodes) + j=$((i + 1)) + if [[ $j -lt ${#ARGS[@]} ]]; then + node="${ARGS[$j]}" + fi + ;; + -o) + j=$((i + 1)) + if [[ $j -lt ${#ARGS[@]} ]]; then + jp="${ARGS[$j]}" + fi + ;; + -l) + # selector-based list; emit names of state-tracked nodes + # that have the matching label=value (very small subset + # of real selector semantics, but enough for DRY_RUN + # tests). + j=$((i + 1)) + if [[ $j -lt ${#ARGS[@]} ]]; then + sel="${ARGS[$j]}" + sel_key="${sel%%=*}" + sel_val="${sel#*=}" + # Sticky DRY_RUN harness writes selector hits in + # STATE using the same format. Match exact equality. + while IFS= read -r line; do + n="${line%%|*}" + rest="${line#*|}" + k="${rest%%=*}" + v="${rest#*=}" + if [[ "$k" == "$sel_key" && "$v" == "$sel_val" ]]; then + printf 'node/%s\n' "$n" + fi + done <"$STATE" + fi + exit 0 + ;; + esac + i=$((i + 1)) + done + + # jsonpath of the form {.metadata.labels.} + if [[ -n "$jp" ]]; then + inner="${jp#jsonpath=}" + inner="${inner#\{}" + inner="${inner%\}}" + + # ---------------------------------------------------------- + # Phase-1-specific jsonpath shapes (added for): + # * {.status.conditions[?(@.type=="Complete")].status} + # -> look up state line: job|=Complete= + # * {.status.conditions[?(@.type=="Failed")].status} + # -> look up state line: job|=Failed= + # * {.items[-1:].metadata.name} (with -l job-name=X) + # -> look up state line: pod-for-job|= + # ---------------------------------------------------------- + cond_type="" + case "$inner" in + *'@.type=="Complete"'*) cond_type="Complete" ;; + *'@.type=="Failed"'*) cond_type="Failed" ;; + esac + if [[ -n "$cond_type" ]]; then + # We need the job name from ARGS. Re-walk to find it. + jobname="" + ii=1 + while [[ $ii -lt ${#ARGS[@]} ]]; do + case "${ARGS[$ii]}" in + job|jobs) + jj=$((ii + 1)) + if [[ $jj -lt ${#ARGS[@]} ]]; then + jobname="${ARGS[$jj]}" + fi + ;; + esac + ii=$((ii + 1)) + done + if [[ -n "$jobname" ]]; then + val="" + while IFS= read -r line; do + if [[ "$line" == "job|${jobname}=${cond_type}="* ]]; then + val="${line#job|${jobname}=${cond_type}=}" + fi + done <"$STATE" + printf '%s' "$val" + fi + exit 0 + fi + + # Note: pod-by-job-name lookup is handled by the early + # route at the top of the `get` arm (above), because the + # generic `-l` selector path exits before we reach here. + + # Default: treat as a node-label lookup + # ({.metadata.labels.}). + inner="${inner#.metadata.labels.}" + # Unescape backslash-dots back to dots. + key="${inner//\\./.}" + val="" + while IFS= read -r line; do + n="${line%%|*}" + rest="${line#*|}" + if [[ "$n" == "$node" && "$rest" == "${key}="* ]]; then + val="${rest#${key}=}" + fi + done <"$STATE" + printf '%s' "$val" + fi + exit 0 + ;; + delete|apply|patch|create) + # Allowed but no-op for the DRY_RUN orchestrator tests. + # Fail-injection support added for (PHASE1_SCRIPT + # job-creation-failure path needs to drive `apply` non-zero). + # Same one-shot vs. sticky semantics as label|annotate above. + if [[ -f "${FAILDIR}/${verb}" ]]; then + ec=$(cat "${FAILDIR}/${verb}") + rm -f "${FAILDIR}/${verb}" + exit "$ec" + fi + if [[ -f "${FAILDIR}/${verb}.sticky" ]]; then + ec=$(cat "${FAILDIR}/${verb}.sticky") + exit "$ec" + fi + exit 0 + ;; + wait) + # `kubectl wait` for PHASE45_PREFLIGHT_SCRIPT. + # Honors the same one-shot vs sticky failure-injection knobs + # as label/annotate above. Default is success so the script + # proceeds to the SSH-mesh / DNS / MPI / RCCL checks under test. + if [[ -f "${FAILDIR}/wait" ]]; then + ec=$(cat "${FAILDIR}/wait") + rm -f "${FAILDIR}/wait" + exit "$ec" + fi + if [[ -f "${FAILDIR}/wait.sticky" ]]; then + ec=$(cat "${FAILDIR}/wait.sticky") + exit "$ec" + fi + exit 0 + ;; + exec) + # `kubectl exec [-n NS] POD -- CMD .` for PHASE45_PREFLIGHT_SCRIPT + #. The pre-flight script issues many exec calls + # back-to-back (launcher->worker readiness probe, N*N SSH mesh, + # DNS, MPI spawn, RCCL topology). Tests need to control the + # exit code AND stdout body of each one independently. + # + # Mechanism: an in-order response queue. Each entry seeds one + # exec response; entries are consumed FIFO. When the queue is + # empty, exec returns 0 with empty stdout (the harmless + # "everything is fine" default). + # + # Queue file: ${KUBECTL_FAIL_DIR}/exec-queue + # one line per pending response: + # | + # Counter file: ${KUBECTL_FAIL_DIR}/exec-cursor + # integer, next line index to consume (1-based) + # + # Note: we co-locate the queue under KUBECTL_FAIL_DIR rather + # than KUBECTL_MOCK_DIR because the latter is intentionally + # NOT exported (only the file paths the shim needs are), and + # the exec arm needs the queue path visible to sub-shells + # spawned by the SUT. + queue_file="${KUBECTL_FAIL_DIR}/exec-queue" + cursor_file="${KUBECTL_FAIL_DIR}/exec-cursor" + ec=0 + body_b64="" + if [[ -f "$queue_file" ]]; then + idx=1 + if [[ -f "$cursor_file" ]]; then + idx=$(cat "$cursor_file") + fi + line=$(sed -n "${idx}p" "$queue_file") + if [[ -n "$line" ]]; then + ec="${line%%|*}" + body_b64="${line#*|}" + echo $((idx + 1)) >"$cursor_file" + fi + fi + if [[ -n "$body_b64" ]]; then + # The decoded body is written verbatim. If it does not + # already end in a newline, append one -- the SUT's DNS + # check uses `while read -r` over the captured output and + # would otherwise drop a final unterminated line. + decoded=$(printf '%s' "$body_b64" | base64 -d 2>/dev/null || echo "") + if [[ -n "$decoded" ]]; then + printf '%s' "$decoded" + # Add trailing newline only if missing. + last_char="${decoded: -1}" + if [[ "$last_char" != $'\n' ]]; then + printf '\n' + fi + fi + fi + exit "$ec" + ;; + logs) + # `kubectl logs [--tail=N]` for PHASE2_SCRIPT. + # The script greps a single pod's container log for marker lines + # to classify failure reasons. We serve canned log content from + # STATE keyed by pod name; the --tail=N flag is honored as a + # tail-count, defaulting to "all" when absent. + if [[ -f "${FAILDIR}/logs" ]]; then + ec=$(cat "${FAILDIR}/logs") + rm -f "${FAILDIR}/logs" + exit "$ec" + fi + if [[ -f "${FAILDIR}/logs.sticky" ]]; then + ec=$(cat "${FAILDIR}/logs.sticky") + exit "$ec" + fi + pod_name="" + tail_n="" + li=1 + while [[ $li -lt ${#ARGS[@]} ]]; do + case "${ARGS[$li]}" in + --tail=*) tail_n="${ARGS[$li]#--tail=}" ;; + -*) : ;; + *) + # First non-flag positional after `logs` is the pod + # name. PHASE2_SCRIPT never passes a -n namespace, so + # this is unambiguous. + if [[ -z "$pod_name" ]]; then + pod_name="${ARGS[$li]}" + fi + ;; + esac + li=$((li + 1)) + done + # State lookup: line shape is + # pod-log|= + # base64 keeps newlines and quotes intact in the single-line + # state file. When the seed helper is not used, we serve an + # empty body (the SUT treats absent logs as "no markers found" + # and falls back to the default reason). + encoded="" + while IFS= read -r line; do + if [[ "$line" == "pod-log|${pod_name}="* ]]; then + encoded="${line#pod-log|${pod_name}=}" + fi + done <"$STATE" + if [[ -n "$encoded" ]]; then + decoded=$(printf '%s' "$encoded" | base64 -d 2>/dev/null || echo "") + if [[ -n "$tail_n" ]]; then + printf '%s' "$decoded" | tail -n "$tail_n" + else + printf '%s' "$decoded" + fi + fi + exit 0 + ;; +esac +exit 99 +KCT + chmod +x "${KUBECTL_MOCK_DIR}/kubectl" + + export KUBECTL_CALLS_FILE KUBECTL_STATE_FILE KUBECTL_FAIL_DIR + export PATH="${KUBECTL_MOCK_DIR}:${PATH}" + trap kubectl_mock_cleanup EXIT +} + +# kubectl_mock_cleanup +# Tear down the mock. Safe to call multiple times; idempotent. +kubectl_mock_cleanup() { + if [[ -n "$KUBECTL_MOCK_DIR" && -d "$KUBECTL_MOCK_DIR" ]]; then + rm -rf "$KUBECTL_MOCK_DIR" + fi + if [[ -n "$KUBECTL_MOCK_ORIG_PATH" ]]; then + export PATH="$KUBECTL_MOCK_ORIG_PATH" + fi + KUBECTL_MOCK_DIR="" + KUBECTL_CALLS_FILE="" + KUBECTL_STATE_FILE="" + KUBECTL_FAIL_DIR="" + KUBECTL_MOCK_ORIG_PATH="" +} + +# kubectl_mock_reset +# Zero out call log, label state, and any pending fail injections. +# Use between tests to keep them independent. +kubectl_mock_reset() { + : >"$KUBECTL_CALLS_FILE" + : >"$KUBECTL_STATE_FILE" + # `rm -f $KUBECTL_FAIL_DIR/*` also clears the Phase 4.5 exec queue + # / exec cursor, which live under KUBECTL_FAIL_DIR + # for export visibility -- see the `exec` arm of the kubectl shim. + rm -f "$KUBECTL_FAIL_DIR"/* +} + +# kubectl_mock_set_label +# Seed the canned label state served by `kubectl get node . jsonpath`. +kubectl_mock_set_label() { + local node="$1" + local key="$2" + local val="$3" + printf '%s|%s=%s\n' "$node" "$key" "$val" >>"$KUBECTL_STATE_FILE" +} + +# kubectl_mock_fail +# Cause the next `kubectl ` to exit with . One-shot +# (cleared after the first match). Verb is one of: label, annotate, get. +kubectl_mock_fail() { + local verb="$1" + local ec="$2" + echo "$ec" >"${KUBECTL_FAIL_DIR}/${verb}" +} + +# kubectl_mock_fail_sticky +# Same as kubectl_mock_fail but persistent until the next reset. +kubectl_mock_fail_sticky() { + local verb="$1" + local ec="$2" + echo "$ec" >"${KUBECTL_FAIL_DIR}/${verb}.sticky" +} + +# kubectl_mock_set_job_condition +# Seed the canned response for +# kubectl get job -o jsonpath='{.status.conditions[?(@.type=="")].status}' +# cond_type is "Complete" or "Failed"; status is typically "True" or "". +# Used by PHASE1_SCRIPT tests. +kubectl_mock_set_job_condition() { + local job_name="$1" + local cond_type="$2" + local status="$3" + printf 'job|%s=%s=%s\n' "$job_name" "$cond_type" "$status" \ + >>"$KUBECTL_STATE_FILE" +} + +# kubectl_mock_set_pod_for_job +# Seed the canned response for +# kubectl get pods -l job-name= -o jsonpath='{.items[-1:].metadata.name}' +# Used by PHASE1_SCRIPT tests. +kubectl_mock_set_pod_for_job() { + local job_name="$1" + local pod_name="$2" + printf 'pod-for-job|%s=%s\n' "$job_name" "$pod_name" \ + >>"$KUBECTL_STATE_FILE" +} + +# kubectl_mock_set_pod_ip_for_job +# Seed the canned response for +# kubectl get pods -l job-name= -o jsonpath='{.items[0].status.podIP}' +# Used by PHASE4_DRIVER_SCRIPT tests -- the driver polls +# for the server pod's IP before submitting the client Job. Empty +# string means "no pod IP yet"; pair with kubectl_mock_set_pod_for_job +# absent to simulate "no pod ever created" (admission rejected). +kubectl_mock_set_pod_ip_for_job() { + local job_name="$1" + local pod_ip="$2" + printf 'pod-ip|%s=%s\n' "$job_name" "$pod_ip" \ + >>"$KUBECTL_STATE_FILE" +} + +# kubectl_mock_set_pod_log +# Seed the canned response for `kubectl logs [--tail=N]`. +# The second arg may be either a path to a file (preferred for fixtures) +# or a literal string. The body is base64-encoded into state to survive +# the one-line-per-entry format. Used by PHASE2_SCRIPT tests. +kubectl_mock_set_pod_log() { + local pod_name="$1" + local src="$2" + local body="" + if [[ -f "$src" ]]; then + body=$(cat "$src") + else + body="$src" + fi + local encoded + encoded=$(printf '%s' "$body" | base64 | tr -d '\n') + printf 'pod-log|%s=%s\n' "$pod_name" "$encoded" \ + >>"$KUBECTL_STATE_FILE" +} + +# --- Phase 4.5 seed helpers --------------------------- +# PHASE45_PREFLIGHT_SCRIPT discovers worker pods via three pod +# listings (names, IPs, nodeNames) plus an MPIJob discovery. Tests +# seed those answers verbatim instead of teaching the mock the +# kubeflow selector grammar. + +# kubectl_mock_set_mpijob_names +# Drives `kubectl get mpijob -o jsonpath='{.items[*].metadata.name}'`. +kubectl_mock_set_mpijob_names() { + printf 'phase45-mpijob-names=%s\n' "$1" >>"$KUBECTL_STATE_FILE" +} + +# kubectl_mock_set_pod_names +# Drives `kubectl get pods -n NS -l . -o jsonpath='{.items[*].metadata.name}'` +# when there is no `job-name=` selector (Phase 4.5 uses Kubeflow +# training labels, not job-name=). +kubectl_mock_set_pod_names() { + printf 'phase45-pod-names=%s\n' "$1" >>"$KUBECTL_STATE_FILE" +} + +# kubectl_mock_set_pod_ips +# Drives `kubectl get pods -n NS -l . -o jsonpath='{.items[*].status.podIP}'` +# when there is no `job-name=` selector. +kubectl_mock_set_pod_ips() { + printf 'phase45-pod-ips=%s\n' "$1" >>"$KUBECTL_STATE_FILE" +} + +# kubectl_mock_set_node_names [ .] +# Drives `kubectl get pods -n NS -l . -o jsonpath='{range .items[*]}{.spec.nodeName}{"\n"}{end}'`. +# Stored with a literal `` separator so the one-line-per-entry +# STATE format survives; the mock decodes back to real newlines. +kubectl_mock_set_node_names() { + local first=1 + local encoded="" + local n + for n in "$@"; do + if [[ "$first" -eq 1 ]]; then + encoded="$n" + first=0 + else + encoded="${encoded}${n}" + fi + done + # Trailing newline mirrors the real jsonpath output shape. + encoded="${encoded}" + printf 'phase45-node-names=%s\n' "$encoded" >>"$KUBECTL_STATE_FILE" +} + +# kubectl_mock_queue_exec [stdout] +# Enqueue the next response for `kubectl exec .`. Responses are +# consumed FIFO; when the queue is empty the mock falls back to exit 0 +# with empty stdout. The optional stdout body is base64-encoded into +# the queue so embedded newlines and quotes survive the one-line shape. +kubectl_mock_queue_exec() { + local ec="$1" + local body="${2:-}" + local encoded="" + if [[ -n "$body" ]]; then + encoded=$(printf '%s' "$body" | base64 | tr -d '\n') + fi + printf '%s|%s\n' "$ec" "$encoded" \ + >>"${KUBECTL_FAIL_DIR}/exec-queue" +} + +# --- Phase 5 seed helpers ----------------------------- +# PHASE5_SCRIPT looks up per-worker pods by (job_name, node_name) and +# reads per-pod terminated.exitCode for failure-attribution. Tests +# seed these answers verbatim instead of teaching the mock the full +# Kubeflow selector grammar. + +# kubectl_mock_set_phase5_worker_pod_for_node +# Drives the Phase-5 per-node worker-pod lookup: +# get pods -l training.kubeflow.org/job-name=, +# training.kubeflow.org/replica-type=worker +# --field-selector spec.nodeName= +# -o jsonpath='{.items[0].metadata.name}' +# An empty simulates "no worker pod scheduled on that node" +# (the SUT swallows the empty result via `|| true`). +kubectl_mock_set_phase5_worker_pod_for_node() { + local job_name="$1" + local node="$2" + local pod_name="$3" + printf 'phase5-worker-pod|%s|%s=%s\n' \ + "$job_name" "$node" "$pod_name" >>"$KUBECTL_STATE_FILE" +} + +# kubectl_mock_set_phase5_launcher_pod +# Drives the Phase-5 launcher-pod lookup: +# get pods -l training.kubeflow.org/job-name=, +# training.kubeflow.org/replica-type=launcher +# -o jsonpath='{.items[0].metadata.name}' +# Used by the launcher-log collection step. +kubectl_mock_set_phase5_launcher_pod() { + local job_name="$1" + local pod_name="$2" + printf 'phase5-launcher-pod|%s=%s\n' \ + "$job_name" "$pod_name" >>"$KUBECTL_STATE_FILE" +} + +# kubectl_mock_set_mpijob_condition +# Seed the canned response for +# kubectl get mpijob -o jsonpath='{.status.conditions[?(@.type=="")].status}' +# cond_type is "Succeeded" or "Failed"; status is typically "True" or "". +# Used by the new PHASE5_SCRIPT wait-loop (polls both conditions; exits +# as soon as either reaches True instead of waiting for the full +# MPIJOB_WAIT_TIME budget). +kubectl_mock_set_mpijob_condition() { + local job_name="$1" + local cond_type="$2" + local status="$3" + printf 'mpijob|%s=%s=%s\n' "$job_name" "$cond_type" "$status" \ + >>"$KUBECTL_STATE_FILE" +} + +# kubectl_mock_set_phase5_pod_exit_code +# Drives the Phase-5 per-pod exit-code lookup: +# get pod -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}' +# An unset entry causes the mock to emit empty stdout, which the SUT +# coerces to "unknown" -- matches the production fallback. +kubectl_mock_set_phase5_pod_exit_code() { + local pod_name="$1" + local exit_code="$2" + printf 'phase5-pod-exit|%s=%s\n' \ + "$pod_name" "$exit_code" >>"$KUBECTL_STATE_FILE" +} diff --git a/example/gpu-validation-cluster/tests/run_all.sh b/example/gpu-validation-cluster/tests/run_all.sh new file mode 100755 index 000000000..61e3ac08a --- /dev/null +++ b/example/gpu-validation-cluster/tests/run_all.sh @@ -0,0 +1,66 @@ +#!/bin/bash +# Test runner entry point for the gpu-validation-cluster bash unit +# tests. Discovers every `test_*.sh` peer file, runs them +# sequentially, and aggregates results. +# +# Exit code: +# 0 -- every test file reported 0 failures +# 1 -- at least one test file reported failures +# +# Usage: +# ./tests/run_all.sh # run all +# ./tests/run_all.sh test_xyz.sh # run a specific file +# +# Each test file is its own bash process so a fatal error in one file +# cannot take the rest of the suite down with it. + +set -uo pipefail + +TEST_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd) +cd "$TEST_DIR" + +if [[ $# -gt 0 ]]; then + FILES=("$@") +else + # Sort for deterministic ordering across hosts. + mapfile -t FILES < <(ls test_*.sh 2>/dev/null | sort) +fi + +if [[ "${#FILES[@]}" -eq 0 ]]; then + echo "run_all.sh: no test_*.sh files found in ${TEST_DIR}" >&2 + exit 1 +fi + +TOTAL_FILES=0 +FAILED_FILES=0 +FAILED_NAMES=() + +for f in "${FILES[@]}"; do + TOTAL_FILES=$((TOTAL_FILES + 1)) + if [[ ! -f "$f" ]]; then + echo "run_all.sh: skipping missing file $f" >&2 + FAILED_FILES=$((FAILED_FILES + 1)) + FAILED_NAMES+=("$f (missing)") + continue + fi + if ! bash "$f"; then + FAILED_FILES=$((FAILED_FILES + 1)) + FAILED_NAMES+=("$f") + fi + echo +done + +echo "==================================================================" +echo " Suite summary" +echo "==================================================================" +echo " Files run: ${TOTAL_FILES}" +echo " Files passed: $((TOTAL_FILES - FAILED_FILES))" +echo " Files failed: ${FAILED_FILES}" +if [[ "${FAILED_FILES}" -gt 0 ]]; then + echo " Failed files:" + for n in "${FAILED_NAMES[@]}"; do + echo " * ${n}" + done + exit 1 +fi +exit 0 diff --git a/example/gpu-validation-cluster/tests/test_orchestrator_dry_run.sh b/example/gpu-validation-cluster/tests/test_orchestrator_dry_run.sh new file mode 100755 index 000000000..80a9471cd --- /dev/null +++ b/example/gpu-validation-cluster/tests/test_orchestrator_dry_run.sh @@ -0,0 +1,301 @@ +#!/bin/bash +# Orchestrator DRY_RUN=1 tests. +# +# Covers the three scenarios from the design doc §7 "Orchestrator +# dry-run" bullet and the explicit list in the description: +# 1. All five skip flags true -> exits 0, NO kubectl write calls +# 2. Phases 1+2 enabled, 3-5 skipped -> ends after Phase 2 filter +# 3. Phase 3 enabled but no amd-nic=true nodes -> empty pool, exits 0 +# +# Approach: the orchestrator body is embedded in the CronJob YAML +# (cluster-validation-job.yaml). We extract it once with +# `extract_cronjob_orchestrator`, drop the body to a tmp file, then +# invoke it as a sub-shell with: +# * DRY_RUN=1 in the environment +# * a mock `kubectl` first on PATH (lib/kubectl_mock.sh) +# * a candidate-label state pre-seeded so Phase 0 can find nodes +# to walk the pipeline with +# * the PHASE_NODE_LABEL_SCRIPT body fed via the env var the +# orchestrator reads (`echo "$PHASE_NODE_LABEL_SCRIPT" > /tmp/.`) +# * every other ConfigMap key the orchestrator reads supplied as +# a normal env var +# +# We then assert on: +# * exit status of the orchestrator +# * the kubectl-call log (DRY_RUN must produce zero writes) +# * the orchestrator's stdout (the planned phase order is the only +# contract for engineers reading the cronjob log) + +set -uo pipefail + +TEST_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd) +REPO_ROOT=$(cd -- "${TEST_DIR}/../../.." && pwd) +CONFIGMAP="${REPO_ROOT}/example/gpu-validation-cluster/configs/cluster-validation-config.yaml" +CRONJOB="${REPO_ROOT}/example/gpu-validation-cluster/configs/cluster-validation-job.yaml" + +# shellcheck source=./lib/assert.sh +source "${TEST_DIR}/lib/assert.sh" +# shellcheck source=./lib/kubectl_mock.sh +source "${TEST_DIR}/lib/kubectl_mock.sh" +# shellcheck source=./lib/extract_script.sh +source "${TEST_DIR}/lib/extract_script.sh" + +echo "================================================================" +echo " test_orchestrator_dry_run.sh" +echo " CronJob: ${CRONJOB}" +echo "================================================================" + +ORCH_SCRIPT=$(mktemp -t orchestrator-XXXXXX.sh) +trap 'rm -f "$ORCH_SCRIPT"; kubectl_mock_cleanup' EXIT + +extract_cronjob_orchestrator "$CRONJOB" >"$ORCH_SCRIPT" +if [[ ! -s "$ORCH_SCRIPT" ]]; then + echo "FATAL: orchestrator extraction produced empty output" >&2 + exit 1 +fi +if ! bash -n "$ORCH_SCRIPT"; then + echo "FATAL: extracted orchestrator has bash syntax errors" >&2 + exit 1 +fi + +# Read PHASE_NODE_LABEL_SCRIPT once for re-use across tests. The +# orchestrator sources whatever string is present in the env var, so +# we pass the helper library through that channel. +PHASE_NODE_LABEL_SCRIPT=$(extract_configmap_data "$CONFIGMAP" \ + "PHASE_NODE_LABEL_SCRIPT") + +kubectl_mock_init + +# Common env shape supplied to every orchestrator run. Each test may +# override specific keys (notably the SKIP_* flags). The set mirrors +# what the ConfigMap envFrom in the real CronJob delivers. +_run_orchestrator() { + # Args: + local skip1="$1" skip2="$2" skip3="$3" skip4="$4" skip5="$5" + # Sub-shell so per-test env vars do not leak across tests. + DRY_RUN=1 \ + LOG_DIR="$(mktemp -d -t orch-log-XXXXXX)" \ + CANDIDATE_LABEL="amd.com/cluster-validation-candidate=true" \ + SUCCESS_LABEL="amd.com/cluster-validation-status=passed" \ + FAILURE_LABEL="amd.com/cluster-validation-status=failed" \ + TIMESTAMP_ANNOTATION="amd.com/cluster-validation-last-run-timestamp" \ + PHASE1_LABEL_KEY="amd.com/gpu-hw-acceptance" \ + PHASE2_LABEL_KEY="amd.com/gpu-mesh-validation" \ + PHASE3_LABEL_KEY="amd.com/nic-health" \ + PHASE4_LABEL_KEY="amd.com/rail-bandwidth" \ + PHASE5_LABEL_KEY="amd.com/cluster-validation-status" \ + PHASE_FAILURE_REASON_ANNOTATION_SUFFIX="-failure-reason" \ + PHASE_NODE_LABEL_SCRIPT="$PHASE_NODE_LABEL_SCRIPT" \ + SKIP_GPU_HW_ACCEPTANCE="$skip1" \ + SKIP_GPU_MESH_VALIDATION="$skip2" \ + SKIP_NIC_VALIDATION="$skip3" \ + SKIP_RAIL_BANDWIDTH_TEST="$skip4" \ + SKIP_RCCL_TEST="$skip5" \ + DEBUG_DELAY="0" \ + bash "$ORCH_SCRIPT" +} + +# Suppress the bash-shim's `set -uo pipefail` propagation from tripping +# on our env-var references in tests that intentionally leave some +# vars unset. +set +u + +# ------------------------------------------------------------------- +# Scenario 1: all 5 skip flags true -> exit 0, no kubectl writes +# ------------------------------------------------------------------- + +it "DRY_RUN with all 5 skip flags true exits 0 and writes nothing to kubectl" && { + kubectl_mock_reset + # Seed one candidate node so Phase 0 has something to discover. + kubectl_mock_set_label node-1 \ + amd.com/cluster-validation-candidate true + run _run_orchestrator true true true true true + assert_status 0 + # Every skip flag is true -> the only kubectl calls the orchestrator + # makes are read-only (`get` for nodes_with_label / candidate + # listing). Assert that NO label/annotate WRITES happened. + if grep -E "^(label|annotate)( |$)" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "DRY_RUN produced kubectl write calls: +$(grep -E '^(label|annotate)( |$)' "$KUBECTL_CALLS_FILE")" + fi + # Sanity: orchestrator printed the DRY_RUN banner. + assert_stdout_contains "DRY_RUN=1 -- no kubectl writes" + # Pass-through trail for every skipped phase. + assert_stdout_contains "SKIP_GPU_HW_ACCEPTANCE=true" + assert_stdout_contains "SKIP_GPU_MESH_VALIDATION=true" + assert_stdout_contains "SKIP_NIC_VALIDATION=true" + assert_stdout_contains "SKIP_RAIL_BANDWIDTH_TEST=true" + assert_stdout_contains "SKIP_RCCL_TEST=true" +} + +# ------------------------------------------------------------------- +# Scenario 2: phases 1+2 enabled, 3-5 skipped -> ends after Phase 2 +# ------------------------------------------------------------------- + +it "DRY_RUN with phases 1+2 enabled and 3-5 skipped ends after Phase 2 filter" && { + kubectl_mock_reset + kubectl_mock_set_label node-1 \ + amd.com/cluster-validation-candidate true + # Mark node-1 as having passed Phase 1 and Phase 2 so the + # filter_passed_nodes gate downstream of each run_phaseN stub + # carries it forward. The DRY_RUN stubs are no-ops, so the + # filter must already see "=passed" on the prior labels for the + # node to survive into the downstream pass-through. + kubectl_mock_set_label node-1 amd.com/gpu-hw-acceptance passed + kubectl_mock_set_label node-1 amd.com/gpu-mesh-validation passed + run _run_orchestrator false false true true true + assert_status 0 + # No kubectl write calls -- DRY_RUN. + if grep -E "^(label|annotate)( |$)" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "DRY_RUN produced kubectl write calls: +$(grep -E '^(label|annotate)( |$)' "$KUBECTL_CALLS_FILE")" + fi + # Phase 1 + Phase 2 must have run (banners present). + assert_stdout_contains "[Phase 1] DRY_RUN -- skipping run_phase1" + assert_stdout_contains "[Phase 2] DRY_RUN -- skipping run_phase2" + # Phases 3, 4, 5 must take the SKIP_* pass-through branch (not the + # active-phase branch), i.e. they emit the documented "pass-through" + # log line. Phase 4.5 has no skip flag; it pass-throughs. + assert_stdout_contains "[Phase 3] SKIP_NIC_VALIDATION=true -- pass-through" + assert_stdout_contains "[Phase 4] SKIP_RAIL_BANDWIDTH_TEST=true -- pass-through" + assert_stdout_contains "[Phase 5] SKIP_RCCL_TEST=true -- pass-through" + # Phase 1 banner must precede Phase 2 banner in the log. + # The orchestrator now prefixes every line with a wrapper + # timestamp, so the inner `[Phase N] DRY_RUN` banner is no + # longer at column 0 -- drop the `^` anchor. + line_p1=$(printf '%s\n' "$LAST_STDOUT" | grep -n "\[Phase 1\] DRY_RUN" | head -1 | cut -d: -f1) + line_p2=$(printf '%s\n' "$LAST_STDOUT" | grep -n "\[Phase 2\] DRY_RUN" | head -1 | cut -d: -f1) + if [[ -z "$line_p1" || -z "$line_p2" || "$line_p1" -ge "$line_p2" ]]; then + _assert_fail "Phase 1 banner must precede Phase 2 banner (p1=${line_p1}, p2=${line_p2})" + fi +} + +# ------------------------------------------------------------------- +# Scenario 3: phase 3 enabled but no amd-nic=true nodes -> empty pool +# ------------------------------------------------------------------- + +it "DRY_RUN with Phase 3 enabled but no amd-nic nodes -> Phase 3 empty pool, exit 0" && { + kubectl_mock_reset + kubectl_mock_set_label node-1 \ + amd.com/cluster-validation-candidate true + # Mark node-1 passed for Phase 1+2 so the pool reaches Phase 3. + kubectl_mock_set_label node-1 amd.com/gpu-hw-acceptance passed + kubectl_mock_set_label node-1 amd.com/gpu-mesh-validation passed + # Intentionally do NOT set feature.node.kubernetes.io/amd-nic=true + # on any node. Phase 3 intersect must return an empty set. + run _run_orchestrator false false false true true + assert_status 0 + # No kubectl write calls -- DRY_RUN. + if grep -E "^(label|annotate)( |$)" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "DRY_RUN produced kubectl write calls: +$(grep -E '^(label|annotate)( |$)' "$KUBECTL_CALLS_FILE")" + fi + # Phase 3 must take the "no NIC-capable nodes" branch. + assert_stdout_contains "[Phase 3] no NIC-capable nodes -- skipping" + # And the run_phase3 stub must NOT have been reached. + assert_stdout_not_contains "[Phase 3] DRY_RUN -- skipping run_phase3" +} + +# ------------------------------------------------------------------- +# Bonus contract checks +# ------------------------------------------------------------------- + +it "DRY_RUN exits 0 when no candidate nodes are present (empty Phase 0 pool)" && { + kubectl_mock_reset + # No candidate nodes seeded. + run _run_orchestrator false false false false false + assert_status 0 + assert_stdout_contains "empty candidate pool after Phase 0" + # Empty pool -> no phase banners (we exit before Phase 1). + assert_stdout_not_contains "[Phase 1] DRY_RUN" +} + +it "DRY_RUN preserves cleanup pass-through (no MPIJob deletions, no log collection)" && { + kubectl_mock_reset + kubectl_mock_set_label node-1 \ + amd.com/cluster-validation-candidate true + run _run_orchestrator true true true true true + assert_status 0 + # Cleanup / log collection branches both honor DRY_RUN by short-circuiting. + assert_stdout_contains "[Cleanup] DRY_RUN -- skipping old-MPIJob cleanup" + assert_stdout_contains "[Logs] DRY_RUN -- skipping launcher log collection" +} + +# ------------------------------------------------------------------- +# every orchestrator stdout/stderr line carries a per-line +# [YYYY-MM-DD HH:MM:SS.mmm] timestamp prefix. +# +# The orchestrator container args install a `_ts_prefix` reader that +# pipes everything through a `printf '[%s] %s\n' "$(date '+%F %T.%3N')"` +# loop. Sourced phase scripts and helpers inherit the redirect, so +# every echo / kubectl output line lands in the per-run log with the +# same prefix shape -- which is what we assert here. +# +# Per-line readers consuming the cron pod log (operators, log scrapers, +# downstream automation) depend on this format being deterministic, so +# treat any unprefixed line as a regression. +# ------------------------------------------------------------------- +it "every orchestrator stdout line carries a [YYYY-MM-DD HH:MM:SS.mmm] timestamp prefix" && { + kubectl_mock_reset + kubectl_mock_set_label node-1 \ + amd.com/cluster-validation-candidate true + run _run_orchestrator true true true true true + assert_status 0 + # Every non-empty line must match the wrapper format. Allow an + # unprefixed final newline only. + bad_lines=$(printf '%s\n' "$LAST_STDOUT" \ + | grep -nvE '^\[[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]{3}\] ' \ + | grep -v '^[0-9]*:$' || true) + if [[ -n "$bad_lines" ]]; then + _assert_fail "found stdout lines missing the [YYYY-MM-DD HH:MM:SS.mmm] prefix: +${bad_lines}" + fi + # Smoke check: at least one prefixed line was actually emitted + # (so the assertion above can't trivially pass on an empty stream). + if ! printf '%s\n' "$LAST_STDOUT" | grep -qE '^\[[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]{3}\] '; then + _assert_fail "orchestrator stdout was empty or never produced a prefixed line" + fi +} + +# ------------------------------------------------------------------- +# configs/*.yaml must apply standalone: parse as valid YAML AND have +# no leftover __XXX__ placeholder tokens. Guards against accidental +# re-introduction of render-time substitution markers. Operators who +# want to deploy without gpu-cluster.sh need this contract intact. +# ------------------------------------------------------------------- +it "configs/*.yaml parse standalone with no __PLACEHOLDER__ tokens" && { + CFG_DIR=$(cd -- "${TEST_DIR}/../configs" && pwd) + python3 - "$CFG_DIR" <<'PY' +import os, sys, yaml +cfg_dir = sys.argv[1] +checked = 0 +for name in ("cluster-validation-config.yaml", + "cluster-validation-job.yaml", + "nad-per-rail.yaml"): + path = os.path.join(cfg_dir, name) + docs = list(yaml.safe_load_all(open(path))) + checked += 1 + for doc in docs: + if not isinstance(doc, dict): + continue + data = doc.get("data") or {} + for k, v in data.items(): + if "__" in str(v): + if any(tok in str(v) for tok in ( + "__WORKER_REPLICAS__", "__LAUNCHER_REPLICAS__", + "__SLOTS_PER_WORKER__", "__GPU_PER_WORKER__", + "__PF_NIC_PER_WORKER__", "__VF_NIC_PER_WORKER__", + "__NODE_VALIDATION_INTERVAL_MINS__", + "__NODE_SELECTOR_LABELS__", + "__SKIP_GPU_HW_ACCEPTANCE__", "__SKIP_GPU_MESH_VALIDATION__", + "__SKIP_NIC_VALIDATION__", "__SKIP_RAIL_BANDWIDTH_TEST__", + "__SKIP_RCCL_TEST__", "__CRONJOB_SCHEDULE__")): + sys.exit(f"placeholder leak in {name} data[{k}]: {v!r}") +print(f"{checked} YAML files parsed standalone; no placeholder tokens") +PY + if [[ $? -ne 0 ]]; then + _assert_fail "configs/*.yaml standalone check failed" + fi +} + +assert_summary diff --git a/example/gpu-validation-cluster/tests/test_phase1.sh b/example/gpu-validation-cluster/tests/test_phase1.sh new file mode 100755 index 000000000..bf5baf633 --- /dev/null +++ b/example/gpu-validation-cluster/tests/test_phase1.sh @@ -0,0 +1,1058 @@ +#!/bin/bash +# Unit tests for PHASE1_SCRIPT against +# the mocked kubectl harness and result.json fixtures. Supersedes the +#/764 single-Job-per-node design. +# +# Scope (matches the plan and the multi-stage contract +# documented at the top of PHASE1_SCRIPT in cluster-validation-config.yaml): +# +# Verdict model: pass/fail is taken SOLELY from the Kubernetes Job +# condition (Complete=True -> pass, Failed=True -> fail). The per-stage +# result.json is never read -- it is node-local and not reliably +# reachable cross-node, so the Job condition (cluster-scoped API state) +# is the single source of truth. failed-subtest is therefore always +# "unknown" on failure, and the failure reason is the generic +# "job-failed". +# +# Carry-over (still relevant under multi-stage): +# * empty input list (no-op pass) +# * single node pass (one stage) +# * single node fail (Job Failed=True) +# * mixed pass/fail across multiple nodes (one stage) +# * Complete job, no result file read -> pass from Job condition +# * Failed job -> failed with reason=job-failed, subtest=unknown +# * SKIP_GPU_HW_ACCEPTANCE=true -> no Jobs, no CMs, pass-label all +# * parallel-submit: N input nodes -> N submissions before any wait +# * configmap-creation-failure -> reason=configmap-creation-failed +# * PHASE_NODES env fallback +# +# New: +# * missing required env var (GPU_VALIDATION_STAGES_JSON / GPU_PER_WORKER / +# PHASE1_LABEL_KEY) -> fail-fast every input node with reason +# phase1-missing-env:. +# * GPU_VALIDATION_STAGES_JSON empty array -> fail every node with reason +# phase1-stages-empty-or-invalid +# * GPU_VALIDATION_STAGES_JSON missing per-stage required field -> fail +# every node with reason phase1-stages-missing-fields:. +# * multi-stage all-pass: every stage emits its own per-stage annotation +# and the final aggregate label is =passed +# * multi-stage first-fails: failing stage records its annotation, the +# node is dropped, NO further stages submit Jobs/CMs for that node, +# failed-subtest is unknown (Job-condition verdict), aggregate label +# is =failed +# * multi-stage cleanup: each stage deletes its Job + per-stage CM +# +# How PHASE1_SCRIPT is exercised: +# +# The script body is a block-scalar inside cluster-validation-config.yaml. +# We extract it with lib/extract_script.sh, then patch two hardcoded +# absolute paths so the test can run as a non-root user without +# touching /test-runner-configs or /var/log/cluster-validation: +# /test-runner-configs/cluster-validation-test-runner-job-config.yaml +# -> ${TPL_DIR}/cluster-validation-test-runner-job-config.yaml +# /var/log/cluster-validation +# -> ${RESULTS_ROOT} (a per-test tmpdir) +# The script uses `local` / `declare -A`, so we wrap the patched body +# in a function `__phase1_run` and invoke that. The helper library +# (PHASE_NODE_LABEL_SCRIPT) is sourced first so label_phase_passed / +# label_phase_failed / annotate_phase_value are defined. +# +# `kubectl` is the mock from lib/kubectl_mock.sh. Test-runner Job +# "completion" is simulated by seeding state via +# kubectl_mock_set_job_condition + kubectl_mock_set_pod_for_job. The +# verdict is read from the Job condition; result.json is not consulted. + +set -uo pipefail + +TEST_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd) +REPO_ROOT=$(cd -- "${TEST_DIR}/../../.." && pwd) +CONFIGMAP="${REPO_ROOT}/example/gpu-validation-cluster/configs/cluster-validation-config.yaml" + +# shellcheck source=./lib/assert.sh +source "${TEST_DIR}/lib/assert.sh" +# shellcheck source=./lib/kubectl_mock.sh +source "${TEST_DIR}/lib/kubectl_mock.sh" +# shellcheck source=./lib/extract_script.sh +source "${TEST_DIR}/lib/extract_script.sh" + +echo "================================================================" +echo " test_phase1.sh" +echo " ConfigMap: ${CONFIGMAP}" +echo "================================================================" + +# --- one-time setup ------------------------------------------------- + +PHASE1_DIR=$(mktemp -d -t phase1-tests-XXXXXX) +TPL_DIR="${PHASE1_DIR}/tpl" +RESULTS_ROOT="${PHASE1_DIR}/results" +PHASE1_BODY="${PHASE1_DIR}/phase1-body.sh" +HELPER_SCRIPT="${PHASE1_DIR}/phase-helpers.sh" +mkdir -p "$TPL_DIR" "$RESULTS_ROOT" + +trap 'rm -rf "$PHASE1_DIR"; kubectl_mock_cleanup' EXIT + +# Minimal job template stand-in. PHASE1_SCRIPT pipes a sed-rendered copy +# to `kubectl apply -f -`; the four sed placeholders are listed below. +# Since kubectl is mocked, the only constraints on contents are: +# * the file must exist (`[[ ! -f "$job_template" ]]` guard) +# * the sed expressions must not produce non-zero +cat >"${TPL_DIR}/cluster-validation-test-runner-job-config.yaml" <<'YAML' +apiVersion: batch/v1 +kind: Job +metadata: + name: cluster-validation-test-runner-job +spec: + template: + spec: + nodeName: $$NODE + containers: + - name: test-runner + image: $$TEST_RUNNER_IMAGE + resources: + limits: + amd.com/gpu: $$GPU_PER_WORKER + envFrom: + - configMapRef: + name: $$PHASE1_CONFIG_MAP +YAML + +# Extract PHASE1_SCRIPT and patch the hardcoded paths so the test can +# run as a non-root user without /test-runner-configs or +# /var/log/cluster-validation existing. Also wrap the body in a function +# so `local` / `declare -A` work, and pin the timestamp the script puts +# into Job/CM names so seeded mock state always matches what the script +# looks up. +RAW_PHASE1=$(extract_configmap_data "$CONFIGMAP" "PHASE1_SCRIPT") +if [[ -z "$RAW_PHASE1" ]]; then + echo "FATAL: PHASE1_SCRIPT extraction produced empty output" >&2 + exit 1 +fi + +PATCHED_PHASE1=$(printf '%s\n' "$RAW_PHASE1" \ + | sed "s|/test-runner-configs/cluster-validation-test-runner-job-config.yaml|${TPL_DIR}/cluster-validation-test-runner-job-config.yaml|g" \ + | sed "s|/var/log/cluster-validation|${RESULTS_ROOT}|g" \ + | sed 's|ts=\$(date +%Y%m%d-%H%M%S)|ts="${PHASE1_TEST_TS:-$(date +%Y%m%d-%H%M%S)}"|') + +{ + printf '__phase1_run() {\n' + printf '%s\n' "$PATCHED_PHASE1" + printf '}\n' +} > "$PHASE1_BODY" + +if ! bash -n "$PHASE1_BODY"; then + echo "FATAL: patched PHASE1_SCRIPT has bash syntax errors" >&2 + exit 1 +fi + +extract_configmap_data "$CONFIGMAP" "PHASE_NODE_LABEL_SCRIPT" \ + > "$HELPER_SCRIPT" +if [[ ! -s "$HELPER_SCRIPT" ]]; then + echo "FATAL: PHASE_NODE_LABEL_SCRIPT extraction produced empty output" >&2 + exit 1 +fi +if ! bash -n "$HELPER_SCRIPT"; then + echo "FATAL: extracted helper script has bash syntax errors" >&2 + exit 1 +fi + +kubectl_mock_init + +# Suffix for failure-reason annotation; mirrors ConfigMap default. +export PHASE_FAILURE_REASON_ANNOTATION_SUFFIX="-failure-reason" + +# shellcheck disable=SC1090 +source "$HELPER_SCRIPT" +# shellcheck disable=SC1090 +source "$PHASE1_BODY" + +for fn in label_phase_passed label_phase_failed annotate_phase_value __phase1_run; do + if ! declare -F "$fn" >/dev/null; then + echo "FATAL: required function $fn not defined after sourcing" >&2 + exit 1 + fi +done + +# --- per-test helpers ----------------------------------------------- + +# Default single-stage config used by most tests. Single recipe so a +# pass/fail decision is a clean signal. Tests that need multi-stage +# override GPU_VALIDATION_STAGES_JSON after _reset_phase1_env. +# +# Stage Name is "gst-single" -- matches the deployed default. The Image +# value is irrelevant to the mock (used only for $$TEST_RUNNER_IMAGE +# sed substitution) but must be non-empty so the per-stage field +# validator does not flag it as missing. +_default_stages_json() { + cat <<'JSON' +[{"Name":"gst-single","Image":"docker.io/rocm/test-runner:v1.4.0","Framework":"RVS","Recipe":"gst_single","Iterations":1,"TimeoutSeconds":60,"Arguments":""}] +JSON +} + +# Per-test reset: wipe the kubectl call log + seeded state, wipe any +# leftover result.json files, and re-export the baseline env. Tests +# override individual pieces (notably SKIP_GPU_HW_ACCEPTANCE and +# GPU_VALIDATION_STAGES_JSON) before calling __phase1_run. +_reset_phase1_env() { + kubectl_mock_reset + rm -rf "${RESULTS_ROOT:?}"/* + export PHASE1_LABEL_KEY="amd.com/gpu-hw-acceptance" + # TEST_RUNNER_IMAGE / TEST_RUNNER_JOB_WAIT_TIME / + # GPU_VALIDATION_TESTS_JSON were removed. Each stage now carries its + # own Image + TimeoutSeconds inside GPU_VALIDATION_STAGES_JSON. + export GPU_VALIDATION_STAGES_JSON + GPU_VALIDATION_STAGES_JSON="$(_default_stages_json)" + export GPU_PER_WORKER="8" + # Pin the timestamp PHASE1_SCRIPT puts into Job/CM names so seeded + # state always matches what the script looks up. + export PHASE1_TEST_TS="testts0001" + # result-file discovery uses `find -newermt @stage_start` + # to skip stale artifacts. Tests seed fixtures BEFORE invoking + # __phase1_run, so pin stage_start to epoch=1 -- any current-time + # mtime trivially passes the filter. + export PHASE1_TEST_STAGE_START_EPOCH="1" + unset SKIP_GPU_HW_ACCEPTANCE PHASE_NODES +} + +_phase1_now_ts() { + printf '%s' "testts0001" +} + +# Compute the 6-char SHA1 prefix PHASE1_SCRIPT uses for k8s names. +_phase1_node_hash() { + echo -n "$1" | sha1sum | cut -c1-6 +} + +# Job name format: +# cvf-tr-${stage_name}-${node_hash}-${ts} +# Computed unconditionally (no length-conditional bypass), since the +# stage_name + ts contribution makes the 63-char ceiling marginal for +# any realistic node hostname. +_phase1_expected_job_name() { + local node="$1" ts="$2" stage_name="$3" + local h + h=$(_phase1_node_hash "$node") + printf '%s' "cvf-tr-${stage_name}-${h}-${ts}" +} + +# ConfigMap name format: +# cvf-phase1-${stage_name}-${node_hash}-${ts} +_phase1_expected_cm_name() { + local node="$1" ts="$2" stage_name="$3" + local h + h=$(_phase1_node_hash "$node") + printf '%s' "cvf-phase1-${stage_name}-${h}-${ts}" +} + +# Seed mock state so a (stage, node) Job completes with the given +# result.json fixture. Returns nothing; helpers compute deterministic +# Job/Pod names from the inputs. The recipe defaults to stage_name +# with hyphens converted to underscores (matches the canonical +# Name->Recipe mapping in the default stages config); pass an explicit +# 6th arg when the recipe differs from the stage Name. +# Pass path: Job condition Complete=True. The verdict is taken SOLELY +# from the Job condition -- result.json is never read -- so no fixture is +# seeded. (A 4th "fixture" arg is accepted and ignored for backward +# compatibility with existing call sites.) +_seed_job_pass() { + local node="$1" ts="$2" pod="$3" stage_name="${5:-gst-single}" + local job + job=$(_phase1_expected_job_name "$node" "$ts" "$stage_name") + kubectl_mock_set_job_condition "$job" "Complete" "True" + kubectl_mock_set_pod_for_job "$job" "$pod" +} + +# Fail path: Job condition Failed=True. Verdict comes from the Job +# condition; no result.json is consulted. +_seed_job_failed() { + local node="$1" ts="$2" pod="$3" stage_name="${4:-gst-single}" + local job + job=$(_phase1_expected_job_name "$node" "$ts" "$stage_name") + kubectl_mock_set_job_condition "$job" "Failed" "True" + kubectl_mock_set_pod_for_job "$job" "$pod" +} + +# Alias kept for call-site readability: a Complete job with no result +# file is just the normal pass path now (file is never read). +_seed_job_no_result() { _seed_job_pass "$@"; } + +# Suppress the -u trap for tests that intentionally leave optional env +# vars unset (PHASE_NODES, SKIP_GPU_HW_ACCEPTANCE). +set +u + +# ------------------------------------------------------------------- +# 1. Empty input list -> no-op, exit 0, no kubectl side effects. +# ------------------------------------------------------------------- + +it "PHASE1_SCRIPT with empty input list is a no-op and returns 0" && { + _reset_phase1_env + run __phase1_run + assert_status 0 + assert_kubectl_no_calls + assert_stdout_contains "no input nodes -- nothing to do" +} + +# ------------------------------------------------------------------- +# 2. SKIP_GPU_HW_ACCEPTANCE=true -> every input node pass-labeled, +# NO Test Runner Job / ConfigMap created, no parsing entered. +# ------------------------------------------------------------------- + +it "SKIP_GPU_HW_ACCEPTANCE=true pass-labels every input node, no Jobs/CMs created" && { + _reset_phase1_env + export SKIP_GPU_HW_ACCEPTANCE="true" + run __phase1_run node-a node-b node-c + assert_status 0 + assert_kubectl_call \ + "label node node-a amd.com/gpu-hw-acceptance=passed --overwrite" + assert_kubectl_call \ + "label node node-b amd.com/gpu-hw-acceptance=passed --overwrite" + assert_kubectl_call \ + "label node node-c amd.com/gpu-hw-acceptance=passed --overwrite" + if grep -E "^apply( |$)" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "SKIP_GPU_HW_ACCEPTANCE=true must not submit any Jobs/CMs: +$(grep -E '^apply( |$)' "$KUBECTL_CALLS_FILE")" + fi + if grep -E "^create( |$)" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "SKIP_GPU_HW_ACCEPTANCE=true must not create any CMs: +$(grep -E '^create( |$)' "$KUBECTL_CALLS_FILE")" + fi + if grep -E "^get job( |$|s)" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "SKIP must not poll Jobs: +$(grep -E '^get job' "$KUBECTL_CALLS_FILE")" + fi + assert_stdout_contains "SKIP_GPU_HW_ACCEPTANCE=true -- pass-labeling" +} + +it "SKIP_GPU_HW_ACCEPTANCE accepts case-insensitive value (TRUE)" && { + _reset_phase1_env + export SKIP_GPU_HW_ACCEPTANCE="TRUE" + run __phase1_run node-x + assert_status 0 + assert_kubectl_call \ + "label node node-x amd.com/gpu-hw-acceptance=passed --overwrite" +} + +# ------------------------------------------------------------------- +# 3. Single node pass: Job Complete=True + result.json says all pass +# -> per-stage annotation =passed, aggregate label =passed, NO +# failed-subtest annotation. Verifies per-stage CM is created and +# cleaned up. +# ------------------------------------------------------------------- + +it "single node pass: per-stage annotation + aggregate label, no failure annotation" && { + _reset_phase1_env + ts=$(_phase1_now_ts) + pod="cvf-pod-node-a-001" + _seed_job_pass "node-a" "$ts" "$pod" "" "gst-single" + cm_name=$(_phase1_expected_cm_name "node-a" "$ts" "gst-single") + job_name=$(_phase1_expected_job_name "node-a" "$ts" "gst-single") + run __phase1_run node-a + assert_status 0 + # Per-stage annotation (=passed for gst-single). + assert_kubectl_call \ + "annotate node node-a amd.com/gpu-hw-acceptance-stage-gst-single=passed --overwrite" + # Aggregate label =passed. + assert_kubectl_call \ + "label node node-a amd.com/gpu-hw-acceptance=passed --overwrite" + # Per-stage ConfigMap was created (dry-run + apply pipeline). + assert_kubectl_call_contains \ + "create configmap ${cm_name} --from-literal=GPU_VALIDATION_TESTS_JSON=" + # Cleanup: stage's Job + CM deleted. + assert_kubectl_call \ + "delete job ${job_name} --ignore-not-found=true --wait=false" + assert_kubectl_call \ + "delete configmap ${cm_name} --ignore-not-found=true --wait=false" + # No failed label or failed-subtest annotation. + if grep -F "node-a amd.com/gpu-hw-acceptance=failed" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "pass-path must not write the failed label: +$(grep failed "$KUBECTL_CALLS_FILE")" + fi + if grep -F "annotate node node-a amd.com/gpu-hw-acceptance-failed-subtest" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "pass-path must not write a failed-subtest annotation: +$(grep failed-subtest "$KUBECTL_CALLS_FILE")" + fi + assert_stdout_contains "stage=gst-single done: passed=1 failed=0" +} + +# ------------------------------------------------------------------- +# 4. Single node fail: Job condition Failed=True -> per-stage +# annotation =failed, aggregate label =failed, generic job-failed +# reason, failed-subtest=unknown (result.json is never read, so the +# specific sub-test is not derived). +# ------------------------------------------------------------------- + +it "single node fail: Job Failed=True -> stage annotation failed + aggregate failed" && { + _reset_phase1_env + ts=$(_phase1_now_ts) + pod="cvf-pod-node-b-002" + _seed_job_failed "node-b" "$ts" "$pod" "gst-single" + run __phase1_run node-b + assert_status 0 + # Per-stage annotation. + assert_kubectl_call \ + "annotate node node-b amd.com/gpu-hw-acceptance-stage-gst-single=failed --overwrite" + # Aggregate label. + assert_kubectl_call \ + "label node node-b amd.com/gpu-hw-acceptance=failed --overwrite" + # failure-reason carries the stage prefix + generic job-failed reason. + assert_kubectl_call \ + "annotate node node-b amd.com/gpu-hw-acceptance-failure-reason=stage-gst-single:job-failed --overwrite" + # failed-subtest is unknown (no result.json parse). + assert_kubectl_call \ + "annotate node node-b amd.com/gpu-hw-acceptance-failed-subtest=unknown --overwrite" + assert_stdout_contains "stage=gst-single done: passed=0 failed=1" +} + +# ------------------------------------------------------------------- +# 5. Mixed pass/fail across 3 nodes (single stage): node-b's Job fails, +# node-a + node-c pass. Per-stage annotations + aggregate labels written +# independently from each node's Job condition. +# ------------------------------------------------------------------- + +it "mixed pass/fail across 3 nodes labels each node independently" && { + _reset_phase1_env + ts=$(_phase1_now_ts) + pod_a="cvf-pod-node-a-mix" + pod_b="cvf-pod-node-b-mix" + pod_c="cvf-pod-node-c-mix" + _seed_job_pass "node-a" "$ts" "$pod_a" "" "gst-single" + _seed_job_failed "node-b" "$ts" "$pod_b" "gst-single" + _seed_job_pass "node-c" "$ts" "$pod_c" "" "gst-single" + run __phase1_run node-a node-b node-c + assert_status 0 + # Per-stage annotations. + assert_kubectl_call \ + "annotate node node-a amd.com/gpu-hw-acceptance-stage-gst-single=passed --overwrite" + assert_kubectl_call \ + "annotate node node-b amd.com/gpu-hw-acceptance-stage-gst-single=failed --overwrite" + assert_kubectl_call \ + "annotate node node-c amd.com/gpu-hw-acceptance-stage-gst-single=passed --overwrite" + # Aggregate labels. + assert_kubectl_call \ + "label node node-a amd.com/gpu-hw-acceptance=passed --overwrite" + assert_kubectl_call \ + "label node node-b amd.com/gpu-hw-acceptance=failed --overwrite" + assert_kubectl_call \ + "label node node-c amd.com/gpu-hw-acceptance=passed --overwrite" + # failed-subtest on node-b only (=unknown, no result parse). + assert_kubectl_call \ + "annotate node node-b amd.com/gpu-hw-acceptance-failed-subtest=unknown --overwrite" + if grep -F "annotate node node-a amd.com/gpu-hw-acceptance-failed-subtest" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "node-a (passed) must not get a failed-subtest annotation" + fi + if grep -F "annotate node node-c amd.com/gpu-hw-acceptance-failed-subtest" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "node-c (passed) must not get a failed-subtest annotation" + fi + assert_stdout_contains "stage=gst-single done: passed=2 failed=1" +} + +# ------------------------------------------------------------------- +# 6. Job Complete=True but result.json unreadable/missing -> the Job +# condition is the source of truth, so the node PASSES. The result file +# is enrichment-only (it names the failed sub-test); its absence must +# NOT flip a Complete job to failed. This is the multi-node case: the +# orchestrator cannot read a worker's node-local result file, but +# `kubectl get job` (k8s API) still reports Complete=True. +# ------------------------------------------------------------------- + +it "Complete job (no result file read) -> node PASSES from Job condition" && { + _reset_phase1_env + ts=$(_phase1_now_ts) + pod="cvf-pod-node-d-noresult" + _seed_job_pass "node-d" "$ts" "$pod" "" "gst-single" + kubectl_mock_set_pod_log "$pod" "amd-test-runner: all GPUs healthy" + run __phase1_run node-d + assert_status 0 + assert_kubectl_call \ + "annotate node node-d amd.com/gpu-hw-acceptance-stage-gst-single=passed --overwrite" + assert_kubectl_call \ + "label node node-d amd.com/gpu-hw-acceptance=passed --overwrite" + assert_stdout_contains "PASS (Job Complete=True)" + # The result file is never consulted, so no MISSING-file diagnostic. + assert_stderr_not_contains "MISSING result file" + # The test-runner pod log is persisted on the pass path too. + assert_stdout_contains "pod-log saved=" +} + +# ------------------------------------------------------------------- +# 6b. Job Failed=True -> stage failed with generic reason=job-failed, +# subtest=unknown (no result.json parse). +# ------------------------------------------------------------------- + +it "Failed job -> failed with reason=job-failed (no result parse)" && { + _reset_phase1_env + ts=$(_phase1_now_ts) + pod="cvf-pod-node-d-failnoresult" + _seed_job_failed "node-d" "$ts" "$pod" "gst-single" + run __phase1_run node-d + assert_status 0 + assert_kubectl_call \ + "annotate node node-d amd.com/gpu-hw-acceptance-stage-gst-single=failed --overwrite" + assert_kubectl_call \ + "label node node-d amd.com/gpu-hw-acceptance=failed --overwrite" + assert_kubectl_call \ + "annotate node node-d amd.com/gpu-hw-acceptance-failure-reason=stage-gst-single:job-failed --overwrite" + assert_kubectl_call \ + "annotate node node-d amd.com/gpu-hw-acceptance-failed-subtest=unknown --overwrite" + assert_stderr_contains "FAILED (Job Failed=True)" +} + +# ------------------------------------------------------------------- +# 6c. Failed job with pod logs present -> the captured pod-log tail is +# surfaced in the diagnostic (logs are for triage only, never parsed +# for the verdict). +# ------------------------------------------------------------------- + +it "Failed job with pod logs -> pod-log tail surfaced" && { + _reset_phase1_env + ts=$(_phase1_now_ts) + pod="cvf-pod-node-d-podlog" + _seed_job_failed "node-d" "$ts" "$pod" "gst-single" + kubectl_mock_set_pod_log "$pod" "amd-test-runner: FATAL could not open device /dev/kfd" + run __phase1_run node-d + assert_status 0 + assert_stderr_contains "pod-log tail:" + assert_stderr_contains "FATAL could not open device /dev/kfd" +} + +# ------------------------------------------------------------------- +# 8. Parallel-submit: N input nodes -> exactly N (CM+Job) submissions +# BEFORE any `kubectl get job` poll. Per (stage, node) the script +# emits one `create configmap`, one apply for the CM, and one apply +# for the Job -- so 3 nodes * 1 stage -> 3 creates + 6 applies +# before any poll. +# ------------------------------------------------------------------- + +it "parallel-submit: N nodes -> N CMs + N Jobs submitted, all before any wait poll" && { + _reset_phase1_env + ts=$(_phase1_now_ts) + pod_a="cvf-pod-node-a-par" + pod_b="cvf-pod-node-b-par" + pod_c="cvf-pod-node-c-par" + _seed_job_pass "node-a" "$ts" "$pod_a" "" "gst-single" + _seed_job_pass "node-b" "$ts" "$pod_b" "" "gst-single" + _seed_job_pass "node-c" "$ts" "$pod_c" "" "gst-single" + run __phase1_run node-a node-b node-c + assert_status 0 + # 3 create-configmap calls. + n_create=$(grep -cE "^create configmap " "$KUBECTL_CALLS_FILE" || true) + assert_equals "3" "$n_create" + # 6 apply calls (3 CM applies + 3 Job applies). + n_apply=$(grep -cE "^apply( |$)" "$KUBECTL_CALLS_FILE" || true) + assert_equals "6" "$n_apply" + # All submits precede any `get job` poll. + last_apply_line=$(grep -nE "^apply" "$KUBECTL_CALLS_FILE" \ + | tail -1 | cut -d: -f1) + first_getjob_line=$(grep -nE "^get job" "$KUBECTL_CALLS_FILE" \ + | head -1 | cut -d: -f1) + if [[ -z "$first_getjob_line" ]]; then + _assert_fail "expected at least one 'get job' poll call" + elif [[ "$last_apply_line" -ge "$first_getjob_line" ]]; then + _assert_fail "submits must all precede any poll (last apply=${last_apply_line}, first get-job=${first_getjob_line}): +$(cat "$KUBECTL_CALLS_FILE")" + fi +} + +# ------------------------------------------------------------------- +# 9. ConfigMap-creation failure: `kubectl apply` returns non-zero +# sticky, so the CM apply in the create|apply pipeline fails first. +# -> stage failed with reason=configmap-creation-failed, +# failed-subtest=unknown; no Job apply for that node, no wait/parse. +# ------------------------------------------------------------------- + +it "kubectl apply failure -> node failed with reason=configmap-creation-failed" && { + _reset_phase1_env + # Sticky apply failure trips the CM create|apply pipeline FIRST + # (CM apply runs before Job apply within the per-stage submit loop). + kubectl_mock_fail_sticky apply 1 + run __phase1_run node-z + assert_status 0 + # Per-stage annotation =failed and aggregate label =failed are + # still attempted via the helper library, which itself fails because + # `kubectl label` and `kubectl annotate` are not failure-injected + # in this test -- only apply. + assert_kubectl_call \ + "annotate node node-z amd.com/gpu-hw-acceptance-stage-gst-single=failed --overwrite" + assert_kubectl_call \ + "label node node-z amd.com/gpu-hw-acceptance=failed --overwrite" + assert_kubectl_call \ + "annotate node node-z amd.com/gpu-hw-acceptance-failure-reason=stage-gst-single:configmap-creation-failed --overwrite" + assert_kubectl_call \ + "annotate node node-z amd.com/gpu-hw-acceptance-failed-subtest=unknown --overwrite" + # No `get job` poll (submit-failed entries skip the wait/parse phases). + if grep -E "^get job" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "submit-failed job must not be polled: +$(grep -E '^get job' "$KUBECTL_CALLS_FILE")" + fi + assert_stderr_contains "failed to create configmap=" +} + +# ------------------------------------------------------------------- +# 10. Missing required env var -> every input node labeled =failed with +# reason=phase1-missing-env:.; no Jobs/CMs submitted. +# ------------------------------------------------------------------- + +it "missing GPU_VALIDATION_STAGES_JSON -> all input nodes labeled failed, no submissions" && { + _reset_phase1_env + unset GPU_VALIDATION_STAGES_JSON + run __phase1_run node-y + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_call \ + "label node node-y amd.com/gpu-hw-acceptance=failed --overwrite" + assert_kubectl_call_contains \ + "amd.com/gpu-hw-acceptance-failure-reason=phase1-missing-env:GPU_VALIDATION_STAGES_JSON" + if grep -E "^apply( |$)" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "missing-env path must not submit Jobs/CMs: +$(grep -E '^apply( |$)' "$KUBECTL_CALLS_FILE")" + fi + if grep -E "^create( |$)" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "missing-env path must not create CMs: +$(grep -E '^create( |$)' "$KUBECTL_CALLS_FILE")" + fi + assert_stderr_contains "required env var(s) unset:" +} + +it "missing GPU_PER_WORKER -> all input nodes labeled failed" && { + _reset_phase1_env + unset GPU_PER_WORKER + run __phase1_run node-y + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_call \ + "label node node-y amd.com/gpu-hw-acceptance=failed --overwrite" + assert_kubectl_call_contains \ + "amd.com/gpu-hw-acceptance-failure-reason=phase1-missing-env:GPU_PER_WORKER" +} + +# ------------------------------------------------------------------- +# 10b. GPU_VALIDATION_STAGES_JSON empty array / invalid JSON / missing +# per-stage field. Each shape exercises a distinct fail-fast branch +# in the stages-validation block. +# ------------------------------------------------------------------- + +it "empty GPU_VALIDATION_STAGES_JSON array -> all nodes labeled failed (stages-empty-or-invalid)" && { + if ! command -v jq >/dev/null 2>&1; then + echo " SKIP: jq not on PATH" >&2 + return 0 + fi + _reset_phase1_env + export GPU_VALIDATION_STAGES_JSON="[]" + run __phase1_run node-y + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_call \ + "label node node-y amd.com/gpu-hw-acceptance=failed --overwrite" + assert_kubectl_call_contains \ + "amd.com/gpu-hw-acceptance-failure-reason=phase1-stages-empty-or-invalid" + if grep -E "^apply( |$)" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "empty-stages path must not submit Jobs/CMs" + fi +} + +it "invalid JSON in GPU_VALIDATION_STAGES_JSON -> all nodes labeled failed" && { + if ! command -v jq >/dev/null 2>&1; then + echo " SKIP: jq not on PATH" >&2 + return 0 + fi + _reset_phase1_env + export GPU_VALIDATION_STAGES_JSON="not-a-json {" + run __phase1_run node-y + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_call \ + "label node node-y amd.com/gpu-hw-acceptance=failed --overwrite" + assert_kubectl_call_contains \ + "amd.com/gpu-hw-acceptance-failure-reason=phase1-stages-empty-or-invalid" +} + +it "GPU_VALIDATION_STAGES_JSON missing per-stage required field -> all nodes labeled failed" && { + if ! command -v jq >/dev/null 2>&1; then + echo " SKIP: jq not on PATH" >&2 + return 0 + fi + _reset_phase1_env + # Stage with no `Image` field -- per-stage validator should reject. + export GPU_VALIDATION_STAGES_JSON='[{"Name":"gst-single","Framework":"RVS","Recipe":"gst_single","TimeoutSeconds":60}]' + run __phase1_run node-y + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_call \ + "label node node-y amd.com/gpu-hw-acceptance=failed --overwrite" + assert_kubectl_call_contains \ + "amd.com/gpu-hw-acceptance-failure-reason=phase1-stages-missing-fields" + # Reason should name the missing field for at least one stage. + assert_kubectl_call_contains "stage[0].Image" +} + +# ------------------------------------------------------------------- +# 11. PHASE_NODES env-var fallback: when positional args are empty +# but PHASE_NODES is exported, the script uses that list. +# ------------------------------------------------------------------- + +it "PHASE_NODES env var is used when no positional args are given" && { + _reset_phase1_env + ts=$(_phase1_now_ts) + pod="cvf-pod-env-fallback" + _seed_job_pass "node-env" "$ts" "$pod" "" "gst-single" + export PHASE_NODES="node-env" + run __phase1_run # NB: no positional args + assert_status 0 + assert_kubectl_call \ + "label node node-env amd.com/gpu-hw-acceptance=passed --overwrite" + assert_kubectl_call \ + "annotate node node-env amd.com/gpu-hw-acceptance-stage-gst-single=passed --overwrite" +} + +# ------------------------------------------------------------------- +# 12. Multi-stage all-pass: 2 stages, single node, both +# pass. Both per-stage annotations written, aggregate label +# =passed, no failed-subtest. Verifies stage-2 Job submission +# happens AFTER stage-1 completes. +# ------------------------------------------------------------------- + +it "multi-stage all-pass: both per-stage annotations + aggregate passed" && { + if ! command -v jq >/dev/null 2>&1; then + echo " SKIP: jq not on PATH" >&2 + return 0 + fi + _reset_phase1_env + export GPU_VALIDATION_STAGES_JSON='[ + {"Name":"gst-single","Image":"docker.io/rocm/test-runner:v1.4.0","Framework":"RVS","Recipe":"gst_single","Iterations":1,"TimeoutSeconds":60,"Arguments":""}, + {"Name":"xgmi-lvl1","Image":"docker.io/rocm/test-runner:v1.4.0","Framework":"AGFHC","Recipe":"xgmi_lvl1","Iterations":1,"TimeoutSeconds":60,"Arguments":""} + ]' + ts=$(_phase1_now_ts) + pod1="cvf-pod-ms-pass-s1" + pod2="cvf-pod-ms-pass-s2" + _seed_job_pass "node-m" "$ts" "$pod1" "" "gst-single" + _seed_job_pass "node-m" "$ts" "$pod2" "" "xgmi-lvl1" + + run __phase1_run node-m + assert_status 0 + + # Per-stage annotations for both stages. + assert_kubectl_call \ + "annotate node node-m amd.com/gpu-hw-acceptance-stage-gst-single=passed --overwrite" + assert_kubectl_call \ + "annotate node node-m amd.com/gpu-hw-acceptance-stage-xgmi-lvl1=passed --overwrite" + # Aggregate label =passed. + assert_kubectl_call \ + "label node node-m amd.com/gpu-hw-acceptance=passed --overwrite" + # No failed-subtest annotation on the pass path. + if grep -F "annotate node node-m amd.com/gpu-hw-acceptance-failed-subtest" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "all-pass multi-stage must not write failed-subtest annotation" + fi + # Stage-2 Job was actually submitted (one CM per stage * 1 node = 2 CMs). + cm1=$(_phase1_expected_cm_name "node-m" "$ts" "gst-single") + cm2=$(_phase1_expected_cm_name "node-m" "$ts" "xgmi-lvl1") + assert_kubectl_call_contains "create configmap ${cm1} --from-literal=" + assert_kubectl_call_contains "create configmap ${cm2} --from-literal=" + # Stage-2 progress banner appears after stage-1. + assert_stdout_contains "stage=gst-single done: passed=1 failed=0" + assert_stdout_contains "stage=xgmi-lvl1 done: passed=1 failed=0" +} + +# ------------------------------------------------------------------- +# 13. Multi-stage stop-on-first-failure: stage 1 fails on +# hbm_lvl1; stage 2 MUST NOT submit a Job/CM for that node. +# Aggregate label =failed, failure-reason carries the first +# failing stage name. +# ------------------------------------------------------------------- + +it "multi-stage first-fails: stage-2 NOT submitted, aggregate failed" && { + if ! command -v jq >/dev/null 2>&1; then + echo " SKIP: jq not on PATH" >&2 + return 0 + fi + _reset_phase1_env + export GPU_VALIDATION_STAGES_JSON='[ + {"Name":"gst-single","Image":"docker.io/rocm/test-runner:v1.4.0","Framework":"RVS","Recipe":"gst_single","Iterations":1,"TimeoutSeconds":60,"Arguments":""}, + {"Name":"xgmi-lvl1","Image":"docker.io/rocm/test-runner:v1.4.0","Framework":"AGFHC","Recipe":"xgmi_lvl1","Iterations":1,"TimeoutSeconds":60,"Arguments":""} + ]' + ts=$(_phase1_now_ts) + pod1="cvf-pod-ms-fail-s1" + # Stage 1's Job fails (Failed=True); stage 2 must never be submitted + # (stop-on-first-failure). Verdict comes from the Job condition only. + _seed_job_failed "node-f" "$ts" "$pod1" "gst-single" + + run __phase1_run node-f + assert_status 0 + + # Stage 1 annotation =failed. + assert_kubectl_call \ + "annotate node node-f amd.com/gpu-hw-acceptance-stage-gst-single=failed --overwrite" + # Stage 2 annotation MUST NOT be written -- node was dropped before + # stage 2's per-stage iteration submitted anything. + if grep -F "amd.com/gpu-hw-acceptance-stage-xgmi-lvl1" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "stage-2 annotation must not be written when stage-1 failed: +$(grep xgmi-lvl1 "$KUBECTL_CALLS_FILE")" + fi + # Stage-2 Job/CM MUST NOT be submitted: no create configmap or apply + # mentioning the stage-2 name. + cm2=$(_phase1_expected_cm_name "node-f" "$ts" "xgmi-lvl1") + job2=$(_phase1_expected_job_name "node-f" "$ts" "xgmi-lvl1") + if grep -F "create configmap ${cm2}" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "stage-2 CM must not be created (stage-1 failed): +$(grep "${cm2}" "$KUBECTL_CALLS_FILE")" + fi + # Aggregate label =failed. + assert_kubectl_call \ + "label node node-f amd.com/gpu-hw-acceptance=failed --overwrite" + # failed-subtest is unknown (no result parse). + assert_kubectl_call \ + "annotate node node-f amd.com/gpu-hw-acceptance-failed-subtest=unknown --overwrite" + # failure-reason carries the first failing stage name + generic reason. + assert_kubectl_call \ + "annotate node node-f amd.com/gpu-hw-acceptance-failure-reason=stage-gst-single:job-failed --overwrite" + # Skipped-stage progress banner. + assert_stdout_contains "stage=xgmi-lvl1 skipped (no alive nodes left)" +} + +# ------------------------------------------------------------------- +# 14. Multi-stage cleanup: per-stage Job + per-stage CM are deleted +# after each stage. With 2 stages * 1 node we expect 2 delete-job +# and 2 delete-configmap calls (one per stage). +# ------------------------------------------------------------------- + +it "multi-stage cleanup: each stage deletes its Job and per-stage CM" && { + if ! command -v jq >/dev/null 2>&1; then + echo " SKIP: jq not on PATH" >&2 + return 0 + fi + _reset_phase1_env + export GPU_VALIDATION_STAGES_JSON='[ + {"Name":"gst-single","Image":"docker.io/rocm/test-runner:v1.4.0","Framework":"RVS","Recipe":"gst_single","Iterations":1,"TimeoutSeconds":60,"Arguments":""}, + {"Name":"xgmi-lvl1","Image":"docker.io/rocm/test-runner:v1.4.0","Framework":"AGFHC","Recipe":"xgmi_lvl1","Iterations":1,"TimeoutSeconds":60,"Arguments":""} + ]' + ts=$(_phase1_now_ts) + pod1="cvf-pod-cleanup-s1" + pod2="cvf-pod-cleanup-s2" + _seed_job_pass "node-c1" "$ts" "$pod1" "" "gst-single" + _seed_job_pass "node-c1" "$ts" "$pod2" "" "xgmi-lvl1" + job1=$(_phase1_expected_job_name "node-c1" "$ts" "gst-single") + job2=$(_phase1_expected_job_name "node-c1" "$ts" "xgmi-lvl1") + cm1=$(_phase1_expected_cm_name "node-c1" "$ts" "gst-single") + cm2=$(_phase1_expected_cm_name "node-c1" "$ts" "xgmi-lvl1") + + run __phase1_run node-c1 + assert_status 0 + + # Per-stage Job + CM deletes for BOTH stages. + assert_kubectl_call \ + "delete job ${job1} --ignore-not-found=true --wait=false" + assert_kubectl_call \ + "delete configmap ${cm1} --ignore-not-found=true --wait=false" + assert_kubectl_call \ + "delete job ${job2} --ignore-not-found=true --wait=false" + assert_kubectl_call \ + "delete configmap ${cm2} --ignore-not-found=true --wait=false" + + # Exactly 2 delete-job and 2 delete-configmap calls (no leaks). + n_del_job=$(grep -cE "^delete job " "$KUBECTL_CALLS_FILE" || true) + n_del_cm=$(grep -cE "^delete configmap " "$KUBECTL_CALLS_FILE" || true) + assert_equals "2" "$n_del_job" + assert_equals "2" "$n_del_cm" +} + +# ------------------------------------------------------------------- +# 15. Skip-single-stage: middle stage Skip=true. Stage 0 +# runs and passes. Stage 1 is annotated skipped without any Job +# submission. Stage 2 still runs and passes. Aggregate label +# =passed (the node had at least one non-skip stage that passed). +# ------------------------------------------------------------------- + +it "skip-single-stage: middle stage Skip=true -> annotation=skipped, others run, aggregate=passed" && { + if ! command -v jq >/dev/null 2>&1; then + echo " SKIP: jq not on PATH" >&2 + return 0 + fi + _reset_phase1_env + export GPU_VALIDATION_STAGES_JSON='[ + {"Name":"gst-single","Image":"docker.io/rocm/test-runner:v1.4.0","Framework":"RVS","Recipe":"gst_single","Iterations":1,"TimeoutSeconds":60,"Arguments":""}, + {"Name":"xgmi-lvl1","Image":"docker.io/rocm/test-runner:v1.4.0","Framework":"AGFHC","Recipe":"xgmi_lvl1","Iterations":1,"TimeoutSeconds":60,"Arguments":"","Skip":true}, + {"Name":"pcie-lvl1","Image":"docker.io/rocm/test-runner:v1.4.0","Framework":"AGFHC","Recipe":"pcie_lvl1","Iterations":1,"TimeoutSeconds":60,"Arguments":""} + ]' + ts=$(_phase1_now_ts) + pod1="cvf-pod-skip-mid-s1" + pod3="cvf-pod-skip-mid-s3" + _seed_job_pass "node-sk" "$ts" "$pod1" "" "gst-single" + _seed_job_pass "node-sk" "$ts" "$pod3" "" "pcie-lvl1" + + run __phase1_run node-sk + assert_status 0 + + # Stage 0 ran and passed -> annotation present. + assert_kubectl_call \ + "annotate node node-sk amd.com/gpu-hw-acceptance-stage-gst-single=passed --overwrite" + # Stage 1 was skipped -> annotation=skipped, NO Job/CM submitted. + assert_kubectl_call \ + "annotate node node-sk amd.com/gpu-hw-acceptance-stage-xgmi-lvl1=skipped --overwrite" + cm_skipped=$(_phase1_expected_cm_name "node-sk" "$ts" "xgmi-lvl1") + if grep -F "create configmap ${cm_skipped}" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "skipped stage must not create configmap: +$(grep "${cm_skipped}" "$KUBECTL_CALLS_FILE")" + fi + if grep -F "delete configmap ${cm_skipped}" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "skipped stage must not delete configmap (it never created one): +$(grep "${cm_skipped}" "$KUBECTL_CALLS_FILE")" + fi + # Stage 2 still ran and passed -> annotation present. + assert_kubectl_call \ + "annotate node node-sk amd.com/gpu-hw-acceptance-stage-pcie-lvl1=passed --overwrite" + # Aggregate label =passed (>=1 non-skip stage was submitted+passed). + assert_kubectl_call \ + "label node node-sk amd.com/gpu-hw-acceptance=passed --overwrite" + # No failed-subtest annotation on this path. + if grep -F "annotate node node-sk amd.com/gpu-hw-acceptance-failed-subtest" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "skip-single-stage must not write failed-subtest annotation" + fi + assert_stdout_contains "stage=xgmi-lvl1 idx=1 SKIPPED (Skip=true)" +} + +# ------------------------------------------------------------------- +# 16. Skip-all-stages: every stage Skip=true. No Jobs +# submitted, no CMs created. Every stage annotated skipped. +# Aggregate label =skipped (third value, not passed). +# ------------------------------------------------------------------- + +it "skip-all-stages: all Skip=true -> no Jobs submitted, aggregate label=skipped" && { + if ! command -v jq >/dev/null 2>&1; then + echo " SKIP: jq not on PATH" >&2 + return 0 + fi + _reset_phase1_env + export GPU_VALIDATION_STAGES_JSON='[ + {"Name":"gst-single","Image":"docker.io/rocm/test-runner:v1.4.0","Framework":"RVS","Recipe":"gst_single","Iterations":1,"TimeoutSeconds":60,"Arguments":"","Skip":true}, + {"Name":"xgmi-lvl1","Image":"docker.io/rocm/test-runner:v1.4.0","Framework":"AGFHC","Recipe":"xgmi_lvl1","Iterations":1,"TimeoutSeconds":60,"Arguments":"","Skip":true} + ]' + + run __phase1_run node-allsk + assert_status 0 + + # Every stage annotated skipped. + assert_kubectl_call \ + "annotate node node-allsk amd.com/gpu-hw-acceptance-stage-gst-single=skipped --overwrite" + assert_kubectl_call \ + "annotate node node-allsk amd.com/gpu-hw-acceptance-stage-xgmi-lvl1=skipped --overwrite" + # Aggregate label =skipped (tri-state: not passed, not failed). + assert_kubectl_call \ + "label node node-allsk amd.com/gpu-hw-acceptance=skipped --overwrite" + # Must NOT label =passed or =failed. + if grep -F "label node node-allsk amd.com/gpu-hw-acceptance=passed" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "all-skipped path must not write aggregate label=passed" + fi + if grep -F "label node node-allsk amd.com/gpu-hw-acceptance=failed" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "all-skipped path must not write aggregate label=failed" + fi + # No failed-subtest annotation on the all-skipped path. + if grep -F "annotate node node-allsk amd.com/gpu-hw-acceptance-failed-subtest" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "all-skipped path must not write failed-subtest annotation" + fi + # No Job/CM submissions of any kind. + if grep -E "^create configmap " "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "all-skipped path must not create configmaps: +$(grep -E '^create configmap ' "$KUBECTL_CALLS_FILE")" + fi + if grep -E "^apply( |$)" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "all-skipped path must not apply Jobs: +$(grep -E '^apply( |$)' "$KUBECTL_CALLS_FILE")" + fi + if grep -E "^get job" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "all-skipped path must not poll Jobs: +$(grep -E '^get job' "$KUBECTL_CALLS_FILE")" + fi + assert_stdout_contains "passed=0 failed=0 skipped=1" +} + +# ------------------------------------------------------------------- +# 17. Skip-bad-type: non-boolean Skip ("true" string) +# must fail fast at validation, BEFORE any stage iteration runs. +# Every input node labeled =failed with the Skip-type reason. +# ------------------------------------------------------------------- + +it "skip-bad-type: non-boolean Skip -> fail-fast, no Job/CM submitted" && { + if ! command -v jq >/dev/null 2>&1; then + echo " SKIP: jq not on PATH" >&2 + return 0 + fi + _reset_phase1_env + # Quoted "true" must be rejected (boolean type check), even though + # an accidentally-stringified value is a likely real-world mistake + # since YAML/JSON conversion can produce strings. + export GPU_VALIDATION_STAGES_JSON='[ + {"Name":"gst-single","Image":"docker.io/rocm/test-runner:v1.4.0","Framework":"RVS","Recipe":"gst_single","Iterations":1,"TimeoutSeconds":60,"Arguments":"","Skip":"true"} + ]' + + run __phase1_run node-bskip + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_call \ + "label node node-bskip amd.com/gpu-hw-acceptance=failed --overwrite" + assert_kubectl_call_contains \ + "amd.com/gpu-hw-acceptance-failure-reason=phase1-stages-bad-skip-type" + # Reason must name the offending stage index. + assert_kubectl_call_contains "stage[0].Skip=true" + # Fail-fast: no submissions of any kind happened. + if grep -E "^create configmap " "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "bad-Skip-type path must not create configmaps" + fi + if grep -E "^apply( |$)" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "bad-Skip-type path must not apply Jobs" + fi +} + +# ------------------------------------------------------------------- +# 18. Skip+fail interleave: Skip stage between two real +# stages. Stage 0 passes, stage 1 is skipped, stage 2 fails. +# Skipped stage MUST NOT count toward the "first failing stage" +# bookkeeping -- aggregate failed-subtest comes from stage 2's +# recipe, not the skipped stage. +# ------------------------------------------------------------------- + +it "skip-then-fail: skipped stage does not poison failure-reason" && { + if ! command -v jq >/dev/null 2>&1; then + echo " SKIP: jq not on PATH" >&2 + return 0 + fi + _reset_phase1_env + export GPU_VALIDATION_STAGES_JSON='[ + {"Name":"gst-single","Image":"docker.io/rocm/test-runner:v1.4.0","Framework":"RVS","Recipe":"gst_single","Iterations":1,"TimeoutSeconds":60,"Arguments":""}, + {"Name":"xgmi-lvl1","Image":"docker.io/rocm/test-runner:v1.4.0","Framework":"AGFHC","Recipe":"xgmi_lvl1","Iterations":1,"TimeoutSeconds":60,"Arguments":"","Skip":true}, + {"Name":"pcie-lvl1","Image":"docker.io/rocm/test-runner:v1.4.0","Framework":"AGFHC","Recipe":"pcie_lvl1","Iterations":1,"TimeoutSeconds":60,"Arguments":""} + ]' + ts=$(_phase1_now_ts) + pod1="cvf-pod-skf-s1" + pod3="cvf-pod-skf-s3" + _seed_job_pass "node-skf" "$ts" "$pod1" "" "gst-single" + # Stage 2's Job fails -- the failure-reason must name stage 2, not the + # skipped stage 1. + _seed_job_failed "node-skf" "$ts" "$pod3" "pcie-lvl1" + + run __phase1_run node-skf + assert_status 0 + + # Stage 0 passed annotation. + assert_kubectl_call \ + "annotate node node-skf amd.com/gpu-hw-acceptance-stage-gst-single=passed --overwrite" + # Stage 1 skipped annotation. + assert_kubectl_call \ + "annotate node node-skf amd.com/gpu-hw-acceptance-stage-xgmi-lvl1=skipped --overwrite" + # Stage 2 failed annotation. + assert_kubectl_call \ + "annotate node node-skf amd.com/gpu-hw-acceptance-stage-pcie-lvl1=failed --overwrite" + # Aggregate label =failed. + assert_kubectl_call \ + "label node node-skf amd.com/gpu-hw-acceptance=failed --overwrite" + # failure-reason names stage 2 (the first real failure), NOT the + # skipped stage 1. + assert_kubectl_call_contains \ + "amd.com/gpu-hw-acceptance-failure-reason=stage-pcie-lvl1:job-failed" + # failed-subtest is unknown (no result parse). + assert_kubectl_call \ + "annotate node node-skf amd.com/gpu-hw-acceptance-failed-subtest=unknown --overwrite" +} + +assert_summary diff --git a/example/gpu-validation-cluster/tests/test_phase2.sh b/example/gpu-validation-cluster/tests/test_phase2.sh new file mode 100755 index 000000000..f47c73d1c --- /dev/null +++ b/example/gpu-validation-cluster/tests/test_phase2.sh @@ -0,0 +1,615 @@ +#!/bin/bash +# Unit tests for PHASE2_SCRIPT against the mocked +# kubectl harness and phase2.log fixtures. +# +# Scope (from the design doc §4 +# "Code Path: PHASE2_SCRIPT" / §7 "Testing Strategy"): +# * pass case (BW above threshold) [TC2] +# * bus-bw-below-threshold fail [TC5] +# * rccl-crash (mpirun non-zero exit) [TC6] +# * timeout (Job stays pending past PHASE2_JOB_WAIT_TIME) [TC9 + TC10] +# * SKIP_GPU_MESH_VALIDATION=true short-circuit [TC3] +# * threshold-too-high inject (PHASE2_BW_THRESHOLD=9999) [TC5] +# * empty input list [TC7] +# * missing-env fast-fail +# * job-template-missing fast-fail +# * Failed=True with no recognized marker -> default rccl-crash +# * PHASE_NODES env-var fallback (when no positional args) +# +# How PHASE2_SCRIPT is exercised: +# +# The script body is a block-scalar inside cluster-validation-config.yaml. +# We extract it with lib/extract_script.sh, then patch the one hardcoded +# absolute path so the test can run as a non-root user without touching +# /phase2-configs: +# /phase2-configs/cluster-validation-phase2-job-config.yaml +# -> ${TPL_DIR}/cluster-validation-phase2-job-config.yaml +# The script uses `local` / `declare -A`, so we wrap the patched body +# in a function `__phase2_run` and invoke that. The helper library +# (PHASE_NODE_LABEL_SCRIPT) is sourced first so label_phase_passed / +# label_phase_failed / annotate_phase_value are defined. +# +# `kubectl` is the mock from lib/kubectl_mock.sh. Job "completion" is +# simulated by seeding state via kubectl_mock_set_job_condition (Phase 1 +# helper, reused) and kubectl_mock_set_pod_for_job, plus the new +# kubectl_mock_set_pod_log helper for the `kubectl logs` arm added in +# this change. +# +# Timeouts are exercised by SETTING THE TIMEOUT TO 0 -- the poll-wait +# loop checks `elapsed >= timeout` BEFORE the first kubectl get, so +# a 0-second budget short-circuits to TIMEOUT on the first iteration +# without sleeping. Without this, the only way to hit the timeout +# branch would be to actually wait for PHASE2_JOB_WAIT_TIME seconds. + +set -uo pipefail + +TEST_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd) +REPO_ROOT=$(cd -- "${TEST_DIR}/../../.." && pwd) +CONFIGMAP="${REPO_ROOT}/example/gpu-validation-cluster/configs/cluster-validation-config.yaml" +FIXTURES_DIR="${TEST_DIR}/fixtures/phase2" + +# shellcheck source=./lib/assert.sh +source "${TEST_DIR}/lib/assert.sh" +# shellcheck source=./lib/kubectl_mock.sh +source "${TEST_DIR}/lib/kubectl_mock.sh" +# shellcheck source=./lib/extract_script.sh +source "${TEST_DIR}/lib/extract_script.sh" + +echo "================================================================" +echo " test_phase2.sh" +echo " ConfigMap: ${CONFIGMAP}" +echo " Fixtures: ${FIXTURES_DIR}" +echo "================================================================" + +# --- one-time setup ------------------------------------------------- + +# Per-process tmp dirs: +# TPL_DIR -- holds a placeholder phase2 job template (the real +# template lives in cluster-validation-phase2-job-config +# ConfigMap,; for these tests PHASE2_SCRIPT +# only needs the file to exist + be sed-able, then +# `kubectl apply` is mocked). +# PHASE2_BODY -- the patched, function-wrapped script we source. +PHASE2_DIR=$(mktemp -d -t phase2-tests-XXXXXX) +TPL_DIR="${PHASE2_DIR}/tpl" +PHASE2_BODY="${PHASE2_DIR}/phase2-body.sh" +HELPER_SCRIPT="${PHASE2_DIR}/phase-helpers.sh" +mkdir -p "$TPL_DIR" + +trap 'rm -rf "$PHASE2_DIR"; kubectl_mock_cleanup' EXIT + +# Minimal job template stand-in. The real template lives in the +# cluster-validation-phase2-job-config ConfigMap. PHASE2_SCRIPT pipes a +# sed-rendered copy to `kubectl apply -f -`; since kubectl is mocked, +# the only constraints on contents are: +# * the file must exist (`[[ ! -f "$job_template" ]]` guard) +# * the sed expressions must not produce non-zero -- both are fine on +# any plain-text file with the substitution markers present. +cat >"${TPL_DIR}/cluster-validation-phase2-job-config.yaml" <<'YAML' +apiVersion: batch/v1 +kind: Job +metadata: + name: cluster-validation-phase2-job +spec: + template: + spec: + nodeSelector: + kubernetes.io/hostname: $$NODE + containers: + - name: phase2-rccl + image: $$ROCE_WORKLOAD_IMAGE + resources: + limits: + amd.com/gpu: 8 +YAML + +# Extract PHASE2_SCRIPT and patch the one hardcoded path so the test +# can run as a non-root user without /phase2-configs existing. Also +# pin the timestamp so seeded mock state always matches what the script +# generates -- the production script calls `date +%Y%m%d-%H%M%S` once +# per submit; we replace that with a fixed test value. +RAW_PHASE2=$(extract_configmap_data "$CONFIGMAP" "PHASE2_SCRIPT") +if [[ -z "$RAW_PHASE2" ]]; then + echo "FATAL: PHASE2_SCRIPT extraction produced empty output" >&2 + exit 1 +fi + +PATCHED_PHASE2=$(printf '%s\n' "$RAW_PHASE2" \ + | sed "s|/phase2-configs/cluster-validation-phase2-job-config.yaml|${TPL_DIR}/cluster-validation-phase2-job-config.yaml|g" \ + | sed 's|ts=\$(date +%Y%m%d-%H%M%S)|ts="${PHASE2_TEST_TS:-$(date +%Y%m%d-%H%M%S)}"|') + +# Wrap in a function so `local` and `declare -A` (used heavily inside +# PHASE2_SCRIPT) are valid. The orchestrator sources the +# script body inside run_phase2, which is a function, so this matches +# production wiring. +{ + printf '__phase2_run() {\n' + printf '%s\n' "$PATCHED_PHASE2" + printf '}\n' +} > "$PHASE2_BODY" + +if ! bash -n "$PHASE2_BODY"; then + echo "FATAL: patched PHASE2_SCRIPT has bash syntax errors" >&2 + exit 1 +fi + +# Extract the helper library (label_phase_passed/failed, +# annotate_phase_value) once; sourced before every test so each test +# gets a fresh function definition. +extract_configmap_data "$CONFIGMAP" "PHASE_NODE_LABEL_SCRIPT" \ + > "$HELPER_SCRIPT" +if [[ ! -s "$HELPER_SCRIPT" ]]; then + echo "FATAL: PHASE_NODE_LABEL_SCRIPT extraction produced empty output" >&2 + exit 1 +fi +if ! bash -n "$HELPER_SCRIPT"; then + echo "FATAL: extracted helper script has bash syntax errors" >&2 + exit 1 +fi + +kubectl_mock_init + +# Suffix for failure-reason annotation; mirrors ConfigMap default. +export PHASE_FAILURE_REASON_ANNOTATION_SUFFIX="-failure-reason" + +# shellcheck disable=SC1090 +source "$HELPER_SCRIPT" +# shellcheck disable=SC1090 +source "$PHASE2_BODY" + +# Sanity: required functions are defined. +for fn in label_phase_passed label_phase_failed annotate_phase_value __phase2_run; do + if ! declare -F "$fn" >/dev/null; then + echo "FATAL: required function $fn not defined after sourcing" >&2 + exit 1 + fi +done + +# Per-test reset: wipe the kubectl call log and any seeded state, and +# re-export the baseline env PHASE2_SCRIPT reads. Tests override pieces +# of this (notably SKIP_GPU_MESH_VALIDATION and PHASE2_BW_THRESHOLD) +# before calling __phase2_run. +_reset_phase2_env() { + kubectl_mock_reset + export PHASE2_LABEL_KEY="amd.com/gpu-mesh-validation" + export ROCE_WORKLOAD_IMAGE="docker.io/rocm/roce-workload:test" + # 60s is large enough that the poll loop's first iteration (which + # checks Complete=True / Failed=True immediately, seeded by the + # tests) breaks out before any sleep. Tests that exercise the + # timeout branch override this to 0. + export PHASE2_JOB_WAIT_TIME="60" + export PHASE2_BW_THRESHOLD="200" + # Pin the timestamp PHASE2_SCRIPT puts into job names so seeded + # state always matches what the script looks up. + export PHASE2_TEST_TS="testts0001" + unset SKIP_GPU_MESH_VALIDATION PHASE_NODES +} + +# Suppress the -u trap for tests that intentionally leave optional env +# vars unset (PHASE_NODES, SKIP_GPU_MESH_VALIDATION). +set +u + +# Helper: compute the job name PHASE2_SCRIPT will generate for +# with the pinned PHASE2_TEST_TS. Mirrors PHASE2_SCRIPT exactly: +# cvf-phase2-${node}-${ts} (when short enough) +# cvf-phase2-${sha1(node)|6}-${ts} (when over 63 chars) +_phase2_expected_job_name() { + local node="$1" ts="$2" max_len=63 prefix="cvf-phase2" + local jn="${prefix}-${node}-${ts}" + if [[ "${#jn}" -gt "$max_len" ]]; then + local h + h=$(echo -n "$node" | sha1sum | cut -c1-6) + jn="${prefix}-${h}-${ts}" + fi + printf '%s' "$jn" +} + +# Seed mock state for one job: Complete=True + pod-for-job + canned pod log. +_seed_job_complete() { + local node="$1" ts="$2" pod="$3" log_fixture="$4" + local job + job=$(_phase2_expected_job_name "$node" "$ts") + kubectl_mock_set_job_condition "$job" "Complete" "True" + kubectl_mock_set_pod_for_job "$job" "$pod" + kubectl_mock_set_pod_log "$pod" "${FIXTURES_DIR}/${log_fixture}" +} + +# Seed mock state for one job: Failed=True + pod-for-job + canned pod log. +_seed_job_failed() { + local node="$1" ts="$2" pod="$3" log_fixture="$4" + local job + job=$(_phase2_expected_job_name "$node" "$ts") + kubectl_mock_set_job_condition "$job" "Failed" "True" + kubectl_mock_set_pod_for_job "$job" "$pod" + kubectl_mock_set_pod_log "$pod" "${FIXTURES_DIR}/${log_fixture}" +} + +# Seed mock state for one job: neither Complete nor Failed seeded -> +# kubectl returns empty string for both jsonpath queries -> PHASE2_SCRIPT +# loops until elapsed >= PHASE2_JOB_WAIT_TIME. Tests pair this with +# PHASE2_JOB_WAIT_TIME=0 to force an immediate TIMEOUT classification. +_seed_job_pending() { + : # no state seeding -- absence == empty jsonpath response +} + +ts=$(printf '%s' "testts0001") + +# ------------------------------------------------------------------- +# 1. Empty input list -> no-op, exit 0, no kubectl side effects. +# ------------------------------------------------------------------- + +it "PHASE2_SCRIPT with empty input list is a no-op and returns 0" && { + _reset_phase2_env + run __phase2_run + assert_status 0 + assert_kubectl_no_calls + assert_stdout_contains "no input nodes -- nothing to do" +} + +# ------------------------------------------------------------------- +# 2. SKIP_GPU_MESH_VALIDATION=true -> every input node pass-labeled, +# NO Phase 2 Job submission, no kubectl get/logs/apply work. +# ------------------------------------------------------------------- + +it "SKIP_GPU_MESH_VALIDATION=true pass-labels every input node, no Jobs created" && { + _reset_phase2_env + export SKIP_GPU_MESH_VALIDATION="true" + run __phase2_run node-a node-b node-c + assert_status 0 + assert_kubectl_call \ + "label node node-a amd.com/gpu-mesh-validation=passed --overwrite" + assert_kubectl_call \ + "label node node-b amd.com/gpu-mesh-validation=passed --overwrite" + assert_kubectl_call \ + "label node node-c amd.com/gpu-mesh-validation=passed --overwrite" + # No `kubectl apply` (Job submission) anywhere. + if grep -E "^apply( |$)" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "SKIP_GPU_MESH_VALIDATION=true must not submit any Jobs: +$(grep -E '^apply( |$)' "$KUBECTL_CALLS_FILE")" + fi + if grep -E "^get job" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "SKIP must not poll Jobs: +$(grep -E '^get job' "$KUBECTL_CALLS_FILE")" + fi + if grep -E "^logs " "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "SKIP must not fetch pod logs: +$(grep -E '^logs ' "$KUBECTL_CALLS_FILE")" + fi + assert_stdout_contains "SKIP_GPU_MESH_VALIDATION=true -- pass-labeling" +} + +it "SKIP_GPU_MESH_VALIDATION accepts case-insensitive value (TRUE)" && { + _reset_phase2_env + export SKIP_GPU_MESH_VALIDATION="TRUE" + run __phase2_run node-x + assert_status 0 + assert_kubectl_call \ + "label node node-x amd.com/gpu-mesh-validation=passed --overwrite" +} + +# ------------------------------------------------------------------- +# 3. Missing required env var (ROCE_WORKLOAD_IMAGE) -> every input +# node labeled =failed with reason=phase2-missing-env:.; no Jobs +# submitted. +# ------------------------------------------------------------------- + +it "missing required env var -> all input nodes labeled failed, no Jobs submitted" && { + _reset_phase2_env + unset ROCE_WORKLOAD_IMAGE + run __phase2_run node-y + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_call \ + "label node node-y amd.com/gpu-mesh-validation=failed --overwrite" + assert_kubectl_call_contains \ + "amd.com/gpu-mesh-validation-failure-reason=phase2-missing-env:ROCE_WORKLOAD_IMAGE" + if grep -E "^apply( |$)" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "missing-env path must not submit Jobs: +$(grep -E '^apply( |$)' "$KUBECTL_CALLS_FILE")" + fi + assert_stderr_contains "required env var(s) unset:" +} + +# ------------------------------------------------------------------- +# 4. Missing job template -> every input node labeled failed with +# reason=job-template-missing; no Jobs submitted. +# ------------------------------------------------------------------- + +it "missing job template -> all input nodes labeled failed, reason=job-template-missing" && { + _reset_phase2_env + # Hide the template the patched script expects. Restore before the + # next test by recreating the file inline. + mv "${TPL_DIR}/cluster-validation-phase2-job-config.yaml" \ + "${TPL_DIR}/cluster-validation-phase2-job-config.yaml.hidden" + run __phase2_run node-a node-b + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_call \ + "label node node-a amd.com/gpu-mesh-validation=failed --overwrite" + assert_kubectl_call \ + "annotate node node-a amd.com/gpu-mesh-validation-failure-reason=job-template-missing --overwrite" + assert_kubectl_call \ + "label node node-b amd.com/gpu-mesh-validation=failed --overwrite" + assert_kubectl_call \ + "annotate node node-b amd.com/gpu-mesh-validation-failure-reason=job-template-missing --overwrite" + if grep -E "^apply( |$)" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "missing-template path must not submit Jobs: +$(grep -E '^apply( |$)' "$KUBECTL_CALLS_FILE")" + fi + # Restore for the rest of the suite. + mv "${TPL_DIR}/cluster-validation-phase2-job-config.yaml.hidden" \ + "${TPL_DIR}/cluster-validation-phase2-job-config.yaml" +} + +# ------------------------------------------------------------------- +# 5. Pass case: Complete=True + pass log -> =passed label + measured-bw +# annotation parsed from "Avg bus bandwidth: " line. +# ------------------------------------------------------------------- + +it "single node pass: Complete=True + pass log -> =passed, measured-bw annotated" && { + _reset_phase2_env + pod="cvf-pod-node-a-pass" + _seed_job_complete "node-a" "$ts" "$pod" "phase2-pass.log" + run __phase2_run node-a + assert_status 0 + assert_kubectl_call \ + "label node node-a amd.com/gpu-mesh-validation=passed --overwrite" + # measured-bw annotation parsed from "Avg bus bandwidth: 234.7". + assert_kubectl_call \ + "annotate node node-a amd.com/gpu-mesh-validation-measured-bw=234.7 --overwrite" + # No failed label or failure-reason annotation on the pass path. + if grep -F "node-a amd.com/gpu-mesh-validation=failed" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "pass-path must not write the failed label: +$(grep failed "$KUBECTL_CALLS_FILE")" + fi + if grep -F "annotate node node-a amd.com/gpu-mesh-validation-failure-reason" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "pass-path must not write a failure-reason annotation: +$(grep failure-reason "$KUBECTL_CALLS_FILE")" + fi + assert_stdout_contains "PASS (measured-bw=234.7)" +} + +# ------------------------------------------------------------------- +# 6. bus-bw-below-threshold: Failed=True + bw-below-threshold log +# -> =failed, reason=bus-bw-below-threshold, measured-bw annotated. +# ------------------------------------------------------------------- + +it "Failed + bw-below-threshold marker -> =failed reason=bus-bw-below-threshold + measured-bw" && { + _reset_phase2_env + pod="cvf-pod-node-b-bw" + _seed_job_failed "node-b" "$ts" "$pod" "phase2-bw-below-threshold.log" + run __phase2_run node-b + assert_status 0 + assert_kubectl_call \ + "label node node-b amd.com/gpu-mesh-validation=failed --overwrite" + assert_kubectl_call \ + "annotate node node-b amd.com/gpu-mesh-validation-failure-reason=bus-bw-below-threshold --overwrite" + # Measured BW from log fixture is 173.7. + assert_kubectl_call \ + "annotate node node-b amd.com/gpu-mesh-validation-measured-bw=173.7 --overwrite" + assert_stdout_contains "FAIL reason=bus-bw-below-threshold" +} + +# ------------------------------------------------------------------- +# 7. Threshold-too-high inject: PHASE2_BW_THRESHOLD=9999, but the +# classification is driven by the container log marker (validator +# runs inside the Job container, not in PHASE2_SCRIPT) -- so the +# log we serve is the same bw-below-threshold fixture, and the +# classification is still bus-bw-below-threshold. This test pins +# the contract: PHASE2_SCRIPT never re-runs the validator; it only +# classifies by log markers. +# ------------------------------------------------------------------- + +it "PHASE2_BW_THRESHOLD=9999 inject still classifies via log marker (bus-bw-below-threshold)" && { + _reset_phase2_env + export PHASE2_BW_THRESHOLD="9999" + pod="cvf-pod-node-c-9999" + _seed_job_failed "node-c" "$ts" "$pod" "phase2-bw-below-threshold.log" + run __phase2_run node-c + assert_status 0 + assert_kubectl_call \ + "label node node-c amd.com/gpu-mesh-validation=failed --overwrite" + assert_kubectl_call \ + "annotate node node-c amd.com/gpu-mesh-validation-failure-reason=bus-bw-below-threshold --overwrite" + # The diagnostic line should echo the inject value. + assert_stdout_contains "threshold=9999" +} + +# ------------------------------------------------------------------- +# 8. rccl-crash: Failed=True + mpirun-exited marker -> =failed, +# reason=rccl-crash. +# ------------------------------------------------------------------- + +it "Failed + mpirun-exited marker -> =failed reason=rccl-crash" && { + _reset_phase2_env + pod="cvf-pod-node-d-crash" + _seed_job_failed "node-d" "$ts" "$pod" "phase2-rccl-crash.log" + run __phase2_run node-d + assert_status 0 + assert_kubectl_call \ + "label node node-d amd.com/gpu-mesh-validation=failed --overwrite" + assert_kubectl_call \ + "annotate node node-d amd.com/gpu-mesh-validation-failure-reason=rccl-crash --overwrite" + assert_stdout_contains "FAIL reason=rccl-crash" +} + +# ------------------------------------------------------------------- +# 9. Failed=True with NO recognized marker in log -> default rccl-crash. +# Defends the "any non-zero exit is treated as a crash signal unless +# the validator explicitly flagged bw-below-threshold" contract from +# design §6. +# ------------------------------------------------------------------- + +it "Failed + no recognized marker -> default reason=rccl-crash" && { + _reset_phase2_env + pod="cvf-pod-node-e-nomark" + _seed_job_failed "node-e" "$ts" "$pod" "phase2-failed-no-marker.log" + run __phase2_run node-e + assert_status 0 + assert_kubectl_call \ + "label node node-e amd.com/gpu-mesh-validation=failed --overwrite" + assert_kubectl_call \ + "annotate node node-e amd.com/gpu-mesh-validation-failure-reason=rccl-crash --overwrite" + assert_stdout_contains "default, no marker found" +} + +# ------------------------------------------------------------------- +# 10. Timeout: no Job conditions seeded + PHASE2_JOB_WAIT_TIME=0 +# -> first iteration of the poll loop hits elapsed >= timeout +# immediately -> classified TIMEOUT, reason=timeout, and the +# hung Job is explicitly deleted at cleanup. +# ------------------------------------------------------------------- + +it "no conditions + PHASE2_JOB_WAIT_TIME=0 -> reason=timeout + cleanup delete" && { + _reset_phase2_env + export PHASE2_JOB_WAIT_TIME="0" + _seed_job_pending # no seeded Complete/Failed -> empty jsonpath responses + run __phase2_run node-f + assert_status 0 + assert_kubectl_call \ + "label node node-f amd.com/gpu-mesh-validation=failed --overwrite" + assert_kubectl_call \ + "annotate node node-f amd.com/gpu-mesh-validation-failure-reason=timeout --overwrite" + # Cleanup: hung job must be deleted. + expected_job=$(_phase2_expected_job_name "node-f" "$ts") + assert_kubectl_call \ + "delete job ${expected_job} --ignore-not-found=true --wait=false" + assert_stdout_contains "TIMEOUT after 0s" + assert_stdout_contains "deleting hung job" +} + +# ------------------------------------------------------------------- +# 11. Mixed pass/fail across multiple nodes: 3 nodes -- one pass, +# one bw-below-threshold, one rccl-crash. Verifies each node is +# labeled independently and the per-node measured-bw annotation +# fires for the two nodes whose logs contain an Avg bus bandwidth +# line. +# ------------------------------------------------------------------- + +it "mixed pass/fail across 3 nodes labels each node independently" && { + _reset_phase2_env + pod_a="cvf-pod-node-a-mix" + pod_b="cvf-pod-node-b-mix" + pod_c="cvf-pod-node-c-mix" + _seed_job_complete "node-a" "$ts" "$pod_a" "phase2-pass.log" + _seed_job_failed "node-b" "$ts" "$pod_b" "phase2-bw-below-threshold.log" + _seed_job_failed "node-c" "$ts" "$pod_c" "phase2-rccl-crash.log" + run __phase2_run node-a node-b node-c + assert_status 0 + assert_kubectl_call \ + "label node node-a amd.com/gpu-mesh-validation=passed --overwrite" + assert_kubectl_call \ + "annotate node node-a amd.com/gpu-mesh-validation-measured-bw=234.7 --overwrite" + assert_kubectl_call \ + "label node node-b amd.com/gpu-mesh-validation=failed --overwrite" + assert_kubectl_call \ + "annotate node node-b amd.com/gpu-mesh-validation-failure-reason=bus-bw-below-threshold --overwrite" + assert_kubectl_call \ + "annotate node node-b amd.com/gpu-mesh-validation-measured-bw=173.7 --overwrite" + assert_kubectl_call \ + "label node node-c amd.com/gpu-mesh-validation=failed --overwrite" + assert_kubectl_call \ + "annotate node node-c amd.com/gpu-mesh-validation-failure-reason=rccl-crash --overwrite" + assert_stdout_contains "passed=1 failed=2" +} + +# ------------------------------------------------------------------- +# 12. Parallel-submit: N input nodes -> exactly N `kubectl apply` +# invocations BEFORE any `kubectl get job` poll. Verifies no +# per-node serialization of submit+wait (mirror of PHASE1_SCRIPT +# contract). +# ------------------------------------------------------------------- + +it "parallel-submit: N input nodes -> N submits, all before any wait poll" && { + _reset_phase2_env + pod_a="cvf-pod-node-a-par" + pod_b="cvf-pod-node-b-par" + pod_c="cvf-pod-node-c-par" + _seed_job_complete "node-a" "$ts" "$pod_a" "phase2-pass.log" + _seed_job_complete "node-b" "$ts" "$pod_b" "phase2-pass.log" + _seed_job_complete "node-c" "$ts" "$pod_c" "phase2-pass.log" + run __phase2_run node-a node-b node-c + assert_status 0 + n_apply=$(grep -cE "^apply( |$)" "$KUBECTL_CALLS_FILE" || true) + assert_equals "3" "$n_apply" + last_apply_line=$(grep -nE "^apply" "$KUBECTL_CALLS_FILE" \ + | tail -1 | cut -d: -f1) + first_getjob_line=$(grep -nE "^get job" "$KUBECTL_CALLS_FILE" \ + | head -1 | cut -d: -f1) + if [[ -z "$first_getjob_line" ]]; then + _assert_fail "expected at least one 'get job' poll call" + elif [[ "$last_apply_line" -ge "$first_getjob_line" ]]; then + _assert_fail "submits must all precede any poll (last apply=${last_apply_line}, first get-job=${first_getjob_line}): +$(cat "$KUBECTL_CALLS_FILE")" + fi +} + +# ------------------------------------------------------------------- +# 13. Job-creation failure: `kubectl apply` returns non-zero for the +# single input node -> node failed with reason=job-creation-failed, +# and NO wait/poll/log work for that job. +# ------------------------------------------------------------------- + +it "kubectl apply failure -> node failed with reason=job-creation-failed" && { + _reset_phase2_env + kubectl_mock_fail_sticky apply 1 + run __phase2_run node-z + assert_status 0 + assert_kubectl_call \ + "label node node-z amd.com/gpu-mesh-validation=failed --overwrite" + assert_kubectl_call \ + "annotate node node-z amd.com/gpu-mesh-validation-failure-reason=job-creation-failed --overwrite" + if grep -E "^get job" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "submit-failed job must not be polled: +$(grep -E '^get job' "$KUBECTL_CALLS_FILE")" + fi + if grep -E "^logs " "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "submit-failed job must not fetch pod logs: +$(grep -E '^logs ' "$KUBECTL_CALLS_FILE")" + fi + assert_stderr_contains "kubectl apply failed for job=" +} + +# ------------------------------------------------------------------- +# 14. PHASE_NODES env-var fallback: when positional args are empty +# but PHASE_NODES is exported, the script uses that list. +# ------------------------------------------------------------------- + +it "PHASE_NODES env var is used when no positional args are given" && { + _reset_phase2_env + pod="cvf-pod-env-fallback" + _seed_job_complete "node-env" "$ts" "$pod" "phase2-pass.log" + export PHASE_NODES="node-env" + run __phase2_run # NB: no positional args + assert_status 0 + assert_kubectl_call \ + "label node node-env amd.com/gpu-mesh-validation=passed --overwrite" +} + +# ------------------------------------------------------------------- +# 15. PHASE2_RCCL_ENV_VARS sanity: the ConfigMap value contains no +# IB/fabric-specific tunables (single-node intra-node test). This +# is a static-content check on the ConfigMap, not on PHASE2_SCRIPT +# behavior, but lives here because it's part of the test plan for +# this change (rccl-env-no-ib-vars, TC4 test plan). +# ------------------------------------------------------------------- + +it "PHASE2_RCCL_ENV_VARS contains no IB/fabric tunables (rccl-env-no-ib-vars)" && { + rccl_env_body=$(extract_configmap_data "$CONFIGMAP" "PHASE2_RCCL_ENV_VARS") + if [[ -z "$rccl_env_body" ]]; then + _assert_fail "PHASE2_RCCL_ENV_VARS extraction produced empty output" + fi + # The check skips comment lines so the design-doc note about + # "No NCCL_NET_PLUGIN -- single node, ." doesn't trip the grep. + # Only non-comment env vars are checked against the forbidden + # patterns. + non_comment=$(echo "$rccl_env_body" | grep -v -E "^[[:space:]]*#") + for forbidden in NCCL_NET_PLUGIN NCCL_IB_ IONIC_ NCCL_SOCKET_IFNAME; do + if echo "$non_comment" | grep -q "$forbidden"; then + _assert_fail "PHASE2_RCCL_ENV_VARS must not contain ${forbidden} (single-node, no fabric): +$non_comment" + fi + done +} + +assert_summary diff --git a/example/gpu-validation-cluster/tests/test_phase3.sh b/example/gpu-validation-cluster/tests/test_phase3.sh new file mode 100755 index 000000000..cf7a1efab --- /dev/null +++ b/example/gpu-validation-cluster/tests/test_phase3.sh @@ -0,0 +1,1738 @@ +#!/bin/bash +# Unit tests for Phase 3 (per-node NIC health check). Covers both the +# in-Job PHASE3_CHECK_SCRIPT body and the outer-driver +# PHASE3_SCRIPT. +# +# Scope (post-refactor: PHASE3_CHECK_SCRIPT emits a PHASE3_RESULT marker +# on stdout; PHASE3_SCRIPT parses `kubectl logs job/` and owns all +# label/annotate writes -- the in-Job container no longer calls kubectl): +# * shellcheck on PHASE3_CHECK_SCRIPT (skipped if shellcheck absent). +# * PHASE3_CHECK_SCRIPT unit tests with mocked +# lspci / ip / rdma / ibv_devices / ibv_devinfo: +# - all 4 checks pass -> stdout has PHASE3_RESULT status=passed +# - NIC count mismatch -> status=failed reason includes nic-count +# - PF/VF mix is collapsed to PFs -> 8 PFs + many VFs => count==8 PASS +# - 1 NIC link DOWN -> status=failed reason link-state +# - 1 RDMA link INIT -> status=failed reason rdma-state +# - 1 device empty GID table -> status=failed reason gid-table +# - 1 device ibv_devinfo unresponsive -> status=failed reason ibv-devinfo +# - partial failure -> reason has only the failing class +# - annotation size truncation -> reason/failed_nics tokens <= MAX_BYTES +# - NODE_NAME unset is informational only (no exit-2 behavior) +# * PHASE3_SCRIPT outer-driver tests: +# - empty input list -> no-op, return 0 +# - SKIP_NIC_VALIDATION=true -> pass-label every input node, no Jobs +# - SKIP_NIC_VALIDATION case-insensitive +# - missing required env -> all-fail with reason +# - missing job template -> all-fail with reason +# - kubectl apply failure -> reason=job-creation-failed +# - timeout -> reason=nic-not-allocated + cleanup +# - Job Complete + PHASE3_RESULT status=passed +# -> orchestrator writes label=passed (no annotation) +# - Job Failed + PHASE3_RESULT status=failed reason=. failed_nics=. +# -> orchestrator writes label=failed + failure-reason + failed-nics +# - Job Complete but NO PHASE3_RESULT line in logs +# -> orchestrator writes label=failed reason=no-result-line +# - parallel-submit ordering: every apply precedes the first poll +# - PHASE_NODES env-var fallback +# +# Implementation notes: +# +# The CHECK script is extracted with lib/extract_script.sh and sourced +# directly -- it uses no `local` / `declare -A`, so no function wrapping +# is needed. To mock lspci/ip/rdma/ibv_*, we prepend a per-test shim +# directory to PATH that contains tiny scripts which `cat` the right +# fixture file (selected via env vars exported by each test). +# +# The kubectl shim is the existing lib/kubectl_mock.sh; only the +# orchestrator-side (PHASE3_SCRIPT) calls it now. Per-job log content +# is seeded via `kubectl_mock_set_pod_log "job/" ""`, +# leveraging the mock's first-positional-token = pod-name dispatch +# (job/ is just a token). +# +# PHASE3_SCRIPT uses `local` / `declare -A`, so we wrap the extracted +# body in a function `__phase3_run` -- same shape as test_phase2.sh. +# +# Timeouts are exercised by setting PHASE3_JOB_WAIT_TIME=0 -- the wait +# loop checks `elapsed >= timeout` BEFORE the first kubectl get, so +# a 0-second budget short-circuits to TIMEOUT on the first iteration +# without sleeping. Same trick used by test_phase2.sh. + +set -uo pipefail + +TEST_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd) +REPO_ROOT=$(cd -- "${TEST_DIR}/../../.." && pwd) +CONFIGMAP="${REPO_ROOT}/example/gpu-validation-cluster/configs/cluster-validation-config.yaml" +FIXTURES_DIR="${TEST_DIR}/fixtures/phase3" + +# shellcheck source=./lib/assert.sh +source "${TEST_DIR}/lib/assert.sh" +# shellcheck source=./lib/kubectl_mock.sh +source "${TEST_DIR}/lib/kubectl_mock.sh" +# shellcheck source=./lib/extract_script.sh +source "${TEST_DIR}/lib/extract_script.sh" + +echo "================================================================" +echo " test_phase3.sh" +echo " ConfigMap: ${CONFIGMAP}" +echo " Fixtures: ${FIXTURES_DIR}" +echo "================================================================" + +# --- one-time setup ------------------------------------------------- + +PHASE3_DIR=$(mktemp -d -t phase3-tests-XXXXXX) +TPL_DIR="${PHASE3_DIR}/tpl" +SHIM_DIR="${PHASE3_DIR}/shims" +CHECK_BODY="${PHASE3_DIR}/phase3-check-body.sh" +PHASE3_BODY="${PHASE3_DIR}/phase3-body.sh" +HELPER_SCRIPT="${PHASE3_DIR}/phase-helpers.sh" +mkdir -p "$TPL_DIR" "$SHIM_DIR" + +trap 'rm -rf "$PHASE3_DIR"; kubectl_mock_cleanup' EXIT + +# --- shim binaries for in-Job tooling ------------------------------- +# +# Each shim reads a per-test env var that names a fixture file under +# FIXTURES_DIR (or an absolute path). Missing/empty env var -> +# the shim emits nothing and exits 0 (matches behavior of the real +# tool on a misconfigured system: no devices found). +# +# Putting these on PATH lets PHASE3_CHECK_SCRIPT call them in any +# sub-shell or process substitution (e.g. `while . done < <(rdma .)`) +# without needing function overrides. + +_make_shim() { + local name="$1" + local body="$2" + cat >"${SHIM_DIR}/${name}" <` +# (per-iface mode). Both forms are used by PHASE3_CHECK_SCRIPT. +_make_shim "ip" ' +mode="list" +target_iface="" +saw_show=0 +for a in "$@"; do + case "$a" in + show) saw_show=1 ;; + -*|link|-br) : ;; + *) target_iface="$a" ;; + esac +done +if [[ "$saw_show" -eq 1 && -n "$target_iface" ]]; then + mode="show-iface" +fi +fixture="${IP_LINK_FIXTURE:-}" +if [[ -z "$fixture" || ! -f "$fixture" ]]; then + exit 0 +fi +case "$mode" in + list) + cat "$fixture" + ;; + show-iface) + # Emit only the line matching target_iface (or nothing if absent). + grep -E "^[[:space:]]*${target_iface}[[:space:]]" "$fixture" || true + ;; +esac +exit 0 +' + +# rdma shim: only the `rdma link show` form is used. +_make_shim "rdma" ' +fixture="${RDMA_LINK_FIXTURE:-}" +if [[ -n "$fixture" && -f "$fixture" ]]; then + cat "$fixture" +fi +exit 0 +' + +# ibv_devices shim: emit IBV_DEVICES_FIXTURE content. +_make_shim "ibv_devices" ' +fixture="${IBV_DEVICES_FIXTURE:-}" +if [[ -n "$fixture" && -f "$fixture" ]]; then + cat "$fixture" +fi +exit 0 +' + +# NOTE: the nicctl shim that previously lived here was removed when +# Check 5 was rewritten to read firmware from +# /sys/class/infiniband//fw_ver. The pre-flight gate no longer +# probes nicctl either (only `ibv_devinfo` remains), so no nicctl +# binary is shimmed at all -- the production script must not call it. + +# ibv_devinfo shim: rc override precedence -> IBV_DEVINFO_RC_ wins +# over IBV_DEVINFO_RC_DEFAULT, which wins over fixture serving (so the +# "tool unresponsive" branch can be exercised even when a fallback +# fixture is configured). Otherwise serves a per-device fixture if +# IBV_DEVINFO_FIXTURE_[_V] is set, else falls back to +# IBV_DEVINFO_FIXTURE_DEFAULT[_V]. The `-v` flag selects the verbose +# fixture (so empty-GID cases can serve a different body for the +# verbose listing). +_make_shim "ibv_devinfo" ' +dev="" +verbose=0 +i=1 +while [[ $i -le $# ]]; do + a="${!i}" + case "$a" in + -d) + j=$((i + 1)) + if [[ $j -le $# ]]; then + dev="${!j}" + fi + ;; + -v) verbose=1 ;; + esac + i=$((i + 1)) +done +# Per-device rc override: takes precedence over fixture serving so the +# "unresponsive driver" path can be exercised against the same default +# fixture set the pass case uses. +if [[ -n "$dev" ]]; then + rc_var="IBV_DEVINFO_RC_${dev}" + if [[ -n "${!rc_var:-}" ]]; then + exit "${!rc_var}" + fi +fi +if [[ -n "${IBV_DEVINFO_RC_DEFAULT:-}" ]]; then + # optional stderr message so the pre-flight + # `ibv_devinfo 2>&1 >/dev/null` capture carries a representative + # error string (e.g. "libibverbs: failed to load driver ionic_rdma"). + if [[ -n "${IBV_DEVINFO_STDERR_DEFAULT:-}" ]]; then + echo "${IBV_DEVINFO_STDERR_DEFAULT}" >&2 + fi + exit "${IBV_DEVINFO_RC_DEFAULT}" +fi +fixture="" +if [[ -n "$dev" ]]; then + var="IBV_DEVINFO_FIXTURE_${dev}" + if [[ "$verbose" -eq 1 ]]; then + var="${var}_V" + fi + fixture="${!var:-}" +fi +if [[ -z "$fixture" ]]; then + if [[ "$verbose" -eq 1 ]]; then + fixture="${IBV_DEVINFO_FIXTURE_DEFAULT_V:-${IBV_DEVINFO_FIXTURE_DEFAULT:-}}" + else + fixture="${IBV_DEVINFO_FIXTURE_DEFAULT:-}" + fi +fi +if [[ -n "$fixture" && -f "$fixture" ]]; then + cat "$fixture" +fi +exit 0 +' + +# Job template stand-in for PHASE3_SCRIPT (mirror of test_phase2.sh). +# Real template lives in cluster-validation-job.yaml (cluster-validation-phase3-job-config +# ConfigMap,). PHASE3_SCRIPT only needs a sed-able file to +# exist -- the actual `kubectl apply` is mocked. +cat >"${TPL_DIR}/cluster-validation-phase3-job-config.yaml" <<'YAML' +apiVersion: batch/v1 +kind: Job +metadata: + name: cluster-validation-phase3-job +spec: + template: + spec: + nodeSelector: + kubernetes.io/hostname: $$NODE + containers: + - name: nic-health + # image is sed-substituted from + # ROCE_WORKLOAD_IMAGE by PHASE3_SCRIPT. + image: $$ROCE_WORKLOAD_IMAGE + resources: + limits: + amd.com/nic: $$EXPECTED_NIC_COUNT +YAML + +# --- extract scripts under test ------------------------------------- + +RAW_CHECK=$(extract_configmap_data "$CONFIGMAP" "PHASE3_CHECK_SCRIPT") +if [[ -z "$RAW_CHECK" ]]; then + echo "FATAL: PHASE3_CHECK_SCRIPT extraction produced empty output" >&2 + exit 1 +fi +printf '%s\n' "$RAW_CHECK" > "$CHECK_BODY" +if ! bash -n "$CHECK_BODY"; then + echo "FATAL: extracted PHASE3_CHECK_SCRIPT has bash syntax errors" >&2 + exit 1 +fi + +RAW_PHASE3=$(extract_configmap_data "$CONFIGMAP" "PHASE3_SCRIPT") +if [[ -z "$RAW_PHASE3" ]]; then + echo "FATAL: PHASE3_SCRIPT extraction produced empty output" >&2 + exit 1 +fi +# Patch the one hardcoded path so the test can run as a non-root user +# without /phase3-configs existing. Also pin the timestamp PHASE3_SCRIPT +# embeds in job names so seeded mock state always matches what the +# script generates. +PATCHED_PHASE3=$(printf '%s\n' "$RAW_PHASE3" \ + | sed "s|/phase3-configs/cluster-validation-phase3-job-config.yaml|${TPL_DIR}/cluster-validation-phase3-job-config.yaml|g" \ + | sed 's|ts=\$(date +%Y%m%d-%H%M%S)|ts="${PHASE3_TEST_TS:-$(date +%Y%m%d-%H%M%S)}"|') + +# Wrap in a function so `local` / `declare -A` (used heavily inside +# PHASE3_SCRIPT) are valid. +{ + printf '__phase3_run() {\n' + printf '%s\n' "$PATCHED_PHASE3" + printf '}\n' +} > "$PHASE3_BODY" + +if ! bash -n "$PHASE3_BODY"; then + echo "FATAL: patched PHASE3_SCRIPT has bash syntax errors" >&2 + exit 1 +fi + +# Extract the helper library (label_phase_passed/failed, +# annotate_phase_value) once; sourced before the outer-driver tests. +extract_configmap_data "$CONFIGMAP" "PHASE_NODE_LABEL_SCRIPT" \ + > "$HELPER_SCRIPT" +if [[ ! -s "$HELPER_SCRIPT" ]]; then + echo "FATAL: PHASE_NODE_LABEL_SCRIPT extraction produced empty output" >&2 + exit 1 +fi +if ! bash -n "$HELPER_SCRIPT"; then + echo "FATAL: extracted helper script has bash syntax errors" >&2 + exit 1 +fi + +kubectl_mock_init + +# Suffix for failure-reason annotation; mirrors ConfigMap default. +export PHASE_FAILURE_REASON_ANNOTATION_SUFFIX="-failure-reason" + +# Check 5 (driver/firmware compat) defaults to enabled in the +# ConfigMap (fail-closed production stance). The legacy test bodies below +# do not yet supply Check 5 fixtures -- per-test ConfigMap-default-enabled +# Check 5 cases land with. Disable Check 5 at the harness level +# so the pre-refactor cases continue to test checks 1-4 in isolation; the +# pre-flight `nicctl --help` / `ibv_devinfo` shims above still cover the +# pre-flight gate. Individual tests that exercise Check 5 are expected to +# locally `export PHASE3_DRIVER_FW_CHECK_ENABLED=true` and provide +# PHASE3_DRIVER_SYSFS_PATH + NICCTL_FW_FIXTURE. +export PHASE3_DRIVER_FW_CHECK_ENABLED="false" + +# Prepend the shim dir AFTER kubectl_mock_init so the mock kubectl +# (which init prepends) still wins for `kubectl`, and our shims win +# for lspci/ip/rdma/ibv_*. +export PATH="${SHIM_DIR}:${PATH}" + +# shellcheck disable=SC1090 +source "$HELPER_SCRIPT" +# shellcheck disable=SC1090 +source "$PHASE3_BODY" + +# Sanity: required functions are defined. +for fn in label_phase_passed label_phase_failed __phase3_run; do + if ! declare -F "$fn" >/dev/null; then + echo "FATAL: required function $fn not defined after sourcing" >&2 + exit 1 + fi +done + +# Suppress the -u trap for tests that intentionally leave optional env +# vars unset (SKIP_NIC_VALIDATION, PHASE_NODES, fixture vars). +set +u + +# ------------------------------------------------------------------- +# PART A: shellcheck on PHASE3_CHECK_SCRIPT +# ------------------------------------------------------------------- +# +# Static analysis. Skip cleanly if shellcheck is not on PATH so the +# suite still runs in minimal CI containers. SC1091 (cannot follow +# sourced file) is irrelevant -- there are no `source` calls in the +# CHECK script. + +it "shellcheck PHASE3_CHECK_SCRIPT (skip if shellcheck not on PATH)" && { + if ! command -v shellcheck >/dev/null 2>&1; then + echo " SKIP: shellcheck not on PATH" + else + run shellcheck --severity=warning "$CHECK_BODY" + assert_status 0 + fi +} + +# ------------------------------------------------------------------- +# PART B: PHASE3_CHECK_SCRIPT in-Job behavior +# ------------------------------------------------------------------- +# +# Each test: +# * resets kubectl mock state +# * exports PHASE3_* config + per-shim fixture pointers +# * runs PHASE3_CHECK_SCRIPT in a sub-shell (so the in-script `exit` +# does not kill the harness) +# * asserts on exit code + kubectl call log + +_reset_check_env() { + kubectl_mock_reset + unset LSPCI_FIXTURE IP_LINK_FIXTURE RDMA_LINK_FIXTURE \ + IBV_DEVICES_FIXTURE IBV_DEVINFO_FIXTURE_DEFAULT \ + IBV_DEVINFO_FIXTURE_DEFAULT_V \ + IBV_DEVINFO_RC_DEFAULT IBV_DEVINFO_STDERR_DEFAULT \ + ROCE_WORKLOAD_IMAGE PHASE3_DRIVER_FW_STRICT \ + PHASE3_IB_SYSFS_DIR PHASE3_NET_SYSFS_DIR + # Clear any per-device fixture / rc overrides from earlier tests. + while IFS= read -r v; do + unset "$v" + done < <(compgen -v | grep -E '^IBV_DEVINFO_(FIXTURE|RC)_rocep' || true) + export NODE_NAME="node-under-test" + export PHASE3_LABEL_KEY="amd.com/nic-health" + export PHASE3_AMD_NIC_PCI_IDS="1dd8:1002" + export PHASE3_EXPECTED_NIC_COUNT="8" + export PHASE3_MIN_GID_COUNT="1" + export PHASE3_ANNOTATION_MAX_BYTES="250" + # Default sysfs roots point at empty tmp dirs so the check doesn't + # accidentally read the host /sys (test runs are typically on hosts + # without ionic interfaces, but we want hermetic behavior). + export PHASE3_IB_SYSFS_DIR="${PHASE3_DIR}/mock_sysfs_ib_empty" + export PHASE3_NET_SYSFS_DIR="${PHASE3_DIR}/mock_sysfs_net_empty" + mkdir -p "$PHASE3_IB_SYSFS_DIR" "$PHASE3_NET_SYSFS_DIR" +} + +# Per-suite shared "drivers" directory: the production sysfs places +# `/sys/class///device/driver` as a symlink whose TARGET's +# basename is the kernel driver name (e.g. `ionic`). The CHECK script +# resolves that basename via `basename $(readlink ...)`. We point our +# fake symlinks at paths inside this directory so the basenames match +# the driver name exactly (e.g. `/ionic`). The target +# files do not need to exist -- bash's `readlink` (no -f) returns the +# raw target path so `basename` works on a dangling link. +PHASE3_FAKE_DRIVERS_DIR="${PHASE3_DIR}/fake-drivers" +mkdir -p "$PHASE3_FAKE_DRIVERS_DIR" + +# _seed_sysfs_net +# Populate a fake /sys/class/net// tree under : +# //operstate (file with operstate text) +# //device/driver -> / +_seed_sysfs_net() { + local root="$1" iface="$2" drv="$3" state="$4" + local d="${root}/${iface}" + mkdir -p "${d}/device" + printf '%s\n' "$state" > "${d}/operstate" + ln -sfn "${PHASE3_FAKE_DRIVERS_DIR}/${drv}" "${d}/device/driver" +} + +# _seed_sysfs_ib +# Populate a fake /sys/class/infiniband// tree: +# //fw_ver (file with firmware string) +# //device/driver -> / +# Pass an empty to skip the fw_ver file entirely (simulates +# unreadable / missing fw_ver, which the script tolerates by +# skipping that device). +_seed_sysfs_ib() { + local root="$1" dev="$2" drv="$3" fw="$4" + local d="${root}/${dev}" + mkdir -p "${d}/device" + if [[ -n "$fw" ]]; then + printf '%s\n' "$fw" > "${d}/fw_ver" + fi + ln -sfn "${PHASE3_FAKE_DRIVERS_DIR}/${drv}" "${d}/device/driver" +} + +# _new_sysfs_net_root / _new_sysfs_ib_root +# Allocate a fresh per-test mock sysfs root + export PHASE3_*_SYSFS_DIR +# to point at it. Stashes the path in a global variable so the test +# body can also reference it via $NEW_SYSFS_NET_ROOT / $NEW_SYSFS_IB_ROOT. +# Do NOT call via command substitution: command-substitution subshells +# would discard the export. The functions print the path on stdout so +# tests written as `var=$(_new_sysfs_X_root)` still see the path string, +# but the export only persists when the function is called in the +# current shell (e.g. just `_new_sysfs_X_root` and then read +# $NEW_SYSFS_X_ROOT or $PHASE3_X_SYSFS_DIR). +NEW_SYSFS_NET_ROOT="" +NEW_SYSFS_IB_ROOT="" +_new_sysfs_net_root() { + local p + p=$(mktemp -d -p "$PHASE3_DIR" mock_sysfs_net-XXXXXX) + export PHASE3_NET_SYSFS_DIR="$p" + NEW_SYSFS_NET_ROOT="$p" + printf '%s' "$p" +} +_new_sysfs_ib_root() { + local p + p=$(mktemp -d -p "$PHASE3_DIR" mock_sysfs_ib-XXXXXX) + export PHASE3_IB_SYSFS_DIR="$p" + NEW_SYSFS_IB_ROOT="$p" + printf '%s' "$p" +} + +# Run the CHECK script in a sub-shell. `bash` (not `source`) so the +# script's own `exit` only ends the sub-shell. +_run_check() { + bash "$CHECK_BODY" +} + +# ------------------------------------------------------------------- +# B1. All 4 checks pass -> =passed, exit 0, no failure annotation. +# ------------------------------------------------------------------- + +it "PHASE3_CHECK_SCRIPT all checks pass -> stdout PHASE3_RESULT status=passed, exit 0" && { + _reset_check_env + export LSPCI_FIXTURE="${FIXTURES_DIR}/lspci-pass.txt" + export IP_LINK_FIXTURE="${FIXTURES_DIR}/ip-link-pass.txt" + export RDMA_LINK_FIXTURE="${FIXTURES_DIR}/rdma-link-pass.txt" + export IBV_DEVICES_FIXTURE="${FIXTURES_DIR}/ibv-devices.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT_V="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + run _run_check + assert_status 0 + assert_stdout_contains "PHASE3_RESULT status=passed" + if grep -F "PHASE3_RESULT status=failed" <<<"$LAST_STDOUT" >/dev/null; then + _assert_fail "pass-path must not emit a failed marker: +${LAST_STDOUT}" + fi + # New contract: PHASE3_CHECK_SCRIPT never invokes kubectl. + assert_kubectl_no_calls + assert_stdout_contains "PHASE3_CHECK_SCRIPT done: PASS" +} + +# ------------------------------------------------------------------- +# B2. NIC count mismatch (lspci returns 7, expected 8). +# ------------------------------------------------------------------- + +it "PHASE3_CHECK_SCRIPT nic-count mismatch -> PHASE3_RESULT status=failed reason nic-count" && { + _reset_check_env + export LSPCI_FIXTURE="${FIXTURES_DIR}/lspci-count-mismatch.txt" + export IP_LINK_FIXTURE="${FIXTURES_DIR}/ip-link-pass.txt" + export RDMA_LINK_FIXTURE="${FIXTURES_DIR}/rdma-link-pass.txt" + export IBV_DEVICES_FIXTURE="${FIXTURES_DIR}/ibv-devices.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT_V="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + run _run_check + assert_status 1 + assert_stdout_contains "PHASE3_RESULT status=failed" + assert_stdout_contains "reason=nic-count:expected=8,actual=7" + assert_stdout_contains "CHECK 1 FAIL" + assert_kubectl_no_calls +} + +# ------------------------------------------------------------------- +# B2b. PF-only count filter: a host that exposes 8 PFs (.0) plus many +# VFs/SR-IOV/sub-functions (.1+) must collapse to nic_count=8 so the +# check passes. Catches the regression where raw `lspci -d :` +# wc -l would have returned 48 against an 8-NIC node with 5 VFs each. +# ------------------------------------------------------------------- + +it "PHASE3_CHECK_SCRIPT 8 PFs + 40 VFs -> nic_count=8 PASS" && { + _reset_check_env + # Build a fixture with 8 PFs (function .0) interleaved with VFs (.1.5). + # PFs carry device ID 1002 (the allowlisted PF); VFs carry 1003 so the + # `[1dd8:1002]` filter excludes them by device ID. + pf_vf_fixture="${PHASE3_DIR}/lspci-pfs-and-vfs.txt" + : >"$pf_vf_fixture" + for bus in 05 06 07 08 09 0a 0b 0c; do + printf '0000:%s:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1002] (rev 03)\n' \ + "$bus" >>"$pf_vf_fixture" + for fn in 1 2 3 4 5; do + printf '0000:%s:00.%d Ethernet controller [0200]: Pensando Systems DSC Virtual Function [1dd8:1003] (rev 03)\n' \ + "$bus" "$fn" >>"$pf_vf_fixture" + done + done + export LSPCI_FIXTURE="$pf_vf_fixture" + export IP_LINK_FIXTURE="${FIXTURES_DIR}/ip-link-pass.txt" + export RDMA_LINK_FIXTURE="${FIXTURES_DIR}/rdma-link-pass.txt" + export IBV_DEVICES_FIXTURE="${FIXTURES_DIR}/ibv-devices.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT_V="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + run _run_check + assert_status 0 + assert_stdout_contains "CHECK 1 PASS: nic_count=8" + assert_stdout_contains "PHASE3_RESULT status=passed" +} + +# ------------------------------------------------------------------- +# B2c. Real Pensando layout: each card exposes 6 PCI functions of the +# same vendor -- 2 PCI bridges (Salina Upstream + Virtual Downstream), +# 3 Processing accelerators (TAWK IPC, Register/Memory, DSC PDS Core), +# and 1 Ethernet controller. Raw lspci returns 48 lines on an 8-card +# node; only the 8 Ethernet controllers should be counted. Catches the +# regression that produced `actual=24` on the live smc300x-ccs node +# when the previous /\.0$/ filter admitted bridges and accelerators. +# ------------------------------------------------------------------- + +it "PHASE3_CHECK_SCRIPT real Pensando 6-functions-per-card layout -> nic_count=8 PASS" && { + _reset_check_env + # Each Pensando card exposes 6 PCI functions sharing the 1dd8 vendor + # but distinct device IDs: 0008 + 1001 (PCI bridges), 1012 + 100f + + # 100c (Processing accelerators), and 1002 (Ethernet controller PF). + # The `[1dd8:1002]` allowlist must keep only the 8 Ethernet PFs. + real_pensando_fixture="${PHASE3_DIR}/lspci-real-pensando.txt" + : >"$real_pensando_fixture" + for bus in 08 25 45 68 88 a5 c5 e8; do + bridge_a=$(printf '%02x' $((16#${bus} - 2))) + bridge_b=$(printf '%02x' $((16#${bus} - 1))) + printf '0000:%s:00.0 PCI bridge [0604]: AMD Pensando Systems DSC3 Salina Upstream Port [1dd8:0008]\n' "$bridge_a" >>"$real_pensando_fixture" + printf '0000:%s:00.0 PCI bridge [0604]: AMD Pensando Systems DSC Virtual Downstream Port [1dd8:1001]\n' "$bridge_b" >>"$real_pensando_fixture" + printf '0000:%s:00.0 Processing accelerators [1200]: AMD Pensando Systems TAWK IPC Device [1dd8:1012]\n' "$bus" >>"$real_pensando_fixture" + printf '0000:%s:00.1 Processing accelerators [1200]: AMD Pensando Systems Register/Memory Resource Device [1dd8:100f]\n' "$bus" >>"$real_pensando_fixture" + printf '0000:%s:00.2 Processing accelerators [1200]: AMD Pensando Systems DSC PDS Core Management [1dd8:100c]\n' "$bus" >>"$real_pensando_fixture" + printf '0000:%s:00.3 Ethernet controller [0200]: AMD Pensando Systems DSC Ethernet Controller [1dd8:1002]\n' "$bus" >>"$real_pensando_fixture" + done + export LSPCI_FIXTURE="$real_pensando_fixture" + export IP_LINK_FIXTURE="${FIXTURES_DIR}/ip-link-pass.txt" + export RDMA_LINK_FIXTURE="${FIXTURES_DIR}/rdma-link-pass.txt" + export IBV_DEVICES_FIXTURE="${FIXTURES_DIR}/ibv-devices.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT_V="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + run _run_check + assert_status 0 + assert_stdout_contains "CHECK 1 PASS: nic_count=8" + assert_stdout_contains "PHASE3_RESULT status=passed" +} + +# ------------------------------------------------------------------- +# B2d. Multi device-ID allowlist: PHASE3_AMD_NIC_PCI_IDS accepts a +# comma-separated list (e.g. "1dd8:1002,1dd8:1003") so a fleet with +# a mixed PF generation -- 4 cards on 1002 + 4 cards on 1003 -- counts +# to 8 PFs. Catches the regression where the list was not regex-joined +# and only the first vendor:device pair was honoured. +# ------------------------------------------------------------------- + +it "PHASE3_CHECK_SCRIPT multi-device PCI ID list -> nic_count counts both IDs" && { + _reset_check_env + export PHASE3_AMD_NIC_PCI_IDS="1dd8:1002,1dd8:1003" + mixed_fixture="${PHASE3_DIR}/lspci-mixed-device-ids.txt" + : >"$mixed_fixture" + for bus in 05 06 07 08; do + printf '0000:%s:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1002] (rev 03)\n' \ + "$bus" >>"$mixed_fixture" + done + for bus in 09 0a 0b 0c; do + printf '0000:%s:00.0 Ethernet controller [0200]: Pensando Systems DSC Ethernet Controller [1dd8:1003] (rev 04)\n' \ + "$bus" >>"$mixed_fixture" + done + # Add a 1dd8:1004 line that is *not* in the allowlist -- must NOT + # be counted. + printf '0000:0d:00.0 Ethernet controller [0200]: Pensando Systems DSC Future Function [1dd8:1004]\n' \ + >>"$mixed_fixture" + export LSPCI_FIXTURE="$mixed_fixture" + export IP_LINK_FIXTURE="${FIXTURES_DIR}/ip-link-pass.txt" + export RDMA_LINK_FIXTURE="${FIXTURES_DIR}/rdma-link-pass.txt" + export IBV_DEVICES_FIXTURE="${FIXTURES_DIR}/ibv-devices.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT_V="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + run _run_check + assert_status 0 + assert_stdout_contains "CHECK 1 PASS: nic_count=8" + assert_stdout_contains "PHASE3_RESULT status=passed" +} + +# ------------------------------------------------------------------- +# B3. One ip link DOWN -> =failed, failed-nics carries that NIC, +# reason includes link-state:=DOWN. +# ------------------------------------------------------------------- + +it "PHASE3_CHECK_SCRIPT one NIC link DOWN -> PHASE3_RESULT status=failed reason link-state, failed_nics set" && { + _reset_check_env + export LSPCI_FIXTURE="${FIXTURES_DIR}/lspci-pass.txt" + export RDMA_LINK_FIXTURE="${FIXTURES_DIR}/rdma-link-pass.txt" + export IBV_DEVICES_FIXTURE="${FIXTURES_DIR}/ibv-devices.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT_V="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + # Lay down 8 ionic netdevs in mock sysfs; one (enP7p1s0f0) is down. + _new_sysfs_net_root >/dev/null + net_root="$PHASE3_NET_SYSFS_DIR" + _seed_sysfs_net "$net_root" "enP5p1s0f0" "ionic" "up" + _seed_sysfs_net "$net_root" "enP6p1s0f0" "ionic" "up" + _seed_sysfs_net "$net_root" "enP7p1s0f0" "ionic" "down" + _seed_sysfs_net "$net_root" "enP8p1s0f0" "ionic" "up" + _seed_sysfs_net "$net_root" "enP9p1s0f0" "ionic" "up" + _seed_sysfs_net "$net_root" "enP10p1s0f0" "ionic" "up" + _seed_sysfs_net "$net_root" "enP11p1s0f0" "ionic" "up" + _seed_sysfs_net "$net_root" "enP12p1s0f0" "ionic" "up" + # A non-ionic interface that MUST be ignored even though its + # operstate would fail. + _seed_sysfs_net "$net_root" "eth0" "e1000" "down" + run _run_check + assert_status 1 + assert_stdout_contains "PHASE3_RESULT status=failed" + assert_stdout_contains "link-state:enP7p1s0f0=down" + assert_stdout_contains "failed_nics=enP7p1s0f0" + assert_stdout_contains "CHECK 2 FAIL" + # eth0 is non-ionic -> must not appear in failed_nics. + if grep -F "eth0" <<<"$LAST_STDOUT" >/dev/null; then + _assert_fail "non-ionic eth0 must not be enumerated by Check 2: +${LAST_STDOUT}" + fi + assert_kubectl_no_calls +} + +# ------------------------------------------------------------------- +# B4. One rdma link in INIT state -> =failed, reason rdma-state. +# ------------------------------------------------------------------- + +it "PHASE3_CHECK_SCRIPT one rdma link INIT -> PHASE3_RESULT status=failed reason rdma-state" && { + _reset_check_env + export LSPCI_FIXTURE="${FIXTURES_DIR}/lspci-pass.txt" + export IP_LINK_FIXTURE="${FIXTURES_DIR}/ip-link-pass.txt" + export RDMA_LINK_FIXTURE="${FIXTURES_DIR}/rdma-link-one-init.txt" + export IBV_DEVICES_FIXTURE="${FIXTURES_DIR}/ibv-devices.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT_V="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + run _run_check + assert_status 1 + assert_stdout_contains "PHASE3_RESULT status=failed" + assert_stdout_contains "rdma-state:rocep7s0/1=INIT" + assert_stdout_contains "CHECK 3 FAIL" + assert_kubectl_no_calls +} + +# ------------------------------------------------------------------- +# B5. Empty GID table on one device (verbose listing has 0 `GID[` lines). +# ------------------------------------------------------------------- + +it "PHASE3_CHECK_SCRIPT empty GID table on one device -> PHASE3_RESULT status=failed reason gid-table" && { + _reset_check_env + export LSPCI_FIXTURE="${FIXTURES_DIR}/lspci-pass.txt" + export IP_LINK_FIXTURE="${FIXTURES_DIR}/ip-link-pass.txt" + export RDMA_LINK_FIXTURE="${FIXTURES_DIR}/rdma-link-pass.txt" + export IBV_DEVICES_FIXTURE="${FIXTURES_DIR}/ibv-devices.txt" + # Default is pass; override rocep7s0's verbose listing to the + # empty-GID body. Non-verbose form is still served (responds) so + # only Check 4's GID-count branch trips, not the unresponsive + # branch. + export IBV_DEVINFO_FIXTURE_DEFAULT="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT_V="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + export IBV_DEVINFO_FIXTURE_rocep7s0_V="${FIXTURES_DIR}/ibv-devinfo-empty-gid.txt" + run _run_check + assert_status 1 + assert_stdout_contains "PHASE3_RESULT status=failed" + assert_stdout_contains "gid-table:rocep7s0=0" + assert_stdout_contains "failed_nics=rocep7s0" + assert_stdout_contains "CHECK 4 FAIL" + assert_kubectl_no_calls +} + +# ------------------------------------------------------------------- +# B6. Tools missing: ibv_devinfo returns non-zero for one device +# (simulates the image regression case where the device file is +# present but the driver tool errors out). +# ------------------------------------------------------------------- + +it "PHASE3_CHECK_SCRIPT ibv_devinfo unresponsive on one device -> PHASE3_RESULT status=failed reason ibv-devinfo" && { + _reset_check_env + export LSPCI_FIXTURE="${FIXTURES_DIR}/lspci-pass.txt" + export IP_LINK_FIXTURE="${FIXTURES_DIR}/ip-link-pass.txt" + export RDMA_LINK_FIXTURE="${FIXTURES_DIR}/rdma-link-pass.txt" + export IBV_DEVICES_FIXTURE="${FIXTURES_DIR}/ibv-devices.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT_V="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + # Non-zero exit for rocep6s0 only -- the non-verbose form is what + # the script tests for responsiveness, so the rc override must + # apply to that call (no -v fixture is checked first per shim). + # We force rc=1 for rocep6s0 across all invocations by clearing + # its fixture overrides and setting the rc. + unset IBV_DEVINFO_FIXTURE_rocep6s0 IBV_DEVINFO_FIXTURE_rocep6s0_V + export IBV_DEVINFO_RC_rocep6s0="1" + run _run_check + assert_status 1 + assert_stdout_contains "PHASE3_RESULT status=failed" + assert_stdout_contains "ibv-devinfo:rocep6s0=unresponsive" + assert_stdout_contains "failed_nics=rocep6s0" + assert_kubectl_no_calls +} + +# ------------------------------------------------------------------- +# B7. Partial failure: counts and links OK, but rdma INIT trips Check 3 +# while Checks 1/2/4 stay green. The PHASE3_RESULT marker must +# carry only Check 3's reason -- no spurious entries from passing +# checks. (Aggregate is still failed: any single check failing is +# a node-level fail.) +# ------------------------------------------------------------------- + +it "PHASE3_CHECK_SCRIPT partial failure (rdma only) -> PHASE3_RESULT reason has only rdma-state, no others" && { + _reset_check_env + export LSPCI_FIXTURE="${FIXTURES_DIR}/lspci-pass.txt" + export IP_LINK_FIXTURE="${FIXTURES_DIR}/ip-link-pass.txt" + export RDMA_LINK_FIXTURE="${FIXTURES_DIR}/rdma-link-one-init.txt" + export IBV_DEVICES_FIXTURE="${FIXTURES_DIR}/ibv-devices.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT_V="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + run _run_check + assert_status 1 + assert_kubectl_no_calls + # The PHASE3_RESULT line must include rdma-state but no other class. + result_line=$(grep -E '^PHASE3_RESULT status=failed' <<<"$LAST_STDOUT" \ + | tail -n 1 || true) + if [[ -z "$result_line" ]]; then + _assert_fail "expected a PHASE3_RESULT status=failed line; stdout: +${LAST_STDOUT}" + fi + for forbidden in "nic-count:" "link-state:" "ibv-devinfo:" "gid-table:"; do + if grep -qF -- "$forbidden" <<<"$result_line"; then + _assert_fail "partial-failure PHASE3_RESULT must not include '${forbidden}', got: ${result_line}" + fi + done + if ! grep -qF "rdma-state:" <<<"$result_line"; then + _assert_fail "partial-failure must include 'rdma-state:', got: ${result_line}" + fi +} + +# ------------------------------------------------------------------- +# B8. Marker truncation: many failures -> reason= and failed_nics= +# values on the PHASE3_RESULT line must NOT exceed +# PHASE3_ANNOTATION_MAX_BYTES (the orchestrator forwards these +# verbatim to the annotation, which has a 256-byte k8s ceiling). +# ------------------------------------------------------------------- + +it "PHASE3_CHECK_SCRIPT large failure list truncates PHASE3_RESULT tokens to MAX_BYTES" && { + _reset_check_env + # Force a worst case: every check fails on many synthetic devices. + # Build a giant rdma fixture so the joined reason string would + # easily exceed 250 bytes. + big_rdma="${PHASE3_DIR}/rdma-link-many-init.txt" + : >"$big_rdma" + for i in $(seq 1 60); do + printf 'link rocep%ds0/1 state INIT physical_state LINK_UP netdev enP%dp1s0f0\n' "$i" "$i" \ + >>"$big_rdma" + done + export LSPCI_FIXTURE="${FIXTURES_DIR}/lspci-pass.txt" + export IP_LINK_FIXTURE="${FIXTURES_DIR}/ip-link-pass.txt" + export RDMA_LINK_FIXTURE="$big_rdma" + export IBV_DEVICES_FIXTURE="${FIXTURES_DIR}/ibv-devices.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT_V="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + export PHASE3_ANNOTATION_MAX_BYTES="250" + run _run_check + assert_status 1 + assert_kubectl_no_calls + # Extract the PHASE3_RESULT line and inspect the value lengths. + result_line=$(grep -E '^PHASE3_RESULT status=failed' <<<"$LAST_STDOUT" \ + | tail -n 1 || true) + if [[ -z "$result_line" ]]; then + _assert_fail "expected a PHASE3_RESULT status=failed line; stdout: +${LAST_STDOUT}" + fi + reason_v=$(sed -nE 's/.*reason=([^ ]+).*/\1/p' <<<"$result_line") + nics_v=$(sed -nE 's/.*failed_nics=([^ ]+).*/\1/p' <<<"$result_line") + if [[ -z "$reason_v" ]]; then + _assert_fail "PHASE3_RESULT line missing reason=...: ${result_line}" + fi + if [[ "${#reason_v}" -gt 250 ]]; then + _assert_fail "PHASE3_RESULT reason length ${#reason_v} > 250" + fi + if [[ -n "$nics_v" && "${#nics_v}" -gt 250 ]]; then + _assert_fail "PHASE3_RESULT failed_nics length ${#nics_v} > 250" + fi +} + +# ------------------------------------------------------------------- +# PART C: PHASE3_SCRIPT outer-driver behavior +# ------------------------------------------------------------------- + +_reset_phase3_env() { + kubectl_mock_reset + export PHASE3_LABEL_KEY="amd.com/nic-health" + export PHASE3_EXPECTED_NIC_COUNT="8" + # PHASE3_SCRIPT sed-substitutes $$ROCE_WORKLOAD_IMAGE. + # Default to a recognizable test tag so tests that exercise the + # submit path render a valid (placeholder-free) Job template. + export ROCE_WORKLOAD_IMAGE="docker.io/rocm/roce-workload:test-tag" + # 60s is large enough that the poll loop's first iteration (which + # checks Complete=True / Failed=True immediately, seeded by tests) + # breaks out before any sleep. + export PHASE3_JOB_WAIT_TIME="60" + # Pin the timestamp PHASE3_SCRIPT puts into job names so seeded + # state always matches what the script looks up. + export PHASE3_TEST_TS="testts0001" + unset SKIP_NIC_VALIDATION PHASE_NODES +} + +# Helper: compute the job name PHASE3_SCRIPT will generate for +# with the pinned PHASE3_TEST_TS. Mirrors PHASE3_SCRIPT exactly: +# cvf-phase3-${node}-${ts} (when short enough) +# cvf-phase3-${sha1(node)|6}-${ts} (when over 63 chars) +_phase3_expected_job_name() { + local node="$1" ts="$2" max_len=63 prefix="cvf-phase3" + local jn="${prefix}-${node}-${ts}" + if [[ "${#jn}" -gt "$max_len" ]]; then + local h + h=$(echo -n "$node" | sha1sum | cut -c1-6) + jn="${prefix}-${h}-${ts}" + fi + printf '%s' "$jn" +} + +# Seed mock state for one job: Complete=True AND a PHASE3_RESULT +# status=passed line in the job's pod log. PHASE3_SCRIPT calls +# `kubectl logs job/ --tail=20` and greps for PHASE3_RESULT to +# decide pass/fail labeling, so both bits must be seeded for the +# orchestrator to take the pass branch. +_seed_job_complete() { + local node="$1" ts="$2" + local job + job=$(_phase3_expected_job_name "$node" "$ts") + kubectl_mock_set_job_condition "$job" "Complete" "True" + kubectl_mock_set_pod_log "job/${job}" "PHASE3_RESULT status=passed" +} + +# Seed mock state for one job: Failed=True AND a PHASE3_RESULT +# status=failed line in the job's pod log (with reason + failed_nics +# tokens the orchestrator forwards to the failure-reason / failed-nics +# annotations). Optional 3rd/4th args customize the marker content. +_seed_job_failed() { + local node="$1" ts="$2" + local reason="${3:-link-state:enP7p1s0f0=DOWN}" + local failed_nics="${4:-enP7p1s0f0}" + local job + job=$(_phase3_expected_job_name "$node" "$ts") + kubectl_mock_set_job_condition "$job" "Failed" "True" + kubectl_mock_set_pod_log "job/${job}" \ + "PHASE3_RESULT status=failed reason=${reason} failed_nics=${failed_nics}" +} + +ts=$(printf '%s' "testts0001") + +# ------------------------------------------------------------------- +# C1. Empty input list -> no-op, exit 0, no kubectl side effects. +# ------------------------------------------------------------------- + +it "PHASE3_SCRIPT with empty input list is a no-op and returns 0" && { + _reset_phase3_env + run __phase3_run + assert_status 0 + assert_kubectl_no_calls + assert_stdout_contains "no input nodes -- nothing to do" +} + +# ------------------------------------------------------------------- +# C2. SKIP_NIC_VALIDATION=true -> every input node pass-labeled, +# NO Phase 3 Job submission, no kubectl get/apply work. +# ------------------------------------------------------------------- + +it "SKIP_NIC_VALIDATION=true pass-labels every input node, no Jobs created" && { + _reset_phase3_env + export SKIP_NIC_VALIDATION="true" + run __phase3_run node-a node-b node-c + assert_status 0 + assert_kubectl_call \ + "label node node-a amd.com/nic-health=passed --overwrite" + assert_kubectl_call \ + "label node node-b amd.com/nic-health=passed --overwrite" + assert_kubectl_call \ + "label node node-c amd.com/nic-health=passed --overwrite" + if grep -E "^apply( |$)" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "SKIP_NIC_VALIDATION=true must not submit any Jobs: +$(grep -E '^apply( |$)' "$KUBECTL_CALLS_FILE")" + fi + if grep -E "^get job" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "SKIP must not poll Jobs: +$(grep -E '^get job' "$KUBECTL_CALLS_FILE")" + fi + assert_stdout_contains "SKIP_NIC_VALIDATION=true -- pass-labeling" +} + +it "SKIP_NIC_VALIDATION accepts case-insensitive value (TRUE)" && { + _reset_phase3_env + export SKIP_NIC_VALIDATION="TRUE" + run __phase3_run node-x + assert_status 0 + assert_kubectl_call \ + "label node node-x amd.com/nic-health=passed --overwrite" +} + +# ------------------------------------------------------------------- +# C3. Missing required env var (PHASE3_EXPECTED_NIC_COUNT) -> every +# input node labeled =failed with reason=phase3-missing-env:.; +# no Jobs submitted. +# ------------------------------------------------------------------- + +it "missing required env -> all input nodes labeled failed, no Jobs submitted" && { + _reset_phase3_env + unset PHASE3_EXPECTED_NIC_COUNT + run __phase3_run node-y + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_call \ + "label node node-y amd.com/nic-health=failed --overwrite" + assert_kubectl_call_contains \ + "amd.com/nic-health-failure-reason=phase3-missing-env:PHASE3_EXPECTED_NIC_COUNT" + if grep -E "^apply( |$)" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "missing-env path must not submit Jobs: +$(grep -E '^apply( |$)' "$KUBECTL_CALLS_FILE")" + fi + assert_stderr_contains "required env var(s) unset:" +} + +# ------------------------------------------------------------------- +# C4. Missing job template -> every input node labeled =failed with +# reason=job-template-missing; no Jobs submitted. +# ------------------------------------------------------------------- + +it "missing job template -> all input nodes labeled failed, reason=job-template-missing" && { + _reset_phase3_env + mv "${TPL_DIR}/cluster-validation-phase3-job-config.yaml" \ + "${TPL_DIR}/cluster-validation-phase3-job-config.yaml.hidden" + run __phase3_run node-a node-b + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_call \ + "label node node-a amd.com/nic-health=failed --overwrite" + assert_kubectl_call \ + "annotate node node-a amd.com/nic-health-failure-reason=job-template-missing --overwrite" + assert_kubectl_call \ + "label node node-b amd.com/nic-health=failed --overwrite" + assert_kubectl_call \ + "annotate node node-b amd.com/nic-health-failure-reason=job-template-missing --overwrite" + if grep -E "^apply( |$)" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "missing-template path must not submit Jobs: +$(grep -E '^apply( |$)' "$KUBECTL_CALLS_FILE")" + fi + # Restore for the rest of the suite. + mv "${TPL_DIR}/cluster-validation-phase3-job-config.yaml.hidden" \ + "${TPL_DIR}/cluster-validation-phase3-job-config.yaml" +} + +# ------------------------------------------------------------------- +# C5. kubectl apply failure -> node failed with reason=job-creation-failed. +# PHASE3_SCRIPT must NOT poll for that job since it never landed. +# ------------------------------------------------------------------- + +it "kubectl apply failure -> node failed with reason=job-creation-failed" && { + _reset_phase3_env + kubectl_mock_fail_sticky apply 1 + run __phase3_run node-z + assert_status 0 + assert_kubectl_call \ + "label node node-z amd.com/nic-health=failed --overwrite" + assert_kubectl_call \ + "annotate node node-z amd.com/nic-health-failure-reason=job-creation-failed --overwrite" + if grep -E "^get job" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "submit-failed job must not be polled: +$(grep -E '^get job' "$KUBECTL_CALLS_FILE")" + fi + assert_stderr_contains "kubectl apply failed for job=" +} + +# ------------------------------------------------------------------- +# C6. Timeout: no Job conditions seeded + PHASE3_JOB_WAIT_TIME=0 +# -> first iteration of the poll loop hits elapsed >= timeout +# immediately -> classified TIMEOUT, reason=nic-not-allocated, +# and the hung Job is explicitly deleted at cleanup. Mirrors the +# PHASE2_SCRIPT TC9/TC10 pattern. +# ------------------------------------------------------------------- + +it "no conditions + PHASE3_JOB_WAIT_TIME=0 -> reason=nic-not-allocated + cleanup delete" && { + _reset_phase3_env + export PHASE3_JOB_WAIT_TIME="0" + run __phase3_run node-f + assert_status 0 + assert_kubectl_call \ + "label node node-f amd.com/nic-health=failed --overwrite" + assert_kubectl_call \ + "annotate node node-f amd.com/nic-health-failure-reason=nic-not-allocated --overwrite" + expected_job=$(_phase3_expected_job_name "node-f" "$ts") + assert_kubectl_call \ + "delete job ${expected_job} --ignore-not-found=true --wait=false" + assert_stdout_contains "TIMEOUT after 0s" + assert_stdout_contains "deleting hung job" +} + +# ------------------------------------------------------------------- +# C7. Complete=True with PHASE3_RESULT status=passed in pod logs -> +# orchestrator parses the log marker and writes the =passed label. +# No failure-reason annotation is written. +# ------------------------------------------------------------------- + +it "Job Complete=True + passed marker -> orchestrator labels node passed, no annotation" && { + _reset_phase3_env + _seed_job_complete "node-pass" "$ts" + run __phase3_run node-pass + assert_status 0 + assert_kubectl_call \ + "label node node-pass amd.com/nic-health=passed --overwrite" + if grep -F "annotate node node-pass amd.com/nic-health-failure-reason" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "passed-marker path must not write failure-reason: +$(grep 'annotate node node-pass' "$KUBECTL_CALLS_FILE")" + fi + if grep -F "annotate node node-pass amd.com/nic-health-failed-nics" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "passed-marker path must not write failed-nics: +$(grep 'annotate node node-pass' "$KUBECTL_CALLS_FILE")" + fi +} + +# ------------------------------------------------------------------- +# C8. Failed=True with PHASE3_RESULT status=failed in pod logs -> +# orchestrator parses reason= and failed_nics= tokens, writes the +# =failed label + failure-reason annotation + failed-nics annotation. +# ------------------------------------------------------------------- + +it "Job Failed=True + failed marker -> orchestrator labels failed + writes reason + failed-nics" && { + _reset_phase3_env + _seed_job_failed "node-jobfail" "$ts" \ + "rdma-state:rocep7s0/1=INIT" "rocep7s0" + run __phase3_run node-jobfail + assert_status 0 + assert_kubectl_call \ + "label node node-jobfail amd.com/nic-health=failed --overwrite" + assert_kubectl_call_contains \ + "amd.com/nic-health-failure-reason=rdma-state:rocep7s0/1=INIT" + assert_kubectl_call_contains \ + "amd.com/nic-health-failed-nics=rocep7s0" +} + +# ------------------------------------------------------------------- +# C8b. Complete=True but NO PHASE3_RESULT line in pod logs (e.g. the +# container exited 0 before the marker was emitted, or logs were +# truncated) -> orchestrator labels =failed with reason=no-result-line +# so the missing-signal failure is visible at the cluster level. +# ------------------------------------------------------------------- + +it "Job Complete=True but missing PHASE3_RESULT marker -> labeled failed reason=no-result-line" && { + _reset_phase3_env + # Seed only the Complete condition; do NOT seed a log marker. + expected_job=$(_phase3_expected_job_name "node-nomarker" "$ts") + kubectl_mock_set_job_condition "$expected_job" "Complete" "True" + run __phase3_run node-nomarker + assert_status 0 + assert_kubectl_call \ + "label node node-nomarker amd.com/nic-health=failed --overwrite" + assert_kubectl_call_contains \ + "amd.com/nic-health-failure-reason=no-result-line" +} + +# ------------------------------------------------------------------- +# C9. Parallel-submit ordering: N input nodes -> exactly N `kubectl apply` +# invocations BEFORE any `kubectl get job` poll. Mirrors PHASE1 / +# PHASE2 contract. +# ------------------------------------------------------------------- + +it "parallel-submit: N input nodes -> N submits, all before any wait poll" && { + _reset_phase3_env + _seed_job_complete "node-a" "$ts" + _seed_job_complete "node-b" "$ts" + _seed_job_complete "node-c" "$ts" + run __phase3_run node-a node-b node-c + assert_status 0 + n_apply=$(grep -cE "^apply( |$)" "$KUBECTL_CALLS_FILE" || true) + assert_equals "3" "$n_apply" + last_apply_line=$(grep -nE "^apply" "$KUBECTL_CALLS_FILE" \ + | tail -1 | cut -d: -f1) + first_getjob_line=$(grep -nE "^get job" "$KUBECTL_CALLS_FILE" \ + | head -1 | cut -d: -f1) + if [[ -z "$first_getjob_line" ]]; then + _assert_fail "expected at least one 'get job' poll call" + elif [[ "$last_apply_line" -ge "$first_getjob_line" ]]; then + _assert_fail "submits must all precede any poll (last apply=${last_apply_line}, first get-job=${first_getjob_line}): +$(cat "$KUBECTL_CALLS_FILE")" + fi +} + +# ------------------------------------------------------------------- +# C10. PHASE_NODES env-var fallback: when positional args are empty +# but PHASE_NODES is exported, the script uses that list. +# ------------------------------------------------------------------- + +it "PHASE_NODES env var is used when no positional args are given" && { + _reset_phase3_env + _seed_job_complete "node-env" "$ts" + export PHASE_NODES="node-env" + run __phase3_run # NB: no positional args + assert_status 0 + # Job was submitted for that node. + expected_job=$(_phase3_expected_job_name "node-env" "$ts") + assert_kubectl_call_contains "$expected_job" + # And the orchestrator parsed the passed marker and wrote the label. + assert_kubectl_call \ + "label node node-env amd.com/nic-health=passed --overwrite" +} + +# ------------------------------------------------------------------- +# PART D: PHASE3_SCRIPT sed pipeline rendering +# +# These tests do NOT exercise __phase3_run end-to-end (the kubectl +# mock's `apply` arm drains stdin and discards it, so rendered YAML +# is not observable through the mock). Instead, they reach into +# the rendering helper logic by sed-substituting the real +# cluster-validation-job.yaml-embedded Phase 3 template the same way +# PHASE3_SCRIPT does (sed pipeline pinned to: $$NODE, metadata.name +# rename, $$EXPECTED_NIC_COUNT, $$ROCE_WORKLOAD_IMAGE). This is the +# most direct way to gate the new substitution + the new host-sys +# volume/mount. +# +# The fixture template is extracted live from the source manifest +# (configs/cluster-validation-job.yaml) so a regression in either +# the template (placeholder removed) or the sed pipeline (key +# dropped) lights up here. +# ------------------------------------------------------------------- + +# Extract the embedded Phase 3 Job template from the real source +# manifest (configs/cluster-validation-job.yaml) so the tests below +# render the actual shipped template, not the test stand-in +# fixture written above at TPL_DIR. +PHASE3_JOB_SOURCE_YAML="${REPO_ROOT}/example/gpu-validation-cluster/configs/cluster-validation-job.yaml" +REAL_PHASE3_TPL=$(mktemp) +trap 'rm -f "$REAL_PHASE3_TPL"' EXIT +python3 - "$PHASE3_JOB_SOURCE_YAML" >"$REAL_PHASE3_TPL" <<'PYEOF' +import sys, yaml +for d in yaml.safe_load_all(open(sys.argv[1])): + if (d and d.get("kind") == "ConfigMap" + and d.get("metadata", {}).get("name") + == "cluster-validation-phase3-job-config"): + sys.stdout.write(d["data"]["cluster-validation-phase3-job-config.yaml"]) + break +PYEOF +if [[ ! -s "$REAL_PHASE3_TPL" ]]; then + echo "FATAL: could not extract Phase 3 Job template from ${PHASE3_JOB_SOURCE_YAML}" >&2 + exit 1 +fi + +# Render the real template the same way PHASE3_SCRIPT does. Mirrors +# the sed expression in cluster-validation-config.yaml PHASE3_SCRIPT +# (around line ~2441 post-refactor). Caller sets NODE, JOB_NAME, +# EXPECTED_NIC_COUNT, IMG. +_render_phase3_real() { + local node="$1" job_name="$2" nic_count="$3" img="$4" + sed "s|\$\$NODE|${node}|g; \ + s/^ name: cluster-validation-phase3-job/ name: ${job_name}/; \ + s|\$\$EXPECTED_NIC_COUNT|${nic_count}|g; \ + s|\$\$ROCE_WORKLOAD_IMAGE|${img}|g" \ + "$REAL_PHASE3_TPL" +} + +# D1. With a populated ROCE_WORKLOAD_IMAGE, the $$-placeholder is +# fully substituted (no residual $$ tokens anywhere in the +# rendered template). +it "PHASE3 template render: \$\$ROCE_WORKLOAD_IMAGE is substituted (no residual placeholders)" && { + rendered=$(_render_phase3_real "node-x" "cvf-phase3-node-x-ts" "8" \ + "docker.io/rocm/roce-workload:my-tag-123") + img_line=$(printf '%s\n' "$rendered" | grep -E '^[[:space:]]+image:' || true) + if [[ "$img_line" != *"image: docker.io/rocm/roce-workload:my-tag-123"* ]]; then + _assert_fail "expected image: docker.io/rocm/roce-workload:my-tag-123 in rendered template, got: $img_line" + fi + if printf '%s\n' "$rendered" | grep -qE '\$\$'; then + _assert_fail "rendered template still contains \$\$ placeholders: +$(printf '%s\n' "$rendered" | grep -E '\$\$')" + fi +} + +# D2. ROCE_WORKLOAD_IMAGE env var injected into nic-health container +# (so Check 5 can substring-match each NIC's fw_ver against the +# full image reference) AND the old host-sys hostPath volume + +# /sys-host mount are ABSENT. The new Check 5 reads +# /sys/class/infiniband//fw_ver from the pod's auto-mounted +# sysfs (kernel sysfs is netns-agnostic for class paths), so a +# dedicated hostPath mount is no longer required. Assert here that +# the volume is gone so we don't accidentally re-introduce it. +# +# Implementation note: we write the rendered YAML to a tmp file +# and pass its path to python via argv. Mixing a heredoc-supplied +# python script with stdin-piped YAML breaks because the heredoc +# itself becomes python's stdin. +it "PHASE3 template render: ROCE_WORKLOAD_IMAGE env injected, host-sys volume absent" && { + _render_phase3_real "node-y" "cvf-phase3-node-y-ts" "8" \ + "docker.io/rocm/roce-workload:any-tag-42" > "${TPL_DIR}/d2_rendered.yaml" + out=$(python3 - "${TPL_DIR}/d2_rendered.yaml" <<'PYEOF' +import sys, yaml +d = yaml.safe_load(open(sys.argv[1])) +spec = d["spec"]["template"]["spec"] +vols = spec.get("volumes", []) or [] +host_sys_vol = next((v for v in vols if v.get("name") == "host-sys"), None) +assert host_sys_vol is None, \ + f"host-sys volume must be absent after Check 5 rewrite; got: {host_sys_vol}" +c = spec["containers"][0] +assert c["name"] == "nic-health", f"first container is {c['name']}" +mounts = c.get("volumeMounts", []) or [] +host_sys_mt = next((m for m in mounts if m.get("name") == "host-sys"), None) +assert host_sys_mt is None, \ + f"host-sys volumeMount must be absent; got: {host_sys_mt}" +# ROCE_WORKLOAD_IMAGE env var must be injected into the container +# with the rendered image value (so Check 5's substring match has a +# reference to compare fw_ver against). +envs = c.get("env", []) or [] +roce_env = next((e for e in envs if e.get("name") == "ROCE_WORKLOAD_IMAGE"), None) +assert roce_env is not None, f"ROCE_WORKLOAD_IMAGE env var must be set; envs={envs}" +assert roce_env.get("value") == "docker.io/rocm/roce-workload:any-tag-42", \ + f"ROCE_WORKLOAD_IMAGE value={roce_env.get('value')}" +print("OK") +PYEOF +) + rc=$? + assert_equals "0" "$rc" + assert_equals "OK" "$out" +} + +# D3. The container image MUST NOT be the obsolete +# network-operator-utils tag anywhere in the source template +# (post-refactor, this image is gone). Guards against accidental +# revert. +it "PHASE3 template source: network-operator-utils:v1.1.0 has been removed" && { + if grep -F 'docker.io/rocm/network-operator-utils:v1.1.0' \ + "$PHASE3_JOB_SOURCE_YAML" >/dev/null; then + _assert_fail "regressed: network-operator-utils:v1.1.0 still referenced in ${PHASE3_JOB_SOURCE_YAML}" + fi + # And the obsolete patchable comment must be gone too ( + # design doc §4: removed because the image is now centrally + # pinned via ROCE_WORKLOAD_IMAGE). + if grep -F 'patchable: cluster-validation-framework.images.nic-health' \ + "$PHASE3_JOB_SOURCE_YAML" >/dev/null; then + _assert_fail "regressed: obsolete patchable comment for nic-health still present in ${PHASE3_JOB_SOURCE_YAML}" + fi +} + +# D4. PHASE3_SCRIPT sed pipeline includes the new substitution +# expression. Guards against the sed key being dropped from the +# pipeline (which would leave a literal $$ROCE_WORKLOAD_IMAGE in +# the rendered template and the kubelet would fail to pull it). +it "PHASE3_SCRIPT sed pipeline substitutes \$\$ROCE_WORKLOAD_IMAGE" && { + # The sed expression lives in cluster-validation-config.yaml + # under PHASE3_SCRIPT. The literal stored in the ConfigMap key + # is `s|\$\$ROCE_WORKLOAD_IMAGE|${image}|g` (backslash-escaped + # so YAML/sed don't expand the placeholders). To gate on the + # presence of the substitution in PHASE3_SCRIPT specifically + # (not the unrelated PHASE2/PHASE4/PHASE5 occurrences), we + # extract PHASE3_SCRIPT and grep its body. + raw=$(extract_configmap_data "$CONFIGMAP" "PHASE3_SCRIPT") + if ! printf '%s\n' "$raw" \ + | grep -F 's|\$\$ROCE_WORKLOAD_IMAGE|${image}|g' >/dev/null; then + _assert_fail "PHASE3_SCRIPT sed pipeline missing \$\$ROCE_WORKLOAD_IMAGE substitution in ${CONFIGMAP}" + fi +} + +# D5. End-to-end shape check: a full render with a typical image +# tag produces a parseable Kubernetes Job manifest (no broken +# YAML from the sed substitution, no leftover $$ tokens, image +# is the expected value). +it "PHASE3 template render: yields a parseable Job manifest with the expected image" && { + img_in="docker.io/rocm/roce-workload:ubuntu24_rocm-7.0.2_rccl-7.0.2_anp-v1.2.0_ainic-1.117.1-a-63" + _render_phase3_real "node-z" "cvf-phase3-node-z-ts" "8" "$img_in" \ + > "${TPL_DIR}/d5_rendered.yaml" + # Pass img_in to python via env so the heredoc can stay + # single-quoted (no shell expansion). + out=$(IMG="$img_in" python3 - "${TPL_DIR}/d5_rendered.yaml" <<'PYEOF' +import os, sys, yaml +img_in = os.environ["IMG"] +d = yaml.safe_load(open(sys.argv[1])) +assert d["kind"] == "Job", f"kind={d.get('kind')}" +assert d["metadata"]["name"] == "cvf-phase3-node-z-ts", \ + f"name={d['metadata']['name']}" +c = d["spec"]["template"]["spec"]["containers"][0] +assert c["image"] == img_in, f"image={c['image']}" +assert c["resources"]["limits"]["amd.com/nic"] == 8, \ + f"nic limit={c['resources']['limits']['amd.com/nic']}" +ns = d["spec"]["template"]["spec"]["nodeSelector"] +assert ns["kubernetes.io/hostname"] == "node-z", \ + f"nodeSelector={ns}" +print("OK") +PYEOF +) + rc=$? + assert_equals "0" "$rc" + assert_equals "OK" "$out" +} + +# ------------------------------------------------------------------- +# PART E: PHASE3_CHECK_SCRIPT Check 5 (firmware <-> workload-image +# alignment) in-pod behavior. The new Check 5 reads +# /sys/class/infiniband//fw_ver per ionic device and requires +# the running firmware string to appear as a substring of +# $ROCE_WORKLOAD_IMAGE. There is no compat map / nicctl / +# driver-version sysfs file anymore. +# +# Each Check 5 test: +# * _reset_check5_env -- shared reset + Check 5 opt-in + PASS +# fixtures for checks 1-4 so any failure has to come from Check 5 +# (or pre-flight). The per-test mock /sys/class/infiniband/ tree +# is laid down via _seed_sysfs_ib (defined in PART B above). +# * supplies ROCE_WORKLOAD_IMAGE and optionally +# PHASE3_DRIVER_FW_STRICT. +# ------------------------------------------------------------------- + +_reset_check5_env() { + _reset_check_env + export PHASE3_DRIVER_FW_CHECK_ENABLED="true" + # Pass-fixture defaults for checks 1-4 so any failure has to come + # from Check 5 (or pre-flight). IP_LINK_FIXTURE is harmless now + # that Check 2 reads sysfs (the `ip` shim is unused) but is kept + # for symmetry with legacy tests. + export LSPCI_FIXTURE="${FIXTURES_DIR}/lspci-pass.txt" + export RDMA_LINK_FIXTURE="${FIXTURES_DIR}/rdma-link-pass.txt" + export IBV_DEVICES_FIXTURE="${FIXTURES_DIR}/ibv-devices.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + export IBV_DEVINFO_FIXTURE_DEFAULT_V="${FIXTURES_DIR}/ibv-devinfo-pass.txt" + # Each test that exercises Check 5 must populate a per-test mock + # ib sysfs tree via _new_sysfs_ib_root + _seed_sysfs_ib. +} + +# Default fw / image pair used by most Check 5 tests. +CHECK5_DEFAULT_FW="1.117.5-a-56" +CHECK5_DEFAULT_IMAGE="docker.io/rocm/roce-workload:ubuntu24_rocm-7.0.2_rccl-7.0.2_anp-v1.2.0_ainic-1.117.5-a-56" + +# Lay down N ionic devices in the per-test ib sysfs root, all with +# the same fw string. +_seed_ionic_fw_uniform() { + local root="$1" count="$2" fw="$3" + local i + for ((i=0; i/dev/null + ib_root="$PHASE3_IB_SYSFS_DIR" + _seed_ionic_fw_uniform "$ib_root" 8 "$CHECK5_DEFAULT_FW" + export ROCE_WORKLOAD_IMAGE="$CHECK5_DEFAULT_IMAGE" + run _run_check + assert_status 0 + assert_stdout_contains "PHASE3_RESULT status=passed" + # observed_fw= field present and lists every device. Order is + # not stable (bash associative array iteration) so check each + # ionic_N=fw substring individually. + assert_stdout_contains "observed_fw=" + for i in 0 1 2 3 4 5 6 7; do + assert_stdout_contains "ionic_${i}=${CHECK5_DEFAULT_FW}" + done + # PASS path must not carry a mismatch token anywhere. + if grep -F "fw-image-mismatch" <<<"$LAST_STDOUT" >/dev/null; then + _assert_fail "PASS path must not emit fw-image-mismatch: +${LAST_STDOUT}" + fi + assert_kubectl_no_calls +} + +# ------------------------------------------------------------------- +# E2. FAIL strict=true: fw does not appear in workload image tag. +# ------------------------------------------------------------------- + +it "PHASE3_CHECK_SCRIPT Check 5 FAIL strict=true: fw not in image -> fw-image-mismatch" && { + _reset_check5_env + _new_sysfs_ib_root >/dev/null + ib_root="$PHASE3_IB_SYSFS_DIR" + _seed_ionic_fw_uniform "$ib_root" 8 "$CHECK5_DEFAULT_FW" + # Image tag carries a DIFFERENT fw value -- substring match fails. + export ROCE_WORKLOAD_IMAGE="docker.io/rocm/roce-workload:ubuntu24_rocm-7.0.2_ainic-1.118.0-a-99" + # strict defaults to true; assert explicitly for clarity. + export PHASE3_DRIVER_FW_STRICT="true" + run _run_check + assert_status 1 + assert_stdout_contains "PHASE3_RESULT status=failed" + # Reason carries the image-tag suffix (after the last colon), not + # the full registry prefix. + assert_stdout_contains "fw-image-mismatch:ionic_0=${CHECK5_DEFAULT_FW}/image=ubuntu24_rocm-7.0.2_ainic-1.118.0-a-99" + assert_stdout_contains "observed_fw=" + assert_stdout_contains "ionic_0=${CHECK5_DEFAULT_FW}" + # failed_nics must list every ionic device since they all mismatch. + for i in 0 1 2 3 4 5 6 7; do + assert_stdout_contains "ionic_${i}" + done +} + +# ------------------------------------------------------------------- +# E3. FAIL -> PASS via strict=false: same mismatch as E2, but +# warn-only mode suppresses the mismatch from the reason marker +# while still surfacing observed_fw= and the warning log line. +# ------------------------------------------------------------------- + +it "PHASE3_CHECK_SCRIPT Check 5 strict=false: mismatch surfaces in observed_fw + log, marker is passed" && { + _reset_check5_env + _new_sysfs_ib_root >/dev/null + ib_root="$PHASE3_IB_SYSFS_DIR" + _seed_ionic_fw_uniform "$ib_root" 8 "$CHECK5_DEFAULT_FW" + export ROCE_WORKLOAD_IMAGE="docker.io/rocm/roce-workload:ubuntu24_rocm-7.0.2_ainic-1.118.0-a-99" + export PHASE3_DRIVER_FW_STRICT="false" + run _run_check + assert_status 0 + assert_stdout_contains "PHASE3_RESULT status=passed" + assert_stdout_contains "observed_fw=" + assert_stdout_contains "ionic_0=${CHECK5_DEFAULT_FW}" + # Warning log line is still emitted (operator-visible) even in + # warn-only mode -- the gate just doesn't trip the marker. + assert_stdout_contains "[Phase 3] CHECK 5 MISMATCH: dev=ionic_0 fw=${CHECK5_DEFAULT_FW} not in image=" + # The marker's reason field MUST NOT carry fw-image-mismatch + # tokens in warn-only mode. (The mismatch line is on stdout only + # via the log message; check the PHASE3_RESULT line specifically.) + result_line=$(grep -E '^PHASE3_RESULT ' <<<"$LAST_STDOUT" | tail -n 1 || true) + if [[ "$result_line" == *"fw-image-mismatch"* ]]; then + _assert_fail "warn-only marker must not carry fw-image-mismatch token: ${result_line}" + fi +} + +# ------------------------------------------------------------------- +# E4. Check disabled: marker is byte-identical to the legacy +# pre-Check-5 contract -- no observed_fw= field at all. +# ------------------------------------------------------------------- + +it "PHASE3_CHECK_SCRIPT Check 5 disabled -> marker has no observed_fw= field" && { + _reset_check5_env + # Override the per-test enable: disable Check 5 explicitly. + export PHASE3_DRIVER_FW_CHECK_ENABLED="false" + # Seed the ib sysfs tree anyway -- the script must not read it + # when the check is disabled (the tree being present is not an + # implicit enable). + _new_sysfs_ib_root >/dev/null + ib_root="$PHASE3_IB_SYSFS_DIR" + _seed_ionic_fw_uniform "$ib_root" 8 "$CHECK5_DEFAULT_FW" + export ROCE_WORKLOAD_IMAGE="$CHECK5_DEFAULT_IMAGE" + run _run_check + assert_status 0 + assert_stdout_contains "PHASE3_RESULT status=passed" + # observed_fw= field MUST be absent when Check 5 is disabled. + result_line=$(grep -E '^PHASE3_RESULT ' <<<"$LAST_STDOUT" | tail -n 1 || true) + if [[ "$result_line" == *"observed_fw="* ]]; then + _assert_fail "disabled Check 5 must not emit observed_fw= field; got: ${result_line}" + fi + if [[ "$result_line" == *"fw-image-mismatch"* ]]; then + _assert_fail "disabled Check 5 must not emit fw-image-mismatch; got: ${result_line}" + fi +} + +# ------------------------------------------------------------------- +# E5. Missing ROCE_WORKLOAD_IMAGE env: "no data" => SKIP, never FAIL. +# Phase 3 still passes (Checks 1-4 are clean in this fixture). +# observed_fw= is still emitted so the read values are visible. +# ------------------------------------------------------------------- + +it "PHASE3_CHECK_SCRIPT Check 5 missing ROCE_WORKLOAD_IMAGE -> SKIP, no fw-image-env-missing fail" && { + _reset_check5_env + _new_sysfs_ib_root >/dev/null + ib_root="$PHASE3_IB_SYSFS_DIR" + _seed_ionic_fw_uniform "$ib_root" 8 "$CHECK5_DEFAULT_FW" + # ROCE_WORKLOAD_IMAGE intentionally unset. + unset ROCE_WORKLOAD_IMAGE + run _run_check + assert_status 0 + assert_stdout_contains "PHASE3_RESULT status=passed" + # observed_fw= is still emitted so operators see the read values. + assert_stdout_contains "observed_fw=" + assert_stdout_contains "ionic_0=${CHECK5_DEFAULT_FW}" + assert_stdout_contains "CHECK 5 SKIP: ROCE_WORKLOAD_IMAGE env not set" + # No-data is a skip, not a fail. The legacy hard-fail reason must + # not appear anywhere on the marker line. + assert_stdout_not_contains "fw-image-env-missing" +} + +# ------------------------------------------------------------------- +# E6. Zero ionic devices in sysfs (empty tree): "no data" => SKIP, +# never FAIL. Phase 3 still passes (Checks 1-4 are clean). +# ------------------------------------------------------------------- + +it "PHASE3_CHECK_SCRIPT Check 5 zero ionic devices -> SKIP, no sysfs-error fail" && { + _reset_check5_env + # Allocate empty ib sysfs root (no _seed_sysfs_ib calls). + _new_sysfs_ib_root >/dev/null + export ROCE_WORKLOAD_IMAGE="$CHECK5_DEFAULT_IMAGE" + run _run_check + assert_status 0 + assert_stdout_contains "PHASE3_RESULT status=passed" + assert_stdout_contains "CHECK 5 SKIP: no fw_ver readable under" + assert_stdout_not_contains "sysfs-error:no-fw-ver" +} + +# ------------------------------------------------------------------- +# E7. Partial fw read: 8 devices in sysfs, but 2 have no fw_ver file +# (simulates unreadable / driver bug). The 6 readable devices +# all match the image tag -> PASS. observed_fw= lists only the +# 6 successfully-read devices. +# ------------------------------------------------------------------- + +it "PHASE3_CHECK_SCRIPT Check 5 partial fw read: 6/8 readable + all match -> PASS, observed_fw lists 6" && { + _reset_check5_env + _new_sysfs_ib_root >/dev/null + ib_root="$PHASE3_IB_SYSFS_DIR" + # 6 devices with readable fw matching the image tag. + for i in 0 1 2 3 4 5; do + _seed_sysfs_ib "$ib_root" "ionic_${i}" "ionic" "$CHECK5_DEFAULT_FW" + done + # 2 devices with NO fw_ver file (empty arg skips file creation). + _seed_sysfs_ib "$ib_root" "ionic_6" "ionic" "" + _seed_sysfs_ib "$ib_root" "ionic_7" "ionic" "" + export ROCE_WORKLOAD_IMAGE="$CHECK5_DEFAULT_IMAGE" + run _run_check + assert_status 0 + assert_stdout_contains "PHASE3_RESULT status=passed" + assert_stdout_contains "observed_fw=" + # The 6 readable devices appear in observed_fw=. + for i in 0 1 2 3 4 5; do + assert_stdout_contains "ionic_${i}=${CHECK5_DEFAULT_FW}" + done + # The 2 unreadable devices DO NOT appear in observed_fw=. + result_line=$(grep -E '^PHASE3_RESULT ' <<<"$LAST_STDOUT" | tail -n 1 || true) + for i in 6 7; do + if [[ "$result_line" == *"ionic_${i}="* ]]; then + _assert_fail "ionic_${i} (no fw_ver) must not appear in observed_fw=: ${result_line}" + fi + done +} + +# ------------------------------------------------------------------- +# E8. Driver filter: a non-ionic IB device (e.g. mlx5) must be +# skipped even if its fw_ver is populated -- only ionic devices +# participate in Check 5. +# ------------------------------------------------------------------- + +it "PHASE3_CHECK_SCRIPT Check 5 non-ionic IB device is skipped" && { + _reset_check5_env + _new_sysfs_ib_root >/dev/null + ib_root="$PHASE3_IB_SYSFS_DIR" + _seed_ionic_fw_uniform "$ib_root" 8 "$CHECK5_DEFAULT_FW" + # A mlx5 IB device that would mismatch the image -- must be ignored. + _seed_sysfs_ib "$ib_root" "mlx5_0" "mlx5_core" "99.99.99-NOT-IN-IMAGE" + export ROCE_WORKLOAD_IMAGE="$CHECK5_DEFAULT_IMAGE" + run _run_check + assert_status 0 + assert_stdout_contains "PHASE3_RESULT status=passed" + # The mlx5 device must NOT appear in observed_fw=, nor cause a mismatch. + result_line=$(grep -E '^PHASE3_RESULT ' <<<"$LAST_STDOUT" | tail -n 1 || true) + if [[ "$result_line" == *"mlx5_0"* ]]; then + _assert_fail "non-ionic mlx5_0 must not appear in marker: ${result_line}" + fi + if grep -F "fw-image-mismatch" <<<"$LAST_STDOUT" >/dev/null; then + _assert_fail "non-ionic device must not trip fw-image-mismatch: +${LAST_STDOUT}" + fi +} + +# ------------------------------------------------------------------- +# E9. Pre-flight FAIL via broken ibv_devinfo -> single +# preflight-failed:ibv_devinfo= reason, checks 1-5 skipped. +# This is the only pre-flight gate left -- nicctl is no longer +# probed. +# ------------------------------------------------------------------- + +it "PHASE3_CHECK_SCRIPT pre-flight: broken ibv_devinfo -> FAIL preflight-failed:ibv_devinfo, checks skipped" && { + _reset_check5_env + _new_sysfs_ib_root >/dev/null + ib_root="$PHASE3_IB_SYSFS_DIR" + _seed_ionic_fw_uniform "$ib_root" 8 "$CHECK5_DEFAULT_FW" + export ROCE_WORKLOAD_IMAGE="$CHECK5_DEFAULT_IMAGE" + # Force ibv_devinfo to fail (non-zero exit + representative stderr). + export IBV_DEVINFO_RC_DEFAULT="1" + export IBV_DEVINFO_STDERR_DEFAULT="libibverbs: failed to load driver ionic_rdma" + run _run_check + assert_status 1 + assert_stdout_contains "PHASE3_RESULT status=failed" + assert_stdout_contains "preflight-failed:ibv_devinfo=libibverbs:" + # Pre-flight short-circuit: no per-check failure lines from + # checks 1-5 (the script `exit 1`s before any of them runs). + for forbidden in "CHECK 1 PASS" "CHECK 1 FAIL" "CHECK 2 FAIL" "CHECK 3 FAIL" "CHECK 4 FAIL" "CHECK 5 FAIL"; do + if grep -qF -- "$forbidden" <<<"$LAST_STDOUT"; then + _assert_fail "pre-flight short-circuit must skip [${forbidden}]; stdout: +${LAST_STDOUT}" + fi + done +} + +# ------------------------------------------------------------------- +# PART F: PHASE3_SCRIPT outer-driver marker parser -- observed_fw +# annotation writes. The new Check 5 only emits observed_fw= on the +# marker (observed_driver was removed when the compat-map path was +# retired), so the orchestrator's parser only writes the +# `observed-fw` annotation. +# +# These exercise the orchestrator's _phase3_parse_and_label code path +# (the sed-extraction of observed_fw= from the PHASE3_RESULT line +# and the matching annotate_phase_value call). The in-pod check body +# is irrelevant here -- the harness seeds the marker line directly +# via the kubectl mock's `kubectl_mock_set_pod_log job/` hook. +# ------------------------------------------------------------------- + +# F1. Marker carries observed_fw on a pass line -> orchestrator +# writes the observed-fw annotation and the =passed label. No +# failure-reason / failed-nics annotations. +it "PHASE3_SCRIPT marker has observed_fw (pass) -> observed-fw annotation written" && { + _reset_phase3_env + expected_job=$(_phase3_expected_job_name "node-obs-fw" "$ts") + kubectl_mock_set_job_condition "$expected_job" "Complete" "True" + kubectl_mock_set_pod_log "job/${expected_job}" \ + "PHASE3_RESULT status=passed observed_fw=ionic_0=1.117.5-a-56,ionic_1=1.117.5-a-56" + run __phase3_run node-obs-fw + assert_status 0 + assert_kubectl_call \ + "label node node-obs-fw amd.com/nic-health=passed --overwrite" + assert_kubectl_call_contains \ + "amd.com/nic-health-observed-fw=ionic_0=1.117.5-a-56,ionic_1=1.117.5-a-56" + # No failure-reason / failed-nics on a pass line. + if grep -F "amd.com/nic-health-failure-reason" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "pass+observed-fw path must not write failure-reason: +$(grep amd.com/nic-health "$KUBECTL_CALLS_FILE")" + fi + if grep -F "amd.com/nic-health-failed-nics" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "pass+observed-fw path must not write failed-nics: +$(grep amd.com/nic-health "$KUBECTL_CALLS_FILE")" + fi +} + +# F2. Marker carries no observed_fw field (e.g. Check 5 disabled +# in-pod) -> orchestrator writes no observed-fw annotation. +# Today's pass label still goes through. The absence branch is +# intentionally a no-op so any prior observed-fw annotation +# stays as last-known-good for operators flipping the check off. +it "PHASE3_SCRIPT marker has no observed_fw field -> no observed-fw annotation written" && { + _reset_phase3_env + _seed_job_complete "node-obs-none" "$ts" # legacy pass-only marker + run __phase3_run node-obs-none + assert_status 0 + assert_kubectl_call \ + "label node node-obs-none amd.com/nic-health=passed --overwrite" + if grep -F "amd.com/nic-health-observed-fw" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "absent-observed_fw path must not write observed-fw annotation: +$(grep amd.com/nic-health "$KUBECTL_CALLS_FILE")" + fi + # observed-driver annotation is dead -- must never be written. + if grep -F "amd.com/nic-health-observed-driver" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "observed-driver annotation must never be written: +$(grep amd.com/nic-health "$KUBECTL_CALLS_FILE")" + fi +} + +# F3. Marker on a FAILED node with fw-image-mismatch reason + +# observed_fw -> orchestrator writes: +# * label=failed +# * failure-reason= +# * failed-nics= +# * observed-fw= +# i.e. today's failure annotations PLUS the new observed-fw +# annotation (per the design "regardless of pass/fail outcome +# whenever the marker carried them"). +it "PHASE3_SCRIPT marker on failed node with fw-image-mismatch -> failure annotations + observed-fw written" && { + _reset_phase3_env + expected_job=$(_phase3_expected_job_name "node-fw-mismatch" "$ts") + kubectl_mock_set_job_condition "$expected_job" "Failed" "True" + kubectl_mock_set_pod_log "job/${expected_job}" \ + "PHASE3_RESULT status=failed reason=fw-image-mismatch:ionic_1=1.117.5-a-56/image=ainic-1.118.0-a-99 failed_nics=ionic_1 observed_fw=ionic_0=1.117.5-a-56,ionic_1=1.117.5-a-56" + run __phase3_run node-fw-mismatch + assert_status 0 + assert_kubectl_call \ + "label node node-fw-mismatch amd.com/nic-health=failed --overwrite" + assert_kubectl_call_contains \ + "amd.com/nic-health-failure-reason=fw-image-mismatch:ionic_1=1.117.5-a-56/image=ainic-1.118.0-a-99" + assert_kubectl_call_contains \ + "amd.com/nic-health-failed-nics=ionic_1" + assert_kubectl_call_contains \ + "amd.com/nic-health-observed-fw=ionic_0=1.117.5-a-56,ionic_1=1.117.5-a-56" +} + +assert_summary diff --git a/example/gpu-validation-cluster/tests/test_phase4.sh b/example/gpu-validation-cluster/tests/test_phase4.sh new file mode 100755 index 000000000..0293c8f12 --- /dev/null +++ b/example/gpu-validation-cluster/tests/test_phase4.sh @@ -0,0 +1,1095 @@ +#!/bin/bash +# Unit tests for PHASE4_DRIVER_SCRIPT against the +# mocked kubectl harness and sample ib_write_bw client log fixtures. +# +# +# Scope (from the design doc §7 +# "Testing Strategy" + test plan): +# * pairing-roundrobin-even -- input [a,b,c,d] -> full mesh +# (3 rounds, 2 disjoint pairs/round, 6 pairs total) [TP TC1] +# * pairing-roundrobin-odd -- input [a,b,c,d,e] -> full mesh +# (5 rounds, one node sits out per round, 10 pairs total) [TP TC2] +# * per-rail-annotation-written -- single pair, all rails pass, +# both nodes labeled passed with +# per-(rail, round) annotations [TP TC3] +# * skip-phase4-passlabels-all -- SKIP_RAIL_BANDWIDTH_TEST=true [TP TC4] +# * single-rail-fail -- inject low BW on rail 5 -> failed-rails=5 +# + triangulation entry [TP TC5] +# * all-rails-fail-one-pair -- all 8 rails below threshold [TP TC6] +# * ib-write-bw-crash -- Failed=True + no BW line in log [TP TC7] +# * parse-failure -- Complete=True + empty log [TP TC8] +# * single-node-input -- only one input node -> unpaired=true [TP TC9] +# * empty-input -- no nodes -> no-op [TP TC10] +# * rail-count-override -- PHASE4_RAIL_COUNT=4 limits annotations [TP TC11] +# * server-pod-unready timeout -- server pod IP never set [TP TC12] +# * missing required env var -> all input nodes labeled failed +# * job templates missing -> all input nodes labeled failed +# * concurrency-cap-honored -- 8-node input, mocked apply, peak +# concurrent pair_runners <= PHASE4_MAX_CONCURRENT_PAIRS=4 [TP TC15] +# +# full-mesh schedule tests (these use PHASE4_RAIL_COUNT=0 +# to exercise the scheduler alone without seeding per-rail Job state; +# the driver still walks the schedule, emits the per-round log lines, +# and runs pair_runners that are no-ops because the rail loop iterates +# zero times): +# * mesh-schedule-N2 -- 1 round, 1 pair +# * mesh-schedule-N4 -- 3 rounds, 2 disjoint pairs/round, union C(4,2)=6 +# * mesh-schedule-N5 -- 5 rounds, one node sits out, union C(5,2)=10 +# * mesh-schedule-N6 -- 5 rounds, 3 disjoint pairs/round, union C(6,2)=15 +# * mesh-schedule-N7 -- 7 rounds, one node sits out, union C(7,2)=21 +# * mesh-schedule-N8 -- 7 rounds, 4 disjoint pairs/round, union C(8,2)=28 +# * mesh-disjoint-property -- no node appears in more than one pair per +# round (verified for every N above) +# +# How PHASE4_DRIVER_SCRIPT is exercised (mirrors test_phase2.sh / +# test_phase3.sh conventions): +# +# The driver body is a block-scalar inside cluster-validation-config.yaml. +# We extract it with lib/extract_script.sh, then patch: +# 1) /phase4-configs/cluster-validation-phase4-server-job-config.yaml +# -> ${TPL_DIR}/cluster-validation-phase4-server-job-config.yaml +# 2) /phase4-configs/cluster-validation-phase4-client-job-config.yaml +# -> ${TPL_DIR}/cluster-validation-phase4-client-job-config.yaml +# 3) /tmp state dir prefix -> ${PHASE4_STATE_TMP_BASE} so the test +# owns the tree (helps with concurrency-cap accounting and +# teardown). +# The driver uses `local` heavily, so we wrap the patched body in a +# function `__phase4_run` and invoke that. The helper library +# (PHASE_NODE_LABEL_SCRIPT) is sourced first so label_phase_passed / +# label_phase_failed / annotate_phase_value are defined. +# +# `kubectl` is the mock from lib/kubectl_mock.sh. Per-rail Job +# "completion" is simulated by seeding: +# * server Job: pod-ip|= (driver waits for podIP) +# pod-for-job|= (driver inspects pod +# count when timing out) +# job|=Complete=True (optional; server Job's +# terminal status is not +# checked by the driver, +# but the cleanup delete +# fires either way) +# * client Job: pod-for-job|= +# pod-log|= (parse "BW average") +# job|=Complete=True OR Failed=True +# +# The driver's wait loops include `sleep 2` / `sleep 5` between +# polls. To keep test runtime reasonable we shim `sleep` on PATH as +# a no-op (the mock state is set up before invocation, so the first +# poll always sees what it needs). + +set -uo pipefail + +TEST_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd) +REPO_ROOT=$(cd -- "${TEST_DIR}/../../.." && pwd) +CONFIGMAP="${REPO_ROOT}/example/gpu-validation-cluster/configs/cluster-validation-config.yaml" +FIXTURES_DIR="${TEST_DIR}/fixtures/phase4" + +# shellcheck source=./lib/assert.sh +source "${TEST_DIR}/lib/assert.sh" +# shellcheck source=./lib/kubectl_mock.sh +source "${TEST_DIR}/lib/kubectl_mock.sh" +# shellcheck source=./lib/extract_script.sh +source "${TEST_DIR}/lib/extract_script.sh" + +echo "================================================================" +echo " test_phase4.sh" +echo " ConfigMap: ${CONFIGMAP}" +echo " Fixtures: ${FIXTURES_DIR}" +echo "================================================================" + +# --- one-time setup ------------------------------------------------- + +PHASE4_DIR=$(mktemp -d -t phase4-tests-XXXXXX) +TPL_DIR="${PHASE4_DIR}/tpl" +SHIM_DIR="${PHASE4_DIR}/shims" +PHASE4_BODY="${PHASE4_DIR}/phase4-body.sh" +HELPER_SCRIPT="${PHASE4_DIR}/phase-helpers.sh" +PHASE4_STATE_TMP_BASE="${PHASE4_DIR}/state-base" +mkdir -p "$TPL_DIR" "$SHIM_DIR" "$PHASE4_STATE_TMP_BASE" + +trap 'rm -rf "$PHASE4_DIR"; kubectl_mock_cleanup' EXIT + +# Minimal Job template stand-ins for server + client. The real +# templates live in cluster-validation-phase4-job-config. +# PHASE4_DRIVER_SCRIPT pipes a sed-rendered copy to `kubectl apply -f -`; +# since kubectl is mocked, the only constraints on contents are: +# * the file must exist (`[[ ! -f "$server_tmpl" || ! -f "$client_tmpl" ]]`) +# * sed substitutions on $$NODE, $$RAIL_IDX, etc. must not error +# plain-text body with the substitution markers is fine +# * the rename anchor `^ name: cluster-validation-phase4-server-job` / +# `-client-job` must match so _phase4_render emits the per-(role, +# node, rail) job name on stderr (RENDERED_JOB_NAME=.). +cat >"${TPL_DIR}/cluster-validation-phase4-server-job-config.yaml" <<'YAML' +apiVersion: batch/v1 +kind: Job +metadata: + name: cluster-validation-phase4-server-job +spec: + template: + metadata: + annotations: + k8s.v1.cni.cncf.io/networks: $$NAD_NAME + spec: + nodeSelector: + kubernetes.io/hostname: $$NODE + containers: + - name: phase4-ib-server + image: $$ROCE_WORKLOAD_IMAGE + env: + - name: RAIL_IDX + value: "$$RAIL_IDX" +YAML + +cat >"${TPL_DIR}/cluster-validation-phase4-client-job-config.yaml" <<'YAML' +apiVersion: batch/v1 +kind: Job +metadata: + name: cluster-validation-phase4-client-job +spec: + template: + metadata: + annotations: + k8s.v1.cni.cncf.io/networks: $$NAD_NAME + spec: + nodeSelector: + kubernetes.io/hostname: $$NODE + containers: + - name: phase4-ib-client + image: $$ROCE_WORKLOAD_IMAGE + env: + - name: RAIL_IDX + value: "$$RAIL_IDX" + - name: PEER_POD_IP + value: "$$PEER_POD_IP" +YAML + +# --- sleep shim ----------------------------------------------------- +# PHASE4_DRIVER_SCRIPT polls with `sleep 2` (pod-IP wait) and `sleep 5` +# (job-terminal wait). Tests pre-seed the mock state so the first +# iteration of every poll loop succeeds; the sleep is therefore wasted +# wall-clock. A no-op `sleep` shim keeps the full suite under one +# second. The shim is installed on PATH BEFORE kubectl_mock_init's +# PATH prepend so the mock kubectl (which kubectl_mock_init places at +# the head) still wins for `kubectl`. +cat >"${SHIM_DIR}/sleep" <<'EOF' +#!/bin/bash +# no-op sleep -- test suite pre-seeds all poll state, so wall-clock +# waiting buys nothing here. Accepts and ignores all args. +exit 0 +EOF +chmod +x "${SHIM_DIR}/sleep" + +# Extract PHASE4_DRIVER_SCRIPT and patch the two hardcoded template +# paths so the test can run as a non-root user without /phase4-configs +# existing. Also rewrite the state-dir prefix from /tmp/phase4-. to +# our test-owned base dir so we can introspect concurrency state and +# the test harness can audit / clean up after itself. +RAW_PHASE4=$(extract_configmap_data "$CONFIGMAP" "PHASE4_DRIVER_SCRIPT") +if [[ -z "$RAW_PHASE4" ]]; then + echo "FATAL: PHASE4_DRIVER_SCRIPT extraction produced empty output" >&2 + exit 1 +fi + +PATCHED_PHASE4=$(printf '%s\n' "$RAW_PHASE4" \ + | sed "s|/phase4-configs/cluster-validation-phase4-server-job-config.yaml|${TPL_DIR}/cluster-validation-phase4-server-job-config.yaml|g" \ + | sed "s|/phase4-configs/cluster-validation-phase4-client-job-config.yaml|${TPL_DIR}/cluster-validation-phase4-client-job-config.yaml|g" \ + | sed "s|\"/tmp/phase4-\${phase4_ts}-\$\$\"|\"${PHASE4_STATE_TMP_BASE}/run-\${phase4_ts}-\$\$\"|g") + +# Wrap in a function so `local` / `local -a` (used heavily inside +# PHASE4_DRIVER_SCRIPT) are valid. The orchestrator +# sources the driver inside run_phase4, which is a function, so this +# matches production wiring. +{ + printf '__phase4_run() {\n' + printf '%s\n' "$PATCHED_PHASE4" + printf '}\n' +} > "$PHASE4_BODY" + +if ! bash -n "$PHASE4_BODY"; then + echo "FATAL: patched PHASE4_DRIVER_SCRIPT has bash syntax errors" >&2 + exit 1 +fi + +# Extract the helper library (label_phase_passed/failed, +# annotate_phase_value) once. +extract_configmap_data "$CONFIGMAP" "PHASE_NODE_LABEL_SCRIPT" \ + > "$HELPER_SCRIPT" +if [[ ! -s "$HELPER_SCRIPT" ]]; then + echo "FATAL: PHASE_NODE_LABEL_SCRIPT extraction produced empty output" >&2 + exit 1 +fi +if ! bash -n "$HELPER_SCRIPT"; then + echo "FATAL: extracted helper script has bash syntax errors" >&2 + exit 1 +fi + +kubectl_mock_init + +# Suffix for failure-reason annotation; mirrors ConfigMap default. +export PHASE_FAILURE_REASON_ANNOTATION_SUFFIX="-failure-reason" + +# Prepend the shim dir AFTER kubectl_mock_init so the mock kubectl +# (which init prepends) still wins for `kubectl`, and our `sleep` +# shim wins over /bin/sleep. The shim dir goes second-from-front so +# /usr/bin/ still resolves for things like `date`, `awk`, `sed`. +export PATH="${SHIM_DIR}:${PATH}" + +# shellcheck disable=SC1090 +source "$HELPER_SCRIPT" +# shellcheck disable=SC1090 +source "$PHASE4_BODY" + +# Sanity: required functions are defined. +for fn in label_phase_passed label_phase_failed annotate_phase_value \ + __phase4_run; do + if ! declare -F "$fn" >/dev/null; then + echo "FATAL: required function $fn not defined after sourcing" >&2 + exit 1 + fi +done + +# Suppress the -u trap for tests that intentionally leave optional env +# vars unset (PHASE_NODES, SKIP_RAIL_BANDWIDTH_TEST). +set +u + +# Helper: compute the per-(role, node, rail, round) Job name +# PHASE4_DRIVER_SCRIPT generates inside _phase4_render. Mirrors the +# driver exactly (round suffix added to disambiguate the +# same (role, node, rail) repeated across mesh rounds): +# cvf-p4-${role}-${node}-r${rail_idx}-rd${round_idx} (when short enough) +# cvf-p4-${role}-${sha1(node)|6}-r${rail_idx}-rd${round_idx} (when too long) +_phase4_expected_job_name() { + local role="$1" node="$2" rail="$3" round="${4:-0}" + local max_len=63 + local jn="cvf-p4-${role}-${node}-r${rail}-rd${round}" + if [[ "${#jn}" -gt "$max_len" ]]; then + local h + h=$(echo -n "$node" | sha1sum | cut -c1-6) + jn="cvf-p4-${role}-${h}-r${rail}-rd${round}" + fi + printf '%s' "$jn" +} + +# Seed every server + client Job for one pair across rails 0.rail_count-1 +# (in one specific round, default 0) as "all pass" with the same client +# log fixture. Server pod IP is canned (any non-empty string works; the +# driver just substitutes it into the client template). The client Job +# is seeded Complete=True and its pod log carries the BW-average value. +_seed_pair_all_pass() { + local node_a="$1" node_b="$2" rail_count="$3" log_fixture="$4" + local round="${5:-0}" + local r + for (( r=0; r < rail_count; r++ )); do + local sjob cjob spod cpod + sjob=$(_phase4_expected_job_name "server" "$node_a" "$r" "$round") + cjob=$(_phase4_expected_job_name "client" "$node_b" "$r" "$round") + spod="pod-${sjob}" + cpod="pod-${cjob}" + kubectl_mock_set_pod_for_job "$sjob" "$spod" + kubectl_mock_set_pod_ip_for_job "$sjob" "10.42.0.${r}0" + kubectl_mock_set_pod_for_job "$cjob" "$cpod" + kubectl_mock_set_job_condition "$cjob" "Complete" "True" + kubectl_mock_set_pod_log "$cpod" "${FIXTURES_DIR}/${log_fixture}" + done +} + +# Seed one specific rail of a pair (in one specific round, default 0) +# with a custom log + terminal state. +# terminal: "Complete" or "Failed". If `seed_server` is "yes" (default) +# the server Job is seeded with a pod IP. Pass "no" to simulate +# server-pod-unready (driver will time out the pod-IP wait). +_seed_pair_one_rail() { + local node_a="$1" node_b="$2" rail="$3" terminal="$4" log_fixture="$5" + local seed_server="${6:-yes}" + local round="${7:-0}" + local sjob cjob spod cpod + sjob=$(_phase4_expected_job_name "server" "$node_a" "$rail" "$round") + cjob=$(_phase4_expected_job_name "client" "$node_b" "$rail" "$round") + spod="pod-${sjob}" + cpod="pod-${cjob}" + if [[ "$seed_server" == "yes" ]]; then + kubectl_mock_set_pod_for_job "$sjob" "$spod" + kubectl_mock_set_pod_ip_for_job "$sjob" "10.42.0.${rail}0" + fi + kubectl_mock_set_pod_for_job "$cjob" "$cpod" + kubectl_mock_set_job_condition "$cjob" "$terminal" "True" + if [[ -n "$log_fixture" ]]; then + kubectl_mock_set_pod_log "$cpod" "${FIXTURES_DIR}/${log_fixture}" + fi +} + +# Helper: extract the schedule emitted by PHASE4_DRIVER_SCRIPT from +# the LAST_STDOUT string captured by `run`. Emits one TSV line per +# round on stdout ( ...), sorted by +# round_idx. Used by mesh-schedule tests below. +_phase4_extract_schedule() { + grep -oE '\[Phase 4\] round [0-9]+ START: .*$' <<<"$LAST_STDOUT" \ + | sed -E 's/^\[Phase 4\] round ([0-9]+) START: (.*)$/\1\t\2/' \ + | sort -n +} + +# Helper: assert that every pair in `pairs_csv` is unordered-unique +# across the supplied multi-line schedule and that the union covers +# every C(N,2) pair derivable from the input node list. Args: +# $1 -- schedule TSV (roundpairs) produced by _phase4_extract_schedule +# $2 -- expected total pair count (e.g. 6 for N=4) +# $3.. -- the input node list (sorted) +# Fails the test via _assert_fail on any violation; prints nothing on success. +_phase4_assert_full_mesh() { + local sched="$1" + local expected="$2" + shift 2 + local -a nodes=("$@") + local n="${#nodes[@]}" + + # Build expected pair set (unordered, sorted "a,b" with a N-1 (circle of size N). + # For odd N -> N rounds (circle of size N+1 with a bye slot; + # circle_size - 1 = N rounds). With the circle algorithm every + # round has either floor(N/2) or (N-1)/2 real pairs, so the START + # line is always emitted (never an empty round). + local expected_rounds + if (( n % 2 == 0 )); then + expected_rounds=$(( n - 1 )) + else + expected_rounds="$n" + fi + if [[ "$round_count" -ne "$expected_rounds" ]]; then + _assert_fail "expected ${expected_rounds} rounds, got ${round_count}" + return 1 + fi + return 0 +} + +# Per-test reset: wipe the kubectl call log and any seeded state, and +# re-export the baseline env PHASE4_DRIVER_SCRIPT reads. Tests override +# pieces of this (notably SKIP_RAIL_BANDWIDTH_TEST, PHASE4_RAIL_COUNT, +# PHASE4_PAIR_WAIT_TIME) before calling __phase4_run. +_reset_phase4_env() { + kubectl_mock_reset + export PHASE4_LABEL_KEY="amd.com/rail-bandwidth" + export PHASE4_RAIL_COUNT="8" + export PHASE4_BW_THRESHOLD="380" + # Large enough that the first poll iteration sees seeded state and + # exits before the (no-op) sleep would fire. Tests that exercise + # the timeout branch override this to 0. + export PHASE4_PAIR_WAIT_TIME="60" + export PHASE4_MAX_CONCURRENT_PAIRS="8" + export PHASE4_NAD_NAME_PREFIX="amd-host-device-nad-rail-" + export PHASE4_IB_DEV_PREFIX="rdma_dev_" + export ROCE_WORKLOAD_IMAGE="docker.io/rocm/roce-workload:test" + unset SKIP_RAIL_BANDWIDTH_TEST PHASE_NODES +} + +# ------------------------------------------------------------------- +# 1. Empty input list -> no-op, exit 0, no kubectl side effects. +# [TP TC10] +# ------------------------------------------------------------------- + +it "empty input list is a no-op and returns 0" && { + _reset_phase4_env + run __phase4_run + assert_status 0 + assert_kubectl_no_calls + assert_stdout_contains "no input nodes -- nothing to do" +} + +# ------------------------------------------------------------------- +# 2. SKIP_RAIL_BANDWIDTH_TEST=true -> every input node pass-labeled, +# NO Phase 4 Job submission, no kubectl get/logs/apply work. +# [TP TC4] +# ------------------------------------------------------------------- + +it "SKIP_RAIL_BANDWIDTH_TEST=true pass-labels every input node, no Jobs created" && { + _reset_phase4_env + export SKIP_RAIL_BANDWIDTH_TEST="true" + run __phase4_run node-a node-b node-c + assert_status 0 + assert_kubectl_call \ + "label node node-a amd.com/rail-bandwidth=passed --overwrite" + assert_kubectl_call \ + "label node node-b amd.com/rail-bandwidth=passed --overwrite" + assert_kubectl_call \ + "label node node-c amd.com/rail-bandwidth=passed --overwrite" + if grep -E "^apply( |$)" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "SKIP must not submit any Jobs: +$(grep -E '^apply( |$)' "$KUBECTL_CALLS_FILE")" + fi + if grep -E "^get job" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "SKIP must not poll Jobs: +$(grep -E '^get job' "$KUBECTL_CALLS_FILE")" + fi + if grep -E "^logs " "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "SKIP must not fetch pod logs: +$(grep -E '^logs ' "$KUBECTL_CALLS_FILE")" + fi + assert_stdout_contains "SKIP_RAIL_BANDWIDTH_TEST=true -- pass-labeling" +} + +# Case-insensitive variant: matches the `${VAR,}` lowercasing the +# driver does before comparing against "true". +it "SKIP_RAIL_BANDWIDTH_TEST accepts case-insensitive value (TRUE)" && { + _reset_phase4_env + export SKIP_RAIL_BANDWIDTH_TEST="TRUE" + run __phase4_run node-x + assert_status 0 + assert_kubectl_call \ + "label node node-x amd.com/rail-bandwidth=passed --overwrite" +} + +# ------------------------------------------------------------------- +# 3. Missing required env var -> every input node labeled =failed with +# reason=phase4-missing-env:.; no Jobs submitted. +# ------------------------------------------------------------------- + +it "missing required env var -> all input nodes labeled failed, no Jobs submitted" && { + _reset_phase4_env + unset PHASE4_RAIL_COUNT + run __phase4_run node-y node-z + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_call \ + "label node node-y amd.com/rail-bandwidth=failed --overwrite" + assert_kubectl_call_contains \ + "amd.com/rail-bandwidth-failure-reason=phase4-missing-env:PHASE4_RAIL_COUNT" + assert_kubectl_call \ + "label node node-z amd.com/rail-bandwidth=failed --overwrite" + if grep -E "^apply( |$)" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "missing-env path must not submit Jobs: +$(grep -E '^apply( |$)' "$KUBECTL_CALLS_FILE")" + fi + assert_stderr_contains "required env var(s) unset:" +} + +# ------------------------------------------------------------------- +# 4. Missing job templates -> every input node labeled failed with +# reason=job-template-missing; no Jobs submitted. +# ------------------------------------------------------------------- + +it "missing job templates -> all input nodes labeled failed, reason=job-template-missing" && { + _reset_phase4_env + mv "${TPL_DIR}/cluster-validation-phase4-server-job-config.yaml" \ + "${TPL_DIR}/cluster-validation-phase4-server-job-config.yaml.hidden" + run __phase4_run node-a node-b + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_call \ + "label node node-a amd.com/rail-bandwidth=failed --overwrite" + assert_kubectl_call \ + "annotate node node-a amd.com/rail-bandwidth-failure-reason=job-template-missing --overwrite" + assert_kubectl_call \ + "label node node-b amd.com/rail-bandwidth=failed --overwrite" + if grep -E "^apply( |$)" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "missing-template path must not submit Jobs: +$(grep -E '^apply( |$)' "$KUBECTL_CALLS_FILE")" + fi + # Restore for the rest of the suite. + mv "${TPL_DIR}/cluster-validation-phase4-server-job-config.yaml.hidden" \ + "${TPL_DIR}/cluster-validation-phase4-server-job-config.yaml" +} + +# ------------------------------------------------------------------- +# 5. full-mesh round-robin EVEN [a,b,c,d] -> 3 rounds, +# 2 disjoint pairs/round, 6 pairs total = C(4,2). Stable ordering: +# driver sort -u's the input first. +# [TP TC1] +# ------------------------------------------------------------------- + +it "pairing full-mesh even: [a,b,c,d] -> 3 rounds, 2 disjoint pairs/round, 6 total" && { + _reset_phase4_env + # RAIL_COUNT=0 short-circuits the per-rail Job machinery inside + # pair_runner so the scheduler can be exercised without seeding + # any kubectl mock state. The scheduler still emits the round + # START lines and the aggregation walks zero rails per node. + export PHASE4_RAIL_COUNT="0" + # Input is intentionally OUT OF ORDER to prove sort-stability. + run __phase4_run node-c node-a node-d node-b + assert_status 0 + assert_stdout_contains "sorted input (4 nodes): node-a node-b node-c node-d" + assert_stdout_contains "schedule: rounds=3 total_pairs=6 unpaired=" + # All 4 nodes pass-labeled (rail loop is a no-op). + for n in node-a node-b node-c node-d; do + assert_kubectl_call \ + "label node ${n} amd.com/rail-bandwidth=passed --overwrite" + done + # The unpaired annotation must NOT appear on the even path. + if grep -F "amd.com/rail-bandwidth-unpaired=true" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "even-count input must not write unpaired annotation: +$(grep unpaired "$KUBECTL_CALLS_FILE")" + fi + # Full-mesh property: every C(4,2)=6 unordered pair appears + # exactly once across the 3 rounds, with no node repeated within + # a round. + sched=$(_phase4_extract_schedule) + _phase4_assert_full_mesh "$sched" 6 node-a node-b node-c node-d +} + +# ------------------------------------------------------------------- +# 6. full-mesh round-robin ODD [a,b,c,d,e] -> 5 rounds; +# one node sits out per round (bye slot); 10 pairs total = C(5,2). +# No node is permanently unpaired -- the per-N=1 "unpaired" +# annotation must not appear. [TP TC2] +# ------------------------------------------------------------------- + +it "pairing full-mesh odd: [a,b,c,d,e] -> 5 rounds, one node sits out, 10 total" && { + _reset_phase4_env + export PHASE4_RAIL_COUNT="0" + run __phase4_run node-a node-b node-c node-d node-e + assert_status 0 + assert_stdout_contains "schedule: rounds=5 total_pairs=10 unpaired=" + # All 5 nodes pass-labeled (rail loop is a no-op). + for n in node-a node-b node-c node-d node-e; do + assert_kubectl_call \ + "label node ${n} amd.com/rail-bandwidth=passed --overwrite" + done + # Odd N must NOT trigger the unpaired annotation under full mesh. + if grep -F "amd.com/rail-bandwidth-unpaired=true" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "odd-count input must not write unpaired annotation under full mesh: +$(grep unpaired "$KUBECTL_CALLS_FILE")" + fi + sched=$(_phase4_extract_schedule) + _phase4_assert_full_mesh "$sched" 10 \ + node-a node-b node-c node-d node-e +} + +# ------------------------------------------------------------------- +# 7. Single-node input -> all unpaired path; node pass-labeled with +# unpaired=true annotation. No pair_runners forked, no Jobs. +# [TP TC9] +# ------------------------------------------------------------------- + +it "single-node input -> unpaired pass-label, no Jobs created" && { + _reset_phase4_env + run __phase4_run node-solo + assert_status 0 + assert_kubectl_call \ + "label node node-solo amd.com/rail-bandwidth=passed --overwrite" + assert_kubectl_call \ + "annotate node node-solo amd.com/rail-bandwidth-unpaired=true --overwrite" + if grep -E "^apply( |$)" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "single-node path must not submit any Job: +$(grep -E '^apply( |$)' "$KUBECTL_CALLS_FILE")" + fi + assert_stdout_contains "schedule: rounds=0 total_pairs=0 unpaired=node-solo" +} + +# ------------------------------------------------------------------- +# 8. Per-rail annotations written on the all-pass single-pair path. +# Both nodes get `rail-{N}=` annotations for every rail and a +# `peer=` diagnostic annotation. Both nodes labeled +# passed; no failed-rails annotation appears. [TP TC3] +# ------------------------------------------------------------------- + +it "single pair all rails pass -> both nodes passed + per-(rail,round) annotations" && { + _reset_phase4_env + export PHASE4_RAIL_COUNT="8" + # N=2 yields a 1-round full-mesh schedule with the single pair + # (node-a, node-b). Seed all rails passing. + _seed_pair_all_pass "node-a" "node-b" 8 "ib-write-bw-pass.log" + run __phase4_run node-a node-b + assert_status 0 + # Both nodes pass-labeled. + assert_kubectl_call \ + "label node node-a amd.com/rail-bandwidth=passed --overwrite" + assert_kubectl_call \ + "label node node-b amd.com/rail-bandwidth=passed --overwrite" + # per-(rail, round) annotations carry both BW and peer + # in the value. Spot-check rail 0 and rail 7 of round 0. BW value + # comes from the ib-write-bw-pass.log fixture (388.42). + assert_kubectl_call \ + "annotate node node-a amd.com/rail-bandwidth-rail-0-round-0=388.42/peer=node-b --overwrite" + assert_kubectl_call \ + "annotate node node-a amd.com/rail-bandwidth-rail-7-round-0=388.42/peer=node-b --overwrite" + assert_kubectl_call \ + "annotate node node-b amd.com/rail-bandwidth-rail-0-round-0=388.42/peer=node-a --overwrite" + assert_kubectl_call \ + "annotate node node-b amd.com/rail-bandwidth-rail-7-round-0=388.42/peer=node-a --overwrite" + # Failed-rails / triangulation annotations must NOT appear on + # the all-pass path. + if grep -F "amd.com/rail-bandwidth-failed-rails" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "all-pass path must not write failed-rails annotation: +$(grep failed-rails "$KUBECTL_CALLS_FILE")" + fi + if grep -F "amd.com/rail-bandwidth-triangulation" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "all-pass path must not write triangulation annotation: +$(grep triangulation "$KUBECTL_CALLS_FILE")" + fi + if grep -F "amd.com/rail-bandwidth=failed" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "all-pass path must not write the failed label: +$(grep failed "$KUBECTL_CALLS_FILE")" + fi + assert_stdout_contains "pass=2 fail=0" +} + +# ------------------------------------------------------------------- +# 9. Single-rail-fail: rails 0.4,6,7 pass; rail 5 returns a BW value +# BELOW the 380 Gbps threshold. Both nodes labeled =failed and the +# failed-rails annotation pins out rail 5 specifically. Per-rail BW +# annotation for rail 5 is still written (preserving the measured +# value for diagnostics). [TP TC5] +# ------------------------------------------------------------------- + +it "single rail fail (rail 5 below threshold) -> failed-rails=5, triangulation, BW preserved" && { + _reset_phase4_env + export PHASE4_RAIL_COUNT="8" + # N=2 -> 1 round. Pass-seed every rail first, then overwrite rail + # 5's client log with the below-threshold fixture. (Both seed + # calls add a NEW pod-log entry; the get loop in the mock takes + # the LAST matching line, so the second seed wins.) + _seed_pair_all_pass "node-a" "node-b" 8 "ib-write-bw-pass.log" + _seed_pair_one_rail "node-a" "node-b" 5 "Complete" \ + "ib-write-bw-below-threshold.log" + run __phase4_run node-a node-b + assert_status 0 + assert_kubectl_call \ + "label node node-a amd.com/rail-bandwidth=failed --overwrite" + assert_kubectl_call \ + "label node node-b amd.com/rail-bandwidth=failed --overwrite" + assert_kubectl_call \ + "annotate node node-a amd.com/rail-bandwidth-failed-rails=5 --overwrite" + assert_kubectl_call \ + "annotate node node-b amd.com/rail-bandwidth-failed-rails=5 --overwrite" + # Rail 5 round-0 BW value preserved with peer identifier. + assert_kubectl_call \ + "annotate node node-a amd.com/rail-bandwidth-rail-5-round-0=180.50/peer=node-b --overwrite" + assert_kubectl_call \ + "annotate node node-b amd.com/rail-bandwidth-rail-5-round-0=180.50/peer=node-a --overwrite" + # triangulation annotation pins out the failing + # (peer, rail, round) measurement on each side. + assert_kubectl_call \ + "annotate node node-a amd.com/rail-bandwidth-triangulation=peer=node-b/rail=5/round=0 --overwrite" + assert_kubectl_call \ + "annotate node node-b amd.com/rail-bandwidth-triangulation=peer=node-a/rail=5/round=0 --overwrite" + # failure-reason annotation written by label_phase_failed records + # the failed-rails CSV. + assert_kubectl_call_contains \ + "amd.com/rail-bandwidth-failure-reason=failed-rails:5" +} + +# ------------------------------------------------------------------- +# 10. All rails fail on one pair: every rail's BW is below threshold. +# failed-rails annotation is the full 0,1,2,3,4,5,6,7 CSV. +# [TP TC6] +# ------------------------------------------------------------------- + +it "all rails fail on one pair -> failed-rails=0,1,2,3,4,5,6,7" && { + _reset_phase4_env + export PHASE4_RAIL_COUNT="8" + # N=2 -> 1 round (round 0). All 8 rails seeded below threshold. + _seed_pair_all_pass "node-a" "node-b" 8 "ib-write-bw-below-threshold.log" + # Mark every client Job Failed=True instead of Complete=True so we + # also cover the Failed-branch parse path. _seed_pair_all_pass + # already seeded Complete=True; we layer Failed=True on top, and + # the driver's lookup-last-line semantics means Failed=True wins. + # NB: `local` is invalid at file scope; use plain assignments. + for (( rr=0; rr < 8; rr++ )); do + cjob_all=$(_phase4_expected_job_name "client" "node-b" "$rr" "0") + kubectl_mock_set_job_condition "$cjob_all" "Failed" "True" + done + run __phase4_run node-a node-b + assert_status 0 + assert_kubectl_call \ + "label node node-a amd.com/rail-bandwidth=failed --overwrite" + assert_kubectl_call \ + "annotate node node-a amd.com/rail-bandwidth-failed-rails=0,1,2,3,4,5,6,7 --overwrite" + assert_kubectl_call \ + "label node node-b amd.com/rail-bandwidth=failed --overwrite" + assert_kubectl_call \ + "annotate node node-b amd.com/rail-bandwidth-failed-rails=0,1,2,3,4,5,6,7 --overwrite" + # per-(rail, round) BW value preserved on every rail + # with peer identifier (180.50 from fixture, round 0). + assert_kubectl_call \ + "annotate node node-a amd.com/rail-bandwidth-rail-0-round-0=180.50/peer=node-b --overwrite" + assert_kubectl_call \ + "annotate node node-a amd.com/rail-bandwidth-rail-7-round-0=180.50/peer=node-b --overwrite" + # triangulation lists every failing (peer, rail, round) + # measurement -- in this case 8 entries on each side, all round 0. + assert_kubectl_call_contains \ + "amd.com/rail-bandwidth-triangulation=peer=node-b/rail=0/round=0,peer=node-b/rail=1/round=0" +} + +# ------------------------------------------------------------------- +# 11. ib-write-bw-crashed: client Job Failed=True + log has no BW line +# -> rail recorded reason=ib-write-bw-crashed. [TP TC7] +# ------------------------------------------------------------------- + +it "client Failed + no BW line -> reason=ib-write-bw-crashed" && { + _reset_phase4_env + export PHASE4_RAIL_COUNT="1" + _seed_pair_one_rail "node-a" "node-b" 0 "Failed" \ + "ib-write-bw-crashed.log" + run __phase4_run node-a node-b + assert_status 0 + assert_kubectl_call \ + "label node node-a amd.com/rail-bandwidth=failed --overwrite" + assert_kubectl_call \ + "annotate node node-a amd.com/rail-bandwidth-failed-rails=0 --overwrite" + # The driver's record-reason path stores "ib-write-bw-crashed" on + # disk; the failure-reason annotation surfaces failed-rails:0 (the + # driver does NOT propagate the per-rail reason into the + # composite failure-reason -- it only lists the failed rail + # indexes). Spot-check the log line that proves the classification. + assert_stdout_contains "reason=ib-write-bw-crashed" +} + +# ------------------------------------------------------------------- +# 12. parse-failure: Complete=True but log has no "BW average" line. +# -> reason=parse-failed. [TP TC8] +# ------------------------------------------------------------------- + +it "client Complete but empty log -> reason=parse-failed" && { + _reset_phase4_env + export PHASE4_RAIL_COUNT="1" + _seed_pair_one_rail "node-a" "node-b" 0 "Complete" \ + "ib-write-bw-empty.log" + run __phase4_run node-a node-b + assert_status 0 + assert_kubectl_call \ + "label node node-a amd.com/rail-bandwidth=failed --overwrite" + assert_kubectl_call \ + "annotate node node-a amd.com/rail-bandwidth-failed-rails=0 --overwrite" + assert_stdout_contains "reason=parse-failed" +} + +# ------------------------------------------------------------------- +# 13. server-pod-unready timeout: NO server pod IP seeded -> driver +# times out the pod-IP wait. With PHASE4_PAIR_WAIT_TIME=0 the +# wait loop's `elapsed >= timeout` check fires on the first +# iteration. The driver distinguishes nad-missing (no pod ever +# created) from peer-pod-unready (pod created but no IP); since +# we seed pod-for-job but NOT pod-ip, the pod-count probe +# returns 1 -> reason=peer-pod-unready. [TP TC12] +# ------------------------------------------------------------------- + +it "server pod IP never set + PHASE4_PAIR_WAIT_TIME=0 -> reason=peer-pod-unready" && { + _reset_phase4_env + export PHASE4_RAIL_COUNT="1" + export PHASE4_PAIR_WAIT_TIME="0" + # Seed pod-for-job so the timeout-time pod-count probe sees 1 pod + # (so the classifier returns peer-pod-unready, not nad-missing). + # Do NOT seed pod-ip-for-job -- that's what triggers the timeout. + # NB: `local` is invalid at file scope; use plain assignments. + sjob=$(_phase4_expected_job_name "server" "node-a" "0") + kubectl_mock_set_pod_for_job "$sjob" "pod-${sjob}" + run __phase4_run node-a node-b + assert_status 0 + assert_kubectl_call \ + "label node node-a amd.com/rail-bandwidth=failed --overwrite" + assert_kubectl_call \ + "annotate node node-a amd.com/rail-bandwidth-failed-rails=0 --overwrite" + # The "server pod IP wait failed rc=X reason=Y" log line is emitted + # to stderr by pair_runner (driver line 2317). Match on stderr. + assert_stderr_contains "reason=peer-pod-unready" + # Server Job must be cleaned up on the timeout path. + assert_kubectl_call \ + "delete job ${sjob} --ignore-not-found=true --wait=false" +} + +# Variant: NO pod seeded at all -> pod-count probe returns 0 -> +# classifier returns nad-missing. +it "server pod never created + PHASE4_PAIR_WAIT_TIME=0 -> reason=nad-missing" && { + _reset_phase4_env + export PHASE4_RAIL_COUNT="1" + export PHASE4_PAIR_WAIT_TIME="0" + # Seed nothing for the server Job -- the get-pods listing returns + # zero lines -> pod_count=0 -> classifier returns nad-missing. + run __phase4_run node-a node-b + assert_status 0 + assert_kubectl_call \ + "annotate node node-a amd.com/rail-bandwidth-failed-rails=0 --overwrite" + # Same as above -- the reason string is on stderr. + assert_stderr_contains "reason=nad-missing" +} + +# ------------------------------------------------------------------- +# 14. rail-count override: PHASE4_RAIL_COUNT=4 -> only rails 0-3 +# are tested; rails 4-7 are NOT in any annotation. [TP TC11] +# ------------------------------------------------------------------- + +it "PHASE4_RAIL_COUNT=4 -> rails 0-3 annotated; rails 4-7 absent" && { + _reset_phase4_env + export PHASE4_RAIL_COUNT="4" + _seed_pair_all_pass "node-a" "node-b" 4 "ib-write-bw-pass.log" + run __phase4_run node-a node-b + assert_status 0 + # rails 0-3 annotated for round 0 (N=2 -> 1 round). + assert_kubectl_call \ + "annotate node node-a amd.com/rail-bandwidth-rail-0-round-0=388.42/peer=node-b --overwrite" + assert_kubectl_call \ + "annotate node node-a amd.com/rail-bandwidth-rail-3-round-0=388.42/peer=node-b --overwrite" + # Rails 4-7 must NOT be annotated (any round). + for rail in 4 5 6 7; do + if grep -F "amd.com/rail-bandwidth-rail-${rail}-" \ + "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "RAIL_COUNT=4 leaked rail-${rail} annotation: +$(grep "rail-${rail}-" "$KUBECTL_CALLS_FILE")" + fi + done +} + +# ------------------------------------------------------------------- +# 15. Concurrency cap honored: 16-node input -> 8 pairs. With +# PHASE4_MAX_CONCURRENT_PAIRS=8, all 8 pair_runners can run in +# parallel (no slot-wait sleep iterations recorded). We can't +# easily observe peak concurrency without instrumentation, but +# we CAN verify (a) the driver logs all 8 pair forks, (b) the +# final pass/fail aggregation accounts for all 16 nodes, and +# (c) the unpaired counter is 0. [TP TC15] +# ------------------------------------------------------------------- + +it "8 nodes (full mesh, cap=4) -> 28 pairs across 7 rounds, all 8 nodes labeled" && { + _reset_phase4_env + # full mesh on N=8 produces 7 rounds with 4 disjoint + # pairs/round = 28 pairs total. With PHASE4_MAX_CONCURRENT_PAIRS=4 + # all 4 pairs in each round can dispatch in parallel. We exercise + # the scheduler alone (PHASE4_RAIL_COUNT=0) so we do not have to + # seed 28 * 0 mock Jobs; the schedule, dispatch, and aggregation + # paths still all run. + export PHASE4_RAIL_COUNT="0" + export PHASE4_MAX_CONCURRENT_PAIRS="4" + nodes8="" + for i8 in 01 02 03 04 05 06 07 08; do + nodes8="$nodes8 node-${i8}" + done + # shellcheck disable=SC2086 + run __phase4_run $nodes8 + assert_status 0 + assert_stdout_contains "schedule: rounds=7 total_pairs=28 unpaired=" + assert_stdout_contains "dispatching pair_runners (cap=4, rounds=7)" + # All 28 pair forks logged (monotonic global counter). + for pair_idx in 0 1 2 3 27; do + assert_stdout_contains "forking pair #${pair_idx}" + done + # All 7 rounds emitted a START line. + for ri in 0 1 2 3 4 5 6; do + assert_stdout_contains "round ${ri} START:" + assert_stdout_contains "round ${ri} DONE" + done + assert_stdout_contains "pass=8 fail=0" + # Full-mesh property verified. + sched=$(_phase4_extract_schedule) + _phase4_assert_full_mesh "$sched" 28 \ + node-01 node-02 node-03 node-04 \ + node-05 node-06 node-07 node-08 +} + +# Defensive guard: PHASE4_MAX_CONCURRENT_PAIRS=0 is invalid; the driver +# promotes to 1 (serial) with a logged warning rather than deadlocking. +it "PHASE4_MAX_CONCURRENT_PAIRS=0 is promoted to 1 with a warning" && { + _reset_phase4_env + # PHASE4_RAIL_COUNT=0 exercises the scheduler / cap + # promotion logic without per-rail kubectl mocking. + export PHASE4_RAIL_COUNT="0" + export PHASE4_MAX_CONCURRENT_PAIRS="0" + run __phase4_run node-a node-b + assert_status 0 + assert_stderr_contains "PHASE4_MAX_CONCURRENT_PAIRS=0 invalid -- promoting to 1" + assert_stdout_contains "dispatching pair_runners (cap=1" +} + +# ------------------------------------------------------------------- +# 16. PHASE_NODES env-var fallback: when positional args are empty +# but PHASE_NODES is exported, the driver uses that list. +# ------------------------------------------------------------------- + +it "PHASE_NODES env var is used when no positional args are given" && { + _reset_phase4_env + export PHASE_NODES="node-env-only" + run __phase4_run # NB: no positional args + assert_status 0 + # Single-node fallback -> unpaired pass-label. + assert_kubectl_call \ + "label node node-env-only amd.com/rail-bandwidth=passed --overwrite" + assert_kubectl_call \ + "annotate node node-env-only amd.com/rail-bandwidth-unpaired=true --overwrite" +} + +# ------------------------------------------------------------------- +# Full-mesh pair-generator tests. +# +# These tests exercise the circle-algorithm scheduler emitted by +# PHASE4_DRIVER_SCRIPT for a range of N values (even and odd). For +# each N we verify: +# (a) the correct number of rounds is emitted +# (b) the correct total pair count = C(N,2) +# (c) within every round, no node appears in more than one pair +# (disjointness property) +# (d) the union over all rounds equals the C(N,2) pair set +# (full-mesh coverage) +# +# These run with PHASE4_RAIL_COUNT=0 so the rail loop inside +# pair_runner is a no-op -- no kubectl Job mocks needed. The schedule +# generation, round-by-round dispatch logging, and per-node +# aggregation paths are all exercised. +# ------------------------------------------------------------------- + +_phase4_mesh_test() { + local n="$1" + shift + local -a nodes=("$@") + local expected=$(( n * (n - 1) / 2 )) + # Circle-algorithm round count: N-1 for even N (circle of size N); + # N for odd N (circle of size N+1 with a bye slot). Both produce + # a full mesh (union covers C(N,2)). + local expected_rounds + if (( n % 2 == 0 )); then + expected_rounds=$(( n - 1 )) + else + expected_rounds="$n" + fi + _reset_phase4_env + export PHASE4_RAIL_COUNT="0" + # shellcheck disable=SC2086 + run __phase4_run "${nodes[@]}" + assert_status 0 + assert_stdout_contains "schedule: rounds=${expected_rounds} total_pairs=${expected} unpaired=" + sched=$(_phase4_extract_schedule) + _phase4_assert_full_mesh "$sched" "$expected" "${nodes[@]}" +} + +it "mesh-schedule N=2 -> 1 round, 1 pair = (a,b)" && { + _phase4_mesh_test 2 node-a node-b +} + +it "mesh-schedule N=4 -> 3 rounds, 2 disjoint pairs/round, 6 total" && { + _phase4_mesh_test 4 node-a node-b node-c node-d +} + +it "mesh-schedule N=5 (odd) -> 5 rounds (N), one node sits out per round, 10 total" && { + # Circle algorithm: odd N needs N rounds (circle_size=N+1 with + # bye), not N-1. (N-1)*(N-1)/2 = 8 pair-slots, which is short of + # C(5,2)=10; N rounds * (N-1)/2 pairs = 10 = C(5,2). + _phase4_mesh_test 5 node-a node-b node-c node-d node-e +} + +it "mesh-schedule N=6 -> 5 rounds, 3 disjoint pairs/round, 15 total" && { + _phase4_mesh_test 6 node-a node-b node-c node-d node-e node-f +} + +it "mesh-schedule N=7 (odd) -> 7 rounds (N), one node sits out per round, 21 total" && { + # See N=5 note: odd N requires N (not N-1) rounds with the + # circle/bye algorithm to cover C(N,2) pairs. + _phase4_mesh_test 7 node-a node-b node-c node-d node-e node-f node-g +} + +it "mesh-schedule N=8 -> 7 rounds, 4 disjoint pairs/round, 28 total" && { + _phase4_mesh_test 8 node-a node-b node-c node-d \ + node-e node-f node-g node-h +} + +# Disjointness property: for odd N=5 specifically, verify that each +# round names exactly 2 real pairs (one node sits out), and the +# sit-out node rotates through all five nodes across the five rounds. +it "mesh-schedule N=5 odd: each node sits out exactly once across the 5 rounds" && { + _reset_phase4_env + export PHASE4_RAIL_COUNT="0" + run __phase4_run node-a node-b node-c node-d node-e + assert_status 0 + # Each round must include exactly 2 pairs (4 nodes), leaving 1 node out. + # We tally bench presences: any node missing from a round is sitting out. + sched=$(_phase4_extract_schedule) + unset sitouts + declare -A sitouts=() + while IFS=$'\t' read -r round_idx round_pairs; do + [[ -z "$round_idx" ]] && continue + unset present + declare -A present=() + for p in $round_pairs; do + a="${p%,*}"; b="${p#*,}" + present[$a]=1 + present[$b]=1 + done + sitout_this_round="" + for n in node-a node-b node-c node-d node-e; do + if [[ -z "${present[$n]:-}" ]]; then + # The driver schedule for N=5 must leave exactly one + # node out per round. + if [[ -n "$sitout_this_round" ]]; then + _assert_fail "round ${round_idx} leaves more than one node out: ${sitout_this_round} and ${n}" + fi + sitout_this_round="$n" + sitouts[$n]=$(( ${sitouts[$n]:-0} + 1 )) + fi + done + done <<< "$sched" + # Every node must sit out exactly once. + for n in node-a node-b node-c node-d node-e; do + if [[ "${sitouts[$n]:-0}" -ne 1 ]]; then + _assert_fail "node ${n} sits out ${sitouts[$n]:-0} time(s); expected 1" + fi + done +} + +assert_summary diff --git a/example/gpu-validation-cluster/tests/test_phase4_5.sh b/example/gpu-validation-cluster/tests/test_phase4_5.sh new file mode 100755 index 000000000..dda445b3c --- /dev/null +++ b/example/gpu-validation-cluster/tests/test_phase4_5.sh @@ -0,0 +1,534 @@ +#!/bin/bash +# Unit tests for PHASE45_PREFLIGHT_SCRIPT (.* body) against +# the mocked kubectl harness. +# +# Scope (from the design doc §7 +# "Testing Strategy" + test plan): +# * SSH-mesh pair-iteration: all-pass / single-pair-fail / all-fail +# -- covers test-plan TC2 (all-checks-pass) and TC4 +# (single-pair-ssh-fail) for the SSH mesh branch. +# * WORKER_REPLICAS=1 degenerate case: self-pair only, no divide-by- +# zero, no hang -- covers test-plan TC8. +# * DNS forward-miss fixture -- covers test-plan TC5. +# * MPI spawn fail -- covers test-plan TC6. +# * RCCL topology hard-fail (non-zero, non-124 exit) and soft-fail +# (124 timeout) -- covers test-plan TC7 (rccl-topology-timeout) +# and the hard-fail companion path from design §4 verdict block. +# * Annotation classification: multi-class union, comma-joined +# (`ssh-mesh,dns,mpi-spawn,rccl-topology`) annotated on every +# participating node -- covers test-plan TC9 +# (annotation-includes-all-failed-classes). +# * Verdict-block contracts: +# - happy path: zero annotate calls, exit 0 +# - hard-failed: exit 1, annotate ran with the comma-joined +# classes, --overwrite present on every annotate +# - soft-only (rccl_topo_timeout alone): annotate ran with +# "rccl-topology", exit 0 (design §6 carve-out) +# +# How PHASE45_PREFLIGHT_SCRIPT is exercised (mirrors test_phase4.sh +# conventions): +# +# The script body is a block-scalar inside cluster-validation-config.yaml. +# We extract it with lib/extract_script.sh, then wrap it in a +# function `__phase45_run` so the test can drive it under `set -u` +# without polluting the harness's globals (the body itself runs +# under `set -euo pipefail`). +# +# `kubectl` is the mock from lib/kubectl_mock.sh: +# * `kubectl get mpijob .` -- seeded with kubectl_mock_set_mpijob_names. +# * `kubectl get pods -n NS -l -o jsonpath=.` -- seeded +# with kubectl_mock_set_pod_names / set_pod_ips / set_node_names +# (Phase 4.5 uses Kubeflow training labels, NOT job-name=, so +# it takes the dedicated phase45-* state routes added for this +# change rather than the job-name=-selector early routes). +# * `kubectl wait .` -- default pass; can be failed via +# kubectl_mock_fail wait . +# * `kubectl exec .` -- per-call response queue via +# kubectl_mock_queue_exec. Each test calls _queue_*_path +# helpers (defined below) to enqueue exactly the responses the +# script will consume in order. +# * `kubectl annotate node .` -- recorded; pass-through. +# +# We also shim host-side `timeout` (used to wrap the RCCL probe) +# as a no-op pass-through (`timeout 60 kubectl exec .` -> just +# exec the inner command), because the mock kubectl already returns +# the desired exit code (including 124 for the soft-fail timeout +# simulation) on its own. + +set -uo pipefail + +TEST_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd) +REPO_ROOT=$(cd -- "${TEST_DIR}/../../.." && pwd) +CONFIGMAP="${REPO_ROOT}/example/gpu-validation-cluster/configs/cluster-validation-config.yaml" +FIXTURES_DIR="${TEST_DIR}/fixtures/phase4_5" + +# shellcheck source=./lib/assert.sh +source "${TEST_DIR}/lib/assert.sh" +# shellcheck source=./lib/kubectl_mock.sh +source "${TEST_DIR}/lib/kubectl_mock.sh" +# shellcheck source=./lib/extract_script.sh +source "${TEST_DIR}/lib/extract_script.sh" + +echo "================================================================" +echo " test_phase4_5.sh" +echo " ConfigMap: ${CONFIGMAP}" +echo " Fixtures: ${FIXTURES_DIR}" +echo "================================================================" + +# --- one-time setup ------------------------------------------------- + +PHASE45_DIR=$(mktemp -d -t phase4_5-tests-XXXXXX) +SHIM_DIR="${PHASE45_DIR}/shims" +PHASE45_BODY="${PHASE45_DIR}/phase4_5-body.sh" +mkdir -p "$SHIM_DIR" + +trap 'rm -rf "$PHASE45_DIR"; kubectl_mock_cleanup' EXIT + +# Shims: +# * `timeout` -- strip the leading duration arg and exec the rest. +# The mock kubectl directly returns 124 (or anything) +# when the test wants to drive the rccl_topo_timeout +# branch. +# * `sleep` -- no-op; the SSH-readiness loop inside `bash -c` is +# never reached because that whole exec is mocked, +# but defensive-shim anyway in case any subshell +# evaluation does reach a real sleep. +cat >"${SHIM_DIR}/timeout" <<'EOF' +#!/bin/bash +# Strip the duration arg (and an optional --foreground / -k form) +# and exec the rest. The Phase 4.5 caller is: +# timeout 60 kubectl exec -n NS POD -- bash -c '.' +# i.e. a plain `timeout `, so the simple shift is +# sufficient. +shift || true +exec "$@" +EOF +chmod +x "${SHIM_DIR}/timeout" + +cat >"${SHIM_DIR}/sleep" <<'EOF' +#!/bin/bash +exit 0 +EOF +chmod +x "${SHIM_DIR}/sleep" + +# Extract PHASE45_PREFLIGHT_SCRIPT and wrap in a function. The body +# uses no `local`, but wrapping in a function keeps its globals +# (ssh_mesh_failed, dns_failed, etc.) out of the harness scope so +# back-to-back `it` blocks don't leak state across tests. +RAW_PHASE45=$(extract_configmap_data "$CONFIGMAP" "PHASE45_PREFLIGHT_SCRIPT") +if [[ -z "$RAW_PHASE45" ]]; then + echo "FATAL: PHASE45_PREFLIGHT_SCRIPT extraction produced empty output" >&2 + exit 1 +fi + +{ + printf '__phase45_run() {\n' + printf '%s\n' "$RAW_PHASE45" + printf '}\n' +} > "$PHASE45_BODY" + +if ! bash -n "$PHASE45_BODY"; then + echo "FATAL: patched PHASE45_PREFLIGHT_SCRIPT has bash syntax errors" >&2 + exit 1 +fi + +kubectl_mock_init + +# Shim dir goes after the mock kubectl so `kubectl` still resolves +# to the mock; `timeout` and `sleep` come from our shims. +export PATH="${SHIM_DIR}:${PATH}" + +# shellcheck disable=SC1090 +source "$PHASE45_BODY" + +if ! declare -F __phase45_run >/dev/null; then + echo "FATAL: __phase45_run not defined after sourcing" >&2 + exit 1 +fi + +# Suppress -u so tests that intentionally exercise optional env can +# leave knobs unset. +set +u + +# --- per-test helpers ----------------------------------------------- + +# _reset_phase45_env +# Wipe mock state and re-export the baseline env PHASE45_PREFLIGHT_SCRIPT +# reads. The defaults below mirror what the launcher init-container +# manifest passes in production (cluster-validation-job.yaml). +_reset_phase45_env() { + kubectl_mock_reset + export WORKER_REPLICAS="2" + export WAIT_FOR_WORKERS="true" + export ENABLE_SSH_CHECK="true" + export WORKER_READY_TIMEOUT="300" + export SSH_CHECK_TIMEOUT="60" + export SSH_CHECK_INTERVAL="5" + export JOB_NAME="cluster-validation-mpi" + export PERF_TEST_DIR="/opt/rccl-tests/build" + # Mirror the production discovery: 1 MPIJob revision exists. + kubectl_mock_set_mpijob_names "cluster-validation-mpi-1" +} + +# _seed_2pod_topology +# Two healthy worker pods on two distinct nodes. Default fabric layout +# used by most tests; individual tests override the pod count with +# _seed_1pod_topology when exercising the degenerate path. +_seed_2pod_topology() { + kubectl_mock_set_pod_names "worker-pod-a worker-pod-b" + kubectl_mock_set_pod_ips "10.42.0.10 10.42.0.11" + kubectl_mock_set_node_names "node-a" "node-b" +} + +_seed_1pod_topology() { + export WORKER_REPLICAS="1" + kubectl_mock_set_pod_names "worker-pod-a" + kubectl_mock_set_pod_ips "10.42.0.10" + kubectl_mock_set_node_names "node-a" +} + +# --- exec-queue helpers --------------------------------------------- +# PHASE45_PREFLIGHT_SCRIPT issues exec calls in a fixed order. These +# helpers enqueue the right number of responses for each phase so the +# test bodies can compose them by intent rather than counting calls. + +# _queue_launcher_ssh_pass / _fail +_queue_launcher_ssh_pass() { kubectl_mock_queue_exec 0; } +_queue_launcher_ssh_fail() { kubectl_mock_queue_exec 1; } + +# _queue_mesh_all_pass +# One exec response per (src, dst_ip) pair. The mesh loop is +# `for src in WORKER_PODS; for dst_ip in WORKER_IPS`. For N pods +# that's N*N pairs. +_queue_mesh_all_pass() { + local n="$1" + local total=$((n * n)) + local i + for (( i = 0; i < total; i++ )); do + kubectl_mock_queue_exec 0 + done +} + +# _queue_mesh_one_fail +# All N*N exec responses succeed except the one at +# (0-based, in row-major (src major) order). +_queue_mesh_one_fail() { + local n="$1" + local fail_idx="$2" + local total=$((n * n)) + local i + for (( i = 0; i < total; i++ )); do + if [[ "$i" -eq "$fail_idx" ]]; then + kubectl_mock_queue_exec 1 + else + kubectl_mock_queue_exec 0 + fi + done +} + +# _queue_mesh_all_fail +_queue_mesh_all_fail() { + local n="$1" + local total=$((n * n)) + local i + for (( i = 0; i < total; i++ )); do + kubectl_mock_queue_exec 1 + done +} + +# _queue_dns_clean / _queue_dns_fwd_miss +# DNS check is one exec that streams DNS::. lines for any +# miss; clean runs produce no output. The script captures stdout into +# dns_misses[] and uses the array length as the verdict. +_queue_dns_clean() { kubectl_mock_queue_exec 0; } +_queue_dns_fwd_miss() { + kubectl_mock_queue_exec 0 "$(cat "${FIXTURES_DIR}/dns-fwd-miss.txt")" +} + +# _queue_mpi_pass / _queue_mpi_fail +_queue_mpi_pass() { kubectl_mock_queue_exec 0; } +_queue_mpi_fail() { kubectl_mock_queue_exec 1; } + +# _queue_rccl_pass / _queue_rccl_hard_fail / _queue_rccl_timeout +# RCCL is one exec wrapped by host-side `timeout 60 .`. The shim +# strips the duration and execs the inner command; the mock kubectl +# returns whatever exit code we enqueue. Exit 124 -> soft-fail +# (rccl_topo_timeout=true); any other non-zero -> hard-fail +# (rccl_topo_failed=true). +_queue_rccl_pass() { + kubectl_mock_queue_exec 0 "$(cat "${FIXTURES_DIR}/rccl-pass.txt")" +} +_queue_rccl_hard_fail() { kubectl_mock_queue_exec 1; } +_queue_rccl_timeout() { kubectl_mock_queue_exec 124; } + +# --- all-pass scaffold --------------------------------------------- +# Every test starts from a healthy 2-pod cluster and overrides one +# phase's exec response to inject the failure under test. This helper +# enqueues the full "all pass" sequence; tests that want to mutate +# one phase reset the queue (via _reset_phase45_env -> kubectl_mock_reset) +# and re-enqueue piece by piece. +_queue_all_pass_2pod() { + _queue_launcher_ssh_pass # 1 call + _queue_mesh_all_pass 2 # 4 calls + _queue_dns_clean # 1 call + _queue_mpi_pass # 1 call + _queue_rccl_pass # 1 call +} + +_queue_all_pass_1pod() { + _queue_launcher_ssh_pass # 1 call + _queue_mesh_all_pass 1 # 1 call + _queue_dns_clean # 1 call + _queue_mpi_pass # 1 call + _queue_rccl_pass # 1 call +} + +# --- assertion helpers ---------------------------------------------- + +# _assert_no_annotate +# Verify zero `kubectl annotate .` calls were recorded. +_assert_no_annotate() { + local path="${KUBECTL_CALLS_FILE:-}" + if grep -q '^annotate ' "$path"; then + _assert_fail "expected no annotate calls; got: +$(grep '^annotate ' "$path")" + fi +} + +# _assert_annotate_classes +# Verify EVERY recorded `annotate node` call carries the exact +# `amd.com/phase4_5-failure-reason=` value AND --overwrite. +# Also verify at least one such call exists. +_assert_annotate_classes() { + local reason="$1" + local path="${KUBECTL_CALLS_FILE:-}" + local match_line="amd.com/phase4_5-failure-reason=${reason}" + if ! grep -F -- "$match_line" "$path" >/dev/null; then + _assert_fail "expected annotate call with [${match_line}], got: +$(grep '^annotate ' "$path" || echo '')" + return + fi + # Every annotate call must use --overwrite (design §4 verdict + # block: annotations replace stale values from prior failed runs). + while IFS= read -r line; do + if [[ "$line" == "annotate "* && "$line" != *"--overwrite"* ]]; then + _assert_fail "annotate call missing --overwrite: ${line}" + fi + done <"$path" +} + +# _assert_annotated_nodes . +# Verify each named node received an annotate call. +_assert_annotated_nodes() { + local path="${KUBECTL_CALLS_FILE:-}" + local n + for n in "$@"; do + if ! grep -E "^annotate node ${n} " "$path" >/dev/null; then + _assert_fail "expected annotate for node ${n}; got: +$(grep '^annotate ' "$path" || echo '')" + fi + done +} + +# ========================================================== +# TESTS +# ========================================================== + +# --- TC2 (all-checks-pass) for 2 pods ------------------------------- +it "all checks pass on a healthy 2-pod cluster -> exit 0, no annotate" && { + _reset_phase45_env + _seed_2pod_topology + _queue_all_pass_2pod + run __phase45_run + assert_status 0 + _assert_no_annotate + assert_stdout_contains "Phase 4.5 pre-flight passed: ssh-mesh, dns, mpi-spawn, rccl-topology all OK." +} + +# --- TC4 (single-pair-ssh-fail) -- mesh loop continues, annotates --- +it "single SSH mesh pair fails -> ssh_mesh_failed=true, classes=ssh-mesh, exit 1" && { + _reset_phase45_env + _seed_2pod_topology + _queue_launcher_ssh_pass # 1 + # 4 mesh probes; fail the 2nd pair (worker-pod-a -> 10.42.0.11) + _queue_mesh_one_fail 2 1 # 4 + _queue_dns_clean # 1 + _queue_mpi_pass # 1 + _queue_rccl_pass # 1 + run __phase45_run + assert_status 1 + assert_stdout_contains "WARN: N*N SSH mesh check failed for 1 pair(s)" + assert_stdout_contains "worker-pod-a->10.42.0.11" + assert_stdout_contains "FATAL: Phase 4.5 pre-flight failed: ssh-mesh" + _assert_annotate_classes "ssh-mesh" + _assert_annotated_nodes node-a node-b +} + +# --- mesh all-fail: all pairs broken -> still single class --------- +it "all SSH mesh pairs fail -> failed_pairs=4, classes=ssh-mesh, exit 1" && { + _reset_phase45_env + _seed_2pod_topology + _queue_launcher_ssh_pass # 1 + _queue_mesh_all_fail 2 # 4 + _queue_dns_clean # 1 + _queue_mpi_pass # 1 + _queue_rccl_pass # 1 + run __phase45_run + assert_status 1 + assert_stdout_contains "WARN: N*N SSH mesh check failed for 4 pair(s)" + _assert_annotate_classes "ssh-mesh" +} + +# --- TC8 (worker-replicas-1) -- degenerate self-pair -------------- +it "WORKER_REPLICAS=1 self-pair only -> no divide-by-zero, exit 0" && { + _reset_phase45_env + _seed_1pod_topology + _queue_all_pass_1pod + run __phase45_run + assert_status 0 + # Self-pair (worker-pod-a -> 10.42.0.10) should mesh-OK exactly + # once (1 src * 1 dst). + assert_stdout_contains "__ mesh OK: worker-pod-a -> 10.42.0.10" + assert_stdout_contains "Phase 4.5 pre-flight passed" + _assert_no_annotate +} + +# --- TC5 (dns-fail-fwd) --------------------------------------------- +it "DNS forward miss -> dns_failed=true, classes=dns, exit 1" && { + _reset_phase45_env + _seed_2pod_topology + _queue_launcher_ssh_pass # 1 + _queue_mesh_all_pass 2 # 4 + _queue_dns_fwd_miss # 1 -- emits DNS:worker-b:fwd=MISS rev=SKIP + _queue_mpi_pass # 1 + _queue_rccl_pass # 1 + run __phase45_run + assert_status 1 + assert_stdout_contains "WARN: DNS check recorded 1 miss(es)" + assert_stdout_contains "DNS:worker-b:fwd=MISS rev=SKIP" + assert_stdout_contains "FATAL: Phase 4.5 pre-flight failed: dns" + _assert_annotate_classes "dns" +} + +# --- TC6 (mpi-spawn-fail) ------------------------------------------- +it "mpirun --hostfile no-op fails -> mpi_spawn_failed=true, classes=mpi-spawn, exit 1" && { + _reset_phase45_env + _seed_2pod_topology + _queue_launcher_ssh_pass # 1 + _queue_mesh_all_pass 2 # 4 + _queue_dns_clean # 1 + _queue_mpi_fail # 1 + _queue_rccl_pass # 1 + run __phase45_run + assert_status 1 + assert_stdout_contains "WARN: mpirun --hostfile no-op spawn failed (mpi_spawn_failed=true)" + assert_stdout_contains "FATAL: Phase 4.5 pre-flight failed: mpi-spawn" + _assert_annotate_classes "mpi-spawn" +} + +# --- RCCL hard-fail (non-timeout non-zero) -------------------------- +it "RCCL probe non-zero exit -> rccl_topo_failed=true, classes=rccl-topology, exit 1" && { + _reset_phase45_env + _seed_2pod_topology + _queue_launcher_ssh_pass # 1 + _queue_mesh_all_pass 2 # 4 + _queue_dns_clean # 1 + _queue_mpi_pass # 1 + _queue_rccl_hard_fail # 1 + run __phase45_run + assert_status 1 + assert_stdout_contains "WARN: RCCL topology probe failed with exit 1 (rccl_topo_failed=true)" + assert_stdout_contains "FATAL: Phase 4.5 pre-flight failed: rccl-topology" + _assert_annotate_classes "rccl-topology" +} + +# --- TC7 (rccl-topology-timeout) -- soft-fail per design §6 --------- +it "RCCL probe times out (exit 124) -> annotate rccl-topology, exit 0" && { + _reset_phase45_env + _seed_2pod_topology + _queue_launcher_ssh_pass # 1 + _queue_mesh_all_pass 2 # 4 + _queue_dns_clean # 1 + _queue_mpi_pass # 1 + _queue_rccl_timeout # 1 -- exit 124 + run __phase45_run + # Soft-fail: annotate but do NOT abort. Exit must be 0 so the + # launcher init-container proceeds and Phase 5 is allowed to + # exercise the warm-cache path (design §6). + assert_status 0 + assert_stdout_contains "WARN: RCCL topology probe timed out after 60s (rccl_topo_timeout=true)" + assert_stdout_contains "WARN: Phase 4.5 pre-flight soft-failed: rccl-topology" + assert_stdout_contains "Phase 4.5 proceeding past soft-fail" + _assert_annotate_classes "rccl-topology" + _assert_annotated_nodes node-a node-b +} + +# --- TC9 (annotation-includes-all-failed-classes) ------------------- +it "all four checks fail -> classes=ssh-mesh,dns,mpi-spawn,rccl-topology, exit 1" && { + _reset_phase45_env + _seed_2pod_topology + _queue_launcher_ssh_pass # 1 + _queue_mesh_all_fail 2 # 4 + _queue_dns_fwd_miss # 1 + _queue_mpi_fail # 1 + _queue_rccl_hard_fail # 1 + run __phase45_run + assert_status 1 + # Order is fixed by the verdict block: + # ssh-mesh, dns, mpi-spawn, rccl-topology + _assert_annotate_classes "ssh-mesh,dns,mpi-spawn,rccl-topology" + assert_stdout_contains "FATAL: Phase 4.5 pre-flight failed: ssh-mesh,dns,mpi-spawn,rccl-topology" + _assert_annotated_nodes node-a node-b +} + +# --- mixed hard + soft: hard wins; reason includes both ------------- +it "hard fail + rccl timeout -> classes include both, exit 1 (hard wins)" && { + _reset_phase45_env + _seed_2pod_topology + _queue_launcher_ssh_pass # 1 + _queue_mesh_one_fail 2 0 # 4 (fail first pair) + _queue_dns_clean # 1 + _queue_mpi_pass # 1 + _queue_rccl_timeout # 1 -- soft-fail + run __phase45_run + # Verdict: hard_failed=true (ssh-mesh) -> exit 1; annotation set + # is union of HARD + SOFT classes, so rccl-topology is included + # via the soft-fail branch. + assert_status 1 + _assert_annotate_classes "ssh-mesh,rccl-topology" + assert_stdout_contains "FATAL: aborting MPIJob via non-zero init-container exit" +} + +# --- ENABLE_SSH_CHECK=false -> the whole pre-flight body is skipped - +it "ENABLE_SSH_CHECK=false skips the entire pre-flight body, exit 0" && { + _reset_phase45_env + _seed_2pod_topology + export ENABLE_SSH_CHECK="false" + # No exec responses needed -- the script's `if [ "$ENABLE_SSH_CHECK" = "true" ]` + # gate skips every kubectl exec when false. + run __phase45_run + assert_status 0 + _assert_no_annotate + # Sanity: the verdict banner should NOT appear when the body was + # skipped wholesale. + assert_stdout_not_contains "Phase 4.5 pre-flight passed" + assert_stdout_not_contains "FATAL: Phase 4.5" +} + +# --- WAIT_FOR_WORKERS=false -> kubectl wait is NOT called ----------- +it "WAIT_FOR_WORKERS=false skips kubectl wait but still runs checks" && { + _reset_phase45_env + _seed_2pod_topology + export WAIT_FOR_WORKERS="false" + _queue_all_pass_2pod + run __phase45_run + assert_status 0 + # No `wait` verb recorded. + if grep -q '^wait ' "$KUBECTL_CALLS_FILE"; then + _assert_fail "expected no kubectl wait call, got: +$(grep '^wait ' "$KUBECTL_CALLS_FILE")" + fi + assert_stdout_contains "Phase 4.5 pre-flight passed" +} + +assert_summary diff --git a/example/gpu-validation-cluster/tests/test_phase5.sh b/example/gpu-validation-cluster/tests/test_phase5.sh new file mode 100755 index 000000000..c9c93a4b0 --- /dev/null +++ b/example/gpu-validation-cluster/tests/test_phase5.sh @@ -0,0 +1,791 @@ +#!/bin/bash +# Unit tests for PHASE5_SCRIPT +# against the mocked kubectl harness. +# +# Scope (from the design doc §7 +# "Testing Strategy" + test plan): +# * refactor-safety baseline: SKIP_RCCL_TEST=false on a healthy +# multi-node input -> passed-label path, helpers called (not raw +# kubectl label), per-worker logs written. Covers test-plan TC1 +# and TC2. +# * dynamic-worker-replicas: input list of 3 nodes -> MPIJob render +# gets WORKER_REPLICAS=3 (not the static config.json max). Covers +# test-plan TC3. +# * skip-flag short-circuit: SKIP_RCCL_TEST=true -> MPIJob NOT +# submitted, all input nodes pass-labeled via helpers, candidate +# label removed. Covers test-plan TC4. +# * per-worker-log-dump: pass path leaves +# ${LOG_DIR}/worker--.log behind for every input node. +# Covers test-plan TC5. +# * mpijob-fails: kubectl wait non-zero -> every input node gets +# failed label via the helper; the candidate-label removal still +# runs. Covers test-plan TC6. +# * per-worker-exit-attribution: one node's worker pod terminated +# with exit 137 -> that node's failure-reason annotation contains +# worker-pod=,exit=137; another node with a different exit +# gets a different reason. Covers test-plan TC7. +# * mpijob-apply-fails: kubectl apply non-zero on the MPIJob render +# -- this leaves no worker pods, so the SUT falls into the +# mpijob-failed labelling loop with `worker-pod=unknown,exit=unknown` +# annotations on every input node. Covers the broader "apply +# failure surfaces to operator" intent of test-plan TC8. +# * sub-min-pool: input of 1 node with PHASE5_MIN_WORKERS=2 -> +# MPIJob NOT submitted, script returns 0, no label changes, +# no kubectl apply/wait/label calls. Covers test-plan TC10. +# * empty-input: empty positional input -> return 0, no MPIJob, +# no labels. Covers test-plan TC11. +# * min-workers-override: PHASE5_MIN_WORKERS=1 + 1-node input -> +# guard passes; MPIJob render and wait happen. Covers +# test-plan TC12. +# * launcher-log-collection: launcher pod log saved to +# ${LOG_DIR}/launcher-.log on both pass and fail paths. +# Covers test-plan TC14. +# +# Out of scope (covered elsewhere or by integration tests): +# * The full RCCL bandwidth-threshold check (TC19) -- requires +# real GPUs. +# * The orchestrator wiring (TC15-17) -- belongs to the +# test_orchestrator_dry_run.sh harness. +# * Performance/bandwidth assertions (TC19) -- multi-node testbed only. +# +# How PHASE5_SCRIPT is exercised (mirrors test_phase2.sh / +# test_phase4_5.sh conventions): +# +# The script body is a block-scalar inside cluster-validation-config.yaml. +# We extract it with lib/extract_script.sh, patch the hardcoded MPIJob +# template path so the test can run as a non-root user, pin the +# timestamp date generation so per-node assertions can match the +# generated job name, then wrap the patched body in a function +# `__phase5_run` so its `local`s + the trailing `run_phase5_main "$@"` +# call execute under a controlled scope. +# +# The PHASE_NODE_LABEL_SCRIPT helper library is sourced first so +# label_phase_passed / label_phase_failed are defined for the +# PHASE5_SCRIPT body to call. +# +# `kubectl` is the mock from lib/kubectl_mock.sh. Phase-5 extends +# the mock with three new state-seed helpers: +# * kubectl_mock_set_phase5_worker_pod_for_node +# * kubectl_mock_set_phase5_launcher_pod +# * kubectl_mock_set_phase5_pod_exit_code +# plus the existing kubectl_mock_fail_sticky for driving +# the apply/wait failure paths. + +set -uo pipefail + +TEST_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd) +REPO_ROOT=$(cd -- "${TEST_DIR}/../../.." && pwd) +CONFIGMAP="${REPO_ROOT}/example/gpu-validation-cluster/configs/cluster-validation-config.yaml" +FIXTURES_DIR="${TEST_DIR}/fixtures/phase5" + +# shellcheck source=./lib/assert.sh +source "${TEST_DIR}/lib/assert.sh" +# shellcheck source=./lib/kubectl_mock.sh +source "${TEST_DIR}/lib/kubectl_mock.sh" +# shellcheck source=./lib/extract_script.sh +source "${TEST_DIR}/lib/extract_script.sh" + +echo "================================================================" +echo " test_phase5.sh" +echo " ConfigMap: ${CONFIGMAP}" +echo " Fixtures: ${FIXTURES_DIR}" +echo "================================================================" + +# --- one-time setup ------------------------------------------------- + +# Per-process tmp dirs: +# PHASE5_DIR -- root tmpdir; cleaned on EXIT. +# MPI_TPL_DIR -- holds a placeholder MPIJob template the SUT +# `sed`s into `kubectl apply -f -`. kubectl is mocked +# so the only constraints are: file exists, sed +# expressions don't fail, substitution markers are +# present (so render assertions can scrape them out +# of the recorded apply payload via the call log). +# LOG_DIR -- writable target for launcher + per-worker logs. +# PHASE5_BODY -- the patched, function-wrapped script we source. +# HELPER_SCRIPT -- PHASE_NODE_LABEL_SCRIPT extracted for helper fns. +PHASE5_DIR=$(mktemp -d -t phase5-tests-XXXXXX) +MPI_TPL_DIR="${PHASE5_DIR}/mpi-configs" +PHASE5_LOG_DIR="${PHASE5_DIR}/log" +PHASE5_BODY="${PHASE5_DIR}/phase5-body.sh" +HELPER_SCRIPT="${PHASE5_DIR}/phase-helpers.sh" +mkdir -p "$MPI_TPL_DIR" "$PHASE5_LOG_DIR" + +trap 'rm -rf "$PHASE5_DIR"; kubectl_mock_cleanup' EXIT + +# Minimal MPIJob template stand-in. The real template lives in the +# cluster-validation-mpijob-config ConfigMap. PHASE5_SCRIPT pipes a +# sed-rendered copy to `kubectl apply -f -`; the mock kubectl drains +# stdin to /dev/null so the substitution markers below only need to +# be present + sed-safe. +cat >"${MPI_TPL_DIR}/cluster-validation-mpijob-config.yaml" <<'YAML' +apiVersion: kubeflow.org/v2beta1 +kind: MPIJob +metadata: + name: cluster-validation-mpi-job +spec: + slotsPerWorker: $$SLOTS_PER_WORKER + runPolicy: + cleanPodPolicy: Running + mpiReplicaSpecs: + Launcher: + replicas: $$LAUNCHER_REPLICAS + template: + spec: + containers: + - name: launcher + image: $$ROCE_WORKLOAD_IMAGE + Worker: + replicas: $$WORKER_REPLICAS + template: + metadata: + annotations: + k8s.v1.cni.cncf.io/networks: $$NAD_ANNOTATION + spec: + nodeName: $$LOG_STORE_NODE_NAME + containers: + - name: worker + image: $$ROCE_WORKLOAD_IMAGE + resources: + limits: + amd.com/gpu: $$GPU_PER_WORKER + amd.com/pf_nic: $$PF_NIC_PER_WORKER + amd.com/vf_nic: $$VF_NIC_PER_WORKER +YAML + +# Extract PHASE5_SCRIPT and patch two things: +# 1. /mpi-configs/. -> ${MPI_TPL_DIR}/. +# (test-only path; kubectl apply is mocked so this is purely +# about letting `sed -f` read a real file as a non-root user.) +# 2. ts=$(date +%Y%m%d-%H%M) -> ts="${PHASE5_TEST_TS:-.}" +# Pinning the timestamp makes the job name `cluster-validation-mpi-job-` +# deterministic so tests can seed worker-pod + launcher-pod state +# that the SUT will look up later. +RAW_PHASE5=$(extract_configmap_data "$CONFIGMAP" "PHASE5_SCRIPT") +if [[ -z "$RAW_PHASE5" ]]; then + echo "FATAL: PHASE5_SCRIPT extraction produced empty output" >&2 + exit 1 +fi + +PATCHED_PHASE5=$(printf '%s\n' "$RAW_PHASE5" \ + | sed "s|/mpi-configs/cluster-validation-mpijob-config.yaml|${MPI_TPL_DIR}/cluster-validation-mpijob-config.yaml|g" \ + | sed 's|ts=\$(date +%Y%m%d-%H%M)|ts="${PHASE5_TEST_TS:-$(date +%Y%m%d-%H%M)}"|') + +# Wrap in a function so `local` declarations inside run_phase5_main +# are valid AND the call site below executes against the function's +# positional args (which we forward from the test caller). Wrapping +# also keeps run_phase5_main's globals (ts, new_job, NAD_*, .) out +# of the harness scope. +# +# Production orchestrator (cluster-validation-job.yaml `run_phase5`) +# sources PHASE5_SCRIPT to REGISTER the run_phase5_main function and +# then invokes `run_phase5_main "$nodes"` explicitly. PHASE5_SCRIPT +# no longer self-invokes at source time (the trailing +# `run_phase5_main "$@"` was removed because the orchestrator's +# explicit call already does the work, and the source-time call was +# running Phase 5 a second time per cron tick). +# +# Mirror that contract here: write the body (just function defs) into +# the wrapper, then APPEND an explicit `run_phase5_main "$@"` call so +# tests invoking `__phase5_run ` exercise the same path. +{ + printf '__phase5_run() {\n' + printf '%s\n' "$PATCHED_PHASE5" + printf ' run_phase5_main "$@"\n' + printf '}\n' +} > "$PHASE5_BODY" + +if ! bash -n "$PHASE5_BODY"; then + echo "FATAL: patched PHASE5_SCRIPT has bash syntax errors" >&2 + exit 1 +fi + +# Extract the helper library (label_phase_passed/failed, +# annotate_phase_value) once; sourced into the harness so the +# wrapped PHASE5_SCRIPT body can call the helpers by name. +extract_configmap_data "$CONFIGMAP" "PHASE_NODE_LABEL_SCRIPT" \ + > "$HELPER_SCRIPT" +if [[ ! -s "$HELPER_SCRIPT" ]]; then + echo "FATAL: PHASE_NODE_LABEL_SCRIPT extraction produced empty output" >&2 + exit 1 +fi +if ! bash -n "$HELPER_SCRIPT"; then + echo "FATAL: extracted helper script has bash syntax errors" >&2 + exit 1 +fi + +kubectl_mock_init + +# Suffix for failure-reason annotation; mirrors ConfigMap default in +# the cluster-validation-config envFrom (PHASE_FAILURE_REASON_ANNOTATION_SUFFIX). +export PHASE_FAILURE_REASON_ANNOTATION_SUFFIX="-failure-reason" + +# shellcheck disable=SC1090 +source "$HELPER_SCRIPT" +# shellcheck disable=SC1090 +source "$PHASE5_BODY" + +# Sanity: required functions are defined. +for fn in label_phase_passed label_phase_failed __phase5_run; do + if ! declare -F "$fn" >/dev/null; then + echo "FATAL: required function $fn not defined after sourcing" >&2 + exit 1 + fi +done + +# Suppress -u for the tests; PHASE5_SCRIPT references optional env +# (SKIP_RCCL_TEST default-empty) that would trip strict mode otherwise. +set +u + +# --- per-test helpers ----------------------------------------------- + +# _reset_phase5_env +# Wipe mock state and re-export the baseline env PHASE5_SCRIPT reads. +# Defaults mirror what the launcher container's envFrom would inject +# in production (cluster-validation-config.yaml). Tests override +# individual knobs (SKIP_RCCL_TEST, PHASE5_MIN_WORKERS, .) before +# invoking __phase5_run. +_reset_phase5_env() { + kubectl_mock_reset + : >"${PHASE5_LOG_DIR}"/.keep 2>/dev/null || true + # Wipe any leftover log files from a prior test so per-test + # assert_file_exists checks reflect this test's writes only. + find "$PHASE5_LOG_DIR" -mindepth 1 -delete 2>/dev/null || true + + # Phase 5 core env + export PHASE5_LABEL_KEY="amd.com/cluster-validation-status" + export PHASE5_MIN_WORKERS="2" + export WORKER_REPLICAS="8" + export LAUNCHER_REPLICAS="1" + export MPIJOB_WAIT_TIME="60" + export DEBUG_DELAY="0" # don't actually sleep on the fail path + export LOG_DIR="${PHASE5_LOG_DIR}" + + # MPIJob render inputs (consumed by the sed pipeline) + export ROCE_WORKLOAD_IMAGE="docker.io/rocm/test-image:phase5" + export LOG_STORE_NODE_NAME="log-store-node" + export SLOTS_PER_WORKER="8" + export GPU_PER_WORKER="8" + export PF_NIC_PER_WORKER="1" + export VF_NIC_PER_WORKER="0" + export PF_NIC_NAD_NAME="default/rail" + export VF_NIC_NAD_NAME="default/vf-rail" + + # Candidate label (the SUT strips the `=.` suffix via parameter + # expansion before issuing `kubectl label node $n ${KEY}-`). + export CANDIDATE_LABEL="amd.com/gpu-validation-candidate=true" + + # Pinned timestamp so the SUT's `new_job` name is deterministic + # and we can seed mock state keyed by it. + export PHASE5_TEST_TS="testts" + + # Phase 5 reads SKIP_RCCL_TEST with `,` lowercase expansion; + # default to empty (= run MPIJob) unless a test overrides. + unset SKIP_RCCL_TEST +} + +# _expected_job_name +# Mirrors PHASE5_SCRIPT exactly: `cluster-validation-mpi-job-${ts}`. +_expected_job_name() { + printf 'cluster-validation-mpi-job-%s' "${PHASE5_TEST_TS}" +} + +# --- assertion helpers ---------------------------------------------- + +# _assert_no_kubectl_verb +# Verify zero `kubectl .` calls were recorded. +_assert_no_kubectl_verb() { + local verb="$1" + local path="${KUBECTL_CALLS_FILE:-}" + if grep -q "^${verb} " "$path"; then + _assert_fail "expected no kubectl ${verb} calls; got: +$(grep "^${verb} " "$path")" + fi +} + +# _assert_label_call +# Verify `kubectl label node = --overwrite` +# was issued (the path the helpers take). +_assert_label_call() { + local node="$1" + local value="$2" + assert_kubectl_call "label node ${node} ${PHASE5_LABEL_KEY}=${value} --overwrite" +} + +# _assert_failure_reason_for_node +# Verify the per-node failure-reason annotation was written via the +# helper's annotate path: +# annotate node -failure-reason= --overwrite +_assert_failure_reason_for_node() { + local node="$1" + local reason="$2" + local key="${PHASE5_LABEL_KEY}${PHASE_FAILURE_REASON_ANNOTATION_SUFFIX}" + assert_kubectl_call "annotate node ${node} ${key}=${reason} --overwrite" +} + +# _assert_candidate_label_preserved +# Retry-trap contract: PHASE5_SCRIPT must NOT remove +# `amd.com/cluster-validation-candidate=true` on either pass or fail. +# The candidate label is the *eligibility* gate; PHASE5_LABEL_KEY is +# the *verdict*. Phase 0's timestamp/interval filter gates re-selection +# on its own -- stripping the candidate label would permanently +# quarantine the node and break the cron-style retry workflow. +# Assert that NO `kubectl label node - --overwrite` +# call was recorded for this node. +_assert_candidate_label_preserved() { + local node="$1" + local cand_key="${CANDIDATE_LABEL%%=*}" + local pattern="label node ${node} ${cand_key}- --overwrite" + if grep -qxF -- "$pattern" "$KUBECTL_CALLS_FILE"; then + _assert_fail "Phase 5 must not strip candidate label on ${node}; recorded call: ${pattern}" + fi +} + +# _assert_mpijob_explicit_delete +# Cleanup contract: PHASE5_SCRIPT must explicitly +# `kubectl delete mpijob --ignore-not-found --wait=false` +# on the way out so a Pending/Failed MPIJob does not linger across +# cron ticks (the launcher Job's pods can hold amd.com/gpu and +# amd.com/nic reservations that block Phase 0 selection on the same +# nodes via the busy-pool check). +_assert_mpijob_explicit_delete() { + local job="$1" + assert_kubectl_call "delete mpijob ${job} --ignore-not-found --wait=false" +} + +# ========================================================== +# TESTS +# ========================================================== + +# --- TC4 (skip-rccl-passlabels-all) --------------------------------- +# SKIP_RCCL_TEST=true short-circuits BEFORE the MPIJob path. Every +# input node must be pass-labeled via the helper (not raw kubectl +# label). Retry-trap contract: the candidate label +# is intentionally PRESERVED across all Phase 5 exits. No MPIJob +# ever lands. +it "SKIP_RCCL_TEST=true labels every input node passed; no MPIJob submitted; candidate label preserved" && { + _reset_phase5_env + export SKIP_RCCL_TEST="true" + run __phase5_run node-a node-b node-c + assert_status 0 + # No MPIJob render / submit / wait. + _assert_no_kubectl_verb apply + _assert_no_kubectl_verb wait + # Every input node pass-labeled via the helper. + _assert_label_call node-a passed + _assert_label_call node-b passed + _assert_label_call node-c passed + # Candidate label preserved on every input node (no `-` strip). + _assert_candidate_label_preserved node-a + _assert_candidate_label_preserved node-b + _assert_candidate_label_preserved node-c + assert_stdout_contains "SKIP_RCCL_TEST is set to true. Skipping MPI Job RCCL test." + assert_stdout_contains "(RCCL test skipped)" +} + +# --- TC4 case-insensitive: SKIP_RCCL_TEST=TRUE same as true -------- +# The `,` parameter expansion in PHASE5_SCRIPT lowercases the value +# before comparing; an upper-case value MUST still take the skip path. +it "SKIP_RCCL_TEST=TRUE (uppercase) still takes the skip path" && { + _reset_phase5_env + export SKIP_RCCL_TEST="TRUE" + run __phase5_run node-a + assert_status 0 + _assert_no_kubectl_verb apply + _assert_label_call node-a passed +} + +# --- TC10 (sub-min-pool-skips) -------------------------------------- +# 1-node input + default PHASE5_MIN_WORKERS=2 -> the guard fires. +# No MPIJob is submitted, no labels are written, no candidate labels +# are stripped. Script returns 0 so downstream cleanup can run. +it "sub-min pool (1 node, MIN_WORKERS=2): no MPIJob, no labels, return 0" && { + _reset_phase5_env + # PHASE5_MIN_WORKERS=2 from _reset_phase5_env + run __phase5_run node-a + assert_status 0 + _assert_no_kubectl_verb apply + _assert_no_kubectl_verb wait + _assert_no_kubectl_verb label + _assert_no_kubectl_verb annotate + assert_stdout_contains "only 1 node(s) survived prior phases" + assert_stdout_contains "skipping MPIJob (no label changes)" +} + +# --- TC11 (empty-input-skips) --------------------------------------- +# Empty positional args -> input_count=0 -> the guard fires (0 < 2). +# Same no-op contract as the sub-min case. +it "empty input: no MPIJob, no labels, return 0" && { + _reset_phase5_env + run __phase5_run + assert_status 0 + _assert_no_kubectl_verb apply + _assert_no_kubectl_verb wait + _assert_no_kubectl_verb label + _assert_no_kubectl_verb annotate + assert_stdout_contains "only 0 node(s) survived prior phases" +} + +# --- TC12 (min-workers-override) ------------------------------------ +# Operator opts in to a single-node MPIJob via PHASE5_MIN_WORKERS=1. +# Now a 1-node input passes the guard; apply + wait + label all run. +it "PHASE5_MIN_WORKERS=1 + 1 node: guard passes, MPIJob submitted" && { + _reset_phase5_env + export PHASE5_MIN_WORKERS="1" + # Seed mock state so the SUT can find the worker pod + launcher pod + # for label / log dump steps. The wait loop polls Succeeded/Failed + # conditions on the MPIJob; seed Succeeded=True so the loop exits + # immediately with job_status=passed. + job=$(_expected_job_name) + kubectl_mock_set_phase5_worker_pod_for_node "$job" node-a worker-pod-a + kubectl_mock_set_phase5_pod_exit_code worker-pod-a 0 + kubectl_mock_set_phase5_launcher_pod "$job" launcher-pod-x + kubectl_mock_set_mpijob_condition "$job" Succeeded True + run __phase5_run node-a + assert_status 0 + # MPIJob render + poll both ran. + assert_kubectl_call_contains "apply" + assert_kubectl_call_contains "get mpijob ${job}" + # Single-node degenerate run: label_phase_passed for node-a. + _assert_label_call node-a passed + # Cleanup deletes the MPIJob explicitly. + _assert_mpijob_explicit_delete "$job" +} + +# --- refactor-safety baseline (TC1 + TC2 + TC5) --------------------- +# Happy path: 2 nodes, SKIP_RCCL_TEST unset, MPIJob succeeds. Labels +# go through the helper (TC2), per-worker logs are saved (TC5), the +# orchestrator output retains the existing pre-refactor banner lines +# (TC1). +it "refactor-safety baseline: 2-node pass path writes labels via helpers + per-worker logs" && { + _reset_phase5_env + job=$(_expected_job_name) + kubectl_mock_set_phase5_worker_pod_for_node "$job" node-a worker-pod-a + kubectl_mock_set_phase5_worker_pod_for_node "$job" node-b worker-pod-b + kubectl_mock_set_phase5_pod_exit_code worker-pod-a 0 + kubectl_mock_set_phase5_pod_exit_code worker-pod-b 0 + kubectl_mock_set_phase5_launcher_pod "$job" launcher-pod-x + # Drive the poll loop straight to passed via Succeeded=True. + kubectl_mock_set_mpijob_condition "$job" Succeeded True + run __phase5_run node-a node-b + assert_status 0 + + # TC2: labels written via helper (the helper itself issues the + # raw kubectl label, but it always carries --overwrite -- a + # marker the pre-refactor raw kubectl label loop did NOT use + # on every call. The presence of `--overwrite` on EVERY + # PHASE5_LABEL_KEY=passed write confirms the helper path.) + _assert_label_call node-a passed + _assert_label_call node-b passed + + # TC2 (label key contract): the label key is the unchanged + # PHASE5_LABEL_KEY value -- preserved across the refactor. + assert_kubectl_call_contains "amd.com/cluster-validation-status=passed" + + # TC5: per-worker log files exist on disk for each input node. + assert_file_exists "${PHASE5_LOG_DIR}/worker-node-a-${job}.log" + assert_file_exists "${PHASE5_LOG_DIR}/worker-node-b-${job}.log" + + # Retry-trap contract: candidate label is preserved across + # the pass path. Phase 0's timestamp/interval gate handles + # re-selection; stripping eligibility would break the cron model. + _assert_candidate_label_preserved node-a + _assert_candidate_label_preserved node-b + + # Pre-refactor banner lines preserved (TC1 functional equivalence). + assert_stdout_contains "===Step 3: Submitting MPIJob===" + assert_stdout_contains "===Step 4: Waiting for MPIJob completion===" + assert_stdout_contains "===Step 5: Labeling nodes based on MPIJob result===" + assert_stdout_contains "[MPIJob Result: Passed]" + + # Explicit MPIJob delete on exit. + _assert_mpijob_explicit_delete "$job" +} + +# --- TC3 (dynamic-worker-replicas) ---------------------------------- +# input_count=3 -> actual_worker_replicas=3 -> the MPIJob render +# substitutes `$$WORKER_REPLICAS` with 3 (NOT the WORKER_REPLICAS=8 +# default from _reset_phase5_env, which is the Phase 0 max). We +# scrape the rendered text from the SUT's own log line because the +# mock kubectl drains stdin to /dev/null. +it "dynamic worker-replicas: 3-node input -> MPIJob rendered with 3 workers" && { + _reset_phase5_env + job=$(_expected_job_name) + kubectl_mock_set_phase5_worker_pod_for_node "$job" node-a worker-pod-a + kubectl_mock_set_phase5_worker_pod_for_node "$job" node-b worker-pod-b + kubectl_mock_set_phase5_worker_pod_for_node "$job" node-c worker-pod-c + kubectl_mock_set_phase5_pod_exit_code worker-pod-a 0 + kubectl_mock_set_phase5_pod_exit_code worker-pod-b 0 + kubectl_mock_set_phase5_pod_exit_code worker-pod-c 0 + kubectl_mock_set_phase5_launcher_pod "$job" launcher-pod-x + kubectl_mock_set_mpijob_condition "$job" Succeeded True + run __phase5_run node-a node-b node-c + assert_status 0 + # The SUT echoes the worker count immediately before submit. + assert_stdout_contains "[MPIJob: Submitted for 3 worker node(s)]" + # And the input_count debug line records the same value. + assert_stdout_contains "input_count=3" + # Three pass-labels. + _assert_label_call node-a passed + _assert_label_call node-b passed + _assert_label_call node-c passed +} + +# --- TC6 (mpijob-fails-labels-failed) ------------------------------- +# kubectl wait non-zero -> every input node gets failed label. +# The candidate-label cleanup loop and per-worker log dump both +# still run on the fail path. +it "MPIJob Failed condition -> every input node labeled failed; candidate label preserved" && { + _reset_phase5_env + job=$(_expected_job_name) + # Seed two worker pods so the failure-attribution loop has data + # to annotate (one pod per node, distinct exit codes). + kubectl_mock_set_phase5_worker_pod_for_node "$job" node-a worker-pod-a + kubectl_mock_set_phase5_worker_pod_for_node "$job" node-b worker-pod-b + kubectl_mock_set_phase5_pod_exit_code worker-pod-a 1 + kubectl_mock_set_phase5_pod_exit_code worker-pod-b 1 + kubectl_mock_set_phase5_launcher_pod "$job" launcher-pod-x + # New wait loop polls Succeeded AND Failed conditions; seed + # Failed=True so the loop short-circuits on its first iteration + # with job_status=failed. + kubectl_mock_set_mpijob_condition "$job" Failed True + run __phase5_run node-a node-b + assert_status 0 + _assert_label_call node-a failed + _assert_label_call node-b failed + # Retry-trap contract: candidate label preserved even on fail. + _assert_candidate_label_preserved node-a + _assert_candidate_label_preserved node-b + assert_stdout_contains "[MPIJob Result: Failed]" + _assert_mpijob_explicit_delete "$job" +} + +# --- TC7 (per-worker-exit-attribution) ----------------------------- +# Distinct exit codes per worker pod -> distinct failure-reason +# annotations per node. Confirms the worker-pod=.,exit=. shape +# AND that the two nodes get DIFFERENT reasons (no accidental +# cross-node bleed). +it "per-worker exit attribution: each failed node carries its own worker-pod=...,exit=... reason" && { + _reset_phase5_env + job=$(_expected_job_name) + # node-a's worker exited 137 (OOM); node-b's exited 1. + kubectl_mock_set_phase5_worker_pod_for_node "$job" node-a worker-pod-a + kubectl_mock_set_phase5_worker_pod_for_node "$job" node-b worker-pod-b + kubectl_mock_set_phase5_pod_exit_code worker-pod-a 137 + kubectl_mock_set_phase5_pod_exit_code worker-pod-b 1 + kubectl_mock_set_phase5_launcher_pod "$job" launcher-pod-x + kubectl_mock_set_mpijob_condition "$job" Failed True + run __phase5_run node-a node-b + assert_status 0 + _assert_label_call node-a failed + _assert_label_call node-b failed + # Per-node distinct reasons: + _assert_failure_reason_for_node node-a "worker-pod=worker-pod-a,exit=137" + _assert_failure_reason_for_node node-b "worker-pod=worker-pod-b,exit=1" + # Sanity: node-a is NOT annotated with node-b's reason and vice + # versa. Without this guard a buggy loop could write the same + # annotation to every node and the previous asserts would still + # pass (each independent line would satisfy assert_kubectl_call). + cross_a="annotate node node-a ${PHASE5_LABEL_KEY}${PHASE_FAILURE_REASON_ANNOTATION_SUFFIX}=worker-pod=worker-pod-b,exit=1 --overwrite" + cross_b="annotate node node-b ${PHASE5_LABEL_KEY}${PHASE_FAILURE_REASON_ANNOTATION_SUFFIX}=worker-pod=worker-pod-a,exit=137 --overwrite" + if grep -qxF -- "$cross_a" "$KUBECTL_CALLS_FILE"; then + _assert_fail "node-a was annotated with node-b's reason -- per-worker attribution bled across nodes" + fi + if grep -qxF -- "$cross_b" "$KUBECTL_CALLS_FILE"; then + _assert_fail "node-b was annotated with node-a's reason -- per-worker attribution bled across nodes" + fi +} + +# --- per-worker exit attribution: pod-missing fallback -------------- +# When the worker-pod lookup returns empty (race against cleanup, +# eviction, etc.) the SUT must annotate with `worker-pod=unknown,exit=unknown` +# rather than leak an empty-string annotation. This is the design's +# "deterministic fallback" guarantee from §4. +it "missing worker pod -> failure-reason=worker-pod=unknown,exit=unknown" && { + _reset_phase5_env + # input_count=1 would trip the default MIN_WORKERS=2 guard; opt in + # to a degenerate single-node MPIJob so the unknown-attribution + # branch is reachable from a 1-node input. + export PHASE5_MIN_WORKERS="1" + job=$(_expected_job_name) + # NO kubectl_mock_set_phase5_worker_pod_for_node seeded for node-a + # -- the lookup returns empty, triggering the unknown/unknown branch. + kubectl_mock_set_phase5_launcher_pod "$job" launcher-pod-x + kubectl_mock_set_mpijob_condition "$job" Failed True + run __phase5_run node-a + assert_status 0 + _assert_label_call node-a failed + _assert_failure_reason_for_node node-a "worker-pod=unknown,exit=unknown" +} + +# --- TC8 (mpijob-apply-fails) --------------------------------------- +# kubectl apply non-zero -- the MPIJob never lands. The SUT does +# NOT special-case apply failure; it falls through to `kubectl wait`, +# which fails (the wait mock returns success by default, but with no +# real MPIJob the production behavior is wait timeout). For our mock +# we drive both apply and wait sticky-failed so the SUT goes down +# the failed-job labelling path. Worker-pod lookups return empty +# (no pods were ever created), so the fallback unknown/unknown +# annotation is written -- matching the design's "apply failure +# surfaces to operator via failure annotation" intent. +it "kubectl apply fails -> nodes labeled failed with unknown attribution" && { + _reset_phase5_env + # Use a very short wait window so the timeout branch fires + # quickly when MPIJob never appears (apply failed -> no + # Succeeded/Failed condition is ever set). + export MPIJOB_WAIT_TIME="1" + kubectl_mock_fail_sticky apply 1 + # No worker pods seeded -- lookup returns empty -> unknown path. + job=$(_expected_job_name) + kubectl_mock_set_phase5_launcher_pod "$job" launcher-pod-x + run __phase5_run node-a node-b + assert_status 0 + _assert_label_call node-a failed + _assert_label_call node-b failed + _assert_failure_reason_for_node node-a "worker-pod=unknown,exit=unknown" + _assert_failure_reason_for_node node-b "worker-pod=unknown,exit=unknown" + # The timeout branch should log the new diagnostic line. + assert_stdout_contains "did not reach terminal state" +} + +# --- TC14 (launcher-log-still-collected) ---------------------------- +# Launcher pod logs land at ${LOG_DIR}/launcher-${new_job}.log on +# the pass path. The fail-path equivalent is exercised implicitly +# by the mpijob-fails test above (which also calls the launcher-log +# collection block), but we pin the on-disk file here for the pass +# path where it is most observable. +it "launcher log saved to LOG_DIR/launcher-.log on pass path" && { + _reset_phase5_env + job=$(_expected_job_name) + kubectl_mock_set_phase5_worker_pod_for_node "$job" node-a worker-pod-a + kubectl_mock_set_phase5_worker_pod_for_node "$job" node-b worker-pod-b + kubectl_mock_set_phase5_pod_exit_code worker-pod-a 0 + kubectl_mock_set_phase5_pod_exit_code worker-pod-b 0 + kubectl_mock_set_phase5_launcher_pod "$job" launcher-pod-x + kubectl_mock_set_mpijob_condition "$job" Succeeded True + run __phase5_run node-a node-b + assert_status 0 + assert_file_exists "${PHASE5_LOG_DIR}/launcher-${job}.log" +} + +# --- SKIP_RCCL_TEST=false explicit (regression for, expansion) --- +# Explicit `false` must take the MPIJob branch, not the skip branch. +# Guards against a `,`-expansion regression that would treat any +# non-empty value as "true". +it "SKIP_RCCL_TEST=false takes the MPIJob branch (no skip short-circuit)" && { + _reset_phase5_env + export SKIP_RCCL_TEST="false" + job=$(_expected_job_name) + kubectl_mock_set_phase5_worker_pod_for_node "$job" node-a worker-pod-a + kubectl_mock_set_phase5_worker_pod_for_node "$job" node-b worker-pod-b + kubectl_mock_set_phase5_pod_exit_code worker-pod-a 0 + kubectl_mock_set_phase5_pod_exit_code worker-pod-b 0 + kubectl_mock_set_phase5_launcher_pod "$job" launcher-pod-x + kubectl_mock_set_mpijob_condition "$job" Succeeded True + run __phase5_run node-a node-b + assert_status 0 + # MUST hit the MPIJob path (apply + poll), not the skip banner. + assert_kubectl_call_contains "apply" + assert_kubectl_call_contains "get mpijob" + assert_stdout_not_contains "Skipping MPI Job RCCL test" +} + +# --- poll loop short-circuits on Succeeded --------------- +# When the MPIJob carries Succeeded=True on the first poll, the wait +# loop must exit on iteration 1 (no 5s sleep). Verifies Fix #3. +it "wait loop: Succeeded=True short-circuits to passed within one poll" && { + _reset_phase5_env + job=$(_expected_job_name) + kubectl_mock_set_phase5_worker_pod_for_node "$job" node-a worker-pod-a + kubectl_mock_set_phase5_worker_pod_for_node "$job" node-b worker-pod-b + kubectl_mock_set_phase5_pod_exit_code worker-pod-a 0 + kubectl_mock_set_phase5_pod_exit_code worker-pod-b 0 + kubectl_mock_set_phase5_launcher_pod "$job" launcher-pod-x + kubectl_mock_set_mpijob_condition "$job" Succeeded True + start=$(date +%s) + run __phase5_run node-a node-b + end=$(date +%s) + assert_status 0 + assert_stdout_contains "[MPIJob Result: Passed]" + elapsed=$((end - start)) + if (( elapsed > 4 )); then + _assert_fail "wait loop took ${elapsed}s on Succeeded=True; expected <5s (no sleep on first poll)" + fi +} + +# --- poll loop short-circuits on Failed ------------------ +# Confirms the new mechanism does NOT block for MPIJOB_WAIT_TIME when +# the MPIJob is in a clear Failed terminal state. The pre-fix code used +# `kubectl wait --for=condition=Succeeded`, which would sit out the +# full 240s timeout on failure. +it "wait loop: Failed=True short-circuits to failed within one poll" && { + _reset_phase5_env + # Set MPIJOB_WAIT_TIME high to prove the short-circuit (not just + # the budget expiring) is what ended the loop. + export MPIJOB_WAIT_TIME="120" + job=$(_expected_job_name) + kubectl_mock_set_phase5_worker_pod_for_node "$job" node-a worker-pod-a + kubectl_mock_set_phase5_worker_pod_for_node "$job" node-b worker-pod-b + kubectl_mock_set_phase5_pod_exit_code worker-pod-a 1 + kubectl_mock_set_phase5_pod_exit_code worker-pod-b 1 + kubectl_mock_set_phase5_launcher_pod "$job" launcher-pod-x + kubectl_mock_set_mpijob_condition "$job" Failed True + start=$(date +%s) + run __phase5_run node-a node-b + end=$(date +%s) + assert_status 0 + assert_stdout_contains "[MPIJob Result: Failed]" + assert_stdout_not_contains "did not reach terminal state" + elapsed=$((end - start)) + if (( elapsed > 4 )); then + _assert_fail "wait loop took ${elapsed}s on Failed=True; expected <5s (no sleep on first poll, NOT the 120s timeout)" + fi + _assert_label_call node-a failed + _assert_label_call node-b failed + _assert_candidate_label_preserved node-a + _assert_candidate_label_preserved node-b + _assert_mpijob_explicit_delete "$job" +} + +# --- poll loop hits its budget when neither condition fires +# MPIJOB_WAIT_TIME=0 makes the `while ($(date +%s) < end)` check false +# on the first evaluation, so the loop body never executes and the +# timeout branch fires immediately. job_status folds to failed. +it "wait loop: timeout branch fires when neither condition is True" && { + _reset_phase5_env + export MPIJOB_WAIT_TIME="0" + job=$(_expected_job_name) + kubectl_mock_set_phase5_worker_pod_for_node "$job" node-a worker-pod-a + kubectl_mock_set_phase5_worker_pod_for_node "$job" node-b worker-pod-b + kubectl_mock_set_phase5_pod_exit_code worker-pod-a 1 + kubectl_mock_set_phase5_pod_exit_code worker-pod-b 1 + kubectl_mock_set_phase5_launcher_pod "$job" launcher-pod-x + # Deliberately NO mpijob condition seeded. + run __phase5_run node-a node-b + assert_status 0 + assert_stdout_contains "did not reach terminal state within 0s" + assert_stdout_contains "[MPIJob Result: Failed]" + _assert_label_call node-a failed + _assert_label_call node-b failed + _assert_candidate_label_preserved node-a + _assert_candidate_label_preserved node-b +} + +# --- explicit MPIJob cleanup runs on the skip path too --- +# The SKIP_RCCL_TEST=true short-circuit returns before the MPIJob is +# ever submitted, so there is nothing to delete. Verify NO `delete +# mpijob` call is issued in that path (avoid noise/false failures +# against a non-existent MPIJob name). +it "skip path: no explicit MPIJob delete (nothing was submitted)" && { + _reset_phase5_env + export SKIP_RCCL_TEST="true" + run __phase5_run node-a node-b + assert_status 0 + _assert_no_kubectl_verb apply + # No delete mpijob call -- there is no $new_job in this branch. + if grep -q "^delete mpijob" "$KUBECTL_CALLS_FILE"; then + _assert_fail "skip path issued an unexpected delete mpijob call" + fi +} + +assert_summary diff --git a/example/gpu-validation-cluster/tests/test_phase_node_label_script.sh b/example/gpu-validation-cluster/tests/test_phase_node_label_script.sh new file mode 100755 index 000000000..36341e841 --- /dev/null +++ b/example/gpu-validation-cluster/tests/test_phase_node_label_script.sh @@ -0,0 +1,312 @@ +#!/bin/bash +# Unit tests for PHASE_NODE_LABEL_SCRIPT +# against the kubectl mock harness. +# +# Covers the four functions the orchestrator depends on: +# * label_phase_passed +# * label_phase_failed +# * annotate_phase_value +# * filter_passed_nodes +# +# Also covers contract invariants from design §5: +# * helpers use --overwrite for idempotency +# * label_phase_failed writes the failure-reason annotation +# * filter_passed_nodes drops nodes that are not exactly "=passed" +# * argument validation: empty args / wrong arity return non-zero +# with NO kubectl side effects + +set -uo pipefail + +TEST_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd) +REPO_ROOT=$(cd -- "${TEST_DIR}/../../.." && pwd) +CONFIGMAP="${REPO_ROOT}/example/gpu-validation-cluster/configs/cluster-validation-config.yaml" + +# shellcheck source=./lib/assert.sh +source "${TEST_DIR}/lib/assert.sh" +# shellcheck source=./lib/kubectl_mock.sh +source "${TEST_DIR}/lib/kubectl_mock.sh" +# shellcheck source=./lib/extract_script.sh +source "${TEST_DIR}/lib/extract_script.sh" + +echo "================================================================" +echo " test_phase_node_label_script.sh" +echo " ConfigMap: ${CONFIGMAP}" +echo "================================================================" + +# --- one-time setup ------------------------------------------------- +HELPER_SCRIPT=$(mktemp -t phase-helpers-XXXXXX.sh) +trap 'rm -f "$HELPER_SCRIPT"; kubectl_mock_cleanup' EXIT + +extract_configmap_data "$CONFIGMAP" "PHASE_NODE_LABEL_SCRIPT" \ + > "$HELPER_SCRIPT" + +if [[ ! -s "$HELPER_SCRIPT" ]]; then + echo "FATAL: PHASE_NODE_LABEL_SCRIPT extraction produced empty output" >&2 + exit 1 +fi +if ! bash -n "$HELPER_SCRIPT"; then + echo "FATAL: extracted helper script has bash syntax errors" >&2 + exit 1 +fi + +kubectl_mock_init + +# Source the helper library AFTER kubectl_mock_init so the kubectl +# shim is on PATH when the helpers call it. The failure-reason +# annotation suffix must mirror the ConfigMap default so we can assert +# on the expected annotation key. +export PHASE_FAILURE_REASON_ANNOTATION_SUFFIX="-failure-reason" +# shellcheck disable=SC1090 +source "$HELPER_SCRIPT" + +# Sanity: the four required functions are defined. +for fn in label_phase_passed label_phase_failed annotate_phase_value filter_passed_nodes; do + if ! declare -F "$fn" >/dev/null; then + echo "FATAL: helper function $fn not defined after sourcing" >&2 + exit 1 + fi +done + +# ------------------------------------------------------------------- +# label_phase_passed +# ------------------------------------------------------------------- + +it "label_phase_passed writes =passed with --overwrite" && { + kubectl_mock_reset + run label_phase_passed node-a amd.com/gpu-hw-acceptance + assert_status 0 + assert_kubectl_call_count 1 + assert_kubectl_call \ + "label node node-a amd.com/gpu-hw-acceptance=passed --overwrite" +} + +it "label_phase_passed exits non-zero on empty node arg" && { + kubectl_mock_reset + run label_phase_passed "" amd.com/gpu-hw-acceptance + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_no_calls +} + +it "label_phase_passed exits non-zero on empty key arg" && { + kubectl_mock_reset + run label_phase_passed node-a "" + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_no_calls +} + +it "label_phase_passed exits non-zero on wrong arity (1 arg)" && { + kubectl_mock_reset + run label_phase_passed node-a + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_no_calls +} + +it "label_phase_passed exits non-zero on wrong arity (3 args)" && { + kubectl_mock_reset + run label_phase_passed node-a amd.com/x extra + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_no_calls +} + +it "label_phase_passed propagates non-zero when kubectl label fails" && { + kubectl_mock_reset + kubectl_mock_fail label 1 + run label_phase_passed node-a amd.com/gpu-hw-acceptance + assert_not_equals 0 "$LAST_STATUS" + # The label call was still made. + assert_kubectl_call_count 1 +} + +# ------------------------------------------------------------------- +# label_phase_failed +# ------------------------------------------------------------------- + +it "label_phase_failed writes =failed AND failure-reason annotation" && { + kubectl_mock_reset + run label_phase_failed node-b amd.com/nic-health "kernel oops" + assert_status 0 + assert_kubectl_call_count 2 + assert_kubectl_call \ + "label node node-b amd.com/nic-health=failed --overwrite" + assert_kubectl_call \ + "annotate node node-b amd.com/nic-health-failure-reason=kernel oops --overwrite" +} + +it "label_phase_failed with empty reason still writes label, skips annotation" && { + kubectl_mock_reset + run label_phase_failed node-b amd.com/nic-health "" + assert_status 0 + assert_kubectl_call_count 1 + assert_kubectl_call \ + "label node node-b amd.com/nic-health=failed --overwrite" +} + +it "label_phase_failed rejects empty node" && { + kubectl_mock_reset + run label_phase_failed "" amd.com/x "reason" + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_no_calls +} + +it "label_phase_failed rejects empty key" && { + kubectl_mock_reset + run label_phase_failed node-a "" "reason" + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_no_calls +} + +it "label_phase_failed rejects wrong arity (2 args)" && { + kubectl_mock_reset + run label_phase_failed node-a amd.com/x + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_no_calls +} + +it "label_phase_failed returns non-zero when kubectl annotate fails" && { + kubectl_mock_reset + kubectl_mock_fail annotate 1 + run label_phase_failed node-b amd.com/nic-health "kernel oops" + assert_not_equals 0 "$LAST_STATUS" + # The label call still ran; the annotate call ran but failed. + assert_kubectl_call_count 2 +} + +# ------------------------------------------------------------------- +# annotate_phase_value +# ------------------------------------------------------------------- + +it "annotate_phase_value writes -=" && { + kubectl_mock_reset + run annotate_phase_value node-c amd.com/rail-bandwidth rail0 94.2 + assert_status 0 + assert_kubectl_call_count 1 + assert_kubectl_call \ + "annotate node node-c amd.com/rail-bandwidth-rail0=94.2 --overwrite" +} + +it "annotate_phase_value allows empty value" && { + kubectl_mock_reset + run annotate_phase_value node-c amd.com/x rail0 "" + assert_status 0 + assert_kubectl_call \ + "annotate node node-c amd.com/x-rail0= --overwrite" +} + +it "annotate_phase_value rejects empty node / phase_key / sub_key" && { + kubectl_mock_reset + run annotate_phase_value "" amd.com/x rail0 v + assert_not_equals 0 "$LAST_STATUS" + kubectl_mock_reset + run annotate_phase_value node-c "" rail0 v + assert_not_equals 0 "$LAST_STATUS" + kubectl_mock_reset + run annotate_phase_value node-c amd.com/x "" v + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_no_calls +} + +it "annotate_phase_value rejects wrong arity" && { + kubectl_mock_reset + run annotate_phase_value node-c amd.com/x rail0 + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_no_calls +} + +it "annotate_phase_value returns non-zero when kubectl fails" && { + kubectl_mock_reset + kubectl_mock_fail annotate 1 + run annotate_phase_value node-c amd.com/x rail0 v + assert_not_equals 0 "$LAST_STATUS" +} + +# ------------------------------------------------------------------- +# filter_passed_nodes +# ------------------------------------------------------------------- + +it "filter_passed_nodes returns only nodes with =passed" && { + kubectl_mock_reset + kubectl_mock_set_label node-x amd.com/phase-test passed + kubectl_mock_set_label node-y amd.com/phase-test failed + # node-z is intentionally unlabeled. + run filter_passed_nodes "node-x node-y node-z" amd.com/phase-test + assert_status 0 + assert_stdout_equals "node-x" +} + +it "filter_passed_nodes preserves input order across multiple passed nodes" && { + kubectl_mock_reset + kubectl_mock_set_label node-x amd.com/phase-test passed + kubectl_mock_set_label node-y amd.com/phase-test passed + kubectl_mock_set_label node-z amd.com/phase-test passed + run filter_passed_nodes "node-y node-x node-z" amd.com/phase-test + assert_status 0 + assert_stdout_equals "node-y node-x node-z" +} + +it "filter_passed_nodes handles dotted label keys (jsonpath escaping)" && { + kubectl_mock_reset + kubectl_mock_set_label node-x amd.com/gpu-hw-acceptance passed + run filter_passed_nodes "node-x" amd.com/gpu-hw-acceptance + assert_status 0 + assert_stdout_equals "node-x" +} + +it "filter_passed_nodes drops nodes with non-passed label values" && { + kubectl_mock_reset + kubectl_mock_set_label node-x amd.com/phase-test pending + kubectl_mock_set_label node-y amd.com/phase-test unknown + run filter_passed_nodes "node-x node-y" amd.com/phase-test + assert_status 0 + assert_stdout_empty +} + +it "filter_passed_nodes returns 0 on empty input and emits no output" && { + kubectl_mock_reset + run filter_passed_nodes "" amd.com/phase-test + assert_status 0 + assert_stdout_empty + # kubectl is only invoked per-input-node; empty input -> no calls. + assert_kubectl_no_calls +} + +it "filter_passed_nodes rejects empty key" && { + kubectl_mock_reset + run filter_passed_nodes "node-a" "" + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_no_calls +} + +it "filter_passed_nodes rejects wrong arity (1 arg)" && { + kubectl_mock_reset + run filter_passed_nodes "node-a" + assert_not_equals 0 "$LAST_STATUS" + assert_kubectl_no_calls +} + +# ------------------------------------------------------------------- +# Cross-cutting contract invariants +# ------------------------------------------------------------------- + +it "all helper write paths use --overwrite (idempotency contract)" && { + kubectl_mock_reset + label_phase_passed node-a amd.com/x >/dev/null 2>&1 + label_phase_failed node-a amd.com/x "r" >/dev/null 2>&1 + annotate_phase_value node-a amd.com/x sub v >/dev/null 2>&1 + # Every recorded line must end with `--overwrite`. + if grep -v -- "--overwrite$" "$KUBECTL_CALLS_FILE" >/dev/null; then + _assert_fail "found a kubectl write call missing --overwrite: +$(grep -v -- "--overwrite$" "$KUBECTL_CALLS_FILE")" + fi +} + +it "helper diagnostics go to stderr, not stdout" && { + kubectl_mock_reset + run label_phase_passed node-a amd.com/x + # The helper's success-log line ("node=node-a amd.com/x=passed") + # must NOT appear on stdout (would corrupt pipe consumers like + # filter_passed_nodes). + assert_stdout_empty + assert_stderr_contains "node=node-a" +} + +assert_summary