Sync gpu-validation-cluster: multi-phase validation framework + robustness fixes by yansun1996 · Pull Request #583 · ROCm/gpu-operator

yansun1996 · 2026-06-24T06:03:12Z

Summary

Syncs the example/gpu-validation-cluster example from the pensando source repo into ROCm. ROCm's copy was at the single-phase state (#1206-equivalent); this brings it up to the current multi-phase Cluster Validation Framework.

Cherry-picks two commits (authored by Yan Sun), scoped to example/gpu-validation-cluster/ only:

Multi-Phase GPU Cluster Validation for Datacenter Bringup (pensando #1473) — the 5-phase gated pipeline (GPU HW acceptance, intra-node mesh, NIC health, pairwise rail bandwidth, multi-node RCCL), CronJob orchestrator, Ansible day-2 playbooks, and a full bats-style unit-test suite under tests/.
Self-healing bringup + robustness fixes (pensando #1539).

Scope / notes

example/ only. docs-internal/ design docs and knowledge notes from the source commits were intentionally excluded (not part of the open-source distribution).
No internal references. A cleanup commit strips internal JIRA IDs (GPUOP-xxx) and docs-internal/ path mentions from comments and test fixtures — comment-only, no functional impact.
Applies cleanly: ROCm's folder was byte-identical to the pensando #1206 baseline, so the two commits layer on without conflict.

Validation

config.json parses; the cluster-validation YAMLs are structurally valid.
Functionally exercised on a 2-node lab cluster (Phase 4 pairwise ib_write_bw and Phase 5 multi-node RCCL all_reduce_perf both pass).

🤖 Generated with Claude Code

…(#1473) * GPUOP-689: Add codie design docs for 7 multi-phase validation stories Adds design documents for all Stories under Epic GPUOP-689 (Multi-Phase GPU Cluster Validation for Datacenter Bringup): - GPUOP-690: Configuration framework + multi-phase CronJob orchestrator - GPUOP-691: Phase 1 — Per-Node GPU Hardware Acceptance - GPUOP-692: Phase 2 — Intra-Node GPU Collective Test - GPUOP-693: Phase 3 — Per-Node NIC Health Check - GPUOP-694: Phase 4 — Pairwise Rail Bandwidth Test - GPUOP-695: Phase 4.5 — Cross-Node Connectivity Matrix Test - GPUOP-696: Phase 5 — Multi-Node RCCL Collective Test Each design references the shared GPUOP-690 contract (uniform label keys, helper functions, per-phase skip flags, run_phaseN orchestrator stubs) so the 6 phase Stories can land independently. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-689: Add codie test plans for 7 multi-phase validation stories Adds test plans for each Story under Epic GPUOP-689: - GPUOP-690: Configuration Framework (16 cases: 5 P0, 9 P1, 2 P2) - GPUOP-691: Phase 1 GPU HW Acceptance (15 cases: 5 P0, 9 P1, 1 P2) - GPUOP-692: Phase 2 GPU Mesh (14 cases: 4 P0, 10 P1) - GPUOP-693: Phase 3 NIC Health (16 cases: 5 P0, 10 P1, 1 P2) - GPUOP-694: Phase 4 Pairwise Rail BW (20 cases: 5 P0, 12 P1, 3 P2) - GPUOP-695: Phase 4.5 Connectivity Mesh (14 cases: 4 P0, 9 P1, 1 P2) - GPUOP-696: Phase 5 Multi-Node RCCL (19 cases: 6 P0, 11 P1, 2 P2) Total: 114 test cases across Positive, Negative, Boundary, Error Recovery, Concurrency, Integration, Upgrade/Downgrade, and Performance categories. Per-test-case files deferred — will be generated during /codie:task-test-plan for each Sub-task as engineers implement, since test-step CLIs depend on actual code landed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-754: Replace skip-tests block with 5 per-phase skip flags Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-754: Add unit tests for config.json schema extension Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-755: Add per-phase env vars and label-key constants to ConfigMap Replace the single SKIP_GPU_VALIDATION placeholder with five per-phase skip flags (SKIP_GPU_HW_ACCEPTANCE, SKIP_GPU_MESH_VALIDATION, SKIP_NIC_VALIDATION, SKIP_RAIL_BANDWIDTH_TEST, SKIP_RCCL_TEST) aligned with the new config.json schema landed in GPUOP-754. Add uniform PHASE{1..5}_LABEL_KEY constants and PHASE_FAILURE_REASON_ANNOTATION_SUFFIX so every downstream phase consumes the same label keys rather than hard-coding their own. Owned by GPUOP-690; this is the ConfigMap env-var surface that downstream Sub-tasks (helper library, orchestrator, per-phase scripts) read from. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-755: Add unit tests for ConfigMap env-var surface 30 test cases covering the cluster-validation-config ConfigMap contract introduced by GPUOP-755: - 6 SKIP_* placeholder presence + templating-sentinel value checks - 6 PHASE{N}_LABEL_KEY + PHASE_FAILURE_REASON_ANNOTATION_SUFFIX value assertions against the design doc - 2 backward-incompat guards (legacy SKIP_GPU_VALIDATION absent) - 5 type/format checks (label namespacing, kebab-case) - 9 regression guards (surrounding ConfigMap content preserved) - 2 YAML parse / duplicate-key checks Framework: Python 3 + unittest + PyYAML. Fallback grep harness included for envs without PyYAML. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-756: Add PHASE_NODE_LABEL_SCRIPT helper library Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-756: Add unit tests for PHASE_NODE_LABEL_SCRIPT Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-757: Rewrite CronJob as phased orchestrator with run_phaseN stubs + DRY_RUN Replace the linear submit-mpijob container args with a phased state machine implementing the design in GPUOP-690 §4. The orchestrator: * sources PHASE_NODE_LABEL_SCRIPT helpers from the ConfigMap * walks Phase 0 -> 1 -> 2 -> 3 -> 4 -> 4.5 -> 5 * gates each phase on SKIP_<PHASE> flags from the ConfigMap * narrows the live-node pool between phases via filter_passed_nodes keyed on PHASE{N}_LABEL_KEY * intersects with feature.node.kubernetes.io/amd-nic=true before Phase 3 so NIC-requiring phases only see NIC-capable nodes * tracks per-phase failures and exits non-zero at the end if any phase produced failed nodes (fail-loud, design §6) * provides DRY_RUN=1 that replaces run_phaseN with no-ops, skips Phase 0 candidate labeling, and prints the planned phase order + per-phase pool without any kubectl writes * cleans up old MPIJobs and collects launcher logs only when Phase 5 actually ran run_phaseN stubs print a banner and source $PHASE{N}_SCRIPT (from the ConfigMap) if defined; otherwise no-op. Downstream Stories (GPUOP-691..696) slot in their per-phase scripts via ConfigMap without orchestrator changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-757: Add unit tests for CronJob phased orchestrator 68 test cases across 9 categories, framework Python 3 + unittest + PyYAML + bash mock-kubectl-on-PATH (matches GPUOP-754/755/756). Maps to GPUOP-690 Story test cases TC1, TC2, TC3, TC4, TC5, TC6, TC8, TC11, TC12, TC13, TC14. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-758: Default candidate selection to amd-gpu=true only Drop the joined amd-nic=true requirement from the default node selector so a datacenter early in bringup can run Phase 1 (GPU HW acceptance) and Phase 2 (intra-node GPU mesh) before any NIC infrastructure exists. NIC-requiring phases (3, 4) narrow the candidate set inside their own run_phaseN scripts via intersect ... amd-nic=true, per GPUOP-690 design sections 4 and 5. Files: - configs/config.json: default node-selector-labels reduced to [feature.node.kubernetes.io/amd-gpu=true]. - configs/cluster-validation-config.yaml: inline ConfigMap docs updated to explain the new default and where NIC narrowing happens; NODE_SELECTOR_LABELS placeholder unchanged. - gpu-cluster.sh: operator's templated jq fallback default for node-selector-labels reduced to the GPU-only list to match. Resource-busy checks and NODE_VALIDATION_INTERVAL_MINS skip behavior are unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-758: Add unit tests for candidate-selection default Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-759: Add bash unit tests for helper library + orchestrator DRY_RUN Adds a hand-rolled bash test harness under example/gpu-validation-cluster/tests/ covering the PHASE_NODE_LABEL_SCRIPT helper library (GPUOP-756) and the multi-phase CronJob orchestrator's DRY_RUN=1 mode (GPUOP-757). Helper library tests (26 cases): * label_phase_passed / label_phase_failed / annotate_phase_value / filter_passed_nodes positive paths * argument validation (empty / wrong arity returns non-zero, no kubectl side effects) * kubectl failure propagation * contract invariants: --overwrite on every write, diagnostics to stderr Orchestrator DRY_RUN tests (5 cases) from design doc section 7: * all 5 skip flags true -> exit 0, zero kubectl writes * phases 1+2 enabled, 3-5 skipped -> Phase 2 followed by SKIP_* pass-through * Phase 3 enabled but no amd-nic=true nodes -> empty pool, exit 0 * empty Phase 0 candidate pool -> exit 0 * cleanup / log collection honor DRY_RUN Infrastructure: * lib/assert.sh -- hand-rolled assertions (no bats dependency) * lib/kubectl_mock.sh -- recording kubectl shim installed on PATH (function override is unreliable for sourced sub-shells) * lib/extract_script.sh -- pure-bash/awk extractor for YAML block scalars (no PyYAML / yq dependency) * run_all.sh -- entry point; exits 0 only when every file reports zero failures All 31 tests pass locally with bash 5.x. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-760: Add multi-phase pipeline docs and SKIP_GPU_VALIDATION migration note Document the 5-phase validation pipeline (phases / labels / skip flags table), new per-phase amd.com/* label keys, the incremental-bringup default posture, DRY_RUN=1 orchestrator testing mode, and the migration mapping from the removed skip-gpu-validation flag to the new 5-flag skip-tests block. Update the per-node status legend to include the per-phase labels and failure-reason annotations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-760: Add unit tests for README + migration note Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-761: Extend GPU_VALIDATION_TESTS_JSON with AGFHC recipes; bump TEST_RUNNER_JOB_WAIT_TIME Add three AGFHC TestCases (xgmi_lvl1, pcie_lvl1, hbm_lvl1) to the MANUAL TestCases array under TestConfig.GPU_HEALTH_CHECK.TestLocationTrigger.global.TestParameters. All use StopOnFailure: true, TimeoutSeconds: 600, Iterations: 1 per GPUOP-691 design doc section 4. Bump TEST_RUNNER_JOB_WAIT_TIME from 1200s to 3600s to cover the longer combined test set (RVS gst_single ~20 min + 3 AGFHC level-1 recipes, ~50 min worst case) plus pod scheduling / result upload headroom. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-761: Add unit tests for GPU_VALIDATION_TESTS_JSON + TEST_RUNNER_JOB_WAIT_TIME 28 test cases (5 structural, 8 content, 4 numeric/budget, 6 regression, 3 boundary, 2 cross-source) covering the ConfigMap contract for the expanded TestCases array and bumped wait-time budget. All 28 cases mapped to GPUOP-691 Story test plan TC1/TC2/TC4/TC5/TC9. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-762: Add PHASE1_SCRIPT to cluster-validation-config ConfigMap Implements GPUOP-691 design §4 -> "Code Path: PHASE1_SCRIPT". Sourced by GPUOP-690 orchestrator's run_phase1 via _run_phase_generic. Behavior: 1. Parallel-submit Test Runner Job per input node (existing template). 2. Poll-wait all Jobs up to TEST_RUNNER_JOB_WAIT_TIME. 3. Parse result.json (local hostPath, fallback kubectl cp) for failed sub-test name; recipe-not-found surfaced as a distinct reason. 4. Label via GPUOP-690 helpers (label_phase_passed / label_phase_failed + annotate_phase_value failed-subtest=<name>). 5. SKIP_GPU_HW_ACCEPTANCE=true early-exit pass-labels every input node. 6. Cleanup hung (timed-out) Jobs at the end. Missing result.json -> failed reason test-runner-did-not-emit-results, subtest=unknown. Missing required env -> all nodes failed with reason phase1-missing-env:<list>. kubectl apply failure on a single node fails just that node with reason job-creation-failed; other nodes proceed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-763: Wire run_phase1 orchestrator to source PHASE1_SCRIPT Replace the generic-dispatch stub for run_phase1 (delivered by GPUOP-690.4 as a one-liner delegating to _run_phase_generic) with a dedicated function body that: * prints a Phase 1 banner with input nodes * returns 0 cleanly when the input pool is empty * returns 0 cleanly when PHASE1_SCRIPT is unset (so a partial rollout that has not yet shipped PHASE1_SCRIPT does not break the orchestrator) * writes $PHASE1_SCRIPT to /tmp/run-phase1.sh, chmod +x, and sources it with the input nodes as positional args * exports PHASE_NODES (documented input handle for per-phase scripts) * preserves the existing fail-tolerant pattern (warn on non-zero rc, always return 0 so other phases can continue) run_phase2..run_phase5 remain on _run_phase_generic until their respective PHASE{N}_SCRIPT bodies land. The DRY_RUN run_phase1 override at the dry-run block is unchanged. Files: example/gpu-validation-cluster/configs/cluster-validation-job.yaml Design: docs/codie/designs/GPUOP-691-phase1-gpu-hw-acceptance-design.md Section 4 -> "Code Path: orchestrator wiring" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-763: Add unit tests for run_phase1 orchestrator wiring 25 unit tests covering the dedicated run_phase1 body added in commit 8d51b91: * 14 contract tests (bash parse, function presence, source-level grep for tmp-path write, chmod, source-with-nodes, PHASE_NODES export, empty/unset guards, fail-tolerant warn pattern, regression guards that other phases still use the generic dispatch, DRY_RUN override intact, call site unchanged, no kubectl in the body) * 8 behavioral tests (probe-script-driven: empty input, unset PHASE1_SCRIPT, single node, multi-node positional expansion, PHASE_NODES exported visible to child processes, tmp file exists + executable, non-zero rc warns but returns 0, zero rc no warning) * 1 DRY_RUN override test (override at line 585 still shadows the new dedicated body) * 2 regression tests (filter_passed_nodes call unchanged; existing GPUOP-759 orchestrator dry-run suite still passes) Suggested test file: example/gpu-validation-cluster/tests/test_run_phase1_wiring.sh Helper extension needed: extract_bash_function() in lib/extract_script.sh Mapped to Story TCs: TC2, TC3, TC4, TC8, TC11, TC13 (downstream test cases TC1, TC5-7, TC9-12, TC14, TC15 are out of scope for the wiring Sub-task; they live in PHASE1_SCRIPT / GPUOP-764). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-764: Add bash unit tests for PHASE1_SCRIPT Realizes test cases TC1, TC2, TC4, TC5, TC6, TC7, TC8, TC10 from the GPUOP-691 Phase 1 test plan against the PHASE1_SCRIPT body embedded in configs/cluster-validation-config.yaml (GPUOP-762). New files: - tests/test_phase1.sh: 12 tests covering empty input, single-node pass, single-node hbm fail, mixed pass/fail across 2 nodes, missing result.json -> result-parse-failed, recipe-not-found -> recipe-not- found, SKIP_GPU_HW_ACCEPTANCE=true short-circuit, job-creation failure, poll-wait timeout, dry-run skip annotation, idempotent --overwrite labels, helper-failure propagation. The test sources PHASE1_SCRIPT after sed-patching the hardcoded job-template path, results root, and timestamp generation so job names are deterministic (PHASE1_TEST_TS) and templates resolve to test fixtures. - tests/fixtures/phase1/result-pass.json: 4 sub-tests all PASS. - tests/fixtures/phase1/result-hbm-fail.json: hbm_lvl1 FAIL, others PASS. - tests/fixtures/phase1/result-recipe-not-found.json: error string matches the recipe-not-found jq regex. Mock extensions (tests/lib/kubectl_mock.sh): - Early-route detection for 'kubectl get pods -l job-name=X -o jsonpath={.items[-1:].metadata.name}' so the pod lookup runs before the generic '-l' selector arm exits. - jsonpath handler for '{.status.conditions[?(@.type=="Complete"|"Failed")].status}' on job objects, seeded via new kubectl_mock_set_job_condition helper. - Fail-injection support for apply|delete|patch|create verbs (was previously hardcoded exit 0); mirrors one-shot / .sticky semantics used by label|annotate. - SIGPIPE-safe stdin drain at the top of the mock shim. Without this, callers like 'sed ... | kubectl apply -f -' under 'set -o pipefail' intermittently saw rc=141 because the mock exited before reading stdin, and the script then took the job-creation-failed path. - New helpers: kubectl_mock_set_job_condition, kubectl_mock_set_pod_for_job. All 43 tests across the suite pass (test_assert: 5, test_phase1: 12, test_phase0_helpers: 26). Verified stable across 10 sequential runs after the SIGPIPE fix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-765: Remove obsolete TEST_RUNNER_SUCCESS_LABEL / TEST_RUNNER_FAILURE_LABEL references Phase 1 now uses the uniform PHASE1_LABEL_KEY (= amd.com/gpu-hw-acceptance) written via the GPUOP-690 PHASE_NODE_LABEL_SCRIPT helper library. The legacy `TEST_RUNNER_SUCCESS_LABEL` / `TEST_RUNNER_FAILURE_LABEL` env vars in cluster-validation-config.yaml are obsolete; delete the env vars and the entire `GPU_VALIDATION_TEST_SCRIPT` block (whose role is fully replaced by `PHASE1_SCRIPT` sourced from the orchestrator's run_phase1 in cluster-validation-job.yaml). Replace the deleted blocks with inline migration-note comments so deployers and downstream consumers can trace the rename. Update the now-stale `GPU_VALIDATION_TEST_SCRIPT` reference in the surviving PHASE1_SCRIPT polling-cadence comment. Verified: * YAML still parses (1 doc in config, 6 in job). * GPU_VALIDATION_TESTS_JSON / TESTS_JSON still valid JSON. * PHASE1_SCRIPT, PHASE_NODE_LABEL_SCRIPT, CRONJOB_CANDIDATE_NODES_SELECTION_SCRIPT, WAIT_FOR_WORKERS_SCRIPT, VALIDATE_RCCL_TEST_SCRIPT, RCCL_ENV_VARS all pass `bash -n`. * No matches for the removed identifiers remain in either configs YAML except the intentional migration-note comments. Design: docs/codie/designs/GPUOP-691-phase1-gpu-hw-acceptance-design.md section 4 (recipe expansion section, last paragraph). Parent: GPUOP-691. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-765: Add unit tests for legacy label-var / script-block removal 38 test cases covering: * Removal contract (top-level keys + inline shell-script references for both removed env vars and the removed legacy script block) * Migration-note guard (deletion is documented inline so future readers can trace the rename without git archaeology) * Replacement contract (PHASE1_LABEL_KEY, helper library, PHASE1_SCRIPT cover the roles vacated by the removed symbols) * Companion-file guard (cluster-validation-job.yaml is also clean) * Regression guards for every sibling ConfigMap key added or preserved by GPUOP-755 / GPUOP-756 / GPUOP-761 / GPUOP-763 * File-parse / bash-syntax structural checks on every embedded script Framework: Python 3 + unittest + PyYAML (matches GPUOP-755 / GPUOP-756 / GPUOP-761 / GPUOP-763 conventions for the same artifact); includes a grep-based shell fallback harness for environments without Python. Story TC mapping: TC1, TC2, TC4, TC5, TC13. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-766: Document Phase 1 GPU HW acceptance in README Add dedicated Phase 1 subsection covering scope, the 4 sub-tests (RVS gst_single + AGFHC xgmi_lvl1/pcie_lvl1/hbm_lvl1), the amd.com/gpu-hw-acceptance result label, sample failure annotation output (-failure-reason and -failed-subtest), the failure-reason catalog, and the SKIP_GPU_HW_ACCEPTANCE short-circuit behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-766: Add unit tests for Phase 1 README docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-767: Add cluster-validation-phase2-job-config ConfigMap New ConfigMap holding the batch/v1.Job template for the Phase 2 single-node 8-GPU RCCL all_reduce_perf test. Mirrors the test-runner Job pattern but uses RCCL_WORKLOAD_IMAGE, requests all 8 GPUs, and runs mpirun --host localhost with the PHASE2_* sizing knobs from cluster-validation-config. Reuses VALIDATE_RCCL_TEST_SCRIPT for bandwidth threshold check. Uses \$\$NODE placeholder for sed substitution by PHASE2_SCRIPT. See docs/codie/designs/GPUOP-692-phase2-gpu-mesh-validation-design.md section 4 -> "Code Path: Phase 2 Job template ConfigMap". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-767: Add unit tests for cluster-validation-phase2-job-config 52 test cases covering structural / content / resource / command / boundary (no-IB invariant) / rendering / regression categories for the new Phase 2 per-node Job template ConfigMap. Mapped to 9 Story-level test cases in GPUOP-692 test plan. Python 3 + unittest + PyYAML harness, with a grep-only fallback for CI environments without Python. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-768: Add Phase 2 env vars + PHASE2_RCCL_ENV_VARS to cluster-validation-config Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-768: Add unit tests for Phase 2 env vars + PHASE2_RCCL_ENV_VARS Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-769: Add PHASE2_SCRIPT to cluster-validation-config Mirror of PHASE1_SCRIPT. Drives Phase 2 (intra-node GPU collective) per-node Job lifecycle: 1. Sed-render the cluster-validation-phase2-job-config template ($$NODE, $$RCCL_WORKLOAD_IMAGE) and kubectl apply per input node. 2. Poll-wait every Job up to PHASE2_JOB_WAIT_TIME. 3. Classify by Job condition + container log markers: Complete=True -> passed "phase2 mpirun exited" -> rccl-crash "phase2 bandwidth below threshold" -> bus-bw-below-threshold pending/active past timeout -> timeout 4. Label via GPUOP-690 helpers (label_phase_passed / label_phase_failed). Annotate measured Avg bus bandwidth on both pass and bw-fail paths. 5. SKIP_GPU_MESH_VALIDATION=true short-circuits to pass-label-all. 6. Delete hung Jobs after the wait budget. Depends on GPUOP-690 helpers (PHASE_NODE_LABEL_SCRIPT), GPUOP-767 (cluster-validation-phase2-job-config template), GPUOP-768 (PHASE2_* env vars). Orchestrator wiring is GPUOP-770; /phase2-configs volume mount is GPUOP-771. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-769: Add unit tests for PHASE2_SCRIPT 37 unit test cases split into two layers: * Contract (24 cases): source-level grep / bash -n on extracted PHASE2_SCRIPT body. Confirms every design-doc clause is present (env validation list, sed substitutions, helper invocations, classify markers, cleanup guard). * Behavioral (13 cases): bash + lib/kubectl_mock.sh, sourcing the extracted script under a stripped-down harness with PHASE_NODE_LABEL_SCRIPT helpers. Covers empty input, skip mode, missing env / template, single-node pass / bw-fail / rccl-crash / timeout, three-node mixed, long-hostname hash fallback, submit failure, helper failure tolerance. Coverage maps to 9 of 14 Story TCs from GPUOP-692-test-plan.md (TC1-3, TC5-10); TC4 is GPUOP-768 (env-vars), TC11-12 are GPUOP-770 (orchestrator wiring), TC13-14 are integration / performance and out of unit-test scope. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-770: Wire run_phase2 to source PHASE2_SCRIPT Replace the generic-dispatch run_phase2 stub (delivered by GPUOP-690.4) with a dedicated function that sources $PHASE2_SCRIPT, mirroring the run_phase1 wiring landed by GPUOP-763. See design doc GPUOP-692 section 4 -> "Code Path: orchestrator wiring". Single-function-body change in cluster-validation-job.yaml. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-770: Add unit tests for run_phase2 orchestrator wiring 27 test cases (15 contract, 8 behavioral wiring, 1 DRY_RUN override, 3 regression) covering the dedicated run_phase2 function added in cluster-validation-job.yaml. Mirrors the GPUOP-763 unit-test plan for run_phase1. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-771: Implement cluster-validation-job.yaml Mount cluster-validation-phase2-job-config ConfigMap into the CronJob submit-mpijob container at /phase2-configs, paralleling the existing mpi-configs and test-runner-configs mounts so that PHASE2_SCRIPT can sed-render the Phase 2 per-node Job template. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-771: Add unit tests for cluster-validation-job.yaml Unit test plan covering the phase2-configs volumeMount + volume addition to the CronJob submit-mpijob container: YAML-structure parse, source-text grep, kubectl client-side schema validation, and sibling-mount regression guards. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-772: Bash unit tests + sample phase2.log fixtures for PHASE2_SCRIPT Adds tests/test_phase2.sh exercising PHASE2_SCRIPT (GPUOP-769) end-to-end against the existing kubectl mock harness, plus four sample phase2.log fixtures under tests/fixtures/phase2/ for the pass / bw-below-threshold / rccl-crash / failed-no-marker classification paths. Coverage (design GPUOP-692 section 7, test plan GPUOP-692-test-plan.md): * pass (BW above threshold) -> =passed + measured-bw annotation * bus-bw-below-threshold fail (also covers PHASE2_BW_THRESHOLD=9999 inject) * rccl-crash via mpirun-exited marker * default rccl-crash on Failed=True without a recognized marker * timeout via PHASE2_JOB_WAIT_TIME=0 + cleanup delete * SKIP_GPU_MESH_VALIDATION=true short-circuit (case-insensitive) * missing-env fast-fail (phase2-missing-env reason) * missing job template -> job-template-missing reason * parallel-submit ordering pin (all applies precede any get-job poll) * job-creation-failed when kubectl apply returns non-zero * PHASE_NODES env-var fallback when no positional args * PHASE2_RCCL_ENV_VARS contains no IB/fabric tunables (TC4) Extends lib/kubectl_mock.sh with a `logs` verb (the original mock only honored label/annotate/get/apply/delete) and a kubectl_mock_set_pod_log seed helper so per-pod log content can be served as base64-encoded state, mirroring the kubectl_mock_set_pod_for_job pattern from GPUOP-764. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-773: Document Phase 2 intra-node GPU mesh validation in README Add a Phase 2 subsection to example/gpu-validation-cluster/README.md mirroring the structure of the existing Phase 1 subsection: scope, single-node mpirun --np 8 --host localhost all_reduce_perf workload, amd.com/gpu-mesh-validation label, ConfigMap env vars (PHASE2_*), failure-reason annotation values (bus-bw-below-threshold, rccl-crash, xgmi-init-failure, timeout, job-creation-failed), and the SKIP_GPU_MESH_VALIDATION short-circuit behavior. Includes a Threshold Tuning subsection covering: the default 200 GB/s for MI300 xGMI, how to override PHASE2_BW_THRESHOLD via the cluster-validation-config ConfigMap, expected ranges for MI300X/A, MI250/X, MI210, and MI100 SKUs, and a 4-step tuning workflow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-774: Implement cluster-validation-phase3-job-config Add cluster-validation-phase3-job-config ConfigMap with the per-node NIC health Job template (batch/v1.Job) used by GPUOP-693 Phase 3. Per docs/codie/designs/GPUOP-693-phase3-nic-health-design.md section 4 -> Code Path: Phase 3 Job template: - serviceAccountName: cluster-validation-sa (preexisting, has node label/annotate perms via cluster-validation-role). - securityContext.privileged: true (required for rdma/ibv tooling). - Requests amd.com/nic: $$EXPECTED_NIC_COUNT (sole-tenant on NICs). - Image: docker.io/rocm/network-operator-utils:v1.1.0 (same as Phase 5 launcher init-container). - envFrom cluster-validation-config -- sources PHASE3_CHECK_SCRIPT body, PHASE3_LABEL_KEY, PHASE3_EXPECTED_NIC_COUNT, and the GPUOP-690 label/annotate helper conventions. - NODE_NAME via downward API for in-pod kubectl self-labelling. - Placeholders $$NODE and $$EXPECTED_NIC_COUNT are sed-substituted by PHASE3_SCRIPT (separate Sub-task) at submit time. - backoffLimit: 0, ttlSecondsAfterFinished: 300 (matches Phase 2). Scope is limited to the Job template ConfigMap only. The PHASE3_CHECK_SCRIPT body, PHASE3_* env vars, and orchestrator wiring are tracked under sibling Sub-tasks (GPUOP-693.2/.3+). Verified with kubectl apply --dry-run=client on the outer file (all 8 resources accepted) and on the rendered inner Job template. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-774: Add unit tests for cluster-validation-phase3-job-config 50 test cases covering structural / content / security / resource / command / boundary (no-scope-creep) / rendering / regression categories for the new Phase 3 per-node NIC health Job template ConfigMap. Mapped to 8 Story-level test cases in GPUOP-693 test plan. Python 3 + unittest + PyYAML harness (matches sibling GPUOP-767 Phase 2 ConfigMap tests), with a grep-only fallback for CI environments without Python. Verifies the Job template contract: image docker.io/rocm/network-operator-utils:v1.1.0, privileged: true, amd.com/nic: $$EXPECTED_NIC_COUNT, serviceAccountName cluster-validation-sa, envFrom cluster-validation-config, NODE_NAME downward-API, and the three-line script that materializes and executes $PHASE3_CHECK_SCRIPT. Scope guards confirm the template does NOT inline the PHASE3_CHECK_SCRIPT body, PHASE3_* env values, kubectl label calls, or amd-nic=true nodeSelector -- those are owned by sibling Sub-tasks (GPUOP-693.2 / .3 / .4). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-775: Add PHASE3_CHECK_SCRIPT to cluster-validation-config In-Job per-node NIC health check with 4 structural validations: NIC count (lspci vendor 1dd8), ip link state UP, rdma link state ACTIVE, ibv_devinfo + GID table non-empty. Self-labels the node via in-pod kubectl (cluster-validation-sa). On any failure writes ${PHASE3_LABEL_KEY}=failed plus failure-reason / failed-nics annotations (each truncated to 250 bytes for K8s safety) and exits non-zero. Matches GPUOP-693 design Section 4 -> Code Path: PHASE3_CHECK_SCRIPT. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-775: Add unit tests for cluster-validation-config Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-776: Add Phase 3 ConfigMap env vars Add four Phase 3 (per-node NIC health) ConfigMap data keys to cluster-validation-config so the Phase 3 Job (GPUOP-774 template) can project them via envFrom: * PHASE3_EXPECTED_NIC_COUNT="8" - PCI function count per node * PHASE3_AMD_NIC_VENDOR_ID="1dd8" - Pensando PCI vendor ID * PHASE3_MIN_GID_COUNT="1" - min GID table entries per dev * PHASE3_JOB_WAIT_TIME="120" - per-node Job wallclock budget Values match the defensive defaults already in PHASE3_CHECK_SCRIPT (GPUOP-775) so existing behavior is preserved; this Sub-task makes them ConfigMap-tunable. Design: docs/codie/designs/GPUOP-693-phase3-nic-health-design.md section 4 -> "Code Path: Phase 3 ConfigMap variables". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-776: Add unit tests for cluster-validation-config Phase 3 env vars 32 unit test cases (18 P0, 13 P1, 1 P2) covering presence, value, type discipline, boundary, script-default consistency with PHASE3_CHECK_SCRIPT (GPUOP-775), and regression guards for the 4 new Phase 3 ConfigMap data keys: * PHASE3_EXPECTED_NIC_COUNT * PHASE3_AMD_NIC_VENDOR_ID * PHASE3_MIN_GID_COUNT * PHASE3_JOB_WAIT_TIME Framework: Python 3 + unittest + PyYAML (matches sibling Phase 1/2/3 unit-test plans). Test harness + grep-only fallback inline. Maps to Story GPUOP-693 test plan cases #1, #2, #3, #5, #8, #12. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-777: Implement PHASE3_SCRIPT orchestrator + run_phase3 wiring Adds the outer Phase 3 driver (submit + wait, no result fetch) to cluster-validation-config and replaces the generic run_phase3 stub with dedicated wiring mirroring the run_phase1/run_phase2 pattern. Mounts cluster-validation-phase3-job-config (GPUOP-774) at /phase3-configs so PHASE3_SCRIPT can sed-render the per-node Job template. amd-nic=true intersection and SKIP_NIC_VALIDATION pass- through stay on the orchestrator side (already in place). Per design GPUOP-693 §4 (Code Path: PHASE3_SCRIPT + orchestrator wiring), the in-pod PHASE3_CHECK_SCRIPT (GPUOP-775) self-labels passed/failed via in-pod kubectl, so PHASE3_SCRIPT only writes labels for the cases that prevent the in-pod kubectl from running: submit-failed -> job-creation-failed; timeout -> nic-not-allocated. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-777: Add unit tests for PHASE3_SCRIPT orchestrator 40 test cases across 8 categories covering submit/wait, no-result- fetch contract, SKIP_NIC_VALIDATION pass-through, timeout-> nic-not-allocated fallback, run_phase3 wiring, and phase3-configs volume mount. Maps to 14 cases in GPUOP-693-test-plan.md. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-778: Add unit tests for run_phase3 wiring + /phase3-configs mount Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-779: Add bash unit tests for Phase 3 NIC health check Covers PHASE3_CHECK_SCRIPT (in-Job 4-check NIC health + self-labelling, GPUOP-775) and PHASE3_SCRIPT (outer-driver submit/wait, GPUOP-777) per the GPUOP-693 test plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-779: Add unit tests for Phase 3 NIC health bash test suite Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-779: Unit test results — PASS (21/21) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-780: Add Phase 3 README section + sample failure annotations Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-781: Add Phase 4 server + client Job template ConfigMap Implements GPUOP-694.1 per design doc section 4 ("Code Path: Phase 4 Job templates"). Adds new cluster-validation-phase4-job-config ConfigMap with two batch/v1.Job templates (server and client) for ib_write_bw pairwise rail bandwidth measurement. Templates parameterized on $$NODE, $$PEER_POD_IP, $$RAIL_IDX, and $$NAD_NAME for sed-rendering by PHASE4_DRIVER_SCRIPT (sibling sub-task). Image rocm/roce-workload via $$ROCE_WORKLOAD_IMAGE. envFrom pulls PHASE4_* config vars (GPUOP-694.3). Per-rail NIC requested via amd.com/nic and k8s.v1.cni.cncf.io/networks Multus annotation. Server waits for client connect; client parses "BW average" line and tees output to /shared for driver log collection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-781: Add unit tests for cluster-validation-phase4-job-config * GPUOP-782: Add per-rail NetworkAttachmentDefinition manifests Adds configs/nad-per-rail.yaml with 8 NADs (amd-host-device-nad-rail-0..7) for Phase 4 pairwise rail bandwidth testing. Each NAD uses the host-device CNI plugin pinned to one rail's RDMA netdev (rdma_dev_N), matching PHASE4_IB_DEV_PREFIX. Conditional per the Sub-task description: skip applying if the cluster already ships per-rail NADs via the network-operator. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-782: Add unit tests for per-rail NAD manifests 48 manifest contract assertions across 9 categories: structural, naming, metadata, CNI config, per-rail consistency, no-scope-creep, conditional deployment, Phase 4 cross-Sub-task, and regression guards. Framework: Python 3 + unittest + PyYAML (matches sibling GPUOP-781 / GPUOP-774 conventions). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-783: Add Phase 4 pairwise rail bandwidth ConfigMap env vars Extend cluster-validation-config ConfigMap with 6 Phase 4 (pairwise rail bandwidth test) tunables consumed by the PHASE4_DRIVER_SCRIPT and per-pair server/client Jobs: - PHASE4_RAIL_COUNT = "8" (rails per node) - PHASE4_BW_THRESHOLD = "380" (Gbps minimum per rail) - PHASE4_PAIR_WAIT_TIME = "180" (per-rail wallclock seconds) - PHASE4_MAX_CONCURRENT_PAIRS = "8" (pair-runner concurrency cap) - PHASE4_NAD_NAME_PREFIX = "amd-host-device-nad-rail-" - PHASE4_IB_DEV_PREFIX = "rdma_dev_" Defaults match the MI300 fleet baseline validated on smc300x-ccs-aus-gpuf268. Section follows the Phase 2 / Phase 3 env-var layout and ownership convention (envFrom projection, no code change required to tune). See GPUOP-694 design §4 -> "Code Path: Phase 4 ConfigMap variables". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-783: Add unit tests for Phase 4 ConfigMap env vars 44 test cases (26 P0, 14 P1, 4 P2) covering the 6 new PHASE4_* keys added to cluster-validation-config: - Presence (8 tests): every key from the design doc lands - Value (6 tests): exact defaults per design §1, §2, §4 - Type (7 tests): str discipline + positive-int + NAD/IB-dev prefix format regex - Boundary (8 tests): lower/upper bounds + non-empty prefixes - Phase 4 contract (3 tests): rail-count == NIC-count, label-key consistency, NAD naming match - Regression (9 tests): Phase 1/2/3 blocks and shared helpers (PHASE_NODE_LABEL_SCRIPT, PHASE_FAILURE_REASON_ANNOTATION_SUFFIX) intact; no duplicate keys - File smoke (3 tests): no tabs, comment provenance header, block ordering after PHASE3_SCRIPT Framework: Python 3 + unittest + PyYAML (matches sibling unit-test plans GPUOP-774/-775/-776 for Phase 3). Test harness reference + grep-only CI fallback included in the document. Story TP map covers cases #3, #5, #6, #7, #11, #12, #13, #14, #15, #20; downstream cases (pairing, driver-script, smoke) deferred to sibling Sub-tasks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-784: Implement PHASE4_DRIVER_SCRIPT in cluster-validation-config Adds the Phase 4 (pairwise rail bandwidth) driver script to the cluster-validation-config ConfigMap per GPUOP-694 design §4 -> "Code Path: PHASE4_DRIVER_SCRIPT". Behavior: 1. Sort + round-robin pair input nodes (already gated to amd.com/nic-health=passed upstream); odd-count last node is pass-labeled with `unpaired=true` annotation. 2. Fork pair_runner per pair, bounded by PHASE4_MAX_CONCURRENT_PAIRS (defaults to 8). 3. pair_runner iterates rails 0..(PHASE4_RAIL_COUNT-1) serially: render+apply server Job (pinned to node A), wait for pod IP, render+apply client Job with that IP (pinned to node B), wait both, parse "BW average" Gbps from client log, record per- (node, rail) result via a tmpdir state surface, cleanup. 4. Aggregate: a node passes iff ALL its rails were >= PHASE4_BW_THRESHOLD. 5. Label via GPUOP-690 helpers (label_phase_passed / label_phase_failed / annotate_phase_value); write per-rail BW annotations and a failed-rails CSV summary. 6. SKIP_RAIL_BANDWIDTH_TEST=true short-circuits to pass-labelling every input node with no Jobs created. Error reasons surfaced per-(node, rail): peer-pod-unready, nad-missing, ib-write-bw-crashed, parse-failed, api-throttled (with up to 3-retry exponential backoff on 429), below-threshold:<bw>. Consumes: PHASE4_* env vars (GPUOP-694.3), Job templates at /phase4-configs/cluster-validation-phase4-{server,client}-job-config.yaml (GPUOP-694.1, mounted by GPUOP-694.6), GPUOP-690 helpers. Wiring to run_phase4 (GPUOP-694.5) and the /phase4-configs mount (GPUOP-694.6) are tracked under their own sub-tasks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-784: Add unit tests for cluster-validation-config * GPUOP-785: Wire run_phase4 orchestrator to source PHASE4_DRIVER_SCRIPT Replace the generic _run_phase_generic stub for Phase 4 with a dedicated run_phase4 wiring function (mirrors run_phase1/2/3) that sources PHASE4_DRIVER_SCRIPT (GPUOP-694.4). Add /phase4-configs volumeMount + ConfigMap volume so PHASE4_DRIVER_SCRIPT can read the server + client Job templates from cluster-validation-phase4-job-config (GPUOP-694.1 / GPUOP-781). See docs/codie/designs/GPUOP-694-phase4-rail-bandwidth-design.md section 4 -> "Code Path: orchestrator wiring". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-785: Add unit tests for run_phase4 orchestrator wiring 26 unit test cases covering: - run_phase4 interface contract (sources PHASE4_DRIVER_SCRIPT, passes nodes as positional args, exports PHASE_NODES) - Error paths (driver non-zero rc, unset/empty driver script) - Boundary cases (empty/single/many input nodes) - DRY_RUN override - /phase4-configs CronJob volume mount + ConfigMap binding - Phase 4 dispatch site SKIP guard + Phase-3 pool gating - Sibling-phase regression (phases 1-3 unchanged, run_phase5 still uses _run_phase_generic) - Static analysis (YAML parse, bash -n, shellcheck) Maps 6 unit tests to Story TP cases in docs/codie/test-plans/GPUOP-694-test-plan.md. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-787: Add Phase 4 bash unit tests + ib_write_bw log fixtures Adds test_phase4.sh with 19 cases covering PHASE4_DRIVER_SCRIPT: pairing (round-robin even/odd, single-node, empty), all-pass single pair with per-rail BW annotations, single-rail-below- threshold, all-rails-fail, ib_write_bw crash, parse failure, server-pod-unready / nad-missing timeout paths, rail-count override, missing required env, missing job templates, SKIP short-circuit, 16-node concurrency cap honored, and the PHASE_NODES env-var fallback. Sample ib_write_bw client log fixtures live under tests/fixtures/phase4/ (pass, pass-high, below-threshold, crashed, empty). Extends lib/kubectl_mock.sh with two Phase-4-specific routes (pod-ip jsonpath, pods --no-headers shape) plus a concurrency- safe argv capture: the previous tail-back-the-call-log approach races under bounded-parallel pair_runners; now ARGS is captured from "$@" before any I/O and the call-log line is appended with a single atomic write. README.md updated with Phase 4 coverage section and the GPUOP-694 test-plan TC mapping. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-788: Document Phase 4 (Pairwise Rail Bandwidth) in README Adds a Phase 4 section to the gpu-validation-cluster README covering scope, the pairwise per-rail ib_write_bw model, the 380 Gbps configurable threshold, sample per-rail annotations (passing and failing nodes), per-(pair, rail) failure reasons, unpaired-node behavior, SKIP_RAIL_BANDWIDTH_TEST short-circuit semantics, and concurrency-cap tuning guidance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-788: Add unit tests for Phase 4 README documentation Adds a 28-case unit-test plan covering section presence, ConfigMap defaults accuracy, sample annotation correctness, markdown structural lint, failure-reason vocabulary alignment with PHASE4_DRIVER_SCRIPT, SKIP and concurrency content, and pipeline-table regression. Tests are designed to run via tests/test_phase4_readme.sh with a stdlib-only Python md-lint helper. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-789: Rename WAIT_FOR_WORKERS_SCRIPT to PHASE45_PREFLIGHT_SCRIPT Rename the ConfigMap key in cluster-validation-config.yaml from WAIT_FOR_WORKERS_SCRIPT to PHASE45_PREFLIGHT_SCRIPT and update the wait-for-worker-pods init-container in cluster-validation-job.yaml to source the new env var name. Pure rename; no behavior change. Makes room for the enhanced N×N SSH mesh, DNS, MPI spawn, and RCCL topology preflight checks landing in subsequent GPUOP-695 sub-tasks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-789: Add unit tests for cluster-validation-config rename Test plan covers the PHASE45_PREFLIGHT_SCRIPT rename: presence of new key, removal of old key, init-container env-var sourcing, envFrom wiring preservation, YAML parse, bash -n on extracted body, and byte-exact body invariance vs baseline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-790: Add N*N SSH mesh check to PHASE45_PREFLIGHT_SCRIPT Extends PHASE45_PREFLIGHT_SCRIPT (cluster-validation-config ConfigMap) so that, after the existing launcher->worker SSH readiness wait, every worker pod is probed against every worker IP via kubectl exec + ssh -o ConnectTimeout=5. Failed pairs are collected in failed_pairs and reported all at once at the end of the loop rather than aborting on the first failure, giving the operator a complete fabric picture for triage. See design doc GPUOP-695 §4 (Code Path: Enhanced PHASE45_PREFLIGHT_SCRIPT). The retry-with-timeout launcher->worker readiness probe is preserved as Phase 1; the new N*N matrix is Phase 2. UserKnownHostsFile=/dev/null is added to keep known_hosts from poisoning subsequent pairs when pods share a home volume. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-790: Add unit tests for N*N SSH mesh probe 23 test cases covering the new N*N mesh loop in PHASE45_PREFLIGHT_SCRIPT: cardinality, flag-shape assertions (StrictHostKeyChecking=no, UserKnownHostsFile=/dev/null, ConnectTimeout=5), continue-past-first-failure semantics under set -euo pipefail, degenerate single-pod path, ENABLE_SSH_CHECK short-circuit, and the additive kubectl_mock.sh extensions (kubectl exec / wait verbs, training.kubeflow.org/job-name= selector route, kubectl_mock_set_pair_fail helper). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-791: Add DNS forward+reverse check to PHASE45_PREFLIGHT_SCRIPT Per design GPUOP-695 §4 and Sub-task scope, add a Phase 3 DNS validation block to the launcher init-container script. From the FIRST worker pod, resolve every worker node hostname forward (getent hosts), then reverse-resolve the returned address. Misses are recorded in dns_misses[] and logged as WARN. The check does NOT short-circuit -- it iterates every hostname and continues to subsequent checks. Overall verdict gating (annotate-and-evict on failure) is owned by a separate Sub-task. WORKER_NODES is derived from .spec.nodeName (deduplicated). The in-pod loop runs under the existing `set -euo pipefail` script, so each getent call is guarded with `|| echo MISS` to keep the loop alive. Verified: yaml.safe_load() loads the ConfigMap and `bash -n` on the extracted PHASE45_PREFLIGHT_SCRIPT passes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-791: Add unit tests for DNS forward+reverse check 27 unit test cases covering the new Phase 3 DNS block of PHASE45_PREFLIGHT_SCRIPT: - 5 positive (resolution paths, FIRST_POD targeting, deduplication, renamed-key extraction) - 6 negative (forward miss, reverse miss, multi-miss, all-fail, no-exit-1 regression guard) - 4 boundary (N=1, empty pods, duplicate nodeName, hostname-with-dot) - 5 mock/dependency (jsonpath, JOB_LABELS, FIRST_POD reuse, fwd-then-rev order, awk first-token extraction) - 3 set -e safety (getent failure non-fatal, kubectl exec failure non-fatal, no-unbound-variable defense) - 4 integration (post-mesh sequencing, mesh-fail-skips-dns, ENABLE_SSH_CHECK=false skip, single-node smoke on smc300x-ccs-aus-gpuf268) Each case is mapped to the Story-level test plan (GPUOP-695) by test number. Framework: bash + tests/lib/{kubectl_mock,assert, extract_script}.sh, layered on top of the GPUOP-790 mock extensions. Recommended test file: example/gpu-validation-cluster/tests/test_phase45_dns.sh. Mock extensions required (additive on top of GPUOP-790): - spec.nodeName jsonpath route in kubectl get pods mock - in-pod getent shim driven by kubectl_mock_set_dns_fwd / kubectl_mock_set_dns_rev helpers - kubectl exec mock must run `bash -c <script>` with the getent shim's PATH prepended Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-792: Add mpirun --hostfile no-op spawn check to PHASE45_PREFLIGHT_SCRIPT Per design GPUOP-695 §4 and Sub-task scope, add a Phase 4 MPI spawn validation block to the launcher init-container script. From the FIRST worker pod, write $WORKER_IPS (one per line) to /tmp/hf and invoke `mpirun --hostfile /tmp/hf --np $(wc -l < /tmp/hf) --allow-run-as-root true`. A non-zero exit from the kubectl exec sets mpi_spawn_failed=true. This catches OMPI launcher misconfiguration (missing orted, broken PATH, btl/oob misconfig, partial mca install) before the heavier RCCL probe in a subsequent Sub-task. Mirrors the record-and-continue pattern established by GPUOP-790 (SSH mesh) and GPUOP-791 (DNS): the check MUST NOT short-circuit -- it sets the flag, logs WARN, and continues. Overall verdict gating (annotate-and-evict) is owned by a separate Sub-task. mpi_spawn_failed is initialised to false so later annotate logic can read it under `set -u`. Blank-line entries in the hostfile are filtered via `awk "NF"` to harden against an empty $WORKER_IPS. Verified: yaml.safe_load_all() loads the ConfigMap and `bash -n` on the extracted PHASE45_PREFLIGHT_SCRIPT passes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-792: Add unit tests for mpirun --hostfile no-op spawn check 30 unit test cases covering the new Phase 4 MPI block of PHASE45_PREFLIGHT_SCRIPT: - 7 positive (single-exec on FIRST_POD, hostfile content, np-from-wc-l, mpirun flag set, renamed-key extraction) - 5 negative (mpirun-nonzero, missing binary, exec-failure, no-exit-1 regression guard, warn-continues) - 4 boundary (worker-replicas=1, empty WORKER_IPS, trailing whitespace, IPv6 IPs) - 5 mock/dependency (FIRST_POD/WORKER_IPS reuse, hostfile via tr+awk, --allow-run-as-root present, payload is `true`) - 4 set -e safety (mpirun fail non-fatal, init-before-use ordering, no-unbound-variable on success, side-channel contract with verdict-gating Sub-task) - 5 integration (post-DNS sequencing, dns-misses-do-not-skip-mpi, mesh-fail-skips-mpi, ENABLE_SSH_CHECK=false skip, single-node smoke on smc300x-ccs-aus-gpuf268) Each case is mapped to the Story-level test plan (GPUOP-695) by test number. Framework: bash + tests/lib/{kubectl_mock,assert, extract_script}.sh, layered on top of the GPUOP-790 + GPUOP-791 mock extensions. Recommended test file: example/gpu-validation-cluster/tests/test_phase45_mpi_spawn.sh. Mock extensions required (additive on top of GPUOP-790/791): - In-pod mpirun shim driven by kubectl_mock_set_mpirun_exit / kubectl_mock_mpirun_calls / kubectl_mock_mpirun_capture_hostfile - kubectl_mock_set_exec_exit for simulating pod eviction / exec failure distinct from mpirun failure Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-793: Add minimal RCCL topology probe to PHASE45_PREFLIGHT_SCRIPT Phase 5 of the PHASE45_PREFLIGHT_SCRIPT now runs a minimal RCCL topology probe via mpirun + all_reduce_perf with NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=INIT, grepping the first 50 "NCCL INFO comm|topology" lines into the launcher log. The probe is wrapped in a 60s timeout: on timeout (exit 124) the soft-fail path sets rccl_topo_timeout=true and continues per design GPUOP-695 §6 (first-run topology discovery can be slow while NCCL warms caches). Any other non-zero exit sets rccl_topo_failed=true and continues. Both flags are initialised to false for set -u safety. Verdict gating (annotate-and-evict) remains owned by a separate Sub-task, matching the WARN-and-continue pattern already used by the DNS and mpirun no-op spawn checks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-793: Add unit tests for RCCL topology probe Unit test plan for the new Phase 5 RCCL topology probe block in PHASE45_PREFLIGHT_SCRIPT. Mirrors the bash-mock-kubectl harness already used by test_phase{1..4}.sh: extract the block scalar, shim kubectl + timeout for deterministic exit codes (0, 1, 124, 137), assert the WARN-and-continue contract is honoured. 15 active cases across interface contract, error conditions (timeout vs generic non-zero), boundary (single worker IP, empty grep matches, set -u flag init), dependency mocks (FIRST_POD, PERF_TEST_DIR, timeout wrapping), error recovery (does-not-short-circuit), and integration (banner ordering). 9 cases map to GPUOP-695 Story TCs 2, 7, 8, 14. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-794: Implement PHASE45_PREFLIGHT_SCRIPT verdict block Adds the aggregate-verdict / annotate-on-failure / abort-MPIJob block at the tail of PHASE45_PREFLIGHT_SCRIPT in cluster-validation-config.yaml. Behaviour (design GPUOP-695 §4 verdict block, §5, §6): * Aggregates four failure flags set by the preceding Phase 4.5 checks (ssh_mesh_failed, dns_failed, mpi_spawn_failed, rccl_topo_failed, rccl_topo_timeout) into a comma-joined reason string and annotates every participating worker node with amd.com/phase4_5-failure-reason=<comma-separated classes> using --overwrite so the next CronJob tick replaces stale values. * HARD failures (ssh-mesh, dns, mpi-spawn, rccl-topology non-timeout) exit 1 -> init-container fails -> launcher fails -> MPIJob fails via the existing failure path. No new orchestrator code needed. * SOFT failure: RCCL timeout (exit 124) per design §6 is annotated but does NOT abort -- first-run NCCL cache warming can be slow and the real test is Phase 5. Without this carve-out every cold-cache run would fail-loop Phase 4.5. * The previously-immediate `exit 1` in the SSH-mesh block is replaced with the same record-and-continue pattern used by DNS/MPI/RCCL so the verdict block sees the complete fabric picture before annotating. * NO new label key is introduced -- per design §5, Phase 4.5 is a gate, not a tested capability per node. Phase 4 labels stay frozen so the audit trail is preserved. * Uses `if; then; fi` rather than `[ ... ] && ...` because the script runs under `set -e` and a short-circuit whose left side is false would abort the script before the verdict block could annotate. * `kubectl annotate` is guarded with a fallback echo so a single failing annotate (e.g. node renamed) does not block the remaining loop iterations or the final FATAL exit. Sanity-checked: * YAML well-formedness via yaml.safe_load_all -- OK. * Embedded script syntax via `bash -n` -- OK. * Verdict logic behavioural sweep (6 scenarios: all-pass, ssh-only, all-hard, only-timeout, timeout+ssh, hard-rccl-beats-timeout) -- all annotation strings and exit codes match the design intent. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-795: Implement bash unit tests for PHASE45_PREFLIGHT_SCRIPT Adds test_phase4_5.sh covering the Phase 4.5 pre-flight gate (SSH mesh, DNS, MPI spawn, RCCL topology + verdict block) by extracting the script body from cluster-validation-config.yaml and driving it against the mocked kubectl harness. 12 cases realize test-plan TC2, TC4, TC5, TC6, TC7, TC8, TC9 from docs/codie/test-plans/GPUOP-695-test-plan.md plus extra coverage for the verdict block's hard-vs-soft fail classification (design GPUOP-695 §4 / §6). Mock infrastructure (lib/kubectl_mock.sh) gains three new arms: * `kubectl wait` -- default pass; failure-injectable via FAIL_DIR. * `kubectl exec` -- per-call FIFO response queue under FAIL_DIR; each entry carries (exit_code, base64 stdout). * `kubectl get` -- new phase45-* state routes for the pod-name / pod-IP / nodeName listings the pre-flight uses (Kubeflow training labels, not job-name=). Plus seed helpers: kubectl_mock_set_mpijob_names / set_pod_names / set_pod_ips / set_node_names / queue_exec. Fixtures in tests/fixtures/phase4_5/ carry the canned stdout the DNS and RCCL probes consume. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-795: Record test results for PHASE45_PREFLIGHT_SCRIPT unit tests 12/12 new cases pass; full suite (111/111 across 7 files) remains green. Maps each case to its parent Story test-plan TC entry. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-796: Implement README + Phase 4.5 documentation Add a dedicated Phase 4.5 (Cross-Node Connectivity Matrix Test) section to the validation cluster README documenting the four pre-flight checks (N x N SSH mesh, DNS forward+reverse, mpirun no-op spawn, RCCL topology probe), the amd.com/phase4_5-failure-reason annotation contract, the shared SKIP_RCCL_TEST flag with Phase 5, and the launcher-init-container execution model (no separate pod). Also fix the Phase 4.5 row in the phase table, cross-link the Day-3 incremental-bringup bullet, and document the WAIT_FOR_WORKERS_SCRIPT -> PHASE45_PREFLIGHT_SCRIPT migration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-796: Add unit tests for Phase 4.5 README documentation Add a 26-case unit-test plan for the Phase 4.5 README section covering section presence and heading hierarchy, four-checks table accuracy, sample annotation correctness, Markdown structural lint, SKIP_RCCL_TEST shared-flag content, cross-file drift guards (ConfigMap key + design-doc reason tokens), and pipeline-table / Day-3 cross-link regression. Mirrors the sibling GPUOP-788 README test plan convention and the bash + md_lint.py harness used by other Phase X.8 documentation sub-tasks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-797: Implement PHASE5_SCRIPT ConfigMap key Lift the pre-refactor inlined Phase 5 body (Steps 3-5 of the submit-mpijob container args plus launcher log retrieval) out of the CronJob args and into a new PHASE5_SCRIPT key in cluster-validation-config.yaml. The orchestrator already wires run_phase5() to source PHASE5_SCRIPT via _run_phase_generic (landed in GPUOP-757); this Sub-task supplies the script body. Per the GPUOP-696 design doc §4 ("Code Path: Lift existing Phase 5 logic into PHASE5_SCRIPT") and the JIRA Sub-task scope note, this Sub-task PRESERVES the body verbatim -- no behavior change. Subsequent Sub-tasks (GPUOP-696.2..696.8) own the substantive refactors: * parameterising on the input node set (dynamic actual_worker_replicas from $@, drop reliance on outer-scope passed_nodes / passed_count) * swapping inlined kubectl-label loops for the label_phase_passed / label_phase_failed helpers (GPUOP-690 label library) * adding the PHASE5_MIN_WORKERS guard * per-worker exit-code annotation + per-worker log dump by node name Two minimal source-vs-exec adjustments (both documented inline): 1. A 2-line input shim at the top -- `passed_nodes="$@"` and `passed_count=$(echo "$passed_nodes" | wc -w)` -- lets the verbatim body run unchanged under the orchestrator's `source $script_path $nodes` calling convention. The pre-refactor body referenced outer-scope `passed_nodes` / `passed_count`; the shim re-binds them from positional args so the verbatim body sees the same variable names. 2. The SKIP_RCCL_TEST early-exit path swaps `exit 0` for `return 0` so the orchestrator can still run cleanup_old_mpijobs, collect_launcher_logs, and the fail-loud exit accounting after Phase 5 returns. `exit 0` in the original code ended the CronJob pod because the code was inlined; under source-from-orchestrator it would kill the orchestrator mid-flight. Files: example/gpu-validation-cluster/configs/cluster-validation-config.yaml Validation: * `python3 -c 'yaml.safe_load(...)'` clean (69 keys, PHASE5_SCRIPT is a ~6000-char literal scalar) * verbatim diff vs commit 1657d978^ Phase 5 block = only the two documented adjustments above + trivial trailing- whitespace trim on one line; the ✅/❌ glyphs are preserved Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-797: Add unit tests for PHASE5_SCRIPT verbatim extraction 25 unit test cases covering the verbatim-preserve scope of GPUOP-696.1: input shim correctness, all sed substitutions in the MPIJob render pipeline, NAD-list rendering (PF/VF/empty branches), the SKIP_RCCL_TEST short-circuit (including the documented exit-0 -> return-0 source-vs-exec adjustment), the pass + fail labeling paths, launcher-log retrieval, and the integration gates that pin extraction fidelity (orchestrator wiring, old block removed from CronJob args, YAML parses clean, verbatim-body diff vs pre-refactor baseline is empty modulo the two documented adjustments). Out-of-scope refactors (helper-based labeling, PHASE5_MIN_WORKERS, per-worker exit-code annotation, per-worker log dumps, dynamic worker count from $@ rather than outer-scope passed_count) are explicitly deferred to the unit-test plans of GPUOP-696.2..696.8 and listed under "Unmapped Story TCs" with their owning Sub-tasks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-798: Parameterise PHASE5_SCRIPT input + dynamic worker-replicas GPUOP-696.2: encapsulate the Phase 5 body in `run_phase5_main`, which takes the surviving Phase 4.5 node set via its own positional args instead of reading the outer-scope `passed_nodes`. `input_count` (renamed from `passed_count`) now drives `actual_worker_replicas`, which renders the MPIJob `worker-replicas` field. The `WORKER_REPLICAS` ConfigMap value consequently becomes a MAXIMUM (Phase 0 node-selection budget) rather than a target for this script. The script body invokes `run_phase5_main "$@"` at the bottom so the existing orchestrator contract (`source "$script_path" $nodes` in _run_phase_generic) continues to work unchanged. GPUOP-696.6 will move that invocation into the orchestrator. Out of scope (other Sub-tasks): label helper swap (GPUOP-696.3 = -799), per-worker exit-code annotation (-696.4 = -800), PHASE5_MIN_WORKERS guard (-696.5 = -801), orchestrator wiring + old inlined block removal (-696.6 = -802). Validated: YAML parse OK; extracted PHASE5_SCRIPT passes `bash -n`; smoke test confirms input_count derives from function args, body ignores outer-scope passed_nodes, and node-iteration loops use function-local input_nodes only. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-798: Add unit tests for PHASE5_SCRIPT parameterisation 29 active unit test cases covering the GPUOP-696.2 parameterisation contract: run_phase5_main function definition, input-from-function-args binding, dynamic actual_worker_replicas in the rendered MPIJob manifest, no-outer-scope-leak (negative tests with decoy passed_nodes/passed_count in caller scope), and regression guards for the unchanged pre-parameterisation surface (NAD rendering, kubectl wait timeout, launcher log capture, deterministic mpijob name). 22 of 29 cases map to 11 Story-level test cases in GPUOP-696-test-plan.md. 8 Story TCs are out of scope (owned by GPUOP-696.3..696.6 -- helper-based labeling, per-worker annotation, PHASE5_MIN_WORKERS guard, orchestrator wiring + old-block deletion, multi-node testbed required). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-799: Replace inlined kubectl label with GPUOP-690 helper calls Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-799: Add unit tests for PHASE5_SCRIPT GPUOP-690 helper migration 31 active unit test cases covering the GPUOP-696.3 helper-call migration contract: skip / pass / fail label paths invoke label_phase_passed / label_phase_failed with the correct shape ("$n", "$PHASE5_LABEL_KEY", "$phase5_fail_reason"), the existing amd.com/cluster-validation-status label key value is preserved end to end (via the PHASE5_LABEL_KEY indirection), the graceful-degradation wrapper matches the Phase 1-4 convention, candidate-label-removal loops are intentionally left as raw kubectl label, and the previous CLUSTER_VALIDATION_STATUS_LABEL local variable + its reads are gone from the script body. 26 of 31 cases map to 10 Story-level test cases in GPUOP-696-test-plan.md. 9 Story TCs are out of scope (owned by GPUOP-797, GPUOP-696.4..696.6, or require a multi-node testbed). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-800: Per-worker exit-code annotation + per-worker log dump In the PHASE5_SCRIPT MPIJob-fail branch, replace the placeholder `phase5_fail_reason="mpijob-failed"` shared across all input nodes with a per-node lookup that derives `worker-pod=<name>,exit=<code>` from the worker pod that ran on each node and passes it as the `<reason>` arg to `label_phase_failed`. Missing pod or exit code falls back to `unknown` for a deterministic annotation. After launcher-log collection, add a per-worker log dump loop that saves each input node's worker pod logs to `${LOG_DIR}/worker-${node}-${new_job}.log`. Runs on both pass and fail paths so operators can correlate every node's annotation with its log file regardless of outcome. Factor the per-node worker-pod lookup into a function-local `_phase5_worker_pod_for_node` helper used by both the failure-branch label loop and the per-worker log dump loop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-801: Add PHASE5_MIN_WORKERS guard Add PHASE5_MIN_WORKERS ConfigMap key (default "2") and enforce it inside PHASE5_SCRIPT's run_phase5_main. When the surviving Phase 4.5 node set is smaller than PHASE5_MIN_WORKERS, log and return 0 without submitting an MPIJob and without writing any PHASE5 labels. RCCL collectives are multi-node by definition, so a single-worker MPIJob is degenerate; operators can override the floor to 1 for opt-in single-node plumbing validation. The guard runs after the SKIP_RCCL_TEST short-circuit so the existing "skip just labels everything passed" semantics are preserved, and an empty input set takes the same skip branch (matches design GPUOP-696 sect 6 "Empty input set" handling). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-801: Add unit tests for PHASE5_MIN_WORKERS guard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * GPUOP-802: Wire run_phase5 with dedicated PHASE5_SCRIPT sourcing Replace the generic _run_phase_generic dispatch stub for run_phase5 with a dedicated function that mirrors the run_phase1..4 wiring pattern landed via GPUOP-763, GPUOP-770, GPUOP-777, and GPUOP-785. Per design doc GPUOP-696 section 4 ("Code Path: orchestrator wiring + cleanup of old in-CronJob block"), run_phase5 now: * writes PHASE5_SCRIPT to /tmp/run-phase5.sh, chmods +x * sources the script once to register run_phase5_main * invokes run_phase5_main "$nodes" with the input node list as a single space-separated argument * exports PHASE_NODES for parity with run_phase1..4 * defends against PHASE5_SCRIPT not yet being shipped (no-op return) and against the script not defining run_phase5_main (warn + re…

…(#1539) * fix(gpu-validation-cluster): wait for cert-manager webhook before GPU-operator install The server bringup raced the cert-manager admission webhook: install_cert_manager returned without waiting for readiness, and install_amd_gpu_operator then applied Certificate/Issuer objects that the webhook must validate. During post-start k3s flapping the webhook briefly loses its endpoints, so the helm install failed with "no endpoints available for service cert-manager-webhook" and (under set -e) aborted before installing the network operator, driver, mpi-operator, and CVF. Add wait_for_cert_manager_webhook (rollout status + Endpoints poll) and call it after the cert-manager install and before each GPU-operator install attempt. Replace the helm-list skip with a helm-status check so a failed release is uninstalled and retried instead of being treated as already installed. Plan: docs-internal/knowledge/plans/2026-06-12-cvf-cert-manager-webhook-race.md Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * fix(gpu-validation-cluster): make server bringup self-healing and verified Plain setup-cluster.yml could report success while installing no operators: the install ran as a single set -e pass inside a backgrounded gpu-cluster.sh, and any kubectl/helm blip during post-start k3s instability aborted it with no resume. The instability was partly self-inflicted -- configure_server_registries did pkill -9 k3s AND launched its own k3s server via docker exec -d, fighting the entrypoint supervisor for port 6443 ("stabilized after N restarts"). Changes to gpu-cluster.sh: - configure_server_registries: only kill k3s and let the supervisor relaunch it (single source-of-truth args), removing the competing manual restart. - Factor post-start steps into run_bringup_steps and run them in a bounded retry loop (subshell set -e + outer set +e capture) so the idempotent sequence self-heals once k3s settles. - wait_for_cert_manager_webhook before the GPU-operator install; GPU-operator install now checks helm status (uninstall+retry a failed release) instead of treating a failed release as installed. - Progress-aware pull waits: wait_for_rollout (rollout status under a deadline, fast-fail on auth/not-found pull errors) and wait_for_path (deadline-based, server-mode fatal-pull fast-fail) replace fixed-count loops in prepare_multus_artifacts. Changes to setup-cluster.yml: - Verify cert-manager, amd-gpu-operator, amd-network-operator helm releases reach STATUS=deployed and the CVF CronJob exists (when enabled), failing loudly instead of reporting false success. Plan: docs-internal/knowledge/plans/2026-06-12-cvf-cert-manager-webhook-race.md Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * fix(gpu-validation-cluster): robust docker apt setup + zero-match selector diagnostic Two issues found during a fresh-machine (docker fully purged) redeploy: 1. Docker install failed on a node that still had docker apt source(s) from prior provisioning. setup-cluster.yml used the deprecated apt_key (global trusted.gpg) and an apt_repository entry with no signed-by; a pre-existing /etc/apt/sources.list.d/docker.list (and an Ansible-generated download_docker_com_linux_ubuntu.list) declared the same repo with a conflicting/empty signed-by, so `apt update` aborted with "Conflicting values set for option Signed-By ... download.docker.com" before docker could install. Now: probe docker first; when (re)installing, remove ALL pre-existing docker source files BEFORE the initial apt cache update, write the key to /etc/apt/keyrings/docker.asc (force refresh), and add one canonical docker.list with explicit signed-by. A healthy host with docker already present is left untouched. 2. A custom node-selector label (e.g. cvf-candidate=true) that nothing applies automatically made Phase 0 match 0 nodes, surfacing only as a generic "Insufficient candidates". The candidate-selection script now warns explicitly when the selector matches 0 nodes, dumps current node labels, and prints the kubectl label command; the skip message states the cause (zero-match vs all-busy/recent). README 0.2 documents that custom labels must be applied by the operator. Plan: docs-internal/knowledge/plans/2026-06-12-cvf-cert-manager-webhook-race.md Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * feat(gpu-validation-cluster): optional skip of network operator for GPU-only clusters Add network-operator.install-network-operator (default true). When false, gpu-cluster.sh skips the network-operator helm install, the NetworkConfig CR, Multus artifact prep, and per-rail NAD apply; the playbook's post-install verification drops amd-network-operator from the required-releases list. cert-manager, the GPU operator, mpi-operator, and the CVF CronJob still install cleanly -- verified end to end on a 2-node lab (GPU+mpi+CVF up, no kube-amd-network namespace, CVF trigger selects nodes and submits Phase 1). Important jq fix: read the flag with `if . == null then true else . end`, NOT `// true`. jq's // alternative operator treats false as empty, so `false // true` => true and the flag would silently never take effect. Plan: docs-internal/knowledge/plans/2026-06-12-cvf-cert-manager-webhook-race.md Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * fix(gpu-validation-cluster): Phase 1 verdict from Job condition + always persist pod logs Phase 1 GPU HW-acceptance misreported passing stages as failed (stage-xgmi-lvl1:test-runner-did-not-emit-results) whenever the orchestrator could not read the per-stage result.json. That happened two ways: a multi-node cluster where the orchestrator runs on a different node than the worker (result.json is written only to the worker's node-local hostPath), and a single node where the UID-1001 orchestrator could not read the root-owned result file. Make the Kubernetes Job condition the sole source of truth for Phase 1 pass/fail (Complete=True -> pass, Failed=True -> fail), matching the original pre-refactor CVF design. result.json is no longer read at all, which also removes a latent multi-node mis-attribution bug where the glob-by-name lookup could match a different node's same-named file. Always persist each test-runner pod log to results_root (/var/log/cluster-validation/<job>_pod.log) for triage, pass or fail, via kubectl logs (node-agnostic). Restore runAsUser:0 on the orchestrator so it can write that root-owned hostPath -- this also fixes the orchestrator's own cronjob-*.log write and Phase 5 launcher/worker log persistence. Remove now-dead result-parsing code (_phase1_parse_result, find/glob lookup, kubectl cp fallback, gz decompress, stage_start_epoch). Rewrite Phase 1 unit tests to seed Job conditions instead of result fixtures; drop the unused result-*.json fixtures. Plan: docs-internal/knowledge/plans/2026-06-12-cvf-phase1-result-detection.md Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * fix(gpu-validation-cluster): address PR review nits (comment accuracy + idempotent change reporting) - gpu-cluster.sh: correct the rollout-wait comment to match actual behavior (per-attempt cap is 60s; the final attempt shrinks to the remaining time before the hard deadline -- not a ~30s floor). - setup-cluster.yml: report the docker apt-source cleanup task as changed only when files were actually removed, instead of changed_when: true on every run. Plan: docs-internal/knowledge/plans/2026-06-12-cvf-phase1-result-detection.md Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4 (1M context) <noreply@anthropic.com>

…for open-source Removes internal GPUOP-xxx ticket IDs and docs-internal/ path references from comments and test fixtures so the open-source distribution carries no pointers to pensando-internal trackers/design docs. Comment-only changes; no functional impact. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

yansun1996 and others added 3 commits June 24, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync gpu-validation-cluster: multi-phase validation framework + robustness fixes#583

Sync gpu-validation-cluster: multi-phase validation framework + robustness fixes#583
yansun1996 wants to merge 3 commits into
ROCm:mainfrom
yansun1996:cvf-sync-rocm

yansun1996 commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yansun1996 commented Jun 24, 2026

Summary

Scope / notes

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant