Skip to content

Sync gpu-validation-cluster: multi-phase validation framework + robustness fixes#583

Open
yansun1996 wants to merge 3 commits into
ROCm:mainfrom
yansun1996:cvf-sync-rocm
Open

Sync gpu-validation-cluster: multi-phase validation framework + robustness fixes#583
yansun1996 wants to merge 3 commits into
ROCm:mainfrom
yansun1996:cvf-sync-rocm

Conversation

@yansun1996

Copy link
Copy Markdown
Member

Summary

Syncs the example/gpu-validation-cluster example from the pensando source repo into ROCm. ROCm's copy was at the single-phase state (#1206-equivalent); this brings it up to the current multi-phase Cluster Validation Framework.

Cherry-picks two commits (authored by Yan Sun), scoped to example/gpu-validation-cluster/ only:

  1. Multi-Phase GPU Cluster Validation for Datacenter Bringup (pensando #1473) — the 5-phase gated pipeline (GPU HW acceptance, intra-node mesh, NIC health, pairwise rail bandwidth, multi-node RCCL), CronJob orchestrator, Ansible day-2 playbooks, and a full bats-style unit-test suite under tests/.
  2. Self-healing bringup + robustness fixes (pensando #1539).

Scope / notes

  • example/ only. docs-internal/ design docs and knowledge notes from the source commits were intentionally excluded (not part of the open-source distribution).
  • No internal references. A cleanup commit strips internal JIRA IDs (GPUOP-xxx) and docs-internal/ path mentions from comments and test fixtures — comment-only, no functional impact.
  • Applies cleanly: ROCm's folder was byte-identical to the pensando #1206 baseline, so the two commits layer on without conflict.

Validation

  • config.json parses; the cluster-validation YAMLs are structurally valid.
  • Functionally exercised on a 2-node lab cluster (Phase 4 pairwise ib_write_bw and Phase 5 multi-node RCCL all_reduce_perf both pass).

🤖 Generated with Claude Code

yansun1996 and others added 3 commits June 24, 2026 02:19
…(#1473)

* GPUOP-689: Add codie design docs for 7 multi-phase validation stories

Adds design documents for all Stories under Epic GPUOP-689
(Multi-Phase GPU Cluster Validation for Datacenter Bringup):

- GPUOP-690: Configuration framework + multi-phase CronJob orchestrator
- GPUOP-691: Phase 1 — Per-Node GPU Hardware Acceptance
- GPUOP-692: Phase 2 — Intra-Node GPU Collective Test
- GPUOP-693: Phase 3 — Per-Node NIC Health Check
- GPUOP-694: Phase 4 — Pairwise Rail Bandwidth Test
- GPUOP-695: Phase 4.5 — Cross-Node Connectivity Matrix Test
- GPUOP-696: Phase 5 — Multi-Node RCCL Collective Test

Each design references the shared GPUOP-690 contract (uniform label keys,
helper functions, per-phase skip flags, run_phaseN orchestrator stubs)
so the 6 phase Stories can land independently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-689: Add codie test plans for 7 multi-phase validation stories

Adds test plans for each Story under Epic GPUOP-689:

- GPUOP-690: Configuration Framework      (16 cases: 5 P0, 9 P1, 2 P2)
- GPUOP-691: Phase 1 GPU HW Acceptance    (15 cases: 5 P0, 9 P1, 1 P2)
- GPUOP-692: Phase 2 GPU Mesh             (14 cases: 4 P0, 10 P1)
- GPUOP-693: Phase 3 NIC Health           (16 cases: 5 P0, 10 P1, 1 P2)
- GPUOP-694: Phase 4 Pairwise Rail BW     (20 cases: 5 P0, 12 P1, 3 P2)
- GPUOP-695: Phase 4.5 Connectivity Mesh  (14 cases: 4 P0, 9 P1, 1 P2)
- GPUOP-696: Phase 5 Multi-Node RCCL      (19 cases: 6 P0, 11 P1, 2 P2)

Total: 114 test cases across Positive, Negative, Boundary,
Error Recovery, Concurrency, Integration, Upgrade/Downgrade,
and Performance categories.

Per-test-case files deferred — will be generated during
/codie:task-test-plan for each Sub-task as engineers implement,
since test-step CLIs depend on actual code landed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-754: Replace skip-tests block with 5 per-phase skip flags

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-754: Add unit tests for config.json schema extension

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-755: Add per-phase env vars and label-key constants to ConfigMap

Replace the single SKIP_GPU_VALIDATION placeholder with five
per-phase skip flags (SKIP_GPU_HW_ACCEPTANCE,
SKIP_GPU_MESH_VALIDATION, SKIP_NIC_VALIDATION,
SKIP_RAIL_BANDWIDTH_TEST, SKIP_RCCL_TEST) aligned with the new
config.json schema landed in GPUOP-754.

Add uniform PHASE{1..5}_LABEL_KEY constants and
PHASE_FAILURE_REASON_ANNOTATION_SUFFIX so every downstream phase
consumes the same label keys rather than hard-coding their own.

Owned by GPUOP-690; this is the ConfigMap env-var surface that
downstream Sub-tasks (helper library, orchestrator, per-phase
scripts) read from.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-755: Add unit tests for ConfigMap env-var surface

30 test cases covering the cluster-validation-config ConfigMap
contract introduced by GPUOP-755:

- 6 SKIP_* placeholder presence + templating-sentinel value checks
- 6 PHASE{N}_LABEL_KEY + PHASE_FAILURE_REASON_ANNOTATION_SUFFIX
  value assertions against the design doc
- 2 backward-incompat guards (legacy SKIP_GPU_VALIDATION absent)
- 5 type/format checks (label namespacing, kebab-case)
- 9 regression guards (surrounding ConfigMap content preserved)
- 2 YAML parse / duplicate-key checks

Framework: Python 3 + unittest + PyYAML. Fallback grep harness
included for envs without PyYAML.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-756: Add PHASE_NODE_LABEL_SCRIPT helper library

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-756: Add unit tests for PHASE_NODE_LABEL_SCRIPT

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-757: Rewrite CronJob as phased orchestrator with run_phaseN stubs + DRY_RUN

Replace the linear submit-mpijob container args with a phased state
machine implementing the design in GPUOP-690 §4. The orchestrator:

  * sources PHASE_NODE_LABEL_SCRIPT helpers from the ConfigMap
  * walks Phase 0 -> 1 -> 2 -> 3 -> 4 -> 4.5 -> 5
  * gates each phase on SKIP_<PHASE> flags from the ConfigMap
  * narrows the live-node pool between phases via filter_passed_nodes
    keyed on PHASE{N}_LABEL_KEY
  * intersects with feature.node.kubernetes.io/amd-nic=true before
    Phase 3 so NIC-requiring phases only see NIC-capable nodes
  * tracks per-phase failures and exits non-zero at the end if any
    phase produced failed nodes (fail-loud, design §6)
  * provides DRY_RUN=1 that replaces run_phaseN with no-ops, skips
    Phase 0 candidate labeling, and prints the planned phase order
    + per-phase pool without any kubectl writes
  * cleans up old MPIJobs and collects launcher logs only when
    Phase 5 actually ran

run_phaseN stubs print a banner and source $PHASE{N}_SCRIPT (from
the ConfigMap) if defined; otherwise no-op. Downstream Stories
(GPUOP-691..696) slot in their per-phase scripts via ConfigMap
without orchestrator changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-757: Add unit tests for CronJob phased orchestrator

68 test cases across 9 categories, framework Python 3 + unittest +
PyYAML + bash mock-kubectl-on-PATH (matches GPUOP-754/755/756). Maps
to GPUOP-690 Story test cases TC1, TC2, TC3, TC4, TC5, TC6, TC8,
TC11, TC12, TC13, TC14.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-758: Default candidate selection to amd-gpu=true only

Drop the joined amd-nic=true requirement from the default node selector
so a datacenter early in bringup can run Phase 1 (GPU HW acceptance)
and Phase 2 (intra-node GPU mesh) before any NIC infrastructure exists.
NIC-requiring phases (3, 4) narrow the candidate set inside their own
run_phaseN scripts via intersect ... amd-nic=true, per GPUOP-690 design
sections 4 and 5.

Files:
  - configs/config.json: default node-selector-labels reduced to
    [feature.node.kubernetes.io/amd-gpu=true].
  - configs/cluster-validation-config.yaml: inline ConfigMap docs
    updated to explain the new default and where NIC narrowing
    happens; NODE_SELECTOR_LABELS placeholder unchanged.
  - gpu-cluster.sh: operator's templated jq fallback default for
    node-selector-labels reduced to the GPU-only list to match.

Resource-busy checks and NODE_VALIDATION_INTERVAL_MINS skip behavior
are unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-758: Add unit tests for candidate-selection default

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-759: Add bash unit tests for helper library + orchestrator DRY_RUN

Adds a hand-rolled bash test harness under example/gpu-validation-cluster/tests/
covering the PHASE_NODE_LABEL_SCRIPT helper library (GPUOP-756) and the
multi-phase CronJob orchestrator's DRY_RUN=1 mode (GPUOP-757).

Helper library tests (26 cases):
* label_phase_passed / label_phase_failed / annotate_phase_value /
  filter_passed_nodes positive paths
* argument validation (empty / wrong arity returns non-zero, no kubectl
  side effects)
* kubectl failure propagation
* contract invariants: --overwrite on every write, diagnostics to stderr

Orchestrator DRY_RUN tests (5 cases) from design doc section 7:
* all 5 skip flags true -> exit 0, zero kubectl writes
* phases 1+2 enabled, 3-5 skipped -> Phase 2 followed by SKIP_* pass-through
* Phase 3 enabled but no amd-nic=true nodes -> empty pool, exit 0
* empty Phase 0 candidate pool -> exit 0
* cleanup / log collection honor DRY_RUN

Infrastructure:
* lib/assert.sh -- hand-rolled assertions (no bats dependency)
* lib/kubectl_mock.sh -- recording kubectl shim installed on PATH
  (function override is unreliable for sourced sub-shells)
* lib/extract_script.sh -- pure-bash/awk extractor for YAML block
  scalars (no PyYAML / yq dependency)
* run_all.sh -- entry point; exits 0 only when every file reports
  zero failures

All 31 tests pass locally with bash 5.x.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-760: Add multi-phase pipeline docs and SKIP_GPU_VALIDATION migration note

Document the 5-phase validation pipeline (phases / labels / skip flags
table), new per-phase amd.com/* label keys, the incremental-bringup
default posture, DRY_RUN=1 orchestrator testing mode, and the migration
mapping from the removed skip-gpu-validation flag to the new 5-flag
skip-tests block. Update the per-node status legend to include the
per-phase labels and failure-reason annotations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-760: Add unit tests for README + migration note

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-761: Extend GPU_VALIDATION_TESTS_JSON with AGFHC recipes; bump TEST_RUNNER_JOB_WAIT_TIME

Add three AGFHC TestCases (xgmi_lvl1, pcie_lvl1, hbm_lvl1) to the MANUAL
TestCases array under
TestConfig.GPU_HEALTH_CHECK.TestLocationTrigger.global.TestParameters.
All use StopOnFailure: true, TimeoutSeconds: 600, Iterations: 1 per
GPUOP-691 design doc section 4.

Bump TEST_RUNNER_JOB_WAIT_TIME from 1200s to 3600s to cover the longer
combined test set (RVS gst_single ~20 min + 3 AGFHC level-1 recipes,
~50 min worst case) plus pod scheduling / result upload headroom.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-761: Add unit tests for GPU_VALIDATION_TESTS_JSON + TEST_RUNNER_JOB_WAIT_TIME

28 test cases (5 structural, 8 content, 4 numeric/budget, 6 regression,
3 boundary, 2 cross-source) covering the ConfigMap contract for the
expanded TestCases array and bumped wait-time budget. All 28 cases
mapped to GPUOP-691 Story test plan TC1/TC2/TC4/TC5/TC9.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-762: Add PHASE1_SCRIPT to cluster-validation-config ConfigMap

Implements GPUOP-691 design §4 -> "Code Path: PHASE1_SCRIPT".
Sourced by GPUOP-690 orchestrator's run_phase1 via _run_phase_generic.

Behavior:
1. Parallel-submit Test Runner Job per input node (existing template).
2. Poll-wait all Jobs up to TEST_RUNNER_JOB_WAIT_TIME.
3. Parse result.json (local hostPath, fallback kubectl cp) for failed
   sub-test name; recipe-not-found surfaced as a distinct reason.
4. Label via GPUOP-690 helpers (label_phase_passed / label_phase_failed
   + annotate_phase_value failed-subtest=<name>).
5. SKIP_GPU_HW_ACCEPTANCE=true early-exit pass-labels every input node.
6. Cleanup hung (timed-out) Jobs at the end.

Missing result.json -> failed reason test-runner-did-not-emit-results,
subtest=unknown. Missing required env -> all nodes failed with reason
phase1-missing-env:<list>. kubectl apply failure on a single node fails
just that node with reason job-creation-failed; other nodes proceed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-763: Wire run_phase1 orchestrator to source PHASE1_SCRIPT

Replace the generic-dispatch stub for run_phase1 (delivered by
GPUOP-690.4 as a one-liner delegating to _run_phase_generic) with
a dedicated function body that:

  * prints a Phase 1 banner with input nodes
  * returns 0 cleanly when the input pool is empty
  * returns 0 cleanly when PHASE1_SCRIPT is unset (so a partial
    rollout that has not yet shipped PHASE1_SCRIPT does not break
    the orchestrator)
  * writes $PHASE1_SCRIPT to /tmp/run-phase1.sh, chmod +x, and
    sources it with the input nodes as positional args
  * exports PHASE_NODES (documented input handle for per-phase
    scripts)
  * preserves the existing fail-tolerant pattern (warn on non-zero
    rc, always return 0 so other phases can continue)

run_phase2..run_phase5 remain on _run_phase_generic until their
respective PHASE{N}_SCRIPT bodies land. The DRY_RUN run_phase1
override at the dry-run block is unchanged.

Files: example/gpu-validation-cluster/configs/cluster-validation-job.yaml

Design: docs/codie/designs/GPUOP-691-phase1-gpu-hw-acceptance-design.md
        Section 4 -> "Code Path: orchestrator wiring"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-763: Add unit tests for run_phase1 orchestrator wiring

25 unit tests covering the dedicated run_phase1 body added in
commit 8d51b91:

  * 14 contract tests (bash parse, function presence, source-level
    grep for tmp-path write, chmod, source-with-nodes, PHASE_NODES
    export, empty/unset guards, fail-tolerant warn pattern,
    regression guards that other phases still use the generic
    dispatch, DRY_RUN override intact, call site unchanged, no
    kubectl in the body)
  * 8 behavioral tests (probe-script-driven: empty input, unset
    PHASE1_SCRIPT, single node, multi-node positional expansion,
    PHASE_NODES exported visible to child processes, tmp file
    exists + executable, non-zero rc warns but returns 0, zero rc
    no warning)
  * 1 DRY_RUN override test (override at line 585 still shadows
    the new dedicated body)
  * 2 regression tests (filter_passed_nodes call unchanged;
    existing GPUOP-759 orchestrator dry-run suite still passes)

Suggested test file:
  example/gpu-validation-cluster/tests/test_run_phase1_wiring.sh

Helper extension needed:
  extract_bash_function() in lib/extract_script.sh

Mapped to Story TCs: TC2, TC3, TC4, TC8, TC11, TC13
(downstream test cases TC1, TC5-7, TC9-12, TC14, TC15 are out of
scope for the wiring Sub-task; they live in PHASE1_SCRIPT /
GPUOP-764).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-764: Add bash unit tests for PHASE1_SCRIPT

Realizes test cases TC1, TC2, TC4, TC5, TC6, TC7, TC8, TC10 from the
GPUOP-691 Phase 1 test plan against the PHASE1_SCRIPT body embedded in
configs/cluster-validation-config.yaml (GPUOP-762).

New files:
- tests/test_phase1.sh: 12 tests covering empty input, single-node pass,
  single-node hbm fail, mixed pass/fail across 2 nodes, missing
  result.json -> result-parse-failed, recipe-not-found -> recipe-not-
  found, SKIP_GPU_HW_ACCEPTANCE=true short-circuit, job-creation
  failure, poll-wait timeout, dry-run skip annotation, idempotent
  --overwrite labels, helper-failure propagation. The test sources
  PHASE1_SCRIPT after sed-patching the hardcoded job-template path,
  results root, and timestamp generation so job names are deterministic
  (PHASE1_TEST_TS) and templates resolve to test fixtures.
- tests/fixtures/phase1/result-pass.json: 4 sub-tests all PASS.
- tests/fixtures/phase1/result-hbm-fail.json: hbm_lvl1 FAIL, others PASS.
- tests/fixtures/phase1/result-recipe-not-found.json: error string
  matches the recipe-not-found jq regex.

Mock extensions (tests/lib/kubectl_mock.sh):
- Early-route detection for
  'kubectl get pods -l job-name=X -o jsonpath={.items[-1:].metadata.name}'
  so the pod lookup runs before the generic '-l' selector arm exits.
- jsonpath handler for
  '{.status.conditions[?(@.type=="Complete"|"Failed")].status}' on
  job objects, seeded via new kubectl_mock_set_job_condition helper.
- Fail-injection support for apply|delete|patch|create verbs (was
  previously hardcoded exit 0); mirrors one-shot / .sticky semantics
  used by label|annotate.
- SIGPIPE-safe stdin drain at the top of the mock shim. Without this,
  callers like 'sed ... | kubectl apply -f -' under 'set -o pipefail'
  intermittently saw rc=141 because the mock exited before reading
  stdin, and the script then took the job-creation-failed path.
- New helpers: kubectl_mock_set_job_condition,
  kubectl_mock_set_pod_for_job.

All 43 tests across the suite pass (test_assert: 5, test_phase1: 12,
test_phase0_helpers: 26). Verified stable across 10 sequential runs
after the SIGPIPE fix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-765: Remove obsolete TEST_RUNNER_SUCCESS_LABEL / TEST_RUNNER_FAILURE_LABEL references

Phase 1 now uses the uniform PHASE1_LABEL_KEY (= amd.com/gpu-hw-acceptance)
written via the GPUOP-690 PHASE_NODE_LABEL_SCRIPT helper library. The
legacy `TEST_RUNNER_SUCCESS_LABEL` / `TEST_RUNNER_FAILURE_LABEL` env vars
in cluster-validation-config.yaml are obsolete; delete the env vars and
the entire `GPU_VALIDATION_TEST_SCRIPT` block (whose role is fully
replaced by `PHASE1_SCRIPT` sourced from the orchestrator's run_phase1
in cluster-validation-job.yaml). Replace the deleted blocks with
inline migration-note comments so deployers and downstream consumers
can trace the rename. Update the now-stale `GPU_VALIDATION_TEST_SCRIPT`
reference in the surviving PHASE1_SCRIPT polling-cadence comment.

Verified:
* YAML still parses (1 doc in config, 6 in job).
* GPU_VALIDATION_TESTS_JSON / TESTS_JSON still valid JSON.
* PHASE1_SCRIPT, PHASE_NODE_LABEL_SCRIPT,
  CRONJOB_CANDIDATE_NODES_SELECTION_SCRIPT, WAIT_FOR_WORKERS_SCRIPT,
  VALIDATE_RCCL_TEST_SCRIPT, RCCL_ENV_VARS all pass `bash -n`.
* No matches for the removed identifiers remain in either configs
  YAML except the intentional migration-note comments.

Design: docs/codie/designs/GPUOP-691-phase1-gpu-hw-acceptance-design.md
        section 4 (recipe expansion section, last paragraph).
Parent: GPUOP-691.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-765: Add unit tests for legacy label-var / script-block removal

38 test cases covering:
* Removal contract (top-level keys + inline shell-script references
  for both removed env vars and the removed legacy script block)
* Migration-note guard (deletion is documented inline so future
  readers can trace the rename without git archaeology)
* Replacement contract (PHASE1_LABEL_KEY, helper library, PHASE1_SCRIPT
  cover the roles vacated by the removed symbols)
* Companion-file guard (cluster-validation-job.yaml is also clean)
* Regression guards for every sibling ConfigMap key added or
  preserved by GPUOP-755 / GPUOP-756 / GPUOP-761 / GPUOP-763
* File-parse / bash-syntax structural checks on every embedded script

Framework: Python 3 + unittest + PyYAML (matches GPUOP-755 /
GPUOP-756 / GPUOP-761 / GPUOP-763 conventions for the same artifact);
includes a grep-based shell fallback harness for environments
without Python.

Story TC mapping: TC1, TC2, TC4, TC5, TC13.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-766: Document Phase 1 GPU HW acceptance in README

Add dedicated Phase 1 subsection covering scope, the 4 sub-tests
(RVS gst_single + AGFHC xgmi_lvl1/pcie_lvl1/hbm_lvl1), the
amd.com/gpu-hw-acceptance result label, sample failure annotation
output (-failure-reason and -failed-subtest), the failure-reason
catalog, and the SKIP_GPU_HW_ACCEPTANCE short-circuit behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-766: Add unit tests for Phase 1 README docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-767: Add cluster-validation-phase2-job-config ConfigMap

New ConfigMap holding the batch/v1.Job template for the Phase 2
single-node 8-GPU RCCL all_reduce_perf test. Mirrors the
test-runner Job pattern but uses RCCL_WORKLOAD_IMAGE, requests
all 8 GPUs, and runs mpirun --host localhost with the PHASE2_*
sizing knobs from cluster-validation-config. Reuses
VALIDATE_RCCL_TEST_SCRIPT for bandwidth threshold check. Uses
\$\$NODE placeholder for sed substitution by PHASE2_SCRIPT.

See docs/codie/designs/GPUOP-692-phase2-gpu-mesh-validation-design.md
section 4 -> "Code Path: Phase 2 Job template ConfigMap".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-767: Add unit tests for cluster-validation-phase2-job-config

52 test cases covering structural / content / resource / command /
boundary (no-IB invariant) / rendering / regression categories for
the new Phase 2 per-node Job template ConfigMap. Mapped to 9
Story-level test cases in GPUOP-692 test plan. Python 3 + unittest
+ PyYAML harness, with a grep-only fallback for CI environments
without Python.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-768: Add Phase 2 env vars + PHASE2_RCCL_ENV_VARS to cluster-validation-config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-768: Add unit tests for Phase 2 env vars + PHASE2_RCCL_ENV_VARS

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-769: Add PHASE2_SCRIPT to cluster-validation-config

Mirror of PHASE1_SCRIPT. Drives Phase 2 (intra-node GPU collective)
per-node Job lifecycle:
  1. Sed-render the cluster-validation-phase2-job-config template
     ($$NODE, $$RCCL_WORKLOAD_IMAGE) and kubectl apply per input node.
  2. Poll-wait every Job up to PHASE2_JOB_WAIT_TIME.
  3. Classify by Job condition + container log markers:
       Complete=True                          -> passed
       "phase2 mpirun exited"                 -> rccl-crash
       "phase2 bandwidth below threshold"     -> bus-bw-below-threshold
       pending/active past timeout            -> timeout
  4. Label via GPUOP-690 helpers (label_phase_passed / label_phase_failed).
     Annotate measured Avg bus bandwidth on both pass and bw-fail paths.
  5. SKIP_GPU_MESH_VALIDATION=true short-circuits to pass-label-all.
  6. Delete hung Jobs after the wait budget.

Depends on GPUOP-690 helpers (PHASE_NODE_LABEL_SCRIPT), GPUOP-767
(cluster-validation-phase2-job-config template), GPUOP-768 (PHASE2_*
env vars). Orchestrator wiring is GPUOP-770; /phase2-configs volume
mount is GPUOP-771.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-769: Add unit tests for PHASE2_SCRIPT

37 unit test cases split into two layers:
* Contract (24 cases): source-level grep / bash -n on extracted
  PHASE2_SCRIPT body. Confirms every design-doc clause is present
  (env validation list, sed substitutions, helper invocations,
  classify markers, cleanup guard).
* Behavioral (13 cases): bash + lib/kubectl_mock.sh, sourcing the
  extracted script under a stripped-down harness with
  PHASE_NODE_LABEL_SCRIPT helpers. Covers empty input, skip mode,
  missing env / template, single-node pass / bw-fail / rccl-crash /
  timeout, three-node mixed, long-hostname hash fallback, submit
  failure, helper failure tolerance.

Coverage maps to 9 of 14 Story TCs from GPUOP-692-test-plan.md
(TC1-3, TC5-10); TC4 is GPUOP-768 (env-vars), TC11-12 are GPUOP-770
(orchestrator wiring), TC13-14 are integration / performance and
out of unit-test scope.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-770: Wire run_phase2 to source PHASE2_SCRIPT

Replace the generic-dispatch run_phase2 stub (delivered by
GPUOP-690.4) with a dedicated function that sources $PHASE2_SCRIPT,
mirroring the run_phase1 wiring landed by GPUOP-763. See design
doc GPUOP-692 section 4 -> "Code Path: orchestrator wiring".

Single-function-body change in cluster-validation-job.yaml.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-770: Add unit tests for run_phase2 orchestrator wiring

27 test cases (15 contract, 8 behavioral wiring, 1 DRY_RUN
override, 3 regression) covering the dedicated run_phase2
function added in cluster-validation-job.yaml. Mirrors the
GPUOP-763 unit-test plan for run_phase1.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-771: Implement cluster-validation-job.yaml

Mount cluster-validation-phase2-job-config ConfigMap into the
CronJob submit-mpijob container at /phase2-configs, paralleling
the existing mpi-configs and test-runner-configs mounts so that
PHASE2_SCRIPT can sed-render the Phase 2 per-node Job template.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-771: Add unit tests for cluster-validation-job.yaml

Unit test plan covering the phase2-configs volumeMount + volume
addition to the CronJob submit-mpijob container: YAML-structure
parse, source-text grep, kubectl client-side schema validation,
and sibling-mount regression guards.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-772: Bash unit tests + sample phase2.log fixtures for PHASE2_SCRIPT

Adds tests/test_phase2.sh exercising PHASE2_SCRIPT (GPUOP-769) end-to-end
against the existing kubectl mock harness, plus four sample phase2.log
fixtures under tests/fixtures/phase2/ for the pass / bw-below-threshold /
rccl-crash / failed-no-marker classification paths.

Coverage (design GPUOP-692 section 7, test plan GPUOP-692-test-plan.md):
* pass (BW above threshold) -> =passed + measured-bw annotation
* bus-bw-below-threshold fail (also covers PHASE2_BW_THRESHOLD=9999 inject)
* rccl-crash via mpirun-exited marker
* default rccl-crash on Failed=True without a recognized marker
* timeout via PHASE2_JOB_WAIT_TIME=0 + cleanup delete
* SKIP_GPU_MESH_VALIDATION=true short-circuit (case-insensitive)
* missing-env fast-fail (phase2-missing-env reason)
* missing job template -> job-template-missing reason
* parallel-submit ordering pin (all applies precede any get-job poll)
* job-creation-failed when kubectl apply returns non-zero
* PHASE_NODES env-var fallback when no positional args
* PHASE2_RCCL_ENV_VARS contains no IB/fabric tunables (TC4)

Extends lib/kubectl_mock.sh with a `logs` verb (the original mock only
honored label/annotate/get/apply/delete) and a kubectl_mock_set_pod_log
seed helper so per-pod log content can be served as base64-encoded
state, mirroring the kubectl_mock_set_pod_for_job pattern from GPUOP-764.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-773: Document Phase 2 intra-node GPU mesh validation in README

Add a Phase 2 subsection to example/gpu-validation-cluster/README.md
mirroring the structure of the existing Phase 1 subsection: scope,
single-node mpirun --np 8 --host localhost all_reduce_perf workload,
amd.com/gpu-mesh-validation label, ConfigMap env vars (PHASE2_*),
failure-reason annotation values (bus-bw-below-threshold, rccl-crash,
xgmi-init-failure, timeout, job-creation-failed), and the
SKIP_GPU_MESH_VALIDATION short-circuit behavior.

Includes a Threshold Tuning subsection covering: the default 200 GB/s
for MI300 xGMI, how to override PHASE2_BW_THRESHOLD via the
cluster-validation-config ConfigMap, expected ranges for MI300X/A,
MI250/X, MI210, and MI100 SKUs, and a 4-step tuning workflow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-774: Implement cluster-validation-phase3-job-config

Add cluster-validation-phase3-job-config ConfigMap with the per-node
NIC health Job template (batch/v1.Job) used by GPUOP-693 Phase 3.

Per docs/codie/designs/GPUOP-693-phase3-nic-health-design.md section
4 -> Code Path: Phase 3 Job template:
- serviceAccountName: cluster-validation-sa (preexisting, has node
  label/annotate perms via cluster-validation-role).
- securityContext.privileged: true (required for rdma/ibv tooling).
- Requests amd.com/nic: $$EXPECTED_NIC_COUNT (sole-tenant on NICs).
- Image: docker.io/rocm/network-operator-utils:v1.1.0 (same as
  Phase 5 launcher init-container).
- envFrom cluster-validation-config -- sources PHASE3_CHECK_SCRIPT
  body, PHASE3_LABEL_KEY, PHASE3_EXPECTED_NIC_COUNT, and the
  GPUOP-690 label/annotate helper conventions.
- NODE_NAME via downward API for in-pod kubectl self-labelling.
- Placeholders $$NODE and $$EXPECTED_NIC_COUNT are sed-substituted
  by PHASE3_SCRIPT (separate Sub-task) at submit time.
- backoffLimit: 0, ttlSecondsAfterFinished: 300 (matches Phase 2).

Scope is limited to the Job template ConfigMap only. The
PHASE3_CHECK_SCRIPT body, PHASE3_* env vars, and orchestrator
wiring are tracked under sibling Sub-tasks (GPUOP-693.2/.3+).

Verified with kubectl apply --dry-run=client on the outer file
(all 8 resources accepted) and on the rendered inner Job template.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-774: Add unit tests for cluster-validation-phase3-job-config

50 test cases covering structural / content / security / resource /
command / boundary (no-scope-creep) / rendering / regression
categories for the new Phase 3 per-node NIC health Job template
ConfigMap. Mapped to 8 Story-level test cases in GPUOP-693 test
plan. Python 3 + unittest + PyYAML harness (matches sibling
GPUOP-767 Phase 2 ConfigMap tests), with a grep-only fallback for
CI environments without Python.

Verifies the Job template contract: image
docker.io/rocm/network-operator-utils:v1.1.0, privileged: true,
amd.com/nic: $$EXPECTED_NIC_COUNT, serviceAccountName
cluster-validation-sa, envFrom cluster-validation-config,
NODE_NAME downward-API, and the three-line script that materializes
and executes $PHASE3_CHECK_SCRIPT.

Scope guards confirm the template does NOT inline the
PHASE3_CHECK_SCRIPT body, PHASE3_* env values, kubectl label
calls, or amd-nic=true nodeSelector -- those are owned by sibling
Sub-tasks (GPUOP-693.2 / .3 / .4).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-775: Add PHASE3_CHECK_SCRIPT to cluster-validation-config

In-Job per-node NIC health check with 4 structural validations:
NIC count (lspci vendor 1dd8), ip link state UP, rdma link state
ACTIVE, ibv_devinfo + GID table non-empty. Self-labels the node
via in-pod kubectl (cluster-validation-sa). On any failure writes
${PHASE3_LABEL_KEY}=failed plus failure-reason / failed-nics
annotations (each truncated to 250 bytes for K8s safety) and
exits non-zero. Matches GPUOP-693 design Section 4 -> Code Path:
PHASE3_CHECK_SCRIPT.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-775: Add unit tests for cluster-validation-config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-776: Add Phase 3 ConfigMap env vars

Add four Phase 3 (per-node NIC health) ConfigMap data keys to
cluster-validation-config so the Phase 3 Job (GPUOP-774 template)
can project them via envFrom:

  * PHASE3_EXPECTED_NIC_COUNT="8"   - PCI function count per node
  * PHASE3_AMD_NIC_VENDOR_ID="1dd8" - Pensando PCI vendor ID
  * PHASE3_MIN_GID_COUNT="1"        - min GID table entries per dev
  * PHASE3_JOB_WAIT_TIME="120"      - per-node Job wallclock budget

Values match the defensive defaults already in PHASE3_CHECK_SCRIPT
(GPUOP-775) so existing behavior is preserved; this Sub-task makes
them ConfigMap-tunable.

Design: docs/codie/designs/GPUOP-693-phase3-nic-health-design.md
        section 4 -> "Code Path: Phase 3 ConfigMap variables".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-776: Add unit tests for cluster-validation-config Phase 3 env vars

32 unit test cases (18 P0, 13 P1, 1 P2) covering presence, value,
type discipline, boundary, script-default consistency with
PHASE3_CHECK_SCRIPT (GPUOP-775), and regression guards for the 4
new Phase 3 ConfigMap data keys:

  * PHASE3_EXPECTED_NIC_COUNT
  * PHASE3_AMD_NIC_VENDOR_ID
  * PHASE3_MIN_GID_COUNT
  * PHASE3_JOB_WAIT_TIME

Framework: Python 3 + unittest + PyYAML (matches sibling Phase 1/2/3
unit-test plans). Test harness + grep-only fallback inline.

Maps to Story GPUOP-693 test plan cases #1, #2, #3, #5, #8, #12.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-777: Implement PHASE3_SCRIPT orchestrator + run_phase3 wiring

Adds the outer Phase 3 driver (submit + wait, no result fetch) to
cluster-validation-config and replaces the generic run_phase3 stub
with dedicated wiring mirroring the run_phase1/run_phase2 pattern.
Mounts cluster-validation-phase3-job-config (GPUOP-774) at
/phase3-configs so PHASE3_SCRIPT can sed-render the per-node Job
template. amd-nic=true intersection and SKIP_NIC_VALIDATION pass-
through stay on the orchestrator side (already in place).

Per design GPUOP-693 §4 (Code Path: PHASE3_SCRIPT + orchestrator
wiring), the in-pod PHASE3_CHECK_SCRIPT (GPUOP-775) self-labels
passed/failed via in-pod kubectl, so PHASE3_SCRIPT only writes
labels for the cases that prevent the in-pod kubectl from running:
submit-failed -> job-creation-failed; timeout -> nic-not-allocated.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-777: Add unit tests for PHASE3_SCRIPT orchestrator

40 test cases across 8 categories covering submit/wait, no-result-
fetch contract, SKIP_NIC_VALIDATION pass-through, timeout->
nic-not-allocated fallback, run_phase3 wiring, and phase3-configs
volume mount. Maps to 14 cases in GPUOP-693-test-plan.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-778: Add unit tests for run_phase3 wiring + /phase3-configs mount

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-779: Add bash unit tests for Phase 3 NIC health check

Covers PHASE3_CHECK_SCRIPT (in-Job 4-check NIC health + self-labelling,
GPUOP-775) and PHASE3_SCRIPT (outer-driver submit/wait, GPUOP-777)
per the GPUOP-693 test plan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-779: Add unit tests for Phase 3 NIC health bash test suite

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-779: Unit test results — PASS (21/21)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-780: Add Phase 3 README section + sample failure annotations

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-781: Add Phase 4 server + client Job template ConfigMap

Implements GPUOP-694.1 per design doc section 4 ("Code Path: Phase 4
Job templates"). Adds new cluster-validation-phase4-job-config
ConfigMap with two batch/v1.Job templates (server and client) for
ib_write_bw pairwise rail bandwidth measurement.

Templates parameterized on $$NODE, $$PEER_POD_IP, $$RAIL_IDX, and
$$NAD_NAME for sed-rendering by PHASE4_DRIVER_SCRIPT (sibling
sub-task). Image rocm/roce-workload via $$ROCE_WORKLOAD_IMAGE.
envFrom pulls PHASE4_* config vars (GPUOP-694.3). Per-rail NIC
requested via amd.com/nic and k8s.v1.cni.cncf.io/networks Multus
annotation. Server waits for client connect; client parses
"BW average" line and tees output to /shared for driver log
collection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-781: Add unit tests for cluster-validation-phase4-job-config

* GPUOP-782: Add per-rail NetworkAttachmentDefinition manifests

Adds configs/nad-per-rail.yaml with 8 NADs
(amd-host-device-nad-rail-0..7) for Phase 4 pairwise rail bandwidth
testing. Each NAD uses the host-device CNI plugin pinned to one
rail's RDMA netdev (rdma_dev_N), matching PHASE4_IB_DEV_PREFIX.

Conditional per the Sub-task description: skip applying if the
cluster already ships per-rail NADs via the network-operator.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-782: Add unit tests for per-rail NAD manifests

48 manifest contract assertions across 9 categories:
structural, naming, metadata, CNI config, per-rail consistency,
no-scope-creep, conditional deployment, Phase 4 cross-Sub-task,
and regression guards. Framework: Python 3 + unittest + PyYAML
(matches sibling GPUOP-781 / GPUOP-774 conventions).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-783: Add Phase 4 pairwise rail bandwidth ConfigMap env vars

Extend cluster-validation-config ConfigMap with 6 Phase 4
(pairwise rail bandwidth test) tunables consumed by the
PHASE4_DRIVER_SCRIPT and per-pair server/client Jobs:

- PHASE4_RAIL_COUNT           = "8"     (rails per node)
- PHASE4_BW_THRESHOLD         = "380"   (Gbps minimum per rail)
- PHASE4_PAIR_WAIT_TIME       = "180"   (per-rail wallclock seconds)
- PHASE4_MAX_CONCURRENT_PAIRS = "8"     (pair-runner concurrency cap)
- PHASE4_NAD_NAME_PREFIX      = "amd-host-device-nad-rail-"
- PHASE4_IB_DEV_PREFIX        = "rdma_dev_"

Defaults match the MI300 fleet baseline validated on
smc300x-ccs-aus-gpuf268. Section follows the Phase 2 / Phase 3
env-var layout and ownership convention (envFrom projection,
no code change required to tune).

See GPUOP-694 design §4 -> "Code Path: Phase 4 ConfigMap variables".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-783: Add unit tests for Phase 4 ConfigMap env vars

44 test cases (26 P0, 14 P1, 4 P2) covering the 6 new PHASE4_*
keys added to cluster-validation-config:

  - Presence (8 tests): every key from the design doc lands
  - Value (6 tests):    exact defaults per design §1, §2, §4
  - Type (7 tests):     str discipline + positive-int +
                        NAD/IB-dev prefix format regex
  - Boundary (8 tests): lower/upper bounds + non-empty prefixes
  - Phase 4 contract (3 tests): rail-count == NIC-count,
                        label-key consistency, NAD naming match
  - Regression (9 tests): Phase 1/2/3 blocks and shared
                        helpers (PHASE_NODE_LABEL_SCRIPT,
                        PHASE_FAILURE_REASON_ANNOTATION_SUFFIX)
                        intact; no duplicate keys
  - File smoke (3 tests): no tabs, comment provenance header,
                        block ordering after PHASE3_SCRIPT

Framework: Python 3 + unittest + PyYAML (matches sibling
unit-test plans GPUOP-774/-775/-776 for Phase 3). Test harness
reference + grep-only CI fallback included in the document.

Story TP map covers cases #3, #5, #6, #7, #11, #12, #13, #14,
#15, #20; downstream cases (pairing, driver-script, smoke)
deferred to sibling Sub-tasks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-784: Implement PHASE4_DRIVER_SCRIPT in cluster-validation-config

Adds the Phase 4 (pairwise rail bandwidth) driver script to the
cluster-validation-config ConfigMap per GPUOP-694 design §4 ->
"Code Path: PHASE4_DRIVER_SCRIPT".

Behavior:
  1. Sort + round-robin pair input nodes (already gated to
     amd.com/nic-health=passed upstream); odd-count last node is
     pass-labeled with `unpaired=true` annotation.
  2. Fork pair_runner per pair, bounded by
     PHASE4_MAX_CONCURRENT_PAIRS (defaults to 8).
  3. pair_runner iterates rails 0..(PHASE4_RAIL_COUNT-1) serially:
     render+apply server Job (pinned to node A), wait for pod IP,
     render+apply client Job with that IP (pinned to node B), wait
     both, parse "BW average" Gbps from client log, record per-
     (node, rail) result via a tmpdir state surface, cleanup.
  4. Aggregate: a node passes iff ALL its rails were
     >= PHASE4_BW_THRESHOLD.
  5. Label via GPUOP-690 helpers (label_phase_passed /
     label_phase_failed / annotate_phase_value); write per-rail BW
     annotations and a failed-rails CSV summary.
  6. SKIP_RAIL_BANDWIDTH_TEST=true short-circuits to pass-labelling
     every input node with no Jobs created.

Error reasons surfaced per-(node, rail):
  peer-pod-unready, nad-missing, ib-write-bw-crashed, parse-failed,
  api-throttled (with up to 3-retry exponential backoff on 429),
  below-threshold:<bw>.

Consumes: PHASE4_* env vars (GPUOP-694.3), Job templates at
/phase4-configs/cluster-validation-phase4-{server,client}-job-config.yaml
(GPUOP-694.1, mounted by GPUOP-694.6), GPUOP-690 helpers.

Wiring to run_phase4 (GPUOP-694.5) and the /phase4-configs mount
(GPUOP-694.6) are tracked under their own sub-tasks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-784: Add unit tests for cluster-validation-config

* GPUOP-785: Wire run_phase4 orchestrator to source PHASE4_DRIVER_SCRIPT

Replace the generic _run_phase_generic stub for Phase 4 with a
dedicated run_phase4 wiring function (mirrors run_phase1/2/3) that
sources PHASE4_DRIVER_SCRIPT (GPUOP-694.4). Add /phase4-configs
volumeMount + ConfigMap volume so PHASE4_DRIVER_SCRIPT can read
the server + client Job templates from
cluster-validation-phase4-job-config (GPUOP-694.1 / GPUOP-781).

See docs/codie/designs/GPUOP-694-phase4-rail-bandwidth-design.md
section 4 -> "Code Path: orchestrator wiring".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-785: Add unit tests for run_phase4 orchestrator wiring

26 unit test cases covering:
- run_phase4 interface contract (sources PHASE4_DRIVER_SCRIPT,
  passes nodes as positional args, exports PHASE_NODES)
- Error paths (driver non-zero rc, unset/empty driver script)
- Boundary cases (empty/single/many input nodes)
- DRY_RUN override
- /phase4-configs CronJob volume mount + ConfigMap binding
- Phase 4 dispatch site SKIP guard + Phase-3 pool gating
- Sibling-phase regression (phases 1-3 unchanged, run_phase5
  still uses _run_phase_generic)
- Static analysis (YAML parse, bash -n, shellcheck)

Maps 6 unit tests to Story TP cases in
docs/codie/test-plans/GPUOP-694-test-plan.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-787: Add Phase 4 bash unit tests + ib_write_bw log fixtures

Adds test_phase4.sh with 19 cases covering PHASE4_DRIVER_SCRIPT:
pairing (round-robin even/odd, single-node, empty), all-pass
single pair with per-rail BW annotations, single-rail-below-
threshold, all-rails-fail, ib_write_bw crash, parse failure,
server-pod-unready / nad-missing timeout paths, rail-count
override, missing required env, missing job templates, SKIP
short-circuit, 16-node concurrency cap honored, and the
PHASE_NODES env-var fallback.

Sample ib_write_bw client log fixtures live under
tests/fixtures/phase4/ (pass, pass-high, below-threshold,
crashed, empty).

Extends lib/kubectl_mock.sh with two Phase-4-specific routes
(pod-ip jsonpath, pods --no-headers shape) plus a concurrency-
safe argv capture: the previous tail-back-the-call-log approach
races under bounded-parallel pair_runners; now ARGS is captured
from "$@" before any I/O and the call-log line is appended with
a single atomic write.

README.md updated with Phase 4 coverage section and the
GPUOP-694 test-plan TC mapping.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-788: Document Phase 4 (Pairwise Rail Bandwidth) in README

Adds a Phase 4 section to the gpu-validation-cluster README covering
scope, the pairwise per-rail ib_write_bw model, the 380 Gbps
configurable threshold, sample per-rail annotations (passing and
failing nodes), per-(pair, rail) failure reasons, unpaired-node
behavior, SKIP_RAIL_BANDWIDTH_TEST short-circuit semantics, and
concurrency-cap tuning guidance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-788: Add unit tests for Phase 4 README documentation

Adds a 28-case unit-test plan covering section presence, ConfigMap
defaults accuracy, sample annotation correctness, markdown structural
lint, failure-reason vocabulary alignment with PHASE4_DRIVER_SCRIPT,
SKIP and concurrency content, and pipeline-table regression. Tests are
designed to run via tests/test_phase4_readme.sh with a stdlib-only
Python md-lint helper.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-789: Rename WAIT_FOR_WORKERS_SCRIPT to PHASE45_PREFLIGHT_SCRIPT

Rename the ConfigMap key in cluster-validation-config.yaml from
WAIT_FOR_WORKERS_SCRIPT to PHASE45_PREFLIGHT_SCRIPT and update the
wait-for-worker-pods init-container in cluster-validation-job.yaml
to source the new env var name. Pure rename; no behavior change.

Makes room for the enhanced N×N SSH mesh, DNS, MPI spawn, and RCCL
topology preflight checks landing in subsequent GPUOP-695 sub-tasks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-789: Add unit tests for cluster-validation-config rename

Test plan covers the PHASE45_PREFLIGHT_SCRIPT rename: presence of
new key, removal of old key, init-container env-var sourcing, envFrom
wiring preservation, YAML parse, bash -n on extracted body, and
byte-exact body invariance vs baseline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-790: Add N*N SSH mesh check to PHASE45_PREFLIGHT_SCRIPT

Extends PHASE45_PREFLIGHT_SCRIPT (cluster-validation-config
ConfigMap) so that, after the existing launcher->worker SSH
readiness wait, every worker pod is probed against every worker
IP via kubectl exec + ssh -o ConnectTimeout=5. Failed pairs are
collected in failed_pairs and reported all at once at the end of
the loop rather than aborting on the first failure, giving the
operator a complete fabric picture for triage.

See design doc GPUOP-695 §4 (Code Path: Enhanced
PHASE45_PREFLIGHT_SCRIPT). The retry-with-timeout launcher->worker
readiness probe is preserved as Phase 1; the new N*N matrix is
Phase 2. UserKnownHostsFile=/dev/null is added to keep known_hosts
from poisoning subsequent pairs when pods share a home volume.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-790: Add unit tests for N*N SSH mesh probe

23 test cases covering the new N*N mesh loop in
PHASE45_PREFLIGHT_SCRIPT: cardinality, flag-shape assertions
(StrictHostKeyChecking=no, UserKnownHostsFile=/dev/null,
ConnectTimeout=5), continue-past-first-failure semantics under
set -euo pipefail, degenerate single-pod path, ENABLE_SSH_CHECK
short-circuit, and the additive kubectl_mock.sh extensions
(kubectl exec / wait verbs, training.kubeflow.org/job-name=
selector route, kubectl_mock_set_pair_fail helper).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-791: Add DNS forward+reverse check to PHASE45_PREFLIGHT_SCRIPT

Per design GPUOP-695 §4 and Sub-task scope, add a Phase 3 DNS
validation block to the launcher init-container script. From the
FIRST worker pod, resolve every worker node hostname forward
(getent hosts), then reverse-resolve the returned address. Misses
are recorded in dns_misses[] and logged as WARN. The check does
NOT short-circuit -- it iterates every hostname and continues to
subsequent checks. Overall verdict gating (annotate-and-evict on
failure) is owned by a separate Sub-task.

WORKER_NODES is derived from .spec.nodeName (deduplicated). The
in-pod loop runs under the existing `set -euo pipefail` script, so
each getent call is guarded with `|| echo MISS` to keep the loop
alive.

Verified: yaml.safe_load() loads the ConfigMap and `bash -n` on
the extracted PHASE45_PREFLIGHT_SCRIPT passes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-791: Add unit tests for DNS forward+reverse check

27 unit test cases covering the new Phase 3 DNS block of
PHASE45_PREFLIGHT_SCRIPT:
- 5 positive (resolution paths, FIRST_POD targeting, deduplication,
  renamed-key extraction)
- 6 negative (forward miss, reverse miss, multi-miss, all-fail,
  no-exit-1 regression guard)
- 4 boundary (N=1, empty pods, duplicate nodeName, hostname-with-dot)
- 5 mock/dependency (jsonpath, JOB_LABELS, FIRST_POD reuse, fwd-then-rev
  order, awk first-token extraction)
- 3 set -e safety (getent failure non-fatal, kubectl exec failure
  non-fatal, no-unbound-variable defense)
- 4 integration (post-mesh sequencing, mesh-fail-skips-dns,
  ENABLE_SSH_CHECK=false skip, single-node smoke on
  smc300x-ccs-aus-gpuf268)

Each case is mapped to the Story-level test plan (GPUOP-695) by
test number. Framework: bash + tests/lib/{kubectl_mock,assert,
extract_script}.sh, layered on top of the GPUOP-790 mock
extensions. Recommended test file:
example/gpu-validation-cluster/tests/test_phase45_dns.sh.

Mock extensions required (additive on top of GPUOP-790):
- spec.nodeName jsonpath route in kubectl get pods mock
- in-pod getent shim driven by kubectl_mock_set_dns_fwd /
  kubectl_mock_set_dns_rev helpers
- kubectl exec mock must run `bash -c <script>` with the getent
  shim's PATH prepended

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-792: Add mpirun --hostfile no-op spawn check to PHASE45_PREFLIGHT_SCRIPT

Per design GPUOP-695 §4 and Sub-task scope, add a Phase 4 MPI spawn
validation block to the launcher init-container script. From the
FIRST worker pod, write $WORKER_IPS (one per line) to /tmp/hf and
invoke `mpirun --hostfile /tmp/hf --np $(wc -l < /tmp/hf)
--allow-run-as-root true`. A non-zero exit from the kubectl exec
sets mpi_spawn_failed=true.

This catches OMPI launcher misconfiguration (missing orted, broken
PATH, btl/oob misconfig, partial mca install) before the heavier
RCCL probe in a subsequent Sub-task. Mirrors the
record-and-continue pattern established by GPUOP-790 (SSH mesh)
and GPUOP-791 (DNS): the check MUST NOT short-circuit -- it sets
the flag, logs WARN, and continues. Overall verdict gating
(annotate-and-evict) is owned by a separate Sub-task.

mpi_spawn_failed is initialised to false so later annotate logic
can read it under `set -u`. Blank-line entries in the hostfile
are filtered via `awk "NF"` to harden against an empty
$WORKER_IPS.

Verified: yaml.safe_load_all() loads the ConfigMap and `bash -n`
on the extracted PHASE45_PREFLIGHT_SCRIPT passes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-792: Add unit tests for mpirun --hostfile no-op spawn check

30 unit test cases covering the new Phase 4 MPI block of
PHASE45_PREFLIGHT_SCRIPT:
- 7 positive (single-exec on FIRST_POD, hostfile content,
  np-from-wc-l, mpirun flag set, renamed-key extraction)
- 5 negative (mpirun-nonzero, missing binary, exec-failure,
  no-exit-1 regression guard, warn-continues)
- 4 boundary (worker-replicas=1, empty WORKER_IPS, trailing
  whitespace, IPv6 IPs)
- 5 mock/dependency (FIRST_POD/WORKER_IPS reuse, hostfile via
  tr+awk, --allow-run-as-root present, payload is `true`)
- 4 set -e safety (mpirun fail non-fatal, init-before-use
  ordering, no-unbound-variable on success, side-channel
  contract with verdict-gating Sub-task)
- 5 integration (post-DNS sequencing, dns-misses-do-not-skip-mpi,
  mesh-fail-skips-mpi, ENABLE_SSH_CHECK=false skip, single-node
  smoke on smc300x-ccs-aus-gpuf268)

Each case is mapped to the Story-level test plan (GPUOP-695) by
test number. Framework: bash + tests/lib/{kubectl_mock,assert,
extract_script}.sh, layered on top of the GPUOP-790 + GPUOP-791
mock extensions. Recommended test file:
example/gpu-validation-cluster/tests/test_phase45_mpi_spawn.sh.

Mock extensions required (additive on top of GPUOP-790/791):
- In-pod mpirun shim driven by kubectl_mock_set_mpirun_exit /
  kubectl_mock_mpirun_calls / kubectl_mock_mpirun_capture_hostfile
- kubectl_mock_set_exec_exit for simulating pod eviction / exec
  failure distinct from mpirun failure

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-793: Add minimal RCCL topology probe to PHASE45_PREFLIGHT_SCRIPT

Phase 5 of the PHASE45_PREFLIGHT_SCRIPT now runs a minimal RCCL
topology probe via mpirun + all_reduce_perf with NCCL_DEBUG=INFO
and NCCL_DEBUG_SUBSYS=INIT, grepping the first 50
"NCCL INFO comm|topology" lines into the launcher log. The probe
is wrapped in a 60s timeout: on timeout (exit 124) the soft-fail
path sets rccl_topo_timeout=true and continues per design
GPUOP-695 §6 (first-run topology discovery can be slow while NCCL
warms caches). Any other non-zero exit sets rccl_topo_failed=true
and continues. Both flags are initialised to false for set -u
safety. Verdict gating (annotate-and-evict) remains owned by a
separate Sub-task, matching the WARN-and-continue pattern already
used by the DNS and mpirun no-op spawn checks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-793: Add unit tests for RCCL topology probe

Unit test plan for the new Phase 5 RCCL topology probe block in
PHASE45_PREFLIGHT_SCRIPT. Mirrors the bash-mock-kubectl harness
already used by test_phase{1..4}.sh: extract the block scalar,
shim kubectl + timeout for deterministic exit codes (0, 1, 124,
137), assert the WARN-and-continue contract is honoured. 15
active cases across interface contract, error conditions
(timeout vs generic non-zero), boundary (single worker IP,
empty grep matches, set -u flag init), dependency mocks
(FIRST_POD, PERF_TEST_DIR, timeout wrapping), error recovery
(does-not-short-circuit), and integration (banner ordering).
9 cases map to GPUOP-695 Story TCs 2, 7, 8, 14.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-794: Implement PHASE45_PREFLIGHT_SCRIPT verdict block

Adds the aggregate-verdict / annotate-on-failure / abort-MPIJob block
at the tail of PHASE45_PREFLIGHT_SCRIPT in cluster-validation-config.yaml.

Behaviour (design GPUOP-695 §4 verdict block, §5, §6):

* Aggregates four failure flags set by the preceding Phase 4.5 checks
  (ssh_mesh_failed, dns_failed, mpi_spawn_failed, rccl_topo_failed,
  rccl_topo_timeout) into a comma-joined reason string and annotates
  every participating worker node with
    amd.com/phase4_5-failure-reason=<comma-separated classes>
  using --overwrite so the next CronJob tick replaces stale values.
* HARD failures (ssh-mesh, dns, mpi-spawn, rccl-topology non-timeout)
  exit 1 -> init-container fails -> launcher fails -> MPIJob fails
  via the existing failure path. No new orchestrator code needed.
* SOFT failure: RCCL timeout (exit 124) per design §6 is annotated
  but does NOT abort -- first-run NCCL cache warming can be slow and
  the real test is Phase 5. Without this carve-out every cold-cache
  run would fail-loop Phase 4.5.
* The previously-immediate `exit 1` in the SSH-mesh block is replaced
  with the same record-and-continue pattern used by DNS/MPI/RCCL so
  the verdict block sees the complete fabric picture before annotating.
* NO new label key is introduced -- per design §5, Phase 4.5 is a gate,
  not a tested capability per node. Phase 4 labels stay frozen so the
  audit trail is preserved.
* Uses `if; then; fi` rather than `[ ... ] && ...` because the script
  runs under `set -e` and a short-circuit whose left side is false
  would abort the script before the verdict block could annotate.
* `kubectl annotate` is guarded with a fallback echo so a single
  failing annotate (e.g. node renamed) does not block the remaining
  loop iterations or the final FATAL exit.

Sanity-checked:
* YAML well-formedness via yaml.safe_load_all -- OK.
* Embedded script syntax via `bash -n` -- OK.
* Verdict logic behavioural sweep (6 scenarios: all-pass, ssh-only,
  all-hard, only-timeout, timeout+ssh, hard-rccl-beats-timeout) --
  all annotation strings and exit codes match the design intent.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-795: Implement bash unit tests for PHASE45_PREFLIGHT_SCRIPT

Adds test_phase4_5.sh covering the Phase 4.5 pre-flight gate (SSH mesh,
DNS, MPI spawn, RCCL topology + verdict block) by extracting the
script body from cluster-validation-config.yaml and driving it against
the mocked kubectl harness. 12 cases realize test-plan TC2, TC4, TC5,
TC6, TC7, TC8, TC9 from docs/codie/test-plans/GPUOP-695-test-plan.md
plus extra coverage for the verdict block's hard-vs-soft fail
classification (design GPUOP-695 §4 / §6).

Mock infrastructure (lib/kubectl_mock.sh) gains three new arms:
* `kubectl wait`   -- default pass; failure-injectable via FAIL_DIR.
* `kubectl exec`   -- per-call FIFO response queue under FAIL_DIR;
                      each entry carries (exit_code, base64 stdout).
* `kubectl get`    -- new phase45-* state routes for the pod-name /
                      pod-IP / nodeName listings the pre-flight uses
                      (Kubeflow training labels, not job-name=).
Plus seed helpers: kubectl_mock_set_mpijob_names / set_pod_names /
set_pod_ips / set_node_names / queue_exec.

Fixtures in tests/fixtures/phase4_5/ carry the canned stdout the
DNS and RCCL probes consume.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-795: Record test results for PHASE45_PREFLIGHT_SCRIPT unit tests

12/12 new cases pass; full suite (111/111 across 7 files) remains
green. Maps each case to its parent Story test-plan TC entry.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-796: Implement README + Phase 4.5 documentation

Add a dedicated Phase 4.5 (Cross-Node Connectivity Matrix Test)
section to the validation cluster README documenting the four
pre-flight checks (N x N SSH mesh, DNS forward+reverse, mpirun
no-op spawn, RCCL topology probe), the
amd.com/phase4_5-failure-reason annotation contract, the shared
SKIP_RCCL_TEST flag with Phase 5, and the launcher-init-container
execution model (no separate pod). Also fix the Phase 4.5 row in
the phase table, cross-link the Day-3 incremental-bringup bullet,
and document the WAIT_FOR_WORKERS_SCRIPT -> PHASE45_PREFLIGHT_SCRIPT
migration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-796: Add unit tests for Phase 4.5 README documentation

Add a 26-case unit-test plan for the Phase 4.5 README section
covering section presence and heading hierarchy, four-checks
table accuracy, sample annotation correctness, Markdown
structural lint, SKIP_RCCL_TEST shared-flag content, cross-file
drift guards (ConfigMap key + design-doc reason tokens), and
pipeline-table / Day-3 cross-link regression. Mirrors the
sibling GPUOP-788 README test plan convention and the bash +
md_lint.py harness used by other Phase X.8 documentation
sub-tasks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-797: Implement PHASE5_SCRIPT ConfigMap key

Lift the pre-refactor inlined Phase 5 body (Steps 3-5 of the
submit-mpijob container args plus launcher log retrieval) out of
the CronJob args and into a new PHASE5_SCRIPT key in
cluster-validation-config.yaml. The orchestrator already wires
run_phase5() to source PHASE5_SCRIPT via _run_phase_generic
(landed in GPUOP-757); this Sub-task supplies the script body.

Per the GPUOP-696 design doc §4 ("Code Path: Lift existing
Phase 5 logic into PHASE5_SCRIPT") and the JIRA Sub-task scope
note, this Sub-task PRESERVES the body verbatim -- no behavior
change. Subsequent Sub-tasks (GPUOP-696.2..696.8) own the
substantive refactors:

  * parameterising on the input node set (dynamic
    actual_worker_replicas from $@, drop reliance on
    outer-scope passed_nodes / passed_count)
  * swapping inlined kubectl-label loops for the
    label_phase_passed / label_phase_failed helpers
    (GPUOP-690 label library)
  * adding the PHASE5_MIN_WORKERS guard
  * per-worker exit-code annotation + per-worker log dump by
    node name

Two minimal source-vs-exec adjustments (both documented inline):

  1. A 2-line input shim at the top -- `passed_nodes="$@"` and
     `passed_count=$(echo "$passed_nodes" | wc -w)` -- lets the
     verbatim body run unchanged under the orchestrator's
     `source $script_path $nodes` calling convention. The
     pre-refactor body referenced outer-scope `passed_nodes` /
     `passed_count`; the shim re-binds them from positional
     args so the verbatim body sees the same variable names.
  2. The SKIP_RCCL_TEST early-exit path swaps `exit 0` for
     `return 0` so the orchestrator can still run
     cleanup_old_mpijobs, collect_launcher_logs, and the
     fail-loud exit accounting after Phase 5 returns. `exit 0`
     in the original code ended the CronJob pod because the
     code was inlined; under source-from-orchestrator it would
     kill the orchestrator mid-flight.

Files:
  example/gpu-validation-cluster/configs/cluster-validation-config.yaml

Validation:
  * `python3 -c 'yaml.safe_load(...)'` clean (69 keys,
    PHASE5_SCRIPT is a ~6000-char literal scalar)
  * verbatim diff vs commit 1657d978^ Phase 5 block = only the
    two documented adjustments above + trivial trailing-
    whitespace trim on one line; the ✅/❌ glyphs are preserved

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-797: Add unit tests for PHASE5_SCRIPT verbatim extraction

25 unit test cases covering the verbatim-preserve scope of
GPUOP-696.1: input shim correctness, all sed substitutions in
the MPIJob render pipeline, NAD-list rendering (PF/VF/empty
branches), the SKIP_RCCL_TEST short-circuit (including the
documented exit-0 -> return-0 source-vs-exec adjustment), the
pass + fail labeling paths, launcher-log retrieval, and the
integration gates that pin extraction fidelity (orchestrator
wiring, old block removed from CronJob args, YAML parses clean,
verbatim-body diff vs pre-refactor baseline is empty modulo the
two documented adjustments).

Out-of-scope refactors (helper-based labeling,
PHASE5_MIN_WORKERS, per-worker exit-code annotation, per-worker
log dumps, dynamic worker count from $@ rather than
outer-scope passed_count) are explicitly deferred to the
unit-test plans of GPUOP-696.2..696.8 and listed under
"Unmapped Story TCs" with their owning Sub-tasks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-798: Parameterise PHASE5_SCRIPT input + dynamic worker-replicas

GPUOP-696.2: encapsulate the Phase 5 body in `run_phase5_main`, which
takes the surviving Phase 4.5 node set via its own positional args
instead of reading the outer-scope `passed_nodes`. `input_count` (renamed
from `passed_count`) now drives `actual_worker_replicas`, which renders
the MPIJob `worker-replicas` field. The `WORKER_REPLICAS` ConfigMap
value consequently becomes a MAXIMUM (Phase 0 node-selection budget)
rather than a target for this script.

The script body invokes `run_phase5_main "$@"` at the bottom so the
existing orchestrator contract (`source "$script_path" $nodes` in
_run_phase_generic) continues to work unchanged. GPUOP-696.6 will move
that invocation into the orchestrator.

Out of scope (other Sub-tasks): label helper swap (GPUOP-696.3 = -799),
per-worker exit-code annotation (-696.4 = -800), PHASE5_MIN_WORKERS guard
(-696.5 = -801), orchestrator wiring + old inlined block removal
(-696.6 = -802).

Validated: YAML parse OK; extracted PHASE5_SCRIPT passes `bash -n`;
smoke test confirms input_count derives from function args, body
ignores outer-scope passed_nodes, and node-iteration loops use
function-local input_nodes only.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-798: Add unit tests for PHASE5_SCRIPT parameterisation

29 active unit test cases covering the GPUOP-696.2 parameterisation
contract: run_phase5_main function definition, input-from-function-args
binding, dynamic actual_worker_replicas in the rendered MPIJob manifest,
no-outer-scope-leak (negative tests with decoy passed_nodes/passed_count
in caller scope), and regression guards for the unchanged
pre-parameterisation surface (NAD rendering, kubectl wait timeout,
launcher log capture, deterministic mpijob name).

22 of 29 cases map to 11 Story-level test cases in GPUOP-696-test-plan.md.
8 Story TCs are out of scope (owned by GPUOP-696.3..696.6 -- helper-based
labeling, per-worker annotation, PHASE5_MIN_WORKERS guard, orchestrator
wiring + old-block deletion, multi-node testbed required).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-799: Replace inlined kubectl label with GPUOP-690 helper calls

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-799: Add unit tests for PHASE5_SCRIPT GPUOP-690 helper migration

31 active unit test cases covering the GPUOP-696.3 helper-call
migration contract: skip / pass / fail label paths invoke
label_phase_passed / label_phase_failed with the correct shape
("$n", "$PHASE5_LABEL_KEY", "$phase5_fail_reason"), the existing
amd.com/cluster-validation-status label key value is preserved
end to end (via the PHASE5_LABEL_KEY indirection), the
graceful-degradation wrapper matches the Phase 1-4 convention,
candidate-label-removal loops are intentionally left as raw
kubectl label, and the previous CLUSTER_VALIDATION_STATUS_LABEL
local variable + its reads are gone from the script body.

26 of 31 cases map to 10 Story-level test cases in
GPUOP-696-test-plan.md. 9 Story TCs are out of scope (owned by
GPUOP-797, GPUOP-696.4..696.6, or require a multi-node testbed).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-800: Per-worker exit-code annotation + per-worker log dump

In the PHASE5_SCRIPT MPIJob-fail branch, replace the placeholder
`phase5_fail_reason="mpijob-failed"` shared across all input nodes
with a per-node lookup that derives `worker-pod=<name>,exit=<code>`
from the worker pod that ran on each node and passes it as the
`<reason>` arg to `label_phase_failed`. Missing pod or exit code
falls back to `unknown` for a deterministic annotation.

After launcher-log collection, add a per-worker log dump loop that
saves each input node's worker pod logs to
`${LOG_DIR}/worker-${node}-${new_job}.log`. Runs on both pass and
fail paths so operators can correlate every node's annotation with
its log file regardless of outcome.

Factor the per-node worker-pod lookup into a function-local
`_phase5_worker_pod_for_node` helper used by both the failure-branch
label loop and the per-worker log dump loop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-801: Add PHASE5_MIN_WORKERS guard

Add PHASE5_MIN_WORKERS ConfigMap key (default "2") and enforce it
inside PHASE5_SCRIPT's run_phase5_main. When the surviving Phase 4.5
node set is smaller than PHASE5_MIN_WORKERS, log and return 0 without
submitting an MPIJob and without writing any PHASE5 labels. RCCL
collectives are multi-node by definition, so a single-worker MPIJob
is degenerate; operators can override the floor to 1 for opt-in
single-node plumbing validation.

The guard runs after the SKIP_RCCL_TEST short-circuit so the existing
"skip just labels everything passed" semantics are preserved, and an
empty input set takes the same skip branch (matches design GPUOP-696
sect 6 "Empty input set" handling).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-801: Add unit tests for PHASE5_MIN_WORKERS guard

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* GPUOP-802: Wire run_phase5 with dedicated PHASE5_SCRIPT sourcing

Replace the generic _run_phase_generic dispatch stub for run_phase5
with a dedicated function that mirrors the run_phase1..4 wiring
pattern landed via GPUOP-763, GPUOP-770, GPUOP-777, and GPUOP-785.

Per design doc GPUOP-696 section 4 ("Code Path: orchestrator
wiring + cleanup of old in-CronJob block"), run_phase5 now:

  * writes PHASE5_SCRIPT to /tmp/run-phase5.sh, chmods +x
  * sources the script once to register run_phase5_main
  * invokes run_phase5_main "$nodes" with the input node list
    as a single space-separated argument
  * exports PHASE_NODES for parity with run_phase1..4
  * defends against PHASE5_SCRIPT not yet being shipped
    (no-op return) and against the script not defining
    run_phase5_main (warn + re…
…(#1539)

* fix(gpu-validation-cluster): wait for cert-manager webhook before GPU-operator install

The server bringup raced the cert-manager admission webhook: install_cert_manager
returned without waiting for readiness, and install_amd_gpu_operator then applied
Certificate/Issuer objects that the webhook must validate. During post-start k3s
flapping the webhook briefly loses its endpoints, so the helm install failed with
"no endpoints available for service cert-manager-webhook" and (under set -e)
aborted before installing the network operator, driver, mpi-operator, and CVF.

Add wait_for_cert_manager_webhook (rollout status + Endpoints poll) and call it
after the cert-manager install and before each GPU-operator install attempt.
Replace the helm-list skip with a helm-status check so a failed release is
uninstalled and retried instead of being treated as already installed.

Plan: docs-internal/knowledge/plans/2026-06-12-cvf-cert-manager-webhook-race.md

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

* fix(gpu-validation-cluster): make server bringup self-healing and verified

Plain setup-cluster.yml could report success while installing no operators:
the install ran as a single set -e pass inside a backgrounded gpu-cluster.sh,
and any kubectl/helm blip during post-start k3s instability aborted it with no
resume. The instability was partly self-inflicted -- configure_server_registries
did pkill -9 k3s AND launched its own k3s server via docker exec -d, fighting
the entrypoint supervisor for port 6443 ("stabilized after N restarts").

Changes to gpu-cluster.sh:
- configure_server_registries: only kill k3s and let the supervisor relaunch it
  (single source-of-truth args), removing the competing manual restart.
- Factor post-start steps into run_bringup_steps and run them in a bounded
  retry loop (subshell set -e + outer set +e capture) so the idempotent
  sequence self-heals once k3s settles.
- wait_for_cert_manager_webhook before the GPU-operator install; GPU-operator
  install now checks helm status (uninstall+retry a failed release) instead of
  treating a failed release as installed.
- Progress-aware pull waits: wait_for_rollout (rollout status under a deadline,
  fast-fail on auth/not-found pull errors) and wait_for_path (deadline-based,
  server-mode fatal-pull fast-fail) replace fixed-count loops in
  prepare_multus_artifacts.

Changes to setup-cluster.yml:
- Verify cert-manager, amd-gpu-operator, amd-network-operator helm releases
  reach STATUS=deployed and the CVF CronJob exists (when enabled), failing
  loudly instead of reporting false success.

Plan: docs-internal/knowledge/plans/2026-06-12-cvf-cert-manager-webhook-race.md

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

* fix(gpu-validation-cluster): robust docker apt setup + zero-match selector diagnostic

Two issues found during a fresh-machine (docker fully purged) redeploy:

1. Docker install failed on a node that still had docker apt source(s) from
   prior provisioning. setup-cluster.yml used the deprecated apt_key (global
   trusted.gpg) and an apt_repository entry with no signed-by; a pre-existing
   /etc/apt/sources.list.d/docker.list (and an Ansible-generated
   download_docker_com_linux_ubuntu.list) declared the same repo with a
   conflicting/empty signed-by, so `apt update` aborted with "Conflicting
   values set for option Signed-By ... download.docker.com" before docker could
   install. Now: probe docker first; when (re)installing, remove ALL
   pre-existing docker source files BEFORE the initial apt cache update, write
   the key to /etc/apt/keyrings/docker.asc (force refresh), and add one
   canonical docker.list with explicit signed-by. A healthy host with docker
   already present is left untouched.

2. A custom node-selector label (e.g. cvf-candidate=true) that nothing applies
   automatically made Phase 0 match 0 nodes, surfacing only as a generic
   "Insufficient candidates". The candidate-selection script now warns
   explicitly when the selector matches 0 nodes, dumps current node labels, and
   prints the kubectl label command; the skip message states the cause
   (zero-match vs all-busy/recent). README 0.2 documents that custom labels
   must be applied by the operator.

Plan: docs-internal/knowledge/plans/2026-06-12-cvf-cert-manager-webhook-race.md

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

* feat(gpu-validation-cluster): optional skip of network operator for GPU-only clusters

Add network-operator.install-network-operator (default true). When false,
gpu-cluster.sh skips the network-operator helm install, the NetworkConfig CR,
Multus artifact prep, and per-rail NAD apply; the playbook's post-install
verification drops amd-network-operator from the required-releases list.
cert-manager, the GPU operator, mpi-operator, and the CVF CronJob still install
cleanly -- verified end to end on a 2-node lab (GPU+mpi+CVF up, no
kube-amd-network namespace, CVF trigger selects nodes and submits Phase 1).

Important jq fix: read the flag with `if . == null then true else . end`, NOT
`// true`. jq's // alternative operator treats false as empty, so `false //
true` => true and the flag would silently never take effect.

Plan: docs-internal/knowledge/plans/2026-06-12-cvf-cert-manager-webhook-race.md

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

* fix(gpu-validation-cluster): Phase 1 verdict from Job condition + always persist pod logs

Phase 1 GPU HW-acceptance misreported passing stages as failed
(stage-xgmi-lvl1:test-runner-did-not-emit-results) whenever the
orchestrator could not read the per-stage result.json. That happened
two ways: a multi-node cluster where the orchestrator runs on a
different node than the worker (result.json is written only to the
worker's node-local hostPath), and a single node where the UID-1001
orchestrator could not read the root-owned result file.

Make the Kubernetes Job condition the sole source of truth for Phase 1
pass/fail (Complete=True -> pass, Failed=True -> fail), matching the
original pre-refactor CVF design. result.json is no longer read at all,
which also removes a latent multi-node mis-attribution bug where the
glob-by-name lookup could match a different node's same-named file.

Always persist each test-runner pod log to results_root
(/var/log/cluster-validation/<job>_pod.log) for triage, pass or fail,
via kubectl logs (node-agnostic). Restore runAsUser:0 on the
orchestrator so it can write that root-owned hostPath -- this also fixes
the orchestrator's own cronjob-*.log write and Phase 5 launcher/worker
log persistence.

Remove now-dead result-parsing code (_phase1_parse_result, find/glob
lookup, kubectl cp fallback, gz decompress, stage_start_epoch). Rewrite
Phase 1 unit tests to seed Job conditions instead of result fixtures;
drop the unused result-*.json fixtures.

Plan: docs-internal/knowledge/plans/2026-06-12-cvf-phase1-result-detection.md

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

* fix(gpu-validation-cluster): address PR review nits (comment accuracy + idempotent change reporting)

- gpu-cluster.sh: correct the rollout-wait comment to match actual
  behavior (per-attempt cap is 60s; the final attempt shrinks to the
  remaining time before the hard deadline -- not a ~30s floor).
- setup-cluster.yml: report the docker apt-source cleanup task as
  changed only when files were actually removed, instead of
  changed_when: true on every run.

Plan: docs-internal/knowledge/plans/2026-06-12-cvf-phase1-result-detection.md

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4 (1M context) <noreply@anthropic.com>
…for open-source

Removes internal GPUOP-xxx ticket IDs and docs-internal/ path references
from comments and test fixtures so the open-source distribution carries
no pointers to pensando-internal trackers/design docs. Comment-only
changes; no functional impact.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant