From ac1fc5b824dff120fe229cc5a57ca22f39777501 Mon Sep 17 00:00:00 2001 From: Yan Sun Date: Mon, 8 Jun 2026 11:24:11 -0700 Subject: [PATCH 1/4] GPUOP-689: Multi-Phase GPU Cluster Validation for Datacenter Bringup (#1473) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * GPUOP-689: Add codie design docs for 7 multi-phase validation stories Adds design documents for all Stories under Epic GPUOP-689 (Multi-Phase GPU Cluster Validation for Datacenter Bringup): - GPUOP-690: Configuration framework + multi-phase CronJob orchestrator - GPUOP-691: Phase 1 — Per-Node GPU Hardware Acceptance - GPUOP-692: Phase 2 — Intra-Node GPU Collective Test - GPUOP-693: Phase 3 — Per-Node NIC Health Check - GPUOP-694: Phase 4 — Pairwise Rail Bandwidth Test - GPUOP-695: Phase 4.5 — Cross-Node Connectivity Matrix Test - GPUOP-696: Phase 5 — Multi-Node RCCL Collective Test Each design references the shared GPUOP-690 contract (uniform label keys, helper functions, per-phase skip flags, run_phaseN orchestrator stubs) so the 6 phase Stories can land independently. Co-Authored-By: Claude Opus 4.6 * GPUOP-689: Add codie test plans for 7 multi-phase validation stories Adds test plans for each Story under Epic GPUOP-689: - GPUOP-690: Configuration Framework (16 cases: 5 P0, 9 P1, 2 P2) - GPUOP-691: Phase 1 GPU HW Acceptance (15 cases: 5 P0, 9 P1, 1 P2) - GPUOP-692: Phase 2 GPU Mesh (14 cases: 4 P0, 10 P1) - GPUOP-693: Phase 3 NIC Health (16 cases: 5 P0, 10 P1, 1 P2) - GPUOP-694: Phase 4 Pairwise Rail BW (20 cases: 5 P0, 12 P1, 3 P2) - GPUOP-695: Phase 4.5 Connectivity Mesh (14 cases: 4 P0, 9 P1, 1 P2) - GPUOP-696: Phase 5 Multi-Node RCCL (19 cases: 6 P0, 11 P1, 2 P2) Total: 114 test cases across Positive, Negative, Boundary, Error Recovery, Concurrency, Integration, Upgrade/Downgrade, and Performance categories. Per-test-case files deferred — will be generated during /codie:task-test-plan for each Sub-task as engineers implement, since test-step CLIs depend on actual code landed. Co-Authored-By: Claude Opus 4.6 * GPUOP-754: Replace skip-tests block with 5 per-phase skip flags Co-Authored-By: Claude Opus 4.6 * GPUOP-754: Add unit tests for config.json schema extension Co-Authored-By: Claude Opus 4.6 * GPUOP-755: Add per-phase env vars and label-key constants to ConfigMap Replace the single SKIP_GPU_VALIDATION placeholder with five per-phase skip flags (SKIP_GPU_HW_ACCEPTANCE, SKIP_GPU_MESH_VALIDATION, SKIP_NIC_VALIDATION, SKIP_RAIL_BANDWIDTH_TEST, SKIP_RCCL_TEST) aligned with the new config.json schema landed in GPUOP-754. Add uniform PHASE{1..5}_LABEL_KEY constants and PHASE_FAILURE_REASON_ANNOTATION_SUFFIX so every downstream phase consumes the same label keys rather than hard-coding their own. Owned by GPUOP-690; this is the ConfigMap env-var surface that downstream Sub-tasks (helper library, orchestrator, per-phase scripts) read from. Co-Authored-By: Claude Opus 4.6 * GPUOP-755: Add unit tests for ConfigMap env-var surface 30 test cases covering the cluster-validation-config ConfigMap contract introduced by GPUOP-755: - 6 SKIP_* placeholder presence + templating-sentinel value checks - 6 PHASE{N}_LABEL_KEY + PHASE_FAILURE_REASON_ANNOTATION_SUFFIX value assertions against the design doc - 2 backward-incompat guards (legacy SKIP_GPU_VALIDATION absent) - 5 type/format checks (label namespacing, kebab-case) - 9 regression guards (surrounding ConfigMap content preserved) - 2 YAML parse / duplicate-key checks Framework: Python 3 + unittest + PyYAML. Fallback grep harness included for envs without PyYAML. Co-Authored-By: Claude Opus 4.6 * GPUOP-756: Add PHASE_NODE_LABEL_SCRIPT helper library Co-Authored-By: Claude Opus 4.6 * GPUOP-756: Add unit tests for PHASE_NODE_LABEL_SCRIPT Co-Authored-By: Claude Opus 4.6 * GPUOP-757: Rewrite CronJob as phased orchestrator with run_phaseN stubs + DRY_RUN Replace the linear submit-mpijob container args with a phased state machine implementing the design in GPUOP-690 §4. The orchestrator: * sources PHASE_NODE_LABEL_SCRIPT helpers from the ConfigMap * walks Phase 0 -> 1 -> 2 -> 3 -> 4 -> 4.5 -> 5 * gates each phase on SKIP_ flags from the ConfigMap * narrows the live-node pool between phases via filter_passed_nodes keyed on PHASE{N}_LABEL_KEY * intersects with feature.node.kubernetes.io/amd-nic=true before Phase 3 so NIC-requiring phases only see NIC-capable nodes * tracks per-phase failures and exits non-zero at the end if any phase produced failed nodes (fail-loud, design §6) * provides DRY_RUN=1 that replaces run_phaseN with no-ops, skips Phase 0 candidate labeling, and prints the planned phase order + per-phase pool without any kubectl writes * cleans up old MPIJobs and collects launcher logs only when Phase 5 actually ran run_phaseN stubs print a banner and source $PHASE{N}_SCRIPT (from the ConfigMap) if defined; otherwise no-op. Downstream Stories (GPUOP-691..696) slot in their per-phase scripts via ConfigMap without orchestrator changes. Co-Authored-By: Claude Opus 4.6 * GPUOP-757: Add unit tests for CronJob phased orchestrator 68 test cases across 9 categories, framework Python 3 + unittest + PyYAML + bash mock-kubectl-on-PATH (matches GPUOP-754/755/756). Maps to GPUOP-690 Story test cases TC1, TC2, TC3, TC4, TC5, TC6, TC8, TC11, TC12, TC13, TC14. Co-Authored-By: Claude Opus 4.6 * GPUOP-758: Default candidate selection to amd-gpu=true only Drop the joined amd-nic=true requirement from the default node selector so a datacenter early in bringup can run Phase 1 (GPU HW acceptance) and Phase 2 (intra-node GPU mesh) before any NIC infrastructure exists. NIC-requiring phases (3, 4) narrow the candidate set inside their own run_phaseN scripts via intersect ... amd-nic=true, per GPUOP-690 design sections 4 and 5. Files: - configs/config.json: default node-selector-labels reduced to [feature.node.kubernetes.io/amd-gpu=true]. - configs/cluster-validation-config.yaml: inline ConfigMap docs updated to explain the new default and where NIC narrowing happens; NODE_SELECTOR_LABELS placeholder unchanged. - gpu-cluster.sh: operator's templated jq fallback default for node-selector-labels reduced to the GPU-only list to match. Resource-busy checks and NODE_VALIDATION_INTERVAL_MINS skip behavior are unchanged. Co-Authored-By: Claude Opus 4.6 * GPUOP-758: Add unit tests for candidate-selection default Co-Authored-By: Claude Opus 4.6 * GPUOP-759: Add bash unit tests for helper library + orchestrator DRY_RUN Adds a hand-rolled bash test harness under example/gpu-validation-cluster/tests/ covering the PHASE_NODE_LABEL_SCRIPT helper library (GPUOP-756) and the multi-phase CronJob orchestrator's DRY_RUN=1 mode (GPUOP-757). Helper library tests (26 cases): * label_phase_passed / label_phase_failed / annotate_phase_value / filter_passed_nodes positive paths * argument validation (empty / wrong arity returns non-zero, no kubectl side effects) * kubectl failure propagation * contract invariants: --overwrite on every write, diagnostics to stderr Orchestrator DRY_RUN tests (5 cases) from design doc section 7: * all 5 skip flags true -> exit 0, zero kubectl writes * phases 1+2 enabled, 3-5 skipped -> Phase 2 followed by SKIP_* pass-through * Phase 3 enabled but no amd-nic=true nodes -> empty pool, exit 0 * empty Phase 0 candidate pool -> exit 0 * cleanup / log collection honor DRY_RUN Infrastructure: * lib/assert.sh -- hand-rolled assertions (no bats dependency) * lib/kubectl_mock.sh -- recording kubectl shim installed on PATH (function override is unreliable for sourced sub-shells) * lib/extract_script.sh -- pure-bash/awk extractor for YAML block scalars (no PyYAML / yq dependency) * run_all.sh -- entry point; exits 0 only when every file reports zero failures All 31 tests pass locally with bash 5.x. Co-Authored-By: Claude Opus 4.6 * GPUOP-760: Add multi-phase pipeline docs and SKIP_GPU_VALIDATION migration note Document the 5-phase validation pipeline (phases / labels / skip flags table), new per-phase amd.com/* label keys, the incremental-bringup default posture, DRY_RUN=1 orchestrator testing mode, and the migration mapping from the removed skip-gpu-validation flag to the new 5-flag skip-tests block. Update the per-node status legend to include the per-phase labels and failure-reason annotations. Co-Authored-By: Claude Opus 4.6 * GPUOP-760: Add unit tests for README + migration note Co-Authored-By: Claude Opus 4.6 * GPUOP-761: Extend GPU_VALIDATION_TESTS_JSON with AGFHC recipes; bump TEST_RUNNER_JOB_WAIT_TIME Add three AGFHC TestCases (xgmi_lvl1, pcie_lvl1, hbm_lvl1) to the MANUAL TestCases array under TestConfig.GPU_HEALTH_CHECK.TestLocationTrigger.global.TestParameters. All use StopOnFailure: true, TimeoutSeconds: 600, Iterations: 1 per GPUOP-691 design doc section 4. Bump TEST_RUNNER_JOB_WAIT_TIME from 1200s to 3600s to cover the longer combined test set (RVS gst_single ~20 min + 3 AGFHC level-1 recipes, ~50 min worst case) plus pod scheduling / result upload headroom. Co-Authored-By: Claude Opus 4.6 * GPUOP-761: Add unit tests for GPU_VALIDATION_TESTS_JSON + TEST_RUNNER_JOB_WAIT_TIME 28 test cases (5 structural, 8 content, 4 numeric/budget, 6 regression, 3 boundary, 2 cross-source) covering the ConfigMap contract for the expanded TestCases array and bumped wait-time budget. All 28 cases mapped to GPUOP-691 Story test plan TC1/TC2/TC4/TC5/TC9. Co-Authored-By: Claude Opus 4.6 * GPUOP-762: Add PHASE1_SCRIPT to cluster-validation-config ConfigMap Implements GPUOP-691 design §4 -> "Code Path: PHASE1_SCRIPT". Sourced by GPUOP-690 orchestrator's run_phase1 via _run_phase_generic. Behavior: 1. Parallel-submit Test Runner Job per input node (existing template). 2. Poll-wait all Jobs up to TEST_RUNNER_JOB_WAIT_TIME. 3. Parse result.json (local hostPath, fallback kubectl cp) for failed sub-test name; recipe-not-found surfaced as a distinct reason. 4. Label via GPUOP-690 helpers (label_phase_passed / label_phase_failed + annotate_phase_value failed-subtest=). 5. SKIP_GPU_HW_ACCEPTANCE=true early-exit pass-labels every input node. 6. Cleanup hung (timed-out) Jobs at the end. Missing result.json -> failed reason test-runner-did-not-emit-results, subtest=unknown. Missing required env -> all nodes failed with reason phase1-missing-env:. kubectl apply failure on a single node fails just that node with reason job-creation-failed; other nodes proceed. Co-Authored-By: Claude Opus 4.6 * GPUOP-763: Wire run_phase1 orchestrator to source PHASE1_SCRIPT Replace the generic-dispatch stub for run_phase1 (delivered by GPUOP-690.4 as a one-liner delegating to _run_phase_generic) with a dedicated function body that: * prints a Phase 1 banner with input nodes * returns 0 cleanly when the input pool is empty * returns 0 cleanly when PHASE1_SCRIPT is unset (so a partial rollout that has not yet shipped PHASE1_SCRIPT does not break the orchestrator) * writes $PHASE1_SCRIPT to /tmp/run-phase1.sh, chmod +x, and sources it with the input nodes as positional args * exports PHASE_NODES (documented input handle for per-phase scripts) * preserves the existing fail-tolerant pattern (warn on non-zero rc, always return 0 so other phases can continue) run_phase2..run_phase5 remain on _run_phase_generic until their respective PHASE{N}_SCRIPT bodies land. The DRY_RUN run_phase1 override at the dry-run block is unchanged. Files: example/gpu-validation-cluster/configs/cluster-validation-job.yaml Design: docs/codie/designs/GPUOP-691-phase1-gpu-hw-acceptance-design.md Section 4 -> "Code Path: orchestrator wiring" Co-Authored-By: Claude Opus 4.6 * GPUOP-763: Add unit tests for run_phase1 orchestrator wiring 25 unit tests covering the dedicated run_phase1 body added in commit 8d51b91: * 14 contract tests (bash parse, function presence, source-level grep for tmp-path write, chmod, source-with-nodes, PHASE_NODES export, empty/unset guards, fail-tolerant warn pattern, regression guards that other phases still use the generic dispatch, DRY_RUN override intact, call site unchanged, no kubectl in the body) * 8 behavioral tests (probe-script-driven: empty input, unset PHASE1_SCRIPT, single node, multi-node positional expansion, PHASE_NODES exported visible to child processes, tmp file exists + executable, non-zero rc warns but returns 0, zero rc no warning) * 1 DRY_RUN override test (override at line 585 still shadows the new dedicated body) * 2 regression tests (filter_passed_nodes call unchanged; existing GPUOP-759 orchestrator dry-run suite still passes) Suggested test file: example/gpu-validation-cluster/tests/test_run_phase1_wiring.sh Helper extension needed: extract_bash_function() in lib/extract_script.sh Mapped to Story TCs: TC2, TC3, TC4, TC8, TC11, TC13 (downstream test cases TC1, TC5-7, TC9-12, TC14, TC15 are out of scope for the wiring Sub-task; they live in PHASE1_SCRIPT / GPUOP-764). Co-Authored-By: Claude Opus 4.6 * GPUOP-764: Add bash unit tests for PHASE1_SCRIPT Realizes test cases TC1, TC2, TC4, TC5, TC6, TC7, TC8, TC10 from the GPUOP-691 Phase 1 test plan against the PHASE1_SCRIPT body embedded in configs/cluster-validation-config.yaml (GPUOP-762). New files: - tests/test_phase1.sh: 12 tests covering empty input, single-node pass, single-node hbm fail, mixed pass/fail across 2 nodes, missing result.json -> result-parse-failed, recipe-not-found -> recipe-not- found, SKIP_GPU_HW_ACCEPTANCE=true short-circuit, job-creation failure, poll-wait timeout, dry-run skip annotation, idempotent --overwrite labels, helper-failure propagation. The test sources PHASE1_SCRIPT after sed-patching the hardcoded job-template path, results root, and timestamp generation so job names are deterministic (PHASE1_TEST_TS) and templates resolve to test fixtures. - tests/fixtures/phase1/result-pass.json: 4 sub-tests all PASS. - tests/fixtures/phase1/result-hbm-fail.json: hbm_lvl1 FAIL, others PASS. - tests/fixtures/phase1/result-recipe-not-found.json: error string matches the recipe-not-found jq regex. Mock extensions (tests/lib/kubectl_mock.sh): - Early-route detection for 'kubectl get pods -l job-name=X -o jsonpath={.items[-1:].metadata.name}' so the pod lookup runs before the generic '-l' selector arm exits. - jsonpath handler for '{.status.conditions[?(@.type=="Complete"|"Failed")].status}' on job objects, seeded via new kubectl_mock_set_job_condition helper. - Fail-injection support for apply|delete|patch|create verbs (was previously hardcoded exit 0); mirrors one-shot / .sticky semantics used by label|annotate. - SIGPIPE-safe stdin drain at the top of the mock shim. Without this, callers like 'sed ... | kubectl apply -f -' under 'set -o pipefail' intermittently saw rc=141 because the mock exited before reading stdin, and the script then took the job-creation-failed path. - New helpers: kubectl_mock_set_job_condition, kubectl_mock_set_pod_for_job. All 43 tests across the suite pass (test_assert: 5, test_phase1: 12, test_phase0_helpers: 26). Verified stable across 10 sequential runs after the SIGPIPE fix. Co-Authored-By: Claude Opus 4.6 * GPUOP-765: Remove obsolete TEST_RUNNER_SUCCESS_LABEL / TEST_RUNNER_FAILURE_LABEL references Phase 1 now uses the uniform PHASE1_LABEL_KEY (= amd.com/gpu-hw-acceptance) written via the GPUOP-690 PHASE_NODE_LABEL_SCRIPT helper library. The legacy `TEST_RUNNER_SUCCESS_LABEL` / `TEST_RUNNER_FAILURE_LABEL` env vars in cluster-validation-config.yaml are obsolete; delete the env vars and the entire `GPU_VALIDATION_TEST_SCRIPT` block (whose role is fully replaced by `PHASE1_SCRIPT` sourced from the orchestrator's run_phase1 in cluster-validation-job.yaml). Replace the deleted blocks with inline migration-note comments so deployers and downstream consumers can trace the rename. Update the now-stale `GPU_VALIDATION_TEST_SCRIPT` reference in the surviving PHASE1_SCRIPT polling-cadence comment. Verified: * YAML still parses (1 doc in config, 6 in job). * GPU_VALIDATION_TESTS_JSON / TESTS_JSON still valid JSON. * PHASE1_SCRIPT, PHASE_NODE_LABEL_SCRIPT, CRONJOB_CANDIDATE_NODES_SELECTION_SCRIPT, WAIT_FOR_WORKERS_SCRIPT, VALIDATE_RCCL_TEST_SCRIPT, RCCL_ENV_VARS all pass `bash -n`. * No matches for the removed identifiers remain in either configs YAML except the intentional migration-note comments. Design: docs/codie/designs/GPUOP-691-phase1-gpu-hw-acceptance-design.md section 4 (recipe expansion section, last paragraph). Parent: GPUOP-691. Co-Authored-By: Claude Opus 4.6 * GPUOP-765: Add unit tests for legacy label-var / script-block removal 38 test cases covering: * Removal contract (top-level keys + inline shell-script references for both removed env vars and the removed legacy script block) * Migration-note guard (deletion is documented inline so future readers can trace the rename without git archaeology) * Replacement contract (PHASE1_LABEL_KEY, helper library, PHASE1_SCRIPT cover the roles vacated by the removed symbols) * Companion-file guard (cluster-validation-job.yaml is also clean) * Regression guards for every sibling ConfigMap key added or preserved by GPUOP-755 / GPUOP-756 / GPUOP-761 / GPUOP-763 * File-parse / bash-syntax structural checks on every embedded script Framework: Python 3 + unittest + PyYAML (matches GPUOP-755 / GPUOP-756 / GPUOP-761 / GPUOP-763 conventions for the same artifact); includes a grep-based shell fallback harness for environments without Python. Story TC mapping: TC1, TC2, TC4, TC5, TC13. Co-Authored-By: Claude Opus 4.6 * GPUOP-766: Document Phase 1 GPU HW acceptance in README Add dedicated Phase 1 subsection covering scope, the 4 sub-tests (RVS gst_single + AGFHC xgmi_lvl1/pcie_lvl1/hbm_lvl1), the amd.com/gpu-hw-acceptance result label, sample failure annotation output (-failure-reason and -failed-subtest), the failure-reason catalog, and the SKIP_GPU_HW_ACCEPTANCE short-circuit behavior. Co-Authored-By: Claude Opus 4.6 * GPUOP-766: Add unit tests for Phase 1 README docs Co-Authored-By: Claude Opus 4.6 * GPUOP-767: Add cluster-validation-phase2-job-config ConfigMap New ConfigMap holding the batch/v1.Job template for the Phase 2 single-node 8-GPU RCCL all_reduce_perf test. Mirrors the test-runner Job pattern but uses RCCL_WORKLOAD_IMAGE, requests all 8 GPUs, and runs mpirun --host localhost with the PHASE2_* sizing knobs from cluster-validation-config. Reuses VALIDATE_RCCL_TEST_SCRIPT for bandwidth threshold check. Uses \$\$NODE placeholder for sed substitution by PHASE2_SCRIPT. See docs/codie/designs/GPUOP-692-phase2-gpu-mesh-validation-design.md section 4 -> "Code Path: Phase 2 Job template ConfigMap". Co-Authored-By: Claude Opus 4.6 * GPUOP-767: Add unit tests for cluster-validation-phase2-job-config 52 test cases covering structural / content / resource / command / boundary (no-IB invariant) / rendering / regression categories for the new Phase 2 per-node Job template ConfigMap. Mapped to 9 Story-level test cases in GPUOP-692 test plan. Python 3 + unittest + PyYAML harness, with a grep-only fallback for CI environments without Python. Co-Authored-By: Claude Opus 4.6 * GPUOP-768: Add Phase 2 env vars + PHASE2_RCCL_ENV_VARS to cluster-validation-config Co-Authored-By: Claude Opus 4.6 * GPUOP-768: Add unit tests for Phase 2 env vars + PHASE2_RCCL_ENV_VARS Co-Authored-By: Claude Opus 4.6 * GPUOP-769: Add PHASE2_SCRIPT to cluster-validation-config Mirror of PHASE1_SCRIPT. Drives Phase 2 (intra-node GPU collective) per-node Job lifecycle: 1. Sed-render the cluster-validation-phase2-job-config template ($$NODE, $$RCCL_WORKLOAD_IMAGE) and kubectl apply per input node. 2. Poll-wait every Job up to PHASE2_JOB_WAIT_TIME. 3. Classify by Job condition + container log markers: Complete=True -> passed "phase2 mpirun exited" -> rccl-crash "phase2 bandwidth below threshold" -> bus-bw-below-threshold pending/active past timeout -> timeout 4. Label via GPUOP-690 helpers (label_phase_passed / label_phase_failed). Annotate measured Avg bus bandwidth on both pass and bw-fail paths. 5. SKIP_GPU_MESH_VALIDATION=true short-circuits to pass-label-all. 6. Delete hung Jobs after the wait budget. Depends on GPUOP-690 helpers (PHASE_NODE_LABEL_SCRIPT), GPUOP-767 (cluster-validation-phase2-job-config template), GPUOP-768 (PHASE2_* env vars). Orchestrator wiring is GPUOP-770; /phase2-configs volume mount is GPUOP-771. Co-Authored-By: Claude Opus 4.6 * GPUOP-769: Add unit tests for PHASE2_SCRIPT 37 unit test cases split into two layers: * Contract (24 cases): source-level grep / bash -n on extracted PHASE2_SCRIPT body. Confirms every design-doc clause is present (env validation list, sed substitutions, helper invocations, classify markers, cleanup guard). * Behavioral (13 cases): bash + lib/kubectl_mock.sh, sourcing the extracted script under a stripped-down harness with PHASE_NODE_LABEL_SCRIPT helpers. Covers empty input, skip mode, missing env / template, single-node pass / bw-fail / rccl-crash / timeout, three-node mixed, long-hostname hash fallback, submit failure, helper failure tolerance. Coverage maps to 9 of 14 Story TCs from GPUOP-692-test-plan.md (TC1-3, TC5-10); TC4 is GPUOP-768 (env-vars), TC11-12 are GPUOP-770 (orchestrator wiring), TC13-14 are integration / performance and out of unit-test scope. Co-Authored-By: Claude Opus 4.6 * GPUOP-770: Wire run_phase2 to source PHASE2_SCRIPT Replace the generic-dispatch run_phase2 stub (delivered by GPUOP-690.4) with a dedicated function that sources $PHASE2_SCRIPT, mirroring the run_phase1 wiring landed by GPUOP-763. See design doc GPUOP-692 section 4 -> "Code Path: orchestrator wiring". Single-function-body change in cluster-validation-job.yaml. Co-Authored-By: Claude Opus 4.6 * GPUOP-770: Add unit tests for run_phase2 orchestrator wiring 27 test cases (15 contract, 8 behavioral wiring, 1 DRY_RUN override, 3 regression) covering the dedicated run_phase2 function added in cluster-validation-job.yaml. Mirrors the GPUOP-763 unit-test plan for run_phase1. Co-Authored-By: Claude Opus 4.6 * GPUOP-771: Implement cluster-validation-job.yaml Mount cluster-validation-phase2-job-config ConfigMap into the CronJob submit-mpijob container at /phase2-configs, paralleling the existing mpi-configs and test-runner-configs mounts so that PHASE2_SCRIPT can sed-render the Phase 2 per-node Job template. Co-Authored-By: Claude Opus 4.6 * GPUOP-771: Add unit tests for cluster-validation-job.yaml Unit test plan covering the phase2-configs volumeMount + volume addition to the CronJob submit-mpijob container: YAML-structure parse, source-text grep, kubectl client-side schema validation, and sibling-mount regression guards. Co-Authored-By: Claude Opus 4.6 * GPUOP-772: Bash unit tests + sample phase2.log fixtures for PHASE2_SCRIPT Adds tests/test_phase2.sh exercising PHASE2_SCRIPT (GPUOP-769) end-to-end against the existing kubectl mock harness, plus four sample phase2.log fixtures under tests/fixtures/phase2/ for the pass / bw-below-threshold / rccl-crash / failed-no-marker classification paths. Coverage (design GPUOP-692 section 7, test plan GPUOP-692-test-plan.md): * pass (BW above threshold) -> =passed + measured-bw annotation * bus-bw-below-threshold fail (also covers PHASE2_BW_THRESHOLD=9999 inject) * rccl-crash via mpirun-exited marker * default rccl-crash on Failed=True without a recognized marker * timeout via PHASE2_JOB_WAIT_TIME=0 + cleanup delete * SKIP_GPU_MESH_VALIDATION=true short-circuit (case-insensitive) * missing-env fast-fail (phase2-missing-env reason) * missing job template -> job-template-missing reason * parallel-submit ordering pin (all applies precede any get-job poll) * job-creation-failed when kubectl apply returns non-zero * PHASE_NODES env-var fallback when no positional args * PHASE2_RCCL_ENV_VARS contains no IB/fabric tunables (TC4) Extends lib/kubectl_mock.sh with a `logs` verb (the original mock only honored label/annotate/get/apply/delete) and a kubectl_mock_set_pod_log seed helper so per-pod log content can be served as base64-encoded state, mirroring the kubectl_mock_set_pod_for_job pattern from GPUOP-764. Co-Authored-By: Claude Opus 4.6 * GPUOP-773: Document Phase 2 intra-node GPU mesh validation in README Add a Phase 2 subsection to example/gpu-validation-cluster/README.md mirroring the structure of the existing Phase 1 subsection: scope, single-node mpirun --np 8 --host localhost all_reduce_perf workload, amd.com/gpu-mesh-validation label, ConfigMap env vars (PHASE2_*), failure-reason annotation values (bus-bw-below-threshold, rccl-crash, xgmi-init-failure, timeout, job-creation-failed), and the SKIP_GPU_MESH_VALIDATION short-circuit behavior. Includes a Threshold Tuning subsection covering: the default 200 GB/s for MI300 xGMI, how to override PHASE2_BW_THRESHOLD via the cluster-validation-config ConfigMap, expected ranges for MI300X/A, MI250/X, MI210, and MI100 SKUs, and a 4-step tuning workflow. Co-Authored-By: Claude Opus 4.6 * GPUOP-774: Implement cluster-validation-phase3-job-config Add cluster-validation-phase3-job-config ConfigMap with the per-node NIC health Job template (batch/v1.Job) used by GPUOP-693 Phase 3. Per docs/codie/designs/GPUOP-693-phase3-nic-health-design.md section 4 -> Code Path: Phase 3 Job template: - serviceAccountName: cluster-validation-sa (preexisting, has node label/annotate perms via cluster-validation-role). - securityContext.privileged: true (required for rdma/ibv tooling). - Requests amd.com/nic: $$EXPECTED_NIC_COUNT (sole-tenant on NICs). - Image: docker.io/rocm/network-operator-utils:v1.1.0 (same as Phase 5 launcher init-container). - envFrom cluster-validation-config -- sources PHASE3_CHECK_SCRIPT body, PHASE3_LABEL_KEY, PHASE3_EXPECTED_NIC_COUNT, and the GPUOP-690 label/annotate helper conventions. - NODE_NAME via downward API for in-pod kubectl self-labelling. - Placeholders $$NODE and $$EXPECTED_NIC_COUNT are sed-substituted by PHASE3_SCRIPT (separate Sub-task) at submit time. - backoffLimit: 0, ttlSecondsAfterFinished: 300 (matches Phase 2). Scope is limited to the Job template ConfigMap only. The PHASE3_CHECK_SCRIPT body, PHASE3_* env vars, and orchestrator wiring are tracked under sibling Sub-tasks (GPUOP-693.2/.3+). Verified with kubectl apply --dry-run=client on the outer file (all 8 resources accepted) and on the rendered inner Job template. Co-Authored-By: Claude Opus 4.6 * GPUOP-774: Add unit tests for cluster-validation-phase3-job-config 50 test cases covering structural / content / security / resource / command / boundary (no-scope-creep) / rendering / regression categories for the new Phase 3 per-node NIC health Job template ConfigMap. Mapped to 8 Story-level test cases in GPUOP-693 test plan. Python 3 + unittest + PyYAML harness (matches sibling GPUOP-767 Phase 2 ConfigMap tests), with a grep-only fallback for CI environments without Python. Verifies the Job template contract: image docker.io/rocm/network-operator-utils:v1.1.0, privileged: true, amd.com/nic: $$EXPECTED_NIC_COUNT, serviceAccountName cluster-validation-sa, envFrom cluster-validation-config, NODE_NAME downward-API, and the three-line script that materializes and executes $PHASE3_CHECK_SCRIPT. Scope guards confirm the template does NOT inline the PHASE3_CHECK_SCRIPT body, PHASE3_* env values, kubectl label calls, or amd-nic=true nodeSelector -- those are owned by sibling Sub-tasks (GPUOP-693.2 / .3 / .4). Co-Authored-By: Claude Opus 4.6 * GPUOP-775: Add PHASE3_CHECK_SCRIPT to cluster-validation-config In-Job per-node NIC health check with 4 structural validations: NIC count (lspci vendor 1dd8), ip link state UP, rdma link state ACTIVE, ibv_devinfo + GID table non-empty. Self-labels the node via in-pod kubectl (cluster-validation-sa). On any failure writes ${PHASE3_LABEL_KEY}=failed plus failure-reason / failed-nics annotations (each truncated to 250 bytes for K8s safety) and exits non-zero. Matches GPUOP-693 design Section 4 -> Code Path: PHASE3_CHECK_SCRIPT. Co-Authored-By: Claude Opus 4.6 * GPUOP-775: Add unit tests for cluster-validation-config Co-Authored-By: Claude Opus 4.6 * GPUOP-776: Add Phase 3 ConfigMap env vars Add four Phase 3 (per-node NIC health) ConfigMap data keys to cluster-validation-config so the Phase 3 Job (GPUOP-774 template) can project them via envFrom: * PHASE3_EXPECTED_NIC_COUNT="8" - PCI function count per node * PHASE3_AMD_NIC_VENDOR_ID="1dd8" - Pensando PCI vendor ID * PHASE3_MIN_GID_COUNT="1" - min GID table entries per dev * PHASE3_JOB_WAIT_TIME="120" - per-node Job wallclock budget Values match the defensive defaults already in PHASE3_CHECK_SCRIPT (GPUOP-775) so existing behavior is preserved; this Sub-task makes them ConfigMap-tunable. Design: docs/codie/designs/GPUOP-693-phase3-nic-health-design.md section 4 -> "Code Path: Phase 3 ConfigMap variables". Co-Authored-By: Claude Opus 4.6 * GPUOP-776: Add unit tests for cluster-validation-config Phase 3 env vars 32 unit test cases (18 P0, 13 P1, 1 P2) covering presence, value, type discipline, boundary, script-default consistency with PHASE3_CHECK_SCRIPT (GPUOP-775), and regression guards for the 4 new Phase 3 ConfigMap data keys: * PHASE3_EXPECTED_NIC_COUNT * PHASE3_AMD_NIC_VENDOR_ID * PHASE3_MIN_GID_COUNT * PHASE3_JOB_WAIT_TIME Framework: Python 3 + unittest + PyYAML (matches sibling Phase 1/2/3 unit-test plans). Test harness + grep-only fallback inline. Maps to Story GPUOP-693 test plan cases #1, #2, #3, #5, #8, #12. Co-Authored-By: Claude Opus 4.6 * GPUOP-777: Implement PHASE3_SCRIPT orchestrator + run_phase3 wiring Adds the outer Phase 3 driver (submit + wait, no result fetch) to cluster-validation-config and replaces the generic run_phase3 stub with dedicated wiring mirroring the run_phase1/run_phase2 pattern. Mounts cluster-validation-phase3-job-config (GPUOP-774) at /phase3-configs so PHASE3_SCRIPT can sed-render the per-node Job template. amd-nic=true intersection and SKIP_NIC_VALIDATION pass- through stay on the orchestrator side (already in place). Per design GPUOP-693 §4 (Code Path: PHASE3_SCRIPT + orchestrator wiring), the in-pod PHASE3_CHECK_SCRIPT (GPUOP-775) self-labels passed/failed via in-pod kubectl, so PHASE3_SCRIPT only writes labels for the cases that prevent the in-pod kubectl from running: submit-failed -> job-creation-failed; timeout -> nic-not-allocated. Co-Authored-By: Claude Opus 4.6 * GPUOP-777: Add unit tests for PHASE3_SCRIPT orchestrator 40 test cases across 8 categories covering submit/wait, no-result- fetch contract, SKIP_NIC_VALIDATION pass-through, timeout-> nic-not-allocated fallback, run_phase3 wiring, and phase3-configs volume mount. Maps to 14 cases in GPUOP-693-test-plan.md. Co-Authored-By: Claude Opus 4.6 * GPUOP-778: Add unit tests for run_phase3 wiring + /phase3-configs mount Co-Authored-By: Claude Opus 4.6 * GPUOP-779: Add bash unit tests for Phase 3 NIC health check Covers PHASE3_CHECK_SCRIPT (in-Job 4-check NIC health + self-labelling, GPUOP-775) and PHASE3_SCRIPT (outer-driver submit/wait, GPUOP-777) per the GPUOP-693 test plan. Co-Authored-By: Claude Opus 4.6 * GPUOP-779: Add unit tests for Phase 3 NIC health bash test suite Co-Authored-By: Claude Opus 4.6 * GPUOP-779: Unit test results — PASS (21/21) Co-Authored-By: Claude Opus 4.6 * GPUOP-780: Add Phase 3 README section + sample failure annotations Co-Authored-By: Claude Opus 4.6 * GPUOP-781: Add Phase 4 server + client Job template ConfigMap Implements GPUOP-694.1 per design doc section 4 ("Code Path: Phase 4 Job templates"). Adds new cluster-validation-phase4-job-config ConfigMap with two batch/v1.Job templates (server and client) for ib_write_bw pairwise rail bandwidth measurement. Templates parameterized on $$NODE, $$PEER_POD_IP, $$RAIL_IDX, and $$NAD_NAME for sed-rendering by PHASE4_DRIVER_SCRIPT (sibling sub-task). Image rocm/roce-workload via $$ROCE_WORKLOAD_IMAGE. envFrom pulls PHASE4_* config vars (GPUOP-694.3). Per-rail NIC requested via amd.com/nic and k8s.v1.cni.cncf.io/networks Multus annotation. Server waits for client connect; client parses "BW average" line and tees output to /shared for driver log collection. Co-Authored-By: Claude Opus 4.6 * GPUOP-781: Add unit tests for cluster-validation-phase4-job-config * GPUOP-782: Add per-rail NetworkAttachmentDefinition manifests Adds configs/nad-per-rail.yaml with 8 NADs (amd-host-device-nad-rail-0..7) for Phase 4 pairwise rail bandwidth testing. Each NAD uses the host-device CNI plugin pinned to one rail's RDMA netdev (rdma_dev_N), matching PHASE4_IB_DEV_PREFIX. Conditional per the Sub-task description: skip applying if the cluster already ships per-rail NADs via the network-operator. Co-Authored-By: Claude Opus 4.6 * GPUOP-782: Add unit tests for per-rail NAD manifests 48 manifest contract assertions across 9 categories: structural, naming, metadata, CNI config, per-rail consistency, no-scope-creep, conditional deployment, Phase 4 cross-Sub-task, and regression guards. Framework: Python 3 + unittest + PyYAML (matches sibling GPUOP-781 / GPUOP-774 conventions). Co-Authored-By: Claude Opus 4.6 * GPUOP-783: Add Phase 4 pairwise rail bandwidth ConfigMap env vars Extend cluster-validation-config ConfigMap with 6 Phase 4 (pairwise rail bandwidth test) tunables consumed by the PHASE4_DRIVER_SCRIPT and per-pair server/client Jobs: - PHASE4_RAIL_COUNT = "8" (rails per node) - PHASE4_BW_THRESHOLD = "380" (Gbps minimum per rail) - PHASE4_PAIR_WAIT_TIME = "180" (per-rail wallclock seconds) - PHASE4_MAX_CONCURRENT_PAIRS = "8" (pair-runner concurrency cap) - PHASE4_NAD_NAME_PREFIX = "amd-host-device-nad-rail-" - PHASE4_IB_DEV_PREFIX = "rdma_dev_" Defaults match the MI300 fleet baseline validated on smc300x-ccs-aus-gpuf268. Section follows the Phase 2 / Phase 3 env-var layout and ownership convention (envFrom projection, no code change required to tune). See GPUOP-694 design §4 -> "Code Path: Phase 4 ConfigMap variables". Co-Authored-By: Claude Opus 4.6 * GPUOP-783: Add unit tests for Phase 4 ConfigMap env vars 44 test cases (26 P0, 14 P1, 4 P2) covering the 6 new PHASE4_* keys added to cluster-validation-config: - Presence (8 tests): every key from the design doc lands - Value (6 tests): exact defaults per design §1, §2, §4 - Type (7 tests): str discipline + positive-int + NAD/IB-dev prefix format regex - Boundary (8 tests): lower/upper bounds + non-empty prefixes - Phase 4 contract (3 tests): rail-count == NIC-count, label-key consistency, NAD naming match - Regression (9 tests): Phase 1/2/3 blocks and shared helpers (PHASE_NODE_LABEL_SCRIPT, PHASE_FAILURE_REASON_ANNOTATION_SUFFIX) intact; no duplicate keys - File smoke (3 tests): no tabs, comment provenance header, block ordering after PHASE3_SCRIPT Framework: Python 3 + unittest + PyYAML (matches sibling unit-test plans GPUOP-774/-775/-776 for Phase 3). Test harness reference + grep-only CI fallback included in the document. Story TP map covers cases #3, #5, #6, #7, #11, #12, #13, #14, #15, #20; downstream cases (pairing, driver-script, smoke) deferred to sibling Sub-tasks. Co-Authored-By: Claude Opus 4.6 * GPUOP-784: Implement PHASE4_DRIVER_SCRIPT in cluster-validation-config Adds the Phase 4 (pairwise rail bandwidth) driver script to the cluster-validation-config ConfigMap per GPUOP-694 design §4 -> "Code Path: PHASE4_DRIVER_SCRIPT". Behavior: 1. Sort + round-robin pair input nodes (already gated to amd.com/nic-health=passed upstream); odd-count last node is pass-labeled with `unpaired=true` annotation. 2. Fork pair_runner per pair, bounded by PHASE4_MAX_CONCURRENT_PAIRS (defaults to 8). 3. pair_runner iterates rails 0..(PHASE4_RAIL_COUNT-1) serially: render+apply server Job (pinned to node A), wait for pod IP, render+apply client Job with that IP (pinned to node B), wait both, parse "BW average" Gbps from client log, record per- (node, rail) result via a tmpdir state surface, cleanup. 4. Aggregate: a node passes iff ALL its rails were >= PHASE4_BW_THRESHOLD. 5. Label via GPUOP-690 helpers (label_phase_passed / label_phase_failed / annotate_phase_value); write per-rail BW annotations and a failed-rails CSV summary. 6. SKIP_RAIL_BANDWIDTH_TEST=true short-circuits to pass-labelling every input node with no Jobs created. Error reasons surfaced per-(node, rail): peer-pod-unready, nad-missing, ib-write-bw-crashed, parse-failed, api-throttled (with up to 3-retry exponential backoff on 429), below-threshold:. Consumes: PHASE4_* env vars (GPUOP-694.3), Job templates at /phase4-configs/cluster-validation-phase4-{server,client}-job-config.yaml (GPUOP-694.1, mounted by GPUOP-694.6), GPUOP-690 helpers. Wiring to run_phase4 (GPUOP-694.5) and the /phase4-configs mount (GPUOP-694.6) are tracked under their own sub-tasks. Co-Authored-By: Claude Opus 4.6 * GPUOP-784: Add unit tests for cluster-validation-config * GPUOP-785: Wire run_phase4 orchestrator to source PHASE4_DRIVER_SCRIPT Replace the generic _run_phase_generic stub for Phase 4 with a dedicated run_phase4 wiring function (mirrors run_phase1/2/3) that sources PHASE4_DRIVER_SCRIPT (GPUOP-694.4). Add /phase4-configs volumeMount + ConfigMap volume so PHASE4_DRIVER_SCRIPT can read the server + client Job templates from cluster-validation-phase4-job-config (GPUOP-694.1 / GPUOP-781). See docs/codie/designs/GPUOP-694-phase4-rail-bandwidth-design.md section 4 -> "Code Path: orchestrator wiring". Co-Authored-By: Claude Opus 4.6 * GPUOP-785: Add unit tests for run_phase4 orchestrator wiring 26 unit test cases covering: - run_phase4 interface contract (sources PHASE4_DRIVER_SCRIPT, passes nodes as positional args, exports PHASE_NODES) - Error paths (driver non-zero rc, unset/empty driver script) - Boundary cases (empty/single/many input nodes) - DRY_RUN override - /phase4-configs CronJob volume mount + ConfigMap binding - Phase 4 dispatch site SKIP guard + Phase-3 pool gating - Sibling-phase regression (phases 1-3 unchanged, run_phase5 still uses _run_phase_generic) - Static analysis (YAML parse, bash -n, shellcheck) Maps 6 unit tests to Story TP cases in docs/codie/test-plans/GPUOP-694-test-plan.md. Co-Authored-By: Claude Opus 4.6 * GPUOP-787: Add Phase 4 bash unit tests + ib_write_bw log fixtures Adds test_phase4.sh with 19 cases covering PHASE4_DRIVER_SCRIPT: pairing (round-robin even/odd, single-node, empty), all-pass single pair with per-rail BW annotations, single-rail-below- threshold, all-rails-fail, ib_write_bw crash, parse failure, server-pod-unready / nad-missing timeout paths, rail-count override, missing required env, missing job templates, SKIP short-circuit, 16-node concurrency cap honored, and the PHASE_NODES env-var fallback. Sample ib_write_bw client log fixtures live under tests/fixtures/phase4/ (pass, pass-high, below-threshold, crashed, empty). Extends lib/kubectl_mock.sh with two Phase-4-specific routes (pod-ip jsonpath, pods --no-headers shape) plus a concurrency- safe argv capture: the previous tail-back-the-call-log approach races under bounded-parallel pair_runners; now ARGS is captured from "$@" before any I/O and the call-log line is appended with a single atomic write. README.md updated with Phase 4 coverage section and the GPUOP-694 test-plan TC mapping. Co-Authored-By: Claude Opus 4.6 * GPUOP-788: Document Phase 4 (Pairwise Rail Bandwidth) in README Adds a Phase 4 section to the gpu-validation-cluster README covering scope, the pairwise per-rail ib_write_bw model, the 380 Gbps configurable threshold, sample per-rail annotations (passing and failing nodes), per-(pair, rail) failure reasons, unpaired-node behavior, SKIP_RAIL_BANDWIDTH_TEST short-circuit semantics, and concurrency-cap tuning guidance. Co-Authored-By: Claude Opus 4.6 * GPUOP-788: Add unit tests for Phase 4 README documentation Adds a 28-case unit-test plan covering section presence, ConfigMap defaults accuracy, sample annotation correctness, markdown structural lint, failure-reason vocabulary alignment with PHASE4_DRIVER_SCRIPT, SKIP and concurrency content, and pipeline-table regression. Tests are designed to run via tests/test_phase4_readme.sh with a stdlib-only Python md-lint helper. Co-Authored-By: Claude Opus 4.6 * GPUOP-789: Rename WAIT_FOR_WORKERS_SCRIPT to PHASE45_PREFLIGHT_SCRIPT Rename the ConfigMap key in cluster-validation-config.yaml from WAIT_FOR_WORKERS_SCRIPT to PHASE45_PREFLIGHT_SCRIPT and update the wait-for-worker-pods init-container in cluster-validation-job.yaml to source the new env var name. Pure rename; no behavior change. Makes room for the enhanced N×N SSH mesh, DNS, MPI spawn, and RCCL topology preflight checks landing in subsequent GPUOP-695 sub-tasks. Co-Authored-By: Claude Opus 4.6 * GPUOP-789: Add unit tests for cluster-validation-config rename Test plan covers the PHASE45_PREFLIGHT_SCRIPT rename: presence of new key, removal of old key, init-container env-var sourcing, envFrom wiring preservation, YAML parse, bash -n on extracted body, and byte-exact body invariance vs baseline. Co-Authored-By: Claude Opus 4.6 * GPUOP-790: Add N*N SSH mesh check to PHASE45_PREFLIGHT_SCRIPT Extends PHASE45_PREFLIGHT_SCRIPT (cluster-validation-config ConfigMap) so that, after the existing launcher->worker SSH readiness wait, every worker pod is probed against every worker IP via kubectl exec + ssh -o ConnectTimeout=5. Failed pairs are collected in failed_pairs and reported all at once at the end of the loop rather than aborting on the first failure, giving the operator a complete fabric picture for triage. See design doc GPUOP-695 §4 (Code Path: Enhanced PHASE45_PREFLIGHT_SCRIPT). The retry-with-timeout launcher->worker readiness probe is preserved as Phase 1; the new N*N matrix is Phase 2. UserKnownHostsFile=/dev/null is added to keep known_hosts from poisoning subsequent pairs when pods share a home volume. Co-Authored-By: Claude Opus 4.6 * GPUOP-790: Add unit tests for N*N SSH mesh probe 23 test cases covering the new N*N mesh loop in PHASE45_PREFLIGHT_SCRIPT: cardinality, flag-shape assertions (StrictHostKeyChecking=no, UserKnownHostsFile=/dev/null, ConnectTimeout=5), continue-past-first-failure semantics under set -euo pipefail, degenerate single-pod path, ENABLE_SSH_CHECK short-circuit, and the additive kubectl_mock.sh extensions (kubectl exec / wait verbs, training.kubeflow.org/job-name= selector route, kubectl_mock_set_pair_fail helper). Co-Authored-By: Claude Opus 4.6 * GPUOP-791: Add DNS forward+reverse check to PHASE45_PREFLIGHT_SCRIPT Per design GPUOP-695 §4 and Sub-task scope, add a Phase 3 DNS validation block to the launcher init-container script. From the FIRST worker pod, resolve every worker node hostname forward (getent hosts), then reverse-resolve the returned address. Misses are recorded in dns_misses[] and logged as WARN. The check does NOT short-circuit -- it iterates every hostname and continues to subsequent checks. Overall verdict gating (annotate-and-evict on failure) is owned by a separate Sub-task. WORKER_NODES is derived from .spec.nodeName (deduplicated). The in-pod loop runs under the existing `set -euo pipefail` script, so each getent call is guarded with `|| echo MISS` to keep the loop alive. Verified: yaml.safe_load() loads the ConfigMap and `bash -n` on the extracted PHASE45_PREFLIGHT_SCRIPT passes. Co-Authored-By: Claude Opus 4.6 * GPUOP-791: Add unit tests for DNS forward+reverse check 27 unit test cases covering the new Phase 3 DNS block of PHASE45_PREFLIGHT_SCRIPT: - 5 positive (resolution paths, FIRST_POD targeting, deduplication, renamed-key extraction) - 6 negative (forward miss, reverse miss, multi-miss, all-fail, no-exit-1 regression guard) - 4 boundary (N=1, empty pods, duplicate nodeName, hostname-with-dot) - 5 mock/dependency (jsonpath, JOB_LABELS, FIRST_POD reuse, fwd-then-rev order, awk first-token extraction) - 3 set -e safety (getent failure non-fatal, kubectl exec failure non-fatal, no-unbound-variable defense) - 4 integration (post-mesh sequencing, mesh-fail-skips-dns, ENABLE_SSH_CHECK=false skip, single-node smoke on smc300x-ccs-aus-gpuf268) Each case is mapped to the Story-level test plan (GPUOP-695) by test number. Framework: bash + tests/lib/{kubectl_mock,assert, extract_script}.sh, layered on top of the GPUOP-790 mock extensions. Recommended test file: example/gpu-validation-cluster/tests/test_phase45_dns.sh. Mock extensions required (additive on top of GPUOP-790): - spec.nodeName jsonpath route in kubectl get pods mock - in-pod getent shim driven by kubectl_mock_set_dns_fwd / kubectl_mock_set_dns_rev helpers - kubectl exec mock must run `bash -c