Skip to content

AIPCC:15489: Add support for Gaudi accelerator in mapt's IBM Cloud module#834

Open
deekay2310 wants to merge 3 commits into
redhat-developer:mainfrom
deekay2310:ibmcloud-gaudi3
Open

AIPCC:15489: Add support for Gaudi accelerator in mapt's IBM Cloud module#834
deekay2310 wants to merge 3 commits into
redhat-developer:mainfrom
deekay2310:ibmcloud-gaudi3

Conversation

@deekay2310

@deekay2310 deekay2310 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds ibmcloud ibm-gaudi create/destroy commands for provisioning Intel Gaudi 3 accelerator instances on IBM Cloud VPC.

  • New target ibm-gaudi with auto-provisioning via RHEL AI image on gx3d-160x1792x8 profile (us-south-2)
  • SSH readiness check with retry loop before returning connection details
  • Uses cloud-user as default SSH user for the RHEL AI image
  • Full VPC networking stack (VPC, subnet, security group, floating IP, public gateway)
  • Supports IBM COS S3-compatible backend for Pulumi state

Resolves: AIPCC-15489

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds an IBM Cloud "ibm-gaudi" CLI command with create/destroy subcommands, a Pulumi-based provisioning action supporting auto-provisioned or existing-subnet networking with SSH key generation and optional OpenTelemetry log shipping via cloud-init, plus accompanying documentation.

Changes

IBM Gaudi3 provisioning feature

Layer / File(s) Summary
GaudiArgs contract and scaffolding
pkg/provider/ibmcloud/action/ibm-gaudi/ibm-gaudi.go
Defines exported GaudiArgs and internal structs, embeds the cloud-config template, and sets OTEL collector version constant.
New/Destroy entrypoints
pkg/provider/ibmcloud/action/ibm-gaudi/ibm-gaudi.go
New resolves stack prefix, zone or subnet, and runs the deploy stack; Destroy tears down the stack and cleans state.
Deployment paths
pkg/provider/ibmcloud/action/ibm-gaudi/ibm-gaudi.go
Implements deploy and deployWithExistingSubnet for networking, SSH key generation, VM instance provisioning, floating IP association, optional OTEL user data, output export, and SSH readiness checks.
OTEL-enabled cloud-init template
pkg/provider/ibmcloud/action/ibm-gaudi/cloud-config
Adds conditional OTEL Collector config, auth token file, systemd override, and RPM install logic via runcmd.
CLI command wiring
cmd/mapt/cmd/ibmcloud/hosts/ibm-gaudi.go, cmd/mapt/cmd/ibmcloud/ibmcloud.go
Adds ibm-gaudi command with create/destroy subcommands binding flags to Viper and invoking New/Destroy, registered in the ibmcloud command tree.
Documentation
docs/ibmcloud/ibm-gaudi.md
Documents environment variables, networking modes, create/destroy usage, OTEL configuration, and COS backend usage.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as ibm-gaudi CLI
    participant Action as ibmgaudi.New
    participant Stack as Pulumi Stack
    participant IBMCloud as IBM Cloud APIs

    User->>CLI: create --subnet-id/--otel-* flags
    CLI->>CLI: bind flags to Viper, build GaudiArgs
    CLI->>Action: New(ContextArgs, GaudiArgs)
    Action->>Action: resolve zone or subnet, build deploy function
    Action->>Stack: create/update stack
    Stack->>IBMCloud: provision resource group, network, SSH key, VM instance
    IBMCloud-->>Stack: instance host, floating IP
    Stack->>IBMCloud: attach cloud-init user data (optional OTEL config)
    Stack->>IBMCloud: run SSH readiness check
    Stack-->>Action: stack outputs (host, username, private key)
    Action->>Action: manageResults writes outputs
    Action-->>CLI: success/error
    CLI-->>User: provisioning result
Loading
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly summarizes the main change: adding Gaudi accelerator support to the IBM Cloud module.
Description check ✅ Passed The description is directly related and accurately describes the new ibm-gaudi create/destroy provisioning flow.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@deekay2310 deekay2310 changed the title WIP: AIPCC:15489: Add support for Gaudi accelerator in mapt's IBM Cloud module AIPCC:15489: Add support for Gaudi accelerator in mapt's IBM Cloud module Jul 2, 2026
@deekay2310

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
pkg/provider/ibmcloud/action/ibm-gaudi/cloud-config (1)

94-98: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Unquoted user-controlled key/value in extra attrs.

{{$k}} and {{$v}} from OtelExtraAttrs (user-supplied via --otel-extra-attrs) are interpolated into YAML unquoted/partially quoted. A key containing YAML-significant characters (:, #, newline) could corrupt the generated config or inject unintended fields.

♻️ Proposed fix
 {{- range $k, $v := .OtelExtraAttrs}}
-            - key: {{$k}}
+            - key: "{{$k}}"
               value: "{{$v}}"
               action: upsert
 {{- end}}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/provider/ibmcloud/action/ibm-gaudi/cloud-config` around lines 94 - 98,
The OtelExtraAttrs entries in the cloud-config template are being interpolated
into YAML without safely quoting the user-controlled key and value, which can
break the generated config or allow injected fields. Update the template loop in
the cloud-config rendering block so the OtelExtraAttrs key/value are emitted
using YAML-safe quoting/escaping, preserving the existing upsert structure while
preventing special characters from being interpreted as YAML syntax.
pkg/provider/ibmcloud/action/ibm-gaudi/ibm-gaudi.go (1)

173-233: 📐 Maintainability & Code Quality | 🔵 Trivial | 🏗️ Heavy lift

Large duplication between deploy and deployWithExistingSubnet.

Instance-args assembly, OTEL user-data injection (188-194 / 285-291), floating-IP association (202-215 / 299-312), username/host exports, and the readiness remote.NewCommand block (217-231 / 314-328) are nearly identical. Extracting a shared helper (e.g. taking VPC/zone/subnet/floating-IP/security-group inputs) would reduce drift risk as this evolves.

Also applies to: 273-330

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/provider/ibmcloud/action/ibm-gaudi/ibm-gaudi.go` around lines 173 - 233,
There is significant duplication between the deploy paths, especially around
instanceArgs setup, OTEL user-data injection in gaudiUserData handling,
floating-IP association with ibmcloud.NewIsInstanceNetworkInterfaceFloatingIp,
ctx.Export of username/host, and the remote.NewCommand readiness check. Extract
the shared provisioning logic into a helper used by both deploy and
deployWithExistingSubnet, parameterized by the varying inputs like VPC, zone,
subnet, security group, and floating IP, so future changes only need to be made
in one place.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/ibmcloud/ibm-gaudi.md`:
- Around line 60-64: Update the IBM Cloud Gaudi SSH documentation to use the
current default SSH user `cloud-user` instead of `root`. In the table entry for
`username` and in the SSH access example, replace the stale `root` references
with `cloud-user` so the docs match the current RHEL AI image behavior. Use the
existing SSH example section and the `host`/`username` table to locate and
update both places consistently.

In `@pkg/provider/ibmcloud/action/ibm-gaudi/cloud-config`:
- Around line 99-103: The otlphttp exporter in the IBM Gaudi cloud-config
disables TLS verification with insecure_skip_verify, which should be removed.
Update the exporter configuration to use proper certificate validation for the
OTEL endpoint in this cloud-config block, and if custom trust is needed, wire in
the appropriate CA/cert settings instead of bypassing verification.
- Around line 128-144: The otelcol-contrib cloud-init install flow in the runcmd
block has no failure stopping behavior, so a bad download or RPM install can be
ignored and the script keeps going. Add explicit error handling to this shell
fragment (for example, stop on failures and check the curl and rpm steps before
proceeding) so the install only continues after each command succeeds. Use the
runcmd shell block and the curl/rpm install sequence as the key points to
update.

In `@pkg/provider/ibmcloud/action/ibm-gaudi/ibm-gaudi.go`:
- Around line 188-194: The OTEL user-data gating in gaudi user data generation
is broader than the cloud-config template, so the collector install can be
skipped while config files are still written. Update the condition around
gaudiUserData/UserData assignment in the IBM Gaudi flow to match the template’s
requirements, including otelIndex, or adjust the template to match the intended
behavior. Check the gaudiUserData helper and the instance setup block to keep
the gating consistent in both places.
- Around line 382-388: Stop logging the generated private key in debug output;
the current ApplyT callback on pk.PrivateKeyPem in the ibm-gaudi flow prints
sensitive material. Update the debug branch to log a non-sensitive identifier
instead, such as the public key or a fingerprint, and apply the same fix in the
analogous ApplyT/debug logging blocks in ibm-power and ibm-z so no private key
values are emitted anywhere.

---

Nitpick comments:
In `@pkg/provider/ibmcloud/action/ibm-gaudi/cloud-config`:
- Around line 94-98: The OtelExtraAttrs entries in the cloud-config template are
being interpolated into YAML without safely quoting the user-controlled key and
value, which can break the generated config or allow injected fields. Update the
template loop in the cloud-config rendering block so the OtelExtraAttrs
key/value are emitted using YAML-safe quoting/escaping, preserving the existing
upsert structure while preventing special characters from being interpreted as
YAML syntax.

In `@pkg/provider/ibmcloud/action/ibm-gaudi/ibm-gaudi.go`:
- Around line 173-233: There is significant duplication between the deploy
paths, especially around instanceArgs setup, OTEL user-data injection in
gaudiUserData handling, floating-IP association with
ibmcloud.NewIsInstanceNetworkInterfaceFloatingIp, ctx.Export of username/host,
and the remote.NewCommand readiness check. Extract the shared provisioning logic
into a helper used by both deploy and deployWithExistingSubnet, parameterized by
the varying inputs like VPC, zone, subnet, security group, and floating IP, so
future changes only need to be made in one place.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: d85efaa9-9a57-456b-b2de-bdc9fbb4fc4f

📥 Commits

Reviewing files that changed from the base of the PR and between a2f7a23 and c4577bb.

📒 Files selected for processing (5)
  • cmd/mapt/cmd/ibmcloud/hosts/ibm-gaudi.go
  • cmd/mapt/cmd/ibmcloud/ibmcloud.go
  • docs/ibmcloud/ibm-gaudi.md
  • pkg/provider/ibmcloud/action/ibm-gaudi/cloud-config
  • pkg/provider/ibmcloud/action/ibm-gaudi/ibm-gaudi.go

Comment on lines +60 to +64
| File | Description |
|---|---|
| `host` | Floating IP of the instance (direct SSH) |
| `username` | SSH username (`root`) |
| `id_rsa` | Private key for the instance |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

SSH username doc likely stale — should be cloud-user, not root.

Per the PR's commit history, the default SSH user for the RHEL AI image was changed to cloud-user. This doc still documents root as the username (Line 63) and uses root@... in the SSH access example (Lines 73-75), which will mislead users and cause failed SSH connections.

📝 Proposed fix
-| `username` | SSH username (`root`) |
+| `username` | SSH username (`cloud-user`) |
-ssh -i ${OUTPUT}/id_rsa \
-    -o StrictHostKeyChecking=no \
-    root@$(cat ${OUTPUT}/host)
+ssh -i ${OUTPUT}/id_rsa \
+    -o StrictHostKeyChecking=no \
+    cloud-user@$(cat ${OUTPUT}/host)

Also applies to: 68-76

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/ibmcloud/ibm-gaudi.md` around lines 60 - 64, Update the IBM Cloud Gaudi
SSH documentation to use the current default SSH user `cloud-user` instead of
`root`. In the table entry for `username` and in the SSH access example, replace
the stale `root` references with `cloud-user` so the docs match the current RHEL
AI image behavior. Use the existing SSH example section and the
`host`/`username` table to locate and update both places consistently.

Comment thread pkg/provider/ibmcloud/action/ibm-gaudi/cloud-config
Comment on lines +128 to +144
runcmd:
- |
PROXY_URL=""
if ! curl -sf --connect-timeout 5 --head {{.OtelEndpoint}} > /dev/null 2>&1; then
PROXY_URL="http://squid.corp.redhat.com:3128"
fi
RPM_URL="https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v{{.OtelColVersion}}/otelcol-contrib_{{.OtelColVersion}}_linux_{{.OtelArch}}.rpm"
HTTPS_PROXY="$PROXY_URL" curl -fsSL -o /tmp/otelcol-contrib.rpm "$RPM_URL"
rpm -Uvh /tmp/otelcol-contrib.rpm
rm -f /tmp/otelcol-contrib.rpm
chown -R otelcol-contrib:otelcol-contrib /etc/otelcol-contrib
if [ -n "$PROXY_URL" ]; then
printf '[Service]\nEnvironment="HTTPS_PROXY=%s/"\nEnvironment="NO_PROXY=10.*,192.168.*,localhost,127.0.0.1"\n' "$PROXY_URL" \
> /etc/systemd/system/otelcol-contrib.service.d/proxy.conf
fi
systemctl daemon-reload
systemctl enable --now otelcol-contrib

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

No error handling in runcmd install script.

Lack of set -e/explicit checks after curl and rpm -Uvh means a failed download silently proceeds to install a stale/missing RPM, potentially leaving the instance without a functioning otelcol-contrib service and no clear failure signal in cloud-init logs.

🛡️ Proposed fix
   - |
+    set -e
     PROXY_URL=""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
runcmd:
- |
PROXY_URL=""
if ! curl -sf --connect-timeout 5 --head {{.OtelEndpoint}} > /dev/null 2>&1; then
PROXY_URL="http://squid.corp.redhat.com:3128"
fi
RPM_URL="https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v{{.OtelColVersion}}/otelcol-contrib_{{.OtelColVersion}}_linux_{{.OtelArch}}.rpm"
HTTPS_PROXY="$PROXY_URL" curl -fsSL -o /tmp/otelcol-contrib.rpm "$RPM_URL"
rpm -Uvh /tmp/otelcol-contrib.rpm
rm -f /tmp/otelcol-contrib.rpm
chown -R otelcol-contrib:otelcol-contrib /etc/otelcol-contrib
if [ -n "$PROXY_URL" ]; then
printf '[Service]\nEnvironment="HTTPS_PROXY=%s/"\nEnvironment="NO_PROXY=10.*,192.168.*,localhost,127.0.0.1"\n' "$PROXY_URL" \
> /etc/systemd/system/otelcol-contrib.service.d/proxy.conf
fi
systemctl daemon-reload
systemctl enable --now otelcol-contrib
runcmd:
- |
set -e
PROXY_URL=""
if ! curl -sf --connect-timeout 5 --head {{.OtelEndpoint}} > /dev/null 2>&1; then
PROXY_URL="http://squid.corp.redhat.com:3128"
fi
RPM_URL="https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v{{.OtelColVersion}}/otelcol-contrib_{{.OtelColVersion}}_linux_{{.OtelArch}}.rpm"
HTTPS_PROXY="$PROXY_URL" curl -fsSL -o /tmp/otelcol-contrib.rpm "$RPM_URL"
rpm -Uvh /tmp/otelcol-contrib.rpm
rm -f /tmp/otelcol-contrib.rpm
chown -R otelcol-contrib:otelcol-contrib /etc/otelcol-contrib
if [ -n "$PROXY_URL" ]; then
printf '[Service]\nEnvironment="HTTPS_PROXY=%s/"\nEnvironment="NO_PROXY=10.*,192.168.*,localhost,127.0.0.1"\n' "$PROXY_URL" \
> /etc/systemd/system/otelcol-contrib.service.d/proxy.conf
fi
systemctl daemon-reload
systemctl enable --now otelcol-contrib
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/provider/ibmcloud/action/ibm-gaudi/cloud-config` around lines 128 - 144,
The otelcol-contrib cloud-init install flow in the runcmd block has no failure
stopping behavior, so a bad download or RPM install can be ignored and the
script keeps going. Add explicit error handling to this shell fragment (for
example, stop on failures and check the curl and rpm steps before proceeding) so
the install only continues after each command succeeds. Use the runcmd shell
block and the curl/rpm install sequence as the key points to update.

Comment on lines +188 to +194
if r.otelAppCode != "" && r.otelAuthToken != "" {
ud, err := gaudiUserData(r.otelAppCode, r.otelAuthToken, r.otelEndpoint, r.otelIndex, r.otelExtraAttrs)
if err != nil {
return fmt.Errorf("failed to render user data: %w", err)
}
instanceArgs.UserData = pulumi.StringPtr(ud)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Confirm the OTEL user-data gating matches the cloud-config template.

User data is rendered whenever otelAppCode != "" && otelAuthToken != "", but the embedded cloud-config guards the collector installation runcmd with {{- if and .AppCode .OtelAuthToken .OtelIndex}}. With otelIndex empty, the config files and auth token are written but the collector is never installed/enabled — a silent no-op. Verify this is intended.

Also applies to: 285-291

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/provider/ibmcloud/action/ibm-gaudi/ibm-gaudi.go` around lines 188 - 194,
The OTEL user-data gating in gaudi user data generation is broader than the
cloud-config template, so the collector install can be skipped while config
files are still written. Update the condition around gaudiUserData/UserData
assignment in the IBM Gaudi flow to match the template’s requirements, including
otelIndex, or adjust the template to match the intended behavior. Check the
gaudiUserData helper and the instance setup block to keep the gating consistent
in both places.

Comment on lines +382 to +388
if mCtx.Debug() {
pk.PrivateKeyPem.ApplyT(
func(privateKey string) error {
logging.Debugf("%s", privateKey)
return nil
})
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check whether other ibmcloud actions also debug-log private keys
rg -nP -C3 'PrivateKeyPem\.ApplyT' pkg/provider/ibmcloud

Repository: redhat-developer/mapt

Length of output: 1762


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== matching debug logs =="
rg -n -C2 'logging\.Debugf\("%s", privateKey\)' pkg/provider/ibmcloud

echo
echo "== nearby context in gaudi =="
sed -n '370,395p' pkg/provider/ibmcloud/action/ibm-gaudi/ibm-gaudi.go

echo
echo "== nearby context in ibm-power =="
sed -n '330,350p' pkg/provider/ibmcloud/action/ibm-power/ibm-power.go

echo
echo "== nearby context in ibm-z =="
sed -n '384,398p' pkg/provider/ibmcloud/action/ibm-z/ibm-z.go

echo
echo "== search for fingerprint/public key logging patterns =="
rg -n 'fingerprint|PublicKey|private key|SSH key|authorized_keys' pkg/provider/ibmcloud

Repository: redhat-developer/mapt

Length of output: 3804


Stop logging the generated private key. This emits pk.PrivateKeyPem into debug logs; log the public key or a fingerprint instead. The same pattern also appears in pkg/provider/ibmcloud/action/ibm-power/ibm-power.go and pkg/provider/ibmcloud/action/ibm-z/ibm-z.go.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/provider/ibmcloud/action/ibm-gaudi/ibm-gaudi.go` around lines 382 - 388,
Stop logging the generated private key in debug output; the current ApplyT
callback on pk.PrivateKeyPem in the ibm-gaudi flow prints sensitive material.
Update the debug branch to log a non-sensitive identifier instead, such as the
public key or a fingerprint, and apply the same fix in the analogous
ApplyT/debug logging blocks in ibm-power and ibm-z so no private key values are
emitted anywhere.

//go:embed cloud-config
var CloudConfig []byte

var otelColVersion = "0.151.0"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this here? Did you rebase the branch, now whole otel is managed through its own intetration package

outputUserPrivateKey = "icgUserPrivatekey"

defaultProfile = "gx3d-160x1792x8gaudi3"
defaultImage = "ibm-redhat-ai-intel"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What image is this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants