AIPCC:15489: Add support for Gaudi accelerator in mapt's IBM Cloud module#834
AIPCC:15489: Add support for Gaudi accelerator in mapt's IBM Cloud module#834deekay2310 wants to merge 3 commits into
Conversation
📝 WalkthroughWalkthroughAdds an IBM Cloud "ibm-gaudi" CLI command with create/destroy subcommands, a Pulumi-based provisioning action supporting auto-provisioned or existing-subnet networking with SSH key generation and optional OpenTelemetry log shipping via cloud-init, plus accompanying documentation. ChangesIBM Gaudi3 provisioning feature
Estimated code review effort: 4 (Complex) | ~60 minutes Sequence Diagram(s)sequenceDiagram
participant User
participant CLI as ibm-gaudi CLI
participant Action as ibmgaudi.New
participant Stack as Pulumi Stack
participant IBMCloud as IBM Cloud APIs
User->>CLI: create --subnet-id/--otel-* flags
CLI->>CLI: bind flags to Viper, build GaudiArgs
CLI->>Action: New(ContextArgs, GaudiArgs)
Action->>Action: resolve zone or subnet, build deploy function
Action->>Stack: create/update stack
Stack->>IBMCloud: provision resource group, network, SSH key, VM instance
IBMCloud-->>Stack: instance host, floating IP
Stack->>IBMCloud: attach cloud-init user data (optional OTEL config)
Stack->>IBMCloud: run SSH readiness check
Stack-->>Action: stack outputs (host, username, private key)
Action->>Action: manageResults writes outputs
Action-->>CLI: success/error
CLI-->>User: provisioning result
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (2)
pkg/provider/ibmcloud/action/ibm-gaudi/cloud-config (1)
94-98: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winUnquoted user-controlled key/value in extra attrs.
{{$k}}and{{$v}}fromOtelExtraAttrs(user-supplied via--otel-extra-attrs) are interpolated into YAML unquoted/partially quoted. A key containing YAML-significant characters (:,#, newline) could corrupt the generated config or inject unintended fields.♻️ Proposed fix
{{- range $k, $v := .OtelExtraAttrs}} - - key: {{$k}} + - key: "{{$k}}" value: "{{$v}}" action: upsert {{- end}}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/provider/ibmcloud/action/ibm-gaudi/cloud-config` around lines 94 - 98, The OtelExtraAttrs entries in the cloud-config template are being interpolated into YAML without safely quoting the user-controlled key and value, which can break the generated config or allow injected fields. Update the template loop in the cloud-config rendering block so the OtelExtraAttrs key/value are emitted using YAML-safe quoting/escaping, preserving the existing upsert structure while preventing special characters from being interpreted as YAML syntax.pkg/provider/ibmcloud/action/ibm-gaudi/ibm-gaudi.go (1)
173-233: 📐 Maintainability & Code Quality | 🔵 Trivial | 🏗️ Heavy liftLarge duplication between
deployanddeployWithExistingSubnet.Instance-args assembly, OTEL user-data injection (188-194 / 285-291), floating-IP association (202-215 / 299-312), username/host exports, and the readiness
remote.NewCommandblock (217-231 / 314-328) are nearly identical. Extracting a shared helper (e.g. taking VPC/zone/subnet/floating-IP/security-group inputs) would reduce drift risk as this evolves.Also applies to: 273-330
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/provider/ibmcloud/action/ibm-gaudi/ibm-gaudi.go` around lines 173 - 233, There is significant duplication between the deploy paths, especially around instanceArgs setup, OTEL user-data injection in gaudiUserData handling, floating-IP association with ibmcloud.NewIsInstanceNetworkInterfaceFloatingIp, ctx.Export of username/host, and the remote.NewCommand readiness check. Extract the shared provisioning logic into a helper used by both deploy and deployWithExistingSubnet, parameterized by the varying inputs like VPC, zone, subnet, security group, and floating IP, so future changes only need to be made in one place.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/ibmcloud/ibm-gaudi.md`:
- Around line 60-64: Update the IBM Cloud Gaudi SSH documentation to use the
current default SSH user `cloud-user` instead of `root`. In the table entry for
`username` and in the SSH access example, replace the stale `root` references
with `cloud-user` so the docs match the current RHEL AI image behavior. Use the
existing SSH example section and the `host`/`username` table to locate and
update both places consistently.
In `@pkg/provider/ibmcloud/action/ibm-gaudi/cloud-config`:
- Around line 99-103: The otlphttp exporter in the IBM Gaudi cloud-config
disables TLS verification with insecure_skip_verify, which should be removed.
Update the exporter configuration to use proper certificate validation for the
OTEL endpoint in this cloud-config block, and if custom trust is needed, wire in
the appropriate CA/cert settings instead of bypassing verification.
- Around line 128-144: The otelcol-contrib cloud-init install flow in the runcmd
block has no failure stopping behavior, so a bad download or RPM install can be
ignored and the script keeps going. Add explicit error handling to this shell
fragment (for example, stop on failures and check the curl and rpm steps before
proceeding) so the install only continues after each command succeeds. Use the
runcmd shell block and the curl/rpm install sequence as the key points to
update.
In `@pkg/provider/ibmcloud/action/ibm-gaudi/ibm-gaudi.go`:
- Around line 188-194: The OTEL user-data gating in gaudi user data generation
is broader than the cloud-config template, so the collector install can be
skipped while config files are still written. Update the condition around
gaudiUserData/UserData assignment in the IBM Gaudi flow to match the template’s
requirements, including otelIndex, or adjust the template to match the intended
behavior. Check the gaudiUserData helper and the instance setup block to keep
the gating consistent in both places.
- Around line 382-388: Stop logging the generated private key in debug output;
the current ApplyT callback on pk.PrivateKeyPem in the ibm-gaudi flow prints
sensitive material. Update the debug branch to log a non-sensitive identifier
instead, such as the public key or a fingerprint, and apply the same fix in the
analogous ApplyT/debug logging blocks in ibm-power and ibm-z so no private key
values are emitted anywhere.
---
Nitpick comments:
In `@pkg/provider/ibmcloud/action/ibm-gaudi/cloud-config`:
- Around line 94-98: The OtelExtraAttrs entries in the cloud-config template are
being interpolated into YAML without safely quoting the user-controlled key and
value, which can break the generated config or allow injected fields. Update the
template loop in the cloud-config rendering block so the OtelExtraAttrs
key/value are emitted using YAML-safe quoting/escaping, preserving the existing
upsert structure while preventing special characters from being interpreted as
YAML syntax.
In `@pkg/provider/ibmcloud/action/ibm-gaudi/ibm-gaudi.go`:
- Around line 173-233: There is significant duplication between the deploy
paths, especially around instanceArgs setup, OTEL user-data injection in
gaudiUserData handling, floating-IP association with
ibmcloud.NewIsInstanceNetworkInterfaceFloatingIp, ctx.Export of username/host,
and the remote.NewCommand readiness check. Extract the shared provisioning logic
into a helper used by both deploy and deployWithExistingSubnet, parameterized by
the varying inputs like VPC, zone, subnet, security group, and floating IP, so
future changes only need to be made in one place.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: d85efaa9-9a57-456b-b2de-bdc9fbb4fc4f
📒 Files selected for processing (5)
cmd/mapt/cmd/ibmcloud/hosts/ibm-gaudi.gocmd/mapt/cmd/ibmcloud/ibmcloud.godocs/ibmcloud/ibm-gaudi.mdpkg/provider/ibmcloud/action/ibm-gaudi/cloud-configpkg/provider/ibmcloud/action/ibm-gaudi/ibm-gaudi.go
| | File | Description | | ||
| |---|---| | ||
| | `host` | Floating IP of the instance (direct SSH) | | ||
| | `username` | SSH username (`root`) | | ||
| | `id_rsa` | Private key for the instance | |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
SSH username doc likely stale — should be cloud-user, not root.
Per the PR's commit history, the default SSH user for the RHEL AI image was changed to cloud-user. This doc still documents root as the username (Line 63) and uses root@... in the SSH access example (Lines 73-75), which will mislead users and cause failed SSH connections.
📝 Proposed fix
-| `username` | SSH username (`root`) |
+| `username` | SSH username (`cloud-user`) |-ssh -i ${OUTPUT}/id_rsa \
- -o StrictHostKeyChecking=no \
- root@$(cat ${OUTPUT}/host)
+ssh -i ${OUTPUT}/id_rsa \
+ -o StrictHostKeyChecking=no \
+ cloud-user@$(cat ${OUTPUT}/host)Also applies to: 68-76
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/ibmcloud/ibm-gaudi.md` around lines 60 - 64, Update the IBM Cloud Gaudi
SSH documentation to use the current default SSH user `cloud-user` instead of
`root`. In the table entry for `username` and in the SSH access example, replace
the stale `root` references with `cloud-user` so the docs match the current RHEL
AI image behavior. Use the existing SSH example section and the
`host`/`username` table to locate and update both places consistently.
| runcmd: | ||
| - | | ||
| PROXY_URL="" | ||
| if ! curl -sf --connect-timeout 5 --head {{.OtelEndpoint}} > /dev/null 2>&1; then | ||
| PROXY_URL="http://squid.corp.redhat.com:3128" | ||
| fi | ||
| RPM_URL="https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v{{.OtelColVersion}}/otelcol-contrib_{{.OtelColVersion}}_linux_{{.OtelArch}}.rpm" | ||
| HTTPS_PROXY="$PROXY_URL" curl -fsSL -o /tmp/otelcol-contrib.rpm "$RPM_URL" | ||
| rpm -Uvh /tmp/otelcol-contrib.rpm | ||
| rm -f /tmp/otelcol-contrib.rpm | ||
| chown -R otelcol-contrib:otelcol-contrib /etc/otelcol-contrib | ||
| if [ -n "$PROXY_URL" ]; then | ||
| printf '[Service]\nEnvironment="HTTPS_PROXY=%s/"\nEnvironment="NO_PROXY=10.*,192.168.*,localhost,127.0.0.1"\n' "$PROXY_URL" \ | ||
| > /etc/systemd/system/otelcol-contrib.service.d/proxy.conf | ||
| fi | ||
| systemctl daemon-reload | ||
| systemctl enable --now otelcol-contrib |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
No error handling in runcmd install script.
Lack of set -e/explicit checks after curl and rpm -Uvh means a failed download silently proceeds to install a stale/missing RPM, potentially leaving the instance without a functioning otelcol-contrib service and no clear failure signal in cloud-init logs.
🛡️ Proposed fix
- |
+ set -e
PROXY_URL=""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| runcmd: | |
| - | | |
| PROXY_URL="" | |
| if ! curl -sf --connect-timeout 5 --head {{.OtelEndpoint}} > /dev/null 2>&1; then | |
| PROXY_URL="http://squid.corp.redhat.com:3128" | |
| fi | |
| RPM_URL="https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v{{.OtelColVersion}}/otelcol-contrib_{{.OtelColVersion}}_linux_{{.OtelArch}}.rpm" | |
| HTTPS_PROXY="$PROXY_URL" curl -fsSL -o /tmp/otelcol-contrib.rpm "$RPM_URL" | |
| rpm -Uvh /tmp/otelcol-contrib.rpm | |
| rm -f /tmp/otelcol-contrib.rpm | |
| chown -R otelcol-contrib:otelcol-contrib /etc/otelcol-contrib | |
| if [ -n "$PROXY_URL" ]; then | |
| printf '[Service]\nEnvironment="HTTPS_PROXY=%s/"\nEnvironment="NO_PROXY=10.*,192.168.*,localhost,127.0.0.1"\n' "$PROXY_URL" \ | |
| > /etc/systemd/system/otelcol-contrib.service.d/proxy.conf | |
| fi | |
| systemctl daemon-reload | |
| systemctl enable --now otelcol-contrib | |
| runcmd: | |
| - | | |
| set -e | |
| PROXY_URL="" | |
| if ! curl -sf --connect-timeout 5 --head {{.OtelEndpoint}} > /dev/null 2>&1; then | |
| PROXY_URL="http://squid.corp.redhat.com:3128" | |
| fi | |
| RPM_URL="https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v{{.OtelColVersion}}/otelcol-contrib_{{.OtelColVersion}}_linux_{{.OtelArch}}.rpm" | |
| HTTPS_PROXY="$PROXY_URL" curl -fsSL -o /tmp/otelcol-contrib.rpm "$RPM_URL" | |
| rpm -Uvh /tmp/otelcol-contrib.rpm | |
| rm -f /tmp/otelcol-contrib.rpm | |
| chown -R otelcol-contrib:otelcol-contrib /etc/otelcol-contrib | |
| if [ -n "$PROXY_URL" ]; then | |
| printf '[Service]\nEnvironment="HTTPS_PROXY=%s/"\nEnvironment="NO_PROXY=10.*,192.168.*,localhost,127.0.0.1"\n' "$PROXY_URL" \ | |
| > /etc/systemd/system/otelcol-contrib.service.d/proxy.conf | |
| fi | |
| systemctl daemon-reload | |
| systemctl enable --now otelcol-contrib |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/provider/ibmcloud/action/ibm-gaudi/cloud-config` around lines 128 - 144,
The otelcol-contrib cloud-init install flow in the runcmd block has no failure
stopping behavior, so a bad download or RPM install can be ignored and the
script keeps going. Add explicit error handling to this shell fragment (for
example, stop on failures and check the curl and rpm steps before proceeding) so
the install only continues after each command succeeds. Use the runcmd shell
block and the curl/rpm install sequence as the key points to update.
| if r.otelAppCode != "" && r.otelAuthToken != "" { | ||
| ud, err := gaudiUserData(r.otelAppCode, r.otelAuthToken, r.otelEndpoint, r.otelIndex, r.otelExtraAttrs) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to render user data: %w", err) | ||
| } | ||
| instanceArgs.UserData = pulumi.StringPtr(ud) | ||
| } |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Confirm the OTEL user-data gating matches the cloud-config template.
User data is rendered whenever otelAppCode != "" && otelAuthToken != "", but the embedded cloud-config guards the collector installation runcmd with {{- if and .AppCode .OtelAuthToken .OtelIndex}}. With otelIndex empty, the config files and auth token are written but the collector is never installed/enabled — a silent no-op. Verify this is intended.
Also applies to: 285-291
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/provider/ibmcloud/action/ibm-gaudi/ibm-gaudi.go` around lines 188 - 194,
The OTEL user-data gating in gaudi user data generation is broader than the
cloud-config template, so the collector install can be skipped while config
files are still written. Update the condition around gaudiUserData/UserData
assignment in the IBM Gaudi flow to match the template’s requirements, including
otelIndex, or adjust the template to match the intended behavior. Check the
gaudiUserData helper and the instance setup block to keep the gating consistent
in both places.
| if mCtx.Debug() { | ||
| pk.PrivateKeyPem.ApplyT( | ||
| func(privateKey string) error { | ||
| logging.Debugf("%s", privateKey) | ||
| return nil | ||
| }) | ||
| } |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟡 Minor | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check whether other ibmcloud actions also debug-log private keys
rg -nP -C3 'PrivateKeyPem\.ApplyT' pkg/provider/ibmcloudRepository: redhat-developer/mapt
Length of output: 1762
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== matching debug logs =="
rg -n -C2 'logging\.Debugf\("%s", privateKey\)' pkg/provider/ibmcloud
echo
echo "== nearby context in gaudi =="
sed -n '370,395p' pkg/provider/ibmcloud/action/ibm-gaudi/ibm-gaudi.go
echo
echo "== nearby context in ibm-power =="
sed -n '330,350p' pkg/provider/ibmcloud/action/ibm-power/ibm-power.go
echo
echo "== nearby context in ibm-z =="
sed -n '384,398p' pkg/provider/ibmcloud/action/ibm-z/ibm-z.go
echo
echo "== search for fingerprint/public key logging patterns =="
rg -n 'fingerprint|PublicKey|private key|SSH key|authorized_keys' pkg/provider/ibmcloudRepository: redhat-developer/mapt
Length of output: 3804
Stop logging the generated private key. This emits pk.PrivateKeyPem into debug logs; log the public key or a fingerprint instead. The same pattern also appears in pkg/provider/ibmcloud/action/ibm-power/ibm-power.go and pkg/provider/ibmcloud/action/ibm-z/ibm-z.go.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/provider/ibmcloud/action/ibm-gaudi/ibm-gaudi.go` around lines 382 - 388,
Stop logging the generated private key in debug output; the current ApplyT
callback on pk.PrivateKeyPem in the ibm-gaudi flow prints sensitive material.
Update the debug branch to log a non-sensitive identifier instead, such as the
public key or a fingerprint, and apply the same fix in the analogous
ApplyT/debug logging blocks in ibm-power and ibm-z so no private key values are
emitted anywhere.
| //go:embed cloud-config | ||
| var CloudConfig []byte | ||
|
|
||
| var otelColVersion = "0.151.0" |
There was a problem hiding this comment.
why is this here? Did you rebase the branch, now whole otel is managed through its own intetration package
| outputUserPrivateKey = "icgUserPrivatekey" | ||
|
|
||
| defaultProfile = "gx3d-160x1792x8gaudi3" | ||
| defaultImage = "ibm-redhat-ai-intel" |
Summary
Adds
ibmcloud ibm-gaudicreate/destroy commands for provisioning Intel Gaudi 3 accelerator instances on IBM Cloud VPC.ibm-gaudiwith auto-provisioning via RHEL AI image ongx3d-160x1792x8profile (us-south-2)cloud-useras default SSH user for the RHEL AI imageResolves: AIPCC-15489