Skip to content

[26.04_linux-nvidia-bos] NVIDIA: VR: SAUCE: cxl: guard unlinked memdev endpoints#482

Open
nirmoy wants to merge 1 commit into
NVIDIA:26.04_linux-nvidia-bosfrom
nirmoy:codex/nvbug6274048-cxl-guards-bos
Open

[26.04_linux-nvidia-bos] NVIDIA: VR: SAUCE: cxl: guard unlinked memdev endpoints#482
nirmoy wants to merge 1 commit into
NVIDIA:26.04_linux-nvidia-bosfrom
nirmoy:codex/nvbug6274048-cxl-guards-bos

Conversation

@nirmoy

@nirmoy nirmoy commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Treat both NULL and error-valued cxlmd->endpoint pointers as unlinked before entering CXL HDM and region helper paths.
  • Port the functional hdm.c and region.c changes from aff4ccee4530 onto BOS/Resolute.
  • Omit the source drivers/cxl/pci.c hunk because this branch already guards cxl_reset_done() against both NULL and ERR_PTR endpoints.

Root cause

cxlmd->endpoint starts as ERR_PTR(-ENXIO) until endpoint-port registration links the memdev to a real cxl_port. The affected helpers checked only for NULL, allowing early CXL consumers to pass the error pointer into functions such as device_find_child().

The BOS region-management backport exposes these helpers before endpoint linkage, producing the observed NVIDIA probe failure and boot delay.

Validation

  • Verified the guard source check fails on the unmodified BOS branch and passes after the port.
  • git diff --check: pass.
  • Strict checkpatch.pl: 0 errors, 0 warnings.
  • Zero-context patch ID for the two changed files matches the corresponding subset of aff4ccee4530.
  • Confirmed CONFIG_CXL_BUS, CONFIG_CXL_MEM, CONFIG_CXL_PCI, and CONFIG_CXL_REGION are enabled in the generated amd64 BOS configuration.
  • Forced amd64 rebuild of drivers/cxl/core/hdm.o, region.o, and drivers/cxl/core/built-in.a: pass.

Tracking

cxlmd->endpoint starts as ERR_PTR(-ENXIO) until endpoint port registration
links the memdev to a real cxl_port.

Treat NULL and error pointers as "endpoint not linked" before dereferencing
cxlmd->endpoint in CXL helper paths.

The BOS region-management backport exposes these helpers before endpoint
linkage.

This backports commit aff4cce ("NVIDIA: VR: SAUCE: cxl: Guard unlinked
memdev endpoints"). Its PCI hunk is omitted because BOS already guards
cxl_reset_done().

Fixes: 29317f8 ("cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation")
Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com>
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

PR Validation Report

Patchscan ✅ No Missing Fixes

All cherry-picked commits checked — no missing upstream fixes found.

PR Lint ✅ All checks passed

Details
Checking 1 commits...

Cherry-pick digest:
┌──────────────┬──────────────────────────────────────────────────────────────────┬────────────┬─────────┬───────────────────────────┐
│ Local        │ Referenced upstream / Patch subject                              │ Patch-ID   │ Subject │ SoB chain                 │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 2c512ff1698a │ [SAUCE] cxl: guard unlinked memdev endpoints                     │ N/A        │ N/A     │ nirmoyd                   │
└──────────────┴──────────────────────────────────────────────────────────────────┴────────────┴─────────┴───────────────────────────┘

Lint: all checks passed.

@nirmoy

nirmoy commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

BaseOS Kernel Review

Summary

No issues found across the reviewed commits.

Findings: no problems found

Latest watcher review: open review

Generated test plan: open test plan

Kernel deb build: successful (download debs, 4 files)

Head: 2c512ff1698a

This comment is maintained by nv-pr-bot. It is updated when the GitHub watcher publishes a newer review.

@nirmoy nirmoy marked this pull request as ready for review July 3, 2026 15:56
@nirmoy nirmoy added the help wanted Extra attention is needed label Jul 3, 2026
@nirmoy

nirmoy commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

Strata/Vera boot smoke (2026-07-03)

  • PR 482 ARM64 debs at 2c512ff1698a booted successfully on strata-vera-eb5-107 as 7.0.6-pr482-g2c512ff1698a using a one-shot GRUB entry; the persistent default remains the known 6.16.12 kernel.
  • BMC serial and the post-boot journal showed no panic, Oops, device_find_child, or cxl_get_committed_decoder signature. Boot completed in 2m55.785s (firmware 2m2.365s, loader 8.764s, kernel 22.932s, userspace 21.721s).
  • Four NVIDIA GPUs remained visible on PCI. The installed nvidia/600.09 DKMS source does not compile against 7.0.6 (__assign_str tracepoint API change), so no NVIDIA module was available and nvidia-smi could not run. No CXL memdevs were enumerated; therefore this is a boot smoke test, not a functional reproduction of the endpoint/region race.
  • mlx5_core reported 80 firmware-internal-error lines on both the prior and test boots, so those messages are pre-existing rather than introduced by this change.

@arighi

arighi commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

This looks like a fix that can go upstream, is there any plan to post this to the LKML?

Other than that, LGTM.

Acked-by: Andrea Righi <arighi@nvidia.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants