Skip to content

[cuda.compute][cuda.coop]: Replace all usages of device arrays outside examples with new wrapper hat does not depend on cupy or numba-cuda#9653

Open
NaderAlAwar wants to merge 3 commits into
NVIDIA:mainfrom
NaderAlAwar:replace-cupy-with-cuda-core
Open

[cuda.compute][cuda.coop]: Replace all usages of device arrays outside examples with new wrapper hat does not depend on cupy or numba-cuda#9653
NaderAlAwar wants to merge 3 commits into
NVIDIA:mainfrom
NaderAlAwar:replace-cupy-with-cuda-core

Conversation

@NaderAlAwar

Copy link
Copy Markdown
Contributor

Description

closes #9646

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@NaderAlAwar NaderAlAwar requested review from a team as code owners June 30, 2026 21:11
@NaderAlAwar NaderAlAwar requested a review from gonidelis June 30, 2026 21:11
@NaderAlAwar NaderAlAwar requested a review from jrhemstad June 30, 2026 21:11
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 30, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 30, 2026
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e68f2e5c-da96-4635-8aa6-a50ee505aa65

📥 Commits

Reviewing files that changed from the base of the PR and between ba194bf and 054f446.

📒 Files selected for processing (8)
  • python/cuda_cccl/tests/_utils/device_array.py
  • python/cuda_cccl/tests/compute/conftest.py
  • python/cuda_cccl/tests/compute/test_radix_sort.py
  • python/cuda_cccl/tests/compute/test_raw_op.py
  • python/cuda_cccl/tests/compute/test_scan.py
  • python/cuda_cccl/tests/compute/test_segmented_reduce.py
  • python/cuda_cccl/tests/compute/test_unique_by_key.py
  • python/cuda_cccl/tests/compute/test_zip_iterator.py
🚧 Files skipped from review as they are similar to previous changes (6)
  • python/cuda_cccl/tests/compute/conftest.py
  • python/cuda_cccl/tests/compute/test_unique_by_key.py
  • python/cuda_cccl/tests/_utils/device_array.py
  • python/cuda_cccl/tests/compute/test_raw_op.py
  • python/cuda_cccl/tests/compute/test_scan.py
  • python/cuda_cccl/tests/compute/test_segmented_reduce.py

📝 Walkthrough

Summary by CodeRabbit

  • New Features
    • Reorganized Python test extras: standard CUDA test extras no longer install CuPy; CuPy is now included only in test-with-examples-* extras.
  • Bug Fixes
    • Updated CI (Linux/Windows) to install the correct example-based extras before running CuPy-based Python example tests.
  • Documentation
    • Clarified in setup docs which extras to use to run CuPy-based cuda.compute examples.
  • Tests
    • Updated example test runner to skip compute.examples.* when CuPy isn’t installed.
    • Migrated many CUDA compute/cooperative tests to a unified device-buffer helper for consistent GPU data handling.

Walkthrough

This PR removes CuPy as the test dependency by moving CUDA buffer handling to a new DeviceArray helper built on cuda.core. Compute and coop tests are updated, and packaging, CI, docs, and example-test gating are adjusted for separate test-with-examples extras.

Changes

DeviceArray migration and packaging

Layer / File(s) Summary
DeviceArray utility implementation
python/cuda_cccl/tests/_utils/*.py
Adds the shared DeviceArray helper, shape/stride utilities, stream validation, host/device copy methods, and CUDA array interface support.
Compute tests migrated to DeviceArray
python/cuda_cccl/tests/compute/*.py
Replaces CuPy/numba.cuda allocations, transfers, and stream handling with DeviceArray, NumPy-side expectations, and cuda.core stream/device helpers.
coop/_experimental tests migrated to DeviceArray
python/cuda_cccl/tests/coop/_experimental/*.py
Replaces numba.cuda allocations and synchronizations with DeviceArray-based allocations, transfers, and host checks.
Packaging, CI, docs, and example skipping
python/cuda_cccl/pyproject.toml, ci/*.sh, ci/windows/*.ps1, docs/python/setup.rst, python/cuda_cccl/tests/test_examples.py
Splits test extras from CuPy-bearing test-with-examples extras, updates install scripts and docs, adds pytest path config, and skips compute example tests when CuPy is unavailable.

Assessment against linked issues

Objective Addressed Explanation
Replace the CuPy test dependency by relying on cuda.core [#9646]
Keep CuPy only for example test coverage via separate extras [#9646]

Out-of-scope changes

None

Suggested reviewers

  • wmaxey
  • kkraus14
  • bernhardmgruber

Comment @coderabbitai help to get the list of available commands.

@NaderAlAwar

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
python/cuda_cccl/tests/compute/conftest.py (1)

73-77: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

suggestion: cuda_stream fixture never releases the stream.

device.create_stream() returns a cuda.core.Stream that owns the underlying CUDA stream and must be destroyed via close(); the fixture just returns it without a yield/cleanup, so the stream handle leaks every test invocation. Use a teardown via yield + stream.close().

♻️ Suggested fix
 `@pytest.fixture`(scope="function")
 def cuda_stream() -> Stream:
     device = Device()
     device.set_current()
-    return device.create_stream()
+    stream = device.create_stream()
+    yield stream
+    stream.close()

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 34e471e4-3fc6-48d3-adde-05f213b52bf6

📥 Commits

Reviewing files that changed from the base of the PR and between 3d5f235 and ba194bf.

📒 Files selected for processing (47)
  • ci/test_cuda_cccl_examples_python.sh
  • ci/test_cuda_compute_minimal_python.sh
  • ci/windows/test_cuda_cccl_examples_python.ps1
  • docs/python/setup.rst
  • python/cuda_cccl/pyproject.toml
  • python/cuda_cccl/tests/_utils/__init__.py
  • python/cuda_cccl/tests/_utils/device_array.py
  • python/cuda_cccl/tests/compute/conftest.py
  • python/cuda_cccl/tests/compute/test_binary_search.py
  • python/cuda_cccl/tests/compute/test_deferred_annotations.py
  • python/cuda_cccl/tests/compute/test_histogram.py
  • python/cuda_cccl/tests/compute/test_iterators.py
  • python/cuda_cccl/tests/compute/test_merge_sort.py
  • python/cuda_cccl/tests/compute/test_nested_struct.py
  • python/cuda_cccl/tests/compute/test_no_numba.py
  • python/cuda_cccl/tests/compute/test_permutation_iterator.py
  • python/cuda_cccl/tests/compute/test_radix_sort.py
  • python/cuda_cccl/tests/compute/test_raw_op.py
  • python/cuda_cccl/tests/compute/test_reduce.py
  • python/cuda_cccl/tests/compute/test_scan.py
  • python/cuda_cccl/tests/compute/test_segmented_reduce.py
  • python/cuda_cccl/tests/compute/test_segmented_sort.py
  • python/cuda_cccl/tests/compute/test_select.py
  • python/cuda_cccl/tests/compute/test_shuffle_iterator.py
  • python/cuda_cccl/tests/compute/test_three_way_partition.py
  • python/cuda_cccl/tests/compute/test_transform.py
  • python/cuda_cccl/tests/compute/test_unique_by_key.py
  • python/cuda_cccl/tests/compute/test_zip_iterator.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_exchange.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_load.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_load_store_api.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_merge_sort.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_merge_sort_api.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_radix_sort.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_radix_sort_api.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_reduce.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_reduce_api.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_scan.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_scan_api.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_store.py
  • python/cuda_cccl/tests/coop/_experimental/test_warp_merge_sort.py
  • python/cuda_cccl/tests/coop/_experimental/test_warp_merge_sort_api.py
  • python/cuda_cccl/tests/coop/_experimental/test_warp_reduce.py
  • python/cuda_cccl/tests/coop/_experimental/test_warp_reduce_api.py
  • python/cuda_cccl/tests/coop/_experimental/test_warp_scan.py
  • python/cuda_cccl/tests/coop/_experimental/test_warp_scan_api.py
  • python/cuda_cccl/tests/test_examples.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/cuda_cccl/tests/compute/test_unique_by_key.py (1)

385-406: 🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win

critical: redundant h_in_keys reassignment drops the duplicate-key bound, defeating the test's intent.

Line 393 re-generates h_in_keys = random_array(num_items, np.int32) without max_value=20, overwriting the bounded array created at line 387 (max_value=20) that was specifically meant to produce duplicate keys for unique_by_key to dedupe. With this overwrite, keys are effectively unbounded random int32 values and will almost never collide, so the test no longer exercises actual deduplication — it still passes, but silently stops testing the intended behavior. This looks like a leftover from the refactor (d_in_keys = numba.cuda.to_device(h_in_keys) likely became this duplicate generation by mistake).

🐛 Proposed fix
     h_in_keys = random_array(num_items, np.int32, max_value=20)
     h_in_items = random_array(num_items, np.float32)
     h_out_keys = np.empty(num_items, dtype=np.int32)
     h_out_items = np.empty(num_items, dtype=np.float32)
     h_out_num_selected = np.empty(1, np.int32)

-    h_in_keys = random_array(num_items, np.int32)
     d_in_keys = DeviceArray.from_numpy(h_in_keys, stream=cuda_stream)
🧹 Nitpick comments (2)
python/cuda_cccl/tests/compute/conftest.py (1)

73-77: 🩺 Stability & Availability | 🔵 Trivial | 💤 Low value

suggestion: cuda_stream fixture never closes the stream it creates. cuda.core.Stream.close() destroys the stream if owned; relying solely on GC across hundreds of test invocations using this fixture is fragile. Consider a yield-based fixture that closes the stream on teardown.

♻️ Possible fix
 `@pytest.fixture`(scope="function")
-def cuda_stream() -> Stream:
+def cuda_stream():
     device = Device()
     device.set_current()
-    return device.create_stream()
+    stream = device.create_stream()
+    yield stream
+    stream.close()

As per path instructions for python/cuda_cccl/**/*: "Focus on ... memory ownership ...".

Source: Path instructions

python/cuda_cccl/tests/compute/test_radix_sort.py (1)

49-53: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

suggestion: get_compute_capability_major() duplicates the same Device(); set_current(); compute_capability snippet inlined in test_zip_iterator.py (and similar patterns elsewhere). Consider moving this helper into tests/_utils so all test files share one implementation instead of re-deriving CC checks per file.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: edf5ef8f-50ba-41eb-a5c3-f2a88afbb30d

📥 Commits

Reviewing files that changed from the base of the PR and between 3d5f235 and ba194bf.

📒 Files selected for processing (47)
  • ci/test_cuda_cccl_examples_python.sh
  • ci/test_cuda_compute_minimal_python.sh
  • ci/windows/test_cuda_cccl_examples_python.ps1
  • docs/python/setup.rst
  • python/cuda_cccl/pyproject.toml
  • python/cuda_cccl/tests/_utils/__init__.py
  • python/cuda_cccl/tests/_utils/device_array.py
  • python/cuda_cccl/tests/compute/conftest.py
  • python/cuda_cccl/tests/compute/test_binary_search.py
  • python/cuda_cccl/tests/compute/test_deferred_annotations.py
  • python/cuda_cccl/tests/compute/test_histogram.py
  • python/cuda_cccl/tests/compute/test_iterators.py
  • python/cuda_cccl/tests/compute/test_merge_sort.py
  • python/cuda_cccl/tests/compute/test_nested_struct.py
  • python/cuda_cccl/tests/compute/test_no_numba.py
  • python/cuda_cccl/tests/compute/test_permutation_iterator.py
  • python/cuda_cccl/tests/compute/test_radix_sort.py
  • python/cuda_cccl/tests/compute/test_raw_op.py
  • python/cuda_cccl/tests/compute/test_reduce.py
  • python/cuda_cccl/tests/compute/test_scan.py
  • python/cuda_cccl/tests/compute/test_segmented_reduce.py
  • python/cuda_cccl/tests/compute/test_segmented_sort.py
  • python/cuda_cccl/tests/compute/test_select.py
  • python/cuda_cccl/tests/compute/test_shuffle_iterator.py
  • python/cuda_cccl/tests/compute/test_three_way_partition.py
  • python/cuda_cccl/tests/compute/test_transform.py
  • python/cuda_cccl/tests/compute/test_unique_by_key.py
  • python/cuda_cccl/tests/compute/test_zip_iterator.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_exchange.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_load.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_load_store_api.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_merge_sort.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_merge_sort_api.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_radix_sort.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_radix_sort_api.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_reduce.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_reduce_api.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_scan.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_scan_api.py
  • python/cuda_cccl/tests/coop/_experimental/test_block_store.py
  • python/cuda_cccl/tests/coop/_experimental/test_warp_merge_sort.py
  • python/cuda_cccl/tests/coop/_experimental/test_warp_merge_sort_api.py
  • python/cuda_cccl/tests/coop/_experimental/test_warp_reduce.py
  • python/cuda_cccl/tests/coop/_experimental/test_warp_reduce_api.py
  • python/cuda_cccl/tests/coop/_experimental/test_warp_scan.py
  • python/cuda_cccl/tests/coop/_experimental/test_warp_scan_api.py
  • python/cuda_cccl/tests/test_examples.py

Comment thread python/cuda_cccl/tests/compute/test_segmented_reduce.py
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 3h 11m: Pass: 100%/506 | Total: 6d 16h | Max: 1h 56m | Hits: 100%/653836

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Replace the CuPy test dependency by relying on cuda.core

1 participant