Skip to content

Add fixed_capacity_map to cudax#7705

Open
srinivasyadav18 wants to merge 100 commits into
NVIDIA:mainfrom
srinivasyadav18:cuco_static_map
Open

Add fixed_capacity_map to cudax#7705
srinivasyadav18 wants to merge 100 commits into
NVIDIA:mainfrom
srinivasyadav18:cuco_static_map

Conversation

@srinivasyadav18

@srinivasyadav18 srinivasyadav18 commented Feb 18, 2026

Copy link
Copy Markdown
Contributor

Description

Part of #7463

This PR migrates cuCollections static_map's insert and contains operations into cudax as cuda::experimental::cuco::fixed_capacity_map.

Minimal scope: implements insert, contains, clear, and trivial accessors, with capacity validation provided by make_valid_capacity and is_valid_capacity. Tests mirror the cuCollections layout and use a parameterized matrix covering key type, probing scheme, CG size, and bucket size.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot

copy-pr-bot Bot commented Feb 18, 2026

Copy link
Copy Markdown
Contributor

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL Feb 18, 2026
@PointKernel PointKernel self-requested a review February 18, 2026 20:10
Comment thread cudax/include/cuda/experimental/__cuco/__detail/bitwise_compare.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/__detail/equal_wrapper.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/detail/prime.hpp
Comment thread cudax/include/cuda/experimental/__cuco/__detail/utils.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/__detail/utils.hpp Outdated
Comment thread cudax/include/cuda/experimental/__cuco/__static_map/kernels.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/static_map.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/static_map.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/static_map.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/__detail/extent.cuh Outdated
@copy-pr-bot

copy-pr-bot Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@PointKernel PointKernel marked this pull request as ready for review June 3, 2026 18:11
@PointKernel PointKernel requested a review from a team as a code owner June 3, 2026 18:11
@PointKernel PointKernel requested a review from andralex June 3, 2026 18:11
@cccl-authenticator-app cccl-authenticator-app Bot moved this from In Progress to In Review in CCCL Jun 3, 2026
@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a complete open-addressing hash table infrastructure for CUDA Experimental, comprising device reference operations, grid kernels, host orchestration, and a public static_map container with static/dynamic capacity modes and optional key erasure, plus comprehensive test coverage.

Changes

Open-addressing and static_map port

Layer / File(s) Summary
Type traits and bitwise comparison
cudax/include/cuda/experimental/__cuco/traits.hpp, cudax/include/cuda/experimental/__cuco/__detail/bitwise_compare.cuh
is_bitwise_comparable, is_tuple_like traits and aligned __bitwise_compare template support bitwise-safe type detection and fast equality paths (4/8-byte specializations via reinterpretation, general memcmp fallback).
Prime utilities and capacity rounding
cudax/include/cuda/experimental/__cuco/__detail/prime.hpp, cudax/include/cuda/experimental/__cuco/capacity.cuh
Deterministic 64-bit primality testing via trial division + Miller–Rabin, modular arithmetic with __int128 fast path, and make_valid_capacity rounding for linear/double-hashing with overflow guards.
Probing schemes and iterator base
cudax/include/cuda/experimental/__cuco/__detail/probing_scheme_base.cuh, cudax/include/cuda/experimental/__cuco/probing_scheme.cuh
__probing_scheme_base<CgSize> and __probing_iterator for bucket traversal; public linear_probing and double_hashing templates with cooperative-group tile-rank stride distribution.
Sentinel types and kernel utilities
cudax/include/cuda/experimental/__cuco/types.cuh, cudax/include/cuda/experimental/__cuco/__detail/types.cuh, cudax/include/cuda/experimental/__cuco/__detail/utils.cuh, cudax/include/cuda/experimental/__cuco/__detail/utils.hpp
Strong-type sentinel wrappers (empty_key, empty_value, erased_key), mdspan extent aliases, and grid-launch helpers (global thread ID, grid stride, occupancy sizing, tile-size traits).
Equality wrapper for probing
cudax/include/cuda/experimental/__cuco/__detail/equal_wrapper.cuh
Combines __bitwise_compare sentinel checks with key equality, returning three-way results and branching on insert vs. query mode for duplicate control.
Slot storage and device reference core
cudax/include/cuda/experimental/__cuco/__open_addressing/slot_storage_ref.cuh, cudax/include/cuda/experimental/__cuco/__open_addressing/open_addressing_ref_impl.cuh
__slot_storage_ref non-owning bucket view and __open_addressing_ref_impl device-side operations (probing, CAS-based insert with packed_cas/back_to_back_cas/cas_dependent_write dispatch, contains, cooperative-group variants).
Grid kernels for bulk operations
cudax/include/cuda/experimental/__cuco/__open_addressing/kernels.cuh
Grid-stride conditional __insert_if_n, __fill, and __contains_if_n kernels with _CgSize==1 direct vs. _CgSize!=1 tiled cooperative execution paths.
Host orchestration and memory
cudax/include/cuda/experimental/__cuco/__open_addressing/open_addressing_impl.cuh
Device-allocated slot buffer, async/sync clear/insert/contains with stream refs, device counter for success counting, bucket-count computation from capacity or load factor.
Public static_map container
cudax/include/cuda/experimental/__cuco/static_map.cuh, cudax/include/cuda/experimental/__cuco/static_map_ref.cuh
SFINAE-selected constructors for static/dynamic capacity and erasure modes; clear, insert, contains forwarding; device-side static_map_ref with trivially-copyable ref semantics.
Capacity, insert, and sentinel tests
cudax/test/cuco/static_map/test_capacity.cu, cudax/test/cuco/static_map/test_insert_and_contains.cu, cudax/test/cuco/static_map/test_key_sentinel.cu, cudax/test/cuco/static_map/test_shared_memory.cu, cudax/test/cuco/utility/test_capacity.cu, cudax/test/CMakeLists.txt
Validates dynamic/static capacity computation, insert/contains workflows, shared-memory sizing via capacity_v, sentinel handling, and load-factor rounding; updates strong_type.cuh documentation.

Assessment against linked issues

Objective Addressed Explanation
Port OpenAddressing [#7463]
Port static_map [#7463]

Suggested labels

cudax

Suggested reviewers

  • andralex
  • pciolkosz
  • gevtushenko

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

🧹 Nitpick comments (3)
cudax/include/cuda/experimental/__cuco/__detail/extent.cuh (1)

131-148: ⚡ Quick win

suggestion: Mark these header variable templates inline. They are namespace-scope constexpr definitions in a header, and the local CCCL rule requires the explicit inline spelling for this pattern.

As per coding guidelines, "All constexpr variables at namespace/global scope must use inline, including template variables."

cudax/include/cuda/experimental/__cuco/probing_scheme.cuh (1)

24-31: ⚡ Quick win

suggestion: Wrap this header with the standard CCCL prologue/epilogue pair. The file enters code directly after its includes and never closes with #include <cuda/std/__cccl/epilogue.h>, unlike the other new headers in this cohort.

As per coding guidelines, "The last included header before code must be #include <cuda/std/__cccl/prologue.h>, and #include <cuda/std/__cccl/epilogue.h> must be at the end of a file."

Also applies to: 264-264

cudax/include/cuda/experimental/__cuco/__static_map/kernels.cuh (1)

119-235: suggestion: Please attach benchmark results for this fast path before merge. This shared-memory kernel adds a new execution path and tuning heuristic, so we need the perf numbers that justify it on the supported toolchains and architectures. As per coding guidelines, "Do not commit SASS code changes without running benchmarks to check for performance regressions."


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d2eb011c-f333-4929-a09b-f09102640ec3

📥 Commits

Reviewing files that changed from the base of the PR and between 75c7b14 and 134736e.

📒 Files selected for processing (22)
  • cudax/include/cuda/experimental/__cuco/__detail/bitwise_compare.cuh
  • cudax/include/cuda/experimental/__cuco/__detail/equal_wrapper.cuh
  • cudax/include/cuda/experimental/__cuco/__detail/extent.cuh
  • cudax/include/cuda/experimental/__cuco/__detail/prime.hpp
  • cudax/include/cuda/experimental/__cuco/__detail/probing_scheme_base.cuh
  • cudax/include/cuda/experimental/__cuco/__detail/types.cuh
  • cudax/include/cuda/experimental/__cuco/__detail/utils.cuh
  • cudax/include/cuda/experimental/__cuco/__detail/utils.hpp
  • cudax/include/cuda/experimental/__cuco/__open_addressing/functors.cuh
  • cudax/include/cuda/experimental/__cuco/__open_addressing/kernels.cuh
  • cudax/include/cuda/experimental/__cuco/__open_addressing/open_addressing_impl.cuh
  • cudax/include/cuda/experimental/__cuco/__open_addressing/open_addressing_ref_impl.cuh
  • cudax/include/cuda/experimental/__cuco/__open_addressing/slot_storage_ref.cuh
  • cudax/include/cuda/experimental/__cuco/__open_addressing/types.cuh
  • cudax/include/cuda/experimental/__cuco/__static_map/kernels.cuh
  • cudax/include/cuda/experimental/__cuco/__utility/strong_type.cuh
  • cudax/include/cuda/experimental/__cuco/probing_scheme.cuh
  • cudax/include/cuda/experimental/__cuco/static_map.cuh
  • cudax/include/cuda/experimental/__cuco/static_map_ref.cuh
  • cudax/include/cuda/experimental/__cuco/traits.hpp
  • cudax/test/CMakeLists.txt
  • cudax/test/cuco/static_map/test_static_map.cu

Comment thread cudax/include/cuda/experimental/__cuco/detail/prime.hpp
Comment thread cudax/include/cuda/experimental/__cuco/__detail/utils.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/detail/utility/cuda.cuh
Comment thread cudax/include/cuda/experimental/__cuco/detail/utility/cuda.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/__open_addressing/functors.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/static_map_ref.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/static_map_ref.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/static_map_ref.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/static_map.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/traits.hpp Outdated

@PointKernel PointKernel left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my end

@PointKernel

Copy link
Copy Markdown
Member

/ok to test 40b602a

@github-actions

This comment has been minimized.

@PointKernel

Copy link
Copy Markdown
Member

/ok to test 058c039

@github-actions

This comment has been minimized.

@srinivasyadav18

Copy link
Copy Markdown
Contributor Author

/ok to test e986cab


public:
//! @brief Constructs an open addressing implementation with the given capacity.
_CCCL_HOST __open_addressing_impl(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, use _CCCL_API/_CCCL_HOST_API/_CCCL_DEVICE_API instead of _CCCL_HOST_DEVICE/_CCCL_HOST/_CCCL_DEVICE

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Updated to the _API family. This file only used _CCCL_HOST, so it is now _CCCL_HOST_API. For host+device cases I am keeping _CCCL_HOST_DEVICE_API rather than _CCCL_API: cudax was intentionally moved off _CCCL_API to _CCCL_HOST_DEVICE_API in #8955 because _CCCL_API carries _CCCL_TILE, which is not usable in tile mode.

Comment on lines +223 to +225
[[maybe_unused]] const auto __status = CUB_NS_QUALIFIER::DeviceTransform::Fill(
__slots.data(), static_cast<detail::__index_type>(__n), __empty_slot_sentinel, __stream);
_CCCL_ASSERT(__status == cudaSuccess, "cuco: failed to clear slot storage");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these _CCCL_THROW(::cuda::cuda_error, ...) instead?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done via _CCCL_TRY_CUDA_API (cccl's CUCO_CUDA_TRY, throws cuda_error). Dropped noexcept from the three async methods so the throw is valid.

#ifndef _CUDAX___CUCO_DETAIL_OPEN_ADDRESSING_KERNELS_CUH
#define _CUDAX___CUCO_DETAIL_OPEN_ADDRESSING_KERNELS_CUH

#include <cuda/__cccl_config>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use this config instead, please

Suggested change
#include <cuda/__cccl_config>
#include <cuda/std/detail/__config>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<cuda/__cccl_config> is the convention the __cuco headers follow, and usage is roughly even across cudax (~138 vs ~139). Is there a best-practice guideline on which one to use?

Comment thread cudax/include/cuda/experimental/__cuco/detail/utility/cuda.cuh Outdated
Comment on lines +117 to +120
[[nodiscard]] _CCCL_HOST static ::cuda::device_memory_pool_ref __default_memory_resource()
{
return ::cuda::device_default_memory_pool(::cuda::device_ref{::cuda::__driver::__ctxGetDevice()});
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is not ideal

const _KeyEqual& __pred = {},
const _ProbingScheme& __probing_scheme = {},
_MemoryResource __mr = __default_memory_resource(),
::cuda::stream_ref __stream = cudaStream_t{nullptr})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cccl runtime, we pretend like there is no default stream and usually put it as the first parameter

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default MR and stream are intentional, aiming for STL canonical ergonomics: a user can write auto m = cuco::static_map{empty_key{-1}, 100}; and get a working device map without thinking about streams or memory resources, much like std::unordered_map<int>{100} hides the allocator, or thrust::sort(thrust::device, ...) handles its own temp allocations and stream internally. That said, I'm not sure how valuable that STL style convenience really is for the GPU case, so I'm open to dropping the defaults and taking the stream as an explicit first parameter if that's the cccl runtime preference. WDYT?

empty_value<_Tp> __empty_value_sentinel,
const _KeyEqual& __pred = {},
const _ProbingScheme& __probing_scheme = {},
_MemoryResource __mr = __default_memory_resource(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory resource should be explicit

//! @brief Erases all elements from the container. After this call, `size()` returns zero.
//!
//! @param __stream CUDA stream this operation is executed in
void clear(::cuda::stream_ref __stream = cudaStream_t{nullptr})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void clear(::cuda::stream_ref __stream = cudaStream_t{nullptr})
void clear(::cuda::stream_ref __stream)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we used the same approach as @pciolkosz used in cuda::buffer, this class should store the stream it was created with to perform these operations

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid storing the stream as a member. The map has to support being used across multiple streams (e.g. built on one stream, then inserts/queries issued on others) since we have downstream use cases relying on that, so binding it to the creation stream would break them. The stream should stay a per-operation argument.

Comment thread cudax/include/cuda/experimental/__cuco/probing_scheme.cuh Outdated
@github-actions

Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 3h 37m: Pass: 100%/55 | Total: 8h 34m | Max: 56m 50s | Hits: 75%/46418

See results here.

//! stencil returns true.
template <int _CgSize, int _BlockSize, class _InputIt, class _StencilIt, class _Predicate, class _Ref>
_CCCL_KERNEL_ATTRIBUTES void
__insert_if_n(_InputIt __first, detail::__index_type __n, _StencilIt __stencil, _Predicate __pred, _Ref __ref)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these kernels are missing launch bounds

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed

@PointKernel

Copy link
Copy Markdown
Member

/ok to test 2d119f5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuco cuCollections

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

5 participants