Add fixed_capacity_map to cudax#7705
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
0edb761 to
65368db
Compare
65368db to
b0d0702
Compare
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a complete open-addressing hash table infrastructure for CUDA Experimental, comprising device reference operations, grid kernels, host orchestration, and a public ChangesOpen-addressing and static_map port
Assessment against linked issues
Suggested labels
Suggested reviewers
Comment |
There was a problem hiding this comment.
Actionable comments posted: 18
🧹 Nitpick comments (3)
cudax/include/cuda/experimental/__cuco/__detail/extent.cuh (1)
131-148: ⚡ Quick winsuggestion: Mark these header variable templates
inline. They are namespace-scopeconstexprdefinitions in a header, and the local CCCL rule requires the explicitinlinespelling for this pattern.As per coding guidelines, "All
constexprvariables at namespace/global scope must useinline, includingtemplatevariables."cudax/include/cuda/experimental/__cuco/probing_scheme.cuh (1)
24-31: ⚡ Quick winsuggestion: Wrap this header with the standard CCCL prologue/epilogue pair. The file enters code directly after its includes and never closes with
#include <cuda/std/__cccl/epilogue.h>, unlike the other new headers in this cohort.As per coding guidelines, "The last included header before code must be
#include <cuda/std/__cccl/prologue.h>, and#include <cuda/std/__cccl/epilogue.h>must be at the end of a file."Also applies to: 264-264
cudax/include/cuda/experimental/__cuco/__static_map/kernels.cuh (1)
119-235: suggestion: Please attach benchmark results for this fast path before merge. This shared-memory kernel adds a new execution path and tuning heuristic, so we need the perf numbers that justify it on the supported toolchains and architectures. As per coding guidelines, "Do not commit SASS code changes without running benchmarks to check for performance regressions."
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d2eb011c-f333-4929-a09b-f09102640ec3
📒 Files selected for processing (22)
cudax/include/cuda/experimental/__cuco/__detail/bitwise_compare.cuhcudax/include/cuda/experimental/__cuco/__detail/equal_wrapper.cuhcudax/include/cuda/experimental/__cuco/__detail/extent.cuhcudax/include/cuda/experimental/__cuco/__detail/prime.hppcudax/include/cuda/experimental/__cuco/__detail/probing_scheme_base.cuhcudax/include/cuda/experimental/__cuco/__detail/types.cuhcudax/include/cuda/experimental/__cuco/__detail/utils.cuhcudax/include/cuda/experimental/__cuco/__detail/utils.hppcudax/include/cuda/experimental/__cuco/__open_addressing/functors.cuhcudax/include/cuda/experimental/__cuco/__open_addressing/kernels.cuhcudax/include/cuda/experimental/__cuco/__open_addressing/open_addressing_impl.cuhcudax/include/cuda/experimental/__cuco/__open_addressing/open_addressing_ref_impl.cuhcudax/include/cuda/experimental/__cuco/__open_addressing/slot_storage_ref.cuhcudax/include/cuda/experimental/__cuco/__open_addressing/types.cuhcudax/include/cuda/experimental/__cuco/__static_map/kernels.cuhcudax/include/cuda/experimental/__cuco/__utility/strong_type.cuhcudax/include/cuda/experimental/__cuco/probing_scheme.cuhcudax/include/cuda/experimental/__cuco/static_map.cuhcudax/include/cuda/experimental/__cuco/static_map_ref.cuhcudax/include/cuda/experimental/__cuco/traits.hppcudax/test/CMakeLists.txtcudax/test/cuco/static_map/test_static_map.cu
|
/ok to test 40b602a |
This comment has been minimized.
This comment has been minimized.
|
/ok to test 058c039 |
This comment has been minimized.
This comment has been minimized.
|
/ok to test e986cab |
|
|
||
| public: | ||
| //! @brief Constructs an open addressing implementation with the given capacity. | ||
| _CCCL_HOST __open_addressing_impl( |
There was a problem hiding this comment.
Please, use _CCCL_API/_CCCL_HOST_API/_CCCL_DEVICE_API instead of _CCCL_HOST_DEVICE/_CCCL_HOST/_CCCL_DEVICE
There was a problem hiding this comment.
Good point. Updated to the _API family. This file only used _CCCL_HOST, so it is now _CCCL_HOST_API. For host+device cases I am keeping _CCCL_HOST_DEVICE_API rather than _CCCL_API: cudax was intentionally moved off _CCCL_API to _CCCL_HOST_DEVICE_API in #8955 because _CCCL_API carries _CCCL_TILE, which is not usable in tile mode.
| [[maybe_unused]] const auto __status = CUB_NS_QUALIFIER::DeviceTransform::Fill( | ||
| __slots.data(), static_cast<detail::__index_type>(__n), __empty_slot_sentinel, __stream); | ||
| _CCCL_ASSERT(__status == cudaSuccess, "cuco: failed to clear slot storage"); |
There was a problem hiding this comment.
Shouldn't these _CCCL_THROW(::cuda::cuda_error, ...) instead?
There was a problem hiding this comment.
Done via _CCCL_TRY_CUDA_API (cccl's CUCO_CUDA_TRY, throws cuda_error). Dropped noexcept from the three async methods so the throw is valid.
| #ifndef _CUDAX___CUCO_DETAIL_OPEN_ADDRESSING_KERNELS_CUH | ||
| #define _CUDAX___CUCO_DETAIL_OPEN_ADDRESSING_KERNELS_CUH | ||
|
|
||
| #include <cuda/__cccl_config> |
There was a problem hiding this comment.
Use this config instead, please
| #include <cuda/__cccl_config> | |
| #include <cuda/std/detail/__config> |
There was a problem hiding this comment.
<cuda/__cccl_config> is the convention the __cuco headers follow, and usage is roughly even across cudax (~138 vs ~139). Is there a best-practice guideline on which one to use?
| [[nodiscard]] _CCCL_HOST static ::cuda::device_memory_pool_ref __default_memory_resource() | ||
| { | ||
| return ::cuda::device_default_memory_pool(::cuda::device_ref{::cuda::__driver::__ctxGetDevice()}); | ||
| } |
There was a problem hiding this comment.
Again, this is not ideal
| const _KeyEqual& __pred = {}, | ||
| const _ProbingScheme& __probing_scheme = {}, | ||
| _MemoryResource __mr = __default_memory_resource(), | ||
| ::cuda::stream_ref __stream = cudaStream_t{nullptr}) |
There was a problem hiding this comment.
In cccl runtime, we pretend like there is no default stream and usually put it as the first parameter
There was a problem hiding this comment.
The default MR and stream are intentional, aiming for STL canonical ergonomics: a user can write auto m = cuco::static_map{empty_key{-1}, 100}; and get a working device map without thinking about streams or memory resources, much like std::unordered_map<int>{100} hides the allocator, or thrust::sort(thrust::device, ...) handles its own temp allocations and stream internally. That said, I'm not sure how valuable that STL style convenience really is for the GPU case, so I'm open to dropping the defaults and taking the stream as an explicit first parameter if that's the cccl runtime preference. WDYT?
| empty_value<_Tp> __empty_value_sentinel, | ||
| const _KeyEqual& __pred = {}, | ||
| const _ProbingScheme& __probing_scheme = {}, | ||
| _MemoryResource __mr = __default_memory_resource(), |
There was a problem hiding this comment.
The memory resource should be explicit
| //! @brief Erases all elements from the container. After this call, `size()` returns zero. | ||
| //! | ||
| //! @param __stream CUDA stream this operation is executed in | ||
| void clear(::cuda::stream_ref __stream = cudaStream_t{nullptr}) |
There was a problem hiding this comment.
| void clear(::cuda::stream_ref __stream = cudaStream_t{nullptr}) | |
| void clear(::cuda::stream_ref __stream) |
There was a problem hiding this comment.
If we used the same approach as @pciolkosz used in cuda::buffer, this class should store the stream it was created with to perform these operations
There was a problem hiding this comment.
I'd avoid storing the stream as a member. The map has to support being used across multiple streams (e.g. built on one stream, then inserts/queries issued on others) since we have downstream use cases relying on that, so binding it to the creation stream would break them. The stream should stay a per-operation argument.
🥳 CI Workflow Results🟩 Finished in 3h 37m: Pass: 100%/55 | Total: 8h 34m | Max: 56m 50s | Hits: 75%/46418See results here. |
| //! stencil returns true. | ||
| template <int _CgSize, int _BlockSize, class _InputIt, class _StencilIt, class _Predicate, class _Ref> | ||
| _CCCL_KERNEL_ATTRIBUTES void | ||
| __insert_if_n(_InputIt __first, detail::__index_type __n, _StencilIt __stencil, _Predicate __pred, _Ref __ref) |
There was a problem hiding this comment.
All these kernels are missing launch bounds
|
/ok to test 2d119f5 |
Description
Part of #7463
This PR migrates cuCollections
static_map'sinsertandcontainsoperations into cudax ascuda::experimental::cuco::fixed_capacity_map.Minimal scope: implements
insert,contains,clear, and trivial accessors, with capacity validation provided bymake_valid_capacityandis_valid_capacity. Tests mirror the cuCollections layout and use a parameterized matrix covering key type, probing scheme, CG size, and bucket size.Checklist