Nvbench compare process bulk data by oleksandr-pavlyk · Pull Request #386 · NVIDIA/nvbench

oleksandr-pavlyk · 2026-06-04T17:57:05Z

closes #320
closes #393
closes #384

This PR builds on top of #392 (changes to C++ sources in nvbench/ and testing/ folders) and contains only Python changes.

Also add comment within percentile_rank to document precondition on input values checked with assert statement. Also, sharpened the comment around percentile_rank function

Prepare duplicate heavy input and check sort-based quartile computation result with selection-based one. std::nth_element only guarantees that the nth element is the value that would appear there in sorted order; it does not fully sort equal partitions. Bugs in the selection implementation, especially when selecting Q1 from the left half and Q3 from the right half after selecting the median, are more likely to show up when many samples equal the quartile values.

…ulated Cold measurement can discard throttled trials before incrementing the accepted sample count, then stop on timeout with zero recorded samples. In that case, only emit the sample-size summary and skip derived timing, bandwidth, clock, and bulk summaries that require accepted samples. This avoids divide-by-zero mean calculations and quartile/IQR computation over empty sample vectors. Keep timeout diagnostics reachable for zero-sample runs and add an explicit warning when no accepted cold samples were recorded. Factor timeout warning emission into a private helper so the zero-sample and normal paths share the same diagnostic logic. Suppress low-sample relative stdev noise Add a statistics helper that returns no relative standard-deviation noise until there are enough samples for a meaningful estimate. Use it for cold CPU/GPU and CPU-only summaries so the low-sample +inf stdev sentinel is not published as real relative noise or used for max-noise timeout warnings. Add statistics coverage for suppressing the low-sample sentinel and computing relative stdev noise once the sample threshold is reached. compute_standard_deviation_noise return nullopt if standard deviation is not finite Test verify that noise is nullopt when not enough samples are accumulated Added statistics::has_enough_samples_for_noise_estimate(...) Used it in standard_deviation, compute_standard_deviation_noise, compute_robust_noise. Added timeout diagnostics in cold and CPU-only paths. if max-noise is configured and the run timed out before enough samples exist to estimate noise, the log now says that explicitly, otherwise the existing “over noise threshold” warning remains unchanged. Added a statistics test assertion for the new sample-count predicate.

Add fixed expected-value assertions for quartile tests around the sort/selection switch point, including duplicate-heavy inputs. This keeps the tests from only proving that both implementations agree with each other.

Introduce new header file with inline implementation. Use it from measure_cold.cuh and measure_cpu_only.cxx

… constant Expose quartile threshold value, use it in testing to test around that value.

Keep legacy stdev/relative summary tags present even when too few samples are available to compute a meaningful standard-deviation noise estimate. Use the standard-deviation unavailable sentinel for those values so existing summary consumers continue to see the expected tags. Factor the sentinel into the statistics helpers and use it from both standard_deviation() and stdev_noise_or_sentinel(), keeping the schema compatibility behavior explicit and tested.

…navailable Add a focused test target, nvbench.test.measure_timeout_warnings, covering: - NaN stdev noise -> “unable to estimate noise” - negative stdev noise -> “unable to estimate noise” - +inf stdev noise -> “over noise threshold”

check_noise_warning() now takes std::optional<nvbench::float64_t>, matching the production helper, and the test now covers std::nullopt explicitly in addition to NaN, negative, and +inf.

Document that percentile helpers return quiet NaNs for NaN-containing inputs. Make quartile expected-value tests compute ranks from the documented round(p / 100 * (n - 1)) rule instead of reusing statistics::percentile_rank(), so rank regressions are caught independently. Extend timeout-warning coverage to exercise the too-few-samples max-noise path in addition to unavailable, invalid, and infinite stdev-noise inputs.

oleksandr-pavlyk · 2026-06-29T21:12:53Z

@coderabbitai full review

coderabbitai · 2026-06-29T21:24:22Z

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added a new nvbench-compare documentation page with result classes, matching rules, and TOML/config guidance.
- Improved Python CLI tools to load plotting/output dependencies lazily and exit with clear errors when tooling is missing.
Bug Fixes
- Enhanced timeout/no-accepted-samples warnings during benchmark summary generation.
- Improved noise/relative-noise computation and more consistent handling of NaN/invalid/insufficient-sample inputs in statistics (percentiles/quartiles/robust noise).
Tests
- Added/expanded unit tests for timeout warnings, statistics edge cases, and tooling-dependency import behavior.

Walkthrough

Adds a compare tool reference page, centralizes C++ timeout-warning logging and statistics helpers, and updates Python scripts and wheel packaging to use lazy optional dependencies under cuda.bench.scripts.

C++ statistics and timeout-warning refactor

Layer / File(s)	Summary
statistics helpers and NaN handling `nvbench/detail/statistics.cuh`	Adds shared noise-gating and sentinel helpers, changes percentile and quartile behavior for NaN-containing inputs, and centralizes quartile selection threshold use.
centralized timeout warning logging `nvbench/detail/measure_timeout_warnings.cuh`, `nvbench/detail/measure_cold.cuh`, `nvbench/detail/measure_cold.cu`, `nvbench/detail/measure_cpu_only.cxx`	Introduces shared timeout-warning helpers and wires them into cold and CPU-only measurement code.
statistics and timeout-warning tests `testing/CMakeLists.txt`, `testing/measure_timeout_warnings.cu`, `testing/statistics.cu`	Adds CUDA tests for timeout warnings and expands statistics coverage for NaNs, duplicate-heavy quartiles, and new noise helpers.

Python tools packaging and lazy dependency loading

Layer / File(s)	Summary
tooling dependency loader `python/scripts/nvbench_tooling_deps.py`	Defines the tooling dependency dataclass, missing-dependency error, and helper for lazy importing optional packages with install hints.
lazy script loading `python/scripts/nvbench_histogram.py`, `python/scripts/nvbench_plot_bwutil.py`, `python/scripts/nvbench_walltime.py`	Updates the histogram, plot_bwutil, and walltime scripts to use package-aware imports, lazy tooling loaders, and explicit exit codes on missing optional dependencies.
wheel packaging and tooling tests `python/pyproject.toml`, `python/test/test_nvbench_tooling_deps.py`	Moves the scripts wheel package under `cuda/bench/scripts`, retargets console entry points, and adds tests for packaged imports and tooling-dependency error handling.

nvbench-compare documentation

Layer / File(s)	Summary
compare reference documentation `docs/nvbench_compare.md`	Documents nvbench-compare behavior, command-line usage, matching, timing summaries, bulk-debug Python, comparison decisions, AMBG handling, and configuration keys.

Assessment against linked issues

Objective	Addressed	Explanation
Distribution-based comparison in `nvbench_compare.py` with `SAME`/`AMBG`/`INSUFFICIENT` handling [`#320`]	❓	No code changes to `nvbench_compare.py` are present; only documentation for the tool was added.
Use a more descriptive scripts package name in the wheel and update script entry points [`#393`]	✅
Add optional tool dependencies in package metadata and make scripts fail gracefully without them [`#384`]	✅

Assessment against linked issues: Out-of-scope changes

Code Change	Explanation
C++ statistics helper changes and NaN/noise gating (`nvbench/detail/statistics.cuh` lines 57-423)	The linked issues do not mention modifying measurement statistics internals.
Centralized timeout-warning logging and cold/CPU-only measurement refactor (`nvbench/detail/measure_timeout_warnings.cuh`, `nvbench/detail/measure_cold.cu`, `nvbench/detail/measure_cpu_only.cxx`)	These changes are unrelated to the linked packaging and compare-tool objectives.
New timeout-warning and statistics tests (`testing/measure_timeout_warnings.cu`, `testing/statistics.cu`, `testing/CMakeLists.txt`)	The linked issues do not require these test additions.

_{Comment @coderabbitai help to get the list of available commands.}

Teach nvbench_compare to parse GPU timing summaries into structured values and prefer the robust median/IQR summaries when both compared measurements provide them. Fall back to the existing mean/stdev summaries when robust summaries are not available. Classify comparisons with the larger available relative noise estimate instead of the smaller one, keep unavailable noise distinct from encoded infinite noise, and report improvements separately from regressions. Keep the process exit code as success for completed comparisons; regression counts are reported in the summary instead of being used as the process status. Make plotting tolerate unavailable noise by leaving gaps in confidence bands, sort plotted series by the plotted axis, and avoid reusing pyplot state across plot calls. Add focused Python tests for robust-summary preference, unavailable-noise classification, non-finite timing centers, plot-along handling when the selected axis is absent, and the exit-code contract.

Teach nvbench_compare to keep the order of --benchmark and --axis arguments so axis filters can apply either globally or to the most recent benchmark. Build a filter plan from the ordered CLI arguments and apply the same plan to table output and plotting labels. Add explicit --reference-devices and --compare-devices filters. The filters accept all, a single device id, or a comma-separated list of ids; ordered lists and duplicates are preserved so selected reference and compare devices can be paired by position. Device-section mismatches remain fatal for unfiltered all-vs-all comparisons, but become warnings when the user explicitly selects devices and the selected device counts match. Match duplicate benchmark states by occurrence within each filtered device section instead of matching only by state name across the whole benchmark. This keeps repeated axis values and filtered duplicate states aligned between the reference and compare inputs, and reports mismatched occurrence counts instead of silently dropping extra states. Add Python tests for duplicate-state matching, axis filtering before matching, device filter parsing and validation, explicit cross-device pairing, and benchmark-scoped axis filters. Original commit messages folded into this change: Tweaks for nvbench_compare 1. When JSON files contain multiple entries with the same name and axis values, make sure that scripts compares corresponding entries. Previous logic would extract the first entry from ref data, and would compare measurements for each state in cmp against the first entry from ref. The change introduces a counter to know which nth entry we process for a particular axis value, and retrieve corresponding entry in ref. Scope occurrence matching by device. Device pairing in nvbench_compare.py is strictly index-based under --ignore-devices, reused IDs in a different order no longer pair against the wrong reference device. Require devices in ref and cmp to have the same cardinality Handle mismatch when number of duplicates in ref data is not same as in cmp data Use pytest monkeypatch fixture to pretend third-party package dependencies are available during test run for nvbench_compare without introducing test-time dependency Added the happy-path test and fixed its direct-call setup by initializing the device globals that main() normally populates. Fix to filter-before-matching. - compare_benches() now pairs devices by selected position instead of taking a device id. - For each device pair, compare_benches() now builds: - ref_device_states: matching reference device and axis filters - cmp_device_states: matching compare device and axis filters - State occurrence counts and duplicate occurrence matching now operate only on those filtered per-device lists. - Removed the later matches_axis_filters() skip inside the compare-state loop because filtering now happens before matching. Added a regression test where ref/cmp have duplicate state names in opposite order, and --axis keeps only one of them. The test verifies the kept compare state is matched against the kept reference state, not the first unfiltered occurrence. Introduce device filtering in nvbench_compare - --reference-devices all|ID|ID,ID,... - --compare-devices all|ID|ID,ID,... - Integer lists preserve order and duplicates. - Requested IDs are validated against the file-level device list. - Filtered reference/compare device counts must match before comparison. - compare_benches() pairs selected reference and compare devices by position. - Each benchmark validates that requested device IDs are present in its own devices list. Implemented benchmark-scoped --axis handling. - --axis and --benchmark now share an ordered argparse action, so their relative CLI order is preserved. - -a before any -b becomes a global axis filter. - -a after -b <name> applies to that most recent benchmark only. - Repeated -b entries are treated as separate filter scopes and combined as alternatives for that benchmark. - Device filtering remains global and is applied independently. Allow non-matching devices for explicit device selection Now the device-section equality check remains fatal only for unfiltered all-vs-all comparisons. If either --reference-devices or --compare-devices is explicit, mismatched selected device metadata is printed as a warning, but comparison proceeds after the selected device counts have been validated. Fix for resolve_benchmark_device_ids, add comments The return value of resolve_benchmark_device_ids now always owns its list. Use monkeypatch class in set_test_devices helper Stricted device id validation Test for device id validation

Introduce GpuTimingData, SummaryComparison, ComparisonStats, and ComparisonRunData to make timing extraction, classification, and run-level state explicit. Load sample-time and SM-frequency bulk data from JSON binary output into GpuTimingData when available, preserving count validation between paired sample and frequency arrays. Move GPU timing comparison logic into compare_gpu_timings(), prefer robust median/IQR data when available, and fall back to mean/stdev summaries otherwise. Keep missing or invalid noise on the unknown path. Replace module-level comparison counters and selected-device globals with per-run data passed into compare_benches(). Update tests to validate timing classification, bulk-data loading, device pairing, filtered duplicate matching, and summary counters through the new structures.

It is not emitted just yet, but the code becomes ready for it when it starts being emitted

Store JSON-bin sample time and frequency metadata in GpuTimingData instead of reading the binary files during summary extraction. Add Float32BinarySource and lazy cached accessors for samples and frequencies. Use np.fromfile by default, but allow tests and alternate callers to inject a float32 reader returning any buffer-compatible object convertable to "<f4" data type. Treat optional bulk-data failures as unavailable evidence instead of aborting comparison: unreadable files, invalid buffers, count mismatches, and mismatched sample/frequency metadata now emit RuntimeWarning and return None. Update nvbench_compare tests to verify lazy loading, cache reuse, injected reader behavior, warning-based degradation, and count mismatch handling.

Also add a test to check that STDOUT also works.

Reject boolean summary float payloads instead of coercing them to 1.0/0.0, while keeping numeric strings accepted for NVBench JSON compatibility. Add regression coverage for generated bulk-debug Python filenames that require escaping, and strengthen the plot-along test to assert log-log axes and confidence-band rendering.

Move Float32BinarySource material-payload detection into the source object. Default file-backed sources still use resolved file size so missing or empty sidecars remain unavailable, but positive-count sources with custom readers are treated as material and proceed through the lazy read path. Add regression coverage for virtual bulk sources whose custom reader provides data without a local sidecar file.

Derive absolute standard deviation from relative stdev and mean when nv/cold/time/gpu/stdev/absolute is absent. This lets older JSON files that only contain mean and relative stdev still construct timing intervals. Also allow mean/stdev intervals to be built without min/max summaries, using min/max only as optional clipping bounds when present. This restores SAME classification for legacy fixture data instead of treating those rows as missing-interval AMBG cases. Update nvbench_compare tests to cover derived stdev handling and the legacy mean/stdev comparison path.

Add a shared nvbench_tooling_deps helper for importing packages required by NVBench console tools. Missing tooling packages now raise a dedicated error with an install recipe instead of failing with a raw ImportError. Update script imports to work both as installed package modules and as direct source-tree scripts by using the __package__ import pattern for nvbench_json and the new tooling helper. Defer nvbench-compare dependencies to the points where they are needed: NumPy/colorama during normal comparison setup, tabulate during table rendering, jsondiff only for device mismatch reporting, and plotting packages only for plot modes. Update tests to initialize compare tooling when calling internals directly and add coverage for the tooling dependency loader. Closes NVIDIA#384

Introduce optional dependency categories in the wheel metadata, with cuda-bench[tools] encompasing all dependencies of nvbench scripts. Closes NVIDIA#393

Move `load_nvbench_compare_tooling()` after: - config resolution / --dump-config - filter parsing - device-filter parsing - input arity validation

test_nvbench_tooling_deps.py now has a smoke test that builds a temporary package layout matching the wheel mapping: cuda/bench/scripts/nvbench_tooling_deps.py and imports: cuda.bench.scripts.nvbench_tooling_deps That covers the cuda.bench.scripts.* path without requiring a wheel build/install inside this unit test.

Add a shared nvbench_tooling_deps helper that imports packages required by console tools and reports missing packages with an actionable install recipe. Update NVBench scripts to support both direct source-tree execution and installed package execution through the __package__ import pattern. Defer third-party tooling imports until they are needed, including lazy loading for compare tables, device diffs, and plotting paths. Make loaders resilient to partial initialization failures so a later retry can complete any dependency that failed previously. Update tests to cover direct internal use, packaged cuda.bench.scripts imports, dependency-loader error messages, and cheap CLI validation before tooling dependencies are loaded.

require_tooling_dependency() now only translates ModuleNotFoundError when the missing module is the requested top-level package. Other import failures are re-raised unchanged. This helps in situation where third-party dependency is installed but broken for whatever reason. Previously we would intercept it and suggest to run pip install cuda-bench[tools], but that was already done.

round in Python returns nearest even, while in C++ it behaves as x -> floor(x + 0.5)

Factor finite-value checks through a common helper so positive and non-negative finite predicates share the same None/finite guard. Add a comment explaining the positive lower-bound clamp used for mean/stdev intervals, since later ratio and log-distance checks require positive bounds. Also quote the --axis pow2 example in CLI help so shell users can copy the example safely.

Implemented robo-review feedback suggesting to expand coverage of lazy loading helper.

oleksandr-pavlyk · 2026-06-30T11:42:26Z

@CodeRabbit full review

oleksandr-pavlyk added 16 commits June 26, 2026 16:28

Replace forwarding with semantically more accurate std::move

214d286

Also add comment within percentile_rank to document precondition on input values checked with assert statement. Also, sharpened the comment around percentile_rank function

Test quartile values across selection threshold

d9cdd8b

Add fixed expected-value assertions for quartile tests around the sort/selection switch point, including duplicate-heavy inputs. This keeps the tests from only proving that both implementations agree with each other.

Add NaN guards to percentiles and quartiles computation routines

04290dd

Add tests for handling of NaNs in quartile routine inputs

86eb2a8

Add comment re magic sort/select threshold value

7069a6b

Refactor logic of emitting warnings between cold and cpu-only measures

b0932b0

Introduce new header file with inline implementation. Use it from measure_cold.cuh and measure_cpu_only.cxx

Add static assertion that ValueType is a floating-point type

6b85a9b

Check consistency of sort- vs. select-based quartiles using threshold…

caa2f46

… constant Expose quartile threshold value, use it in testing to test around that value.

Collapsed two branches with identical bodies

b6b0dd1

test_compute_standard_deviation_noise exercises other invalid inputs

5546726

Test nullopt explicitly in warning check test

36d8c5b

check_noise_warning() now takes std::optional<nvbench::float64_t>, matching the production helper, and the test now covers std::nullopt explicitly in addition to NaN, negative, and +inf.

oleksandr-pavlyk mentioned this pull request Jun 29, 2026

Nvbench compare improvements #383

Closed

oleksandr-pavlyk force-pushed the nvbench-compare-process-bulk-data branch from f18d48d to f0db899 Compare June 29, 2026 19:59

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

oleksandr-pavlyk force-pushed the nvbench-compare-process-bulk-data branch from 18869e2 to 42927d9 Compare June 30, 2026 11:35

oleksandr-pavlyk added 7 commits June 30, 2026 06:37

Test for timeout warnings for min-samples and min-time

454e3fe

Make nvbench_compare read bulk data, if available

0baa699

Introduce UNDECIDED comparison status

a11b541

It is not emitted just yet, but the code becomes ready for it when it starts being emitted

oleksandr-pavlyk added 18 commits June 30, 2026 06:40

Test to use --bulk-debug-python stdout, not STDOUT

9ace525

Also add a test to check that STDOUT also works.

--threshold-diff is treated as % value, as documented

9bb5021

Install scripts into cuda/bench.scripts in prefix

49b2bb2

Introduce optional dependency categories in the wheel metadata, with cuda-bench[tools] encompasing all dependencies of nvbench scripts. Closes NVIDIA#393

Move lazy load after cheap CLI arg validation

21661b9

Move `load_nvbench_compare_tooling()` after: - config resolution / --dump-config - filter parsing - device-filter parsing - input arity validation

Harden tooling dependency import tests

4ca228a

colorama is not imported if --no-color is used

c2c46e9

make Python percentile_rank align C++ implementation

4f6dd09

round in Python returns nearest even, while in C++ it behaves as x -> floor(x + 0.5)

Implement typing for NumPy arrays friendly to lazy loading

f0878fc

percentile_rank use integer arithmetic and avoid use of floats

6740e59

Fix unexpected AttributeError when --no-color is used

a240460

Implemented robo-review feedback suggesting to expand coverage of lazy loading helper.

oleksandr-pavlyk force-pushed the nvbench-compare-process-bulk-data branch from 42927d9 to a240460 Compare June 30, 2026 11:41

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

Add nvbench_plot_bwutil to mock layout in test

77dae7c

oleksandr-pavlyk added python type: enhancement New feature or request. labels Jun 30, 2026

oleksandr-pavlyk self-assigned this Jun 30, 2026

oleksandr-pavlyk added this to CCCL Jun 30, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 30, 2026

oleksandr-pavlyk moved this from Todo to In Progress in CCCL Jun 30, 2026

jrhemstad requested a review from gevtushenko June 30, 2026 15:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nvbench compare process bulk data#386

Nvbench compare process bulk data#386
oleksandr-pavlyk wants to merge 84 commits into
NVIDIA:mainfrom
oleksandr-pavlyk:nvbench-compare-process-bulk-data

oleksandr-pavlyk commented Jun 4, 2026 •

edited

Loading

Uh oh!

oleksandr-pavlyk commented Jun 29, 2026

Uh oh!

This comment was marked as outdated.

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading

Summary by CodeRabbit

Walkthrough

Assessment against linked issues

Assessment against linked issues: Out-of-scope changes

Uh oh!

This comment was marked as resolved.

Uh oh!

oleksandr-pavlyk commented Jun 30, 2026

Uh oh!

This comment was marked as outdated.

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

oleksandr-pavlyk commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oleksandr-pavlyk commented Jun 29, 2026

Uh oh!

This comment was marked as outdated.

coderabbitai Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Assessment against linked issues

Assessment against linked issues: Out-of-scope changes

Uh oh!

This comment was marked as resolved.

Uh oh!

oleksandr-pavlyk commented Jun 30, 2026

Uh oh!

This comment was marked as outdated.

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

oleksandr-pavlyk commented Jun 4, 2026 •

edited

Loading

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading