Skip to content

Nvbench compare process bulk data#386

Draft
oleksandr-pavlyk wants to merge 84 commits into
NVIDIA:mainfrom
oleksandr-pavlyk:nvbench-compare-process-bulk-data
Draft

Nvbench compare process bulk data#386
oleksandr-pavlyk wants to merge 84 commits into
NVIDIA:mainfrom
oleksandr-pavlyk:nvbench-compare-process-bulk-data

Conversation

@oleksandr-pavlyk

@oleksandr-pavlyk oleksandr-pavlyk commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

closes #320
closes #393
closes #384

This PR builds on top of #392 (changes to C++ sources in nvbench/ and testing/ folders) and contains only Python changes.

Also add comment within percentile_rank to document precondition
on input values checked with assert statement.

Also, sharpened the comment around percentile_rank function
Prepare duplicate heavy input and check sort-based
quartile computation result with selection-based one.

std::nth_element only guarantees that the nth element
is the value that would appear there in sorted order;
it does not fully sort equal partitions. Bugs in the
selection implementation, especially when selecting Q1
from the left half and Q3 from the right half after
selecting the median, are more likely to show up when
many samples equal the quartile values.
…ulated

Cold measurement can discard throttled trials before incrementing the accepted
sample count, then stop on timeout with zero recorded samples. In that case,
only emit the sample-size summary and skip derived timing, bandwidth, clock, and
bulk summaries that require accepted samples.

This avoids divide-by-zero mean calculations and quartile/IQR computation over
empty sample vectors.

Keep timeout diagnostics reachable for zero-sample runs and add an explicit
warning when no accepted cold samples were recorded. Factor timeout warning
emission into a private helper so the zero-sample and normal paths share the
same diagnostic logic.

Suppress low-sample relative stdev noise

Add a statistics helper that returns no relative standard-deviation noise until
there are enough samples for a meaningful estimate. Use it for cold CPU/GPU and
CPU-only summaries so the low-sample +inf stdev sentinel is not published as
real relative noise or used for max-noise timeout warnings.

Add statistics coverage for suppressing the low-sample sentinel and computing
relative stdev noise once the sample threshold is reached.

compute_standard_deviation_noise return nullopt if standard deviation is not finite

Test verify that noise is nullopt when not enough samples are accumulated

Added statistics::has_enough_samples_for_noise_estimate(...)

Used it in standard_deviation, compute_standard_deviation_noise,
compute_robust_noise.

Added timeout diagnostics in cold and CPU-only paths.
if max-noise is configured and the run timed out before enough
samples exist to estimate noise, the log now says that explicitly,
otherwise the existing “over noise threshold” warning remains
unchanged.

Added a statistics test assertion for the new sample-count
predicate.
Add fixed expected-value assertions for quartile tests around the
sort/selection switch point, including duplicate-heavy inputs. This keeps the
tests from only proving that both implementations agree with each other.
Introduce new header file with inline implementation. Use it
from measure_cold.cuh and measure_cpu_only.cxx
… constant

Expose quartile threshold value, use it in testing to test around that value.
Keep legacy stdev/relative summary tags present even when too few
samples are available to compute a meaningful standard-deviation noise
estimate. Use the standard-deviation unavailable sentinel for those
values so existing summary consumers continue to see the expected tags.

Factor the sentinel into the statistics helpers and use it from both
standard_deviation() and stdev_noise_or_sentinel(), keeping the schema
compatibility behavior explicit and tested.
…navailable

Add a focused test target, nvbench.test.measure_timeout_warnings, covering:

  - NaN stdev noise -> “unable to estimate noise”
  - negative stdev noise -> “unable to estimate noise”
  - +inf stdev noise -> “over noise threshold”
check_noise_warning() now takes std::optional<nvbench::float64_t>,
matching the production helper, and the test now covers
std::nullopt explicitly in addition to NaN, negative, and +inf.
Document that percentile helpers return quiet NaNs for NaN-containing inputs.

Make quartile expected-value tests compute ranks from the documented
round(p / 100 * (n - 1)) rule instead of reusing statistics::percentile_rank(),
so rank regressions are caught independently.

Extend timeout-warning coverage to exercise the too-few-samples max-noise path
in addition to unavailable, invalid, and infinite stdev-noise inputs.
@oleksandr-pavlyk oleksandr-pavlyk force-pushed the nvbench-compare-process-bulk-data branch from f18d48d to f0db899 Compare June 29, 2026 19:59
@oleksandr-pavlyk

Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai

This comment was marked as outdated.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • New Features
    • Added a new nvbench-compare documentation page with result classes, matching rules, and TOML/config guidance.
    • Improved Python CLI tools to load plotting/output dependencies lazily and exit with clear errors when tooling is missing.
  • Bug Fixes
    • Enhanced timeout/no-accepted-samples warnings during benchmark summary generation.
    • Improved noise/relative-noise computation and more consistent handling of NaN/invalid/insufficient-sample inputs in statistics (percentiles/quartiles/robust noise).
  • Tests
    • Added/expanded unit tests for timeout warnings, statistics edge cases, and tooling-dependency import behavior.

Walkthrough

Adds a compare tool reference page, centralizes C++ timeout-warning logging and statistics helpers, and updates Python scripts and wheel packaging to use lazy optional dependencies under cuda.bench.scripts.

C++ statistics and timeout-warning refactor

Layer / File(s) Summary
statistics helpers and NaN handling
nvbench/detail/statistics.cuh
Adds shared noise-gating and sentinel helpers, changes percentile and quartile behavior for NaN-containing inputs, and centralizes quartile selection threshold use.
centralized timeout warning logging
nvbench/detail/measure_timeout_warnings.cuh, nvbench/detail/measure_cold.cuh, nvbench/detail/measure_cold.cu, nvbench/detail/measure_cpu_only.cxx
Introduces shared timeout-warning helpers and wires them into cold and CPU-only measurement code.
statistics and timeout-warning tests
testing/CMakeLists.txt, testing/measure_timeout_warnings.cu, testing/statistics.cu
Adds CUDA tests for timeout warnings and expands statistics coverage for NaNs, duplicate-heavy quartiles, and new noise helpers.

Python tools packaging and lazy dependency loading

Layer / File(s) Summary
tooling dependency loader
python/scripts/nvbench_tooling_deps.py
Defines the tooling dependency dataclass, missing-dependency error, and helper for lazy importing optional packages with install hints.
lazy script loading
python/scripts/nvbench_histogram.py, python/scripts/nvbench_plot_bwutil.py, python/scripts/nvbench_walltime.py
Updates the histogram, plot_bwutil, and walltime scripts to use package-aware imports, lazy tooling loaders, and explicit exit codes on missing optional dependencies.
wheel packaging and tooling tests
python/pyproject.toml, python/test/test_nvbench_tooling_deps.py
Moves the scripts wheel package under cuda/bench/scripts, retargets console entry points, and adds tests for packaged imports and tooling-dependency error handling.

nvbench-compare documentation

Layer / File(s) Summary
compare reference documentation
docs/nvbench_compare.md
Documents nvbench-compare behavior, command-line usage, matching, timing summaries, bulk-debug Python, comparison decisions, AMBG handling, and configuration keys.

Assessment against linked issues

Objective Addressed Explanation
Distribution-based comparison in nvbench_compare.py with SAME/AMBG/INSUFFICIENT handling [#320] No code changes to nvbench_compare.py are present; only documentation for the tool was added.
Use a more descriptive scripts package name in the wheel and update script entry points [#393]
Add optional tool dependencies in package metadata and make scripts fail gracefully without them [#384]

Assessment against linked issues: Out-of-scope changes

Code Change Explanation
C++ statistics helper changes and NaN/noise gating (nvbench/detail/statistics.cuh lines 57-423) The linked issues do not mention modifying measurement statistics internals.
Centralized timeout-warning logging and cold/CPU-only measurement refactor (nvbench/detail/measure_timeout_warnings.cuh, nvbench/detail/measure_cold.cu, nvbench/detail/measure_cpu_only.cxx) These changes are unrelated to the linked packaging and compare-tool objectives.
New timeout-warning and statistics tests (testing/measure_timeout_warnings.cu, testing/statistics.cu, testing/CMakeLists.txt) The linked issues do not require these test additions.

Comment @coderabbitai help to get the list of available commands.

coderabbitai[bot]

This comment was marked as resolved.

@oleksandr-pavlyk oleksandr-pavlyk force-pushed the nvbench-compare-process-bulk-data branch from 18869e2 to 42927d9 Compare June 30, 2026 11:35
Teach nvbench_compare to parse GPU timing summaries into structured values and
prefer the robust median/IQR summaries when both compared measurements provide
them. Fall back to the existing mean/stdev summaries when robust summaries are
not available.

Classify comparisons with the larger available relative noise estimate instead
of the smaller one, keep unavailable noise distinct from encoded infinite noise,
and report improvements separately from regressions. Keep the process exit code
as success for completed comparisons; regression counts are reported in the
summary instead of being used as the process status.

Make plotting tolerate unavailable noise by leaving gaps in confidence bands,
sort plotted series by the plotted axis, and avoid reusing pyplot state across
plot calls.

Add focused Python tests for robust-summary preference, unavailable-noise
classification, non-finite timing centers, plot-along handling when the selected
axis is absent, and the exit-code contract.
Teach nvbench_compare to keep the order of --benchmark and --axis arguments so
axis filters can apply either globally or to the most recent benchmark. Build a
filter plan from the ordered CLI arguments and apply the same plan to table
output and plotting labels.

Add explicit --reference-devices and --compare-devices filters. The filters
accept all, a single device id, or a comma-separated list of ids; ordered lists
and duplicates are preserved so selected reference and compare devices can be
paired by position. Device-section mismatches remain fatal for unfiltered
all-vs-all comparisons, but become warnings when the user explicitly selects
devices and the selected device counts match.

Match duplicate benchmark states by occurrence within each filtered device
section instead of matching only by state name across the whole benchmark. This
keeps repeated axis values and filtered duplicate states aligned between the
reference and compare inputs, and reports mismatched occurrence counts instead
of silently dropping extra states.

Add Python tests for duplicate-state matching, axis filtering before matching,
device filter parsing and validation, explicit cross-device pairing, and
benchmark-scoped axis filters.

Original commit messages folded into this change:

Tweaks for nvbench_compare

1. When JSON files contain multiple entries with the same name and axis values,
   make sure that scripts compares corresponding entries.

   Previous logic would extract the first entry from ref data, and would compare
   measurements for each state in cmp against the first entry from ref. The
   change introduces a counter to know which nth entry we process for a
   particular axis value, and retrieve corresponding entry in ref.

Scope occurrence matching by device.

Device pairing in nvbench_compare.py is strictly index-based under
--ignore-devices, reused IDs in a different order no longer pair against the
wrong reference device.

Require devices in ref and cmp to have the same cardinality

Handle mismatch when number of duplicates in ref data is not same as in cmp data

Use pytest monkeypatch fixture to pretend third-party package dependencies are
available during test run for nvbench_compare without introducing test-time
dependency

Added the happy-path test and fixed its direct-call setup by initializing the
device globals that main() normally populates.

Fix to filter-before-matching.

 - compare_benches() now pairs devices by selected position instead of taking a
   device id.
 - For each device pair, compare_benches() now builds:
     - ref_device_states: matching reference device and axis filters
     - cmp_device_states: matching compare device and axis filters
 - State occurrence counts and duplicate occurrence matching now operate only
   on those filtered per-device lists.
 - Removed the later matches_axis_filters() skip inside the compare-state loop
   because filtering now happens before matching.

Added a regression test where ref/cmp have duplicate state names in opposite
order, and --axis keeps only one of them. The test verifies the kept compare
state is matched against the kept reference state, not the first unfiltered
occurrence.

Introduce device filtering in nvbench_compare

 - --reference-devices all|ID|ID,ID,...
 - --compare-devices all|ID|ID,ID,...
 - Integer lists preserve order and duplicates.
 - Requested IDs are validated against the file-level device list.
 - Filtered reference/compare device counts must match before comparison.
 - compare_benches() pairs selected reference and compare devices by position.
 - Each benchmark validates that requested device IDs are present in its own
   devices list.

Implemented benchmark-scoped --axis handling.

  - --axis and --benchmark now share an ordered argparse action, so their
    relative CLI order is preserved.
  - -a before any -b becomes a global axis filter.
  - -a after -b <name> applies to that most recent benchmark only.
  - Repeated -b entries are treated as separate filter scopes and combined as
    alternatives for that benchmark.
  - Device filtering remains global and is applied independently.

Allow non-matching devices for explicit device selection

Now the device-section equality check remains fatal only for unfiltered
all-vs-all comparisons. If either --reference-devices or --compare-devices is
explicit, mismatched selected device metadata is printed as a warning, but
comparison proceeds after the selected device counts have been validated.

Fix for resolve_benchmark_device_ids, add comments

The return value of resolve_benchmark_device_ids now always owns its list.

Use monkeypatch class in set_test_devices helper

Stricted device id validation

Test for device id validation
Introduce GpuTimingData, SummaryComparison, ComparisonStats, and
ComparisonRunData to make timing extraction, classification, and run-level
state explicit.

Load sample-time and SM-frequency bulk data from JSON binary output into
GpuTimingData when available, preserving count validation between paired
sample and frequency arrays.

Move GPU timing comparison logic into compare_gpu_timings(), prefer robust
median/IQR data when available, and fall back to mean/stdev summaries otherwise.
Keep missing or invalid noise on the unknown path.

Replace module-level comparison counters and selected-device globals with
per-run data passed into compare_benches(). Update tests to validate timing
classification, bulk-data loading, device pairing, filtered duplicate matching,
and summary counters through the new structures.
It is not emitted just yet, but the code becomes ready for it
when it starts being emitted
Store JSON-bin sample time and frequency metadata in GpuTimingData instead of
reading the binary files during summary extraction.

Add Float32BinarySource and lazy cached accessors for samples and frequencies.
Use np.fromfile by default, but allow tests and alternate callers to inject a
float32 reader returning any buffer-compatible object convertable to "<f4" data
type.

Treat optional bulk-data failures as unavailable evidence instead of aborting
comparison: unreadable files, invalid buffers, count mismatches, and mismatched
sample/frequency metadata now emit RuntimeWarning and return None.

Update nvbench_compare tests to verify lazy loading, cache reuse, injected
reader behavior, warning-based degradation, and count mismatch handling.
Also add a test to check that STDOUT also works.
Reject boolean summary float payloads instead of coercing them to 1.0/0.0,
while keeping numeric strings accepted for NVBench JSON compatibility.

Add regression coverage for generated bulk-debug Python filenames that require
escaping, and strengthen the plot-along test to assert log-log axes and
confidence-band rendering.
Move Float32BinarySource material-payload detection into the source object.
Default file-backed sources still use resolved file size so missing or empty
sidecars remain unavailable, but positive-count sources with custom readers are
treated as material and proceed through the lazy read path.

Add regression coverage for virtual bulk sources whose custom reader provides
data without a local sidecar file.
Derive absolute standard deviation from relative stdev and mean when
nv/cold/time/gpu/stdev/absolute is absent. This lets older JSON files
that only contain mean and relative stdev still construct timing
intervals.

Also allow mean/stdev intervals to be built without min/max summaries,
using min/max only as optional clipping bounds when present. This
restores SAME classification for legacy fixture data instead of treating
those rows as missing-interval AMBG cases.

Update nvbench_compare tests to cover derived stdev handling and the
legacy mean/stdev comparison path.
Add a shared nvbench_tooling_deps helper for importing packages required
by NVBench console tools. Missing tooling packages now raise a dedicated
error with an install recipe instead of failing with a raw ImportError.

Update script imports to work both as installed package modules and as
direct source-tree scripts by using the __package__ import pattern for
nvbench_json and the new tooling helper.

Defer nvbench-compare dependencies to the points where they are needed:
NumPy/colorama during normal comparison setup, tabulate during table
rendering, jsondiff only for device mismatch reporting, and plotting
packages only for plot modes.

Update tests to initialize compare tooling when calling internals
directly and add coverage for the tooling dependency loader.

Closes NVIDIA#384
Introduce optional dependency categories in the wheel metadata,
with cuda-bench[tools] encompasing all dependencies of nvbench
scripts.

Closes NVIDIA#393
Move `load_nvbench_compare_tooling()` after:
  - config resolution / --dump-config
  - filter parsing
  - device-filter parsing
  - input arity validation
test_nvbench_tooling_deps.py now has a smoke test that builds a
temporary package layout matching the wheel mapping:

cuda/bench/scripts/nvbench_tooling_deps.py

and imports:

cuda.bench.scripts.nvbench_tooling_deps

That covers the cuda.bench.scripts.* path without requiring a wheel
build/install inside this unit test.
Add a shared nvbench_tooling_deps helper that imports packages required
by console tools and reports missing packages with an actionable install
recipe.

Update NVBench scripts to support both direct source-tree execution and
installed package execution through the __package__ import pattern. Defer
third-party tooling imports until they are needed, including lazy loading
for compare tables, device diffs, and plotting paths.

Make loaders resilient to partial initialization failures so a later
retry can complete any dependency that failed previously.

Update tests to cover direct internal use, packaged cuda.bench.scripts
imports, dependency-loader error messages, and cheap CLI validation before
tooling dependencies are loaded.
require_tooling_dependency() now only translates
ModuleNotFoundError when the missing module is the
requested top-level package. Other import
failures are re-raised unchanged.

This helps in situation where third-party dependency
is installed but broken for whatever reason. Previously
we would intercept it and suggest to run
pip install cuda-bench[tools], but that was already done.
round in Python returns nearest even, while in C++ it
behaves as x -> floor(x + 0.5)
Factor finite-value checks through a common helper so positive and
non-negative finite predicates share the same None/finite guard. Add a
comment explaining the positive lower-bound clamp used for mean/stdev
intervals, since later ratio and log-distance checks require positive
bounds.

Also quote the --axis pow2 example in CLI help so shell users can copy
the example safely.
Implemented robo-review feedback suggesting to expand coverage
of lazy loading helper.
@oleksandr-pavlyk oleksandr-pavlyk force-pushed the nvbench-compare-process-bulk-data branch from 42927d9 to a240460 Compare June 30, 2026 11:41
@oleksandr-pavlyk

Copy link
Copy Markdown
Collaborator Author

@CodeRabbit full review

@coderabbitai

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as resolved.

@oleksandr-pavlyk oleksandr-pavlyk added python type: enhancement New feature or request. labels Jun 30, 2026
@oleksandr-pavlyk oleksandr-pavlyk self-assigned this Jun 30, 2026
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 30, 2026
@oleksandr-pavlyk oleksandr-pavlyk moved this from Todo to In Progress in CCCL Jun 30, 2026
@jrhemstad jrhemstad requested a review from gevtushenko June 30, 2026 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python type: enhancement New feature or request.

Projects

Status: In Progress

1 participant