Nvbench compare process bulk data#386
Draft
oleksandr-pavlyk wants to merge 84 commits into
Draft
Conversation
Also add comment within percentile_rank to document precondition on input values checked with assert statement. Also, sharpened the comment around percentile_rank function
Prepare duplicate heavy input and check sort-based quartile computation result with selection-based one. std::nth_element only guarantees that the nth element is the value that would appear there in sorted order; it does not fully sort equal partitions. Bugs in the selection implementation, especially when selecting Q1 from the left half and Q3 from the right half after selecting the median, are more likely to show up when many samples equal the quartile values.
…ulated Cold measurement can discard throttled trials before incrementing the accepted sample count, then stop on timeout with zero recorded samples. In that case, only emit the sample-size summary and skip derived timing, bandwidth, clock, and bulk summaries that require accepted samples. This avoids divide-by-zero mean calculations and quartile/IQR computation over empty sample vectors. Keep timeout diagnostics reachable for zero-sample runs and add an explicit warning when no accepted cold samples were recorded. Factor timeout warning emission into a private helper so the zero-sample and normal paths share the same diagnostic logic. Suppress low-sample relative stdev noise Add a statistics helper that returns no relative standard-deviation noise until there are enough samples for a meaningful estimate. Use it for cold CPU/GPU and CPU-only summaries so the low-sample +inf stdev sentinel is not published as real relative noise or used for max-noise timeout warnings. Add statistics coverage for suppressing the low-sample sentinel and computing relative stdev noise once the sample threshold is reached. compute_standard_deviation_noise return nullopt if standard deviation is not finite Test verify that noise is nullopt when not enough samples are accumulated Added statistics::has_enough_samples_for_noise_estimate(...) Used it in standard_deviation, compute_standard_deviation_noise, compute_robust_noise. Added timeout diagnostics in cold and CPU-only paths. if max-noise is configured and the run timed out before enough samples exist to estimate noise, the log now says that explicitly, otherwise the existing “over noise threshold” warning remains unchanged. Added a statistics test assertion for the new sample-count predicate.
Add fixed expected-value assertions for quartile tests around the sort/selection switch point, including duplicate-heavy inputs. This keeps the tests from only proving that both implementations agree with each other.
Introduce new header file with inline implementation. Use it from measure_cold.cuh and measure_cpu_only.cxx
… constant Expose quartile threshold value, use it in testing to test around that value.
Keep legacy stdev/relative summary tags present even when too few samples are available to compute a meaningful standard-deviation noise estimate. Use the standard-deviation unavailable sentinel for those values so existing summary consumers continue to see the expected tags. Factor the sentinel into the statistics helpers and use it from both standard_deviation() and stdev_noise_or_sentinel(), keeping the schema compatibility behavior explicit and tested.
…navailable Add a focused test target, nvbench.test.measure_timeout_warnings, covering: - NaN stdev noise -> “unable to estimate noise” - negative stdev noise -> “unable to estimate noise” - +inf stdev noise -> “over noise threshold”
check_noise_warning() now takes std::optional<nvbench::float64_t>, matching the production helper, and the test now covers std::nullopt explicitly in addition to NaN, negative, and +inf.
Document that percentile helpers return quiet NaNs for NaN-containing inputs. Make quartile expected-value tests compute ranks from the documented round(p / 100 * (n - 1)) rule instead of reusing statistics::percentile_rank(), so rank regressions are caught independently. Extend timeout-warning coverage to exercise the too-few-samples max-noise path in addition to unavailable, invalid, and infinite stdev-noise inputs.
f18d48d to
f0db899
Compare
Collaborator
Author
|
@coderabbitai full review |
This comment was marked as outdated.
This comment was marked as outdated.
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds a compare tool reference page, centralizes C++ timeout-warning logging and statistics helpers, and updates Python scripts and wheel packaging to use lazy optional dependencies under C++ statistics and timeout-warning refactor
Python tools packaging and lazy dependency loading
nvbench-compare documentation
Assessment against linked issues
Assessment against linked issues: Out-of-scope changes
Comment |
18869e2 to
42927d9
Compare
Teach nvbench_compare to parse GPU timing summaries into structured values and prefer the robust median/IQR summaries when both compared measurements provide them. Fall back to the existing mean/stdev summaries when robust summaries are not available. Classify comparisons with the larger available relative noise estimate instead of the smaller one, keep unavailable noise distinct from encoded infinite noise, and report improvements separately from regressions. Keep the process exit code as success for completed comparisons; regression counts are reported in the summary instead of being used as the process status. Make plotting tolerate unavailable noise by leaving gaps in confidence bands, sort plotted series by the plotted axis, and avoid reusing pyplot state across plot calls. Add focused Python tests for robust-summary preference, unavailable-noise classification, non-finite timing centers, plot-along handling when the selected axis is absent, and the exit-code contract.
Teach nvbench_compare to keep the order of --benchmark and --axis arguments so
axis filters can apply either globally or to the most recent benchmark. Build a
filter plan from the ordered CLI arguments and apply the same plan to table
output and plotting labels.
Add explicit --reference-devices and --compare-devices filters. The filters
accept all, a single device id, or a comma-separated list of ids; ordered lists
and duplicates are preserved so selected reference and compare devices can be
paired by position. Device-section mismatches remain fatal for unfiltered
all-vs-all comparisons, but become warnings when the user explicitly selects
devices and the selected device counts match.
Match duplicate benchmark states by occurrence within each filtered device
section instead of matching only by state name across the whole benchmark. This
keeps repeated axis values and filtered duplicate states aligned between the
reference and compare inputs, and reports mismatched occurrence counts instead
of silently dropping extra states.
Add Python tests for duplicate-state matching, axis filtering before matching,
device filter parsing and validation, explicit cross-device pairing, and
benchmark-scoped axis filters.
Original commit messages folded into this change:
Tweaks for nvbench_compare
1. When JSON files contain multiple entries with the same name and axis values,
make sure that scripts compares corresponding entries.
Previous logic would extract the first entry from ref data, and would compare
measurements for each state in cmp against the first entry from ref. The
change introduces a counter to know which nth entry we process for a
particular axis value, and retrieve corresponding entry in ref.
Scope occurrence matching by device.
Device pairing in nvbench_compare.py is strictly index-based under
--ignore-devices, reused IDs in a different order no longer pair against the
wrong reference device.
Require devices in ref and cmp to have the same cardinality
Handle mismatch when number of duplicates in ref data is not same as in cmp data
Use pytest monkeypatch fixture to pretend third-party package dependencies are
available during test run for nvbench_compare without introducing test-time
dependency
Added the happy-path test and fixed its direct-call setup by initializing the
device globals that main() normally populates.
Fix to filter-before-matching.
- compare_benches() now pairs devices by selected position instead of taking a
device id.
- For each device pair, compare_benches() now builds:
- ref_device_states: matching reference device and axis filters
- cmp_device_states: matching compare device and axis filters
- State occurrence counts and duplicate occurrence matching now operate only
on those filtered per-device lists.
- Removed the later matches_axis_filters() skip inside the compare-state loop
because filtering now happens before matching.
Added a regression test where ref/cmp have duplicate state names in opposite
order, and --axis keeps only one of them. The test verifies the kept compare
state is matched against the kept reference state, not the first unfiltered
occurrence.
Introduce device filtering in nvbench_compare
- --reference-devices all|ID|ID,ID,...
- --compare-devices all|ID|ID,ID,...
- Integer lists preserve order and duplicates.
- Requested IDs are validated against the file-level device list.
- Filtered reference/compare device counts must match before comparison.
- compare_benches() pairs selected reference and compare devices by position.
- Each benchmark validates that requested device IDs are present in its own
devices list.
Implemented benchmark-scoped --axis handling.
- --axis and --benchmark now share an ordered argparse action, so their
relative CLI order is preserved.
- -a before any -b becomes a global axis filter.
- -a after -b <name> applies to that most recent benchmark only.
- Repeated -b entries are treated as separate filter scopes and combined as
alternatives for that benchmark.
- Device filtering remains global and is applied independently.
Allow non-matching devices for explicit device selection
Now the device-section equality check remains fatal only for unfiltered
all-vs-all comparisons. If either --reference-devices or --compare-devices is
explicit, mismatched selected device metadata is printed as a warning, but
comparison proceeds after the selected device counts have been validated.
Fix for resolve_benchmark_device_ids, add comments
The return value of resolve_benchmark_device_ids now always owns its list.
Use monkeypatch class in set_test_devices helper
Stricted device id validation
Test for device id validation
Introduce GpuTimingData, SummaryComparison, ComparisonStats, and ComparisonRunData to make timing extraction, classification, and run-level state explicit. Load sample-time and SM-frequency bulk data from JSON binary output into GpuTimingData when available, preserving count validation between paired sample and frequency arrays. Move GPU timing comparison logic into compare_gpu_timings(), prefer robust median/IQR data when available, and fall back to mean/stdev summaries otherwise. Keep missing or invalid noise on the unknown path. Replace module-level comparison counters and selected-device globals with per-run data passed into compare_benches(). Update tests to validate timing classification, bulk-data loading, device pairing, filtered duplicate matching, and summary counters through the new structures.
It is not emitted just yet, but the code becomes ready for it when it starts being emitted
Store JSON-bin sample time and frequency metadata in GpuTimingData instead of reading the binary files during summary extraction. Add Float32BinarySource and lazy cached accessors for samples and frequencies. Use np.fromfile by default, but allow tests and alternate callers to inject a float32 reader returning any buffer-compatible object convertable to "<f4" data type. Treat optional bulk-data failures as unavailable evidence instead of aborting comparison: unreadable files, invalid buffers, count mismatches, and mismatched sample/frequency metadata now emit RuntimeWarning and return None. Update nvbench_compare tests to verify lazy loading, cache reuse, injected reader behavior, warning-based degradation, and count mismatch handling.
Also add a test to check that STDOUT also works.
Reject boolean summary float payloads instead of coercing them to 1.0/0.0, while keeping numeric strings accepted for NVBench JSON compatibility. Add regression coverage for generated bulk-debug Python filenames that require escaping, and strengthen the plot-along test to assert log-log axes and confidence-band rendering.
Move Float32BinarySource material-payload detection into the source object. Default file-backed sources still use resolved file size so missing or empty sidecars remain unavailable, but positive-count sources with custom readers are treated as material and proceed through the lazy read path. Add regression coverage for virtual bulk sources whose custom reader provides data without a local sidecar file.
Derive absolute standard deviation from relative stdev and mean when nv/cold/time/gpu/stdev/absolute is absent. This lets older JSON files that only contain mean and relative stdev still construct timing intervals. Also allow mean/stdev intervals to be built without min/max summaries, using min/max only as optional clipping bounds when present. This restores SAME classification for legacy fixture data instead of treating those rows as missing-interval AMBG cases. Update nvbench_compare tests to cover derived stdev handling and the legacy mean/stdev comparison path.
Add a shared nvbench_tooling_deps helper for importing packages required by NVBench console tools. Missing tooling packages now raise a dedicated error with an install recipe instead of failing with a raw ImportError. Update script imports to work both as installed package modules and as direct source-tree scripts by using the __package__ import pattern for nvbench_json and the new tooling helper. Defer nvbench-compare dependencies to the points where they are needed: NumPy/colorama during normal comparison setup, tabulate during table rendering, jsondiff only for device mismatch reporting, and plotting packages only for plot modes. Update tests to initialize compare tooling when calling internals directly and add coverage for the tooling dependency loader. Closes NVIDIA#384
Introduce optional dependency categories in the wheel metadata, with cuda-bench[tools] encompasing all dependencies of nvbench scripts. Closes NVIDIA#393
Move `load_nvbench_compare_tooling()` after: - config resolution / --dump-config - filter parsing - device-filter parsing - input arity validation
test_nvbench_tooling_deps.py now has a smoke test that builds a temporary package layout matching the wheel mapping: cuda/bench/scripts/nvbench_tooling_deps.py and imports: cuda.bench.scripts.nvbench_tooling_deps That covers the cuda.bench.scripts.* path without requiring a wheel build/install inside this unit test.
Add a shared nvbench_tooling_deps helper that imports packages required by console tools and reports missing packages with an actionable install recipe. Update NVBench scripts to support both direct source-tree execution and installed package execution through the __package__ import pattern. Defer third-party tooling imports until they are needed, including lazy loading for compare tables, device diffs, and plotting paths. Make loaders resilient to partial initialization failures so a later retry can complete any dependency that failed previously. Update tests to cover direct internal use, packaged cuda.bench.scripts imports, dependency-loader error messages, and cheap CLI validation before tooling dependencies are loaded.
require_tooling_dependency() now only translates ModuleNotFoundError when the missing module is the requested top-level package. Other import failures are re-raised unchanged. This helps in situation where third-party dependency is installed but broken for whatever reason. Previously we would intercept it and suggest to run pip install cuda-bench[tools], but that was already done.
round in Python returns nearest even, while in C++ it behaves as x -> floor(x + 0.5)
Factor finite-value checks through a common helper so positive and non-negative finite predicates share the same None/finite guard. Add a comment explaining the positive lower-bound clamp used for mean/stdev intervals, since later ratio and log-distance checks require positive bounds. Also quote the --axis pow2 example in CLI help so shell users can copy the example safely.
Implemented robo-review feedback suggesting to expand coverage of lazy loading helper.
42927d9 to
a240460
Compare
Collaborator
Author
|
@CodeRabbit full review |
This comment was marked as outdated.
This comment was marked as outdated.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
closes #320
closes #393
closes #384
This PR builds on top of #392 (changes to C++ sources in nvbench/ and testing/ folders) and contains only Python changes.