Adds GEMM Profiling Guide to TE#2863
Conversation
Greptile SummaryThis PR adds a GEMM profiling guide to Transformer Engine documentation along with a companion benchmark tool (
Confidence Score: 4/5Safe to merge after addressing the FP8Delayed pre-quantized documentation mismatch; the benchmark tool itself works correctly. The benchmark code correctly excludes FP8Delayed when --pre-quantize is passed, but the tutorial note and both pre-quantized example outputs (B300 and H200) show a FP8Delayed ms column that users will not see when they run the documented commands. This directly misleads anyone following the H200 pre-quantized walkthrough. The rest of the benchmark logic, toctree wiring, and plot generation look correct. docs/examples/gemm_profiling/gemm_profiling.rst — the pre-quantized sections (note at line 120, B300 output at line 143, H200 output at line 291) need to be reconciled with the actual code behaviour that skips FP8Delayed entirely in pre-quantized mode. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[CLI args] --> B{Mode?}
B -->|model config args| C[run_model_config_benchmarks]
B -->|--shapes or default| D[run_benchmarks]
C --> F[compute_gemm_shapes]
F --> G[_benchmark_single_shape per shape]
G --> H{pre_quantize?}
H -->|False| I[autocast: FP8Current/Delayed/Block/MXFP8/FP4]
H -->|True| J[prequantized: FP8Current/Block/MXFP8/FP4 — FP8Delayed SKIPPED]
I --> K[print summary + speedup table]
J --> K
K --> L[create_model_config_plot]
D --> M{pre_quantize?}
M -->|False| N[autocast benchmarks]
M -->|True| O[prequantized — FP8Delayed SKIPPED]
N --> P[create_plot]
O --> P
Reviews (10): Last reviewed commit: "Regenerate GEMM speedup figures with Del..." | Re-trigger Greptile |
|
Hi @jomitchellnv, I see that this PR is open, but "Documentation" job is failing. If you fix it, please ping me and I'll review it. |
|
@pggPL they should be fixed now I hope |
|
/te-ci L1 pytorch |
Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>
| loc="upper right", | ||
| fontsize=8, | ||
| ncol=2, |
There was a problem hiding this comment.
--verify-dgrad plot silently uses approximation instead of measured values
When --verify-dgrad is passed, run_model_config_benchmarks benchmarks and records actual Dgrad timings into dgrad_results, and the printed table correctly shows those measured values. However, create_model_config_plot is never given dgrad_results — the call site only passes fprop_results and wgrad_results. Inside the plot function, Fprop+Dgrad bar height is always computed as fp.avg_time_ms * 2 (the approximation), so the chart silently contradicts the table when --verify-dgrad is used.
Fix: add dgrad_results and verify_dgrad parameters to create_model_config_plot, and when verify_dgrad=True, use fprop_ms[j] + dgrad_ms[j] instead of fprop_ms[j] * 2 for each op bar.
pggPL
left a comment
There was a problem hiding this comment.
I'm super happy about this change, I think we really need that.
I added some comments.
| mxfp8/mxfp8.rst | ||
| nvfp4/nvfp4.rst No newline at end of file | ||
| nvfp4/nvfp4.rst | ||
| gemm_profiling/gemm_profiling.rst |
There was a problem hiding this comment.
In the Features section we want to have only text with very short code snippets - one should be able to read it as academic-like handbook and we want it to be concise. The Tutorials and examples is better place for code user should run and more elaborate text.
I see that describing what is real impact of the precisions is super impactful and user reading the docs now is not able to estimate it. So what i propose is:
- add short
speedups.rsthere section that would have some short introduction + 1-2 graphs/tables and link togemm_profiling.rst+ short description what user can find there, - add
gemm_profiling.rstto examples which would basically be this whole tutorial.
I will elaborate about this in separate comment, but I think we would need 2 pictures/tables in speedups.rst:
- one with Hopper recipes: bf16, fp8 tensorwise, fp8 blockwise,
- one with Blackwell recipes: bf16, fp8 tensorwise, mxfp8, nvfp4
for some reasonable shape sizes.
And then link to gemm profiling example.
|
|
||
| .. code-block:: text | ||
|
|
||
| ========================================================================================== |
There was a problem hiding this comment.
FP8 block scaling on Blackwell is emulated via MXFP8 tensor cores and we support it mostly for backward compatibility. You can see it in the fp8 blockwise docs page. This recipe is aimed for Hopper.
Also, I see that FP8 tensorwise is omitted, is there a specific reason for that? I think it is still widely used (i think also on Blackwell, but i am not sure).
So I think we should run 2 experiments - 1 for hopper, 1 for blackwell with appropriate recipes.
There was a problem hiding this comment.
Yes -- will do and update accordingly thanks.
| numbers is the overhead from dynamic quantization, Hadamard transforms, and block | ||
| scaling that occurs in each training step. | ||
|
|
||
| An interesting result: **FP8 Block Scaling beats MXFP8 in raw kernel throughput** |
There was a problem hiding this comment.
i think this comment should be removed, since we do not aim to run both of them on the same device - one is for hopper, on is for blackwell
There was a problem hiding this comment.
done -- its gone. We replaced it with a device-specific note explaining that FP8 Block targets Hopper and MXFP8/NVFP4 target Blackwell
| Wgrad shapes have a different aspect ratio -- the token dimension moves from M to K -- | ||
| so they must always be benchmarked separately. | ||
|
|
||
| By default, the tool approximates Dgrad time as equal to Fprop time (since the FLOP |
There was a problem hiding this comment.
This is weird, I think we should run dgrad - even if the number of FLOPs is the same, some tensors are transposed and it may impact the result.
There was a problem hiding this comment.
Yea ok I fixed it --> done
| this benchmarks Dgrad shapes separately and reports the actual difference. | ||
|
|
||
|
|
||
| Speedup Is Shape-Dependent |
There was a problem hiding this comment.
This is really important section and should not be in appendix imo
| Computes Fprop and Wgrad shapes, benchmarks each across enabled | ||
| precisions, and prints per-layer / full-model speedup estimates. | ||
|
|
||
| When *verify_dgrad* is True, Dgrad shapes are benchmarked separately |
There was a problem hiding this comment.
as commented later, I think we should always benchmark it separately
There was a problem hiding this comment.
is it auto generated by the script?
There was a problem hiding this comment.
yea -- its auto generated
| quantization and kernel dispatch for these precision modes -- this is how you derive | ||
| the matrix multiplications your model runs and measure where your time goes. | ||
|
|
||
| A companion benchmark tool is provided at ``benchmarks/gemm/benchmark_gemm.py``. |
There was a problem hiding this comment.
This should be link, maybe even to github
|
One more thought - we should consider adding MoE - grouped gemm. |
Benchmark tool: - Always benchmark Dgrad separately (remove --verify-dgrad flag) - Pass measured Dgrad data to plot instead of 2x Fprop approximation - Add FP8 CurrentScaling and DelayedScaling benchmark support - Add FP8Block to shape mode (was missing, only in model-config mode) - Add --no-fp8-current and --no-fp8-delayed CLI flags Documentation: - Restructure: concise speedups.rst in features/, full tutorial in examples/ - Add device-specific precision recipes (Hopper vs Blackwell) - Add Hopper (H200) benchmark results alongside Blackwell (B300) - Remove misleading FP8 Block vs MXFP8 comparison (different target devices) - Rename "How Shapes Are Derived" to appendix, promote key sections - Convert benchmark tool references to GitHub links - Refresh all benchmark numbers with FP8 Current/Delayed columns Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>
Signed-off-by: Jonathan Mitchell <jomitchell@dl325g11-1979.ipp2a2.colossus.nvidia.com>
| print("\nEstimated GEMM Speedups:") | ||
| bf16_total = full_model.get("BF16", 0) | ||
| if run_fp8 and bf16_total > 0: | ||
| fp8_total = full_model.get("MXFP8", 0) | ||
| if fp8_total > 0: | ||
| print(f" MXFP8 vs BF16: {bf16_total / fp8_total:.2f}x") | ||
| if run_fp4 and run_fp8: | ||
| fp8_total = full_model.get("MXFP8", 0) | ||
| fp4_total = full_model.get("NVFP4", 0) | ||
| if fp8_total > 0 and fp4_total > 0: | ||
| print(f" NVFP4 vs MXFP8: {fp8_total / fp4_total:.2f}x") | ||
| if run_fp4 and bf16_total > 0: | ||
| fp4_total = full_model.get("NVFP4", 0) | ||
| if fp4_total > 0: | ||
| print(f" NVFP4 vs BF16: {bf16_total / fp4_total:.2f}x") | ||
| print(sep) |
There was a problem hiding this comment.
The "Estimated GEMM Speedups" block only prints results for MXFP8 and NVFP4. When the tool is run with
--no-fp8 --no-fp4 (the documented H200 invocation), run_fp8 and run_fp4 are both False, so all three if guards evaluate to False and the section emits nothing — directly contradicting the tutorial output in docs/examples/gemm_profiling/gemm_profiling.rst which shows FP8Delayed vs BF16: 1.69x, FP8Current vs BF16: 1.58x, and FP8Block vs BF16: 1.40x for that exact invocation.
| print("\nEstimated GEMM Speedups:") | |
| bf16_total = full_model.get("BF16", 0) | |
| if run_fp8 and bf16_total > 0: | |
| fp8_total = full_model.get("MXFP8", 0) | |
| if fp8_total > 0: | |
| print(f" MXFP8 vs BF16: {bf16_total / fp8_total:.2f}x") | |
| if run_fp4 and run_fp8: | |
| fp8_total = full_model.get("MXFP8", 0) | |
| fp4_total = full_model.get("NVFP4", 0) | |
| if fp8_total > 0 and fp4_total > 0: | |
| print(f" NVFP4 vs MXFP8: {fp8_total / fp4_total:.2f}x") | |
| if run_fp4 and bf16_total > 0: | |
| fp4_total = full_model.get("NVFP4", 0) | |
| if fp4_total > 0: | |
| print(f" NVFP4 vs BF16: {bf16_total / fp4_total:.2f}x") | |
| print(sep) | |
| print("\nEstimated GEMM Speedups:") | |
| bf16_total = full_model.get("BF16", 0) | |
| if bf16_total > 0: | |
| for p in precisions[1:]: # all enabled precisions except BF16 | |
| p_total = full_model.get(p, 0) | |
| if p_total > 0: | |
| print(f" {p} vs BF16: {bf16_total / p_total:.2f}x") | |
| if run_fp4 and run_fp8: | |
| fp8_total = full_model.get("MXFP8", 0) | |
| fp4_total = full_model.get("NVFP4", 0) | |
| if fp8_total > 0 and fp4_total > 0: | |
| print(f" NVFP4 vs MXFP8: {fp8_total / fp4_total:.2f}x") | |
| print(sep) |
| :doc:`full tutorial </examples/gemm_profiling/gemm_profiling>` for usage details. | ||
|
|
||
|
|
||
| Recommended Precision Recipes by Device |
| targets Hopper, where it runs natively. | ||
|
|
||
|
|
||
| Speedup Is Shape-Dependent |
There was a problem hiding this comment.
I would like to have pictures first because they are the most important
There was a problem hiding this comment.
Done. Moved both example sections (B300 and H200) with their figures above the "Speedup Is Shape-Dependent" text section, so readers see the benchmark plots first.
| which has K=N=hidden_size and no expansion) may see little to no benefit from lower | ||
| precision, because the GEMM is too small for the faster kernel to outrun the | ||
| quantization cost. | ||
| - **Batch size and sequence length also matter** -- they determine M, the token |
There was a problem hiding this comment.
I think this is similar to Models with large hidden dimensions and intermediate sizes
There was a problem hiding this comment.
Done. Merged the separate "Batch size and sequence length also matter" bullet into the first bullet — it now
reads "Models with large hidden dimensions, intermediate sizes, and token counts (micro_batch_size *
sequence_length)..." — eliminating the redundancy.
| For a 5B-parameter model (hidden=4096, intermediate=16384, 24 layers), MXFP8 delivers | ||
| ~1.42x and NVFP4 delivers ~1.98x over BF16 in autocast mode. FP8 DelayedScaling | ||
| reaches 1.64x, outperforming both FP8 CurrentScaling (1.39x) and MXFP8 on Blackwell. | ||
| In pre-quantized mode (raw kernel throughput), NVFP4 reaches 3.48x -- the gap is |
There was a problem hiding this comment.
It's worth mentioning that speedup can be faster and is not extremely fast due to qunatization, but pre-qunatized mode is not introduced by this time.
There was a problem hiding this comment.
Moved the pre-quantized reference out of the body text and into a note callout that explains the concept
(raw kernel throughput, --pre-quantize flag). The Sphinx tabs above also show the pre-quantized graph, so the reader sees the concept visually before the text discusses it.
| FP8 CurrentScaling, FP8 DelayedScaling, and FP8 Block Scaling. FP8 DelayedScaling | ||
| delivers ~1.69x over BF16, followed by FP8 CurrentScaling at ~1.58x and FP8 Block | ||
| Scaling at ~1.40x. FP8 Block Scaling runs natively on Hopper and is the only | ||
| block-scaled FP8 recipe available on this device. In pre-quantized mode (raw kernel |
There was a problem hiding this comment.
maybe insert also the graph with pre-qunaitzation and expose them via sphinx tab
There was a problem hiding this comment.
i think this is important point for people to understand, speedup of nvfp4 is big, but qunatization can diminish it
There was a problem hiding this comment.
Done — both B300 and H200 sections now use .. tabs:: with Autocast and Pre-quantized tabs showing the
corresponding graphs.
Added a .. note:: callout under the B300 example that highlights this explicitly — shows the 1.98x autocast
vs 3.48x kernel-only gap and explains what causes it.
| flag is mutually exclusive with model config arguments. | ||
|
|
||
|
|
||
| What Precision Does Each GEMM Run At? |
There was a problem hiding this comment.
i think we can remove this seciton
| at the same precision in both configs. | ||
|
|
||
|
|
||
| Understanding the Speedup Calculation |
Signed-off-by: Jonathan Mitchell <jomitchell@dl325g11-0771.ipp4a1.colossus.nvidia.com>
| print("\nEstimated GEMM Speedups:") | ||
| bf16_total = full_model.get("BF16", 0) | ||
| if run_fp8 and bf16_total > 0: | ||
| fp8_total = full_model.get("MXFP8", 0) | ||
| if fp8_total > 0: | ||
| print(f" MXFP8 vs BF16: {bf16_total / fp8_total:.2f}x") | ||
| if run_fp4 and run_fp8: | ||
| fp8_total = full_model.get("MXFP8", 0) | ||
| fp4_total = full_model.get("NVFP4", 0) | ||
| if fp8_total > 0 and fp4_total > 0: | ||
| print(f" NVFP4 vs MXFP8: {fp8_total / fp4_total:.2f}x") | ||
| if run_fp4 and bf16_total > 0: | ||
| fp4_total = full_model.get("NVFP4", 0) | ||
| if fp4_total > 0: | ||
| print(f" NVFP4 vs BF16: {bf16_total / fp4_total:.2f}x") | ||
| print(sep) |
There was a problem hiding this comment.
"Estimated GEMM Speedups" never prints FP8Current/FP8Delayed/FP8Block entries
The three if guards only check run_fp8 (MXFP8) and run_fp4 (NVFP4). There are no corresponding branches for run_fp8_current, run_fp8_delayed, or run_fp8_block. On H200 — the documented use case that passes --no-fp8 --no-fp4 — both flags are False, so all three guards fail and the section emits nothing but its header. This directly contradicts the tutorial in docs/examples/gemm_profiling/gemm_profiling.rst (lines 259–263), which shows FP8Delayed vs BF16: 1.69x, FP8Current vs BF16: 1.58x, and FP8Block vs BF16: 1.40x for exactly that invocation.
There was a problem hiding this comment.
Fixed — the hardcoded MXFP8/NVFP4 speedup branches are replaced with a loop over all active precisions, so it now prints {prec} vs BF16 for every enabled precision. Works correctly on Hopper with --no-fp8 --no-fp4
| .. code-block:: text | ||
|
|
||
| GEMM Benchmark (Model Config Mode) on NVIDIA B300 SXM6 AC | ||
| Timing method: CUDA events | ||
| Warmup iterations: 10, Timed iterations: 100 | ||
| Mode: Autocast (includes quantization overhead) | ||
|
|
||
| ========================================================================================== | ||
| Model Config: hidden=4096, intermediate=16384, heads=32, layers=24 | ||
| Tokens per step: M = 31 x 512 = 15,872 | ||
| ========================================================================================== | ||
|
|
||
| Fprop Shapes: | ||
| ------------------------------------------------------------------------------------------ | ||
| Op Shape BF16 ms FP8Current ms FP8Delayed ms MXFP8 ms NVFP4 ms | ||
| ------------------------------------------------------------------------------------------ | ||
| QKV Proj 15872x4096x12288 1.071 0.605 0.503 0.579 0.392 | ||
| Attn Out 15872x4096x4096 0.307 0.317 0.231 0.269 0.256 | ||
| MLP Up 15872x4096x16384 1.393 0.924 0.850 0.924 0.635 | ||
| MLP Down 15872x16384x4096 1.426 1.033 0.901 1.076 0.649 | ||
| ------------------------------------------------------------------------------------------ | ||
| Fprop sum (ms): 4.196 2.879 2.486 2.847 1.932 | ||
|
|
||
| ========================================================================================== | ||
| Per-Layer GEMM Time: | ||
| BF16 ms FP8Current ms FP8Delayed ms MXFP8 ms NVFP4 ms | ||
| Fprop: 4.196 2.879 2.486 2.847 1.932 | ||
| Dgrad: 4.290 3.063 2.621 3.045 2.189 | ||
| Fprop + Dgrad: 8.486 5.941 5.107 5.892 4.122 | ||
| Wgrad: 4.272 3.205 2.695 3.092 2.331 | ||
| Per-layer total: 12.758 9.147 7.802 8.984 6.453 | ||
|
|
||
| Full Model (24 layers): | ||
| Total GEMM time (ms): 306.192 219.522 187.246 215.608 154.869 | ||
|
|
||
| Estimated GEMM Speedups: | ||
| MXFP8 vs BF16: 1.42x | ||
| NVFP4 vs MXFP8: 1.39x | ||
| NVFP4 vs BF16: 1.98x | ||
| ========================================================================================== |
There was a problem hiding this comment.
B300 example output is stale and contradicts the current code
The documented B300 Estimated GEMM Speedups block (lines 97–100) shows only MXFP8 vs BF16, plus cross-precision lines like NVFP4 vs MXFP8: 1.39x that the current code never produces. The actual code at line 1483–1489 of benchmark_gemm.py iterates precisions[1:] and emits each precision vs BF16 only — no cross-precision pairs. Running the command as documented (no --no-fp8-current or --no-fp8-delayed flags) on a B300 would also print FP8Current vs BF16 and FP8Delayed vs BF16, which are absent from the docs. The Fprop table further omits the FP8Block column even though the shown command does not pass --no-fp8-block. This means users following the tutorial will see materially different output than documented.
pggPL
left a comment
There was a problem hiding this comment.
I'm ok with the tutorial, but the speedups part needs polishing
|
|
||
| .. tabs:: | ||
|
|
||
| .. tab:: Autocast |
There was a problem hiding this comment.
This autocast/pre-quantized is not defined and user does not know what that means.
There was a problem hiding this comment.
Good catch. Added a definition block at the top of the page (before any figures) defining both: Autocast = the end-to-end speedup seen in training, including per-step quantization; Pre-quantized = raw GEMM kernel throughput with inputs already in the target format (the hardware upper bound).
|
|
||
| .. tab:: Pre-quantized | ||
|
|
||
| .. figure:: gemm_profiling/img/b300_model_config_speedup_prequant.png |
There was a problem hiding this comment.
no delayed scaling in H200 pre quantized
There was a problem hiding this comment.
You're right, and this was actually a bug. In pre-quantized mode FP8Delayed had no pre-quantized variant and silently fell back to the autocast te.Linear path, so it shouldn't have appeared at all. Fixed benchmark_gemm.py to skip DelayedScaling under --pre-quantize (it differs from CurrentScaling only in how the scale is computed each step, which pre-quantized mode skips). Regenerating the pre-quantized plots rn so the FP8Delayed bar is gone.
| **Quantization overhead matters.** In pre-quantized mode (raw kernel throughput), | ||
| NVFP4 reaches 3.48x over BF16 -- nearly double the 1.98x seen in autocast mode. | ||
| The gap is the cost of dynamic quantization, Hadamard transforms, and block scaling | ||
| that occurs each training step. Use the ``--pre-quantize`` flag to see the kernel |
There was a problem hiding this comment.
mentioning --pre-quantize flag, when we do not define script does not make sense. We should explain difference of pre qunatized vs autocast above and remove it.
There was a problem hiding this comment.
Done. Removed the flag reference and the standalone note, and moved the autocast vs pre-quantized explanation to the top of the page as you suggested
|
|
||
| **The speedup from lower-precision GEMMs depends directly on the matrix dimensions, | ||
| which are determined by your model config.** Larger matrices amortize the fixed | ||
| overhead of quantization (format conversion, block scaling, Hadamard transforms) |
There was a problem hiding this comment.
"(format conversion, block scaling, Hadamard transforms)" what format conversion and block scaling mean in this context?
it seems that claude generated it without understanding
There was a problem hiding this comment.
Removed the unclear jargon. Replaced it with plain language: the per-step quantization cost is "converting the input tensors to the low-precision format and computing their scaling factors."
| run to see how each choice affects low-precision gains. | ||
|
|
||
| See the :doc:`full tutorial </examples/gemm_profiling/gemm_profiling>` for detailed | ||
| analysis on both Blackwell and Hopper, including Fprop vs Dgrad comparisons, autocast |
There was a problem hiding this comment.
similar - what is "including Fprop vs Dgrad comparisons"?
There was a problem hiding this comment.
I reworded it but it was a comparison that I added. I now have per operation fprop, dgrad, and wgrad breakdowns
| Speedup Is Shape-Dependent | ||
| ---------------------------- | ||
|
|
||
| **The speedup from lower-precision GEMMs depends directly on the matrix dimensions, |
There was a problem hiding this comment.
this sections seems to be super verbose - it says basically "the bigger shape, the bigger speedup" in 15 sentences
There was a problem hiding this comment.
i trimmed it down alot
|
i need to grab an H200 and B300 node to regenerate the plots then it should be ok |
- Define autocast vs pre-quantized modes upfront before the figures - Remove the --pre-quantize flag reference and the standalone note - Replace unclear quantization-overhead jargon with plain language - Condense the verbose "Speedup Is Shape-Dependent" section - Reword "Fprop vs Dgrad comparisons" to per-operation breakdowns - Fix benchmark_gemm.py: skip FP8 DelayedScaling in pre-quantized mode (it has no pre-quantized variant and silently fell back to the autocast path, producing a misleading bar in the pre-quantized plots) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>
Re-ran the model-config benchmark on B300 (SM100) and H200 (SM90) with the pre-quantized DelayedScaling fix applied, and synced the numbers in speedups.rst: - B300 autocast: now includes FP8Block (1.30x); FP8Current 1.41x, FP8Delayed 1.61x, MXFP8 1.44x, NVFP4 2.03x - B300 pre-quantized: FP8Delayed bar removed, FP8Block (1.82x) added; NVFP4 3.55x - H200 autocast: FP8Current 1.57x, FP8Delayed 1.69x, FP8Block 1.41x - H200 pre-quantized: FP8Delayed removed; FP8Block dropped (no Hopper prequant support); raw FP8 1.92x Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>
f8ebbd2 to
36cc7fa
Compare
Description
Adds a GEMM profiling guide to the Transformer Engine documentation and a companion benchmark tool. The guide
explains how to derive all 12 per-layer GEMM shapes (Fprop, Dgrad, Wgrad) from transformer model
hyperparameters, benchmark them across precisions (BF16, FP8 Block, MXFP8, NVFP4), and interpret the resulting
speedup estimates.
The benchmark tool supports two modes: model config mode (derives shapes automatically from hidden_size,
intermediate_size, etc.) and manual shape mode (explicit MxKxN triplets). It measures both autocast performance
(realistic end-to-end with quantization overhead) and pre-quantized kernel-only throughput, using CUDA events
or torch.profiler timing backends.
Type of change
Changes
Add benchmarks/gemm/benchmark_gemm.py — standalone GEMM benchmark tool supporting BF16, FP8 Block, MXFP8, and
NVFP4 precisions with autocast and pre-quantized modes, CUDA event and torch.profiler timing, Nsight Systems
integration, and bar-chart output
Add docs/features/low_precision_training/gemm_profiling/gemm_profiling.rst — documentation covering GEMM
shape derivation from model configs, forward/backward pass shape conventions, precision mapping per GEMM pass,
speedup calculation methodology, and a worked example on B300
Add benchmark result plots (img/model_config_speedup.png, img/model_config_speedup_prequant.png)
Update docs/features/low_precision_training/index.rst toctree to include the new guide
Please list the changes introduced in this PR:
Change A
Change B
Checklist: