Skip to content

Adds GEMM Profiling Guide to TE#2863

Open
jomitchellnv wants to merge 7 commits into
NVIDIA:mainfrom
jomitchellnv:jm/gemm-blog
Open

Adds GEMM Profiling Guide to TE#2863
jomitchellnv wants to merge 7 commits into
NVIDIA:mainfrom
jomitchellnv:jm/gemm-blog

Conversation

@jomitchellnv
Copy link
Copy Markdown
Contributor

Description

Adds a GEMM profiling guide to the Transformer Engine documentation and a companion benchmark tool. The guide
explains how to derive all 12 per-layer GEMM shapes (Fprop, Dgrad, Wgrad) from transformer model
hyperparameters, benchmark them across precisions (BF16, FP8 Block, MXFP8, NVFP4), and interpret the resulting
speedup estimates.

The benchmark tool supports two modes: model config mode (derives shapes automatically from hidden_size,
intermediate_size, etc.) and manual shape mode (explicit MxKxN triplets). It measures both autocast performance
(realistic end-to-end with quantization overhead) and pre-quantized kernel-only throughput, using CUDA events
or torch.profiler timing backends.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Add benchmarks/gemm/benchmark_gemm.py — standalone GEMM benchmark tool supporting BF16, FP8 Block, MXFP8, and
    NVFP4 precisions with autocast and pre-quantized modes, CUDA event and torch.profiler timing, Nsight Systems
    integration, and bar-chart output

  • Add docs/features/low_precision_training/gemm_profiling/gemm_profiling.rst — documentation covering GEMM
    shape derivation from model configs, forward/backward pass shape conventions, precision mapping per GEMM pass,
    speedup calculation methodology, and a worked example on B300

  • Add benchmark result plots (img/model_config_speedup.png, img/model_config_speedup_prequant.png)

  • Update docs/features/low_precision_training/index.rst toctree to include the new guide
    Please list the changes introduced in this PR:

  • Change A

  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@jomitchellnv jomitchellnv changed the title adds blog post Adds GEMM Profiling Guide to TE Apr 9, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 9, 2026

Greptile Summary

This PR adds a GEMM profiling guide to Transformer Engine documentation along with a companion benchmark tool (benchmarks/gemm/benchmark_gemm.py). It covers shape derivation for all 12 per-layer GEMMs, benchmarking across BF16/FP8/MXFP8/NVFP4 precisions, and interpretation of speedup results.

  • benchmark_gemm.py: 1883-line standalone tool supporting model-config and manual-shape modes, two timing backends (CUDA events and torch.profiler), autocast and pre-quantized measurement modes, Nsight Systems integration, and bar-chart output.
  • gemm_profiling.rst: 589-line tutorial with worked examples on B300 and H200; contains a documentation inaccuracy where the pre-quantized examples show an FP8Delayed ms column that the code does not produce.
  • speedups.rst + toctree updates: New summary page added to the low-precision-training section; docs/index.rst correctly adds the tutorial to the toctree.

Confidence Score: 4/5

Safe to merge after addressing the FP8Delayed pre-quantized documentation mismatch; the benchmark tool itself works correctly.

The benchmark code correctly excludes FP8Delayed when --pre-quantize is passed, but the tutorial note and both pre-quantized example outputs (B300 and H200) show a FP8Delayed ms column that users will not see when they run the documented commands. This directly misleads anyone following the H200 pre-quantized walkthrough. The rest of the benchmark logic, toctree wiring, and plot generation look correct.

docs/examples/gemm_profiling/gemm_profiling.rst — the pre-quantized sections (note at line 120, B300 output at line 143, H200 output at line 291) need to be reconciled with the actual code behaviour that skips FP8Delayed entirely in pre-quantized mode.

Important Files Changed

Filename Overview
benchmarks/gemm/benchmark_gemm.py New 1883-line GEMM benchmark tool supporting BF16, FP8 (Current/Delayed/Block), MXFP8, NVFP4; FP8Delayed is correctly excluded from pre-quantized runs but the docs misrepresent this.
docs/examples/gemm_profiling/gemm_profiling.rst New 589-line tutorial RST. Pre-quantized B300/H200 examples claim FP8Delayed is included, contradicting the code. H200 speedup ordering also mismatches code iteration order.
docs/features/low_precision_training/speedups.rst New summary page covering autocast vs pre-quantized results for B300 and H200 with benchmark plots; no logic issues.
docs/features/low_precision_training/index.rst Adds speedups.rst to the low_precision_training toctree; straightforward change.
docs/index.rst Adds examples/gemm_profiling/gemm_profiling.rst to the top-level toctree; resolves the prior missing-toctree issue.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[CLI args] --> B{Mode?}
    B -->|model config args| C[run_model_config_benchmarks]
    B -->|--shapes or default| D[run_benchmarks]
    C --> F[compute_gemm_shapes]
    F --> G[_benchmark_single_shape per shape]
    G --> H{pre_quantize?}
    H -->|False| I[autocast: FP8Current/Delayed/Block/MXFP8/FP4]
    H -->|True| J[prequantized: FP8Current/Block/MXFP8/FP4 — FP8Delayed SKIPPED]
    I --> K[print summary + speedup table]
    J --> K
    K --> L[create_model_config_plot]
    D --> M{pre_quantize?}
    M -->|False| N[autocast benchmarks]
    M -->|True| O[prequantized — FP8Delayed SKIPPED]
    N --> P[create_plot]
    O --> P
Loading

Reviews (10): Last reviewed commit: "Regenerate GEMM speedup figures with Del..." | Re-trigger Greptile

Comment thread benchmarks/gemm/benchmark_gemm.py Outdated
Comment thread benchmarks/gemm/benchmark_gemm.py
Comment thread benchmarks/gemm/benchmark_gemm.py
@pggPL pggPL self-requested a review April 10, 2026 14:00
@pggPL
Copy link
Copy Markdown
Collaborator

pggPL commented Apr 13, 2026

Hi @jomitchellnv, I see that this PR is open, but "Documentation" job is failing. If you fix it, please ping me and I'll review it.

@jomitchellnv
Copy link
Copy Markdown
Contributor Author

@pggPL they should be fixed now I hope

@jomitchellnv
Copy link
Copy Markdown
Contributor Author

/te-ci L1 pytorch

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>
Comment on lines +1405 to +1407
loc="upper right",
fontsize=8,
ncol=2,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 --verify-dgrad plot silently uses approximation instead of measured values

When --verify-dgrad is passed, run_model_config_benchmarks benchmarks and records actual Dgrad timings into dgrad_results, and the printed table correctly shows those measured values. However, create_model_config_plot is never given dgrad_results — the call site only passes fprop_results and wgrad_results. Inside the plot function, Fprop+Dgrad bar height is always computed as fp.avg_time_ms * 2 (the approximation), so the chart silently contradicts the table when --verify-dgrad is used.

Fix: add dgrad_results and verify_dgrad parameters to create_model_config_plot, and when verify_dgrad=True, use fprop_ms[j] + dgrad_ms[j] instead of fprop_ms[j] * 2 for each op bar.

Copy link
Copy Markdown
Collaborator

@pggPL pggPL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm super happy about this change, I think we really need that.

I added some comments.

mxfp8/mxfp8.rst
nvfp4/nvfp4.rst No newline at end of file
nvfp4/nvfp4.rst
gemm_profiling/gemm_profiling.rst
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Features section we want to have only text with very short code snippets - one should be able to read it as academic-like handbook and we want it to be concise. The Tutorials and examples is better place for code user should run and more elaborate text.

I see that describing what is real impact of the precisions is super impactful and user reading the docs now is not able to estimate it. So what i propose is:

  • add short speedups.rst here section that would have some short introduction + 1-2 graphs/tables and link to gemm_profiling.rst + short description what user can find there,
  • add gemm_profiling.rst to examples which would basically be this whole tutorial.

I will elaborate about this in separate comment, but I think we would need 2 pictures/tables in speedups.rst:

  • one with Hopper recipes: bf16, fp8 tensorwise, fp8 blockwise,
  • one with Blackwell recipes: bf16, fp8 tensorwise, mxfp8, nvfp4
    for some reasonable shape sizes.

And then link to gemm profiling example.


.. code-block:: text

==========================================================================================
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FP8 block scaling on Blackwell is emulated via MXFP8 tensor cores and we support it mostly for backward compatibility. You can see it in the fp8 blockwise docs page. This recipe is aimed for Hopper.

Also, I see that FP8 tensorwise is omitted, is there a specific reason for that? I think it is still widely used (i think also on Blackwell, but i am not sure).

So I think we should run 2 experiments - 1 for hopper, 1 for blackwell with appropriate recipes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- will do and update accordingly thanks.

Comment thread benchmarks/gemm/benchmark_gemm.py
numbers is the overhead from dynamic quantization, Hadamard transforms, and block
scaling that occurs in each training step.

An interesting result: **FP8 Block Scaling beats MXFP8 in raw kernel throughput**
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this comment should be removed, since we do not aim to run both of them on the same device - one is for hopper, on is for blackwell

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done -- its gone. We replaced it with a device-specific note explaining that FP8 Block targets Hopper and MXFP8/NVFP4 target Blackwell

Comment thread docs/examples/gemm_profiling/gemm_profiling.rst
Wgrad shapes have a different aspect ratio -- the token dimension moves from M to K --
so they must always be benchmarked separately.

By default, the tool approximates Dgrad time as equal to Fprop time (since the FLOP
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird, I think we should run dgrad - even if the number of FLOPs is the same, some tensors are transposed and it may impact the result.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea ok I fixed it --> done

this benchmarks Dgrad shapes separately and reports the actual difference.


Speedup Is Shape-Dependent
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really important section and should not be in appendix imo

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok for sure

Comment thread benchmarks/gemm/benchmark_gemm.py Outdated
Computes Fprop and Wgrad shapes, benchmarks each across enabled
precisions, and prints per-layer / full-model speedup estimates.

When *verify_dgrad* is True, Dgrad shapes are benchmarked separately
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as commented later, I think we should always benchmark it separately

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it auto generated by the script?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea -- its auto generated

quantization and kernel dispatch for these precision modes -- this is how you derive
the matrix multiplications your model runs and measure where your time goes.

A companion benchmark tool is provided at ``benchmarks/gemm/benchmark_gemm.py``.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be link, maybe even to github

@pggPL
Copy link
Copy Markdown
Collaborator

pggPL commented May 4, 2026

One more thought - we should consider adding MoE - grouped gemm.

Jonathan Mitchell and others added 2 commits May 4, 2026 13:23
  Benchmark tool:
  - Always benchmark Dgrad separately (remove --verify-dgrad flag)
  - Pass measured Dgrad data to plot instead of 2x Fprop approximation
  - Add FP8 CurrentScaling and DelayedScaling benchmark support
  - Add FP8Block to shape mode (was missing, only in model-config mode)
  - Add --no-fp8-current and --no-fp8-delayed CLI flags

  Documentation:
  - Restructure: concise speedups.rst in features/, full tutorial in examples/
  - Add device-specific precision recipes (Hopper vs Blackwell)
  - Add Hopper (H200) benchmark results alongside Blackwell (B300)
  - Remove misleading FP8 Block vs MXFP8 comparison (different target devices)
  - Rename "How Shapes Are Derived" to appendix, promote key sections
  - Convert benchmark tool references to GitHub links
  - Refresh all benchmark numbers with FP8 Current/Delayed columns

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>
Signed-off-by: Jonathan Mitchell <jomitchell@dl325g11-1979.ipp2a2.colossus.nvidia.com>
Comment on lines +1483 to +1498
print("\nEstimated GEMM Speedups:")
bf16_total = full_model.get("BF16", 0)
if run_fp8 and bf16_total > 0:
fp8_total = full_model.get("MXFP8", 0)
if fp8_total > 0:
print(f" MXFP8 vs BF16: {bf16_total / fp8_total:.2f}x")
if run_fp4 and run_fp8:
fp8_total = full_model.get("MXFP8", 0)
fp4_total = full_model.get("NVFP4", 0)
if fp8_total > 0 and fp4_total > 0:
print(f" NVFP4 vs MXFP8: {fp8_total / fp4_total:.2f}x")
if run_fp4 and bf16_total > 0:
fp4_total = full_model.get("NVFP4", 0)
if fp4_total > 0:
print(f" NVFP4 vs BF16: {bf16_total / fp4_total:.2f}x")
print(sep)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 The "Estimated GEMM Speedups" block only prints results for MXFP8 and NVFP4. When the tool is run with --no-fp8 --no-fp4 (the documented H200 invocation), run_fp8 and run_fp4 are both False, so all three if guards evaluate to False and the section emits nothing — directly contradicting the tutorial output in docs/examples/gemm_profiling/gemm_profiling.rst which shows FP8Delayed vs BF16: 1.69x, FP8Current vs BF16: 1.58x, and FP8Block vs BF16: 1.40x for that exact invocation.

Suggested change
print("\nEstimated GEMM Speedups:")
bf16_total = full_model.get("BF16", 0)
if run_fp8 and bf16_total > 0:
fp8_total = full_model.get("MXFP8", 0)
if fp8_total > 0:
print(f" MXFP8 vs BF16: {bf16_total / fp8_total:.2f}x")
if run_fp4 and run_fp8:
fp8_total = full_model.get("MXFP8", 0)
fp4_total = full_model.get("NVFP4", 0)
if fp8_total > 0 and fp4_total > 0:
print(f" NVFP4 vs MXFP8: {fp8_total / fp4_total:.2f}x")
if run_fp4 and bf16_total > 0:
fp4_total = full_model.get("NVFP4", 0)
if fp4_total > 0:
print(f" NVFP4 vs BF16: {bf16_total / fp4_total:.2f}x")
print(sep)
print("\nEstimated GEMM Speedups:")
bf16_total = full_model.get("BF16", 0)
if bf16_total > 0:
for p in precisions[1:]: # all enabled precisions except BF16
p_total = full_model.get(p, 0)
if p_total > 0:
print(f" {p} vs BF16: {bf16_total / p_total:.2f}x")
if run_fp4 and run_fp8:
fp8_total = full_model.get("MXFP8", 0)
fp4_total = full_model.get("NVFP4", 0)
if fp8_total > 0 and fp4_total > 0:
print(f" NVFP4 vs MXFP8: {fp8_total / fp4_total:.2f}x")
print(sep)

Copy link
Copy Markdown
Collaborator

@pggPL pggPL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments

:doc:`full tutorial </examples/gemm_profiling/gemm_profiling>` for usage details.


Recommended Precision Recipes by Device
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

targets Hopper, where it runs natively.


Speedup Is Shape-Dependent
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have pictures first because they are the most important

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Moved both example sections (B300 and H200) with their figures above the "Speedup Is Shape-Dependent" text section, so readers see the benchmark plots first.

which has K=N=hidden_size and no expansion) may see little to no benefit from lower
precision, because the GEMM is too small for the faster kernel to outrun the
quantization cost.
- **Batch size and sequence length also matter** -- they determine M, the token
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is similar to Models with large hidden dimensions and intermediate sizes

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Merged the separate "Batch size and sequence length also matter" bullet into the first bullet — it now
reads "Models with large hidden dimensions, intermediate sizes, and token counts (micro_batch_size *
sequence_length)..." — eliminating the redundancy.

For a 5B-parameter model (hidden=4096, intermediate=16384, 24 layers), MXFP8 delivers
~1.42x and NVFP4 delivers ~1.98x over BF16 in autocast mode. FP8 DelayedScaling
reaches 1.64x, outperforming both FP8 CurrentScaling (1.39x) and MXFP8 on Blackwell.
In pre-quantized mode (raw kernel throughput), NVFP4 reaches 3.48x -- the gap is
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth mentioning that speedup can be faster and is not extremely fast due to qunatization, but pre-qunatized mode is not introduced by this time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the pre-quantized reference out of the body text and into a note callout that explains the concept
(raw kernel throughput, --pre-quantize flag). The Sphinx tabs above also show the pre-quantized graph, so the reader sees the concept visually before the text discusses it.

FP8 CurrentScaling, FP8 DelayedScaling, and FP8 Block Scaling. FP8 DelayedScaling
delivers ~1.69x over BF16, followed by FP8 CurrentScaling at ~1.58x and FP8 Block
Scaling at ~1.40x. FP8 Block Scaling runs natively on Hopper and is the only
block-scaled FP8 recipe available on this device. In pre-quantized mode (raw kernel
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe insert also the graph with pre-qunaitzation and expose them via sphinx tab

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is important point for people to understand, speedup of nvfp4 is big, but qunatization can diminish it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — both B300 and H200 sections now use .. tabs:: with Autocast and Pre-quantized tabs showing the
corresponding graphs.

Added a .. note:: callout under the B300 example that highlights this explicitly — shows the 1.98x autocast
vs 3.48x kernel-only gap and explains what causes it.

flag is mutually exclusive with model config arguments.


What Precision Does Each GEMM Run At?
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can remove this seciton

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

at the same precision in both configs.


Understanding the Speedup Calculation
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also can be removed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Signed-off-by: Jonathan Mitchell <jomitchell@dl325g11-0771.ipp4a1.colossus.nvidia.com>
Comment on lines +1483 to +1498
print("\nEstimated GEMM Speedups:")
bf16_total = full_model.get("BF16", 0)
if run_fp8 and bf16_total > 0:
fp8_total = full_model.get("MXFP8", 0)
if fp8_total > 0:
print(f" MXFP8 vs BF16: {bf16_total / fp8_total:.2f}x")
if run_fp4 and run_fp8:
fp8_total = full_model.get("MXFP8", 0)
fp4_total = full_model.get("NVFP4", 0)
if fp8_total > 0 and fp4_total > 0:
print(f" NVFP4 vs MXFP8: {fp8_total / fp4_total:.2f}x")
if run_fp4 and bf16_total > 0:
fp4_total = full_model.get("NVFP4", 0)
if fp4_total > 0:
print(f" NVFP4 vs BF16: {bf16_total / fp4_total:.2f}x")
print(sep)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 "Estimated GEMM Speedups" never prints FP8Current/FP8Delayed/FP8Block entries

The three if guards only check run_fp8 (MXFP8) and run_fp4 (NVFP4). There are no corresponding branches for run_fp8_current, run_fp8_delayed, or run_fp8_block. On H200 — the documented use case that passes --no-fp8 --no-fp4 — both flags are False, so all three guards fail and the section emits nothing but its header. This directly contradicts the tutorial in docs/examples/gemm_profiling/gemm_profiling.rst (lines 259–263), which shows FP8Delayed vs BF16: 1.69x, FP8Current vs BF16: 1.58x, and FP8Block vs BF16: 1.40x for exactly that invocation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — the hardcoded MXFP8/NVFP4 speedup branches are replaced with a loop over all active precisions, so it now prints {prec} vs BF16 for every enabled precision. Works correctly on Hopper with --no-fp8 --no-fp4

Signed-off-by: Jonathan Mitchell <jomitchell@dl325g11-0771.ipp4a1.colossus.nvidia.com>
Comment on lines +62 to +101
.. code-block:: text

GEMM Benchmark (Model Config Mode) on NVIDIA B300 SXM6 AC
Timing method: CUDA events
Warmup iterations: 10, Timed iterations: 100
Mode: Autocast (includes quantization overhead)

==========================================================================================
Model Config: hidden=4096, intermediate=16384, heads=32, layers=24
Tokens per step: M = 31 x 512 = 15,872
==========================================================================================

Fprop Shapes:
------------------------------------------------------------------------------------------
Op Shape BF16 ms FP8Current ms FP8Delayed ms MXFP8 ms NVFP4 ms
------------------------------------------------------------------------------------------
QKV Proj 15872x4096x12288 1.071 0.605 0.503 0.579 0.392
Attn Out 15872x4096x4096 0.307 0.317 0.231 0.269 0.256
MLP Up 15872x4096x16384 1.393 0.924 0.850 0.924 0.635
MLP Down 15872x16384x4096 1.426 1.033 0.901 1.076 0.649
------------------------------------------------------------------------------------------
Fprop sum (ms): 4.196 2.879 2.486 2.847 1.932

==========================================================================================
Per-Layer GEMM Time:
BF16 ms FP8Current ms FP8Delayed ms MXFP8 ms NVFP4 ms
Fprop: 4.196 2.879 2.486 2.847 1.932
Dgrad: 4.290 3.063 2.621 3.045 2.189
Fprop + Dgrad: 8.486 5.941 5.107 5.892 4.122
Wgrad: 4.272 3.205 2.695 3.092 2.331
Per-layer total: 12.758 9.147 7.802 8.984 6.453

Full Model (24 layers):
Total GEMM time (ms): 306.192 219.522 187.246 215.608 154.869

Estimated GEMM Speedups:
MXFP8 vs BF16: 1.42x
NVFP4 vs MXFP8: 1.39x
NVFP4 vs BF16: 1.98x
==========================================================================================
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 B300 example output is stale and contradicts the current code

The documented B300 Estimated GEMM Speedups block (lines 97–100) shows only MXFP8 vs BF16, plus cross-precision lines like NVFP4 vs MXFP8: 1.39x that the current code never produces. The actual code at line 1483–1489 of benchmark_gemm.py iterates precisions[1:] and emits each precision vs BF16 only — no cross-precision pairs. Running the command as documented (no --no-fp8-current or --no-fp8-delayed flags) on a B300 would also print FP8Current vs BF16 and FP8Delayed vs BF16, which are absent from the docs. The Fprop table further omits the FP8Block column even though the shown command does not pass --no-fp8-block. This means users following the tutorial will see materially different output than documented.

Copy link
Copy Markdown
Collaborator

@pggPL pggPL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with the tutorial, but the speedups part needs polishing


.. tabs::

.. tab:: Autocast
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This autocast/pre-quantized is not defined and user does not know what that means.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Added a definition block at the top of the page (before any figures) defining both: Autocast = the end-to-end speedup seen in training, including per-step quantization; Pre-quantized = raw GEMM kernel throughput with inputs already in the target format (the hardware upper bound).


.. tab:: Pre-quantized

.. figure:: gemm_profiling/img/b300_model_config_speedup_prequant.png
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no delayed scaling in H200 pre quantized

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, and this was actually a bug. In pre-quantized mode FP8Delayed had no pre-quantized variant and silently fell back to the autocast te.Linear path, so it shouldn't have appeared at all. Fixed benchmark_gemm.py to skip DelayedScaling under --pre-quantize (it differs from CurrentScaling only in how the scale is computed each step, which pre-quantized mode skips). Regenerating the pre-quantized plots rn so the FP8Delayed bar is gone.

**Quantization overhead matters.** In pre-quantized mode (raw kernel throughput),
NVFP4 reaches 3.48x over BF16 -- nearly double the 1.98x seen in autocast mode.
The gap is the cost of dynamic quantization, Hadamard transforms, and block scaling
that occurs each training step. Use the ``--pre-quantize`` flag to see the kernel
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mentioning --pre-quantize flag, when we do not define script does not make sense. We should explain difference of pre qunatized vs autocast above and remove it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed the flag reference and the standalone note, and moved the autocast vs pre-quantized explanation to the top of the page as you suggested


**The speedup from lower-precision GEMMs depends directly on the matrix dimensions,
which are determined by your model config.** Larger matrices amortize the fixed
overhead of quantization (format conversion, block scaling, Hadamard transforms)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"(format conversion, block scaling, Hadamard transforms)" what format conversion and block scaling mean in this context?

it seems that claude generated it without understanding

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unclear jargon. Replaced it with plain language: the per-step quantization cost is "converting the input tensors to the low-precision format and computing their scaling factors."

run to see how each choice affects low-precision gains.

See the :doc:`full tutorial </examples/gemm_profiling/gemm_profiling>` for detailed
analysis on both Blackwell and Hopper, including Fprop vs Dgrad comparisons, autocast
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar - what is "including Fprop vs Dgrad comparisons"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworded it but it was a comparison that I added. I now have per operation fprop, dgrad, and wgrad breakdowns

Speedup Is Shape-Dependent
----------------------------

**The speedup from lower-precision GEMMs depends directly on the matrix dimensions,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sections seems to be super verbose - it says basically "the bigger shape, the bigger speedup" in 15 sentences

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i trimmed it down alot

Copy link
Copy Markdown
Collaborator

@pggPL pggPL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some comments

@jomitchellnv
Copy link
Copy Markdown
Contributor Author

i need to grab an H200 and B300 node to regenerate the plots then it should be ok

@github-actions github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label May 28, 2026
jomitchellnv and others added 2 commits May 28, 2026 15:32
- Define autocast vs pre-quantized modes upfront before the figures
- Remove the --pre-quantize flag reference and the standalone note
- Replace unclear quantization-overhead jargon with plain language
- Condense the verbose "Speedup Is Shape-Dependent" section
- Reword "Fprop vs Dgrad comparisons" to per-operation breakdowns
- Fix benchmark_gemm.py: skip FP8 DelayedScaling in pre-quantized mode
  (it has no pre-quantized variant and silently fell back to the
  autocast path, producing a misleading bar in the pre-quantized plots)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>
Re-ran the model-config benchmark on B300 (SM100) and H200 (SM90) with the
pre-quantized DelayedScaling fix applied, and synced the numbers in speedups.rst:

- B300 autocast: now includes FP8Block (1.30x); FP8Current 1.41x, FP8Delayed
  1.61x, MXFP8 1.44x, NVFP4 2.03x
- B300 pre-quantized: FP8Delayed bar removed, FP8Block (1.82x) added; NVFP4 3.55x
- H200 autocast: FP8Current 1.57x, FP8Delayed 1.69x, FP8Block 1.41x
- H200 pre-quantized: FP8Delayed removed; FP8Block dropped (no Hopper prequant
  support); raw FP8 1.92x

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PRs from external contributor outside the core maintainers, representing community-driven work.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants