Adds GEMM Profiling Guide to TE by jomitchellnv · Pull Request #2863 · NVIDIA/TransformerEngine

jomitchellnv · 2026-04-09T21:56:44Z

Description

Adds a GEMM profiling guide to the Transformer Engine documentation and a companion benchmark tool. The guide
explains how to derive all 12 per-layer GEMM shapes (Fprop, Dgrad, Wgrad) from transformer model
hyperparameters, benchmark them across precisions (BF16, FP8 Block, MXFP8, NVFP4), and interpret the resulting
speedup estimates.

The benchmark tool supports two modes: model config mode (derives shapes automatically from hidden_size,
intermediate_size, etc.) and manual shape mode (explicit MxKxN triplets). It measures both autocast performance
(realistic end-to-end with quantization overhead) and pre-quantized kernel-only throughput, using CUDA events
or torch.profiler timing backends.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Add benchmarks/gemm/benchmark_gemm.py — standalone GEMM benchmark tool supporting BF16, FP8 Block, MXFP8, and
NVFP4 precisions with autocast and pre-quantized modes, CUDA event and torch.profiler timing, Nsight Systems
integration, and bar-chart output
Add docs/features/low_precision_training/gemm_profiling/gemm_profiling.rst — documentation covering GEMM
shape derivation from model configs, forward/backward pass shape conventions, precision mapping per GEMM pass,
speedup calculation methodology, and a worked example on B300
Add benchmark result plots (img/model_config_speedup.png, img/model_config_speedup_prequant.png)
Update docs/features/low_precision_training/index.rst toctree to include the new guide
Please list the changes introduced in this PR:
Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-04-09T22:00:01Z

Greptile Summary

This PR adds a GEMM profiling guide to Transformer Engine documentation along with a companion benchmark tool (benchmarks/gemm/benchmark_gemm.py). It covers shape derivation for all 12 per-layer GEMMs, benchmarking across BF16/FP8/MXFP8/NVFP4 precisions, and interpretation of speedup results.

benchmark_gemm.py: 1883-line standalone tool supporting model-config and manual-shape modes, two timing backends (CUDA events and torch.profiler), autocast and pre-quantized measurement modes, Nsight Systems integration, and bar-chart output.
gemm_profiling.rst: 589-line tutorial with worked examples on B300 and H200; contains a documentation inaccuracy where the pre-quantized examples show an FP8Delayed ms column that the code does not produce.
speedups.rst + toctree updates: New summary page added to the low-precision-training section; docs/index.rst correctly adds the tutorial to the toctree.

Confidence Score: 4/5

Safe to merge after addressing the FP8Delayed pre-quantized documentation mismatch; the benchmark tool itself works correctly.

The benchmark code correctly excludes FP8Delayed when --pre-quantize is passed, but the tutorial note and both pre-quantized example outputs (B300 and H200) show a FP8Delayed ms column that users will not see when they run the documented commands. This directly misleads anyone following the H200 pre-quantized walkthrough. The rest of the benchmark logic, toctree wiring, and plot generation look correct.

docs/examples/gemm_profiling/gemm_profiling.rst — the pre-quantized sections (note at line 120, B300 output at line 143, H200 output at line 291) need to be reconciled with the actual code behaviour that skips FP8Delayed entirely in pre-quantized mode.

Important Files Changed

Filename	Overview
benchmarks/gemm/benchmark_gemm.py	New 1883-line GEMM benchmark tool supporting BF16, FP8 (Current/Delayed/Block), MXFP8, NVFP4; FP8Delayed is correctly excluded from pre-quantized runs but the docs misrepresent this.
docs/examples/gemm_profiling/gemm_profiling.rst	New 589-line tutorial RST. Pre-quantized B300/H200 examples claim FP8Delayed is included, contradicting the code. H200 speedup ordering also mismatches code iteration order.
docs/features/low_precision_training/speedups.rst	New summary page covering autocast vs pre-quantized results for B300 and H200 with benchmark plots; no logic issues.
docs/features/low_precision_training/index.rst	Adds speedups.rst to the low_precision_training toctree; straightforward change.
docs/index.rst	Adds examples/gemm_profiling/gemm_profiling.rst to the top-level toctree; resolves the prior missing-toctree issue.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[CLI args] --> B{Mode?}
    B -->|model config args| C[run_model_config_benchmarks]
    B -->|--shapes or default| D[run_benchmarks]
    C --> F[compute_gemm_shapes]
    F --> G[_benchmark_single_shape per shape]
    G --> H{pre_quantize?}
    H -->|False| I[autocast: FP8Current/Delayed/Block/MXFP8/FP4]
    H -->|True| J[prequantized: FP8Current/Block/MXFP8/FP4 — FP8Delayed SKIPPED]
    I --> K[print summary + speedup table]
    J --> K
    K --> L[create_model_config_plot]
    D --> M{pre_quantize?}
    M -->|False| N[autocast benchmarks]
    M -->|True| O[prequantized — FP8Delayed SKIPPED]
    N --> P[create_plot]
    O --> P

_{Reviews (10): Last reviewed commit: "Regenerate GEMM speedup figures with Del..." | Re-trigger Greptile}

pggPL · 2026-04-13T10:13:29Z

Hi @jomitchellnv, I see that this PR is open, but "Documentation" job is failing. If you fix it, please ping me and I'll review it.

jomitchellnv · 2026-04-27T18:39:07Z

@pggPL they should be fixed now I hope

jomitchellnv · 2026-04-27T18:40:38Z

/te-ci L1 pytorch

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

greptile-apps · 2026-05-01T18:05:19Z

+        loc="upper right",
+        fontsize=8,
+        ncol=2,


--verify-dgrad plot silently uses approximation instead of measured values

When --verify-dgrad is passed, run_model_config_benchmarks benchmarks and records actual Dgrad timings into dgrad_results, and the printed table correctly shows those measured values. However, create_model_config_plot is never given dgrad_results — the call site only passes fprop_results and wgrad_results. Inside the plot function, Fprop+Dgrad bar height is always computed as fp.avg_time_ms * 2 (the approximation), so the chart silently contradicts the table when --verify-dgrad is used.

Fix: add dgrad_results and verify_dgrad parameters to create_model_config_plot, and when verify_dgrad=True, use fprop_ms[j] + dgrad_ms[j] instead of fprop_ms[j] * 2 for each op bar.

pggPL

I'm super happy about this change, I think we really need that.

I added some comments.

pggPL · 2026-05-04T10:10:35Z

   mxfp8/mxfp8.rst
-   nvfp4/nvfp4.rst
+   nvfp4/nvfp4.rst
+   gemm_profiling/gemm_profiling.rst


In the Features section we want to have only text with very short code snippets - one should be able to read it as academic-like handbook and we want it to be concise. The Tutorials and examples is better place for code user should run and more elaborate text.

I see that describing what is real impact of the precisions is super impactful and user reading the docs now is not able to estimate it. So what i propose is:

add short speedups.rst here section that would have some short introduction + 1-2 graphs/tables and link to gemm_profiling.rst + short description what user can find there,

add gemm_profiling.rst to examples which would basically be this whole tutorial.

I will elaborate about this in separate comment, but I think we would need 2 pictures/tables in speedups.rst:

one with Hopper recipes: bf16, fp8 tensorwise, fp8 blockwise,

one with Blackwell recipes: bf16, fp8 tensorwise, mxfp8, nvfp4
for some reasonable shape sizes.

And then link to gemm profiling example.

pggPL · 2026-05-04T10:14:26Z

+
+.. code-block:: text
+
+    ==========================================================================================


FP8 block scaling on Blackwell is emulated via MXFP8 tensor cores and we support it mostly for backward compatibility. You can see it in the fp8 blockwise docs page. This recipe is aimed for Hopper.

Also, I see that FP8 tensorwise is omitted, is there a specific reason for that? I think it is still widely used (i think also on Blackwell, but i am not sure).

So I think we should run 2 experiments - 1 for hopper, 1 for blackwell with appropriate recipes.

Yes -- will do and update accordingly thanks.

pggPL · 2026-05-04T10:17:48Z

+numbers is the overhead from dynamic quantization, Hadamard transforms, and block
+scaling that occurs in each training step.
+
+An interesting result: **FP8 Block Scaling beats MXFP8 in raw kernel throughput**


i think this comment should be removed, since we do not aim to run both of them on the same device - one is for hopper, on is for blackwell

done -- its gone. We replaced it with a device-specific note explaining that FP8 Block targets Hopper and MXFP8/NVFP4 target Blackwell

pggPL · 2026-05-04T10:21:30Z

+Wgrad shapes have a different aspect ratio -- the token dimension moves from M to K --
+so they must always be benchmarked separately.
+
+By default, the tool approximates Dgrad time as equal to Fprop time (since the FLOP


This is weird, I think we should run dgrad - even if the number of FLOPs is the same, some tensors are transposed and it may impact the result.

Yea ok I fixed it --> done

pggPL · 2026-05-04T10:22:01Z

+this benchmarks Dgrad shapes separately and reports the actual difference.
+
+
+Speedup Is Shape-Dependent


This is really important section and should not be in appendix imo

ok for sure

pggPL · 2026-05-04T10:22:57Z

+    Computes Fprop and Wgrad shapes, benchmarks each across enabled
+    precisions, and prints per-layer / full-model speedup estimates.
+
+    When *verify_dgrad* is True, Dgrad shapes are benchmarked separately


as commented later, I think we should always benchmark it separately

pggPL · 2026-05-04T10:27:18Z

is it auto generated by the script?

yea -- its auto generated

pggPL · 2026-05-04T10:28:39Z

+quantization and kernel dispatch for these precision modes -- this is how you derive
+the matrix multiplications your model runs and measure where your time goes.
+
+A companion benchmark tool is provided at ``benchmarks/gemm/benchmark_gemm.py``.


This should be link, maybe even to github

pggPL · 2026-05-04T10:52:37Z

One more thought - we should consider adding MoE - grouped gemm.

Benchmark tool: - Always benchmark Dgrad separately (remove --verify-dgrad flag) - Pass measured Dgrad data to plot instead of 2x Fprop approximation - Add FP8 CurrentScaling and DelayedScaling benchmark support - Add FP8Block to shape mode (was missing, only in model-config mode) - Add --no-fp8-current and --no-fp8-delayed CLI flags Documentation: - Restructure: concise speedups.rst in features/, full tutorial in examples/ - Add device-specific precision recipes (Hopper vs Blackwell) - Add Hopper (H200) benchmark results alongside Blackwell (B300) - Remove misleading FP8 Block vs MXFP8 comparison (different target devices) - Rename "How Shapes Are Derived" to appendix, promote key sections - Convert benchmark tool references to GitHub links - Refresh all benchmark numbers with FP8 Current/Delayed columns Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

Signed-off-by: Jonathan Mitchell <jomitchell@dl325g11-1979.ipp2a2.colossus.nvidia.com>

greptile-apps · 2026-05-07T18:11:45Z

+    print("\nEstimated GEMM Speedups:")
+    bf16_total = full_model.get("BF16", 0)
+    if run_fp8 and bf16_total > 0:
+        fp8_total = full_model.get("MXFP8", 0)
+        if fp8_total > 0:
+            print(f"  MXFP8 vs BF16:  {bf16_total / fp8_total:.2f}x")
+    if run_fp4 and run_fp8:
+        fp8_total = full_model.get("MXFP8", 0)
+        fp4_total = full_model.get("NVFP4", 0)
+        if fp8_total > 0 and fp4_total > 0:
+            print(f"  NVFP4 vs MXFP8: {fp8_total / fp4_total:.2f}x")
+    if run_fp4 and bf16_total > 0:
+        fp4_total = full_model.get("NVFP4", 0)
+        if fp4_total > 0:
+            print(f"  NVFP4 vs BF16:  {bf16_total / fp4_total:.2f}x")
+    print(sep)


The "Estimated GEMM Speedups" block only prints results for MXFP8 and NVFP4. When the tool is run with --no-fp8 --no-fp4 (the documented H200 invocation), run_fp8 and run_fp4 are both False, so all three if guards evaluate to False and the section emits nothing — directly contradicting the tutorial output in docs/examples/gemm_profiling/gemm_profiling.rst which shows FP8Delayed vs BF16: 1.69x, FP8Current vs BF16: 1.58x, and FP8Block vs BF16: 1.40x for that exact invocation.

Suggested change

print("\nEstimated GEMM Speedups:")

bf16_total = full_model.get("BF16", 0)

if run_fp8 and bf16_total > 0:

fp8_total = full_model.get("MXFP8", 0)

if fp8_total > 0:

print(f" MXFP8 vs BF16: {bf16_total / fp8_total:.2f}x")

if run_fp4 and run_fp8:

fp8_total = full_model.get("MXFP8", 0)

fp4_total = full_model.get("NVFP4", 0)

if fp8_total > 0 and fp4_total > 0:

print(f" NVFP4 vs MXFP8: {fp8_total / fp4_total:.2f}x")

if run_fp4 and bf16_total > 0:

fp4_total = full_model.get("NVFP4", 0)

if fp4_total > 0:

print(f" NVFP4 vs BF16: {bf16_total / fp4_total:.2f}x")

print(sep)

print("\nEstimated GEMM Speedups:")

bf16_total = full_model.get("BF16", 0)

if bf16_total > 0:

for p in precisions[1:]: # all enabled precisions except BF16

p_total = full_model.get(p, 0)

if p_total > 0:

print(f" {p} vs BF16: {bf16_total / p_total:.2f}x")

if run_fp4 and run_fp8:

fp8_total = full_model.get("MXFP8", 0)

fp4_total = full_model.get("NVFP4", 0)

if fp8_total > 0 and fp4_total > 0:

print(f" NVFP4 vs MXFP8: {fp8_total / fp4_total:.2f}x")

print(sep)

pggPL

I left some comments

pggPL · 2026-05-13T14:06:08Z

+:doc:`full tutorial </examples/gemm_profiling/gemm_profiling>` for usage details.
+
+
+Recommended Precision Recipes by Device


This can be removed

pggPL · 2026-05-13T14:06:49Z

+   targets Hopper, where it runs natively.
+
+
+Speedup Is Shape-Dependent


I would like to have pictures first because they are the most important

Done. Moved both example sections (B300 and H200) with their figures above the "Speedup Is Shape-Dependent" text section, so readers see the benchmark plots first.

pggPL · 2026-05-13T14:07:34Z

+  which has K=N=hidden_size and no expansion) may see little to no benefit from lower
+  precision, because the GEMM is too small for the faster kernel to outrun the
+  quantization cost.
+- **Batch size and sequence length also matter** -- they determine M, the token


I think this is similar to Models with large hidden dimensions and intermediate sizes

Done. Merged the separate "Batch size and sequence length also matter" bullet into the first bullet — it now
reads "Models with large hidden dimensions, intermediate sizes, and token counts (micro_batch_size *
sequence_length)..." — eliminating the redundancy.

pggPL · 2026-05-13T14:09:15Z

+For a 5B-parameter model (hidden=4096, intermediate=16384, 24 layers), MXFP8 delivers
+~1.42x and NVFP4 delivers ~1.98x over BF16 in autocast mode. FP8 DelayedScaling
+reaches 1.64x, outperforming both FP8 CurrentScaling (1.39x) and MXFP8 on Blackwell.
+In pre-quantized mode (raw kernel throughput), NVFP4 reaches 3.48x -- the gap is


It's worth mentioning that speedup can be faster and is not extremely fast due to qunatization, but pre-qunatized mode is not introduced by this time.

Moved the pre-quantized reference out of the body text and into a note callout that explains the concept
(raw kernel throughput, --pre-quantize flag). The Sphinx tabs above also show the pre-quantized graph, so the reader sees the concept visually before the text discusses it.

pggPL · 2026-05-13T14:14:04Z

+FP8 CurrentScaling, FP8 DelayedScaling, and FP8 Block Scaling. FP8 DelayedScaling
+delivers ~1.69x over BF16, followed by FP8 CurrentScaling at ~1.58x and FP8 Block
+Scaling at ~1.40x. FP8 Block Scaling runs natively on Hopper and is the only
+block-scaled FP8 recipe available on this device. In pre-quantized mode (raw kernel


maybe insert also the graph with pre-qunaitzation and expose them via sphinx tab

i think this is important point for people to understand, speedup of nvfp4 is big, but qunatization can diminish it

Done — both B300 and H200 sections now use .. tabs:: with Autocast and Pre-quantized tabs showing the
corresponding graphs.

Added a .. note:: callout under the B300 example that highlights this explicitly — shows the 1.98x autocast
vs 3.48x kernel-only gap and explains what causes it.

pggPL · 2026-05-13T14:17:16Z

+flag is mutually exclusive with model config arguments.
+
+
+What Precision Does Each GEMM Run At?


i think we can remove this seciton

pggPL · 2026-05-13T14:17:26Z

+at the same precision in both configs.
+
+
+Understanding the Speedup Calculation


this also can be removed

Signed-off-by: Jonathan Mitchell <jomitchell@dl325g11-0771.ipp4a1.colossus.nvidia.com>

greptile-apps · 2026-05-13T16:03:22Z

+    print("\nEstimated GEMM Speedups:")
+    bf16_total = full_model.get("BF16", 0)
+    if run_fp8 and bf16_total > 0:
+        fp8_total = full_model.get("MXFP8", 0)
+        if fp8_total > 0:
+            print(f"  MXFP8 vs BF16:  {bf16_total / fp8_total:.2f}x")
+    if run_fp4 and run_fp8:
+        fp8_total = full_model.get("MXFP8", 0)
+        fp4_total = full_model.get("NVFP4", 0)
+        if fp8_total > 0 and fp4_total > 0:
+            print(f"  NVFP4 vs MXFP8: {fp8_total / fp4_total:.2f}x")
+    if run_fp4 and bf16_total > 0:
+        fp4_total = full_model.get("NVFP4", 0)
+        if fp4_total > 0:
+            print(f"  NVFP4 vs BF16:  {bf16_total / fp4_total:.2f}x")
+    print(sep)


"Estimated GEMM Speedups" never prints FP8Current/FP8Delayed/FP8Block entries

The three if guards only check run_fp8 (MXFP8) and run_fp4 (NVFP4). There are no corresponding branches for run_fp8_current, run_fp8_delayed, or run_fp8_block. On H200 — the documented use case that passes --no-fp8 --no-fp4 — both flags are False, so all three guards fail and the section emits nothing but its header. This directly contradicts the tutorial in docs/examples/gemm_profiling/gemm_profiling.rst (lines 259–263), which shows FP8Delayed vs BF16: 1.69x, FP8Current vs BF16: 1.58x, and FP8Block vs BF16: 1.40x for exactly that invocation.

Fixed — the hardcoded MXFP8/NVFP4 speedup branches are replaced with a loop over all active precisions, so it now prints {prec} vs BF16 for every enabled precision. Works correctly on Hopper with --no-fp8 --no-fp4

Signed-off-by: Jonathan Mitchell <jomitchell@dl325g11-0771.ipp4a1.colossus.nvidia.com>

greptile-apps · 2026-05-13T16:21:25Z

+.. code-block:: text
+
+    GEMM Benchmark (Model Config Mode) on NVIDIA B300 SXM6 AC
+    Timing method: CUDA events
+    Warmup iterations: 10, Timed iterations: 100
+    Mode: Autocast (includes quantization overhead)
+
+    ==========================================================================================
+    Model Config: hidden=4096, intermediate=16384, heads=32, layers=24
+    Tokens per step: M = 31 x 512 = 15,872
+    ==========================================================================================
+
+    Fprop Shapes:
+    ------------------------------------------------------------------------------------------
+    Op                     Shape                       BF16 ms FP8Current ms FP8Delayed ms   MXFP8 ms   NVFP4 ms
+    ------------------------------------------------------------------------------------------
+    QKV Proj               15872x4096x12288              1.071      0.605      0.503      0.579      0.392
+    Attn Out               15872x4096x4096               0.307      0.317      0.231      0.269      0.256
+    MLP Up                 15872x4096x16384              1.393      0.924      0.850      0.924      0.635
+    MLP Down               15872x16384x4096              1.426      1.033      0.901      1.076      0.649
+    ------------------------------------------------------------------------------------------
+    Fprop sum (ms):                                     4.196      2.879      2.486      2.847      1.932
+
+    ==========================================================================================
+    Per-Layer GEMM Time:
+                                      BF16 ms FP8Current ms FP8Delayed ms   MXFP8 ms   NVFP4 ms
+    Fprop:                              4.196      2.879      2.486      2.847      1.932
+    Dgrad:                              4.290      3.063      2.621      3.045      2.189
+    Fprop + Dgrad:                      8.486      5.941      5.107      5.892      4.122
+    Wgrad:                              4.272      3.205      2.695      3.092      2.331
+    Per-layer total:                   12.758      9.147      7.802      8.984      6.453
+
+    Full Model (24 layers):
+    Total GEMM time (ms):             306.192    219.522    187.246    215.608    154.869
+
+    Estimated GEMM Speedups:
+      MXFP8 vs BF16:  1.42x
+      NVFP4 vs MXFP8: 1.39x
+      NVFP4 vs BF16:  1.98x
+    ==========================================================================================


B300 example output is stale and contradicts the current code

The documented B300 Estimated GEMM Speedups block (lines 97–100) shows only MXFP8 vs BF16, plus cross-precision lines like NVFP4 vs MXFP8: 1.39x that the current code never produces. The actual code at line 1483–1489 of benchmark_gemm.py iterates precisions[1:] and emits each precision vs BF16 only — no cross-precision pairs. Running the command as documented (no --no-fp8-current or --no-fp8-delayed flags) on a B300 would also print FP8Current vs BF16 and FP8Delayed vs BF16, which are absent from the docs. The Fprop table further omits the FP8Block column even though the shown command does not pass --no-fp8-block. This means users following the tutorial will see materially different output than documented.

pggPL

I'm ok with the tutorial, but the speedups part needs polishing

pggPL · 2026-05-13T19:52:45Z

+
+.. tabs::
+
+   .. tab:: Autocast


This autocast/pre-quantized is not defined and user does not know what that means.

Good catch. Added a definition block at the top of the page (before any figures) defining both: Autocast = the end-to-end speedup seen in training, including per-step quantization; Pre-quantized = raw GEMM kernel throughput with inputs already in the target format (the hardware upper bound).

pggPL · 2026-05-13T19:54:11Z

+
+   .. tab:: Pre-quantized
+
+      .. figure:: gemm_profiling/img/b300_model_config_speedup_prequant.png


no delayed scaling in H200 pre quantized

You're right, and this was actually a bug. In pre-quantized mode FP8Delayed had no pre-quantized variant and silently fell back to the autocast te.Linear path, so it shouldn't have appeared at all. Fixed benchmark_gemm.py to skip DelayedScaling under --pre-quantize (it differs from CurrentScaling only in how the scale is computed each step, which pre-quantized mode skips). Regenerating the pre-quantized plots rn so the FP8Delayed bar is gone.

pggPL · 2026-05-13T19:55:52Z

+   **Quantization overhead matters.** In pre-quantized mode (raw kernel throughput),
+   NVFP4 reaches 3.48x over BF16 -- nearly double the 1.98x seen in autocast mode.
+   The gap is the cost of dynamic quantization, Hadamard transforms, and block scaling
+   that occurs each training step. Use the ``--pre-quantize`` flag to see the kernel


mentioning --pre-quantize flag, when we do not define script does not make sense. We should explain difference of pre qunatized vs autocast above and remove it.

Done. Removed the flag reference and the standalone note, and moved the autocast vs pre-quantized explanation to the top of the page as you suggested

pggPL · 2026-05-13T19:58:21Z

+
+**The speedup from lower-precision GEMMs depends directly on the matrix dimensions,
+which are determined by your model config.** Larger matrices amortize the fixed
+overhead of quantization (format conversion, block scaling, Hadamard transforms)


"(format conversion, block scaling, Hadamard transforms)" what format conversion and block scaling mean in this context?

it seems that claude generated it without understanding

Removed the unclear jargon. Replaced it with plain language: the per-step quantization cost is "converting the input tensors to the low-precision format and computing their scaling factors."

pggPL · 2026-05-13T19:58:50Z

+run to see how each choice affects low-precision gains.
+
+See the :doc:`full tutorial </examples/gemm_profiling/gemm_profiling>` for detailed
+analysis on both Blackwell and Hopper, including Fprop vs Dgrad comparisons, autocast


similar - what is "including Fprop vs Dgrad comparisons"?

I reworded it but it was a comparison that I added. I now have per operation fprop, dgrad, and wgrad breakdowns

pggPL · 2026-05-13T20:05:49Z

+Speedup Is Shape-Dependent
+----------------------------
+
+**The speedup from lower-precision GEMMs depends directly on the matrix dimensions,


this sections seems to be super verbose - it says basically "the bigger shape, the bigger speedup" in 15 sentences

i trimmed it down alot

pggPL

I've added some comments

jomitchellnv · 2026-05-28T21:03:26Z

i need to grab an H200 and B300 node to regenerate the plots then it should be ok

- Define autocast vs pre-quantized modes upfront before the figures - Remove the --pre-quantize flag reference and the standalone note - Replace unclear quantization-overhead jargon with plain language - Condense the verbose "Speedup Is Shape-Dependent" section - Reword "Fprop vs Dgrad comparisons" to per-operation breakdowns - Fix benchmark_gemm.py: skip FP8 DelayedScaling in pre-quantized mode (it has no pre-quantized variant and silently fell back to the autocast path, producing a misleading bar in the pre-quantized plots) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>

Re-ran the model-config benchmark on B300 (SM100) and H200 (SM90) with the pre-quantized DelayedScaling fix applied, and synced the numbers in speedups.rst: - B300 autocast: now includes FP8Block (1.30x); FP8Current 1.41x, FP8Delayed 1.61x, MXFP8 1.44x, NVFP4 2.03x - B300 pre-quantized: FP8Delayed bar removed, FP8Block (1.82x) added; NVFP4 3.55x - H200 autocast: FP8Current 1.57x, FP8Delayed 1.69x, FP8Block 1.41x - H200 pre-quantized: FP8Delayed removed; FP8Block dropped (no Hopper prequant support); raw FP8 1.92x Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>

jomitchellnv changed the title ~~adds blog post~~ Adds GEMM Profiling Guide to TE Apr 9, 2026

greptile-apps Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread benchmarks/gemm/benchmark_gemm.py Outdated

Comment thread benchmarks/gemm/benchmark_gemm.py

Comment thread benchmarks/gemm/benchmark_gemm.py

pggPL self-requested a review April 10, 2026 14:00

ptrendx assigned pggPL Apr 21, 2026

jomitchellnv force-pushed the jm/gemm-blog branch from 2533a50 to 64c353d Compare May 1, 2026 18:02

adds blog post

88a9e0b

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

jomitchellnv force-pushed the jm/gemm-blog branch from 64c353d to 88a9e0b Compare May 1, 2026 18:02

greptile-apps Bot reviewed May 1, 2026

View reviewed changes

pggPL reviewed May 4, 2026

View reviewed changes

Jonathan Mitchell and others added 2 commits May 4, 2026 13:23

fixes failing test

6132b5a

Signed-off-by: Jonathan Mitchell <jomitchell@dl325g11-1979.ipp2a2.colossus.nvidia.com>

greptile-apps Bot reviewed May 7, 2026

View reviewed changes

pggPL reviewed May 13, 2026

View reviewed changes

cleanup per comments

237b199

Signed-off-by: Jonathan Mitchell <jomitchell@dl325g11-0771.ipp4a1.colossus.nvidia.com>

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

greptile

394ed95

Signed-off-by: Jonathan Mitchell <jomitchell@dl325g11-0771.ipp4a1.colossus.nvidia.com>

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

pggPL reviewed May 13, 2026

View reviewed changes

github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label May 28, 2026

jomitchellnv and others added 2 commits May 28, 2026 15:32

jomitchellnv force-pushed the jm/gemm-blog branch from f8ebbd2 to 36cc7fa Compare May 28, 2026 22:34


		.. code-block:: text

		==========================================================================================

		this benchmarks Dgrad shapes separately and reports the actual difference.


		Speedup Is Shape-Dependent

		:doc:`full tutorial </examples/gemm_profiling/gemm_profiling>` for usage details.


		Recommended Precision Recipes by Device

		targets Hopper, where it runs natively.


		Speedup Is Shape-Dependent

		flag is mutually exclusive with model config arguments.


		What Precision Does Each GEMM Run At?

		at the same precision in both configs.


		Understanding the Speedup Calculation


		.. tab:: Pre-quantized

		.. figure:: gemm_profiling/img/b300_model_config_speedup_prequant.png

Conversation

jomitchellnv commented Apr 9, 2026

Checklist:

Uh oh!

greptile-apps Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pggPL commented Apr 13, 2026

Uh oh!

jomitchellnv commented Apr 27, 2026

Uh oh!

jomitchellnv commented Apr 27, 2026

Uh oh!

greptile-apps Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

pggPL left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pggPL commented May 4, 2026

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

pggPL left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Apr 9, 2026 •

edited

Loading