-
Notifications
You must be signed in to change notification settings - Fork 733
Adds GEMM Profiling Guide to TE #2863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jomitchellnv
wants to merge
8
commits into
NVIDIA:main
Choose a base branch
from
jomitchellnv:jm/gemm-blog
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
81c10a5
adds blog post
8f2e635
Address review comments on GEMM profiling guide
1fbde91
fixes failing test
a492269
cleanup per comments
dab2f9c
greptile
39714e3
Address review comments on speedups.rst
jomitchellnv 03b5742
Regenerate GEMM speedup figures with DelayedScaling fix
jomitchellnv 1a16e4a
Apply suggestion from @pggPL
pggPL File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+92.5 KB
docs/examples/gemm_profiling/img/b300_model_config_speedup_prequant.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+83.8 KB
docs/examples/gemm_profiling/img/h200_model_config_speedup_prequant.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+94.4 KB
...eatures/low_precision_training/gemm_profiling/img/b300_model_config_speedup.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+91.4 KB
...ow_precision_training/gemm_profiling/img/b300_model_config_speedup_prequant.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+86.8 KB
...eatures/low_precision_training/gemm_profiling/img/h200_model_config_speedup.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+80.5 KB
...ow_precision_training/gemm_profiling/img/h200_model_config_speedup_prequant.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,114 @@ | ||
| .. | ||
| Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
|
|
||
| See LICENSE for license information. | ||
|
|
||
| GEMM Speedups Across Precisions | ||
| ================================= | ||
|
|
||
| Transformer Engine supports multiple low-precision formats for the linear-layer GEMMs | ||
| that dominate transformer training time: BF16, FP8 tensor-wise scaling | ||
| (CurrentScaling, DelayedScaling), FP8 Block Scaling, MXFP8, and NVFP4. Each step down | ||
| in precision can accelerate the 12 GEMMs per transformer layer (4 Fprop + 4 Dgrad + | ||
| 4 Wgrad), but the actual speedup depends on your model's matrix dimensions. | ||
|
|
||
| A benchmark tool is provided at | ||
| `benchmarks/gemm/benchmark_gemm.py <https://github.com/NVIDIA/TransformerEngine/blob/main/benchmarks/gemm/benchmark_gemm.py>`__ | ||
| to measure GEMM performance for your specific model config. See the | ||
| :doc:`full tutorial </examples/gemm_profiling/gemm_profiling>` for usage details. | ||
|
|
||
| The benchmark reports two numbers for each precision: | ||
|
|
||
| - **Autocast** -- the end-to-end speedup seen in real training. It includes both the | ||
| GEMM and the per-step quantization work: converting the input tensors to the | ||
| low-precision format and computing their scaling factors. | ||
| - **Pre-quantized** -- the raw GEMM kernel throughput with inputs already in the target | ||
| format, excluding per-step quantization. Because the | ||
| inputs are already quantized, recipes that differ only in how scaling factors are | ||
| computed (e.g. DelayedScaling vs CurrentScaling) collapse to the same kernel here. | ||
|
|
||
|
|
||
| Example: 5B Model on B300 (Blackwell) | ||
| --------------------------------------- | ||
|
|
||
| .. tabs:: | ||
|
|
||
| .. tab:: Autocast | ||
|
|
||
| .. figure:: gemm_profiling/img/b300_model_config_speedup.png | ||
| :align: center | ||
| :width: 80% | ||
| :alt: Autocast model config benchmark showing per-layer GEMM time breakdown across precisions. | ||
|
|
||
| Autocast model config benchmark on NVIDIA B300 -- per-layer GEMM time breakdown by | ||
| precision and operation (Fprop+Dgrad and Wgrad). | ||
|
|
||
| .. tab:: Pre-quantized | ||
|
|
||
| .. figure:: gemm_profiling/img/b300_model_config_speedup_prequant.png | ||
|
pggPL marked this conversation as resolved.
|
||
| :align: center | ||
| :width: 80% | ||
| :alt: Pre-quantized model config benchmark showing raw GEMM kernel throughput on B300. | ||
|
|
||
| Pre-quantized model config benchmark on NVIDIA B300 -- raw GEMM kernel throughput | ||
| without quantization overhead. | ||
|
|
||
| For a 5B-parameter model (hidden=4096, intermediate=16384, 24 layers), MXFP8 delivers | ||
| ~1.44x and NVFP4 delivers ~2.03x over BF16 in autocast mode. FP8 DelayedScaling | ||
| reaches 1.61x, outperforming both FP8 CurrentScaling (1.41x) and MXFP8 on Blackwell. | ||
|
|
||
| In pre-quantized mode the gap widens: NVFP4 reaches 3.55x over BF16, nearly double its | ||
| autocast speedup. The difference is the per-step quantization cost, which the | ||
| pre-quantized number excludes. | ||
|
|
||
| Example: 5B Model on H200 (Hopper) | ||
| ------------------------------------- | ||
|
|
||
| .. tabs:: | ||
|
|
||
| .. tab:: Autocast | ||
|
|
||
| .. figure:: gemm_profiling/img/h200_model_config_speedup.png | ||
| :align: center | ||
| :width: 80% | ||
| :alt: Autocast model config benchmark showing per-layer GEMM time breakdown across precisions on H200. | ||
|
|
||
| Autocast model config benchmark on NVIDIA H200 NVL -- per-layer GEMM time breakdown by | ||
| precision and operation (Fprop+Dgrad and Wgrad). | ||
|
|
||
| .. tab:: Pre-quantized | ||
|
|
||
| .. figure:: gemm_profiling/img/h200_model_config_speedup_prequant.png | ||
| :align: center | ||
| :width: 80% | ||
| :alt: Pre-quantized model config benchmark showing raw GEMM kernel throughput on H200. | ||
|
|
||
| Pre-quantized model config benchmark on NVIDIA H200 NVL -- raw GEMM kernel throughput | ||
| without quantization overhead. | ||
|
|
||
| For the same 5B-parameter model on H200 (Hopper), the available precisions are BF16, | ||
| FP8 CurrentScaling, FP8 DelayedScaling, and FP8 Block Scaling. FP8 DelayedScaling | ||
| delivers ~1.69x over BF16, followed by FP8 CurrentScaling at ~1.57x and FP8 Block | ||
| Scaling at ~1.41x. FP8 Block Scaling runs natively on Hopper and is the only | ||
| block-scaled FP8 recipe available on this device. In pre-quantized mode, raw FP8 | ||
| reaches 1.92x over BF16. | ||
|
|
||
|
|
||
| Speedup Is Shape-Dependent | ||
|
jomitchellnv marked this conversation as resolved.
|
||
| ---------------------------- | ||
|
|
||
| The speedup from lower precision depends on the matrix dimensions set by your model | ||
| config. Large GEMMs -- from big hidden and intermediate sizes and high token counts | ||
| (``micro_batch_size * sequence_length``) -- amortize the fixed quantization overhead | ||
| over more compute and see meaningful speedups. Small GEMMs (e.g. the attention output | ||
| projection, with ``K=N=hidden_size`` and no expansion) may see little benefit or even a | ||
| slowdown when the overhead outweighs what the faster kernel saves. | ||
|
|
||
| This is why you should benchmark with your actual config: the theoretical tensor-core | ||
| speedup (e.g. 2x for FP4 vs FP8) is an upper bound that assumes the GEMM is large enough | ||
| to saturate the hardware. It also makes the tool useful for architecture co-design -- | ||
| run candidate configs through it before committing to a training run. | ||
|
|
||
| See the :doc:`full tutorial </examples/gemm_profiling/gemm_profiling>` for detailed | ||
| analysis on both Blackwell and Hopper, including per-operation (Fprop, Dgrad, Wgrad) | ||
| breakdowns and manual shape mode for non-standard architectures. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.