CUDA matrix multiplication benchmarking on Jetson Orin Nano. Four implementations, three power modes, five matrix sizes. 99.5% mathematical validation. C++/CUDA and Python.
-
Updated
Apr 2, 2026 - Python
CUDA matrix multiplication benchmarking on Jetson Orin Nano. Four implementations, three power modes, five matrix sizes. 99.5% mathematical validation. C++/CUDA and Python.
Hands-on Jupyter notebooks for deep learning with TensorFlow, covering fundamental concepts, model training, and applied tabular projects.
Standalone LLM inference benchmarking pipelines on AMD GPUs using ROCm, vLLM, MAD, and data visualization scripts.
Artifact-backed LLM serving performance lab for vLLM baselines, official metrics, GuideLLM checks, and SGLang/PD scaffolding
One-shot script to audit GPU, CUDA, PyTorch, CPU, and disk performance before debugging a slow or broken ML environment.
benchHUB is a Python-based project to parse, aggregate, and visualize system and performance benchmarks. It includes a Streamlit dashboard to display and compare results.
Reproducible GPT-2 distributed-training benchmarks on 1-8 V100 GPUs using Slurm, PyTorch, DeepSpeed, NCCL, NVTX, and Nsight Systems.
Run a 2-min local benchmark → predict how long your AI job will take on cloud GPU (T4/V100/A100). No guessing, no wasted money.
Add a description, image, and links to the gpu-benchmarking topic page so that developers can more easily learn about it.
To associate your repository with the gpu-benchmarking topic, visit your repo's landing page and select "manage topics."