Kernel Lab is a small research playground for learning and comparing operator implementations across three stages:
- Torch reference implementation
- Triton implementation
- CUDA extension implementation
Every operator follows the same workflow: correctness first, then benchmarks, then profiling.
- Keep the code readable enough for study and iteration.
- Make local macOS development possible without requiring a GPU.
- Keep the remote Linux + NVIDIA validation path obvious and repeatable.
kernel_lab/
├── ops/
│ ├── registry.py
│ ├── references/
│ ├── triton/
│ ├── cuda/
│ └── common/
├── tests/
├── benchmarks/
├── scripts/
└── docs/
Implement one operator at a time.
- Start in
kernel_lab/ops/references/with the Torch baseline. - Add the Triton version in
kernel_lab/ops/triton/. - Add the CUDA binding and kernels in
kernel_lab/ops/cuda/. - Validate with
pytest. - Compare with the benchmark scripts.
- Profile on the Linux + NVIDIA server with
ncuornsys.
Install the package in editable mode:
pip install -e ".[dev]"Run the default tests:
pytestRun a baseline benchmark on CPU:
python benchmarks/bench_softmax.py --backend reference --device cpuWhen you are on a Linux + NVIDIA machine, you can build the CUDA extension with:
python setup.py build_ext --inplacesoftmaxrmsnormropeswigluplaceholderattention_toyplaceholder
The Triton and CUDA directories currently provide templates and integration points, not finished optimized kernels.