All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Complete Doxygen coverage for all public headers (30 headers)
- Inline code comments across memory, device, and algo layers
- Structured logging infrastructure (ERROR/WARN/INFO/DEBUG/TRACE)
- User guides: SPARSE.md, QUANTIZATION.md, ARCHITECTURE.md
- Updated README with v2.x features
- Transformer optimization features
- KV cache management improvements
- Flash attention support
- Advanced quantization infrastructure
- FP8 support (E4M3, E5M2) for H100/H200
- Quantization-aware training (QAT) support
- Calibration framework (MinMax, Percentile)
- Performance tooling
- Kernel profiling utilities
- Memory bandwidth analysis
- Sparse solver acceleration
- CG, GMRES, BiCGSTAB iterative solvers
- Jacobi and ILU(0) preconditioners
- RCM bandwidth reduction reordering
- Architecture refactoring
- Five-layer architecture finalization
- Numerical computing enhancements
- Precision management
- Comprehensive testing and validation
- Test coverage improvements
- Transformer and inference optimization
- Optimized GEMM operations
- Error handling and recovery
- Timeout management
- Circuit breaker patterns
- Retry logic with backoff
- Production hardening
- Memory pool improvements
- Error reporting enhancements
- Extended algorithm support
- Additional parallel primitives
- Comprehensive enhancement
- Feature parity across modules
- New algorithm implementations
- Performance optimizations
- Testing and quality infrastructure
- Comprehensive test coverage
- Documentation system
- API reference generation
- Developer experience improvements
- CMake build enhancements
- Benchmarking and testing
- Performance benchmarks
- Performance and training optimizations
- Training-specific utilities
- Fault tolerance
- Checkpoint and recovery
- Multi-node support
- MPI integration (MpiContext, rank discovery)
- Topology detection (NIC enumeration, RDMA capability)
- Cross-node communicators (MultiNodeContext, HierarchicalAllReduce)
- Build system improvements
- Auto-detect CPU cores for parallel builds
- Ninja generator support
- NCCL integration
- Multi-GPU collectives
- Toolchain upgrade
- Modern C++ support
- Multi-GPU support
- Device memory pools
- Five-layer architecture
- Layer 0:
cuda::memory- Buffer, unique_ptr, MemoryPool - Layer 1:
cuda::device- Pure device kernels - Layer 2:
cuda::algo- Algorithm wrappers - Layer 3:
cuda::api- High-level API
- Layer 0:
- Core algorithms: reduce, scan, sort
- Image processing: brightness, gaussian_blur, sobel_edge
- Matrix operations: add, mult
- Convolution: 2D convolution