vuiseng9

Follow

Vui Seng Chua vuiseng9

Follow

5 followers · 0 following

in/vuiseng9

Achievements

Achievements

vuiseng9/README.md

Distributed & Parallel

Megatron, Transformed! A Hands-on Tutorial on Replicating Empirical Trends in Distributed Training and Model Parallelism. 60+ runs across DP → ZeRO → TP → SP → CP → PP → VPP → EP.
MLPerf Training Rundowns: v5.1 (Nov'25) on Llama 3.1-8B, Flux.1, and v6.0 (June'26) on newly-added MoE DeepSeek-V3, GPT-OSS.
(In Progress) GPU-Initiated EP Comm: MoE dispatch/combine with PyTorch Symmetric Memory.

Narrow Precision Training

Quantized Training in FP4(8): Concepts and Pytorch Implementation using cuBLASLt and Microxcaling.
PoC nvfp4 forward + mxfp8 backward recipe in Transformer Engine, faster than nvfp4-QAT.

Modeling Front

moe-lab: Rigorous MoE design ablations you can run at home. Notably, DeepSeek-V3 Router load-biasing turns out to be surprisingly effective. Implemented autograd for F.grouped_mm, >20% speedup on average.

Model Optimization for Efficient Inference

Post-Training Statistical Calibration for Higher Activation Sparsity, [ENLSP 2024 Spotlight 7, Paper, Oral, Code, Integrated]
Pre-LLM explosion — Unified HuggingFace Trainer for Joint Pruning, Quantization, and Distillation (JPQD), integrating OpenVINO NNCF and runtime. 16× more BERT serving throughput on Xeon Sapphire Rapids. See MLPerf Inference 3.0 submission. Applicable to vision, audio models.

Perhaps useful: dlbp, dockerhub, HuggingFace

Pinned Loading

fp4-training fp4-training Public

mxfp8/nvfp4 training - from concept to implementation (cuBLASLt + Microxcaling).

Python 3
megatron-tutorials megatron-tutorials Public

Hands-on Megatron-LM tutorials on ablating parallelism and scaling trends. DP → ZeRO → TP → SP → CP → PP → VPP → EP

Shell 2
mlperf-t6.0-rundown mlperf-t6.0-rundown Public

Plots & Takeways from MLPerf Training v6.0 on new MoE workloads (DeepSeek-v3, GPT-OSS), scaling efficiency and MXFP4 recipe debut.

Python
ep-comm ep-comm Public

Implementation of MoE & Expert-Parallel (EP) communication using Pytorch Symmetric Memory.
faster-qat faster-qat Public

Revisiting QAT: QAT vs. native NVFP4/MXFP8 fine-tuning.

Dockerfile
moe-lab moe-lab Public

Hands-on MoE ablations (load balancing, resolution, shared experts, etc.) using a subclassed HF Transformer on a single GPU.

Python