Skip to content
View vuiseng9's full-sized avatar

Block or report vuiseng9

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
vuiseng9/README.md

Distributed & Parallel

  • Megatron, Transformed! A Hands-on Tutorial on Replicating Empirical Trends in Distributed Training and Model Parallelism. 60+ runs across DP → ZeRO → TP → SP → CP → PP → VPP → EP.
  • MLPerf Training Rundowns: v5.1 (Nov'25) on Llama 3.1-8B, Flux.1, and v6.0 (June'26) on newly-added MoE DeepSeek-V3, GPT-OSS.
  • (In Progress) GPU-Initiated EP Comm: MoE dispatch/combine with PyTorch Symmetric Memory.

Narrow Precision Training

Modeling Front

  • moe-lab: Rigorous MoE design ablations you can run at home. Notably, DeepSeek-V3 Router load-biasing turns out to be surprisingly effective. Implemented autograd for F.grouped_mm, >20% speedup on average.

Model Optimization for Efficient Inference

Perhaps useful: dlbp, dockerhub, HuggingFace

Pinned Loading

  1. fp4-training fp4-training Public

    mxfp8/nvfp4 training - from concept to implementation (cuBLASLt + Microxcaling).

    Python 3

  2. megatron-tutorials megatron-tutorials Public

    Hands-on Megatron-LM tutorials on ablating parallelism and scaling trends. DP → ZeRO → TP → SP → CP → PP → VPP → EP

    Shell 2

  3. mlperf-t6.0-rundown mlperf-t6.0-rundown Public

    Plots & Takeways from MLPerf Training v6.0 on new MoE workloads (DeepSeek-v3, GPT-OSS), scaling efficiency and MXFP4 recipe debut.

    Python

  4. ep-comm ep-comm Public

    Implementation of MoE & Expert-Parallel (EP) communication using Pytorch Symmetric Memory.

  5. faster-qat faster-qat Public

    Revisiting QAT: QAT vs. native NVFP4/MXFP8 fine-tuning.

    Dockerfile

  6. moe-lab moe-lab Public

    Hands-on MoE ablations (load balancing, resolution, shared experts, etc.) using a subclassed HF Transformer on a single GPU.

    Python