Atandra Bharati atandra2000

Atandra Bharati

Deep Learning Research Engineer rebuilding frontier AI architectures from scratch — LLMs, latent diffusion, multimodal, video understanding. PyTorch-first, single-GPU heroics, paper-faithful reproductions.

🧭 Open to roles

Applied ML · ML Research · Research Engineer · GenAI Engineering. Remote-friendly; available worldwide.

🛠️ Core stack

Languages & core ML

Architectures Transformers · GQA · MLA · RoPE · SwiGLU · RMSNorm · MoE · Gated Delta Net · MTP · Diffusion UNet · VAE · GAN · CycleGAN · ST-GCN

Optimization & numerics BF16 · Flash Attention 2 · torch.compile · Gradient checkpointing · μP scaling · WSD LR · Chunked cross-entropy · Disk-backed token caching

Hardware validated A100 80GB · RTX 5090 · RTX 6000 Ada · RTX 3090 · P100 · T4 (2×)

🔭 Now

Shipping the Autonomous ML Research Engineer platform (15 phases, 23 agents) and exploring a paper on mixture-of-depths routing for sub-1B parameter LLMs.

Highlights

78% peak memory reduction (92 GB → 20 GB) for LLM pretraining via gradient checkpointing, chunked cross-entropy, and disk-backed token caching — enabling 2× batch-size headroom on a single A100 80GB.
Training loss 0.0947 at epoch 16 on Stable Diffusion 1.x (860M UNet) trained from scratch across a 7-phase curriculum on 2× RTX 5090.
878 passing tests, 15 cooperating phases, 23 agents, 61 tools, 186 models in the Autonomous ML Research Engineer platform — a full research loop from paper to conclusions, with self-repair and provider-agnostic LLM routing.
12 end-to-end projects spanning LLMs, generative vision, multimodal AI, and video — every project engineered for single-GPU feasibility.

Projects

Category	Project	Highlight	Stack / hardware	Repo
Architecture	DeepSeek-v3-Lite (422M)	MLA + aux-loss-free MoE + MTP, end-to-end with inference absorption	PyTorch · μP · 8.4B-token Chinchilla recipe	→
Architecture	LLaMA-3-Lite (515M)	GQA · RoPE · fused SwiGLU · RMSNorm · Flash-Attn 2 · chunked CE	PyTorch · BF16 · A100 80GB	→
Architecture	FusionLLM (415.6M active / 868.6M stored)	MLA + Gated Delta Net + MoE + MTP in a 24-layer hybrid	PyTorch · NorMuon + CautiousAdamW · WSD + μP	→
Generative vision	Stable Diffusion 1.x (860M UNet)	Best loss 0.0947 at epoch 16; 42-epoch run	PyTorch · BF16 · 2× RTX 5090	→
Generative vision	FaceAgingCycleGAN (AdaIN-conditioned)	31 epochs on IMDB-WIKI; per-layer age conditioning, 3-scale PatchGAN	PyTorch · RTX 6000 Ada	→
Generative vision	FaceGenerationVAE (β-VAE)	50 epochs on CelebA; recon MSE 0.0152, KL annealing 0→1	PyTorch · bilinear-upsample decoder	→
Generative vision	DCGAN-Face-Generation	50 epochs on 202k CelebA; D loss → ln 2 ≈ 0.693 equilibrium	PyTorch · 2× T4	→
Multimodal	VisionLangModel (PaliGemma-style)	Trained end-to-end on COCO 2014 captions; zero pre-trained weights	PyTorch · P100	→
NLP	TranslationLM (EN→IT seq2seq)	20 epochs on OPUS Books; cross-attention visualizations, custom SentencePiece BPE	PyTorch · T4	→
Foundations	GPT-From-Scratch	200-line educational GPT-2, trained on Tiny Shakespeare	PyTorch	→
Agentic / research infra	Autonomous ML Research Engineer	15-phase multi-agent platform: paper → plan → patch → train → evaluate → iterate → report	PyTorch · Ollama Cloud · multi-agent · 878 tests	→
In progress	ActionRecognition (ST-GCN)	Pose + ST-GCN pipeline ready; NTU RGB+D 120 benchmark pending	PyTorch	→

Writing

"Multi-Head Latent Attention — A Technical Deep-Dive" — 643-line reference covering KV-cache math, low-rank compression algebra, the absorption-trick derivation, decoupled RoPE mechanics, and SDPA vs manual attention path trade-offs in DeepSeek-V2/V3. (read)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atandra Bharati atandra2000

Achievements

Achievements

Block or report atandra2000