MS Data Science @ Northeastern University. I build end-to-end ML systems; from ETL pipelines and predictive models to production-deployed LLM applications. Focused on retail analytics, agentic AI, and turning messy data into decisions.
ListingLens — Amazon Seller Intelligence Platform
Multi-stage NLP pipeline processing 250 reviews/product · BERT sentiment scoring · XGBoost return risk classifier (96.5% acc, 0.997 ROC-AUC) · RAG pipeline with FAISS + LLaMA 3 70B · Deployed on Railway + Vercel
Distributed Backtesting Engine — Algorithmic Trading
PySpark parallel framework · 123 strategies × 100 S&P 500 stocks · 12,300 backtests on 303K real market records · 5-step data governance pipeline · 9-panel Plotly BI dashboards
Spotify Breakout Predictor — Viral Music Classification
99.2% accuracy · 0.998 ROC-AUC · temporal + 5-fold cross-validation · TikTok views as dominant predictor (41% importance) · Interactive Streamlit dashboard
Languages · Python · R · SQL · Java · JavaScript
ML / AI · Scikit-learn · XGBoost · PyTorch · TensorFlow · HuggingFace · LangChain · FAISS · RAG Pipelines
Data Engineering · PySpark · ETL · Data Warehousing · Snowflake · MySQL · PostgreSQL
Deployment · FastAPI · Next.js · Railway · Vercel · AWS (CLF-C02)
Data Science Associate · Compatible Solutions (Jul 2024 – Jun 2025)
- Built ETL pipelines in Python and SQL processing 100K+ records from CRM and transactional systems, enabling faster analytics for business teams
- Engineered 15+ behavioral features (purchase frequency, recency scores, seasonality indices), improving forecast accuracy by 20% over legacy baseline
- Designed 5+ KPI dashboards using Matplotlib and Seaborn, reducing manual reporting and supporting data-driven stakeholder decisions
- Implemented data validation pipelines achieving 95%+ data quality scores across BI reporting systems
Data Science Intern · Yhills / IIT Hyderabad (Mar 2023 – May 2023)
- Built ML models for H1N1 vaccine prediction (84% accuracy) and NYC taxi fare prediction (RMSE: $3.20) using Random Forest and XGBoost on 50K+ records
- Engineered 25+ features from temporal, geographic, and demographic data, improving model performance by 30%
- Communicated insights via visual dashboards to non-technical stakeholders using Matplotlib and Seaborn
- LLM applications & agentic workflows
- Scalable ML systems & deep learning
- Distributed data processing (Spark + cloud)

