Skip to content
View Het415's full-sized avatar

Highlights

  • Pro

Block or report Het415

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Het415/README.md

Het Prajapati

Data Scientist · ML Engineer · NLP & Agentic AI

LinkedIn GitHub Email Boston


MS Data Science @ Northeastern University. I build end-to-end ML systems; from ETL pipelines and predictive models to production-deployed LLM applications. Focused on retail analytics, agentic AI, and turning messy data into decisions.


Projects

ListingLens — Amazon Seller Intelligence Platform

Multi-stage NLP pipeline processing 250 reviews/product · BERT sentiment scoring · XGBoost return risk classifier (96.5% acc, 0.997 ROC-AUC) · RAG pipeline with FAISS + LLaMA 3 70B · Deployed on Railway + Vercel

Distributed Backtesting Engine — Algorithmic Trading

PySpark parallel framework · 123 strategies × 100 S&P 500 stocks · 12,300 backtests on 303K real market records · 5-step data governance pipeline · 9-panel Plotly BI dashboards

Spotify Breakout Predictor — Viral Music Classification

99.2% accuracy · 0.998 ROC-AUC · temporal + 5-fold cross-validation · TikTok views as dominant predictor (41% importance) · Interactive Streamlit dashboard


Stack

Languages · Python · R · SQL · Java · JavaScript

ML / AI · Scikit-learn · XGBoost · PyTorch · TensorFlow · HuggingFace · LangChain · FAISS · RAG Pipelines

Data Engineering · PySpark · ETL · Data Warehousing · Snowflake · MySQL · PostgreSQL

Deployment · FastAPI · Next.js · Railway · Vercel · AWS (CLF-C02)


Experience

Data Science Associate · Compatible Solutions (Jul 2024 – Jun 2025)

  • Built ETL pipelines in Python and SQL processing 100K+ records from CRM and transactional systems, enabling faster analytics for business teams
  • Engineered 15+ behavioral features (purchase frequency, recency scores, seasonality indices), improving forecast accuracy by 20% over legacy baseline
  • Designed 5+ KPI dashboards using Matplotlib and Seaborn, reducing manual reporting and supporting data-driven stakeholder decisions
  • Implemented data validation pipelines achieving 95%+ data quality scores across BI reporting systems

Data Science Intern · Yhills / IIT Hyderabad (Mar 2023 – May 2023)

  • Built ML models for H1N1 vaccine prediction (84% accuracy) and NYC taxi fare prediction (RMSE: $3.20) using Random Forest and XGBoost on 50K+ records
  • Engineered 25+ features from temporal, geographic, and demographic data, improving model performance by 30%
  • Communicated insights via visual dashboards to non-technical stakeholders using Matplotlib and Seaborn

Currently Exploring

  • LLM applications & agentic workflows
  • Scalable ML systems & deep learning
  • Distributed data processing (Spark + cloud)

GitHub Streak

Pinned Loading

  1. listinglens listinglens Public

    AI-powered Amazon seller intelligence platform — BERT sentiment, XGBoost return risk prediction, and RAG chatbot grounded in real customer reviews.

    TypeScript 1

  2. algorithmic-trading-backtest algorithmic-trading-backtest Public

    Distributed trading strategy backtesting with PySpark - 12,300 backtests on 100 stocks

    Jupyter Notebook

  3. Spotify_predictor_enhanced Spotify_predictor_enhanced Public

    A Machine Learning tool to predict upcoming Spotify breakout artists using TikTok and Shazam data. Features a Random Forest backend and a Streamlit frontend.

    Jupyter Notebook