This repository is a hands-on machine learning lab built around Jupyter notebooks. It covers the path from first-principles algorithm implementation to applied projects using scikit-learn, NLTK, Keras, and TensorFlow.
The notebooks are organized by topic: regression, classification, decision trees, random forests, KNN, Naive Bayes, SVM, NLP, neural networks, TensorFlow, and end-to-end prediction projects. Several notebooks implement core algorithms manually and then compare the result with library implementations, making the repository useful for both learning the mathematics and practicing real modeling workflows.
- 49 Jupyter notebooks across classical ML, NLP, and neural network topics
- 20+ bundled CSV datasets and prediction outputs
- Manual implementations for linear regression, gradient descent, decision tree splitting, Naive Bayes, and neural network forward/backward passes
- Applied projects for Titanic survival prediction, Twitter airline sentiment analysis, Boston/CCPP-style regression, MNIST digit classification, and breast cancer classification
- Uses
numpy,pandas,matplotlib,scikit-learn,nltk,keras, and TensorFlow v1-style APIs
flowchart LR
A["Data Exploration"] --> B["Preprocessing"]
B --> C["Model Training"]
C --> D["Evaluation"]
D --> E["Prediction Output"]
B --> B1["Scaling"]
B --> B2["Text Cleaning"]
B --> B3["Missing Value Handling"]
C --> C1["From-Scratch Algorithms"]
C --> C2["scikit-learn Models"]
C --> C3["Neural Networks"]
D --> D1["Accuracy"]
D --> D2["Confusion Matrix"]
D --> D3["Classification Report"]
D --> D4["Regression Score"]
.
|-- Classification Measures/
| |-- Confusion Matrix.ipynb
| `-- iris.csv
|-- Decision tree/
| |-- Code Using Sklearn Decision Tree.ipynb
| |-- Decision Tree Implementation.ipynb
| |-- DecisionTreeImplementation_Base File.ipynb
| |-- decision_tree_ta.ipynb
| `-- iris.pdf
|-- Feature Scaling/
| `-- Feature Scaling in Sklearn.ipynb
|-- KNN/
| |-- KNN.ipynb
| |-- Cross_Validation.ipynb
| `-- KNN_from_scratch.ipynb
|-- Keras/
| `-- Keras_Intro.ipynb
|-- Linear Regression/
| |-- Analysis of LR using dummy Data.ipynb
| |-- diabetes.ipynb
| |-- linear_regression_by_diffrentiation.ipynb
| `-- diabetes_train.csv / diabetes_test.csv
|-- Logistic Regression/
| `-- Logistic regression examples
|-- MultiVariable Regression and Gradient Descent/
| |-- Gradient Descent.ipynb
| `-- Complex Boundaries.ipynb
|-- NLP/
| |-- NLTK.ipynb
| `-- Movie_review.ipynb
|-- NLP-2/
| `-- Movie review classification notebooks
|-- Naive Bayes/
| `-- Naive Bayes from scratch and sklearn comparison
|-- Neural Network-2/
| `-- Neural network forward/backward propagation notebooks
|-- Neural Networks - 1/
| `-- MLP Classifier in Sklearn.ipynb
|-- Project - Logistic Regression/
| `-- Logistic Regression - Titanic Dataset.ipynb
|-- Project Twitter Sentiment Analysis/
| `-- Twitter US Airline Sentiment Analysis.ipynb
|-- Projects - Gradient Descent/
| `-- Boston and Combined Cycle Power Plant regression notebooks
|-- Random Forests/
| `-- Random forest and decision tree comparison notebooks
|-- SVM/
| `-- SVM decision-boundary notebooks
`-- Tensor Flow/
|-- MNIST Tensorflow.ipynb
|-- Digit prediction notebooks
|-- input_data.py
`-- MNIST_data/
| Area | What It Covers | Representative Files |
|---|---|---|
| Linear Regression | Closed-form slope/intercept, cost function, R2-style score, sklearn comparison | Linear Regression/linear_regression_by_diffrentiation.ipynb |
| Gradient Descent | Manual gradient descent loops, cost tracking, multivariable regression | MultiVariable Regression and Gradient Descent/Gradient Descent.ipynb, Projects - Gradient Descent/Gradient Descent - Boston Dataset.ipynb |
| Logistic Regression | Classification with sklearn logistic regression and prediction export | Project - Logistic Regression/Logistic Regression - Titanic Dataset.ipynb |
| Decision Trees | Entropy, information gain, categorical binning, sklearn tree usage | Decision tree/Decision Tree Implementation.ipynb, Decision tree/Code Using Sklearn Decision Tree.ipynb |
| Random Forests | Titanic data preprocessing, decision tree vs random forest comparison | Random Forests/Random Forest vs Decision Trees.ipynb |
| KNN | Breast cancer classification, cross-validation over neighbor counts | KNN/KNN.ipynb, KNN/Cross_Validation.ipynb |
| Naive Bayes | From-scratch probability tables, Laplace smoothing, sklearn comparison | Naive Bayes/Implementation of Naive Bayes .ipynb |
| SVM | SVM classification on Iris/dummy data, visual decision boundaries | SVM/SVM-Iris.ipynb, SVM/SVM_Dummy_data.ipynb |
| NLP | Tokenization, stopword removal, POS tagging, lemmatization, text classification | NLP/NLTK.ipynb, NLP-2/movie_review_by_sklearn.ipynb |
| Neural Networks | Forward propagation, hidden-layer experiments, MLPClassifier | Neural Network-2/forward_propagation.ipynb, Neural Networks - 1/MLP Classifier in Sklearn.ipynb |
| Keras | Dense neural network for breast cancer classification | Keras/Keras_Intro.ipynb |
| TensorFlow | TensorFlow v1-style variables, placeholders, MNIST digit prediction | Tensor Flow/MNIST Tensorflow.ipynb, Tensor Flow/Digit_prediction_using_neural_network.ipynb |
Location: Project - Logistic Regression/
This notebook builds a logistic regression classifier for Titanic survival prediction. It performs categorical conversion for gender and embarked port, fills missing age values, removes high-cardinality/non-numeric columns, trains a LogisticRegression model, and writes predictions to output.csv.
Key ideas:
- Binary classification
- Missing value handling
- Basic categorical encoding
- Prediction export
Location: Project Twitter Sentiment Analysis/
This project classifies airline-related tweets by sentiment. It cleans raw tweet text, removes stopwords and punctuation, applies POS-aware lemmatization, vectorizes text with TF-IDF n-grams, and trains SVM / Multinomial Naive Bayes classifiers.
Key ideas:
- NLP preprocessing with NLTK
- Lemmatization and POS tagging
TfidfVectorizerwith n-grams- Text classification with SVM and Naive Bayes
Location: Projects - Gradient Descent/
These notebooks apply gradient descent to regression datasets such as Boston-style housing data and Combined Cycle Power Plant data. They demonstrate how model parameters are iteratively updated, how cost changes during training, and how predictions are saved.
Key ideas:
- Batch gradient descent
- Multivariable linear regression
- Cost minimization
- Regression prediction files
Location: Tensor Flow/
The TensorFlow notebooks work with the bundled MNIST gzip files and input_data.py. They use TensorFlow v1-style placeholders, variables, sessions, and softmax classification to predict handwritten digits.
Key ideas:
- TensorFlow graph execution
- Placeholders and variables
- Softmax classification
- MNIST image data loading
Locations: KNN/, Keras/, Logistic Regression/, Neural Networks - 1/
Multiple notebooks use the sklearn breast cancer dataset to compare classical and neural-network approaches, including KNN, logistic regression, Keras dense networks, and sklearn MLP.
Key ideas:
- Train/test split
- Standard scaling
- KNN neighbor search
- Dense neural networks
- Model evaluation
This repository is especially useful because several notebooks build ML logic manually before leaning on libraries:
- Linear regression slope, intercept, cost, and coefficient-of-determination style score
- Batch gradient descent for simple and multivariable regression
- Decision tree splitting with entropy and information gain
- Naive Bayes probability estimation with Laplace smoothing
- Neural network forward propagation with NumPy
- Basic hidden-layer neural network training logic
That mix helps connect the math behind each model with the API-level workflow used in real projects.
| Dataset / File | Used For |
|---|---|
Classification Measures/iris.csv |
Confusion matrix and classification metric practice |
Linear Regression/data.csv |
Simple linear regression from scratch |
Linear Regression/diabetes_train.csv, diabetes_test.csv |
Diabetes regression experiments |
MultiVariable Regression and Gradient Descent/data.csv |
Basic gradient descent experiments |
Projects - Gradient Descent/boston_test.csv |
Boston-style regression prediction |
Project - Logistic Regression/titanic_train.csv, titanic_test.csv |
Titanic survival classification |
Random Forests/titanic.csv and split CSVs |
Decision tree / random forest comparison |
Project Twitter Sentiment Analysis/train.csv, test.csv |
Airline sentiment classification |
Tensor Flow/MNIST_data/*.gz |
MNIST digit classification |
Several notebooks also use built-in scikit-learn datasets, including Iris, Breast Cancer Wisconsin, Boston housing style data, and Diabetes.
Recommended:
- Python 3.7 or compatible Python 3.x environment
- Jupyter Notebook or JupyterLab
- Core packages:
numpypandasmatplotlibscikit-learnnltkpydotpluskerastensorflow
For the TensorFlow notebooks, the code uses TensorFlow v1-style APIs such as:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()Use a TensorFlow version that supports tensorflow.compat.v1.
Clone the repository:
git clone https://github.com/devthedevil/Machine-Learning.git
cd Machine-LearningCreate and activate a virtual environment:
python3 -m venv .venv
source .venv/bin/activateInstall common dependencies:
pip install jupyter numpy pandas matplotlib scikit-learn nltk pydotplus keras tensorflowDownload common NLTK resources used by the NLP notebooks:
import nltk
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("averaged_perceptron_tagger")
nltk.download("movie_reviews")Start Jupyter:
jupyter notebookThen open any notebook from the topic folders.
- Run notebooks from inside their own folder when they reference local files such as
train.csv,data.csv,boston_test.csv, orMNIST_data/. - Some notebooks generate output files such as
output.csv,twitter.csv, and prediction CSVs. - The TensorFlow notebooks depend on the local
Tensor Flow/input_data.pyhelper and bundledTensor Flow/MNIST_data/files. - The repository does not currently include a
requirements.txt; the package list above was inferred from notebook imports. - The
CNN/Untitled.ipynbnotebook is currently empty.
For a smooth learning progression:
- Start with
Linear Regression/linear_regression_by_diffrentiation.ipynb - Move to
MultiVariable Regression and Gradient Descent/Gradient Descent.ipynb - Explore
Classification Measures/Confusion Matrix.ipynb - Study
Decision tree/Decision Tree Implementation.ipynb - Compare with
Decision tree/Code Using Sklearn Decision Tree.ipynb - Try
KNN/KNN.ipynbandNaive Bayes/Implementation of Naive Bayes .ipynb - Open
Project - Logistic Regression/Logistic Regression - Titanic Dataset.ipynb - Continue to
Project Twitter Sentiment Analysis/Twitter US Airline Sentiment Analysis.ipynb - Finish with
Keras/Keras_Intro.ipynband the TensorFlow MNIST notebooks
Across the notebooks, the project uses:
- Train/test splitting
- Cross-validation
- Confusion matrices
- Classification reports
- Accuracy scores
- Regression score comparisons
- Cost-function tracking during gradient descent
- Visual decision boundaries for SVM experiments
- No centralized dependency file is included.
- Most work is notebook-based rather than packaged into reusable Python modules.
- Some notebooks use older APIs, including TensorFlow v1-style code and older scikit-learn defaults.
- Several notebooks depend on being executed from a specific folder because CSV paths are relative.
- The CNN notebook is empty and can be removed or replaced with a complete convolutional neural network example.
- Add
requirements.txtorenvironment.yml - Move reusable logic into Python modules under a
src/folder - Add notebook execution checks with
nbconvert - Convert major projects into clean scripts or pipelines
- Add exploratory data analysis sections to project notebooks
- Add saved visualizations for model comparison
- Modernize TensorFlow notebooks to TensorFlow 2 / Keras
- Add a completed CNN notebook for image classification
- Practical understanding of core supervised learning algorithms
- Ability to implement ML fundamentals from scratch
- Experience with data preprocessing, feature engineering, and model evaluation
- Familiarity with NLP workflows using NLTK and scikit-learn
- Exposure to neural-network workflows in Keras and TensorFlow
- Comfort working with Jupyter notebooks, CSV datasets, and iterative experiments