Skip to content

devthedevil/Machine-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning

Python Jupyter scikit--learn TensorFlow

This repository is a hands-on machine learning lab built around Jupyter notebooks. It covers the path from first-principles algorithm implementation to applied projects using scikit-learn, NLTK, Keras, and TensorFlow.

The notebooks are organized by topic: regression, classification, decision trees, random forests, KNN, Naive Bayes, SVM, NLP, neural networks, TensorFlow, and end-to-end prediction projects. Several notebooks implement core algorithms manually and then compare the result with library implementations, making the repository useful for both learning the mathematics and practicing real modeling workflows.

Repository Snapshot

  • 49 Jupyter notebooks across classical ML, NLP, and neural network topics
  • 20+ bundled CSV datasets and prediction outputs
  • Manual implementations for linear regression, gradient descent, decision tree splitting, Naive Bayes, and neural network forward/backward passes
  • Applied projects for Titanic survival prediction, Twitter airline sentiment analysis, Boston/CCPP-style regression, MNIST digit classification, and breast cancer classification
  • Uses numpy, pandas, matplotlib, scikit-learn, nltk, keras, and TensorFlow v1-style APIs

Learning Flow

flowchart LR
    A["Data Exploration"] --> B["Preprocessing"]
    B --> C["Model Training"]
    C --> D["Evaluation"]
    D --> E["Prediction Output"]

    B --> B1["Scaling"]
    B --> B2["Text Cleaning"]
    B --> B3["Missing Value Handling"]

    C --> C1["From-Scratch Algorithms"]
    C --> C2["scikit-learn Models"]
    C --> C3["Neural Networks"]

    D --> D1["Accuracy"]
    D --> D2["Confusion Matrix"]
    D --> D3["Classification Report"]
    D --> D4["Regression Score"]
Loading

Project Structure

.
|-- Classification Measures/
|   |-- Confusion Matrix.ipynb
|   `-- iris.csv
|-- Decision tree/
|   |-- Code Using Sklearn Decision Tree.ipynb
|   |-- Decision Tree Implementation.ipynb
|   |-- DecisionTreeImplementation_Base File.ipynb
|   |-- decision_tree_ta.ipynb
|   `-- iris.pdf
|-- Feature Scaling/
|   `-- Feature Scaling in Sklearn.ipynb
|-- KNN/
|   |-- KNN.ipynb
|   |-- Cross_Validation.ipynb
|   `-- KNN_from_scratch.ipynb
|-- Keras/
|   `-- Keras_Intro.ipynb
|-- Linear Regression/
|   |-- Analysis of LR using dummy Data.ipynb
|   |-- diabetes.ipynb
|   |-- linear_regression_by_diffrentiation.ipynb
|   `-- diabetes_train.csv / diabetes_test.csv
|-- Logistic Regression/
|   `-- Logistic regression examples
|-- MultiVariable Regression and Gradient Descent/
|   |-- Gradient Descent.ipynb
|   `-- Complex Boundaries.ipynb
|-- NLP/
|   |-- NLTK.ipynb
|   `-- Movie_review.ipynb
|-- NLP-2/
|   `-- Movie review classification notebooks
|-- Naive Bayes/
|   `-- Naive Bayes from scratch and sklearn comparison
|-- Neural Network-2/
|   `-- Neural network forward/backward propagation notebooks
|-- Neural Networks - 1/
|   `-- MLP Classifier in Sklearn.ipynb
|-- Project - Logistic Regression/
|   `-- Logistic Regression - Titanic Dataset.ipynb
|-- Project Twitter Sentiment Analysis/
|   `-- Twitter US Airline Sentiment Analysis.ipynb
|-- Projects - Gradient Descent/
|   `-- Boston and Combined Cycle Power Plant regression notebooks
|-- Random Forests/
|   `-- Random forest and decision tree comparison notebooks
|-- SVM/
|   `-- SVM decision-boundary notebooks
`-- Tensor Flow/
    |-- MNIST Tensorflow.ipynb
    |-- Digit prediction notebooks
    |-- input_data.py
    `-- MNIST_data/

Topic Guide

Area What It Covers Representative Files
Linear Regression Closed-form slope/intercept, cost function, R2-style score, sklearn comparison Linear Regression/linear_regression_by_diffrentiation.ipynb
Gradient Descent Manual gradient descent loops, cost tracking, multivariable regression MultiVariable Regression and Gradient Descent/Gradient Descent.ipynb, Projects - Gradient Descent/Gradient Descent - Boston Dataset.ipynb
Logistic Regression Classification with sklearn logistic regression and prediction export Project - Logistic Regression/Logistic Regression - Titanic Dataset.ipynb
Decision Trees Entropy, information gain, categorical binning, sklearn tree usage Decision tree/Decision Tree Implementation.ipynb, Decision tree/Code Using Sklearn Decision Tree.ipynb
Random Forests Titanic data preprocessing, decision tree vs random forest comparison Random Forests/Random Forest vs Decision Trees.ipynb
KNN Breast cancer classification, cross-validation over neighbor counts KNN/KNN.ipynb, KNN/Cross_Validation.ipynb
Naive Bayes From-scratch probability tables, Laplace smoothing, sklearn comparison Naive Bayes/Implementation of Naive Bayes .ipynb
SVM SVM classification on Iris/dummy data, visual decision boundaries SVM/SVM-Iris.ipynb, SVM/SVM_Dummy_data.ipynb
NLP Tokenization, stopword removal, POS tagging, lemmatization, text classification NLP/NLTK.ipynb, NLP-2/movie_review_by_sklearn.ipynb
Neural Networks Forward propagation, hidden-layer experiments, MLPClassifier Neural Network-2/forward_propagation.ipynb, Neural Networks - 1/MLP Classifier in Sklearn.ipynb
Keras Dense neural network for breast cancer classification Keras/Keras_Intro.ipynb
TensorFlow TensorFlow v1-style variables, placeholders, MNIST digit prediction Tensor Flow/MNIST Tensorflow.ipynb, Tensor Flow/Digit_prediction_using_neural_network.ipynb

Highlight Projects

Titanic Survival Prediction

Location: Project - Logistic Regression/

This notebook builds a logistic regression classifier for Titanic survival prediction. It performs categorical conversion for gender and embarked port, fills missing age values, removes high-cardinality/non-numeric columns, trains a LogisticRegression model, and writes predictions to output.csv.

Key ideas:

  • Binary classification
  • Missing value handling
  • Basic categorical encoding
  • Prediction export

Twitter US Airline Sentiment Analysis

Location: Project Twitter Sentiment Analysis/

This project classifies airline-related tweets by sentiment. It cleans raw tweet text, removes stopwords and punctuation, applies POS-aware lemmatization, vectorizes text with TF-IDF n-grams, and trains SVM / Multinomial Naive Bayes classifiers.

Key ideas:

  • NLP preprocessing with NLTK
  • Lemmatization and POS tagging
  • TfidfVectorizer with n-grams
  • Text classification with SVM and Naive Bayes

Gradient Descent Regression Projects

Location: Projects - Gradient Descent/

These notebooks apply gradient descent to regression datasets such as Boston-style housing data and Combined Cycle Power Plant data. They demonstrate how model parameters are iteratively updated, how cost changes during training, and how predictions are saved.

Key ideas:

  • Batch gradient descent
  • Multivariable linear regression
  • Cost minimization
  • Regression prediction files

MNIST Digit Prediction

Location: Tensor Flow/

The TensorFlow notebooks work with the bundled MNIST gzip files and input_data.py. They use TensorFlow v1-style placeholders, variables, sessions, and softmax classification to predict handwritten digits.

Key ideas:

  • TensorFlow graph execution
  • Placeholders and variables
  • Softmax classification
  • MNIST image data loading

Breast Cancer Classification

Locations: KNN/, Keras/, Logistic Regression/, Neural Networks - 1/

Multiple notebooks use the sklearn breast cancer dataset to compare classical and neural-network approaches, including KNN, logistic regression, Keras dense networks, and sklearn MLP.

Key ideas:

  • Train/test split
  • Standard scaling
  • KNN neighbor search
  • Dense neural networks
  • Model evaluation

From-Scratch Implementations

This repository is especially useful because several notebooks build ML logic manually before leaning on libraries:

  • Linear regression slope, intercept, cost, and coefficient-of-determination style score
  • Batch gradient descent for simple and multivariable regression
  • Decision tree splitting with entropy and information gain
  • Naive Bayes probability estimation with Laplace smoothing
  • Neural network forward propagation with NumPy
  • Basic hidden-layer neural network training logic

That mix helps connect the math behind each model with the API-level workflow used in real projects.

Datasets Included

Dataset / File Used For
Classification Measures/iris.csv Confusion matrix and classification metric practice
Linear Regression/data.csv Simple linear regression from scratch
Linear Regression/diabetes_train.csv, diabetes_test.csv Diabetes regression experiments
MultiVariable Regression and Gradient Descent/data.csv Basic gradient descent experiments
Projects - Gradient Descent/boston_test.csv Boston-style regression prediction
Project - Logistic Regression/titanic_train.csv, titanic_test.csv Titanic survival classification
Random Forests/titanic.csv and split CSVs Decision tree / random forest comparison
Project Twitter Sentiment Analysis/train.csv, test.csv Airline sentiment classification
Tensor Flow/MNIST_data/*.gz MNIST digit classification

Several notebooks also use built-in scikit-learn datasets, including Iris, Breast Cancer Wisconsin, Boston housing style data, and Diabetes.

Prerequisites

Recommended:

  • Python 3.7 or compatible Python 3.x environment
  • Jupyter Notebook or JupyterLab
  • Core packages:
    • numpy
    • pandas
    • matplotlib
    • scikit-learn
    • nltk
    • pydotplus
    • keras
    • tensorflow

For the TensorFlow notebooks, the code uses TensorFlow v1-style APIs such as:

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

Use a TensorFlow version that supports tensorflow.compat.v1.

Setup

Clone the repository:

git clone https://github.com/devthedevil/Machine-Learning.git
cd Machine-Learning

Create and activate a virtual environment:

python3 -m venv .venv
source .venv/bin/activate

Install common dependencies:

pip install jupyter numpy pandas matplotlib scikit-learn nltk pydotplus keras tensorflow

Download common NLTK resources used by the NLP notebooks:

import nltk

nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("averaged_perceptron_tagger")
nltk.download("movie_reviews")

Start Jupyter:

jupyter notebook

Then open any notebook from the topic folders.

Running Notes

  • Run notebooks from inside their own folder when they reference local files such as train.csv, data.csv, boston_test.csv, or MNIST_data/.
  • Some notebooks generate output files such as output.csv, twitter.csv, and prediction CSVs.
  • The TensorFlow notebooks depend on the local Tensor Flow/input_data.py helper and bundled Tensor Flow/MNIST_data/ files.
  • The repository does not currently include a requirements.txt; the package list above was inferred from notebook imports.
  • The CNN/Untitled.ipynb notebook is currently empty.

Suggested Reading Path

For a smooth learning progression:

  1. Start with Linear Regression/linear_regression_by_diffrentiation.ipynb
  2. Move to MultiVariable Regression and Gradient Descent/Gradient Descent.ipynb
  3. Explore Classification Measures/Confusion Matrix.ipynb
  4. Study Decision tree/Decision Tree Implementation.ipynb
  5. Compare with Decision tree/Code Using Sklearn Decision Tree.ipynb
  6. Try KNN/KNN.ipynb and Naive Bayes/Implementation of Naive Bayes .ipynb
  7. Open Project - Logistic Regression/Logistic Regression - Titanic Dataset.ipynb
  8. Continue to Project Twitter Sentiment Analysis/Twitter US Airline Sentiment Analysis.ipynb
  9. Finish with Keras/Keras_Intro.ipynb and the TensorFlow MNIST notebooks

Evaluation Techniques Used

Across the notebooks, the project uses:

  • Train/test splitting
  • Cross-validation
  • Confusion matrices
  • Classification reports
  • Accuracy scores
  • Regression score comparisons
  • Cost-function tracking during gradient descent
  • Visual decision boundaries for SVM experiments

Current Limitations

  • No centralized dependency file is included.
  • Most work is notebook-based rather than packaged into reusable Python modules.
  • Some notebooks use older APIs, including TensorFlow v1-style code and older scikit-learn defaults.
  • Several notebooks depend on being executed from a specific folder because CSV paths are relative.
  • The CNN notebook is empty and can be removed or replaced with a complete convolutional neural network example.

Possible Improvements

  • Add requirements.txt or environment.yml
  • Move reusable logic into Python modules under a src/ folder
  • Add notebook execution checks with nbconvert
  • Convert major projects into clean scripts or pipelines
  • Add exploratory data analysis sections to project notebooks
  • Add saved visualizations for model comparison
  • Modernize TensorFlow notebooks to TensorFlow 2 / Keras
  • Add a completed CNN notebook for image classification

What This Repository Demonstrates

  • Practical understanding of core supervised learning algorithms
  • Ability to implement ML fundamentals from scratch
  • Experience with data preprocessing, feature engineering, and model evaluation
  • Familiarity with NLP workflows using NLTK and scikit-learn
  • Exposure to neural-network workflows in Keras and TensorFlow
  • Comfort working with Jupyter notebooks, CSV datasets, and iterative experiments

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors