Skip to content

devthedevil/Machine-learning-Lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning Lab

Python scikit--learn Jupyter TensorFlow

Machine Learning Lab

This repository is a module-based machine learning lab that explores foundational algorithms, feature engineering, model comparison, NLP preprocessing, clustering, genetic algorithms, and deep learning for time-series prediction.

The project is organized as seven lab modules (mod1 through mod7). Each module focuses on a different part of the machine learning workflow: data preparation, model training, optimization, dimensionality reduction, evaluation, and applied prediction.

Lab Overview

flowchart LR
    A["Input Data"] --> B["Preprocessing"]
    B --> C["Feature Engineering"]
    C --> D["Model Training"]
    D --> E["Evaluation"]
    E --> F["Visualization"]

    C --> C1["PCA"]
    C --> C2["Mutual Information"]
    C --> C3["Sequential Selection"]

    D --> D1["Classical ML"]
    D --> D2["Genetic Algorithm"]
    D --> D3["KMeans Clustering"]
    D --> D4["LSTM RNN"]

    E --> E1["Accuracy"]
    E --> E2["Classification Report"]
    E --> E3["Confusion Matrix"]
    E --> E4["Homogeneity / V-measure"]
Loading

Repository Structure

.
|-- ml.jpeg
|-- mod1/
|   `-- Gradient Descent/
|       `-- student_scores.csv
|-- mod2/
|   |-- iris.data
|   |-- ae.train
|   |-- ae.test
|   |-- size_ae.train
|   `-- size_ae.test
|-- mod3/
|   |-- GA.ipynb
|   |-- ga_m.py
|   |-- GA-on-SVM-for-Y1.py
|   |-- dataset.xlsx
|   |-- GA_guideline.pdf
|   `-- GA parameters details.pdf
|-- mod4/
|   |-- IG.py
|   |-- PCA.py
|   |-- Ranking_of_feature.py
|   |-- feature_forward_and_backward.py
|   |-- feature_backward.py
|   |-- dermatology.data
|   |-- dermatology.names
|   `-- dermatology_csv.csv
|-- mod5/
|   |-- logistic_regression.py
|   |-- decision_tree_classification.py
|   |-- random_forest_classification.py
|   |-- svm.py
|   |-- Untitled.ipynb
|   |-- Untitled1.ipynb
|   `-- dermatology_csv.csv
|-- mod6/
|   |-- Text_Preprocessing.py
|   `-- Text_Preprocessing_Clustering.py
`-- mod7/
    |-- rnn.py
    |-- Google_Stock_Price_Train.csv
    `-- Google_Stock_Price_Test.csv

Module Guide

Module Focus Main Files What It Demonstrates
mod1 Gradient descent dataset student_scores.csv Small supervised learning dataset with study hours, scores, and pass/fail labels
mod2 Core classification datasets iris.data, ae.train, ae.test Iris data and train/test numeric datasets for lab experiments
mod3 Genetic algorithms GA.ipynb, ga_m.py, GA-on-SVM-for-Y1.py, dataset.xlsx Binary encoding, roulette-wheel selection, crossover, mutation, elitist survival, and GA-driven SVM parameter search draft
mod4 Feature engineering IG.py, PCA.py, Ranking_of_feature.py, feature_forward_and_backward.py Mutual information, PCA, variance-based feature ranking, forward/backward feature selection
mod5 Classifier comparison logistic_regression.py, decision_tree_classification.py, random_forest_classification.py, svm.py Dermatology classification with multiple algorithms, imputation, scaling, reports, and confusion matrices
mod6 NLP preprocessing and clustering Text_Preprocessing.py, Text_Preprocessing_Clustering.py Tokenization, stemming, lemmatization, stopwords, POS tagging, n-grams, TF-IDF, KMeans clustering
mod7 Deep learning time-series prediction rnn.py, Google stock CSV files LSTM-based Google stock-price prediction using Keras/TensorFlow

Key Topics Covered

  • Gradient descent datasets and supervised learning basics
  • Iris and other train/test datasets for classification experiments
  • Genetic algorithm implementation from scratch
  • GA operators: population generation, roulette-wheel selection, crossover, mutation, and elitism
  • Feature selection with mutual information
  • Dimensionality reduction with Principal Component Analysis
  • Sequential forward and backward feature selection
  • Logistic regression, decision trees, random forests, and SVM
  • Confusion matrices and classification reports
  • NLP text cleaning with NLTK
  • TF-IDF vectorization and KMeans text clustering
  • LSTM recurrent neural network for stock-price prediction

Highlight Labs

1. Genetic Algorithm From Scratch

Location: mod3/GA.ipynb, mod3/ga_m.py

The genetic algorithm lab optimizes a mathematical objective function:

f(x) = x^3 + 9

The implementation uses:

  • 6-bit binary chromosomes
  • Population size of 10
  • Roulette-wheel parent selection
  • Single-point crossover
  • Mutation
  • Replacement of weak offspring with strong parent solutions

This is one of the strongest parts of the repository because it shows the optimization loop explicitly instead of hiding it behind a library.

2. Dermatology Classification Suite

Location: mod5/

The dermatology classification scripts train multiple models on a dataset with 34 features and a target class column.

Models included:

  • Logistic Regression
  • Decision Tree with entropy criterion
  • Random Forest
  • Linear Support Vector Machine

Each script follows a similar pipeline:

flowchart LR
    A["dermatology_csv.csv"] --> B["Train/Test Split"]
    B --> C["StandardScaler"]
    C --> D["Mean Imputation"]
    D --> E["Classifier"]
    E --> F["Predictions"]
    F --> G["Classification Report"]
    F --> H["Confusion Matrix Heatmap"]
Loading

3. Feature Selection and PCA

Location: mod4/

This module explores how feature preparation affects model performance.

Included techniques:

  • Mutual information-based feature selection in IG.py
  • PCA-based dimensionality reduction in PCA.py
  • Variance-based feature ranking in Ranking_of_feature.py
  • Sequential forward/backward feature selection in feature_forward_and_backward.py

The main dataset is dermatology_csv.csv, which contains 366 records, 34 feature columns, and 1 class label column.

4. NLP Preprocessing and Clustering

Location: mod6/

Text_Preprocessing.py demonstrates standard NLP preprocessing steps:

  • Lowercasing
  • Word tokenization
  • Punctuation removal
  • Stemming
  • Lemmatization
  • Stopword removal
  • Sentence tokenization
  • POS tagging
  • Chunking
  • N-gram extraction

Text_Preprocessing_Clustering.py uses the 20 Newsgroups dataset, converts documents into TF-IDF features, clusters them with KMeans, and evaluates clustering with homogeneity and V-measure.

5. LSTM Stock Price Prediction

Location: mod7/rnn.py

This lab builds a stacked LSTM model for Google stock-price prediction.

Pipeline:

  • Reads Google stock training and test CSV files
  • Uses the Open price as the training signal
  • Scales values with MinMaxScaler
  • Builds 60-day timestep sequences
  • Trains a 4-layer LSTM network with dropout
  • Predicts 2017 stock prices
  • Plots real vs predicted stock prices

Datasets

Dataset Location Shape / Notes
Student scores mod1/Gradient Descent/student_scores.csv 25 rows plus header; columns: Hours, Scores, Pass
Iris mod2/iris.data 150 Iris samples plus trailing newline; 4 numeric features and species label
AE train/test mod2/ae.train, mod2/ae.test Numeric train/test files; size metadata is stored in size_ae.train and size_ae.test
GA/SVM workbook mod3/dataset.xlsx Main sheet has 1,297 rows and columns X1-X8, Y1, Y2
Dermatology mod4/dermatology_csv.csv, mod5/dermatology_csv.csv 366 records, 34 feature columns, and target class
Google stock prices mod7/Google_Stock_Price_Train.csv, mod7/Google_Stock_Price_Test.csv 2012-2016 training data and January 2017 test data

Prerequisites

Recommended environment:

  • Python 3.x
  • Jupyter Notebook or JupyterLab
  • Core packages:
    • numpy
    • pandas
    • matplotlib
    • seaborn
    • scikit-learn
    • nltk
    • mlxtend
    • tensorflow
    • openpyxl

There is currently no requirements.txt, so the list above is inferred from the scripts and notebooks.

Setup

Clone the repository:

git clone https://github.com/devthedevil/Machine-learning-Lab.git
cd Machine-learning-Lab

Create a virtual environment:

python3 -m venv .venv
source .venv/bin/activate

Install common dependencies:

pip install numpy pandas matplotlib seaborn scikit-learn nltk mlxtend tensorflow openpyxl jupyter

Download common NLTK resources:

import nltk

nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("averaged_perceptron_tagger")

How to Run

Most scripts use relative file paths, so run them from inside their module folder.

Example:

cd mod7
python rnn.py

For notebook-based material:

jupyter notebook

Then open:

  • mod3/GA.ipynb
  • mod5/Untitled.ipynb
  • mod5/Untitled1.ipynb

Some .py files contain notebook magics such as %matplotlib inline or !pip install. Run those cells in Jupyter, or remove the magic lines before executing them as plain Python scripts.

Current Execution Notes

During analysis, the repository was checked for syntax and runtime expectations:

  • mod3/ga_m.py and mod3/GA.ipynb contain the clearest complete genetic algorithm implementation.
  • mod3/GA-on-SVM-for-Y1.py is a draft for GA-based SVM hyperparameter optimization. Its dataset-loading block is commented out, and it references variables/API names that need cleanup before direct execution.
  • mod4/PCA.py, mod4/feature_forward_and_backward.py, and the mod5 classifier scripts contain notebook-only syntax such as %matplotlib inline or !pip install.
  • mod4/feature_forward_and_backward.py also contains a typo in the backward-selection section and should be cleaned before running as a script.
  • mod6/Text_Preprocessing.py expects a local 20_newsgroups folder for load_files.
  • mod6/Text_Preprocessing_Clustering.py uses fetch_20newsgroups, which may download data when run for the first time.
  • mod7/rnn.py is a complete LSTM script, but it may take time to train because it runs 100 epochs.

Suggested Learning Path

  1. Start with mod1/Gradient Descent/student_scores.csv to understand the simplest supervised-learning data.
  2. Review mod2/iris.data and the train/test datasets.
  3. Open mod3/GA.ipynb to study the genetic algorithm loop from scratch.
  4. Move to mod4/IG.py and mod4/PCA.py for feature selection and dimensionality reduction.
  5. Compare classifiers in mod5/.
  6. Explore text preprocessing in mod6/Text_Preprocessing.py.
  7. Finish with the LSTM sequence model in mod7/rnn.py.

Suggested Improvements

  • Add requirements.txt or environment.yml
  • Convert notebook-style .py files into real notebooks or remove notebook magic commands from scripts
  • Fix runtime issues in GA-on-SVM-for-Y1.py, feature_forward_and_backward.py, and Text_Preprocessing.py
  • Add module-specific README files for each lab
  • Add saved plots for PCA, confusion matrices, clustering metrics, and RNN predictions
  • Add a single src/ package for reusable preprocessing/modeling utilities
  • Add lightweight tests or notebook execution checks
  • Add clearer dataset source notes for ae.train, ae.test, and dataset.xlsx

What This Lab Demonstrates

  • Understanding of the full ML workflow from preprocessing to evaluation
  • Ability to implement optimization logic manually
  • Familiarity with classical ML models and model comparison
  • Experience with feature selection, PCA, and dimensionality reduction
  • Practical exposure to NLP preprocessing and document clustering
  • Experience building and training an LSTM with TensorFlow/Keras
  • Comfort working with scripts, notebooks, CSV datasets, Excel files, and visualizations

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors