Machine Learning Lab

This repository is a module-based machine learning lab that explores foundational algorithms, feature engineering, model comparison, NLP preprocessing, clustering, genetic algorithms, and deep learning for time-series prediction.

The project is organized as seven lab modules (mod1 through mod7). Each module focuses on a different part of the machine learning workflow: data preparation, model training, optimization, dimensionality reduction, evaluation, and applied prediction.

Lab Overview

flowchart LR
    A["Input Data"] --> B["Preprocessing"]
    B --> C["Feature Engineering"]
    C --> D["Model Training"]
    D --> E["Evaluation"]
    E --> F["Visualization"]

    C --> C1["PCA"]
    C --> C2["Mutual Information"]
    C --> C3["Sequential Selection"]

    D --> D1["Classical ML"]
    D --> D2["Genetic Algorithm"]
    D --> D3["KMeans Clustering"]
    D --> D4["LSTM RNN"]

    E --> E1["Accuracy"]
    E --> E2["Classification Report"]
    E --> E3["Confusion Matrix"]
    E --> E4["Homogeneity / V-measure"]

Repository Structure

.
|-- ml.jpeg
|-- mod1/
|   `-- Gradient Descent/
|       `-- student_scores.csv
|-- mod2/
|   |-- iris.data
|   |-- ae.train
|   |-- ae.test
|   |-- size_ae.train
|   `-- size_ae.test
|-- mod3/
|   |-- GA.ipynb
|   |-- ga_m.py
|   |-- GA-on-SVM-for-Y1.py
|   |-- dataset.xlsx
|   |-- GA_guideline.pdf
|   `-- GA parameters details.pdf
|-- mod4/
|   |-- IG.py
|   |-- PCA.py
|   |-- Ranking_of_feature.py
|   |-- feature_forward_and_backward.py
|   |-- feature_backward.py
|   |-- dermatology.data
|   |-- dermatology.names
|   `-- dermatology_csv.csv
|-- mod5/
|   |-- logistic_regression.py
|   |-- decision_tree_classification.py
|   |-- random_forest_classification.py
|   |-- svm.py
|   |-- Untitled.ipynb
|   |-- Untitled1.ipynb
|   `-- dermatology_csv.csv
|-- mod6/
|   |-- Text_Preprocessing.py
|   `-- Text_Preprocessing_Clustering.py
`-- mod7/
    |-- rnn.py
    |-- Google_Stock_Price_Train.csv
    `-- Google_Stock_Price_Test.csv

Module Guide

Module	Focus	Main Files	What It Demonstrates
`mod1`	Gradient descent dataset	`student_scores.csv`	Small supervised learning dataset with study hours, scores, and pass/fail labels
`mod2`	Core classification datasets	`iris.data`, `ae.train`, `ae.test`	Iris data and train/test numeric datasets for lab experiments
`mod3`	Genetic algorithms	`GA.ipynb`, `ga_m.py`, `GA-on-SVM-for-Y1.py`, `dataset.xlsx`	Binary encoding, roulette-wheel selection, crossover, mutation, elitist survival, and GA-driven SVM parameter search draft
`mod4`	Feature engineering	`IG.py`, `PCA.py`, `Ranking_of_feature.py`, `feature_forward_and_backward.py`	Mutual information, PCA, variance-based feature ranking, forward/backward feature selection
`mod5`	Classifier comparison	`logistic_regression.py`, `decision_tree_classification.py`, `random_forest_classification.py`, `svm.py`	Dermatology classification with multiple algorithms, imputation, scaling, reports, and confusion matrices
`mod6`	NLP preprocessing and clustering	`Text_Preprocessing.py`, `Text_Preprocessing_Clustering.py`	Tokenization, stemming, lemmatization, stopwords, POS tagging, n-grams, TF-IDF, KMeans clustering
`mod7`	Deep learning time-series prediction	`rnn.py`, Google stock CSV files	LSTM-based Google stock-price prediction using Keras/TensorFlow

Key Topics Covered

Gradient descent datasets and supervised learning basics
Iris and other train/test datasets for classification experiments
Genetic algorithm implementation from scratch
GA operators: population generation, roulette-wheel selection, crossover, mutation, and elitism
Feature selection with mutual information
Dimensionality reduction with Principal Component Analysis
Sequential forward and backward feature selection
Logistic regression, decision trees, random forests, and SVM
Confusion matrices and classification reports
NLP text cleaning with NLTK
TF-IDF vectorization and KMeans text clustering
LSTM recurrent neural network for stock-price prediction

Highlight Labs

1. Genetic Algorithm From Scratch

Location: mod3/GA.ipynb, mod3/ga_m.py

The genetic algorithm lab optimizes a mathematical objective function:

f(x) = x^3 + 9

The implementation uses:

6-bit binary chromosomes
Population size of 10
Roulette-wheel parent selection
Single-point crossover
Mutation
Replacement of weak offspring with strong parent solutions

This is one of the strongest parts of the repository because it shows the optimization loop explicitly instead of hiding it behind a library.

2. Dermatology Classification Suite

Location: mod5/

The dermatology classification scripts train multiple models on a dataset with 34 features and a target class column.

Models included:

Logistic Regression
Decision Tree with entropy criterion
Random Forest
Linear Support Vector Machine

Each script follows a similar pipeline:

flowchart LR
    A["dermatology_csv.csv"] --> B["Train/Test Split"]
    B --> C["StandardScaler"]
    C --> D["Mean Imputation"]
    D --> E["Classifier"]
    E --> F["Predictions"]
    F --> G["Classification Report"]
    F --> H["Confusion Matrix Heatmap"]

3. Feature Selection and PCA

Location: mod4/

This module explores how feature preparation affects model performance.

Included techniques:

Mutual information-based feature selection in IG.py
PCA-based dimensionality reduction in PCA.py
Variance-based feature ranking in Ranking_of_feature.py
Sequential forward/backward feature selection in feature_forward_and_backward.py

The main dataset is dermatology_csv.csv, which contains 366 records, 34 feature columns, and 1 class label column.

4. NLP Preprocessing and Clustering

Location: mod6/

Text_Preprocessing.py demonstrates standard NLP preprocessing steps:

Lowercasing
Word tokenization
Punctuation removal
Stemming
Lemmatization
Stopword removal
Sentence tokenization
POS tagging
Chunking
N-gram extraction

Text_Preprocessing_Clustering.py uses the 20 Newsgroups dataset, converts documents into TF-IDF features, clusters them with KMeans, and evaluates clustering with homogeneity and V-measure.

5. LSTM Stock Price Prediction

Location: mod7/rnn.py

This lab builds a stacked LSTM model for Google stock-price prediction.

Pipeline:

Reads Google stock training and test CSV files
Uses the Open price as the training signal
Scales values with MinMaxScaler
Builds 60-day timestep sequences
Trains a 4-layer LSTM network with dropout
Predicts 2017 stock prices
Plots real vs predicted stock prices

Datasets

Dataset	Location	Shape / Notes
Student scores	`mod1/Gradient Descent/student_scores.csv`	25 rows plus header; columns: `Hours`, `Scores`, `Pass`
Iris	`mod2/iris.data`	150 Iris samples plus trailing newline; 4 numeric features and species label
AE train/test	`mod2/ae.train`, `mod2/ae.test`	Numeric train/test files; size metadata is stored in `size_ae.train` and `size_ae.test`
GA/SVM workbook	`mod3/dataset.xlsx`	Main sheet has 1,297 rows and columns `X1`-`X8`, `Y1`, `Y2`
Dermatology	`mod4/dermatology_csv.csv`, `mod5/dermatology_csv.csv`	366 records, 34 feature columns, and target `class`
Google stock prices	`mod7/Google_Stock_Price_Train.csv`, `mod7/Google_Stock_Price_Test.csv`	2012-2016 training data and January 2017 test data

Prerequisites

Recommended environment:

Python 3.x
Jupyter Notebook or JupyterLab
Core packages:
- numpy
- pandas
- matplotlib
- seaborn
- scikit-learn
- nltk
- mlxtend
- tensorflow
- openpyxl

There is currently no requirements.txt, so the list above is inferred from the scripts and notebooks.

Setup

Clone the repository:

git clone https://github.com/devthedevil/Machine-learning-Lab.git
cd Machine-learning-Lab

Create a virtual environment:

python3 -m venv .venv
source .venv/bin/activate

Install common dependencies:

pip install numpy pandas matplotlib seaborn scikit-learn nltk mlxtend tensorflow openpyxl jupyter

Download common NLTK resources:

import nltk

nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("averaged_perceptron_tagger")

How to Run

Most scripts use relative file paths, so run them from inside their module folder.

Example:

cd mod7
python rnn.py

For notebook-based material:

jupyter notebook

Then open:

mod3/GA.ipynb
mod5/Untitled.ipynb
mod5/Untitled1.ipynb

Some .py files contain notebook magics such as %matplotlib inline or !pip install. Run those cells in Jupyter, or remove the magic lines before executing them as plain Python scripts.

Current Execution Notes

During analysis, the repository was checked for syntax and runtime expectations:

mod3/ga_m.py and mod3/GA.ipynb contain the clearest complete genetic algorithm implementation.
mod3/GA-on-SVM-for-Y1.py is a draft for GA-based SVM hyperparameter optimization. Its dataset-loading block is commented out, and it references variables/API names that need cleanup before direct execution.
mod4/PCA.py, mod4/feature_forward_and_backward.py, and the mod5 classifier scripts contain notebook-only syntax such as %matplotlib inline or !pip install.
mod4/feature_forward_and_backward.py also contains a typo in the backward-selection section and should be cleaned before running as a script.
mod6/Text_Preprocessing.py expects a local 20_newsgroups folder for load_files.
mod6/Text_Preprocessing_Clustering.py uses fetch_20newsgroups, which may download data when run for the first time.
mod7/rnn.py is a complete LSTM script, but it may take time to train because it runs 100 epochs.

Suggested Learning Path

Start with mod1/Gradient Descent/student_scores.csv to understand the simplest supervised-learning data.
Review mod2/iris.data and the train/test datasets.
Open mod3/GA.ipynb to study the genetic algorithm loop from scratch.
Move to mod4/IG.py and mod4/PCA.py for feature selection and dimensionality reduction.
Compare classifiers in mod5/.
Explore text preprocessing in mod6/Text_Preprocessing.py.
Finish with the LSTM sequence model in mod7/rnn.py.

Suggested Improvements

Add requirements.txt or environment.yml
Convert notebook-style .py files into real notebooks or remove notebook magic commands from scripts
Fix runtime issues in GA-on-SVM-for-Y1.py, feature_forward_and_backward.py, and Text_Preprocessing.py
Add module-specific README files for each lab
Add saved plots for PCA, confusion matrices, clustering metrics, and RNN predictions
Add a single src/ package for reusable preprocessing/modeling utilities
Add lightweight tests or notebook execution checks
Add clearer dataset source notes for ae.train, ae.test, and dataset.xlsx

What This Lab Demonstrates

Understanding of the full ML workflow from preprocessing to evaluation
Ability to implement optimization logic manually
Familiarity with classical ML models and model comparison
Experience with feature selection, PCA, and dimensionality reduction
Practical exposure to NLP preprocessing and document clustering
Experience building and training an LSTM with TensorFlow/Keras
Comfort working with scripts, notebooks, CSV datasets, Excel files, and visualizations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning Lab

Lab Overview

Repository Structure

Module Guide

Key Topics Covered

Highlight Labs

1. Genetic Algorithm From Scratch

2. Dermatology Classification Suite

3. Feature Selection and PCA

4. NLP Preprocessing and Clustering

5. LSTM Stock Price Prediction

Datasets

Prerequisites

Setup

How to Run

Current Execution Notes

Suggested Learning Path

Suggested Improvements

What This Lab Demonstrates

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
mod1/Gradient Descent		mod1/Gradient Descent
mod2		mod2
mod3		mod3
mod4		mod4
mod5		mod5
mod6		mod6
mod7		mod7
README.md		README.md
ml.jpeg		ml.jpeg

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Lab

Lab Overview

Repository Structure

Module Guide

Key Topics Covered

Highlight Labs

1. Genetic Algorithm From Scratch

2. Dermatology Classification Suite

3. Feature Selection and PCA

4. NLP Preprocessing and Clustering

5. LSTM Stock Price Prediction

Datasets

Prerequisites

Setup

How to Run

Current Execution Notes

Suggested Learning Path

Suggested Improvements

What This Lab Demonstrates

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages