This repository is a module-based machine learning lab that explores foundational algorithms, feature engineering, model comparison, NLP preprocessing, clustering, genetic algorithms, and deep learning for time-series prediction.
The project is organized as seven lab modules (mod1 through mod7). Each module focuses on a different part of the machine learning workflow: data preparation, model training, optimization, dimensionality reduction, evaluation, and applied prediction.
flowchart LR
A["Input Data"] --> B["Preprocessing"]
B --> C["Feature Engineering"]
C --> D["Model Training"]
D --> E["Evaluation"]
E --> F["Visualization"]
C --> C1["PCA"]
C --> C2["Mutual Information"]
C --> C3["Sequential Selection"]
D --> D1["Classical ML"]
D --> D2["Genetic Algorithm"]
D --> D3["KMeans Clustering"]
D --> D4["LSTM RNN"]
E --> E1["Accuracy"]
E --> E2["Classification Report"]
E --> E3["Confusion Matrix"]
E --> E4["Homogeneity / V-measure"]
.
|-- ml.jpeg
|-- mod1/
| `-- Gradient Descent/
| `-- student_scores.csv
|-- mod2/
| |-- iris.data
| |-- ae.train
| |-- ae.test
| |-- size_ae.train
| `-- size_ae.test
|-- mod3/
| |-- GA.ipynb
| |-- ga_m.py
| |-- GA-on-SVM-for-Y1.py
| |-- dataset.xlsx
| |-- GA_guideline.pdf
| `-- GA parameters details.pdf
|-- mod4/
| |-- IG.py
| |-- PCA.py
| |-- Ranking_of_feature.py
| |-- feature_forward_and_backward.py
| |-- feature_backward.py
| |-- dermatology.data
| |-- dermatology.names
| `-- dermatology_csv.csv
|-- mod5/
| |-- logistic_regression.py
| |-- decision_tree_classification.py
| |-- random_forest_classification.py
| |-- svm.py
| |-- Untitled.ipynb
| |-- Untitled1.ipynb
| `-- dermatology_csv.csv
|-- mod6/
| |-- Text_Preprocessing.py
| `-- Text_Preprocessing_Clustering.py
`-- mod7/
|-- rnn.py
|-- Google_Stock_Price_Train.csv
`-- Google_Stock_Price_Test.csv
| Module | Focus | Main Files | What It Demonstrates |
|---|---|---|---|
mod1 |
Gradient descent dataset | student_scores.csv |
Small supervised learning dataset with study hours, scores, and pass/fail labels |
mod2 |
Core classification datasets | iris.data, ae.train, ae.test |
Iris data and train/test numeric datasets for lab experiments |
mod3 |
Genetic algorithms | GA.ipynb, ga_m.py, GA-on-SVM-for-Y1.py, dataset.xlsx |
Binary encoding, roulette-wheel selection, crossover, mutation, elitist survival, and GA-driven SVM parameter search draft |
mod4 |
Feature engineering | IG.py, PCA.py, Ranking_of_feature.py, feature_forward_and_backward.py |
Mutual information, PCA, variance-based feature ranking, forward/backward feature selection |
mod5 |
Classifier comparison | logistic_regression.py, decision_tree_classification.py, random_forest_classification.py, svm.py |
Dermatology classification with multiple algorithms, imputation, scaling, reports, and confusion matrices |
mod6 |
NLP preprocessing and clustering | Text_Preprocessing.py, Text_Preprocessing_Clustering.py |
Tokenization, stemming, lemmatization, stopwords, POS tagging, n-grams, TF-IDF, KMeans clustering |
mod7 |
Deep learning time-series prediction | rnn.py, Google stock CSV files |
LSTM-based Google stock-price prediction using Keras/TensorFlow |
- Gradient descent datasets and supervised learning basics
- Iris and other train/test datasets for classification experiments
- Genetic algorithm implementation from scratch
- GA operators: population generation, roulette-wheel selection, crossover, mutation, and elitism
- Feature selection with mutual information
- Dimensionality reduction with Principal Component Analysis
- Sequential forward and backward feature selection
- Logistic regression, decision trees, random forests, and SVM
- Confusion matrices and classification reports
- NLP text cleaning with NLTK
- TF-IDF vectorization and KMeans text clustering
- LSTM recurrent neural network for stock-price prediction
Location: mod3/GA.ipynb, mod3/ga_m.py
The genetic algorithm lab optimizes a mathematical objective function:
f(x) = x^3 + 9
The implementation uses:
- 6-bit binary chromosomes
- Population size of 10
- Roulette-wheel parent selection
- Single-point crossover
- Mutation
- Replacement of weak offspring with strong parent solutions
This is one of the strongest parts of the repository because it shows the optimization loop explicitly instead of hiding it behind a library.
Location: mod5/
The dermatology classification scripts train multiple models on a dataset with 34 features and a target class column.
Models included:
- Logistic Regression
- Decision Tree with entropy criterion
- Random Forest
- Linear Support Vector Machine
Each script follows a similar pipeline:
flowchart LR
A["dermatology_csv.csv"] --> B["Train/Test Split"]
B --> C["StandardScaler"]
C --> D["Mean Imputation"]
D --> E["Classifier"]
E --> F["Predictions"]
F --> G["Classification Report"]
F --> H["Confusion Matrix Heatmap"]
Location: mod4/
This module explores how feature preparation affects model performance.
Included techniques:
- Mutual information-based feature selection in
IG.py - PCA-based dimensionality reduction in
PCA.py - Variance-based feature ranking in
Ranking_of_feature.py - Sequential forward/backward feature selection in
feature_forward_and_backward.py
The main dataset is dermatology_csv.csv, which contains 366 records, 34 feature columns, and 1 class label column.
Location: mod6/
Text_Preprocessing.py demonstrates standard NLP preprocessing steps:
- Lowercasing
- Word tokenization
- Punctuation removal
- Stemming
- Lemmatization
- Stopword removal
- Sentence tokenization
- POS tagging
- Chunking
- N-gram extraction
Text_Preprocessing_Clustering.py uses the 20 Newsgroups dataset, converts documents into TF-IDF features, clusters them with KMeans, and evaluates clustering with homogeneity and V-measure.
Location: mod7/rnn.py
This lab builds a stacked LSTM model for Google stock-price prediction.
Pipeline:
- Reads Google stock training and test CSV files
- Uses the
Openprice as the training signal - Scales values with
MinMaxScaler - Builds 60-day timestep sequences
- Trains a 4-layer LSTM network with dropout
- Predicts 2017 stock prices
- Plots real vs predicted stock prices
| Dataset | Location | Shape / Notes |
|---|---|---|
| Student scores | mod1/Gradient Descent/student_scores.csv |
25 rows plus header; columns: Hours, Scores, Pass |
| Iris | mod2/iris.data |
150 Iris samples plus trailing newline; 4 numeric features and species label |
| AE train/test | mod2/ae.train, mod2/ae.test |
Numeric train/test files; size metadata is stored in size_ae.train and size_ae.test |
| GA/SVM workbook | mod3/dataset.xlsx |
Main sheet has 1,297 rows and columns X1-X8, Y1, Y2 |
| Dermatology | mod4/dermatology_csv.csv, mod5/dermatology_csv.csv |
366 records, 34 feature columns, and target class |
| Google stock prices | mod7/Google_Stock_Price_Train.csv, mod7/Google_Stock_Price_Test.csv |
2012-2016 training data and January 2017 test data |
Recommended environment:
- Python 3.x
- Jupyter Notebook or JupyterLab
- Core packages:
numpypandasmatplotlibseabornscikit-learnnltkmlxtendtensorflowopenpyxl
There is currently no requirements.txt, so the list above is inferred from the scripts and notebooks.
Clone the repository:
git clone https://github.com/devthedevil/Machine-learning-Lab.git
cd Machine-learning-LabCreate a virtual environment:
python3 -m venv .venv
source .venv/bin/activateInstall common dependencies:
pip install numpy pandas matplotlib seaborn scikit-learn nltk mlxtend tensorflow openpyxl jupyterDownload common NLTK resources:
import nltk
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("averaged_perceptron_tagger")Most scripts use relative file paths, so run them from inside their module folder.
Example:
cd mod7
python rnn.pyFor notebook-based material:
jupyter notebookThen open:
mod3/GA.ipynbmod5/Untitled.ipynbmod5/Untitled1.ipynb
Some .py files contain notebook magics such as %matplotlib inline or !pip install. Run those cells in Jupyter, or remove the magic lines before executing them as plain Python scripts.
During analysis, the repository was checked for syntax and runtime expectations:
mod3/ga_m.pyandmod3/GA.ipynbcontain the clearest complete genetic algorithm implementation.mod3/GA-on-SVM-for-Y1.pyis a draft for GA-based SVM hyperparameter optimization. Its dataset-loading block is commented out, and it references variables/API names that need cleanup before direct execution.mod4/PCA.py,mod4/feature_forward_and_backward.py, and themod5classifier scripts contain notebook-only syntax such as%matplotlib inlineor!pip install.mod4/feature_forward_and_backward.pyalso contains a typo in the backward-selection section and should be cleaned before running as a script.mod6/Text_Preprocessing.pyexpects a local20_newsgroupsfolder forload_files.mod6/Text_Preprocessing_Clustering.pyusesfetch_20newsgroups, which may download data when run for the first time.mod7/rnn.pyis a complete LSTM script, but it may take time to train because it runs 100 epochs.
- Start with
mod1/Gradient Descent/student_scores.csvto understand the simplest supervised-learning data. - Review
mod2/iris.dataand the train/test datasets. - Open
mod3/GA.ipynbto study the genetic algorithm loop from scratch. - Move to
mod4/IG.pyandmod4/PCA.pyfor feature selection and dimensionality reduction. - Compare classifiers in
mod5/. - Explore text preprocessing in
mod6/Text_Preprocessing.py. - Finish with the LSTM sequence model in
mod7/rnn.py.
- Add
requirements.txtorenvironment.yml - Convert notebook-style
.pyfiles into real notebooks or remove notebook magic commands from scripts - Fix runtime issues in
GA-on-SVM-for-Y1.py,feature_forward_and_backward.py, andText_Preprocessing.py - Add module-specific README files for each lab
- Add saved plots for PCA, confusion matrices, clustering metrics, and RNN predictions
- Add a single
src/package for reusable preprocessing/modeling utilities - Add lightweight tests or notebook execution checks
- Add clearer dataset source notes for
ae.train,ae.test, anddataset.xlsx
- Understanding of the full ML workflow from preprocessing to evaluation
- Ability to implement optimization logic manually
- Familiarity with classical ML models and model comparison
- Experience with feature selection, PCA, and dimensionality reduction
- Practical exposure to NLP preprocessing and document clustering
- Experience building and training an LSTM with TensorFlow/Keras
- Comfort working with scripts, notebooks, CSV datasets, Excel files, and visualizations
