GitHub - is-leeroy-jenkins/Sake: A modular machine learning framework for budget execution & data analysis built in Python with Scikit, XGBoost, PyTorch, and TensorFlow.

sake

A modular machine learning framework for budget execution & data analysis built in Python with Scikit, XGBoost, PyTorch, and TensorFlow. Designed for rapid experimentation, visualization, and benchmarking of both classification and regression models, it provides a structured yet extensible workflow that’s equally useful for teaching, prototyping, and real-world application development.

Demo

☁️ Google (Cloud)

🧱 Databricks

A data engineering, analytics, and artificial intelligence collaborative workspace
Codebase

🕸️ Streamlit (Web)

🧪 How to Run

git clone https://github.com/your-username/balance-projector.git
cd balance-projector
pip install -r requirements.txt
jupyter notebook balances.ipynb

Option A — Google Colab (no local setup)

1. Click the **Open In Colab** badge above.
2. Upload your CSV or mount Google Drive.
3. Set `DATA_PATH` near the top of the notebook.
4. **Runtime → Run all**.

Option B — Local (conda or venv)

bash
# 1) Create environment
conda create -n sake python=3.11 -y
conda activate sake

# 2) Install dependencies
pip install -U pip wheel setuptools
pip install pandas numpy scipy matplotlib seaborn scikit-learn jupyter

# 3) Launch Jupyter
jupyter notebook

Open ipynb/schedule-x.ipynb and run cells top-to-bottom.

This section describes how to clone the repository, install dependencies, and launch the Sake Streamlit application locally.
The Streamlit app provides an interactive interface for exploring Account Balances data, performing statistical analysis, and training machine-learning models as described in this project.

📥 Clone the Repository

First, clone the Sake repository from GitHub and navigate into the project directory:

git clone https://github.com/<your-username>/sake.git
cd sake

🐍 Create a Virtual Environment (Recommended)

It is strongly recommended to run Sake in an isolated Python virtual environment.

Windows (PowerShell):

python -m venv .venv
.venv\Scripts\Activate.ps1

macOS / Linux:

python3 -m venv .venv
source .venv/bin/activate

Verify that the virtual environment is active before proceeding.

📦 Install Dependencies

Install all required Python packages using the provided requirements.txt file:

pip install --upgrade pip
pip install -r requirements.txt

Note Ensure that streamlit is included in requirements.txt. If not, install it manually:
pip install streamlit

▶️ Launch the Streamlit Application

Once dependencies are installed, start the Streamlit app by running:

streamlit run app.py

Streamlit will start a local development server and automatically open the application in your default web browser. If it does not open automatically, the terminal will display a local URL similar to:

http://localhost:8501

Open that address in your browser to access the app.

🔬 Data Source

File A (Account Balances) published monthly by agencies on USASpending
Required by the DATA Act.
Pulled automatically from data in the Governmentwide Treasury Account Symbol Adjusted Trial Balance System (GTAS)
Contains Budgetary resources, obligation, and outlay data for all the relevant Treasury Account Symbols (TAS) in a reporting agency.
It includes both award and non-award spending (grouped together), and crosswalks with the SF 133 report.

🔄 Unified Evaluation Pipeline

A single interface `train_and_evaluate()` to:

Train models
Cross-validate with nested k-fold
Generate predictions
Output evaluation plots & performance metrics
Store results & timings for meta-analysis

📊 Using the Application

After launch, the Streamlit interface will guide you through:

Uploading File A (Account Balances) Upload an Excel file containing GTAS / USASpending Account Balances data.
Exploring the Data View previews, column summaries, descriptive statistics, and distributions.
Statistical Analysis Perform correlation analysis, hypothesis testing, and ANOVA-style comparisons.
Feature Engineering Apply dimensionality reduction techniques such as PCA, Truncated SVD, or Factor Analysis.
Model Training & Evaluation Train and evaluate regression, classification, or clustering models using a unified train_and_evaluate() pipeline with optional cross-validation and diagnostics.

🛑 Stopping the App

To stop the Streamlit server, return to the terminal and press:

Ctrl + C

🧪 Supported Python Versions

Python 3.9 – 3.12 (64-bit recommended)
Tested on Windows and macOS

If you want, I can also:

Add a Docker-based run section
Add screenshots / UI walkthrough
Align this section stylistically with another README you’ve already approved

📊 Rich Visualization Toolkit

Confusion Matrix Heatmaps 🔥
ROC & Precision-Recall Curves 📈
Actual vs. Predicted Scatterplots 🎯
Residual Analysis & Error Distribution 🎭
Feature Importance Charts 📊

⏱️ Timing & Benchmarking

Automatically logs fit and predict durations
Model performance rankings across tasks
Output available in tabular format for export

💡 Custom Dataset Support

Accepts CSVs, Excel files, or Pandas DataFrames
Label encoding, numeric coercion, missing data handling
Drop-in replacement for datasets via parameter injection

🧪 Research Ready

Benchmark dozens of models easily
Plug-in architecture for testing experimental models
Use in classrooms to demo interpretability, overfitting, and variance

📊 Descriptive Statistics

Statistic	Description	Use in Budget Analysis
Mean	Average value	Avg. Outlays, Obligations, etc., across accounts
Median	Middle value	Robust central tendency in skewed financial data
Mode	Most frequent value	Identify common MainAccountCodes or Availability categories
Standard Deviation	Spread around the mean	Indicates variability in execution rates or balances
Variance	Square of standard deviation	Used in statistical tests and model diagnostics
Range	Difference between max and min	Measures total spread of financial metrics
Interquartile Range (IQR)	Spread of middle 50% of data	Identifies budget outliers and extreme accounts
Skewness	Asymmetry of distribution	Skewed obligations suggest few accounts dominate totals
Kurtosis	"Peakedness" of distribution	High values indicate outlier-prone financial data

🔍 Inferrential Statistics

Metric	Description	Use in Budget Analysis
Pearson Correlation	Linear relationship between variables	E.g., TotalResources vs. Obligations
Spearman Correlation	Monotonic (rank-based) relationship	More robust to non-linear trends in financial execution
t-test	Compare means between 2 groups	Discretionary vs. Mandatory accounts' execution rates
ANOVA	Compare means across multiple groups	Obligations across availability periods or account types
Chi-square Test	Categorical independence	Are Main Account Codes related to availability or a specific agency?
Confidence Intervals	Estimate range of a population mean	Upper and lower bound expected obligations or recoveries
Regression Coefficients (p-values)	Test variable significance	Are Recoveries a significant predictor of UnobligatedBalance?
F-statistic (overall regression)	Test whole model fit	Determines the combined influence of all predictors
Z-score / Outlier Tests	Deviation from standard mean	Identify abnormal balances or lapse rates
Boxplots	Visual outlier detection	Discover obligation anomalies within agencies

✅ Classification:

Model	Module
Logistic Regression	`sklearn.linear_model.LogisticRegression`
SVM	`sklearn.svm.SVC`
Decision Tree	`sklearn.tree.DecisionTreeClassifier`
Random Forest	`sklearn.ensemble.RandomForestClassifier`
XGBoost Classifier	`xgboost.XGBClassifier`
K-Nearest Neighbors	`sklearn.neighbors.KNeighborsClassifier`
Gaussian Naive Bayes	`sklearn.naive_bayes.GaussianNB`
Extra Trees	`sklearn.ensemble.ExtraTreesClassifier`
Bagging	`sklearn.ensemble.BaggingClassifier`
AdaBoost	`sklearn.ensemble.AdaBoostClassifier`

📉 Regression:

Model	Module
Linear Regression	`sklearn.linear_model.LinearRegression`
Ridge Regression	`sklearn.linear_model.Ridge`
Lasso Regression	`sklearn.linear_model.Lasso`
ElasticNet	`sklearn.linear_model.ElasticNet`
Support Vector Regressor	`sklearn.svm.SVR`
Decision Tree Regressor	`sklearn.tree.DecisionTreeRegressor`
Random Forest Regressor	`sklearn.ensemble.RandomForestRegressor`
Gradient Boosting Regressor	`sklearn.ensemble.GradientBoostingRegressor`
XGBoost Regressor	`xgboost.XGBRegressor`
K-Nearest Neighbors	`sklearn.neighbors.KNeighborsRegressor`
AdaBoost Regressor	`sklearn.ensemble.AdaBoostRegressor`
Extra Trees Regressor	`sklearn.ensemble.ExtraTreesRegressor`

📦 Dependencies

Package	Description	Link
numpy	Numerical computing library	numpy.org
pandas	Data manipulation and DataFrames	pandas.pydata.org
matplotlib	Plotting and visualization	matplotlib.org
seaborn	Statistical data visualization	seaborn.pydata.org
scikit-learn	ML modeling and metrics	scikit-learn.org
xgboost	Gradient boosting framework (optional)	xgboost.readthedocs.io
torch	PyTorch deep learning library	pytorch.org
tensorflow	End-to-end ML platform	tensorflow.org
openai	OpenAI’s Python API client	openai-python
requests	HTTP requests for API and web access	requests.readthedocs.io
PySimpleGUI	GUI framework for desktop apps	pysimplegui.readthedocs.io
typing	Type hinting standard library	typing Docs
pyodbc	ODBC database connector	pyodbc GitHub
fitz	PDF document parser via PyMuPDF	pymupdf
pillow	Image processing library	python-pillow.org
openpyxl	Excel file processing	openpyxl Docs
soundfile	Read/write sound file formats	pysoundfile
sounddevice	Audio I/O interface	sounddevice Docs
loguru	Structured, elegant logging	loguru GitHub
statsmodels	Statistical tests and regression diagnostics	statsmodels.org
dotenv	Load environment variables from `.env`	python-dotenv GitHub
python-dotenv	Same as above (modern usage)	python-dotenv

📁 Customize Dataset

Replace dataset ingestion cell with:

import pandas as pd
df = pd.read_csv("your_dataset.csv")
X = df.drop("target_column", axis=1)
y = df["target_column"]

📊 Outputs

R², MAE, MSE for each model
Bar plots of performance scores
Visual predicted vs. actual scatter charts
Residual error analysis

Disclaimer: This is for analytical exploration, research, and education purposes.
This is not an official government product; validate against authoritative sources before use.

📝 License

Sake is published under the MIT General Public License v3.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
.idea		.idea
.streamlit		.streamlit
data		data
resources		resources
src		src
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
app.py		app.py
booger.py		booger.py
config.py		config.py
minion.py		minion.py
models.ipynb		models.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

sake

Demo

☁️ Google (Cloud)

🧱 Databricks

🕸️ Streamlit (Web)

🧪 How to Run

Option A — Google Colab (no local setup)

Option B — Local (conda or venv)

📥 Clone the Repository

🐍 Create a Virtual Environment (Recommended)

📦 Install Dependencies

▶️ Launch the Streamlit Application

🔬 Data Source

🔄 Unified Evaluation Pipeline

A single interface train_and_evaluate() to:

📊 Using the Application

🛑 Stopping the App

🧪 Supported Python Versions

📊 Rich Visualization Toolkit

⏱️ Timing & Benchmarking

💡 Custom Dataset Support

🧪 Research Ready

📊 Descriptive Statistics

🔍 Inferrential Statistics

✅ Classification:

📉 Regression:

📦 Dependencies

📁 Customize Dataset

📊 Outputs

📝 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

A single interface `train_and_evaluate()` to:

Packages