Author: Mohin Hasin GitHub: mohin-io Email: mohinhasin999@gmail.com Project Start Date: October 2025
This project tackles real-world portfolio optimization by addressing limitations in classical mean-variance optimization:
- Integer constraints: Discrete trading units (can't buy fractional shares)
- Transaction costs: Fixed and proportional costs per trade
- Cardinality constraints: Limited number of assets to reduce monitoring overhead
Traditional portfolio optimization assumes frictionless markets and continuous asset allocation. In practice:
- Small funds face minimum lot sizes
- Transaction costs make frequent rebalancing expensive
- Fund managers need concentrated portfolios for operational efficiency
We enhance mixed-integer optimization (MIO) with machine learning:
- Asset Pre-selection: Clustering algorithms identify diverse asset subsets
- Constraint Prediction: ML models predict which constraints will bind
- Guided Search: Genetic algorithms and simulated annealing find near-optimal solutions faster
- Create virtual environment (
Python 3.10+) - Install core dependencies:
- Data:
yfinance,pandas,numpy - Optimization:
pyomo,pulp,cvxpy - ML:
scikit-learn,xgboost - Forecasting:
statsmodels,arch(GARCH) - Visualization:
matplotlib,seaborn,plotly - Dashboard:
streamlit,dash - API:
fastapi,uvicorn - Testing:
pytest,pytest-cov
- Data:
# Responsibilities:
# - Fetch historical price data (Yahoo Finance API)
# - Support multiple asset classes (stocks, ETFs, crypto)
# - Handle missing data, splits, dividends
# - Cache data locally for reproducibilityImplementation:
- Create
AssetDataLoaderclass with methods:fetch_prices(tickers, start_date, end_date)compute_returns(prices, frequency='daily')handle_missing_data(method='forward_fill')
- Store raw data in
data/raw/ - Visual Output: Price time series plot for sample portfolio
# Responsibilities:
# - Compute rolling windows for out-of-sample testing
# - Calculate risk factors (size, value, momentum)
# - Generate covariance matrices
# - Handle outliers and winsorizationImplementation:
- Create
DataPreprocessorclass:compute_rolling_windows(window_size=252)calculate_factors()winsorize_returns(quantile=0.01)
- Store processed data in
data/processed/ - Visual Output: Correlation heatmap, factor exposures
Models to Implement:
- ARIMA: Autoregressive Integrated Moving Average
- VAR: Vector Autoregression (captures cross-asset dynamics)
- ML Baseline: Random Forest Regressor
Implementation:
- Create
ReturnsForecastclass with methods:fit_arima(order=(1,0,1))fit_var(lags=5)fit_ml_model(features=['momentum', 'volatility'])
- Out-of-sample evaluation (train on 80%, test on 20%)
- Visual Output: Predicted vs actual returns scatter plot
Models:
- GARCH(1,1): Generalized Autoregressive Conditional Heteroskedasticity
- EGARCH: Exponential GARCH (asymmetric shocks)
Implementation:
- Create
VolatilityForecastclass:fit_garch(p=1, q=1)fit_egarch(p=1, q=1)
- Forecast rolling volatility
- Visual Output: Realized vs forecasted volatility time series
Methods:
- Sample Covariance: Baseline
- Ledoit-Wolf Shrinkage: Reduces estimation error
- Factor Models: Fama-French 3-factor
Implementation:
- Create
CovarianceEstimatorclass - Compare condition numbers across methods
- Visual Output: Eigenvalue distribution plot
Mathematical Model:
maximize: μᵀw - λ·(wᵀΣw) - transaction_costs(w, w_prev)
subject to:
1. Σwᵢ = 1 (budget constraint)
2. wᵢ ∈ {0, l, 2l, ..., u} (integer lots)
3. Σyᵢ ≤ k (cardinality: max k assets)
4. yᵢ ∈ {0,1}, wᵢ ≤ yᵢ (binary indicators)
5. wᵢ ≥ 0 (long-only)
where:
transaction_costs = Σ(fixed_cost·yᵢ + proportional_cost·|wᵢ - w_prev,ᵢ|)
Implementation:
- Create
MIOOptimizerclass usingpyomoorpulp - Support solvers: CPLEX (if licensed), CBC, GLPK
- Parameters:
risk_aversion(λ)max_assets(k)lot_size(l)fixed_cost,proportional_cost
- Visual Output: Efficient frontier with transaction costs
- Implement solver wrappers for multiple backends
- Timeout handling for large problems
- Log solver statistics (iterations, gap, runtime)
Purpose: Pre-select diverse assets to reduce problem size
Algorithms:
- K-Means: Group assets by return correlation
- Hierarchical Clustering: Dendrogram-based selection
Implementation:
- Create
AssetClustererclass:fit_kmeans(n_clusters=10)fit_hierarchical(linkage='ward')select_representatives(n_per_cluster=2)
- Feature engineering: Use correlation matrix + volatility
- Visual Output: Dendrogram, cluster scatter plot (t-SNE)
Purpose: Predict which constraints will bind to prune search space
ML Model:
- Random Forest Classifier: Predicts if cardinality/cost constraints are active
- Features: Market volatility, portfolio turnover, asset dispersion
Implementation:
- Train on historical optimization solutions
- Use predictions to initialize heuristic search
- Visual Output: Feature importance plot
Purpose: Combinatorial search for asset selection
Algorithm:
1. Initialize population of random portfolios
2. Evaluate fitness (Sharpe ratio - costs)
3. Selection (tournament selection)
4. Crossover (blend weights from two parents)
5. Mutation (randomly adjust weights)
6. Repeat for N generations
Implementation:
- Create
GeneticOptimizerclass:initialize_population(size=100)evaluate_fitness()evolve(generations=50)
- ML guidance: Use clustering to seed initial population
- Visual Output: Fitness convergence plot
Purpose: Escape local optima in non-convex cost landscape
Implementation:
- Create
SimulatedAnnealingOptimizerclass - Cooling schedule: Exponential decay
- Visual Output: Energy landscape over iterations
Rolling Window Strategy:
for each window t:
1. Train forecasting models on data[t-train_window:t]
2. Forecast returns, volatility for period t+1
3. Solve optimization problem
4. Execute trades (simulate slippage)
5. Record performance metrics
Implementation:
- Create
Backtesterclass:run_backtest(start_date, end_date, rebalance_freq='monthly')calculate_metrics(): Sharpe, Sortino, max drawdown, turnover
- Transaction cost accounting
- Visual Output: Cumulative returns plot
Strategies to Compare:
- Naïve Mean-Variance: No transaction costs, fractional weights
- Exact MIO Solver: CPLEX/Gurobi with strict optimality
- Equal-Weight Portfolio: 1/N allocation
- ML-Guided Heuristics: Our approach
Metrics:
- Risk-adjusted return (Sharpe ratio)
- Realized transaction costs
- CPU runtime
- Portfolio turnover
Visual Output: Side-by-side performance table + bar charts
Required Plots:
- Price & Returns: Time series of asset prices and log-returns
- Correlation Matrix: Heatmap of asset correlations
- Factor Exposures: Bar chart of size/value/momentum loadings
- Forecasting Performance: Predicted vs actual scatter plots
- Efficient Frontier: Risk-return trade-off with transaction costs
- Portfolio Weights: Stacked area chart over time
- Performance Metrics: Cumulative returns, drawdowns
- Heuristic Convergence: GA/SA fitness over iterations
- Runtime Comparison: Bar chart of solver times
- Cost Analysis: Transaction costs breakdown
Implementation:
- Create reusable plotting functions
- Save all figures to
outputs/figures/with timestamps - Use consistent styling (seaborn 'whitegrid')
Interactive Features:
- Sidebar: Adjust risk aversion, max assets, transaction costs
- Tab 1: Data exploration (price charts, correlations)
- Tab 2: Forecasting results (predicted returns, volatility)
- Tab 3: Optimization results (portfolio weights, efficient frontier)
- Tab 4: Backtesting (cumulative returns, metrics table)
- Tab 5: Heuristic comparison (runtime, performance trade-offs)
Implementation:
streamlit run src/visualization/dashboard.pyEndpoints:
POST /optimize: Submit optimization request{ "tickers": ["AAPL", "GOOGL", "MSFT"], "risk_aversion": 2.5, "max_assets": 5, "method": "genetic_algorithm" }GET /portfolio/{job_id}: Retrieve optimization resultsGET /backtest/{portfolio_id}: Get backtest metrics
Implementation:
- Async task queue (Celery or FastAPI background tasks)
- Input validation with Pydantic models
- Error handling and logging
Dockerfile:
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "src.api.main:app", "--host", "0.0.0.0", "--port", "8000"]docker-compose.yml:
services:
api:
build: .
ports:
- "8000:8000"
dashboard:
build: .
command: streamlit run src/visualization/dashboard.py
ports:
- "8501:8501"test_data_loader.py: Verify data fetchingtest_forecasting.py: Check model predictionstest_optimization.py: Validate constraint satisfactiontest_heuristics.py: Ensure convergence
Target Coverage: >80%
# Project Title
[Badge: Python Version] [Badge: License] [Badge: Tests Passing]
## 🚀 Quick Start
# Installation and 3-line demo
## 📊 Key Results
[Embed: Performance comparison chart]
[Embed: Portfolio weights visualization]
## 🏗️ Architecture
[Diagram: System components]
## 📈 Methodology
# Brief explanation of MIO + ML approach
## 🔧 Usage
# Code examples
## 📂 Project Structure
# Directory tree
## 🤝 Contributing
## 📄 License- Baseline Run: 10 assets, 3-year backtest
- Scalability Test: 50+ assets
- Cost Sensitivity: Vary transaction costs
- Constraint Analysis: Cardinality limits (5, 10, 15 assets)
- 10+ high-quality plots (PNG, 300 DPI)
- Each with descriptive filename (e.g.,
efficient_frontier_with_costs.png) - Captions in
outputs/figures/README.md
docs/PLAN.md(this file)docs/METHODOLOGY.md: Mathematical formulationsdocs/RESULTS.md: Simulation outcomesREADME.md: Project showcase
- Initial setup: "chore: initialize project structure and planning docs"
- Data module: "feat: implement asset data loader with Yahoo Finance integration"
- Preprocessing: "feat: add data preprocessing with factor computation"
- Forecasting (returns): "feat: implement ARIMA and VAR returns forecasting"
- Forecasting (volatility): "feat: add GARCH volatility forecasting"
- Covariance estimation: "feat: implement Ledoit-Wolf covariance estimation"
- MIO core: "feat: build mixed-integer optimization solver with transaction costs"
- Clustering: "feat: add K-Means and hierarchical asset clustering"
- Constraint prediction: "feat: implement ML-based constraint prediction"
- Genetic algorithm: "feat: add genetic algorithm heuristic optimizer"
- Simulated annealing: "feat: implement simulated annealing optimizer"
- Backtesting engine: "feat: create rolling window backtesting framework"
- Benchmarks: "feat: add benchmark strategies for comparison"
- Visualization: "feat: generate static plots for all components"
- Dashboard: "feat: build Streamlit interactive dashboard"
- API: "feat: implement FastAPI optimization service"
- Docker: "chore: add Docker containerization setup"
- Tests: "test: add unit tests with >80% coverage"
- README: "docs: create comprehensive README with visuals and quickstart"
- Final simulation: "feat: run complete backtest and generate all outputs"
git config user.name "mohin-io"
git config user.email "mohinhasin999@gmail.com"- Push after every 3-5 logical commits
- Ensure all tests pass before pushing
- Tag releases:
v1.0.0-alpha,v1.0.0-beta,v1.0.0
- MIO solver handles 50+ assets in <5 minutes
- Heuristics achieve 95%+ of exact solver performance
- ML guidance reduces search time by 30%+
- Backtested Sharpe ratio > 1.0 (after costs)
- README has embedded visuals (no external links)
- All plots have clear titles, labels, legends
- Code is PEP8 compliant and documented
- Dashboard runs smoothly on Streamlit Cloud
- GitHub profile pinned repository
- LinkedIn post with key visualizations
- Portfolio page with live demo link
- Bertsimas & Shioda (2009): "Algorithm for cardinality-constrained quadratic optimization"
- Woodside-Oriakhi et al. (2011): "Heuristic algorithms for portfolio optimization"
- Ledoit & Wolf (2004): "Honey, I shrunk the sample covariance matrix"
-
Advanced Extensions:
- Multi-period optimization (dynamic programming)
- Reinforcement learning for adaptive rebalancing
- Factor-based risk models (Barra)
-
Real-World Deployment:
- Connect to brokerage APIs (Alpaca, Interactive Brokers)
- Real-time data streams (WebSocket)
- Production monitoring (Prometheus + Grafana)
-
Research Publications:
- Blog post series on Medium
- Conference submission (IEEE CIS, ICAIF)
- Kaggle kernel/competition
Last Updated: October 3, 2025 Status: Planning Phase Complete ✅ Next Phase: Implementation Start (Data Infrastructure)