This project is a comprehensive data science exploration of the European boat market, designed to provide the marketing team of a yacht sales platform with actionable insights to power their weekly newsletter. I conducted an end-to-end data science workflow, moving from business-centric Exploratory Data Analysis (EDA) to advanced predictive modeling and market segmentation.
- Primary Dataset: Sourced from Kaggle (2021). It contains nearly 10,000 listings of yachts and boats across Europe, including technical specifications (material, age, size) and engagement metrics (number of views in the last 7 days).
- Macro-Economic Data: To enrich the analysis, complementary time-series data was pulled from FRED (Federal Reserve Economic Data) via the Quandl API to analyze the Producer Price Index (PPI) trends within the industry.
To identify distinct market segments and buyer profiles, I implemented a robust clustering process:
- Dimensionality Reduction: Used PCA (Principal Component Analysis) to reduce features while preserving 80% of the variance, ensuring more stable and interpretable clusters.
-
Optimization: Applied the Elbow Technique to determine the optimal number of clusters (
$k=6$ ). - Impact: Identified 6 unique boat segments based on price-to-size ratios, allowing the marketing team to tailor newsletter content for specific owner profiles.
I developed models to test if physical boat attributes could forecast user engagement:
-
Approach: Progressed from Simple to Multiple Linear Regression using
scikit-learn, incorporating Price, Age, and Boat Area ($m^2$ ). -
Key Finding: With an
$R^2$ of 0.025, the model statistically demonstrated that "Visits" are driven by complex, non-linear factors (such as location or manufacturer prestige) rather than just physical dimensions or price.
Analyzed long-term economic trends for the boat industry to help sellers time their listings:
- Testing: Performed the Augmented Dickey-Fuller Test to check for stationarity.
-
Transformation: Applied Differencing to stationarize the data, successfully reducing the
$p$ -value from 0.96 to near zero ($2.59e^{-16}$ ), making the series ready for forecasting.
- Analysis: Python (Pandas, NumPy, Scipy).
- Machine Learning: Scikit-Learn (LinearRegression, KMeans, PCA).
- Time-Series: Statsmodels (ADF Test, Decomposition, ACF).
- Visualization: Seaborn, Matplotlib, Scikit-Plot, and Tableau Public.
- Data Wrangling: Standardization, handling missing values, and outlier detection.
- 01 Management: Project Brief and strategic marketing goals.
- 02 Data: Raw datasets, cleaned versions (Excel/CSV), and standardized data for PCA.
- 03 Scripts: Jupyter Notebooks covering the entire pipeline: EDA, Regression (Simple & Multiple), Clustering, and Time-Series Analysis.
- 04 Analysis: Visualizations, Tableau Dashboards, and the final executive report.
🚀 Presentation Tableau Public
*Note: This project was developed as part of a professional Data Analytics certification by CareerFoundry.