Data Engineer · Analytics Pipelines · Sports Data · BI & Visualization Remote · Corpus Christi, TX
Data Engineer with 5+ years designing and operating analytics pipelines on GCP and AWS, with a focus on sports data infrastructure, real-time event processing, and cloud-based ETL systems.
Built production pipelines handling live MLB game feeds at Sportradar/Synergy Sports. At Vikua, delivered GCP analytical models that cut time-to-insight by 45%, maintained 99.7% pipeline uptime, and reduced cloud compute costs by 18% across 6 client environments.
Currently completing an MIT MicroMaster in Statistical Modeling & Computation. Fluent in English and Spanish.
💡 My focus: turning technical execution into measurable business impact.
Languages: Python, SQL, Ruby Cloud: GCP (BigQuery, Cloud Composer, Cloud Storage), AWS (S3, Redshift), Azure SQL ETL & Orchestration: Airflow, dbt, Tray.io, Zapier, REST APIs, Pandas, Terraform Streaming: Kafka / Redpanda, Apache Flink (event time, windowing) Lakehouse: Apache Iceberg, Parquet, DuckDB BI & Viz: Power BI, Metabase, Plotly, Streamlit, ParaView, QGIS Sports Data: Statcast, pybaseball, MLB StatsAPI, pitch-by-pitch tracking Quality / Ops: Great Expectations, pytest, GitHub Actions, OpenLineage
End-to-end MLB platform: Statcast ingestion → Bronze/Silver/Gold (DuckDB) → XGBoost CSW model (+SHAP) → AI scouting reports backed by Claude → Streamlit dashboard. Live demo · Repo
12-page interactive app on real injury data (Real Madrid 2021–2025): medallion stack, feature store, point-in-time-correct features, drift detection, CI/CD — with explicit epistemic limits. Live demo · Repo
End-to-end soccer tracking pipeline: ingestion, validation, Parquet transforms, analytics layer, CI/CD, an Airflow DAG and Terraform provisioning a 3-layer AWS S3 data lake. Live demo · Repo
PII-safe pipeline unifying heterogeneous sources into a Master User Model in BigQuery (Bronze/Silver/Gold), with SHA256 hashing and boolean masking, orchestrated via Cloud Composer. Repo
SQL models + interactive Metabase dashboards tracking income, expenses, and profitability across multiple construction projects. Repo
Dual-path (Flink streaming + dbt/Iceberg batch) engine for pitcher fatigue, bullpen readiness, and matchup leverage, with a reconciliation layer measuring where each architecture wins. Architecture complete; implementation in progress. Repo
- Sports data infrastructure & real-time event pipelines
- Cloud data architecture & orchestration (Airflow, Terraform, dbt)
- Streaming vs batch trade-offs (Flink, Kafka, Iceberg)
- BI automation, geospatial & scientific visualization
📧 ivanfgruber@gmail.com 🌐 linkedin.com/in/ifrg
"Architecture is not about storing data — it's about how data flows to create value."