A fully containerised batch ETL stack that ingests big scale Google Analytics 4 data, transforms it with Spark, orchestrates the workflow in Airflow, lands data facts/dimensions in Postgres for downstream ML/BI analysis.
| Layers | Stack | Highlights |
|---|---|---|
| Data Lake | MinIO (S3-compatible) | Fast & Cheap raw/staging Parquets |
| Transform | Apache Spark (PySpark) | Data modeling implemented in PySpark, Optimized for scale |
| Orchestrator | Apache Airflow | Idempotent Airflow-native DAGs |
| Serve | Postgres | Data Warehouse with event, session & user dimensions |
| Dev Ex | Docker Compose, Custom Images, Dev Container | Ready for production |
┌────────────┐ Extract ┌───────────┐ Transform ┌─────────┐ Load ┌──────────┐
External DB ────────────► MinIO(S3) ────────────────► Spark ───────► Postgres │
└────────────┘ (bronze) └───────────┘ (staging) └─────────┘ └──────────┘
Sample GA4 dataset extracted BigQuery public datasets (92 days * 55,000 rows ~ 5M rows; pq table 68 columns)
| Tables | Grain | Purpose |
|---|---|---|
| base_select | raw events | Clean field names; fields selected by design for downstream |
| base_ga4_events | cleaned events | Flatten & Normalize records, standardized, dedup fields |
| stg_ga4_events | keyed events | Add surrogate key, sanitizes event parameters, preps for dims |
Prereqs: Docker >= 24
# Clone this repo
git clone https://github.com/Creative-Ataraxia/GA4-Analytical-Pipeline.git
# Ran entire stack with docker compose
cd GA4-Analytical-Pipeline
docker compose up -d --build
# Copy your own GA4 export parquet into project-root/raw-data/
# Get the same exact dataset I used from here: https://cloud.google.com/bigquery/public-data
# After all docker container is healthy
# opens Airflow UI at http://localhost:8080, and manually trigger DAG called 'ga4_batch_dag' to execute the pipeline
# Verify row counts in Postgres
docker exec -it postgres psql -U postgres -d ga4_data -c "SELECT COUNT(*) FROM stg_ga4_events;"
cd GA4-Analytical-Pipelineandreopen in Dev Containerand you'll be in the same exact Airflow 3.0.2 + Python 3.10 + Spark 4.0.0 shell I developed this pipeline in.- Try submiting the spark jobs locally
spark-submit --master local[4] /opt/spark-apps/staging/stg_ga4_events.py <YYYY-MM-DD> <days:int>
Happy to answer any question via issues; Connect with me on Linkedin; My Github Page
This project is licensed under the MIT License, see the file LICENSE for details


