Skip to content

A fully containerised batch ETL stack that ingests ~5M Google Analytics 4 data, transforms it with Spark, orchestrates the workflow in Airflow, lands data facts/dimensions in Postgres for downstream ML/BI analysis.

License

Notifications You must be signed in to change notification settings

Creative-Ataraxia/GA4-Analytical-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Batch ETL Pipeline for Retail Analytics (Google Analytics 4)

Overview

A fully containerised batch ETL stack that ingests big scale Google Analytics 4 data, transforms it with Spark, orchestrates the workflow in Airflow, lands data facts/dimensions in Postgres for downstream ML/BI analysis.

Tech Stack

Layers Stack Highlights
Data Lake MinIO (S3-compatible) Fast & Cheap raw/staging Parquets
Transform Apache Spark (PySpark) Data modeling implemented in PySpark, Optimized for scale
Orchestrator Apache Airflow Idempotent Airflow-native DAGs
Serve Postgres Data Warehouse with event, session & user dimensions
Dev Ex Docker Compose, Custom Images, Dev Container Ready for production

Architecture

┌────────────┐   Extract  ┌───────────┐   Transform   ┌─────────┐  Load ┌──────────┐
  External DB ────────────► MinIO(S3) ────────────────►  Spark   ───────► Postgres │ 
└────────────┘  (bronze)  └───────────┘   (staging)   └─────────┘       └──────────┘
                                                                 

Architecture

Data Sources

Sample GA4 dataset extracted BigQuery public datasets (92 days * 55,000 rows ~ 5M rows; pq table 68 columns)

Data Models

Tables Grain Purpose
base_select raw events Clean field names; fields selected by design for downstream
base_ga4_events cleaned events Flatten & Normalize records, standardized, dedup fields
stg_ga4_events keyed events Add surrogate key, sanitizes event parameters, preps for dims

Airflow DAG Screenshot

Airflow DAG

Loaded Postgres Data Screenshot

Postgres Screenshot

Usage

Prereqs: Docker >= 24

# Clone this repo
git clone https://github.com/Creative-Ataraxia/GA4-Analytical-Pipeline.git

# Ran entire stack with docker compose
cd GA4-Analytical-Pipeline
docker compose up -d --build

# Copy your own GA4 export parquet into project-root/raw-data/
# Get the same exact dataset I used from here: https://cloud.google.com/bigquery/public-data

# After all docker container is healthy
# opens Airflow UI at http://localhost:8080, and manually trigger DAG called 'ga4_batch_dag' to execute the pipeline

# Verify row counts in Postgres
docker exec -it postgres psql -U postgres -d ga4_data -c "SELECT COUNT(*) FROM stg_ga4_events;"

Experiement Locally

  1. cd GA4-Analytical-Pipeline and reopen in Dev Container and you'll be in the same exact Airflow 3.0.2 + Python 3.10 + Spark 4.0.0 shell I developed this pipeline in.
  2. Try submiting the spark jobs locally spark-submit --master local[4] /opt/spark-apps/staging/stg_ga4_events.py <YYYY-MM-DD> <days:int>

About

Happy to answer any question via issues; Connect with me on Linkedin; My Github Page

License

This project is licensed under the MIT License, see the file LICENSE for details

About

A fully containerised batch ETL stack that ingests ~5M Google Analytics 4 data, transforms it with Spark, orchestrates the workflow in Airflow, lands data facts/dimensions in Postgres for downstream ML/BI analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors