This project implements a modern data engineering pipeline for analyzing YouTube trending video data.
It demonstrates the Medallion Architecture (Bronze → Silver → Gold) using:
- Apache Airflow (3.x) → Orchestration & scheduling
- Apache Spark → Scalable ETL transformations
- Local filesystem → Data lake layers (Bronze/Silver/Gold)
- Postgres → Serving layer for analytics
- Streamlit + Altair (via SQLAlchemy) → Interactive BI dashboard
The pipeline ingests raw JSON/CSV datasets, cleans and enriches them, computes derived metrics, and publishes analytics‑ready tables for visualization.
---
Youtube_DE_Project/
│
├── bronze/ # Raw input data (JSON/CSV)
├── silver/ # Cleaned, normalized data
├── gold/ # Aggregated, analytics-ready data
│
├── dags/
│ └── youtube_pipeline_dag.py # Airflow DAG definition
│
├── scripts/
│ ├── json_to_silver.py # Raw JSON → Silver layer
│ ├── csv_to_silver.py # Raw CSV → Silver layer
│ ├── silver_to_gold.py # Silver → Gold transformations
│ ├── gold_to_postgres.py # Load Gold into Postgres
│ ├── dashboard.py # Streamlit + Altair dashboard
│ └── clear_outputs.py # Utility to clear old outputs
│
├── logs/ # Airflow logs
├── airflow.cfg # Airflow config
├── airflow.db # Airflow metadata DB (SQLite for local)
├── postgresql-42.4.7.jar # JDBC driver for Spark → Postgres
├── requirements.txt # Python dependencies
└── venv_spark/ # Virtual environment
-
Python 3.10+
sudo apt update sudo apt install -y python3 python3-venv python3-dev
-
Java (JDK 11 or higher) → required for PySpark
sudo apt install -y openjdk-11-jdk java -version
-
Apache Spark
For local dev, installing pyspark via pip is enough.
For full CLI (spark-submit, spark-shell):wget https://downloads.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz tar xvf spark-3.4.1-bin-hadoop3.tgz mv spark-3.4.1-bin-hadoop3 ~/sparkEnvironment Variables for Spark
export SPARK_HOME=~/spark export PATH=$SPARK_HOME/bin:$PATH spark-shell --version
-
Apache Airflow (3.x)
Installed via requirements.txt, but must be initialized manually:airflow db migrate airflow standalone
-
Postgres
sudo apt install -y postgresql postgresql-contrib libpq-dev
👉 Always check Postgres status before running the pipeline:
sudo systemctl status postgresql
If Postgres is inactive, start it with:
sudo systemctl start postgresql
-
Working with Postgres
Log into Postgres shell (psql):
psql -U postgres -d youtube_gold
(If you set a different user, replace postgres with that username.)
List all databases:
\l
Create database:
CREATE DATABASE youtube_gold
Connect to a database:
\c youtube_gold
List all tables:
\dt
Run a sample query:
SELECT * FROM youtube_gold LIMIT 10;
Exit psql:
\q
-
Environment Variables for postgres For security, set Postgres credentials as environment variables:
export PGUSER=postgres export PGPASSWORD=your_password
These are automatically picked up by dashboard.py.
-
JDBC Driver
Already included (postgresql-42.7.4.jar). If missing:wget https://jdbc.postgresql.org/download/postgresql-42.7.4.jar -P .
-
git clone https://github.com/Arjun-M-101/Youtube_DE_Project.git
cd Youtube_DE_Projectpython3 -m venv venv_spark
source venv_spark/bin/activate
pip install -r requirements.txtairflow db migrate
airflow standaloneThis starts:
- Scheduler
- Webserver (http://localhost:8080)
- Triggerer
- Workers
👉 Dataset link: YouTube Trending Video Dataset on Kaggle https://www.kaggle.com/datasets/datasnaek/youtube-new
- Drop Kaggle YouTube trending .csv files into bronze/raw_statistics/
- Drop category .json files into bronze/raw_statistics_reference/
clear_outputs.py script can be used to reset the project state. Before running the pipeline from scratch, clear old outputs (if already present):
python scripts/clear_outputs.pyIn the Airflow UI, enable and trigger youtube_pipeline.
streamlit run scripts/dashboard.py
Airflow DAG Success Example
Streamlit Dashboard Example
- Stores raw
.csvand.jsonfiles. - No transformations, just schema ingest and landing.
json_to_silver.py:- Explodes
itemsarray in JSON. - Extracts
id,category_name. - Adds
region.
- Explodes
csv_to_silver.py:- Casts numeric fields (
views,likes,comment_count, etc.). - Casts flags to boolean (
comments_disabled,ratings_disabled, etc.). - Normalizes
publish_time→ timestamp,trending_date→ date. - Adds
region.
- Casts numeric fields (
silver_to_gold.py:- Joins videos with categories.
- Adds derived metric:
engagement_ratio = (likes + comment_count) / views. - Adds
region. - Unifies all regions into one dataset.
- Partitioned by
region.
gold_to_postgres.py:- Loads Gold dataset into Postgres table
videos_gold.
- Loads Gold dataset into Postgres table
dashboard.py:- Streamlit + Altair visualizations:
- Top categories by views
- Views over time
- Likes vs Comments scatter
- Engagement ratio distribution
- Streamlit + Altair visualizations:
- Demonstrates Medallion Architecture (Bronze → Silver → Gold) using PySpark for batch ETL.
- Uses Airflow 3.x to orchestrate and schedule the full pipeline end‑to‑end.
- Implements Spark → Postgres loading via JDBC as a simple serving layer for analytics.
- Exposes metrics through an interactive Streamlit + Altair dashboard backed by SQL queries.
- Fully reproducible on a single machine, while keeping the design compatible with a future move to S3/EMR/MWAA/RDS‑style cloud services.
-
Local filesystem vs cloud storage:
Chose local folders for Bronze/Silver/Gold to keep the setup lightweight for learning. The same structure can later be mapped one‑to‑one to S3 buckets or prefixes for scalability and durability. -
Postgres vs data warehouse:
Used local Postgres as a simple serving layer that is easy to install and query from Streamlit. In a real production setup, this would likely be replaced by Redshift, Snowflake, or BigQuery for larger data volumes and more concurrent users. -
Airflow standalone vs managed Airflow:
Ran Airflow in standalone mode to avoid infrastructure overhead. In production, a managed service like MWAA would be preferred for autoscaling, centralized logging, and better reliability. -
Spark local mode vs Spark cluster:
Executed Spark jobs in local mode because the Kaggle dataset fits comfortably on a single machine. For true YouTube‑scale data, the same code would be deployed on a Spark cluster (EMR, Databricks, or Kubernetes). -
Streamlit prototype vs enterprise BI:
Used Streamlit for fast, code‑first dashboarding that is ideal for demos and portfolios. In an organization, the gold tables would typically be consumed by BI tools such as Superset, Tableau, or Power BI for governed, multi‑user reporting.