📊 YouTube Data Engineering Pipeline (Batch Processing)

🚀 Overview

This project implements a modern data engineering pipeline for analyzing YouTube trending video data.
It demonstrates the Medallion Architecture (Bronze → Silver → Gold) using:

Apache Airflow (3.x) → Orchestration & scheduling
Apache Spark → Scalable ETL transformations
Local filesystem → Data lake layers (Bronze/Silver/Gold)
Postgres → Serving layer for analytics
Streamlit + Altair (via SQLAlchemy) → Interactive BI dashboard

The pipeline ingests raw JSON/CSV datasets, cleans and enriches them, computes derived metrics, and publishes analytics‑ready tables for visualization.

🏗️ Architecture

---

📂 Project Structure

Youtube_DE_Project/
│
├── bronze/                  # Raw input data (JSON/CSV)
├── silver/                  # Cleaned, normalized data
├── gold/                    # Aggregated, analytics-ready data
│
├── dags/
│   └── youtube_pipeline_dag.py   # Airflow DAG definition
│
├── scripts/
│   ├── json_to_silver.py         # Raw JSON → Silver layer
│   ├── csv_to_silver.py          # Raw CSV → Silver layer
│   ├── silver_to_gold.py         # Silver → Gold transformations
│   ├── gold_to_postgres.py       # Load Gold into Postgres
│   ├── dashboard.py              # Streamlit + Altair dashboard
│   └── clear_outputs.py          # Utility to clear old outputs
│
├── logs/                   # Airflow logs
├── airflow.cfg             # Airflow config
├── airflow.db              # Airflow metadata DB (SQLite for local)
├── postgresql-42.4.7.jar   # JDBC driver for Spark → Postgres
├── requirements.txt        # Python dependencies
└── venv_spark/             # Virtual environment

🛠️ Prerequisites & Installation

System Packages

Python 3.10+

sudo apt update
sudo apt install -y python3 python3-venv python3-dev

Java (JDK 11 or higher) → required for PySpark

sudo apt install -y openjdk-11-jdk
java -version

Apache Spark
For local dev, installing pyspark via pip is enough.
For full CLI (spark-submit, spark-shell):

wget https://downloads.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
tar xvf spark-3.4.1-bin-hadoop3.tgz
mv spark-3.4.1-bin-hadoop3 ~/spark

Environment Variables for Spark

export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH
spark-shell --version

Apache Airflow (3.x)
Installed via requirements.txt, but must be initialized manually:
```
airflow db migrate
airflow standalone
```
Postgres
```
sudo apt install -y postgresql postgresql-contrib libpq-dev
```
👉 Always check Postgres status before running the pipeline:
```
sudo systemctl status postgresql
```
If Postgres is inactive, start it with:
```
sudo systemctl start postgresql
```
- Working with Postgres
  
  Log into Postgres shell (psql):
```
psql -U postgres -d youtube_gold
```
  (If you set a different user, replace postgres with that username.)
  
  List all databases:
```
\l
```
  Create database:
```
CREATE DATABASE youtube_gold
```
  Connect to a database:
```
\c youtube_gold
```
  List all tables:
```
\dt
```
  Run a sample query:
```
SELECT * FROM youtube_gold LIMIT 10;
```
  Exit psql:
```
\q
```
- Environment Variables for postgres For security, set Postgres credentials as environment variables:
```
export PGUSER=postgres
export PGPASSWORD=your_password
```
  These are automatically picked up by dashboard.py.
- JDBC Driver
  Already included (postgresql-42.7.4.jar). If missing:
```
wget https://jdbc.postgresql.org/download/postgresql-42.7.4.jar -P .
```

⚙️ Setup Instructions

- Clone the repo

git clone https://github.com/Arjun-M-101/Youtube_DE_Project.git
cd Youtube_DE_Project

- Create virtual environment

python3 -m venv venv_spark
source venv_spark/bin/activate
pip install -r requirements.txt

- Initialize Airflow

airflow db migrate
airflow standalone

This starts:

Scheduler
Webserver (http://localhost:8080)
Triggerer
Workers

- Place raw data

👉 Dataset link: YouTube Trending Video Dataset on Kaggle https://www.kaggle.com/datasets/datasnaek/youtube-new

Drop Kaggle YouTube trending .csv files into bronze/raw_statistics/
Drop category .json files into bronze/raw_statistics_reference/

- Fresh Setup (optional)

clear_outputs.py script can be used to reset the project state. Before running the pipeline from scratch, clear old outputs (if already present):

python scripts/clear_outputs.py

- Trigger the DAG

In the Airflow UI, enable and trigger youtube_pipeline.

- Launch the dashboard

streamlit run scripts/dashboard.py

- Sample Outputs (Airflow + Dashboard)

Airflow DAG Success Example

Streamlit Dashboard Example

🔄 Data Flow

Bronze Layer (Raw Landing)

Stores raw .csv and .json files.
No transformations, just schema ingest and landing.

Silver Layer (Cleaned & Normalized)

json_to_silver.py:
- Explodes items array in JSON.
- Extracts id, category_name.
- Adds region.
csv_to_silver.py:
- Casts numeric fields (views, likes, comment_count, etc.).
- Casts flags to boolean (comments_disabled, ratings_disabled, etc.).
- Normalizes publish_time → timestamp, trending_date → date.
- Adds region.

Gold Layer (Analytics-Ready)

silver_to_gold.py:
- Joins videos with categories.
- Adds derived metric: engagement_ratio = (likes + comment_count) / views.
- Adds region.
- Unifies all regions into one dataset.
- Partitioned by region.

Serving Layer

gold_to_postgres.py:
- Loads Gold dataset into Postgres table videos_gold.
dashboard.py:
- Streamlit + Altair visualizations:
  - Top categories by views
  - Views over time
  - Likes vs Comments scatter
  - Engagement ratio distribution

✅ Key Takeaways

Demonstrates Medallion Architecture (Bronze → Silver → Gold) using PySpark for batch ETL.
Uses Airflow 3.x to orchestrate and schedule the full pipeline end‑to‑end.
Implements Spark → Postgres loading via JDBC as a simple serving layer for analytics.
Exposes metrics through an interactive Streamlit + Altair dashboard backed by SQL queries.
Fully reproducible on a single machine, while keeping the design compatible with a future move to S3/EMR/MWAA/RDS‑style cloud services.

⚖️ Trade‑offs & Design Decisions

Local filesystem vs cloud storage:
Chose local folders for Bronze/Silver/Gold to keep the setup lightweight for learning. The same structure can later be mapped one‑to‑one to S3 buckets or prefixes for scalability and durability.
Postgres vs data warehouse:
Used local Postgres as a simple serving layer that is easy to install and query from Streamlit. In a real production setup, this would likely be replaced by Redshift, Snowflake, or BigQuery for larger data volumes and more concurrent users.
Airflow standalone vs managed Airflow:
Ran Airflow in standalone mode to avoid infrastructure overhead. In production, a managed service like MWAA would be preferred for autoscaling, centralized logging, and better reliability.
Spark local mode vs Spark cluster:
Executed Spark jobs in local mode because the Kaggle dataset fits comfortably on a single machine. For true YouTube‑scale data, the same code would be deployed on a Spark cluster (EMR, Databricks, or Kubernetes).
Streamlit prototype vs enterprise BI:
Used Streamlit for fast, code‑first dashboarding that is ideal for demos and portfolios. In an organization, the gold tables would typically be consumed by BI tools such as Superset, Tableau, or Power BI for governed, multi‑user reporting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📊 YouTube Data Engineering Pipeline (Batch Processing)

🚀 Overview

🏗️ Architecture

📂 Project Structure

🛠️ Prerequisites & Installation

System Packages

⚙️ Setup Instructions

- Clone the repo

- Create virtual environment

- Initialize Airflow

- Place raw data

- Fresh Setup (optional)

- Trigger the DAG

- Launch the dashboard

- Sample Outputs (Airflow + Dashboard)

🔄 Data Flow

Bronze Layer (Raw Landing)

Silver Layer (Cleaned & Normalized)

Gold Layer (Analytics-Ready)

Serving Layer

✅ Key Takeaways

⚖️ Trade‑offs & Design Decisions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
bronze		bronze
dags		dags
gold		gold
scripts		scripts
silver		silver
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
postgresql-42.7.4.jar		postgresql-42.7.4.jar
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

📊 YouTube Data Engineering Pipeline (Batch Processing)

🚀 Overview

🏗️ Architecture

📂 Project Structure

🛠️ Prerequisites & Installation

System Packages

⚙️ Setup Instructions

- Clone the repo

- Create virtual environment

- Initialize Airflow

- Place raw data

- Fresh Setup (optional)

- Trigger the DAG

- Launch the dashboard

- Sample Outputs (Airflow + Dashboard)

🔄 Data Flow

Bronze Layer (Raw Landing)

Silver Layer (Cleaned & Normalized)

Gold Layer (Analytics-Ready)

Serving Layer

✅ Key Takeaways

⚖️ Trade‑offs & Design Decisions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages