Skip to content

Arjun-M-101/Youtube_DE_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📊 YouTube Data Engineering Pipeline (Batch Processing)

🚀 Overview

This project implements a modern data engineering pipeline for analyzing YouTube trending video data.
It demonstrates the Medallion Architecture (Bronze → Silver → Gold) using:

  • Apache Airflow (3.x) → Orchestration & scheduling
  • Apache Spark → Scalable ETL transformations
  • Local filesystem → Data lake layers (Bronze/Silver/Gold)
  • Postgres → Serving layer for analytics
  • Streamlit + Altair (via SQLAlchemy) → Interactive BI dashboard

The pipeline ingests raw JSON/CSV datasets, cleans and enriches them, computes derived metrics, and publishes analytics‑ready tables for visualization.


🏗️ Architecture

Architecture Design ---

📂 Project Structure

Youtube_DE_Project/
│
├── bronze/                  # Raw input data (JSON/CSV)
├── silver/                  # Cleaned, normalized data
├── gold/                    # Aggregated, analytics-ready data
│
├── dags/
│   └── youtube_pipeline_dag.py   # Airflow DAG definition
│
├── scripts/
│   ├── json_to_silver.py         # Raw JSON → Silver layer
│   ├── csv_to_silver.py          # Raw CSV → Silver layer
│   ├── silver_to_gold.py         # Silver → Gold transformations
│   ├── gold_to_postgres.py       # Load Gold into Postgres
│   ├── dashboard.py              # Streamlit + Altair dashboard
│   └── clear_outputs.py          # Utility to clear old outputs
│
├── logs/                   # Airflow logs
├── airflow.cfg             # Airflow config
├── airflow.db              # Airflow metadata DB (SQLite for local)
├── postgresql-42.4.7.jar   # JDBC driver for Spark → Postgres
├── requirements.txt        # Python dependencies
└── venv_spark/             # Virtual environment

🛠️ Prerequisites & Installation

System Packages

  • Python 3.10+

    sudo apt update
    sudo apt install -y python3 python3-venv python3-dev
  • Java (JDK 11 or higher) → required for PySpark

    sudo apt install -y openjdk-11-jdk
    java -version
  • Apache Spark
    For local dev, installing pyspark via pip is enough.
    For full CLI (spark-submit, spark-shell):

    wget https://downloads.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
    tar xvf spark-3.4.1-bin-hadoop3.tgz
    mv spark-3.4.1-bin-hadoop3 ~/spark

    Environment Variables for Spark

    export SPARK_HOME=~/spark
    export PATH=$SPARK_HOME/bin:$PATH
    spark-shell --version
  • Apache Airflow (3.x)
    Installed via requirements.txt, but must be initialized manually:

    airflow db migrate
    airflow standalone
  • Postgres

    sudo apt install -y postgresql postgresql-contrib libpq-dev

    👉 Always check Postgres status before running the pipeline:

    sudo systemctl status postgresql

    If Postgres is inactive, start it with:

    sudo systemctl start postgresql
    • Working with Postgres

      Log into Postgres shell (psql):

      psql -U postgres -d youtube_gold

      (If you set a different user, replace postgres with that username.)

      List all databases:

      \l

      Create database:

      CREATE DATABASE youtube_gold

      Connect to a database:

      \c youtube_gold

      List all tables:

      \dt

      Run a sample query:

      SELECT * FROM youtube_gold LIMIT 10;

      Exit psql:

      \q
    • Environment Variables for postgres For security, set Postgres credentials as environment variables:

      export PGUSER=postgres
      export PGPASSWORD=your_password

      These are automatically picked up by dashboard.py.

    • JDBC Driver
      Already included (postgresql-42.7.4.jar). If missing:

      wget https://jdbc.postgresql.org/download/postgresql-42.7.4.jar -P .

⚙️ Setup Instructions

- Clone the repo

git clone https://github.com/Arjun-M-101/Youtube_DE_Project.git
cd Youtube_DE_Project

- Create virtual environment

python3 -m venv venv_spark
source venv_spark/bin/activate
pip install -r requirements.txt

- Initialize Airflow

airflow db migrate
airflow standalone

This starts:

- Place raw data

👉 Dataset link: YouTube Trending Video Dataset on Kaggle https://www.kaggle.com/datasets/datasnaek/youtube-new

  • Drop Kaggle YouTube trending .csv files into bronze/raw_statistics/
  • Drop category .json files into bronze/raw_statistics_reference/

- Fresh Setup (optional)

clear_outputs.py script can be used to reset the project state. Before running the pipeline from scratch, clear old outputs (if already present):

python scripts/clear_outputs.py

- Trigger the DAG

In the Airflow UI, enable and trigger youtube_pipeline.

- Launch the dashboard

streamlit run scripts/dashboard.py

- Sample Outputs (Airflow + Dashboard)

Airflow DAG Success Example Airflow DAG Success Example Streamlit Dashboard Example Streamlit Dashboard Example

🔄 Data Flow

Bronze Layer (Raw Landing)

  • Stores raw .csv and .json files.
  • No transformations, just schema ingest and landing.

Silver Layer (Cleaned & Normalized)

  • json_to_silver.py:
    • Explodes items array in JSON.
    • Extracts id, category_name.
    • Adds region.
  • csv_to_silver.py:
    • Casts numeric fields (views, likes, comment_count, etc.).
    • Casts flags to boolean (comments_disabled, ratings_disabled, etc.).
    • Normalizes publish_time → timestamp, trending_date → date.
    • Adds region.

Gold Layer (Analytics-Ready)

  • silver_to_gold.py:
    • Joins videos with categories.
    • Adds derived metric: engagement_ratio = (likes + comment_count) / views.
    • Adds region.
    • Unifies all regions into one dataset.
    • Partitioned by region.

Serving Layer

  • gold_to_postgres.py:
    • Loads Gold dataset into Postgres table videos_gold.
  • dashboard.py:
    • Streamlit + Altair visualizations:
      • Top categories by views
      • Views over time
      • Likes vs Comments scatter
      • Engagement ratio distribution

✅ Key Takeaways

  • Demonstrates Medallion Architecture (Bronze → Silver → Gold) using PySpark for batch ETL.
  • Uses Airflow 3.x to orchestrate and schedule the full pipeline end‑to‑end.
  • Implements Spark → Postgres loading via JDBC as a simple serving layer for analytics.
  • Exposes metrics through an interactive Streamlit + Altair dashboard backed by SQL queries.
  • Fully reproducible on a single machine, while keeping the design compatible with a future move to S3/EMR/MWAA/RDS‑style cloud services.

⚖️ Trade‑offs & Design Decisions

  • Local filesystem vs cloud storage:
    Chose local folders for Bronze/Silver/Gold to keep the setup lightweight for learning. The same structure can later be mapped one‑to‑one to S3 buckets or prefixes for scalability and durability.

  • Postgres vs data warehouse:
    Used local Postgres as a simple serving layer that is easy to install and query from Streamlit. In a real production setup, this would likely be replaced by Redshift, Snowflake, or BigQuery for larger data volumes and more concurrent users.

  • Airflow standalone vs managed Airflow:
    Ran Airflow in standalone mode to avoid infrastructure overhead. In production, a managed service like MWAA would be preferred for autoscaling, centralized logging, and better reliability.

  • Spark local mode vs Spark cluster:
    Executed Spark jobs in local mode because the Kaggle dataset fits comfortably on a single machine. For true YouTube‑scale data, the same code would be deployed on a Spark cluster (EMR, Databricks, or Kubernetes).

  • Streamlit prototype vs enterprise BI:
    Used Streamlit for fast, code‑first dashboarding that is ideal for demos and portfolios. In an organization, the gold tables would typically be consumed by BI tools such as Superset, Tableau, or Power BI for governed, multi‑user reporting.

About

End‑to‑end YouTube Data Engineering pipeline with Airflow, Spark, Postgres, and Streamlit.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages