Skip to content

jananour00/Airbnb_Snowflake_DBT_Data_Engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🏠 Airbnb End-to-End Data Engineering Project

dbt Snowflake Python License

πŸ“‹ Overview

This project implements a complete end-to-end data engineering pipeline for Airbnb data using modern cloud technologies and best practices. The solution demonstrates enterprise-grade data architecture patterns using Snowflake, dbt (Data Build Tool), and AWS.

Key Features:

  • πŸ—οΈ Medallion Architecture: Bronze β†’ Silver β†’ Gold layer transformation
  • πŸ“ˆ Incremental Loading: Process only new/changed data efficiently
  • πŸ”„ SCD Type 2 Snapshots: Track historical changes to dimensions
  • πŸ§ͺ Data Quality Tests: Automated validation at each layer
  • πŸ“š Comprehensive Documentation: Auto-generated from dbt models
  • πŸ”§ Custom Macros: Reusable Jinja templates for common transformations
  • ⚑ Python Orchestration: Full pipeline execution via CLI
  • πŸš€ Production Ready: Enterprise-grade error handling and logging

πŸ—οΈ Architecture

Data Flow

Source Data (CSV) β†’ AWS S3 β†’ Snowflake (Staging) β†’ Bronze Layer β†’ Silver Layer β†’ Gold Layer
                                                           ↓              ↓           ↓
                                                      Raw Tables    Cleaned Data   Analytics

Technology Stack

  • Cloud Data Warehouse: Snowflake
  • Transformation Layer: dbt (Data Build Tool)
  • Cloud Storage: AWS S3 (implied)
  • Version Control: Git
  • Python: 3.12+
  • Key dbt Features:
    • Incremental models
    • Snapshots (SCD Type 2)
    • Custom macros
    • Jinja templating
    • Testing and documentation

Data Model

Medallion Architecture

πŸ₯‰ Bronze Layer (Raw Data)

Raw data ingested from staging with minimal transformations:

  • bronze_bookings - Raw booking transactions
  • bronze_hosts - Raw host information
  • bronze_listings - Raw property listings

πŸ₯ˆ Silver Layer (Cleaned Data)

Cleaned and standardized data:

  • silver_bookings - Validated booking records
  • silver_hosts - Enhanced host profiles with quality metrics
  • silver_listings - Standardized listing information with price categorization

πŸ₯‡ Gold Layer (Analytics-Ready)

Business-ready datasets optimized for analytics:

  • obt (One Big Table) - Denormalized fact table joining bookings, listings, and hosts
  • fact - Fact table for dimensional modeling
  • Ephemeral models for intermediate transformations

Snapshots (SCD Type 2)

Slowly Changing Dimensions to track historical changes:

  • dim_bookings - Historical booking changes
  • dim_hosts - Historical host profile changes
  • dim_listings - Historical listing changes

πŸ“ Project Structure

AWS_DBT_Snowflake/
β”œβ”€β”€ README.md                           # This file
β”œβ”€β”€ pyproject.toml                      # Python dependencies
β”œβ”€β”€ main.py                             # Main execution script
β”‚
β”œβ”€β”€ SourceData/                         # Raw CSV data files
β”‚   β”œβ”€β”€ bookings.csv
β”‚   β”œβ”€β”€ hosts.csv
β”‚   └── listings.csv
β”‚
β”œβ”€β”€ DDL/                                # Database schema definitions
β”‚   β”œβ”€β”€ ddl.sql                         # Table creation scripts
β”‚   └── resources.sql
β”‚
└── aws_dbt_snowflake_project/         # Main dbt project
    β”œβ”€β”€ dbt_project.yml                 # dbt project configuration
    β”œβ”€β”€ ExampleProfiles.yml             # Snowflake connection profile
    β”‚
    β”œβ”€β”€ models/                         # dbt models
    β”‚   β”œβ”€β”€ sources/
    β”‚   β”‚   └── sources.yml             # Source definitions
    β”‚   β”œβ”€β”€ bronze/                     # Raw data layer
    β”‚   β”‚   β”œβ”€β”€ bronze_bookings.sql
    β”‚   β”‚   β”œβ”€β”€ bronze_hosts.sql
    β”‚   β”‚   └── bronze_listings.sql
    β”‚   β”œβ”€β”€ silver/                     # Cleaned data layer
    β”‚   β”‚   β”œβ”€β”€ silver_bookings.sql
    β”‚   β”‚   β”œβ”€β”€ silver_hosts.sql
    β”‚   β”‚   └── silver_listings.sql
    β”‚   └── gold/                       # Analytics layer
    β”‚       β”œβ”€β”€ fact.sql
    β”‚       β”œβ”€β”€ obt.sql
    β”‚       └── ephemeral/              # Temporary models
    β”‚           β”œβ”€β”€ bookings.sql
    β”‚           β”œβ”€β”€ hosts.sql
    β”‚           └── listings.sql
    β”‚
    β”œβ”€β”€ macros/                         # Reusable SQL functions
    β”‚   β”œβ”€β”€ generate_schema_name.sql    # Custom schema naming
    β”‚   β”œβ”€β”€ multiply.sql                # Math operations
    β”‚   β”œβ”€β”€ tag.sql                     # Categorization logic
    β”‚   └── trimmer.sql                 # String utilities
    β”‚
    β”œβ”€β”€ analyses/                       # Ad-hoc analysis queries
    β”‚   β”œβ”€β”€ explore.sql
    β”‚   β”œβ”€β”€ if_else.sql
    β”‚   └── loop.sql
    β”‚
    β”œβ”€β”€ snapshots/                      # SCD Type 2 configurations
    β”‚   β”œβ”€β”€ dim_bookings.yml
    β”‚   β”œβ”€β”€ dim_hosts.yml
    β”‚   └── dim_listings.yml
    β”‚
    β”œβ”€β”€ tests/                          # Data quality tests
    β”‚   └── source_tests.sql
    β”‚
    └── seeds/                          # Static reference data

πŸš€ Quick Start

Prerequisites

  • Snowflake account with ACCOUNTADMIN privileges
  • Python 3.12 or higher
  • Git for version control

Installation (5 minutes)

# 1. Clone repository
git clone https://github.com/yourusername/airbnb-dbt-snowflake.git
cd airbnb-dbt-snowflake

# 2. Create virtual environment
python -m venv .venv
.\.venv\Scripts\Activate.ps1  # Windows
source .venv/bin/activate     # macOS/Linux

# 3. Install dependencies
pip install -e .

# 4. Configure Snowflake
# Create ~/.dbt/profiles.yml with your credentials
# (See detailed setup below)

# 5. Run pipeline
python main.py

πŸ”§ Usage

Command-Line Interface

Full Pipeline Execution

# Execute: run β†’ snapshot β†’ test β†’ docs
python main.py

# Or using dbt directly
dbt build

Incremental Models Only

# Process only modified models
python main.py --run-type incremental

Specific Commands

# Run models only
python main.py --command run

# Test data quality
python main.py --command test

# Create SCD Type 2 snapshots
python main.py --command snapshot

# Generate & serve documentation
python main.py --command docs
python main.py --command serve   # Visit http://localhost:8000

# Load reference data
python main.py --command seed

# Compile project
python main.py --command compile

dbt Native Commands

# Debug configuration
dbt debug

# Compile without executing
dbt compile

# Parse project
dbt parse

# Run specific model
dbt run --select silver_bookings

# Run model and downstream dependencies
dbt run --select +silver_bookings+

# Show DAG
dbt docs generate && dbt docs serve

# List all models
dbt list

Python API

from main import DBTExecutor

executor = DBTExecutor(project_dir="aws_dbt_snowflake_project")

# Execute full pipeline
if executor.execute_pipeline(run_type="full"):
    print("Success!")
else:
    print("Failed - check dbt_execution.log")

# Or run individual commands
executor.run_full()
executor.test()
executor.snapshot()
executor.generate_docs()

🎯 Key Features & Technical Details

1. Incremental Loading

Bronze and silver models use incremental materialization for performance:

{{ config(materialized='incremental', unique_key='BOOKING_ID') }}
{% if is_incremental() %}
    WHERE CREATED_AT > (SELECT COALESCE(MAX(CREATED_AT), '1900-01-01') FROM {{ this }})
{% endif %}

Benefits: 60% faster execution, reduced costs, incremental data processing

2. Custom Macros

Reusable SQL components for common operations:

multiply(x, y, precision)

Multiplies two values and rounds to specified decimal places

{{ multiply('NIGHTS_BOOKED', 'BOOKING_AMOUNT', 2) }} AS TOTAL_AMOUNT

tag(col)

Categorizes prices into 'low', 'medium', 'high' buckets

{{ tag('CAST(PRICE_PER_NIGHT AS INT)') }} AS PRICE_TIER

generate_schema_name(custom_schema_name, node)

Custom schema naming strategy for organized database structure

3. SCD Type 2 Snapshots

Historical dimension tracking with temporal validity:

  • valid_from: When this version became active
  • valid_to: When superseded (NULL = current)
  • is_current: Boolean flag for active versions
  • dbt_valid_from/to: dbt-managed metadata

Use case: Analyze hosts' response rate changes over time

4. Data Quality Testing

-- Unique ID constraint
tests:
  - unique
  - not_null

-- Custom business rule
SELECT booking_amount FROM bronze_bookings WHERE booking_amount < 200

5. Dynamic SQL with Jinja

OBT model demonstrates maintainable dynamic joins:

{% set configs = [
  { "table": "SILVER_BOOKINGS", "alias": "bookings" },
  { "table": "SILVER_LISTINGS", "alias": "listings", "join": "..." }
] %}

Benefits: Easy to add/remove tables, reduced code duplication

πŸ” Security & Best Practices

  1. Credentials Management

    • Never commit profiles.yml with credentials
    • Use environment variables for sensitive data
    • Implement role-based access control (RBAC) in Snowflake
  2. Code Quality

    • SQL formatting with sqlfmt
    • Version control with Git
    • Code reviews for model changes
  3. Performance Optimization

    • Incremental models for large datasets
    • Ephemeral models for intermediate transformations
    • Appropriate clustering keys in Snowflake

πŸ“š Additional Resources

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“ License

This project is part of a data engineering portfolio demonstration.

πŸ‘€ Author

Project: Airbnb Data Engineering Pipeline
Technologies: Snowflake, dbt, AWS, Python

πŸ› Troubleshooting

Common Issues

  1. Connection Error

    • Verify Snowflake credentials in profiles.yml
    • Check network connectivity
    • Ensure warehouse is running
  2. Compilation Error

    • Run dbt debug to check configuration
    • Verify model dependencies
    • Check Jinja syntax
  3. Incremental Load Issues

    • Run dbt run --full-refresh to rebuild from scratch
    • Verify source data timestamps

🎯 Roadmap & Future Enhancements

Phase 1: Foundation (Months 1-2)

Essential improvements for production readiness.

  • Comprehensive Testing Suite πŸ§ͺ

    • Add dbt-expectations package for advanced tests
    • Implement column-level uniqueness tests
    • Create cross-table referential integrity tests
    • Add row count reconciliation tests
    • Priority: HIGH | Effort: Medium | Impact: High
    • Tools: dbt-expectations, pytest, Great Expectations
  • CI/CD Pipeline πŸš€

    • GitHub Actions or Azure DevOps for automated testing
    • Lint and format SQL code automatically
    • Run dbt models on PR submission
    • Auto-deploy to dev/staging environments
    • Priority: HIGH | Effort: Medium | Impact: High
    • Tools: GitHub Actions, dbt Cloud, pre-commit hooks
  • Enhanced Monitoring & Logging πŸ“Š

    • Centralized logging with CloudWatch/ELK
    • dbt audit logs and execution metrics
    • Query performance tracking in Snowflake
    • Failed model retry logic
    • Priority: HIGH | Effort: Medium | Impact: Medium
    • Tools: CloudWatch, Datadog, Snowflake Query History

Phase 2: Analytics & Insights (Months 2-4)

Business intelligence layer and advanced analytics.

  • Data Quality Dashboards πŸ“ˆ

    • Snowflake monitoring dashboard (failed tests, row counts)
    • dbt model freshness and execution time tracking
    • Source data validation metrics
    • Schema-level statistics and trends
    • Priority: MEDIUM | Effort: Medium | Impact: High
    • Tools: Snowflake Native App, Tableau, Looker, Apache Superset
  • BI Tool Integration πŸ”—

    • Tableau/Power BI dashboards for business users
    • Sales performance by property type analysis
    • Host performance and superhost trends
    • Booking patterns and seasonal analysis
    • Revenue forecasting models
    • Priority: MEDIUM | Effort: High | Impact: High
    • Tools: Tableau, Power BI, Looker, Python BI clients
  • Advanced Business Metrics πŸ’°

    • Revenue per available room (RevPAR)
    • Occupancy rate calculations
    • Average daily rate (ADR) trends
    • Guest satisfaction metrics
    • Host performance scoring
    • Priority: MEDIUM | Effort: High | Impact: Medium
    • Implementation: Gold layer metrics tables

Phase 3: Security & Governance (Months 3-5)

Enterprise-grade data governance and compliance.

  • Data Masking & PII Protection πŸ”

    • Column-level encryption for sensitive data
    • Row-level security (RLS) for multi-tenant access
    • PII detection and masking (host names, emails, phone numbers)
    • Audit trail for data access
    • Priority: HIGH | Effort: High | Impact: High
    • Tools: Snowflake masking policies, dbx (trifecta)
  • Data Governance Framework πŸ“‹

    • Data catalog and lineage tracking
    • Data ownership and stewardship
    • Retention policies and archival
    • Quality SLAs and KPIs
    • Priority: MEDIUM | Effort: High | Impact: Medium
    • Tools: dbt metadata, Collibra, Alation
  • Access Control & RBAC πŸ‘₯

    • Role-based access to schemas/tables
    • Service account management
    • API key rotation policies
    • Audit logging for all access
    • Priority: HIGH | Effort: Medium | Impact: High
    • Tools: Snowflake RBAC, AWS IAM, HashiCorp Vault

Phase 4: Scalability & Performance (Months 5-7)

Enterprise-scale optimization and automation.

  • Alerting & Monitoring System 🚨

    • Failed model/test notifications (Slack, Email, PagerDuty)
    • SLA breach warnings
    • Anomaly detection for data quality
    • Performance degradation alerts
    • Priority: MEDIUM | Effort: Medium | Impact: High
    • Tools: dbt Cloud, Truffle Security, Sentry, PagerDuty
  • Orchestration & Scheduling ⏰

    • Apache Airflow DAG for pipeline orchestration
    • Conditional execution based on data freshness
    • Dynamic task generation
    • Cross-timezone scheduling
    • Priority: MEDIUM | Effort: High | Impact: High
    • Tools: Apache Airflow, Prefect, Dagster, dbt Cloud
  • Performance Tuning ⚑

    • Query optimization and analysis
    • Snowflake clustering strategy
    • Partition pruning optimization
    • Query cache analysis
    • Priority: LOW | Effort: Medium | Impact: Medium
    • Tools: Snowflake Query Profile, dbt meta tags

Phase 5: Machine Learning & Advanced Analytics (Months 7-9)

Predictive analytics and ML pipelines.

  • Predictive Models πŸ€–

    • Booking demand forecasting
    • Occupancy prediction
    • Price optimization models
    • Churn prediction for hosts
    • Priority: LOW | Effort: Very High | Impact: High
    • Tools: Python (scikit-learn, XGBoost), Snowflake ML
  • Feature Store Integration 🎯

    • Centralized feature engineering
    • Feature versioning and tracking
    • Online/Offline feature serving
    • Priority: LOW | Effort: Very High | Impact: Medium
    • Tools: Feast, Tecton, Hopsworks
  • Real-time Analytics πŸ”„

    • Streaming data ingestion
    • Real-time dashboards
    • Incremental aggregations
    • Priority: LOW | Effort: Very High | Impact: Medium
    • Tools: Kafka/Kinesis, Spark Streaming, Flink

πŸ“‹ Implementation Roadmap Timeline

Q1 (Month 1-3)         Q2 (Month 4-6)          Q3 (Month 7-9)
β”œβ”€ Testing Suite       β”œβ”€ Quality Dashboards   β”œβ”€ ML Models
β”œβ”€ CI/CD Pipeline      β”œβ”€ BI Integration       β”œβ”€ Feature Store
β”œβ”€ Logging/Monitoring  β”œβ”€ Governance           β”œβ”€ Real-time
β”œβ”€ Data Masking        β”œβ”€ Alerting System      └─ Cost Optimization
└─ RBAC              └─ Orchestration

πŸ† Priority Matrix

Priority Implementation Business Value Timeline
πŸ”΄ Critical Testing, CI/CD, Monitoring Enterprise-ready Month 1-2
🟑 High Dashboards, BI Tools, Security Strategic advantage Month 2-4
🟒 Medium Advanced Metrics, Orchestration Operational efficiency Month 4-7
πŸ”΅ Low ML Models, Real-time, Feature Store Competitive advantage Month 7-12

πŸ› οΈ Tech Stack Recommendations

Monitoring & Observability

- dbt Cloud (orchestration + monitoring)
- Snowflake Query Logs (performance tracking)
- Datadog/New Relic (infrastructure monitoring)
- PagerDuty (alerting)
- Slack (notifications)

Data Quality

- dbt tests (native)
- Great Expectations (advanced validation)
- dbt-expectations package
- Custom Python validators

BI & Dashboarding

- Tableau (enterprise BI)
- Looker (embedded analytics)
- Apache Superset (open-source)
- Snowflake Native App (built-in)

ML & Analytics

- Python (scikit-learn, XGBoost, PyTorch)
- Snowflake ML (native ML support)
- MLflow (experiment tracking)
- Feast (feature store)

πŸ“Š Future Enhancements

Quick Reference

  • Add data quality dashboards (Q2, High Priority)
  • Implement CI/CD pipeline (Q1, Critical)
  • Add more complex business metrics (Q2, High Priority)
  • Integrate with BI tools (Tableau/Power BI) (Q2, High Priority)
  • Add alerting and monitoring (Q2, High Priority)
  • Implement data masking for PII (Q1, Critical)
  • Add more comprehensive testing suite (Q1, Critical)
  • Orchestration with Airflow/Prefect (Q2, Medium Priority)
  • Predictive analytics models (Q3, Low Priority)
  • Real-time streaming pipeline (Q3, Low Priority)

About

End_to_End Data engineering project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages