This project implements a complete end-to-end data engineering pipeline for Airbnb data using modern cloud technologies and best practices. The solution demonstrates enterprise-grade data architecture patterns using Snowflake, dbt (Data Build Tool), and AWS.
Key Features:
- ποΈ Medallion Architecture: Bronze β Silver β Gold layer transformation
- π Incremental Loading: Process only new/changed data efficiently
- π SCD Type 2 Snapshots: Track historical changes to dimensions
- π§ͺ Data Quality Tests: Automated validation at each layer
- π Comprehensive Documentation: Auto-generated from dbt models
- π§ Custom Macros: Reusable Jinja templates for common transformations
- β‘ Python Orchestration: Full pipeline execution via CLI
- π Production Ready: Enterprise-grade error handling and logging
Source Data (CSV) β AWS S3 β Snowflake (Staging) β Bronze Layer β Silver Layer β Gold Layer
β β β
Raw Tables Cleaned Data Analytics
- Cloud Data Warehouse: Snowflake
- Transformation Layer: dbt (Data Build Tool)
- Cloud Storage: AWS S3 (implied)
- Version Control: Git
- Python: 3.12+
- Key dbt Features:
- Incremental models
- Snapshots (SCD Type 2)
- Custom macros
- Jinja templating
- Testing and documentation
Raw data ingested from staging with minimal transformations:
bronze_bookings- Raw booking transactionsbronze_hosts- Raw host informationbronze_listings- Raw property listings
Cleaned and standardized data:
silver_bookings- Validated booking recordssilver_hosts- Enhanced host profiles with quality metricssilver_listings- Standardized listing information with price categorization
Business-ready datasets optimized for analytics:
obt(One Big Table) - Denormalized fact table joining bookings, listings, and hostsfact- Fact table for dimensional modeling- Ephemeral models for intermediate transformations
Slowly Changing Dimensions to track historical changes:
dim_bookings- Historical booking changesdim_hosts- Historical host profile changesdim_listings- Historical listing changes
AWS_DBT_Snowflake/
βββ README.md # This file
βββ pyproject.toml # Python dependencies
βββ main.py # Main execution script
β
βββ SourceData/ # Raw CSV data files
β βββ bookings.csv
β βββ hosts.csv
β βββ listings.csv
β
βββ DDL/ # Database schema definitions
β βββ ddl.sql # Table creation scripts
β βββ resources.sql
β
βββ aws_dbt_snowflake_project/ # Main dbt project
βββ dbt_project.yml # dbt project configuration
βββ ExampleProfiles.yml # Snowflake connection profile
β
βββ models/ # dbt models
β βββ sources/
β β βββ sources.yml # Source definitions
β βββ bronze/ # Raw data layer
β β βββ bronze_bookings.sql
β β βββ bronze_hosts.sql
β β βββ bronze_listings.sql
β βββ silver/ # Cleaned data layer
β β βββ silver_bookings.sql
β β βββ silver_hosts.sql
β β βββ silver_listings.sql
β βββ gold/ # Analytics layer
β βββ fact.sql
β βββ obt.sql
β βββ ephemeral/ # Temporary models
β βββ bookings.sql
β βββ hosts.sql
β βββ listings.sql
β
βββ macros/ # Reusable SQL functions
β βββ generate_schema_name.sql # Custom schema naming
β βββ multiply.sql # Math operations
β βββ tag.sql # Categorization logic
β βββ trimmer.sql # String utilities
β
βββ analyses/ # Ad-hoc analysis queries
β βββ explore.sql
β βββ if_else.sql
β βββ loop.sql
β
βββ snapshots/ # SCD Type 2 configurations
β βββ dim_bookings.yml
β βββ dim_hosts.yml
β βββ dim_listings.yml
β
βββ tests/ # Data quality tests
β βββ source_tests.sql
β
βββ seeds/ # Static reference data
- Snowflake account with ACCOUNTADMIN privileges
- Python 3.12 or higher
- Git for version control
# 1. Clone repository
git clone https://github.com/yourusername/airbnb-dbt-snowflake.git
cd airbnb-dbt-snowflake
# 2. Create virtual environment
python -m venv .venv
.\.venv\Scripts\Activate.ps1 # Windows
source .venv/bin/activate # macOS/Linux
# 3. Install dependencies
pip install -e .
# 4. Configure Snowflake
# Create ~/.dbt/profiles.yml with your credentials
# (See detailed setup below)
# 5. Run pipeline
python main.py# Execute: run β snapshot β test β docs
python main.py
# Or using dbt directly
dbt build# Process only modified models
python main.py --run-type incremental# Run models only
python main.py --command run
# Test data quality
python main.py --command test
# Create SCD Type 2 snapshots
python main.py --command snapshot
# Generate & serve documentation
python main.py --command docs
python main.py --command serve # Visit http://localhost:8000
# Load reference data
python main.py --command seed
# Compile project
python main.py --command compile# Debug configuration
dbt debug
# Compile without executing
dbt compile
# Parse project
dbt parse
# Run specific model
dbt run --select silver_bookings
# Run model and downstream dependencies
dbt run --select +silver_bookings+
# Show DAG
dbt docs generate && dbt docs serve
# List all models
dbt listfrom main import DBTExecutor
executor = DBTExecutor(project_dir="aws_dbt_snowflake_project")
# Execute full pipeline
if executor.execute_pipeline(run_type="full"):
print("Success!")
else:
print("Failed - check dbt_execution.log")
# Or run individual commands
executor.run_full()
executor.test()
executor.snapshot()
executor.generate_docs()Bronze and silver models use incremental materialization for performance:
{{ config(materialized='incremental', unique_key='BOOKING_ID') }}
{% if is_incremental() %}
WHERE CREATED_AT > (SELECT COALESCE(MAX(CREATED_AT), '1900-01-01') FROM {{ this }})
{% endif %}Benefits: 60% faster execution, reduced costs, incremental data processing
Reusable SQL components for common operations:
Multiplies two values and rounds to specified decimal places
{{ multiply('NIGHTS_BOOKED', 'BOOKING_AMOUNT', 2) }} AS TOTAL_AMOUNTCategorizes prices into 'low', 'medium', 'high' buckets
{{ tag('CAST(PRICE_PER_NIGHT AS INT)') }} AS PRICE_TIERCustom schema naming strategy for organized database structure
Historical dimension tracking with temporal validity:
valid_from: When this version became activevalid_to: When superseded (NULL = current)is_current: Boolean flag for active versionsdbt_valid_from/to: dbt-managed metadata
Use case: Analyze hosts' response rate changes over time
-- Unique ID constraint
tests:
- unique
- not_null
-- Custom business rule
SELECT booking_amount FROM bronze_bookings WHERE booking_amount < 200OBT model demonstrates maintainable dynamic joins:
{% set configs = [
{ "table": "SILVER_BOOKINGS", "alias": "bookings" },
{ "table": "SILVER_LISTINGS", "alias": "listings", "join": "..." }
] %}Benefits: Easy to add/remove tables, reduced code duplication
-
Credentials Management
- Never commit
profiles.ymlwith credentials - Use environment variables for sensitive data
- Implement role-based access control (RBAC) in Snowflake
- Never commit
-
Code Quality
- SQL formatting with
sqlfmt - Version control with Git
- Code reviews for model changes
- SQL formatting with
-
Performance Optimization
- Incremental models for large datasets
- Ephemeral models for intermediate transformations
- Appropriate clustering keys in Snowflake
- dbt Documentation: https://docs.getdbt.com/
- Snowflake Documentation: https://docs.snowflake.com/
- dbt Best Practices: https://docs.getdbt.com/guides/best-practices
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is part of a data engineering portfolio demonstration.
Project: Airbnb Data Engineering Pipeline
Technologies: Snowflake, dbt, AWS, Python
-
Connection Error
- Verify Snowflake credentials in
profiles.yml - Check network connectivity
- Ensure warehouse is running
- Verify Snowflake credentials in
-
Compilation Error
- Run
dbt debugto check configuration - Verify model dependencies
- Check Jinja syntax
- Run
-
Incremental Load Issues
- Run
dbt run --full-refreshto rebuild from scratch - Verify source data timestamps
- Run
Essential improvements for production readiness.
-
Comprehensive Testing Suite π§ͺ
- Add
dbt-expectationspackage for advanced tests - Implement column-level uniqueness tests
- Create cross-table referential integrity tests
- Add row count reconciliation tests
- Priority: HIGH | Effort: Medium | Impact: High
- Tools: dbt-expectations, pytest, Great Expectations
- Add
-
CI/CD Pipeline π
- GitHub Actions or Azure DevOps for automated testing
- Lint and format SQL code automatically
- Run dbt models on PR submission
- Auto-deploy to dev/staging environments
- Priority: HIGH | Effort: Medium | Impact: High
- Tools: GitHub Actions, dbt Cloud, pre-commit hooks
-
Enhanced Monitoring & Logging π
- Centralized logging with CloudWatch/ELK
- dbt audit logs and execution metrics
- Query performance tracking in Snowflake
- Failed model retry logic
- Priority: HIGH | Effort: Medium | Impact: Medium
- Tools: CloudWatch, Datadog, Snowflake Query History
Business intelligence layer and advanced analytics.
-
Data Quality Dashboards π
- Snowflake monitoring dashboard (failed tests, row counts)
- dbt model freshness and execution time tracking
- Source data validation metrics
- Schema-level statistics and trends
- Priority: MEDIUM | Effort: Medium | Impact: High
- Tools: Snowflake Native App, Tableau, Looker, Apache Superset
-
BI Tool Integration π
- Tableau/Power BI dashboards for business users
- Sales performance by property type analysis
- Host performance and superhost trends
- Booking patterns and seasonal analysis
- Revenue forecasting models
- Priority: MEDIUM | Effort: High | Impact: High
- Tools: Tableau, Power BI, Looker, Python BI clients
-
Advanced Business Metrics π°
- Revenue per available room (RevPAR)
- Occupancy rate calculations
- Average daily rate (ADR) trends
- Guest satisfaction metrics
- Host performance scoring
- Priority: MEDIUM | Effort: High | Impact: Medium
- Implementation: Gold layer metrics tables
Enterprise-grade data governance and compliance.
-
Data Masking & PII Protection π
- Column-level encryption for sensitive data
- Row-level security (RLS) for multi-tenant access
- PII detection and masking (host names, emails, phone numbers)
- Audit trail for data access
- Priority: HIGH | Effort: High | Impact: High
- Tools: Snowflake masking policies, dbx (trifecta)
-
Data Governance Framework π
- Data catalog and lineage tracking
- Data ownership and stewardship
- Retention policies and archival
- Quality SLAs and KPIs
- Priority: MEDIUM | Effort: High | Impact: Medium
- Tools: dbt metadata, Collibra, Alation
-
Access Control & RBAC π₯
- Role-based access to schemas/tables
- Service account management
- API key rotation policies
- Audit logging for all access
- Priority: HIGH | Effort: Medium | Impact: High
- Tools: Snowflake RBAC, AWS IAM, HashiCorp Vault
Enterprise-scale optimization and automation.
-
Alerting & Monitoring System π¨
- Failed model/test notifications (Slack, Email, PagerDuty)
- SLA breach warnings
- Anomaly detection for data quality
- Performance degradation alerts
- Priority: MEDIUM | Effort: Medium | Impact: High
- Tools: dbt Cloud, Truffle Security, Sentry, PagerDuty
-
Orchestration & Scheduling β°
- Apache Airflow DAG for pipeline orchestration
- Conditional execution based on data freshness
- Dynamic task generation
- Cross-timezone scheduling
- Priority: MEDIUM | Effort: High | Impact: High
- Tools: Apache Airflow, Prefect, Dagster, dbt Cloud
-
Performance Tuning β‘
- Query optimization and analysis
- Snowflake clustering strategy
- Partition pruning optimization
- Query cache analysis
- Priority: LOW | Effort: Medium | Impact: Medium
- Tools: Snowflake Query Profile, dbt meta tags
Predictive analytics and ML pipelines.
-
Predictive Models π€
- Booking demand forecasting
- Occupancy prediction
- Price optimization models
- Churn prediction for hosts
- Priority: LOW | Effort: Very High | Impact: High
- Tools: Python (scikit-learn, XGBoost), Snowflake ML
-
Feature Store Integration π―
- Centralized feature engineering
- Feature versioning and tracking
- Online/Offline feature serving
- Priority: LOW | Effort: Very High | Impact: Medium
- Tools: Feast, Tecton, Hopsworks
-
Real-time Analytics π
- Streaming data ingestion
- Real-time dashboards
- Incremental aggregations
- Priority: LOW | Effort: Very High | Impact: Medium
- Tools: Kafka/Kinesis, Spark Streaming, Flink
Q1 (Month 1-3) Q2 (Month 4-6) Q3 (Month 7-9)
ββ Testing Suite ββ Quality Dashboards ββ ML Models
ββ CI/CD Pipeline ββ BI Integration ββ Feature Store
ββ Logging/Monitoring ββ Governance ββ Real-time
ββ Data Masking ββ Alerting System ββ Cost Optimization
ββ RBAC ββ Orchestration
| Priority | Implementation | Business Value | Timeline |
|---|---|---|---|
| π΄ Critical | Testing, CI/CD, Monitoring | Enterprise-ready | Month 1-2 |
| π‘ High | Dashboards, BI Tools, Security | Strategic advantage | Month 2-4 |
| π’ Medium | Advanced Metrics, Orchestration | Operational efficiency | Month 4-7 |
| π΅ Low | ML Models, Real-time, Feature Store | Competitive advantage | Month 7-12 |
- dbt Cloud (orchestration + monitoring)
- Snowflake Query Logs (performance tracking)
- Datadog/New Relic (infrastructure monitoring)
- PagerDuty (alerting)
- Slack (notifications)
- dbt tests (native)
- Great Expectations (advanced validation)
- dbt-expectations package
- Custom Python validators
- Tableau (enterprise BI)
- Looker (embedded analytics)
- Apache Superset (open-source)
- Snowflake Native App (built-in)
- Python (scikit-learn, XGBoost, PyTorch)
- Snowflake ML (native ML support)
- MLflow (experiment tracking)
- Feast (feature store)
- Add data quality dashboards (Q2, High Priority)
- Implement CI/CD pipeline (Q1, Critical)
- Add more complex business metrics (Q2, High Priority)
- Integrate with BI tools (Tableau/Power BI) (Q2, High Priority)
- Add alerting and monitoring (Q2, High Priority)
- Implement data masking for PII (Q1, Critical)
- Add more comprehensive testing suite (Q1, Critical)
- Orchestration with Airflow/Prefect (Q2, Medium Priority)
- Predictive analytics models (Q3, Low Priority)
- Real-time streaming pipeline (Q3, Low Priority)