This page answers what you are building, why each layer exists, and how the pieces relate. It complements the hands-on LEARNING_GUIDE.md (commands and troubleshooting) and the README.md architecture diagram.
The repository is a local, open-source mirror of a common humanitarian / INGO data stack: land diverse raw files, govern them through a medallion path, model business-ready tables with dbt, and expose KPIs to analysts.
You practice skills that transfer directly to Databricks + Delta + dbt in production:
- Ingesting CSV.gz, Parquet, JSON, Avro without losing lineage back to the file.
- Writing Delta tables so you get ACID writes, schema evolution, and time travel (in the Spark path).
- Separating engineering transforms (Silver cleaning, typed marts) from ad-hoc analytics (SQL files, dashboards).
- Testing at the Python boundary and inside dbt.
The domain is synthetic Pakistani social protection and Afghan refugee programme-style data (beneficiaries, payments, complaints, inventory, donors, surveys, protection). The numbers are fictional; the grains, joins, and quality patterns are realistic practice material.
Purpose: Preserve a faithful, queryable copy of what arrived from upstream, with minimal opinion. Bronze answers: What did we receive, when, and in what shape?
In this repo: python ingest/ingest.py reads files under data_large/, normalizes them into Spark DataFrames, and writes Delta under delta_lake/bronze/.
Why Delta here: Bronze is the handoff between file diversity (many libraries) and downstream SQL/Spark. Delta gives you a table abstraction over Parquet with transaction logs—closer to how cloud lakehouses store “raw” data than a pile of loose files.
Purpose: Apply governance: deduplicate, cast types, resolve keys, handle nulls, and encode business rules that everyone should agree on before metrics. Silver answers: What do we believe is true about each entity at this grain?
In this repo: notebooks/delta_lake_operations.ipynb reads Bronze Delta, cleans and promotes to delta_lake/silver/.
Why a notebook (vs only dbt): In many organizations, heavy row-level Spark (window functions, complex parsing) still lives in notebooks or jobs; dbt then consumes the stable Silver contract. This split matches that pattern and keeps dbt models simpler.
Purpose: Curate marts for consumption: star-ish shapes, KPI-ready joins, documented grains. Gold answers: What does the business read every Monday?
In this repo: dbt builds stg_* → int_* → mart_* and materializes marts as tables in DuckDB (pspl.duckdb). dbt reads Silver via delta_scan (DuckDB Delta extension), so Gold is still logically downstream of the lake—even though the query engine switches from Spark to DuckDB.
Why DuckDB for Gold here: It keeps the portfolio lightweight on a laptop while teaching the same dbt patterns (ref, source, tests, docs) you would run against Databricks SQL in production.
| Concern | Spark + Delta (this repo) | DuckDB + dbt |
|---|---|---|
| Scale-out, huge scans | Yes | Laptop-scale |
| Multi-format ingest to a lake | Yes | Possible but not the focus here |
| Cheap local modeling + tests | Heavy | Excellent |
delta_scan from SQL |
Native in Spark | Via DuckDB extension |
So: Spark owns the lake writes; DuckDB owns the warehouse-shaped marts for this demo. In production you might collapse some of this into Databricks-only, but the separation of concerns (lake vs warehouse) is still a common pattern.
- Staging (
stg_*): One-to-one with Silver sources, light renaming and type hygiene. Think “adapter layer” to your sources. - Intermediate (
int_*): Reusable building blocks—joins, aggregates, or bridges used by multiple marts. - Marts (
mart_*): Business-facing tables with clear grain (e.g. one row per district/program/month for payment KPIs).
Why tests matter: dbt tests encode contracts (“this key is unique”, “no nulls in this column”). They turn tribal knowledge into CI-friendly assertions—just like you want before promoting models in production.
The sql/ folder holds standalone queries that read the same DuckDB objects dbt built. They demonstrate window functions, CTEs, and reporting logic without hiding inside a model.
The Streamlit app in dashboard/streamlit_app.py reads those marts too, but adds interactive charts for onboarding and stakeholder-style storytelling.
DELTA_LAKE_PATH: Absolute path to thedelta_lake/directory using forward slashes, required for dbt to parsedelta_scan('{{ env_var('DELTA_LAKE_PATH') }}/silver/...')indbt/models/sources.yml. The Windows pipeline script sets this for you.JAVA_HOME: Required for PySpark.HADOOP_HOME/winutils(Windows): Spark expects Hadoop-style filesystem hooks; the provided scripts configure this automatically.
A tight story:
- Ingest multi-format sources to Bronze Delta with explicit readers and tests at the Python edge.
- Notebook enforces Silver quality (dedupe, types) and documents Spark/Delta behavior including time travel.
- dbt defines versioned transformations and tests from Silver to Gold marts in DuckDB—portable to
dbt-databrickswith adapter and path changes. - KPIs + dashboard show how analysts consume marts without re-implementing business logic.
That is a complete lakehouse-shaped arc on a laptop.
- Scope and cloud mapping: SCOPE_AND_CLOUD.md.
- Run the pipeline: LEARNING_GUIDE.md — Section 5.
- Commands in the browser: getting-started.html.
- Notebook environment: JUPYTER_AND_NOTEBOOKS.md.
- Column-level detail: data_dictionary.md.