Audience: Trainers building PowerPoint / Google Slides / Marp.
Depth: ~55 slides ≈ 1.5–2h talk + labs separate. Extend with screenshots from dbt docs, Delta folders, Streamlit.
Full speaker notes (same slide order): TRAINER_SLIDES_WITH_SPEAKER_NOTES.md · Course marketing (10-day cohort): COURSE_MARKETING.md
Convention: Each block is one slide. Title = # ... line; bullets = talking points.
- Pakistani social protection — data engineering portfolio (synthetic)
- Trainer: [name] · Cohort: [date]
- Outcomes: run a lakehouse-shaped pipeline end-to-end (local open-source stack)
- Files → Bronze Delta → Silver Delta → dbt marts (DuckDB) → KPIs + dashboard
- Same shape as Databricks/ADF production stacks
- Beginners: vocabulary + order of operations
- Practitioners: Spark + dbt integration patterns
- Leads: trade-offs + cloud migration narrative
- Numbers are not official statistics
- We teach engineering patterns: grain, lineage, tests, reruns
- Python 3.11, JDK 11/17, Git, 8 GB+ RAM
- Windows: PowerShell policy +
winutilsstory (demo script)
data_large/·ingest/·delta_lake/·notebooks/·dbt/·sql/·dashboard/·docs/
- “What landed” from upstream
- Low opinion transforms; preserve auditability
- Conformed entities: types, keys, dedupe
- Engineering disagreements surface here
- Subject-area tables / KPI grains
- Fewer joins for analysts / BI
- Spark: lake-scale writes + Delta
- DuckDB + dbt: fast modeling on a laptop
- Isolation · reproducible pins (
requirements.txt) - Avoid wrong global Python (3.14 wheel gaps)
- Driver + JVM even in
local[*] - Lazy transforms vs actions
_delta_log· ACID batch writes- Path to time travel and schema evolution (concept)
- Interactive exploration vs headless
nbconvert --execute - CI = same notebook, no UI
source·ref· tests · docs
- In-process SQL analytics
delta_scanbridges Silver paths into SQL
- Same DAG, different shells
- Classroom parity on Windows
- Unit tests at pure Python boundary
- Properties = invariants
- Consumer of marts — not a new truth layer
- Interactivity for stakeholder rehearsal
- Every metric answer needs a grain sentence
- Tie to
data_dictionary.md
- “What if this job runs twice?”
- Bronze overwrite vs merge (real world)
- dbt tests + documented columns
- Fail fast vs silent drift
- Rolling averages · rankings · running totals
- Appears in KPI SQL + marts
- CSV.gz vs Parquet vs JSON vs Avro
- When schema hurts / helps
python ingest/ingest.py --dataset beneficiaries- Faster feedback loop in class
- Open
delta_lake/bronze/.../_delta_log - “This is a real Delta table, not CSV cosplay”
- Why duplicates exist (retries, upstream bugs)
- Show before/after row counts (conceptual)
- Ranking / partitions tied to business questions
- dbt parse-time requirement
- Forward slashes on Windows
- 1:1 with Silver sources
- Rename + cast discipline
- Reusable joins / bridges
- DRY for multiple marts
- KPI-ready grains
- Materialized as tables here
- “This is your communication artifact with analysts”
- Schema tests vs singular SQL tests
- Where to enforce business rules
- Analyst-ready recipes
- Portable to BI tools / Databricks SQL
- Heatmaps = patterns
- Ranked bars = snapshots
- Volume + rate = anti-ambiguity
- Ingestion runbook · dbt runbook
- Day-2 on-call mindset (local simulation)
make clean/clean-artifacts.ps1- Windows file handles + Spark shutdown
ingest.py↔ ADF-triggered Databricks job
- Local Delta paths ↔ ADLS Gen2 + Unity Catalog
dbt-duckdb↔dbt-databricks- Adapter swap + catalog paths
make all↔ Databricks Workflows / ADF pipeline
- Synthetic data avoids privacy incidents
- Production: masking, row access policies, audit logs
- Shuffle cost · partition pruning
- “Not optimising premature — measuring first”
- “What breaks if Silver is missing?”
- Add a test or document a mart grain in YAML
- Correctness · grain · lineage hygiene · communication
- Walk file → KPI on a dashboard (60 seconds)
- Where do you test: Python vs dbt vs integration?
- Migrate this repo to Databricks — what changes first?
- Incremental models · SCD2 · streaming ingestion
- Unity Catalog governance patterns
- Read
CONCEPTS_AND_PURPOSE.md - Run one stage with intentional failure + fix write-up
- Point to
docs/training/COMPLETE_TECHNICAL_TRAINER_GUIDE.md - Thank you / office hours
- Screenshot:
pytestgreen - Screenshot: Streamlit heatmap
- Screenshot:
dbt testfailure example - Diagram: README Mermaid architecture (export PNG)
Deck outline ends. Speaker notes: TRAINER_SLIDES_WITH_SPEAKER_NOTES.md. Deep technical notes: COMPLETE_TECHNICAL_TRAINER_GUIDE.md.