Skip to content

GEG-ETHZ/kedro-pipeline-template

Repository files navigation

Kedro Pipeline Template — GEG ETH Zurich

A standardized, production-ready Kedro pipeline template for the Geothermal Energy and Geofluids (GEG) at ETH Zurich. Runs locally out of the box or on Google Cloud Platform with a single flag.

Why use this template?

  • Standardized structure: Uniform project layout and Data Catalog across all GEG pipelines.
  • Version-controlled logic: Python transformations tracked in Git for full reproducibility.
  • Multi-environment: Switch between local and GCP execution with one flag — no code changes needed.
  • Data governance: Centralized catalog manages GCS and BigQuery connections. No hardcoded paths or secrets.
  • Quality built-in: Pre-commit hooks run linting (ruff), type checking (mypy), and secret scanning (detect-secrets) automatically.

Prerequisites

Tool Purpose Install
Python ≥ 3.10 Runtime python.org
uv Package & environment management curl -LsSf https://astral.sh/uv/install.sh | sh
Google Cloud CLI GCP authentication (GCP mode only) cloud.google.com

Setup

1. Clone and setup environment

git clone <this-repo-url>
cd kedro-pipeline
make setup

This creates a virtual environment, installs all dependencies from uv.lock, and sets up the code quality hooks.

Hooks run automatically on every git commit. To run them manually:

make check

Running the pipeline

Local mode (default — no cloud needed)

Data is read from and written to the local data/ directory.

make run

Place your raw CSV files in:

  • data/01_raw/experiment_material_fluid_flowrate_01.csv
  • data/01_raw/experiment_material_fluid_pressure_01.csv

Results are written to data/08_reporting/.


GCP mode

Data is read from Google Cloud Storage and results are written to both GCS and BigQuery.

Step 1 — Authenticate with Google Cloud

gcloud auth application-default login

Step 2 — Configure environment variables

Copy the example file and fill in your GCP details:

cp .env.example .env

Edit .env:

GCS_BUCKET=your-gcs-bucket-name   # GCS bucket (without gs://)
GCP_PROJECT=your-gcp-project-id   # GCP project ID
GBQ_DATASET=your_bigquery_dataset  # BigQuery dataset name
# GBQ_LOCATION=europe-west6        # Optional: BigQuery region (default: europe-west6)

Step 3 — Run pipeline

make run-gcp

The .env variables are automatically loaded by the Makefile.

GCP data paths

Layer Location
Raw inputs gs://<GCS_BUCKET>/kedro-pipeline/data/01_raw/
Intermediate gs://<GCS_BUCKET>/kedro-pipeline/data/02_intermediate/
Primary gs://<GCS_BUCKET>/kedro-pipeline/data/03_primary/
Reporting (CSV) gs://<GCS_BUCKET>/kedro-pipeline/data/08_reporting/permeability.csv
Reporting (BQ) <GCP_PROJECT>.<GBQ_DATASET>.permeability

Project structure

kedro-pipeline/
├── conf/
│   ├── base/           # Shared config — local filesystem defaults
│   │   ├── catalog.yml
│   │   └── parameters.yml
│   ├── gcp/            # GCP overrides — activated with --env gcp
│   │   └── catalog.yml
│   └── local/          # Your personal overrides (gitignored)
├── data/               # Local data (gitignored — only .gitkeep files committed)
│   ├── 01_raw/
│   ├── 02_intermediate/
│   ├── 03_primary/
│   └── 08_reporting/
├── src/
│   └── kedro_pipeline/
│       └── pipelines/
│           └── data_processing/
│               ├── nodes.py      # Pure transformation functions
│               └── pipeline.py   # Pipeline assembly
├── tests/
├── .env.example        # GCP environment variable template
└── pyproject.toml

Common commands

Task Command
Project setup make setup
Run pipeline locally make run
Run pipeline on GCP make run-gcp
Run a single node uv run kedro run --nodes preprocess_flow_node
Visualize pipeline make run-viz
Run tests make test
Jupyter notebook uv run kedro jupyter notebook
Lint & format make check

Resources

About

A standardized, production-ready Kedro pipeline template. Runs locally out of the box or on Google Cloud Platform with a single flag.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors