RentPredictor

Swiss rental price prediction project with:

an XGBoost model,
reproducible train/evaluate/predict scripts,
and a Streamlit app for interactive inference.

Navigation

Results and Demo
Dataset
Replicate Locally
CLI Workflow
Deployment
Quality Checks
Model Report
Productionization Decisions
Notebooks
Project Structure
Notes

Results and Demo

Key Results (current baseline run)

MAE: 347.02 CHF
RMSE: 748.05 CHF
R2: 0.7856

These metrics come from the reproducible CLI pipeline (train.py / evaluate.py) on data/processed/02_featured_data.pkl.

Demo

Live demo: https://dimy.dev/projects/rentpredictor
Local interactive demo:

streamlit run app.py

Dataset

Data lineage

Raw scrape (not committed): ~22,515 listings from ImmoScout24 (notebooks/01_eda.ipynb).
- Scraped using ImmoScraper, 12/2025.
Cleaned residential set: ~16,399 rows after filtering/cleaning (data/processed/01_cleaned_data.pkl).
Featured modeling set: engineered dataset used for training/inference (data/processed/02_featured_data.pkl).
External enrichment: Swiss municipal/cantonal tax data from data/external/tax_data_2025.csv.

Files in this repository

data/processed/01_cleaned_data.pkl / .csv: output of EDA + cleaning.
data/processed/02_featured_data.pkl: canonical training/evaluation dataset.
data/processed/rentals_ready_for_modeling.csv: modeling-ready export.
data/external/tax_data_2025.csv: tax feature source.

Important preprocessing notes

Tax feature (tax_rate) is merged by city/commune mapping.
If an exact city match fails, notebooks apply canton-level median fallback.
Training and app inference both expect raw columns such as Zip, Canton, and SubType before encoding.

Replicate Locally

1) Create environment

conda env create -f environment.dev.yml
conda activate swiss-rental

For Streamlit Cloud deployment, requirements.txt is provided with runtime dependencies.

2) Train artifacts

python scripts/train.py \
  --data data/processed/02_featured_data.pkl \
  --model-out models/xgb_rent_model.pkl \
  --encoder-out models/zip_encoder.pkl \
  --metrics-out models/training_metrics.json \
  --feature-columns-out models/feature_columns.json \
  --manifest-out models/model_manifest.json \
  --split-strategy random

3) Evaluate saved artifacts

python scripts/evaluate.py \
  --data data/processed/02_featured_data.pkl \
  --model models/xgb_rent_model.pkl \
  --encoder models/zip_encoder.pkl \
  --manifest models/model_manifest.json \
  --metrics-out models/evaluation_metrics.json \
  --split-strategy random

4) (Optional) Batch prediction

python scripts/predict.py \
  --input-csv path/to/input_features.csv \
  --output-csv predictions.csv \
  --model models/xgb_rent_model.pkl \
  --encoder models/zip_encoder.pkl \
  --manifest models/model_manifest.json \
  --feature-columns models/feature_columns.json

CLI Workflow

scripts/train.py: trains model + encoder and writes training metrics/feature columns.
scripts/evaluate.py: evaluates saved model/encoder on deterministic split (+ segmented metrics).
scripts/predict.py: batch inference from input CSV with predicted_rent_chf, model_version, and validation warning fields.
scripts/healthcheck.py: validates model artifact integrity from models/model_manifest.json.

Validation split strategies:

--split-strategy random: standard random holdout.
--split-strategy group_zip: group-based holdout by ZIP (harder, more realistic generalization check).

Deployment

Containerized deployment is supported with startup artifact checks.

One-command VPS deploy

From repo root on your VPS:

./scripts/deploy_vps.sh

This script will:

pull latest origin/master (fast-forward only)
rebuild the Docker image
replace the running rentpredictor container
print container status and recent logs

Optional: skip git pull if you already synced code:

SKIP_PULL=1 ./scripts/deploy_vps.sh

Manual Docker run

docker build -t rentpredictor .
docker run --rm -p 8501:8501 rentpredictor

At container startup:

scripts/entrypoint.sh runs scripts/healthcheck.py
healthcheck verifies checksums in models/model_manifest.json
Streamlit starts only if artifacts are valid

Quality Checks

Run tests only:

python -m unittest discover -s tests -p "test_*.py" -v

Run full local gate:

./scripts/gate.sh

CI:

GitHub Actions gate workflow: .github/workflows/gate.yml

Data quality:

scripts/data_quality_report.py runs in ./scripts/gate.sh and writes models/data_quality_report.json.
Gate fails if critical schema checks fail or configured out-of-range ratios are exceeded.

Productionization Decisions

Script-first pipeline over notebook-only execution:
- Decision: scripts/train.py, scripts/evaluate.py, scripts/predict.py are canonical.
- Why: reproducible runs, easier CI integration, lower handoff friction.
Contract-first input validation:
- Decision: strict shared schema checks in src/ml_pipeline.py.
- Why: fail fast on bad inputs and avoid silent inference/training corruption.
Explicit artifact integrity and versioning:
- Decision: models/model_manifest.json with checksum verification at startup/inference.
- Why: prevents stale/tampered artifact mixes and improves deploy safety.
Feature alignment as a first-class artifact:
- Decision: persist models/feature_columns.json and always align inference features.
- Why: removes train/serve mismatch risk from one-hot columns.
Two evaluation split modes:
- Decision: support random and group_zip split strategies.
- Why: report both optimistic baseline and harder generalization-to-unseen-ZIP behavior.
Deployment startup guardrails:
- Decision: healthcheck before app start (scripts/entrypoint.sh + scripts/healthcheck.py).
- Why: fail early if deployment artifacts are inconsistent.

Model Report

Detailed before/after metrics and tradeoff case study:
- docs/model_report.md

Notebook Flow

notebooks/01_eda.ipynb
notebooks/02_features_and_baseline.ipynb
notebooks/03_ml_and_interpretability.ipynb

Project Structure

RentPredictor/
  app.py
  scripts/
    train.py
    evaluate.py
    predict.py
  src/
    ml_pipeline.py
    mappings.py
  data/
    external/
    processed/
  models/
  notebooks/
  docs/

Notes

Batch prediction input must include raw preprocessing columns, including Zip, Canton, and SubType.
models/feature_columns.json is required for robust feature alignment in app/CLI inference.
models/model_manifest.json is used for artifact integrity validation and model version metadata.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RentPredictor

Navigation

Results and Demo

Key Results (current baseline run)

Demo

Dataset

Data lineage

Files in this repository

Important preprocessing notes

Replicate Locally

1) Create environment

2) Train artifacts

3) Evaluate saved artifacts

4) (Optional) Batch prediction

CLI Workflow

Deployment

One-command VPS deploy

Manual Docker run

Quality Checks

Productionization Decisions

Model Report

Notebook Flow

Project Structure

Notes

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
data		data
docs		docs
models		models
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
environment.dev.yml		environment.dev.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

RentPredictor

Navigation

Results and Demo

Key Results (current baseline run)

Demo

Dataset

Data lineage

Files in this repository

Important preprocessing notes

Replicate Locally

1) Create environment

2) Train artifacts

3) Evaluate saved artifacts

4) (Optional) Batch prediction

CLI Workflow

Deployment

One-command VPS deploy

Manual Docker run

Quality Checks

Productionization Decisions

Model Report

Notebook Flow

Project Structure

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages