Swiss rental price prediction project with:
- an XGBoost model,
- reproducible train/evaluate/predict scripts,
- and a Streamlit app for interactive inference.
- Results and Demo
- Dataset
- Replicate Locally
- CLI Workflow
- Deployment
- Quality Checks
- Model Report
- Productionization Decisions
- Notebooks
- Project Structure
- Notes
- MAE:
347.02 CHF - RMSE:
748.05 CHF - R2:
0.7856
These metrics come from the reproducible CLI pipeline (train.py / evaluate.py) on data/processed/02_featured_data.pkl.
-
Live demo: https://dimy.dev/projects/rentpredictor
-
Local interactive demo:
streamlit run app.py- Raw scrape (not committed): ~22,515 listings from ImmoScout24 (
notebooks/01_eda.ipynb).- Scraped using ImmoScraper, 12/2025.
- Cleaned residential set: ~16,399 rows after filtering/cleaning (
data/processed/01_cleaned_data.pkl). - Featured modeling set: engineered dataset used for training/inference (
data/processed/02_featured_data.pkl). - External enrichment: Swiss municipal/cantonal tax data from
data/external/tax_data_2025.csv.
data/processed/01_cleaned_data.pkl/.csv: output of EDA + cleaning.data/processed/02_featured_data.pkl: canonical training/evaluation dataset.data/processed/rentals_ready_for_modeling.csv: modeling-ready export.data/external/tax_data_2025.csv: tax feature source.
- Tax feature (
tax_rate) is merged by city/commune mapping. - If an exact city match fails, notebooks apply canton-level median fallback.
- Training and app inference both expect raw columns such as
Zip,Canton, andSubTypebefore encoding.
conda env create -f environment.dev.yml
conda activate swiss-rentalFor Streamlit Cloud deployment, requirements.txt is provided with runtime dependencies.
python scripts/train.py \
--data data/processed/02_featured_data.pkl \
--model-out models/xgb_rent_model.pkl \
--encoder-out models/zip_encoder.pkl \
--metrics-out models/training_metrics.json \
--feature-columns-out models/feature_columns.json \
--manifest-out models/model_manifest.json \
--split-strategy randompython scripts/evaluate.py \
--data data/processed/02_featured_data.pkl \
--model models/xgb_rent_model.pkl \
--encoder models/zip_encoder.pkl \
--manifest models/model_manifest.json \
--metrics-out models/evaluation_metrics.json \
--split-strategy randompython scripts/predict.py \
--input-csv path/to/input_features.csv \
--output-csv predictions.csv \
--model models/xgb_rent_model.pkl \
--encoder models/zip_encoder.pkl \
--manifest models/model_manifest.json \
--feature-columns models/feature_columns.jsonscripts/train.py: trains model + encoder and writes training metrics/feature columns.scripts/evaluate.py: evaluates saved model/encoder on deterministic split (+ segmented metrics).scripts/predict.py: batch inference from input CSV withpredicted_rent_chf,model_version, and validation warning fields.scripts/healthcheck.py: validates model artifact integrity frommodels/model_manifest.json.
Validation split strategies:
--split-strategy random: standard random holdout.--split-strategy group_zip: group-based holdout by ZIP (harder, more realistic generalization check).
Containerized deployment is supported with startup artifact checks.
From repo root on your VPS:
./scripts/deploy_vps.shThis script will:
- pull latest
origin/master(fast-forward only) - rebuild the Docker image
- replace the running
rentpredictorcontainer - print container status and recent logs
Optional: skip git pull if you already synced code:
SKIP_PULL=1 ./scripts/deploy_vps.shdocker build -t rentpredictor .
docker run --rm -p 8501:8501 rentpredictorAt container startup:
scripts/entrypoint.shrunsscripts/healthcheck.py- healthcheck verifies checksums in
models/model_manifest.json - Streamlit starts only if artifacts are valid
Run tests only:
python -m unittest discover -s tests -p "test_*.py" -vRun full local gate:
./scripts/gate.shCI:
- GitHub Actions gate workflow:
.github/workflows/gate.yml
Data quality:
scripts/data_quality_report.pyruns in./scripts/gate.shand writesmodels/data_quality_report.json.- Gate fails if critical schema checks fail or configured out-of-range ratios are exceeded.
-
Script-first pipeline over notebook-only execution:
- Decision:
scripts/train.py,scripts/evaluate.py,scripts/predict.pyare canonical. - Why: reproducible runs, easier CI integration, lower handoff friction.
- Decision:
-
Contract-first input validation:
- Decision: strict shared schema checks in
src/ml_pipeline.py. - Why: fail fast on bad inputs and avoid silent inference/training corruption.
- Decision: strict shared schema checks in
-
Explicit artifact integrity and versioning:
- Decision:
models/model_manifest.jsonwith checksum verification at startup/inference. - Why: prevents stale/tampered artifact mixes and improves deploy safety.
- Decision:
-
Feature alignment as a first-class artifact:
- Decision: persist
models/feature_columns.jsonand always align inference features. - Why: removes train/serve mismatch risk from one-hot columns.
- Decision: persist
-
Two evaluation split modes:
- Decision: support
randomandgroup_zipsplit strategies. - Why: report both optimistic baseline and harder generalization-to-unseen-ZIP behavior.
- Decision: support
-
Deployment startup guardrails:
- Decision: healthcheck before app start (
scripts/entrypoint.sh+scripts/healthcheck.py). - Why: fail early if deployment artifacts are inconsistent.
- Decision: healthcheck before app start (
- Detailed before/after metrics and tradeoff case study:
docs/model_report.md
notebooks/01_eda.ipynbnotebooks/02_features_and_baseline.ipynbnotebooks/03_ml_and_interpretability.ipynb
RentPredictor/
app.py
scripts/
train.py
evaluate.py
predict.py
src/
ml_pipeline.py
mappings.py
data/
external/
processed/
models/
notebooks/
docs/
- Batch prediction input must include raw preprocessing columns, including
Zip,Canton, andSubType. models/feature_columns.jsonis required for robust feature alignment in app/CLI inference.models/model_manifest.jsonis used for artifact integrity validation and model version metadata.