End-to-end machine learning pipeline to predict seller churn for the Olist e-commerce marketplace — enabling proactive retention strategies and measurable revenue protection.
Olist's marketplace depends on an active seller base. High churn erodes GMV and increases acquisition costs. This project builds a dual-stage predictive system that flags at-risk sellers before they churn, giving account managers a prioritised intervention list with estimated revenue impact.
| Metric | Value |
|---|---|
| Overall Churn Rate | 85.0% |
| Never-Activated Sellers | 515 (61.2%) — onboarding failure |
| Dormant Sellers | 201 (23.9%) — retention failure |
| Active Sellers | 140 (16.6%) |
| Revenue at Risk | R$ 272,607 |
| High / Critical Risk Sellers | 669 (79.5%) |
| Pre-Activation Model AUC | 0.975 (Logistic Regression) |
| Retention Model AUC | 0.706 (Gradient Boosting) |
| Active Sellers Targeted for Intervention | 33 |
Raw CSVs
│
▼
DataLoader ──► DataPreprocessor ──► ChurnAnalyzer (labels + cohorts)
│
▼
FeatureEngineer
/ \
Pre-Activation Retention
Features Features
│ │
▼ ▼
ChurnModeler ──────► ChurnModeler
(LogReg / RF / GBM) (LogReg / RF / GBM)
│ │
└────────┬───────────┘
▼
ModelEvaluator
(ROC · PR · CM · FI)
│
┌────────────┴────────────┐
▼ ▼
Risk Scoring InsightsReporter
(overall_churn_risk) (churn_insights_report.md)
│
▼
InterventionPrioritizer
(intervention_priority_list.csv)
│
▼
📊 Interactive Dashboard
(dashboard/index.html)
# 1. Clone and enter the project
git clone <repo>
cd olist-ecommerce
# 2. Copy and configure environment variables
cp .env.example .env # set DATA_PATH to your Olist CSV folder
# 3. Install dependencies (uv required)
make install
# 4. Run the full pipeline
make run-pipeline
# 5. Generate the interactive dashboard
make dashboard # → opens dashboard/index.html
# 6. View generated reports
ls outputs/ # seller_master, risk_scores, segments, cohorts
ls outputs/figures/ # ROC, PR, confusion matrix, feature importance# 1. Clone and configure env (same as above)
git clone <repo>
cd olist-ecommerce
cp .env.example .env
# 2. Build the image (one-time, ~2 min)
make docker-build
# 3. Run the full training pipeline
make docker-pipeline
# 4. Launch the MLflow UI → http://localhost:5000
make docker-mlflowRaw CSVs are read from
./data/raw/on your host via a bind mount — no copying into the image required.
The project ships a multi-stage Dockerfile and a docker-compose.yml that orchestrates three independent services.
┌─────────────────────────────────────────────────────────┐
│ builder stage (python:3.10-slim + build-essential) │
│ └─ uv sync → resolves & installs deps into .venv │
└───────────────────────────┬─────────────────────────────┘
│ COPY .venv only
┌───────────────────────────▼─────────────────────────────┐
│ runtime stage (python:3.10-slim, no compiler tools) │
│ ├─ runs as non-root user (appuser) │
│ ├─ bind mount: ./data → /app/data (host CSVs) │
│ └─ named volumes: outputs · models · mlruns · logs │
└─────────────────────────────────────────────────────────┘
| Service | Profile | Description |
|---|---|---|
pipeline |
pipeline |
Runs the full training pipeline |
inference |
inference |
Scores sellers with saved models |
mlflow |
mlflow |
Experiment tracking UI on port 5000 |
Services use Compose profiles — nothing starts by default. Activate the one you need.
make docker-build # Build the image (needed once, or after dep changes)
make docker-pipeline # Train — reads ./data/raw, writes to volumes
make docker-inference # Score — uses saved models from the models volume
make docker-mlflow # Start MLflow UI → http://localhost:5000
make docker-down # Stop & remove containers
make docker-clean # ⚠ Remove containers, volumes AND the imageYou only need make docker-build after:
- Changing
Dockerfile - Updating
pyproject.tomloruv.lock(dependency changes) - Modifying files in
src/,scripts/, orconfig/
Changes to docker-compose.yml never require a rebuild.
After running the pipeline, generate a self-contained stakeholder dashboard:
make dashboard
# → dashboard/index.htmlThe dashboard is a single HTML file (no server required) with five tabs:
| Tab | Contents |
|---|---|
| 🏠 Overview | 6 KPI cards · key insight pills · status & risk donuts |
| 📅 Cohort Analysis | Monthly churn vs activation trend · GMV by cohort |
| 🗂️ Segmentation | Churn by business segment · lead type · behaviour profile · state |
| 🤖 Model Performance | Metric bars per model · cross-model comparison chart · pipeline architecture |
| 🎯 Interventions | Searchable & filterable priority table for 30 high-risk sellers |
Open dashboard/index.html in any browser — no dependencies, no server.
olist-ecommerce/
├── 📁 config/
│ └── settings.py # Centralised config via pydantic-settings (.env)
├── 📁 src/ # Core library — importable modules
│ ├── pipeline.py # End-to-end orchestrator + all pipeline classes
│ ├── features.py # Feature engineering (pre-activation + retention)
│ ├── models.py # Model training (LogReg, RF, GBM comparison)
│ ├── evaluation.py # Evaluation + chart generation → outputs/figures/
│ ├── reports.py # Stakeholder Markdown report generation
│ └── validation/
│ └── schemas.py # Pydantic data validation schemas
├── 📁 scripts/
│ ├── run_pipeline.py # Entry point: runs src/pipeline.main()
│ └── generate_dashboard.py # Reads outputs/ CSVs → dashboard/index.html
├── 📁 dashboard/ # Dashboard source (index.html is gitignored)
│ └── template.html # HTML/JS/CSS template (Chart.js, dark theme)
├── 📁 notebooks/
│ └── poc_churn_analysis.ipynb # Exploratory analysis
├── 📁 tests/
│ └── test_pipeline.py # Unit tests
├── 📁 docs/ # Project documentation
│ ├── dataset.md # Dataset description
│ ├── glossary.md # Domain glossary
│ ├── project.md # Architecture & design notes
│ └── poc.md # POC findings
├── 📁 outputs/ # ← Generated on pipeline run (gitignored)
│ ├── figures/ # Charts: ROC, PR, confusion matrix, feature importance
│ ├── seller_master.csv # Full seller dataset with churn labels
│ ├── seller_risk_scores.csv # Per-seller risk scores
│ ├── cohort_analysis.csv # Monthly cohort stats
│ ├── segment_analysis_*.csv # Churn by segment, lead type, state, profile
│ ├── intervention_priority_list.csv
│ └── analysis_summary.txt
├── 📁 models/ # ← Generated on pipeline run (gitignored)
│ ├── pre_activation_model.joblib
│ └── retention_model.joblib
├── 📁 data/ # (gitignored) — place Olist CSVs here
│ └── raw/
├── .env.example # Required environment variables template
├── Dockerfile # Multi-stage image (builder → runtime)
├── docker-compose.yml # pipeline · inference · mlflow services
├── .dockerignore # Keeps build context lean
├── Makefile # Developer shortcuts (local + Docker)
└── requirements.txt # Pinned dependencies
Predicts whether a newly-onboarded seller will never make a sale (61% of the dataset).
| Model | AUC-ROC | Accuracy | F1 |
|---|---|---|---|
| Logistic Regression ✅ | 0.975 | 0.930 | 0.944 |
| Random Forest | 0.962 | 0.937 | 0.947 |
| Gradient Boosting | 0.953 | 0.918 | 0.933 |
Predicts whether an activated seller will go dormant (60+ days without an order).
| Model | AUC-ROC | Accuracy | F1 |
|---|---|---|---|
| Logistic Regression | 0.647 | 0.597 | 0.638 |
| Random Forest | 0.673 | 0.597 | 0.638 |
| Gradient Boosting ✅ | 0.706 | 0.645 | 0.694 |
Charts for each model (ROC curve, Precision-Recall, Confusion Matrix, Feature Importance) are saved to
reports/figures/on every run.
make install # Install dependencies via uv
make run-pipeline # Run the full end-to-end pipeline
make run-inference # Score sellers using saved models
make dashboard # Generate dashboard/index.html from latest outputs
make test # Run unit tests
make test-coverage # Run tests with HTML coverage report
make lint # flake8 + black + isort + bandit checks
make format # Auto-format with black + isort
make ci-check # Full local CI simulation (lint → typecheck → tests)
make pre-commit-run # Run all pre-commit hooks over the codebase
make mlflow-ui # Open MLflow UI at http://localhost:5000
make clean # Remove __pycache__, .pytest_cache, htmlcov
make clean-outputs # Remove generated CSVs, reports, and figures
make setup-dirs # Create required directories from scratchmake docker-build # Build the Docker image
make docker-pipeline # Run training pipeline in a container
make docker-inference # Run inference in a container
make docker-mlflow # Start MLflow UI at http://localhost:5000
make docker-down # Stop & remove containers
make docker-clean # ⚠ Remove containers, volumes AND the imageAll settings are managed through config/settings.py and read from .env.
No hardcoded paths anywhere in the codebase.
# .env
DATA_PATH=./data/raw # Olist raw CSV files
OUTPUT_PATH=./outputs # Generated CSVs and text files
MODELS_PATH=./models # Saved .joblib models
FIGURES_PATH=./outputs/figures # Charts and plotsSee .env.example for the full list of configurable values.
make test # Run all unit tests
make test-coverage # Run with coverage (open htmlcov/index.html)Cairo Cananea
- Blog: cairocananea.com.br
- Linkedin: Cairo Cananea
- Github: Cairo Cananea