Turning Brazil's largest public e-commerce dataset into an early-warning system for late deliveries.
- Overview
- Key Features
- Dataset
- Pipeline
- Project Structure
- SQL Exploratory Analysis
- Machine Learning Model
- Model Performance
- Power BI Dashboard
- Getting Started
- Key Insights
- Roadmap
- License
- Author
Late deliveries are one of the most expensive problems in e-commerce — they drive support tickets, refunds, and lost trust. This project builds an end-to-end pipeline on top of the Olist Brazilian e-commerce dataset (~100K real orders, 2016–2018) to flag orders likely to arrive late, using only information available at the time of purchase.
The pipeline goes: raw CSVs → SQL feature engineering → trained classifier → batch prediction script → Power BI dashboard.
- 🗃️ SQL-driven EDA — delivery rates, regional delay patterns, seller-level performance, and shipping-cost correlation, all in
delay.sql - 🎯 Binary delay classifier — predicts
is_delayedusing a trained Random Forest model - 📊 Baseline comparison — Random Forest benchmarked against a Logistic Regression baseline
- ⚡ Lightweight inference script —
scripts/predict.pyloads the saved model and scores new orders in a single run - 📈 Power BI dashboards — pre-built
.pbixreports for delay trends and prediction results - 🌎 State-level granularity — every Brazilian state is one-hot encoded as a model feature
Built on the Brazilian E-Commerce Public Dataset by Olist.
|
|
olist_orders_dataset.csv
|
Order status, timestamps, delivery dates
|
|
olist_order_items_dataset.csv
|
Items, prices, freight values, seller info
|
|
olist_customers_dataset.csv
|
Customer location and ID mappings
|
|
olist_sellers_dataset.csv
|
Seller location data
|
|
olist_products_dataset.csv
|
Product categories and dimensions
|
|
olist_geolocation_dataset.csv
|
ZIP code geolocation data
|
|
olist_order_payments_dataset.csv
|
Payment type and installment info
|
|
olist_order_reviews_dataset.csv
|
Customer review scores
|
|
product_category_name_translation.csv
|
Portuguese → English category names
|
Raw Olist CSVs
│
▼
delay.sql ──▶ joins tables, engineers is_delayed label
│
▼
final.csv ──▶ clean, model-ready dataset
│
▼
Model training ──▶ Random Forest + Logistic Regression (trained offline)
│
▼
models/*.pkl ──▶ rf_model.pkl, logistic_model.pkl, model_columns.pkl
│
▼
scripts/predict.py ──▶ scores new_data.csv → predicted_output.csv
│
▼
dashboard/*.pbix ──▶ Power BI visualization
supply-chain-delay-prediction/
├── dashboard/
│ ├── delay.pbix # Power BI: delay & delivery analysis
│ └── predicted_output.pbix # Power BI: prediction results view
├── data/
│ ├── new_data.csv # Unseen orders for inference
│ └── predicted_output.csv # Output of predict.py
├── models/
│ ├── rf_model.pkl # Trained Random Forest classifier
│ ├── logistic_model.pkl # Trained Logistic Regression baseline
│ └── model_columns.pkl # Exact feature order expected by the model
├── scripts/
│ └── predict.py # Loads model, scores new data, writes output
├── delay.sql # SQL EDA + feature engineering queries
├── final.csv # Model-ready dataset (output of delay.sql)
└── olist_*.csv # Raw Olist source tables
Note: the repo ships the trained model artifacts and an inference script (
predict.py). The training script itself isn't included here — only the resulting.pklfiles are.
delay.sql runs the analysis that feeds the model. It covers:
- Overall on-time vs. delivered percentage
- Order volume by customer state
- Average delivery time by state
- Average delay vs. the estimated delivery date
- Seller-level delay ranking (sellers with 50+ orders only)
- Freight cost vs. delivery time correlation
- Low / Medium / High shipping-cost category breakdown
- Final join that produces
final.csvand theis_delayedlabel
Derived features:
|
|
delivery_time
|
Days from purchase to actual delivery
|
|
delay
|
Days between actual delivery and the
*
estimated
*
delivery date
|
|
is_delayed
|
1
if delivered after the estimated date, else
0
|
The classifier predicts is_delayed using only features known at purchase time — so it can't see delivery_time itself (that would be cheating).
Actual model inputs (confirmed from model_columns.pkl):
freight_value— shipping costcustomer_state— one-hot encoded across all 27 Brazilian states
That's it — a deliberately lean feature set. price, seller_id, and order timestamps are carried through the pipeline for reference but are not fed into the model.
python scripts/predict.pyThis loads rf_model.pkl + model_columns.pkl, reads data/new_data.csv, aligns columns, predicts, and writes data/predicted_output.csv.
| | Logistic Regression (baseline) | ~69% | | ** Random Forest ** | ** ~87% ** |
The Random Forest model outperforms the linear baseline by a wide margin, indicating the relationship between state, freight cost, and delay risk is non-linear.
The dashboard/ folder contains two Power BI files (open with Power BI Desktop):
delay.pbix— delay rates and trends across states and sellerspredicted_output.pbix— visualizes model predictions against actual outcomes
- Python 3.8+
pandas,scikit-learn(to unpickle and run the model)- A SQL-compatible database (MySQL/PostgreSQL) if you want to re-run
delay.sql - Power BI Desktop, to open the
.pbixdashboards
git clone https://github.com/arka562/supply-chain-delay-prediction.git
cd supply-chain-delay-prediction
pip install pandas scikit-learncd scripts
python predict.pyOutput lands in data/predicted_output.csv.
Load the Olist CSVs into your database, then execute the queries in delay.sql against your orders, order_items, and customers tables.
- States in Brazil's North and Northeast show consistently higher average delivery delays than the South and Southeast.
- Seller order volume doesn't predict delay risk well — geography matters more.
- Freight cost and delivery time don't move in a straight line; cheap shipping isn't reliably slower.
- A lean two-feature model (freight cost + state) is enough to beat a linear baseline by ~18 points of accuracy — most of the predictive signal is geographic.
- Commit the training script (currently only inference is in the repo)
- Add a
requirements.txt - Incorporate product category and order timestamp features
- Add cross-validation and a confusion matrix / precision-recall report
- Automate the SQL → CSV → model → dashboard refresh
This project's code has no license file yet — add one (e.g. MIT) if you want others to reuse it freely. The underlying Olist dataset is distributed under CC BY-NC-SA 4.0 via Kaggle, which restricts commercial use.
Arkaprava GitHub: @arka562