📦 Supply Chain Delay Prediction

Predicting E-Commerce Delivery Delays with SQL, ML & Power BI

Turning Brazil's largest public e-commerce dataset into an early-warning system for late deliveries.

📑 Table of Contents

Overview
Key Features
Dataset
Pipeline
Project Structure
SQL Exploratory Analysis
Machine Learning Model
Model Performance
Power BI Dashboard
Getting Started
Key Insights
Roadmap
License
Author

🔍 Overview

Late deliveries are one of the most expensive problems in e-commerce — they drive support tickets, refunds, and lost trust. This project builds an end-to-end pipeline on top of the Olist Brazilian e-commerce dataset (~100K real orders, 2016–2018) to flag orders likely to arrive late, using only information available at the time of purchase.

The pipeline goes: raw CSVs → SQL feature engineering → trained classifier → batch prediction script → Power BI dashboard.

✨ Key Features

🗃️ SQL-driven EDA — delivery rates, regional delay patterns, seller-level performance, and shipping-cost correlation, all in delay.sql
🎯 Binary delay classifier — predicts is_delayed using a trained Random Forest model
📊 Baseline comparison — Random Forest benchmarked against a Logistic Regression baseline
⚡ Lightweight inference script — scripts/predict.py loads the saved model and scores new orders in a single run
📈 Power BI dashboards — pre-built .pbix reports for delay trends and prediction results
🌎 State-level granularity — every Brazilian state is one-hot encoded as a model feature

📊 Dataset

Built on the Brazilian E-Commerce Public Dataset by Olist.

| File | Description | |

|

🏗️ Pipeline

Raw Olist CSVs
      │
      ▼
  delay.sql            ──▶  joins tables, engineers is_delayed label
      │
      ▼
  final.csv             ──▶  clean, model-ready dataset
      │
      ▼
  Model training         ──▶  Random Forest + Logistic Regression (trained offline)
      │
      ▼
  models/*.pkl            ──▶  rf_model.pkl, logistic_model.pkl, model_columns.pkl
      │
      ▼
  scripts/predict.py       ──▶  scores new_data.csv → predicted_output.csv
      │
      ▼
  dashboard/*.pbix           ──▶  Power BI visualization

📂 Project Structure

supply-chain-delay-prediction/
├── dashboard/
│   ├── delay.pbix                 # Power BI: delay & delivery analysis
│   └── predicted_output.pbix      # Power BI: prediction results view
├── data/
│   ├── new_data.csv                # Unseen orders for inference
│   └── predicted_output.csv        # Output of predict.py
├── models/
│   ├── rf_model.pkl                 # Trained Random Forest classifier
│   ├── logistic_model.pkl           # Trained Logistic Regression baseline
│   └── model_columns.pkl            # Exact feature order expected by the model
├── scripts/
│   └── predict.py                    # Loads model, scores new data, writes output
├── delay.sql                          # SQL EDA + feature engineering queries
├── final.csv                          # Model-ready dataset (output of delay.sql)
└── olist_*.csv                        # Raw Olist source tables

Note: the repo ships the trained model artifacts and an inference script (predict.py). The training script itself isn't included here — only the resulting .pkl files are.

🔍 SQL Exploratory Analysis

delay.sql runs the analysis that feeds the model. It covers:

Overall on-time vs. delivered percentage
Order volume by customer state
Average delivery time by state
Average delay vs. the estimated delivery date
Seller-level delay ranking (sellers with 50+ orders only)
Freight cost vs. delivery time correlation
Low / Medium / High shipping-cost category breakdown
Final join that produces final.csv and the is_delayed label

Derived features:

| Feature | Meaning | |

|

🤖 Machine Learning Model

The classifier predicts is_delayed using only features known at purchase time — so it can't see delivery_time itself (that would be cheating).

Actual model inputs (confirmed from model_columns.pkl):

freight_value — shipping cost
customer_state — one-hot encoded across all 27 Brazilian states

That's it — a deliberately lean feature set. price, seller_id, and order timestamps are carried through the pipeline for reference but are not fed into the model.

Inference

python scripts/predict.py

This loads rf_model.pkl + model_columns.pkl, reads data/new_data.csv, aligns columns, predicts, and writes data/predicted_output.csv.

📈 Model Performance

| Model | Accuracy | |

|

| | Logistic Regression (baseline) | ~69% | | ** Random Forest ** | ** ~87% ** |

The Random Forest model outperforms the linear baseline by a wide margin, indicating the relationship between state, freight cost, and delay risk is non-linear.

📊 Power BI Dashboard

The dashboard/ folder contains two Power BI files (open with Power BI Desktop):

delay.pbix — delay rates and trends across states and sellers
predicted_output.pbix — visualizes model predictions against actual outcomes

🚀 Getting Started

Prerequisites

Python 3.8+
pandas, scikit-learn (to unpickle and run the model)
A SQL-compatible database (MySQL/PostgreSQL) if you want to re-run delay.sql
Power BI Desktop, to open the .pbix dashboards

Setup

git clone https://github.com/arka562/supply-chain-delay-prediction.git
cd supply-chain-delay-prediction
pip install pandas scikit-learn

Run a prediction

cd scripts
python predict.py

Output lands in data/predicted_output.csv.

Re-run the SQL EDA

Load the Olist CSVs into your database, then execute the queries in delay.sql against your orders, order_items, and customers tables.

📌 Key Insights

States in Brazil's North and Northeast show consistently higher average delivery delays than the South and Southeast.
Seller order volume doesn't predict delay risk well — geography matters more.
Freight cost and delivery time don't move in a straight line; cheap shipping isn't reliably slower.
A lean two-feature model (freight cost + state) is enough to beat a linear baseline by ~18 points of accuracy — most of the predictive signal is geographic.

🗺️ Roadmap

Commit the training script (currently only inference is in the repo)
Add a requirements.txt
Incorporate product category and order timestamp features
Add cross-validation and a confusion matrix / precision-recall report
Automate the SQL → CSV → model → dashboard refresh

📄 License

This project's code has no license file yet — add one (e.g. MIT) if you want others to reuse it freely. The underlying Olist dataset is distributed under CC BY-NC-SA 4.0 via Kaggle, which restricts commercial use.

👤 Author

Arkaprava GitHub: @arka562

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📦 Supply Chain Delay Prediction

Predicting E-Commerce Delivery Delays with SQL, ML & Power BI

📑 Table of Contents

🔍 Overview

✨ Key Features

📊 Dataset

| File | Description | |

|

🏗️ Pipeline

📂 Project Structure

🔍 SQL Exploratory Analysis

| Feature | Meaning | |

|

🤖 Machine Learning Model

Inference

📈 Model Performance

| Model | Accuracy | |

|

📊 Power BI Dashboard

🚀 Getting Started

Prerequisites

Setup

Run a prediction

Re-run the SQL EDA

📌 Key Insights

🗺️ Roadmap

📄 License

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dashboard		dashboard
data		data
models		models
scripts		scripts
README.md		README.md
delay.sql		delay.sql
final.csv		final.csv
olist_customers_dataset.csv		olist_customers_dataset.csv
olist_geolocation_dataset.csv		olist_geolocation_dataset.csv
olist_order_items_dataset.csv		olist_order_items_dataset.csv
olist_order_payments_dataset.csv		olist_order_payments_dataset.csv
olist_order_reviews_dataset.csv		olist_order_reviews_dataset.csv
olist_orders_dataset.csv		olist_orders_dataset.csv
olist_products_dataset.csv		olist_products_dataset.csv
olist_sellers_dataset.csv		olist_sellers_dataset.csv
product_category_name_translation.csv		product_category_name_translation.csv

Folders and files

Latest commit

History

Repository files navigation

📦 Supply Chain Delay Prediction

Predicting E-Commerce Delivery Delays with SQL, ML & Power BI

📑 Table of Contents

🔍 Overview

✨ Key Features

📊 Dataset

| File | Description | |

|

🏗️ Pipeline

📂 Project Structure

🔍 SQL Exploratory Analysis

| Feature | Meaning | |

|

🤖 Machine Learning Model

Inference

📈 Model Performance

| Model | Accuracy | |

|

📊 Power BI Dashboard

🚀 Getting Started

Prerequisites

Setup

Run a prediction

Re-run the SQL EDA

📌 Key Insights

🗺️ Roadmap

📄 License

👤 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages