Skip to content

arka562/supply-chain-delay-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📦 Supply Chain Delay Prediction

Predicting E-Commerce Delivery Delays with SQL, ML & Power BI

Turning Brazil's largest public e-commerce dataset into an early-warning system for late deliveries.

Python Pandas scikit-learn SQL Power BI


📑 Table of Contents


🔍 Overview

Late deliveries are one of the most expensive problems in e-commerce — they drive support tickets, refunds, and lost trust. This project builds an end-to-end pipeline on top of the Olist Brazilian e-commerce dataset (~100K real orders, 2016–2018) to flag orders likely to arrive late, using only information available at the time of purchase.

The pipeline goes: raw CSVs → SQL feature engineering → trained classifier → batch prediction script → Power BI dashboard.


✨ Key Features

  • 🗃️ SQL-driven EDA — delivery rates, regional delay patterns, seller-level performance, and shipping-cost correlation, all in delay.sql
  • 🎯 Binary delay classifier — predicts is_delayed using a trained Random Forest model
  • 📊 Baseline comparison — Random Forest benchmarked against a Logistic Regression baseline
  • Lightweight inference scriptscripts/predict.py loads the saved model and scores new orders in a single run
  • 📈 Power BI dashboards — pre-built .pbix reports for delay trends and prediction results
  • 🌎 State-level granularity — every Brazilian state is one-hot encoded as a model feature

📊 Dataset

Built on the Brazilian E-Commerce Public Dataset by Olist.

| File | Description | |

|

| | olist_orders_dataset.csv | Order status, timestamps, delivery dates | | olist_order_items_dataset.csv | Items, prices, freight values, seller info | | olist_customers_dataset.csv | Customer location and ID mappings | | olist_sellers_dataset.csv | Seller location data | | olist_products_dataset.csv | Product categories and dimensions | | olist_geolocation_dataset.csv | ZIP code geolocation data | | olist_order_payments_dataset.csv | Payment type and installment info | | olist_order_reviews_dataset.csv | Customer review scores | | product_category_name_translation.csv | Portuguese → English category names |


🏗️ Pipeline

Raw Olist CSVs
      │
      ▼
  delay.sql            ──▶  joins tables, engineers is_delayed label
      │
      ▼
  final.csv             ──▶  clean, model-ready dataset
      │
      ▼
  Model training         ──▶  Random Forest + Logistic Regression (trained offline)
      │
      ▼
  models/*.pkl            ──▶  rf_model.pkl, logistic_model.pkl, model_columns.pkl
      │
      ▼
  scripts/predict.py       ──▶  scores new_data.csv → predicted_output.csv
      │
      ▼
  dashboard/*.pbix           ──▶  Power BI visualization

📂 Project Structure

supply-chain-delay-prediction/
├── dashboard/
│   ├── delay.pbix                 # Power BI: delay & delivery analysis
│   └── predicted_output.pbix      # Power BI: prediction results view
├── data/
│   ├── new_data.csv                # Unseen orders for inference
│   └── predicted_output.csv        # Output of predict.py
├── models/
│   ├── rf_model.pkl                 # Trained Random Forest classifier
│   ├── logistic_model.pkl           # Trained Logistic Regression baseline
│   └── model_columns.pkl            # Exact feature order expected by the model
├── scripts/
│   └── predict.py                    # Loads model, scores new data, writes output
├── delay.sql                          # SQL EDA + feature engineering queries
├── final.csv                          # Model-ready dataset (output of delay.sql)
└── olist_*.csv                        # Raw Olist source tables

Note: the repo ships the trained model artifacts and an inference script (predict.py). The training script itself isn't included here — only the resulting .pkl files are.


🔍 SQL Exploratory Analysis

delay.sql runs the analysis that feeds the model. It covers:

  • Overall on-time vs. delivered percentage
  • Order volume by customer state
  • Average delivery time by state
  • Average delay vs. the estimated delivery date
  • Seller-level delay ranking (sellers with 50+ orders only)
  • Freight cost vs. delivery time correlation
  • Low / Medium / High shipping-cost category breakdown
  • Final join that produces final.csv and the is_delayed label

Derived features:

| Feature | Meaning | |

|

| | delivery_time | Days from purchase to actual delivery | | delay | Days between actual delivery and the * estimated * delivery date | | is_delayed | 1 if delivered after the estimated date, else 0 |


🤖 Machine Learning Model

The classifier predicts is_delayed using only features known at purchase time — so it can't see delivery_time itself (that would be cheating).

Actual model inputs (confirmed from model_columns.pkl):

  • freight_value — shipping cost
  • customer_state — one-hot encoded across all 27 Brazilian states

That's it — a deliberately lean feature set. price, seller_id, and order timestamps are carried through the pipeline for reference but are not fed into the model.

Inference

python scripts/predict.py

This loads rf_model.pkl + model_columns.pkl, reads data/new_data.csv, aligns columns, predicts, and writes data/predicted_output.csv.


📈 Model Performance

| Model | Accuracy | |

|

| | Logistic Regression (baseline) | ~69% | | ** Random Forest ** | ** ~87% ** |

The Random Forest model outperforms the linear baseline by a wide margin, indicating the relationship between state, freight cost, and delay risk is non-linear.


📊 Power BI Dashboard

The dashboard/ folder contains two Power BI files (open with Power BI Desktop):

  • delay.pbix — delay rates and trends across states and sellers
  • predicted_output.pbix — visualizes model predictions against actual outcomes

🚀 Getting Started

Prerequisites

  • Python 3.8+
  • pandas, scikit-learn (to unpickle and run the model)
  • A SQL-compatible database (MySQL/PostgreSQL) if you want to re-run delay.sql
  • Power BI Desktop, to open the .pbix dashboards

Setup

git clone https://github.com/arka562/supply-chain-delay-prediction.git
cd supply-chain-delay-prediction
pip install pandas scikit-learn

Run a prediction

cd scripts
python predict.py

Output lands in data/predicted_output.csv.

Re-run the SQL EDA

Load the Olist CSVs into your database, then execute the queries in delay.sql against your orders, order_items, and customers tables.


📌 Key Insights

  • States in Brazil's North and Northeast show consistently higher average delivery delays than the South and Southeast.
  • Seller order volume doesn't predict delay risk well — geography matters more.
  • Freight cost and delivery time don't move in a straight line; cheap shipping isn't reliably slower.
  • A lean two-feature model (freight cost + state) is enough to beat a linear baseline by ~18 points of accuracy — most of the predictive signal is geographic.

🗺️ Roadmap

  • Commit the training script (currently only inference is in the repo)
  • Add a requirements.txt
  • Incorporate product category and order timestamp features
  • Add cross-validation and a confusion matrix / precision-recall report
  • Automate the SQL → CSV → model → dashboard refresh

📄 License

This project's code has no license file yet — add one (e.g. MIT) if you want others to reuse it freely. The underlying Olist dataset is distributed under CC BY-NC-SA 4.0 via Kaggle, which restricts commercial use.


👤 Author

Arkaprava GitHub: @arka562

About

Supply chain delay predictor using the Brazilian Olist e-commerce dataset — SQL-based EDA, binary classification ML model, and interactive dashboard to identify and predict late deliveries by state, seller, and shipping cost.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages