A curated, opinionated reference to the most widely used public retail datasets, written from a Product Manager's perspective. For each dataset you get: what it is, license, schema, a real sample row, and how a retail PM would actually use it.
- datasets/ — one deep-dive page per dataset
- scorecards/ — PM scorecard templates (RFM, delivery SLA, promo lift, forecast accuracy)
- regulatory/ — GDPR / CCPA / LGPD / PCI checklist for retail data
- samples/ — small sample CSVs for the freely-licensed datasets
- scripts/ — fetch scripts for restricted datasets
- notebooks/ — Jupyter quickstarts (load → profile → visualize)
| # | Dataset | What it's about | License |
|---|---|---|---|
| 1 | UCI Online Retail / II | UK non-store online retailer, 2009–2011, ~1M invoice lines, gifts + wholesale | CC BY 4.0 |
| 2 | Brazilian E-Commerce by Olist | ~100K real BR orders 2016–2018, 9 relational tables incl. reviews & geo | CC BY-NC-SA 4.0 |
| 3 | Instacart Online Grocery | 3M+ grocery orders, 200K users, reorder patterns | Instacart Open Dataset (non-commercial, no redistribution) |
| 4 | Amazon Customer Reviews | 75GB+ product reviews & metadata across categories | Amazon CR License (research only) |
| 5 | Walmart M5 Forecasting | Daily sales for 3,049 SKUs × 10 stores, CA/TX/WI | Kaggle Competition Rules |
| 6 | H&M Personalized Fashion | 2yr fashion transactions + customer metadata + product images | H&M Competition Rules |
| 7 | Dunnhumby — Complete Journey | 2yr household grocery transactions + promo + demographics | Dunnhumby Source Files (academic/non-commercial) |
| 8 | Google Analytics Sample (GA4) | 3mo obfuscated event-level data, Google Merchandise Store | Google APIs ToS |
| 9 | US Census Monthly Retail Trade | Monthly/annual retail sales by NAICS, quarterly e-commerce | Public domain |
| 10 | Kaggle E-Commerce Data (UK) | 541K transactions from UK online retailer (mirror of UCI) | CC0 |
| 11 | Retail Transactions (synthetic) | Synthetic transactions for basket/segmentation prototyping | CC0 |
- US Census MRTS — macro context, 30 min
- UCI Online Retail — simplest schema, get hands-on
- Olist — multi-table modeling + review/funnel signals
- Instacart + Dunnhumby — basket-level intuition
- M5 + H&M — forecasting & fashion-specific dynamics
- GA4 sample — read your own site's analytics fluently
Each dataset page is split into:
- From the dataset — content verbatim or paraphrased from the source page, with link.
- General knowledge — my PM framing and use-case suggestions.
Both are clearly labeled inline.
The writeups, scorecards, checklists, and code in this repo are licensed under CC BY 4.0. The datasets themselves are not — each retains its own license, reproduced on its page.
If you want to add a sample to samples/, the dataset must allow redistribution. When in doubt, use scripts/fetch_samples.py instead.