A curated collection of simple, ready-to-use datasets for machine learning, data analysis, and tutorials.
These datasets are designed to:
- ✅ be easy to load
- ✅ require minimal preprocessing
- ✅ work great for beginners and demos
This repository helps you:
- learn machine learning faster
- practice exploratory data analysis (EDA)
- build quick prototypes
- create tutorials and demos
👉 No heavy data cleaning — just start working with data.
These datasets work perfectly with MLJAR Studio.
MLJAR Studio is a desktop application designed for data science, combining AI and Python in one place. It lets users easily load data, build machine learning models, and generate reports without complex setup. It is especially beginner-friendly, helping users move from data to insights quickly while still giving advanced users full control.
| Dataset | Type | Rows | Target |
|---|---|---|---|
| adult | tabular | 48k | income |
| bank-marketing | tabular | 45k | subscribed |
| breast_cancer_wisconsin | tabular | 569 | diagnosis |
| credit | tabular | 1k | credit risk |
| diabetes | tabular | 768 | outcome |
| employee_attrition | tabular | 1.5k | attrition |
| ionosphere | tabular | 351 | class |
| sonar | tabular | 208 | object |
| spam | tabular | 4.6k | spam |
| spect | tabular | 267 | diagnosis |
| 2d_circles | synthetic | small | class |
| 2d_simple | synthetic | small | class |
| 3d_spheres | synthetic | small | class |
| Dataset | Type | Rows | Target |
|---|---|---|---|
| digits | tabular | 1.8k | digit |
| iris | tabular | 150 | species |
| wine | tabular | 178 | class |
| glass | tabular | 214 | type |
| mnist | image | 70k | digit |
| Dataset | Type | Rows | Target |
|---|---|---|---|
| housing | tabular | 506 | price |
| house_prices | tabular | 1.4k | sale price |
| housing_california | tabular | 20k | price |
| regression_1 | synthetic | small | value |
| regression_2 | synthetic | small | value |
| us_house_prices_1950_2024 | time series | ~900 | price |
| Dataset | Frequency | Rows | Target |
|---|---|---|---|
| air-passengers | monthly | 144 | passengers |
| aep-hourly-energy-consumption | hourly | ~121k | MW |
| nyc-taxi-demand | 30 min | 10k | demand |
| bitcoin-historical-data | 4H | ~17k | price |
| Dataset | Rows | Use Case |
|---|---|---|
| online-retail | ~500k | e-commerce analysis |
| superstore-sales | 51k | business analytics |
| telco-customer-churn | 7k | churn prediction |
| sp500-company-financials | 500 | financial analysis |
| Dataset | Rows | Task |
|---|---|---|
| amazon-fine-food-reviews | 10k | sentiment / NLP |
| imdb | 50k | sentiment analysis |
| Dataset | Rows | Task |
|---|---|---|
| higgs | 11M | classification |
| occupancy | 20k | classification |
| world_happiness_report | ~150/year | regression |
import pandas as pd
url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv"
df = pd.read_csv(url)
print(df.head())With these datasets you can practice:
- classification models
- regression models
- time series forecasting
- NLP (text analysis)
- business analytics
If you have a dataset that is:
- simple
- clean
- useful for learning
Feel free to open a pull request 🚀
MIT License
Datasets come from public sources (UCI, Kaggle, etc.). Please check individual datasets for details.