📦 Datasets for Start

A curated collection of simple, ready-to-use datasets for machine learning, data analysis, and tutorials.

These datasets are designed to:

✅ be easy to load
✅ require minimal preprocessing
✅ work great for beginners and demos

🚀 Why this repo?

This repository helps you:

learn machine learning faster
practice exploratory data analysis (EDA)
build quick prototypes
create tutorials and demos

👉 No heavy data cleaning — just start working with data.

🧠 Use with MLJAR Studio

These datasets work perfectly with MLJAR Studio.

MLJAR Studio is a desktop application designed for data science, combining AI and Python in one place. It lets users easily load data, build machine learning models, and generate reports without complex setup. It is especially beginner-friendly, helping users move from data to insights quickly while still giving advanced users full control.

👉 https://mljar.com/

📊 Dataset Overview

🔵 Binary Classification

Dataset	Type	Rows	Target
adult	tabular	48k	income
bank-marketing	tabular	45k	subscribed
breast_cancer_wisconsin	tabular	569	diagnosis
credit	tabular	1k	credit risk
diabetes	tabular	768	outcome
employee_attrition	tabular	1.5k	attrition
ionosphere	tabular	351	class
sonar	tabular	208	object
spam	tabular	4.6k	spam
spect	tabular	267	diagnosis
2d_circles	synthetic	small	class
2d_simple	synthetic	small	class
3d_spheres	synthetic	small	class

🟣 Multiclass Classification

Dataset	Type	Rows	Target
digits	tabular	1.8k	digit
iris	tabular	150	species
wine	tabular	178	class
glass	tabular	214	type
mnist	image	70k	digit

🟢 Regression

Dataset	Type	Rows	Target
housing	tabular	506	price
house_prices	tabular	1.4k	sale price
housing_california	tabular	20k	price
regression_1	synthetic	small	value
regression_2	synthetic	small	value
us_house_prices_1950_2024	time series	~900	price

🟡 Time Series

Dataset	Frequency	Rows	Target
air-passengers	monthly	144	passengers
aep-hourly-energy-consumption	hourly	~121k	MW
nyc-taxi-demand	30 min	10k	demand
bitcoin-historical-data	4H	~17k	price

🟠 Business / Tabular

Dataset	Rows	Use Case
online-retail	~500k	e-commerce analysis
superstore-sales	51k	business analytics
telco-customer-churn	7k	churn prediction
sp500-company-financials	500	financial analysis

🟤 NLP

Dataset	Rows	Task
amazon-fine-food-reviews	10k	sentiment / NLP
imdb	50k	sentiment analysis

🌍 Other

Dataset	Rows	Task
higgs	11M	classification
occupancy	20k	classification
world_happiness_report	~150/year	regression

⚡ Quick Start

import pandas as pd

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv"
df = pd.read_csv(url)

print(df.head())

📊 What can you build?

With these datasets you can practice:

classification models
regression models
time series forecasting
NLP (text analysis)
business analytics

🤝 Contributing

If you have a dataset that is:

simple
clean
useful for learning

Feel free to open a pull request 🚀

📄 License

MIT License

Datasets come from public sources (UCI, Kaggle, etc.). Please check individual datasets for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📦 Datasets for Start

🚀 Why this repo?

🧠 Use with MLJAR Studio

📊 Dataset Overview

🔵 Binary Classification

🟣 Multiclass Classification

🟢 Regression

🟡 Time Series

🟠 Business / Tabular

🟤 NLP

🌍 Other

⚡ Quick Start

📊 What can you build?

🤝 Contributing

📄 License

About

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
2d_circles		2d_circles
2d_simple		2d_simple
3d_spheres		3d_spheres
Titanic		Titanic
adult		adult
aep-hourly-energy-consumption		aep-hourly-energy-consumption
air-passengers		air-passengers
amazon-fine-food-reviews		amazon-fine-food-reviews
bank-marketing		bank-marketing
bitcoin-historical-data		bitcoin-historical-data
breast_cancer_wisconsin		breast_cancer_wisconsin
credit		credit
diabetes		diabetes
digits		digits
employee_attrition		employee_attrition
glass		glass
higgs		higgs
house_prices		house_prices
housing		housing
housing_california		housing_california
imdb		imdb
ionosphere		ionosphere
iris		iris
mnist		mnist
nyc-taxi-demand		nyc-taxi-demand
occupancy		occupancy
online-retail		online-retail
red-wine-quality		red-wine-quality
regression_1		regression_1
regression_2		regression_2
sonar		sonar
sp500-company-financials		sp500-company-financials
spam		spam
spect		spect
superstore-sales		superstore-sales
telco-customer-churn		telco-customer-churn
us_house_prices_1950_2024		us_house_prices_1950_2024
wine		wine
world_happiness_report		world_happiness_report
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

📦 Datasets for Start

🚀 Why this repo?

🧠 Use with MLJAR Studio

📊 Dataset Overview

🔵 Binary Classification

🟣 Multiclass Classification

🟢 Regression

🟡 Time Series

🟠 Business / Tabular

🟤 NLP

🌍 Other

⚡ Quick Start

📊 What can you build?

🤝 Contributing

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!