nbadb

The most comprehensive open NBA database available.

Extractor coverage	Public model	Derived outputs	Docs site
Current `nba_api` runtime surface	Generated star-schema outputs	Generated `agg_` and `analytics_` surfaces	Guides, references, diagrams, and lineage pages

📊 What's Inside

nbadb exposes an analytics-first warehouse surface rather than a thin mirror of raw upstream payloads.

Surface	What it covers
*`dim_`**	Stable identity and lookup context for players, teams, games, seasons, arenas, officials, and other conformed dimensions
*`fact_`**	Event and measurement tables across box scores, tracking, shot charts, play-by-play, standings, matchups, and specialty feeds
*`bridge_`**	Many-to-many connectors where public entities legitimately fan out
*`agg_`**	Reusable rollups for season, career, pace, efficiency, and other repeated reporting needs
*`analytics_`**	Convenience outputs for notebooks, dashboards, and quick exploratory analysis

For the current public contract, use the generated docs surfaces: Schema Reference, Data Dictionary, and Lineage.

🏀 Data Coverage

All data spans from the 1946-47 season to present (auto-updating via the daily pipeline).

Game-level — box scores (traditional, advanced, misc, four factors, hustle, tracking), play-by-play, shot charts, rotations, win probability, game context, scoring runs
Player-level — career stats, season splits, matchups, awards, draft combine measurements, player tracking (speed, distance, touches, passes, rebounding, shooting), estimated metrics
Team-level — game logs, matchups, splits, clutch stats, franchise history, IST standings, playoff picture, pace and efficiency, player dashboards
League-level — leaders, hustle stats, lineup visualizations, shot locations by zone, synergy play types, league-wide tracking

📦 Output Formats

Format	Path	Description
DuckDB	`nba.duckdb`	Primary analytics engine — columnar storage and fast SQL queries
SQLite	`nba.sqlite`	Portable single-file relational database
Parquet	`parquet/`	Zstd-compressed columnar files, partitioned by season
CSV	`csv/`	Universal flat files for any tool

🚀 Quick Start

Tip

pip install nbadb    # or: uv add nbadb

# Full build from scratch (1946-present, ~2-4 hours)
nbadb init

# Daily incremental update (~5-15 minutes)
nbadb daily

# Export to all formats
nbadb export

# Query with natural language
nbadb ask "who led the league in scoring last season"

# Upload to Kaggle
nbadb upload

⌨️ CLI Reference

Command	Description
`nbadb init`	Full pipeline — extract all endpoints, stage, transform, export
`nbadb daily`	Incremental update for recent games
`nbadb monthly`	Dimension refresh + recent data
`nbadb backfill`	Retry failed work and fill extraction gaps
`nbadb migrate`	Run schema migrations
`nbadb run-quality`	Execute data quality checks and generate a report
`nbadb export`	Re-export DuckDB → SQLite / Parquet / CSV
`nbadb upload`	Push the dataset to Kaggle
`nbadb download`	Pull the Kaggle dataset and seed local DuckDB
`nbadb extract-completeness`	Report endpoint coverage gaps
`nbadb docs-autogen`	Regenerate generator-owned schema, data dictionary, ER, and lineage artifacts
`nbadb schema [TABLE]`	Show schema for a table or list all star tables
`nbadb status`	Pipeline status, row counts, and watermarks
`nbadb ask QUESTION`	Natural-language query interface (read-only)

Run nbadb --help or nbadb <command> --help for full option details.

For docs-site maintenance, regenerate generator-owned artifacts from the repo root with:

uv run nbadb docs-autogen --docs-root docs/content/docs

🤖 AI Query Interface

nbadb ask translates natural-language questions into read-only DuckDB queries:

nbadb ask "top 5 players by career three-pointers made"
nbadb ask "which teams had the best home record in 2023-24"
nbadb ask "LeBron James career averages by season"

Queries run against the star schema with safety guards (read-only mode, query limits, SQL injection protection).

📓 Kaggle Notebooks

Ten analysis notebooks are published on Kaggle, all powered by this dataset:

Notebook	Description
NBA Aging Curves	Peak, prime, and decline — career trajectory modeling
Defense Decoded	Tracking + hustle + synergy PCA to quantify defense
Draft Combine Analysis	What pre-draft measurements actually predict
Game Prediction	Stacking ensemble model for game outcomes
MVP Predictor	Explainable ML for MVP voting prediction
Play-by-Play Insights	Win probability, scoring runs, and clutch analysis
Player Archetypes	UMAP + GMM clustering — 8 data-driven player types
Player Dashboard	Interactive explorer with 50+ metrics
Player Similarity	Find any player's statistical twin
Shot Chart Analysis	The geography of scoring and the 3-point revolution

🏗️ Architecture

flowchart LR
    A["NBA API + static sources"] -->|"extract"| B["Stage\nDuckDB staging"]
    B --> C["Transform"]
    C --> D["Warehouse\nDimensions / facts / bridges"]
    C --> E["Derived outputs\nAggregates / analytics"]
    D & E --> F["Export"]
    F --> G["DuckDB"]
    F --> H["SQLite"]
    F --> I["Parquet / CSV"]

Polars for all DataFrame operations with zero-copy Arrow interchange to DuckDB
3-tier Pandera validation — raw → staging → star
SQL-first transforms for the star surface, with dependency-ordered execution
SCD Type 2 for dim_player and dim_team_history (surrogate keys, valid_from/valid_to)
Checkpoint/resume for interrupted transform runs
Watermark tracking for incremental extraction
Proxy rotation via proxywhirl with circuit-breaker failover

Read more in the full Architecture Guide.

🔧 Tech Stack

Component	Technology
Language	Python 3.13
Package Manager	uv
DataFrames	Polars 1.38
Validation	Pandera (Polars backend)
Analytics DB	DuckDB 1.4
Relational DB	SQLModel + SQLite
HTTP / Proxy	proxywhirl
CLI	Typer + Rich + Textual
Type Checking	ty
Linting	Ruff
Docs	Fumadocs + Next.js
CI	GitHub Actions (SHA-pinned)

📖 Documentation

Full documentation lives at nbadb.w4w.dev.

Getting Started — install, run the pipeline, and learn where to go next
Architecture — pipeline stages, validation layers, and state tables
Schema Reference — curated star-surface guide plus generated raw/staging/star references
Data Dictionary — glossary plus generated raw/staging/star field references
Diagrams — ER, endpoint map, and pipeline visuals
Lineage — trace endpoints and staging inputs to final tables
Guides — onboarding, query recipes, Parquet, Kaggle, and troubleshooting

📄 License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 317 Commits
.agents/skills/docs-steward		.agents/skills/docs-steward
.claude/skills		.claude/skills
.github		.github
apps/chat		apps/chat
docs		docs
notebooks		notebooks
src/nbadb		src/nbadb
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.sqlfluff		.sqlfluff
AGENTS.md		AGENTS.md
AUDIT-chat-with-data.md		AUDIT-chat-with-data.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
dataset-metadata.json		dataset-metadata.json
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

nbadb

📊 What's Inside

🏀 Data Coverage

📦 Output Formats

🚀 Quick Start

⌨️ CLI Reference

🤖 AI Query Interface

📓 Kaggle Notebooks

🏗️ Architecture

🔧 Tech Stack

📖 Documentation

📄 License

About

Uh oh!

Sponsor this project

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

nbadb

📊 What's Inside

🏀 Data Coverage

📦 Output Formats

🚀 Quick Start

⌨️ CLI Reference

🤖 AI Query Interface

📓 Kaggle Notebooks

🏗️ Architecture

🔧 Tech Stack

📖 Documentation

📄 License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages