The most comprehensive open NBA database available.
| Extractor coverage | Public model | Derived outputs | Docs site |
|---|---|---|---|
Current nba_api runtime surface |
Generated star-schema outputs | Generated agg_* and analytics_* surfaces |
Guides, references, diagrams, and lineage pages |
nbadb exposes an analytics-first warehouse surface rather than a thin mirror of raw upstream payloads.
| Surface | What it covers |
|---|---|
dim_* |
Stable identity and lookup context for players, teams, games, seasons, arenas, officials, and other conformed dimensions |
fact_* |
Event and measurement tables across box scores, tracking, shot charts, play-by-play, standings, matchups, and specialty feeds |
bridge_* |
Many-to-many connectors where public entities legitimately fan out |
agg_* |
Reusable rollups for season, career, pace, efficiency, and other repeated reporting needs |
analytics_* |
Convenience outputs for notebooks, dashboards, and quick exploratory analysis |
For the current public contract, use the generated docs surfaces: Schema Reference, Data Dictionary, and Lineage.
All data spans from the 1946-47 season to present (auto-updating via the daily pipeline).
- Game-level β box scores (traditional, advanced, misc, four factors, hustle, tracking), play-by-play, shot charts, rotations, win probability, game context, scoring runs
- Player-level β career stats, season splits, matchups, awards, draft combine measurements, player tracking (speed, distance, touches, passes, rebounding, shooting), estimated metrics
- Team-level β game logs, matchups, splits, clutch stats, franchise history, IST standings, playoff picture, pace and efficiency, player dashboards
- League-level β leaders, hustle stats, lineup visualizations, shot locations by zone, synergy play types, league-wide tracking
| Format | Path | Description |
|---|---|---|
| DuckDB | nba.duckdb |
Primary analytics engine β columnar storage and fast SQL queries |
| SQLite | nba.sqlite |
Portable single-file relational database |
| Parquet | parquet/ |
Zstd-compressed columnar files, partitioned by season |
| CSV | csv/ |
Universal flat files for any tool |
Tip
pip install nbadb # or: uv add nbadb
# Full build from scratch (1946-present, ~2-4 hours)
nbadb init
# Daily incremental update (~5-15 minutes)
nbadb daily
# Export to all formats
nbadb export
# Query with natural language
nbadb ask "who led the league in scoring last season"
# Upload to Kaggle
nbadb upload| Command | Description |
|---|---|
nbadb init |
Full pipeline β extract all endpoints, stage, transform, export |
nbadb daily |
Incremental update for recent games |
nbadb monthly |
Dimension refresh + recent data |
nbadb backfill |
Retry failed work and fill extraction gaps |
nbadb migrate |
Run schema migrations |
nbadb run-quality |
Execute data quality checks and generate a report |
nbadb export |
Re-export DuckDB β SQLite / Parquet / CSV |
nbadb upload |
Push the dataset to Kaggle |
nbadb download |
Pull the Kaggle dataset and seed local DuckDB |
nbadb extract-completeness |
Report endpoint coverage gaps |
nbadb docs-autogen |
Regenerate generator-owned schema, data dictionary, ER, and lineage artifacts |
nbadb schema [TABLE] |
Show schema for a table or list all star tables |
nbadb status |
Pipeline status, row counts, and watermarks |
nbadb ask QUESTION |
Natural-language query interface (read-only) |
Run nbadb --help or nbadb <command> --help for full option details.
For docs-site maintenance, regenerate generator-owned artifacts from the repo root with:
uv run nbadb docs-autogen --docs-root docs/content/docsnbadb ask translates natural-language questions into read-only DuckDB queries:
nbadb ask "top 5 players by career three-pointers made"
nbadb ask "which teams had the best home record in 2023-24"
nbadb ask "LeBron James career averages by season"Queries run against the star schema with safety guards (read-only mode, query limits, SQL injection protection).
Ten analysis notebooks are published on Kaggle, all powered by this dataset:
| Notebook | Description |
|---|---|
| NBA Aging Curves | Peak, prime, and decline β career trajectory modeling |
| Defense Decoded | Tracking + hustle + synergy PCA to quantify defense |
| Draft Combine Analysis | What pre-draft measurements actually predict |
| Game Prediction | Stacking ensemble model for game outcomes |
| MVP Predictor | Explainable ML for MVP voting prediction |
| Play-by-Play Insights | Win probability, scoring runs, and clutch analysis |
| Player Archetypes | UMAP + GMM clustering β 8 data-driven player types |
| Player Dashboard | Interactive explorer with 50+ metrics |
| Player Similarity | Find any player's statistical twin |
| Shot Chart Analysis | The geography of scoring and the 3-point revolution |
flowchart LR
A["NBA API + static sources"] -->|"extract"| B["Stage\nDuckDB staging"]
B --> C["Transform"]
C --> D["Warehouse\nDimensions / facts / bridges"]
C --> E["Derived outputs\nAggregates / analytics"]
D & E --> F["Export"]
F --> G["DuckDB"]
F --> H["SQLite"]
F --> I["Parquet / CSV"]
- Polars for all DataFrame operations with zero-copy Arrow interchange to DuckDB
- 3-tier Pandera validation β raw β staging β star
- SQL-first transforms for the star surface, with dependency-ordered execution
- SCD Type 2 for
dim_playeranddim_team_history(surrogate keys,valid_from/valid_to) - Checkpoint/resume for interrupted transform runs
- Watermark tracking for incremental extraction
- Proxy rotation via proxywhirl with circuit-breaker failover
Read more in the full Architecture Guide.
| Component | Technology |
|---|---|
| Language | Python 3.13 |
| Package Manager | uv |
| DataFrames | Polars 1.38 |
| Validation | Pandera (Polars backend) |
| Analytics DB | DuckDB 1.4 |
| Relational DB | SQLModel + SQLite |
| HTTP / Proxy | proxywhirl |
| CLI | Typer + Rich + Textual |
| Type Checking | ty |
| Linting | Ruff |
| Docs | Fumadocs + Next.js |
| CI | GitHub Actions (SHA-pinned) |
Full documentation lives at nbadb.w4w.dev.
- Getting Started β install, run the pipeline, and learn where to go next
- Architecture β pipeline stages, validation layers, and state tables
- Schema Reference β curated star-surface guide plus generated raw/staging/star references
- Data Dictionary β glossary plus generated raw/staging/star field references
- Diagrams β ER, endpoint map, and pipeline visuals
- Lineage β trace endpoints and staging inputs to final tables
- Guides β onboarding, query recipes, Parquet, Kaggle, and troubleshooting
MIT
