Skip to content

wyattowalsh/nbadb

nbadb

nbadb logo

The most comprehensive open NBA database available.

PyPI Python License CI DuckDB Polars Ruff Docs Kaggle Data Coverage

Extractor coverage Public model Derived outputs Docs site
Current nba_api runtime surface Generated star-schema outputs Generated agg_* and analytics_* surfaces Guides, references, diagrams, and lineage pages

πŸ“Š What's Inside

nbadb exposes an analytics-first warehouse surface rather than a thin mirror of raw upstream payloads.

Surface What it covers
dim_* Stable identity and lookup context for players, teams, games, seasons, arenas, officials, and other conformed dimensions
fact_* Event and measurement tables across box scores, tracking, shot charts, play-by-play, standings, matchups, and specialty feeds
bridge_* Many-to-many connectors where public entities legitimately fan out
agg_* Reusable rollups for season, career, pace, efficiency, and other repeated reporting needs
analytics_* Convenience outputs for notebooks, dashboards, and quick exploratory analysis

For the current public contract, use the generated docs surfaces: Schema Reference, Data Dictionary, and Lineage.

πŸ€ Data Coverage

All data spans from the 1946-47 season to present (auto-updating via the daily pipeline).

  • Game-level β€” box scores (traditional, advanced, misc, four factors, hustle, tracking), play-by-play, shot charts, rotations, win probability, game context, scoring runs
  • Player-level β€” career stats, season splits, matchups, awards, draft combine measurements, player tracking (speed, distance, touches, passes, rebounding, shooting), estimated metrics
  • Team-level β€” game logs, matchups, splits, clutch stats, franchise history, IST standings, playoff picture, pace and efficiency, player dashboards
  • League-level β€” leaders, hustle stats, lineup visualizations, shot locations by zone, synergy play types, league-wide tracking

πŸ“¦ Output Formats

Format Path Description
DuckDB nba.duckdb Primary analytics engine β€” columnar storage and fast SQL queries
SQLite nba.sqlite Portable single-file relational database
Parquet parquet/ Zstd-compressed columnar files, partitioned by season
CSV csv/ Universal flat files for any tool

πŸš€ Quick Start

Tip

pip install nbadb    # or: uv add nbadb

# Full build from scratch (1946-present, ~2-4 hours)
nbadb init

# Daily incremental update (~5-15 minutes)
nbadb daily

# Export to all formats
nbadb export

# Query with natural language
nbadb ask "who led the league in scoring last season"

# Upload to Kaggle
nbadb upload

⌨️ CLI Reference

Command Description
nbadb init Full pipeline β€” extract all endpoints, stage, transform, export
nbadb daily Incremental update for recent games
nbadb monthly Dimension refresh + recent data
nbadb backfill Retry failed work and fill extraction gaps
nbadb migrate Run schema migrations
nbadb run-quality Execute data quality checks and generate a report
nbadb export Re-export DuckDB β†’ SQLite / Parquet / CSV
nbadb upload Push the dataset to Kaggle
nbadb download Pull the Kaggle dataset and seed local DuckDB
nbadb extract-completeness Report endpoint coverage gaps
nbadb docs-autogen Regenerate generator-owned schema, data dictionary, ER, and lineage artifacts
nbadb schema [TABLE] Show schema for a table or list all star tables
nbadb status Pipeline status, row counts, and watermarks
nbadb ask QUESTION Natural-language query interface (read-only)

Run nbadb --help or nbadb <command> --help for full option details.

For docs-site maintenance, regenerate generator-owned artifacts from the repo root with:

uv run nbadb docs-autogen --docs-root docs/content/docs

πŸ€– AI Query Interface

nbadb ask translates natural-language questions into read-only DuckDB queries:

nbadb ask "top 5 players by career three-pointers made"
nbadb ask "which teams had the best home record in 2023-24"
nbadb ask "LeBron James career averages by season"

Queries run against the star schema with safety guards (read-only mode, query limits, SQL injection protection).

πŸ““ Kaggle Notebooks

Ten analysis notebooks are published on Kaggle, all powered by this dataset:

Notebook Description
NBA Aging Curves Peak, prime, and decline β€” career trajectory modeling
Defense Decoded Tracking + hustle + synergy PCA to quantify defense
Draft Combine Analysis What pre-draft measurements actually predict
Game Prediction Stacking ensemble model for game outcomes
MVP Predictor Explainable ML for MVP voting prediction
Play-by-Play Insights Win probability, scoring runs, and clutch analysis
Player Archetypes UMAP + GMM clustering β€” 8 data-driven player types
Player Dashboard Interactive explorer with 50+ metrics
Player Similarity Find any player's statistical twin
Shot Chart Analysis The geography of scoring and the 3-point revolution

πŸ—οΈ Architecture

flowchart LR
    A["NBA API + static sources"] -->|"extract"| B["Stage\nDuckDB staging"]
    B --> C["Transform"]
    C --> D["Warehouse\nDimensions / facts / bridges"]
    C --> E["Derived outputs\nAggregates / analytics"]
    D & E --> F["Export"]
    F --> G["DuckDB"]
    F --> H["SQLite"]
    F --> I["Parquet / CSV"]
Loading
  • Polars for all DataFrame operations with zero-copy Arrow interchange to DuckDB
  • 3-tier Pandera validation β€” raw β†’ staging β†’ star
  • SQL-first transforms for the star surface, with dependency-ordered execution
  • SCD Type 2 for dim_player and dim_team_history (surrogate keys, valid_from/valid_to)
  • Checkpoint/resume for interrupted transform runs
  • Watermark tracking for incremental extraction
  • Proxy rotation via proxywhirl with circuit-breaker failover

Read more in the full Architecture Guide.

πŸ”§ Tech Stack

Component Technology
Language Python 3.13
Package Manager uv
DataFrames Polars 1.38
Validation Pandera (Polars backend)
Analytics DB DuckDB 1.4
Relational DB SQLModel + SQLite
HTTP / Proxy proxywhirl
CLI Typer + Rich + Textual
Type Checking ty
Linting Ruff
Docs Fumadocs + Next.js
CI GitHub Actions (SHA-pinned)

πŸ“– Documentation

Full documentation lives at nbadb.w4w.dev.

  • Getting Started β€” install, run the pipeline, and learn where to go next
  • Architecture β€” pipeline stages, validation layers, and state tables
  • Schema Reference β€” curated star-surface guide plus generated raw/staging/star references
  • Data Dictionary β€” glossary plus generated raw/staging/star field references
  • Diagrams β€” ER, endpoint map, and pipeline visuals
  • Lineage β€” trace endpoints and staging inputs to final tables
  • Guides β€” onboarding, query recipes, Parquet, Kaggle, and troubleshooting

πŸ“„ License

MIT

About

Data Extraction (from https://stats.nba.com) and Processing Scripts to Produce the NBA Database on Kaggle (https://kaggle.com/wyattowalsh/basketball)

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

 

Contributors