Fetch Hacker News data into local Parquet files and query with SQL.
- Async fetching - Fast concurrent downloads from HN Firebase API
- Incremental sync - Only fetches new items on subsequent runs
- Sorted Parquet files - Optimized for DuckDB with zstd compression and zonemap filtering
- Built-in SQL shell - Query data directly without external tools
- Checkpoint/resume - Interruption-safe, continues where it left off
uv sync
# Fetch recent items (last 10,000)
uv run hn-sql fetch --start -10000
# Check status
uv run hn-sql stats
# Query the data
uv run hn-sql query "SELECT type, count(*) FROM hn GROUP BY type"
# Sync new items (run anytime)
uv run hn-sql fetchhn-sql fetch # Sync from checkpoint (or start fresh)
hn-sql fetch --start -1000 # Start from last 1000 items
hn-sql fetch --start 1000 --end 2000 # Fetch specific range
hn-sql fetch -c 200 -b 5000 # Custom concurrency/batch sizehn-sql query "SELECT count(*) FROM hn"
hn-sql query "SELECT title, score FROM hn WHERE type='story' ORDER BY score DESC LIMIT 10"
hn-sql query -i # Interactive SQL shellIn interactive mode: .help, .schema, .tables, .quit
hn-sql stats # Show data coverage, storage info, and checkpoint status
hn-sql stats -t # Include partition treeShows: item count, ID range, file stats, and checkpoint info (including partition_style).
hn-sql reset # Delete checkpoint only
hn-sql reset --data # Delete checkpoint and all data
hn-sql reset --data -y # Skip confirmationConsolidates all data into a single sorted Parquet file for optimal query performance.
hn-sql migrate --dry-run # Preview what will happen
hn-sql migrate # Run migration (creates data/items_v2/)
hn-sql migrate --swap # Migrate and swap directories
hn-sql migrate --swap -y # Skip confirmationWhat it does:
- Reads all existing data (handles both hive-partitioned and flat chunk files)
- Sorts by ID (correlates with time, enables DuckDB zonemap filtering)
- Writes a single consolidated
hn.parquetfile todata/items_v2/ - With
--swap: moves old data todata/items_old/, new todata/items/ - Updates checkpoint's
partition_styleto"flat"
When to use:
- After initial sync to consolidate into optimized format
- Periodically (e.g., daily) to merge accumulated chunk files
- To migrate from old hive-partitioned format (year/month directories)
Directory layout after --swap:
data/
items/ # Active data (single hn.parquet)
items_old/ # Backup of previous data (delete when verified)
Incremental sync workflow:
After migration, each fetch adds new items as chunk-*.parquet files. Queries read both hn.parquet and chunks seamlessly. Run migrate --swap periodically to reconsolidate.
hn-sql api # Start API server on port 8000
hn-sql api -p 8080 # Custom portBrowse http://localhost:8000/docs for interactive API documentation.
The hn table contains all item types (story, comment, job, poll, pollopt):
| Column | Type | Description |
|---|---|---|
| id | int64 | Item ID |
| type | string | story, comment, job, poll, pollopt |
| by | string | Author username (quote as "by" in SQL) |
| time | timestamp | Creation time (UTC) |
| title | string | Title (stories/jobs/polls) |
| url | string | External URL |
| text | string | Content (HTML) |
| score | int32 | Points |
| descendants | int32 | Comment count |
| parent | int64 | Parent item ID |
| kids | list | Child comment IDs |
| dead | bool | Dead/flagged item |
| deleted | bool | Deleted item |
| poll | int64 | Parent poll (for pollopts) |
| parts | list | Poll option IDs (for polls) |