You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The download/transform pipeline discourages incremental experimentation. Developers can't easily work on a single data source without triggering the entire pipeline. Caching behavior is undocumented and inconsistent across sources. New contributors have no way to understand what gets downloaded, where it goes, how to clear it, or how long it takes.
This is an umbrella issue. Each checkbox could become its own issue.
1. Selective download
Add tag fields to all items in download.yaml — the kghub_downloader framework already supports a tags parameter, but zero items in download.yaml use it. Group by data source (e.g., mediadive, bacdive, ontology, ncbitaxon, chebi, metatraits)
Expose --tags in the kg download CLI — pass through to download_from_yaml(tags=...) so users can run kg download -t mediadive instead of downloading all 39 items
Parity with transform — kg transform -s mediadive already works per-source; kg download should have equivalent granularity
2. Caching and incremental updates
Document the two-layer caching for MediaDive — there's a requests_cache SQLite file (per-URL, 50MB) AND output JSONs in data/raw/mediadive/ (all-or-nothing, 43MB). The PR description doesn't distinguish cached vs uncached behavior
Fix .gitignore mismatch — .gitignore has mediadive_cache.sqlite but the bulk downloader creates mediadive_bulk_cache.sqlite. The 50MB cache file is not gitignored
Gitignore the output JSONs — data/raw/mediadive/*.json (43MB of generated data) is not gitignored
Audit caching behavior for all data sources — which sources have HTTP caching? Which check for existing files? Which re-download unconditionally? Document the answers
Support per-item updates for MediaDive output — currently all 4 output JSONs are full dumps. If MediaDive adds one medium, all 3,333 get re-dumped. Consider incremental JSON merge or per-medium files
3. Documentation per data source
Each data source should have a brief doc (README section, table, or per-source markdown) covering:
What it downloads — URLs, expected file count, approximate size
Where output goes — paths relative to repo root
How to clear/refresh — which files and caches to delete
Expected timing — cold download, cached, and transform. For reference: NCBITaxon is 12GB+ and dominates download time; MediaDive bulk is ~43MB and trivial by comparison
Dependencies — does the transform need other sources' downloads to be present?
Assess whether parallelization is worth the complexity — with requests_cache in place, the only time parallel matters is a cold-start first download. The 274-line threading addition (semaphores, thread-safe sessions, retry-after) benefits a ~4 min one-time operation on a 43MB dataset, in a repo where NCBITaxon is 12GB
Add execution instructions to PR — how to run, where output goes, how to clear destination, how to clear cache
Respect the API operator — MediaDive is a small academic API at DSMZ. Even at 5 workers, saturating their endpoint with concurrent requests for a one-time bulk download is poor etiquette when sequential + caching achieves the same result on every subsequent run
Consider extracting the good parts without the threading — User-Agent header, Retry-After handling, parameterized retry logic are all valuable independent of parallelization
5. Pipeline-wide ergonomics
kg download should report what it will do before doing it — list items, expected sizes, estimated time. A dry-run mode (--dry-run) would help
kg download should skip already-present files by default — some sources may already do this, but the behavior is inconsistent and undocumented
Post-download hooks (like MediaDive bulk) should be opt-in — currently _post_download_mediadive_bulk runs automatically after every kg download if mediadive.json exists. This surprises developers who just wanted to refresh an ontology file
Problem
The download/transform pipeline discourages incremental experimentation. Developers can't easily work on a single data source without triggering the entire pipeline. Caching behavior is undocumented and inconsistent across sources. New contributors have no way to understand what gets downloaded, where it goes, how to clear it, or how long it takes.
This is an umbrella issue. Each checkbox could become its own issue.
1. Selective download
tagfields to all items indownload.yaml— thekghub_downloaderframework already supports atagsparameter, but zero items indownload.yamluse it. Group by data source (e.g.,mediadive,bacdive,ontology,ncbitaxon,chebi,metatraits)--tagsin thekg downloadCLI — pass through todownload_from_yaml(tags=...)so users can runkg download -t mediadiveinstead of downloading all 39 itemskg transform -s mediadivealready works per-source;kg downloadshould have equivalent granularity2. Caching and incremental updates
requests_cacheSQLite file (per-URL, 50MB) AND output JSONs indata/raw/mediadive/(all-or-nothing, 43MB). The PR description doesn't distinguish cached vs uncached behavior.gitignoremismatch —.gitignorehasmediadive_cache.sqlitebut the bulk downloader createsmediadive_bulk_cache.sqlite. The 50MB cache file is not gitignoreddata/raw/mediadive/*.json(43MB of generated data) is not gitignored3. Documentation per data source
Each data source should have a brief doc (README section, table, or per-source markdown) covering:
4. MediaDive parallel download (PR #527) specific
requests_cachein place, the only time parallel matters is a cold-start first download. The 274-line threading addition (semaphores, thread-safe sessions, retry-after) benefits a ~4 min one-time operation on a 43MB dataset, in a repo where NCBITaxon is 12GB5. Pipeline-wide ergonomics
kg downloadshould report what it will do before doing it — list items, expected sizes, estimated time. A dry-run mode (--dry-run) would helpkg downloadshould skip already-present files by default — some sources may already do this, but the behavior is inconsistent and undocumented_post_download_mediadive_bulkruns automatically after everykg downloadifmediadive.jsonexists. This surprises developers who just wanted to refresh an ontology fileContext
download.yamlhas 39 items totaling ~21GB indata/raw/-sflag); download does notkghub_downloaderlibrary supports tags but they're unused here