Download and data management ergonomics: selective execution, caching, documentation

## Problem

The download/transform pipeline discourages incremental experimentation. Developers can't easily work on a single data source without triggering the entire pipeline. Caching behavior is undocumented and inconsistent across sources. New contributors have no way to understand what gets downloaded, where it goes, how to clear it, or how long it takes.

This is an umbrella issue. Each checkbox could become its own issue.

---

## 1. Selective download

- [ ] **Add `tag` fields to all items in `download.yaml`** — the `kghub_downloader` framework already supports a `tags` parameter, but zero items in `download.yaml` use it. Group by data source (e.g., `mediadive`, `bacdive`, `ontology`, `ncbitaxon`, `chebi`, `metatraits`)
- [ ] **Expose `--tags` in the `kg download` CLI** — pass through to `download_from_yaml(tags=...)` so users can run `kg download -t mediadive` instead of downloading all 39 items
- [ ] **Parity with transform** — `kg transform -s mediadive` already works per-source; `kg download` should have equivalent granularity

## 2. Caching and incremental updates

- [ ] **Document the two-layer caching for MediaDive** — there's a `requests_cache` SQLite file (per-URL, 50MB) AND output JSONs in `data/raw/mediadive/` (all-or-nothing, 43MB). The PR description doesn't distinguish cached vs uncached behavior
- [ ] **Fix `.gitignore` mismatch** — `.gitignore` has `mediadive_cache.sqlite` but the bulk downloader creates `mediadive_bulk_cache.sqlite`. The 50MB cache file is not gitignored
- [ ] **Gitignore the output JSONs** — `data/raw/mediadive/*.json` (43MB of generated data) is not gitignored
- [ ] **Audit caching behavior for all data sources** — which sources have HTTP caching? Which check for existing files? Which re-download unconditionally? Document the answers
- [ ] **Support per-item updates for MediaDive output** — currently all 4 output JSONs are full dumps. If MediaDive adds one medium, all 3,333 get re-dumped. Consider incremental JSON merge or per-medium files

## 3. Documentation per data source

Each data source should have a brief doc (README section, table, or per-source markdown) covering:

- [ ] **What it downloads** — URLs, expected file count, approximate size
- [ ] **Where output goes** — paths relative to repo root
- [ ] **How to clear/refresh** — which files and caches to delete
- [ ] **Expected timing** — cold download, cached, and transform. For reference: NCBITaxon is 12GB+ and dominates download time; MediaDive bulk is ~43MB and trivial by comparison
- [ ] **Dependencies** — does the transform need other sources' downloads to be present?

## 4. MediaDive parallel download (PR #527) specific

- [ ] **Assess whether parallelization is worth the complexity** — with `requests_cache` in place, the only time parallel matters is a cold-start first download. The 274-line threading addition (semaphores, thread-safe sessions, retry-after) benefits a ~4 min one-time operation on a 43MB dataset, in a repo where NCBITaxon is 12GB
- [ ] **Add execution instructions to PR** — how to run, where output goes, how to clear destination, how to clear cache
- [ ] **Respect the API operator** — MediaDive is a small academic API at DSMZ. Even at 5 workers, saturating their endpoint with concurrent requests for a one-time bulk download is poor etiquette when sequential + caching achieves the same result on every subsequent run
- [ ] **Consider extracting the good parts without the threading** — User-Agent header, Retry-After handling, parameterized retry logic are all valuable independent of parallelization

## 5. Pipeline-wide ergonomics

- [ ] **`kg download` should report what it will do before doing it** — list items, expected sizes, estimated time. A dry-run mode (`--dry-run`) would help
- [ ] **`kg download` should skip already-present files by default** — some sources may already do this, but the behavior is inconsistent and undocumented
- [ ] **Post-download hooks (like MediaDive bulk) should be opt-in** — currently `_post_download_mediadive_bulk` runs automatically after every `kg download` if `mediadive.json` exists. This surprises developers who just wanted to refresh an ontology file

## Context

- PR https://github.com/Knowledge-Graph-Hub/kg-microbe/pull/527 prompted this investigation
- `download.yaml` has 39 items totaling ~21GB in `data/raw/`
- Transform already supports per-source execution (`-s` flag); download does not
- The `kghub_downloader` library supports tags but they're unused here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download and data management ergonomics: selective execution, caching, documentation #533

Problem

1. Selective download

2. Caching and incremental updates

3. Documentation per data source

4. MediaDive parallel download (PR #527) specific

5. Pipeline-wide ergonomics

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Download and data management ergonomics: selective execution, caching, documentation #533

Description

Problem

1. Selective download

2. Caching and incremental updates

3. Documentation per data source

4. MediaDive parallel download (PR #527) specific

5. Pipeline-wide ergonomics

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions