Skip to content

Download and data management ergonomics: selective execution, caching, documentation #533

@turbomam

Description

@turbomam

Problem

The download/transform pipeline discourages incremental experimentation. Developers can't easily work on a single data source without triggering the entire pipeline. Caching behavior is undocumented and inconsistent across sources. New contributors have no way to understand what gets downloaded, where it goes, how to clear it, or how long it takes.

This is an umbrella issue. Each checkbox could become its own issue.


1. Selective download

  • Add tag fields to all items in download.yaml — the kghub_downloader framework already supports a tags parameter, but zero items in download.yaml use it. Group by data source (e.g., mediadive, bacdive, ontology, ncbitaxon, chebi, metatraits)
  • Expose --tags in the kg download CLI — pass through to download_from_yaml(tags=...) so users can run kg download -t mediadive instead of downloading all 39 items
  • Parity with transformkg transform -s mediadive already works per-source; kg download should have equivalent granularity

2. Caching and incremental updates

  • Document the two-layer caching for MediaDive — there's a requests_cache SQLite file (per-URL, 50MB) AND output JSONs in data/raw/mediadive/ (all-or-nothing, 43MB). The PR description doesn't distinguish cached vs uncached behavior
  • Fix .gitignore mismatch.gitignore has mediadive_cache.sqlite but the bulk downloader creates mediadive_bulk_cache.sqlite. The 50MB cache file is not gitignored
  • Gitignore the output JSONsdata/raw/mediadive/*.json (43MB of generated data) is not gitignored
  • Audit caching behavior for all data sources — which sources have HTTP caching? Which check for existing files? Which re-download unconditionally? Document the answers
  • Support per-item updates for MediaDive output — currently all 4 output JSONs are full dumps. If MediaDive adds one medium, all 3,333 get re-dumped. Consider incremental JSON merge or per-medium files

3. Documentation per data source

Each data source should have a brief doc (README section, table, or per-source markdown) covering:

  • What it downloads — URLs, expected file count, approximate size
  • Where output goes — paths relative to repo root
  • How to clear/refresh — which files and caches to delete
  • Expected timing — cold download, cached, and transform. For reference: NCBITaxon is 12GB+ and dominates download time; MediaDive bulk is ~43MB and trivial by comparison
  • Dependencies — does the transform need other sources' downloads to be present?

4. MediaDive parallel download (PR #527) specific

  • Assess whether parallelization is worth the complexity — with requests_cache in place, the only time parallel matters is a cold-start first download. The 274-line threading addition (semaphores, thread-safe sessions, retry-after) benefits a ~4 min one-time operation on a 43MB dataset, in a repo where NCBITaxon is 12GB
  • Add execution instructions to PR — how to run, where output goes, how to clear destination, how to clear cache
  • Respect the API operator — MediaDive is a small academic API at DSMZ. Even at 5 workers, saturating their endpoint with concurrent requests for a one-time bulk download is poor etiquette when sequential + caching achieves the same result on every subsequent run
  • Consider extracting the good parts without the threading — User-Agent header, Retry-After handling, parameterized retry logic are all valuable independent of parallelization

5. Pipeline-wide ergonomics

  • kg download should report what it will do before doing it — list items, expected sizes, estimated time. A dry-run mode (--dry-run) would help
  • kg download should skip already-present files by default — some sources may already do this, but the behavior is inconsistent and undocumented
  • Post-download hooks (like MediaDive bulk) should be opt-in — currently _post_download_mediadive_bulk runs automatically after every kg download if mediadive.json exists. This surprises developers who just wanted to refresh an ontology file

Context

  • PR Parallelize MediaDive bulk download (~20x faster) #527 prompted this investigation
  • download.yaml has 39 items totaling ~21GB in data/raw/
  • Transform already supports per-source execution (-s flag); download does not
  • The kghub_downloader library supports tags but they're unused here

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions