Merge master and fix lint for fix_metatraits#535
Merge master and fix lint for fix_metatraits#535turbomam wants to merge 9 commits intofix_metatraitsfrom
Conversation
Replace sequential per-medium API loops with a 20-worker thread pool, reducing ~3,333 serial requests per phase (~30 min each) to ~1-2 min. Applies to both detailed recipe and medium-strain association downloads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Per Mark's comments: - Lower DEFAULT_MAX_WORKERS from 20 → 5 (polite for small academic API at DSMZ) - Add USER_AGENT constant identifying kg-microbe so API operator can reach out - Respect Retry-After header on 429 responses instead of fixed retry_delay - Switch from futures-dict to executor.map in both download functions - Tighten return type annotations on _fetch_* helpers to tuple[str, dict] Per Marcin's comments: - Make max_workers, retry_count, retry_delay parameters on download_detailed_media, download_medium_strains, and download_mediadive_bulk so callers can tune them - Pass retry_count and retry_delay through _fetch_* helpers into get_json_from_api (previously helpers always used defaults, ignoring caller overrides) - Add tests/test_mediadive_bulk_download.py: 8 tests covering default values, Retry-After behaviour, retry parameter propagation, and concurrency bounds 136 passed, 25 skipped. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Consolidate chemical mappings into unified utility
The bulk downloader creates mediadive_bulk_cache.sqlite (50MB) but only mediadive_cache.sqlite was gitignored. Also ignore the generated output JSONs in data/raw/mediadive/. See #533 Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ownload Parallelize MediaDive bulk download (~20x faster)
Accept master's version of mediadive_bulk_download.py which includes _make_session() helper and expanded get_json_from_api signature from the parallelization PR. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Fix 3x S110 try-except-pass with noqa suppressions (cleanup code) - Fix E501 line too long in parallel processing message - Fix D203 blank line before class docstring in test file (ruff --fix) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR merges recent master updates into the fix_metatraits branch and resolves CI lint failures, including adopting the updated parallelized MediaDive bulk download utilities and adding/adjusting lint suppressions and formatting fixes.
Changes:
- Merge conflict resolution bringing in the newer
mediadive_bulk_download.pyAPI (session helper, retry/header handling, parallel fetching). - Lint fixes in MetaTraits cleanup code (
# noqa: S110) and a long f-string split. - Minor lint-related adjustments (zip strict arg,
.gitignoreupdates) and a new MediaDive bulk download test file.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
kg_microbe/utils/mediadive_bulk_download.py |
Introduces parallel download, session helper, retry behavior (incl. Retry-After), and new function parameters. |
tests/test_mediadive_bulk_download.py |
Adds unit tests for defaults, retry behavior, and concurrency bounds. |
kg_microbe/utils/robot_utils.py |
Lint-driven update to zip(..., strict=...) in ROBOT CLI call construction. |
kg_microbe/transform_utils/metatraits/metatraits.py |
Adds noqa suppressions for intentional cleanup exception silencing and splits a long print f-string. |
.gitignore |
Ignores MediaDive bulk cache DB and raw output directory. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix 429 exhaustion bug: all-429 retry loop now returns {} instead of
falling off the end and returning None
- Guard Retry-After parsing: handle HTTP-date values that can't be
cast to float, fall back to retry_delay
- Fix docstring: session param creates new session per call, not
module-level
- Remove unused requests_per_second parameter from both download
functions (was documented but never implemented)
- Fix isinstance(terms, List) -> isinstance(terms, list) in
robot_utils.py for Python 3.10+ compatibility
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
Closing as superseded. Since this PR was opened, The three genuinely unique bug fixes in |
Summary
fix_metatraitsto resolve conflict from PR Parallelize MediaDive bulk download (~20x faster) #527 (MediaDive parallelization)Changes
mediadive_bulk_download.pywith_make_session()helper and expandedget_json_from_apisignaturenoqa: S110suppressions for cleanup code where silencing exceptions is intentionalContext
CI on
fix_metatraitshas been failing (lint) and the branch has a merge conflict with master after #527 and #530 were merged. This PR brings the branch up to date so it can pass CI and be evaluated for merge.