Skip to content

Replace fasterq-dump with sracat-rs#49

Open
wwood wants to merge 1 commit into
mainfrom
replace-fasterq-dump-with-sracat-rs
Open

Replace fasterq-dump with sracat-rs#49
wwood wants to merge 1 commit into
mainfrom
replace-fasterq-dump-with-sracat-rs

Conversation

@wwood

@wwood wwood commented Jul 1, 2026

Copy link
Copy Markdown
Owner

Summary

Migrates all .sra extraction to sracat-rs, replacing both fasterq-dump (sorted/default path) and the legacy sracat binary (--unsorted paths). The legacy sracat dependency is dropped entirely.

Changes

  • kingfisher/__init__.py (extract)
    • Sorted path: fasterq-dumpsracat-rs --qual -1/-2/--single-out.
    • --unsorted file outputs: rewritten to sracat-rs split mode, writing native .fasta/.fastq directly (no more .fna renames, no FIFO/-z machinery; gzip is now a pigz pass).
    • --unsorted --stdout: routes single/orphan reads to /dev/stdout so single-end runs work.
    • Empty -1/-2 files (created eagerly by sracat-rs for single-end runs) are dropped before further processing.
    • Removed now-unused import gzip.
  • Dependencies: sracatsracat-rs in pixi.toml; regenerated pixi.lock (sracat-rs 0.0.3) and admin/environment.yml. Excluded gawk from the generated requirements.txt.
  • Docs/CLI: mention sracat-rs instead of fasterq-dump.
  • Tests: regenerated the .sra-extraction expectations in test_get_sra_and_aws.py. Since sorted and unsorted now run identical sracat-rs extraction, their previously-distinct expected values converge. ENA/NGDC tests are unaffected (they download fastq.gz directly).

Testing

Run on the aqua HPC queue in the dev pixi env:

  • 8 offline extract tests — pass
  • 14 network download→extract tests — pass

Not run: aws_cp-marked tests (need AWS credentials) and pure .sra-download / bioproject / prefetch-listing tests (don't touch extraction).

🤖 Generated with Claude Code

Use sracat-rs (github.com/wwood/sracat-rs) for all .sra extraction,
replacing fasterq-dump in the sorted/default path and the legacy sracat
binary in the --unsorted paths. sracat-rs writes split mates via
-1/-2/--single-out in native .fasta/.fastq, so the .fna renames and the
FIFO/-z gzip machinery are gone (gzip is now a pigz pass). --unsorted
--stdout routes single/orphan reads to /dev/stdout so single-end runs work.

Empty -1/-2 files (created eagerly by sracat-rs for single-end runs) are
dropped before further processing.

deps: swap sracat -> sracat-rs in pixi.toml, regenerate pixi.lock and
admin/environment.yml. Exclude gawk from the generated requirements.txt.

docs/cli: mention sracat-rs instead of fasterq-dump.

tests: regenerate the .sra-extraction expectations in test_get_sra_and_aws.py
to match sracat-rs output (sorted and unsorted now extract identically, so
their expected values converge). All extraction-affected tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1ac851eaa4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread kingfisher/__init__.py
Comment on lines +594 to +595
extern.run("sracat-rs --qual --threads {} -1 {} -2 {} --single-out {} {}".format(
threads, r1, r2, single, sra_file_abs))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve sorted extraction when --unsorted is absent

For default get/extract calls where users do not pass --unsorted, this now invokes sracat-rs, the same unordered extractor used by the unsorted branch, so the documented/default distinction is lost. The CLI still advertises --unsorted as the opt-in path for arbitrary SRA storage order, and tests were changed so the non-unsorted FASTQ output matches the unsorted sizes; workflows that rely on fasterq-dump's original spot order now silently receive arbitrary-order reads.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant