Gina Magro 2026-05-20
This project presents a reproducible population genomics workflow by leveraging publicly available
sequencing data from the NCBI Sequence Read Archive (SRA). This script is using Chinook Salmon
(Oncorhynchus tshawytscha) as demonstrative
reference genome and inspired by the study "Genomic evidence for domestication selection in three
hatchery populations of Chinook salmon", which investigated genetic
differentiation between hatchery and wild salmon populations across Southeast Alaska. Full study data
is available through the NCBI's SRA database under the project ID PRJNA1069051. Further information
and analysis can be seen in the Chinook_6Sample_Data_Report.html
All scripts can be called independently with manual arguments inputs.
All scripts have default arguments to align with the workflow structure
in the run_pipeline script.
Full Workflow Usage:
./run_pipeline.sh <Accession_list.csv>
chinook_salmon/
│
├── scripts/
│ ├── sra_to_FASTQ.sh
│ ├── qc_trim_FASTQ.sh
│ ├── align_FASTQ_reads.sh
│ ├── variant_calling.sh
│ ├── postprocess_vcf.sh
│ ├── population_structure.R
│ ├── variant_summary.R
│ └── config.sh
│
├── data/
├── logs/
├── plots/
├── results/
│
├── run_pipeline.sh
│
├── SraAccList_Chickamin.csv
├── SraAccList_Whitman.csv
└── Merged_SraAccList.csv
The automated the workflow in the steps described below. Each script can be run independently.
This bash script take the argument of a SRA accession numbers list.
It temporarily downloads SRA accessions and converts them to
FASTQ files saved to data directory under FASTQ. Includes
logging, skipping complete samples, and cleanup.
Usage:
./sra_to_fastq_pipeline <accessionFile_OR_AccessionFileList> This bash script reads raw FASTQ files to preform quality control
and trim reads set to the default parameters. Produces cleaned FASTQ
files ready for alignments saved to data directory under CLEAN_FASTQ.
Data quality reports generated by FASTQC are saved under data directory in QC_REPORTS directory.
Usage:
./qc_trim_FASTQ.sh <Path_to_RAW_FASTQ_Dir> <Path_to_clean_fastq_dir> <Path_to_qc_reports> <Path_to_logs> This bash script processes clean FASTQ files to produce aligned, sorted, and
indexed BAM files to reference genome.
Usage:
./align_FASTQ_reads.sh <clean_FASTQ_dir> <reference_sequence.fa> <Output_Directory> <Log_Directory> This bash script calls all variants from the aligned BAM files using default parameters
in bcftools. Produces VCF files saved to data directory under variants
Usage:
./variant_calling.sh <bam_dir> <reference.fa> <vcf_output_dir> <log_dir> This bash script merges all VCF files to and saves to merged.vcf.gz under the variants directory.
After, it then filters all variants calls, removing Quality scores below 30 and read depths outside of the range 10-200.
This filtered version is then saved to filtered.vcf.gz under the variants directory.
Usage:
./postprocess_vcf.sh <VCF_DIR> This Rscript loads filtered VCF data to preform:
- Variant summary statistics
- Basic QC
- Depth / Qualilty visualization
- Missingness analysis
The goal of this script is to be used for visualization, exploration, and description of data to be
included in reporting. All plots are saved under the plots directory.
This Rscript loads our VCF to:
- Convert VCF to GDS format
- Filters SNPs
- Performs LD pruning
- Principal Component Analysis
- Visualizes population structure
The goal of this script is to be used for visualization, exploration, and description of data to be
included in reporting. All plots are saved under the plots directory.
This report presents a reproducible population genomics workflow for Chinook salmon (Oncorhynchus tshawtscha) using publicly available low-coverage whole-genome sequencing (lcWGS) data. This analysis was inspired by the study "Genomic evidence for domestication selection in three hatchery populations of Chinook salmon, which investigated genetic differentiation between hatchery and wild salmon populations across Southeast Alaska. The study goal was to use the genetic variation to explain mechanisms behind fitness reduction and domestication selection for hatchery fish. All data is retrieved through NCBI's SRA database under the project ID PRJNA1069051. The original study examined population structure using a cohort of 192 individuals across 3 hatchery-wild population pairs based on location. This report adapts the general methodology to a smaller educational-style dataset.
The primary input required for this workflow is a text/csv file containing SRA accession numbers corresponding to the sequencing samples to be analyzed. These accession identifiers are used by the pipeline to automatically retrieve raw sequencing data from the NCBI Sequence Read Archive (SRA). The reference genome path is defined within the pipeline configuration settings and may need to be updated by the user when analyzing a different organism or genome assembly. This can be achieved by opening the scripts directory and opening config.sh is any text editor and adjusting the file path of REF. Reference genomes should be downloaded into the project folder prior to running pipeline. For this analysis, the Chinook salmon reference genome assembly is set to the default. The original publication referenced the Chinook salmon genome assembly Otsh_v1.0; GFA_002872995. Since that assembly is no longer the current standard reference, this workflow uses the updated Chinook salmon genome assembly GCA_002872995.1.
For our bash scripts, your environment must include the following, in an easy environment creating download:
conda create -n chinook_pipeline \
bcftools \
bwa \
fastqc \
fastp \
sra-tools \
-c bioconda -c conda-forge
conda activate chinook_pipelineFor our Rscripts, your environment must include:
install.packages(c(
"vcfR",
"tidyverse",
"ggplot2",
"here",
"knitr"
))
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("SNPRelate")
For any large-scale analysis, this pipeline will require a vast amount of available storage. In order to explore each step of the workflow and data, we save various formats of data. Deciding not to do this in our workflow allows for Quality Control check to evaluate the sequence quality and alignment after running the pipeline.
Future development of this workflow could focus on improving scalability, automation, and computational efficiency for larger population genomics datasets. Current intermediate files are intentionally retained to support reproducibility and quality-control inspection; however, future versions may incorporate automated cleanup and compression strategies to reduce storage requirements. Additional improvements may include workflow parallelization, containerized environments (Docker/Singularity), expanded variant filtering options, and support for larger cohort-based analyses. Next step could be using this project structure to create a Nextflow workflow that would allow for paralization. This can be added as a branch to the current project to retain both options as needed.
Gina Magro Bioinformatics / Computational Biology Pipeline Project