Skip to content

gmagro24/Population_Comparasion_Pipelines

Repository files navigation

A Reproducible Population Genomics Workflow for Chinook Salmon

Gina Magro 2026-05-20

Salmon Genetic Population Comparison Pipeline

This project presents a reproducible population genomics workflow by leveraging publicly available sequencing data from the NCBI Sequence Read Archive (SRA). This script is using Chinook Salmon (Oncorhynchus tshawytscha) as demonstrative reference genome and inspired by the study "Genomic evidence for domestication selection in three hatchery populations of Chinook salmon", which investigated genetic differentiation between hatchery and wild salmon populations across Southeast Alaska. Full study data is available through the NCBI's SRA database under the project ID PRJNA1069051. Further information and analysis can be seen in the Chinook_6Sample_Data_Report.html

Workflow Structure

All scripts can be called independently with manual arguments inputs. All scripts have default arguments to align with the workflow structure in the run_pipeline script.
Full Workflow Usage:

 ./run_pipeline.sh <Accession_list.csv> 

Project Structure

chinook_salmon/  
│  
├── scripts/  
│ ├── sra_to_FASTQ.sh  
│ ├── qc_trim_FASTQ.sh  
│ ├── align_FASTQ_reads.sh  
│ ├── variant_calling.sh  
│ ├── postprocess_vcf.sh  
│ ├── population_structure.R  
│ ├── variant_summary.R  
│ └── config.sh  
│  
├── data/  
├── logs/  
├── plots/    
├── results/  
│   
├── run_pipeline.sh  
│  
├── SraAccList_Chickamin.csv  
├── SraAccList_Whitman.csv  
└── Merged_SraAccList.csv 

Workflow Structure

The automated the workflow in the steps described below. Each script can be run independently.

1. sra_to_FASTQ.sh

This bash script take the argument of a SRA accession numbers list. It temporarily downloads SRA accessions and converts them to FASTQ files saved to data directory under FASTQ. Includes logging, skipping complete samples, and cleanup.
Usage:

./sra_to_fastq_pipeline <accessionFile_OR_AccessionFileList> 

2. qc_trim_FASTQ.sh

This bash script reads raw FASTQ files to preform quality control and trim reads set to the default parameters. Produces cleaned FASTQ files ready for alignments saved to data directory under CLEAN_FASTQ.
Data quality reports generated by FASTQC are saved under data directory in QC_REPORTS directory.
Usage:

./qc_trim_FASTQ.sh <Path_to_RAW_FASTQ_Dir> <Path_to_clean_fastq_dir> <Path_to_qc_reports> <Path_to_logs> 

3. align_FASTQ_reads.sh

This bash script processes clean FASTQ files to produce aligned, sorted, and indexed BAM files to reference genome.
Usage:

./align_FASTQ_reads.sh <clean_FASTQ_dir> <reference_sequence.fa> <Output_Directory> <Log_Directory>  

4. variant_calling.sh

This bash script calls all variants from the aligned BAM files using default parameters in bcftools. Produces VCF files saved to data directory under variants Usage:

./variant_calling.sh <bam_dir> <reference.fa> <vcf_output_dir> <log_dir> 

5. postprocess_vcf.sh

This bash script merges all VCF files to and saves to merged.vcf.gz under the variants directory. After, it then filters all variants calls, removing Quality scores below 30 and read depths outside of the range 10-200. This filtered version is then saved to filtered.vcf.gz under the variants directory.
Usage:

 ./postprocess_vcf.sh <VCF_DIR>  

6. variant_summary.R

This Rscript loads filtered VCF data to preform: - Variant summary statistics - Basic QC - Depth / Qualilty visualization - Missingness analysis The goal of this script is to be used for visualization, exploration, and description of data to be included in reporting. All plots are saved under the plots directory.

7. population_structure.R

This Rscript loads our VCF to: - Convert VCF to GDS format - Filters SNPs - Performs LD pruning - Principal Component Analysis
- Visualizes population structure
The goal of this script is to be used for visualization, exploration, and description of data to be included in reporting. All plots are saved under the plots directory.

Report Chinook_6Sample_Data_Report

This report presents a reproducible population genomics workflow for Chinook salmon (Oncorhynchus tshawtscha) using publicly available low-coverage whole-genome sequencing (lcWGS) data. This analysis was inspired by the study "Genomic evidence for domestication selection in three hatchery populations of Chinook salmon, which investigated genetic differentiation between hatchery and wild salmon populations across Southeast Alaska. The study goal was to use the genetic variation to explain mechanisms behind fitness reduction and domestication selection for hatchery fish. All data is retrieved through NCBI's SRA database under the project ID PRJNA1069051. The original study examined population structure using a cohort of 192 individuals across 3 hatchery-wild population pairs based on location. This report adapts the general methodology to a smaller educational-style dataset.

Pipeline Configuration / User Input

The primary input required for this workflow is a text/csv file containing SRA accession numbers corresponding to the sequencing samples to be analyzed. These accession identifiers are used by the pipeline to automatically retrieve raw sequencing data from the NCBI Sequence Read Archive (SRA). The reference genome path is defined within the pipeline configuration settings and may need to be updated by the user when analyzing a different organism or genome assembly. This can be achieved by opening the scripts directory and opening config.sh is any text editor and adjusting the file path of REF. Reference genomes should be downloaded into the project folder prior to running pipeline. For this analysis, the Chinook salmon reference genome assembly is set to the default. The original publication referenced the Chinook salmon genome assembly Otsh_v1.0; GFA_002872995. Since that assembly is no longer the current standard reference, this workflow uses the updated Chinook salmon genome assembly GCA_002872995.1.

Required Packages

For our bash scripts, your environment must include the following, in an easy environment creating download:

conda create -n chinook_pipeline \
    bcftools \
    bwa \
    fastqc \
    fastp \
    sra-tools \
    -c bioconda -c conda-forge

conda activate chinook_pipeline

For our Rscripts, your environment must include:

install.packages(c(
  "vcfR",
  "tidyverse",
  "ggplot2",
  "here",
  "knitr"
))

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("SNPRelate")

Known Limitations

For any large-scale analysis, this pipeline will require a vast amount of available storage. In order to explore each step of the workflow and data, we save various formats of data. Deciding not to do this in our workflow allows for Quality Control check to evaluate the sequence quality and alignment after running the pipeline.

Future Improvements

Future development of this workflow could focus on improving scalability, automation, and computational efficiency for larger population genomics datasets. Current intermediate files are intentionally retained to support reproducibility and quality-control inspection; however, future versions may incorporate automated cleanup and compression strategies to reduce storage requirements. Additional improvements may include workflow parallelization, containerized environments (Docker/Singularity), expanded variant filtering options, and support for larger cohort-based analyses. Next step could be using this project structure to create a Nextflow workflow that would allow for paralization. This can be added as a branch to the current project to retain both options as needed.

Author

Gina Magro Bioinformatics / Computational Biology Pipeline Project

About

This project presents a reproducible population genomics workflow by leveraging publicly available sequencing data from the NCBI Sequence Read Archive (SRA). Using Bash to obtain and process our data while Rscripts produce visualization and population analysis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors