Skip to content

parklab/Illumina-TwoColorBias

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NovaSeq 6000 bias correction script

This repository contains code to correct the bias towards T>G/A>C single-base substitutions in NovaSeq6000 data.

Installation

  1. Clone this repository to a folder, and add the path to the cloned folder to your PATH variable (i.e. PATH=$PATH:/path/to/cloned/repository/).
  2. Make sure you have the following installed (or available)
    • R
    • snakemake
    • bedtools
  3. Update the config.yaml with the reference genome that you used to make mutation calls

R package requirements

Please make sure you have the following packages installed. This code runs with R version 4.4.2 (2024-10-31) -- "Pile of Leaves" or higher.

CRAN

  • optparse
  • tidyverse (includes dplyr, tidyr, purrr, tibble, readr)
  • data.table
  • ggplot2
  • stringr
  • scales
  • cowplot
  • pheatmap
  • openxlsx

Bioconductor

  • GenomicRanges
  • Biostrings
  • BSgenome and a genome assembly package (e.g. BSgenome.Hsapiens.UCSC.hg19)
  • seqLogo

Instructions

Upstream processing

Please see the individual folders for different upstream processing steps

  • alignment Alignment
  • variant_calling Variant calling
  • pileup Pileup of mutations
  • analysis Different analyses, namely strand bias and platform enrichment

Training position weight matrices (PWMs) to model sequence biases

If you have variants from (ideally) sequencing the sample sample across two platforms (i.e., NovaSeq 6000 vs HiSeq X10, as in the paper), you can train your own model

$ Rscript src/tg_bias_removal_main.r -h
Usage: src/tg_bias_removal_main.r [options]

Builds a per-SBS sequence-context motif model from paired NovaSeq / HiSeq mutation calls and filters putative T>G sequencing artifacts.

Options:
        --novaseq-mutations=NOVASEQ-MUTATIONS
                Path to NovaSeq mutations file (tab-delimited).
                           Required columns: chrom, start, end, ref, alt,
                           base_conversion, af, dp. [required]

        --hiseq-mutations=HISEQ-MUTATIONS
                Path to HiSeq mutations file (tab-delimited).
                           Same required columns as --novaseq-mutations. [required]

        --outdir=OUTDIR
                Output directory. Created if absent. [default: .]

        --run-name=RUN-NAME
                Prefix for all output files. [default: run]

        --genome=GENOME
                BSgenome package name. Must be installed.
                           [default: BSgenome.Hsapiens.UCSC.hg19]

        --window-size=WINDOW-SIZE
                Sequence window width (must be odd). [default: 41]

        --alpha=ALPHA
                Likelihood-ratio filtering threshold. [default: 0.05]

        --extract-sequences
                Extract ref_seq from --genome rather than reading
                           from input files.

        --panels=PANELS
                Comma-separated list of panels to generate.
                           A = pre-correction counts, B = AF distributions,
                           C = trinucleotide spectrum (pre-correction),
                           D = k-mer counts, E = post-correction counts + AF,
                           F = base composition pointrange (bootstrap),
                           G = trinucleotide spectrum (post-correction).
                           [default: A,B,C,D,E,F,G]

        --n-bootstraps=N-BOOTSTRAPS
                Number of bootstrap resamples for Panel F. [default: 50]

        --empirical-gc
                Compute GC background frequencies empirically from
                           the ref_seq windows rather than using human genome
                           defaults (G/C = 0.205, A/T = 0.295). Recommended
                           for non-human genomes.

        -h, --help
                Show this help message and exit

A PWM trained in this paper is available at bias_models/fu_et_al.ns6000_vs_hsx10.pwm_model.rds. The mutations used to train this model are derived from Tables S1.2 and S1.5 from Viswanadham et al. 2025 (10.1016/j.celrep.2025.116458).

Applying PWMs to remove artifacts

$ Rscript src/tg_apply_pwm.r -h
Usage: src/tg_apply_pwm.r [options]

Applies a pre-trained per-SBS PWM model to new NovaSeq-only mutations and partitions them into kept / removed without requiring HiSeq data.

Options:
        --mutations=MUTATIONS
                Path to new NovaSeq mutations file (tab-delimited).
                           Required columns: chrom, start, end, ref, alt,
                           base_conversion, AF, dp. [required]

        --pwm-model=PWM-MODEL
                Path to .pwm_model.rds produced by tg_bias_removal.R.
                           [required]

        --outdir=OUTDIR
                Output directory. Created if absent. [default: .]

        --run-name=RUN-NAME
                Prefix for all output files. [default: run]

        --genome=GENOME
                BSgenome package name. Must be installed.
                           [default: BSgenome.Hsapiens.UCSC.hg19]

        --window-size=WINDOW-SIZE
                Sequence window width (must be odd). Must match the
                           window used to train the PWM model. [default: 41]

        --alpha=ALPHA
                Likelihood-ratio filtering threshold. [default: 0.05]

        --extract-sequences
                Extract ref_seq from --genome rather than reading
                           from input file.

        --panels=PANELS
                Comma-separated list of diagnostic panels to generate.
                           A = mutation counts, B = AF distributions,
                           C = likelihood ratio distributions,
                           D = trinucleotide spectrum (kept vs removed),
                           E = sequence motif bits,
                           F = trinucleotide spectrum before vs after correction.
                           [default: A,B,C,D,E,F]

        --threshold-mode=THRESHOLD-MODE
                How to determine the filtering threshold per SBS type.
                           'saved'   = use the threshold saved during training
                                       (exact reproduction of training results).
                           'rederive' = re-score the saved null (HiSeq) sequences
                                       against the model PWMs and compute a fresh
                                       threshold at --alpha (allows different alpha
                                       than training).
                           [default: saved]

        -h, --help
                Show this help message and exit

Citation

If you use this pipeline, please cite

A recurrent sequencing artifact on Illumina sequencers with two-color fluorescent dye chemistry and its impact on somatic variant detection
Beverly J. Fu, Vinayak V Viswanadham, Dominika Maziec, Hu Jin, Peter J. Park
bioRxiv 2025.09.27.678978; doi: https://doi.org/10.1101/2025.09.27.678978

This work was recently accepted, so an updated citation will be provided.

About

A repository containing code to correct T>N mutation bias in NovaSeq 6000 callsets

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors