NovaSeq 6000 bias correction script

This repository contains code to correct the bias towards T>G/A>C single-base substitutions in NovaSeq6000 data.

Installation

Clone this repository to a folder, and add the path to the cloned folder to your PATH variable (i.e. PATH=$PATH:/path/to/cloned/repository/).
Make sure you have the following installed (or available)
- R
- snakemake
- bedtools
Update the config.yaml with the reference genome that you used to make mutation calls

R package requirements

Please make sure you have the following packages installed. This code runs with R version 4.4.2 (2024-10-31) -- "Pile of Leaves" or higher.

CRAN

optparse
tidyverse (includes dplyr, tidyr, purrr, tibble, readr)
data.table
ggplot2
stringr
scales
cowplot
pheatmap
openxlsx

Bioconductor

GenomicRanges
Biostrings
BSgenome and a genome assembly package (e.g. BSgenome.Hsapiens.UCSC.hg19)
seqLogo

Instructions

Upstream processing

Please see the individual folders for different upstream processing steps

alignment Alignment
variant_calling Variant calling
pileup Pileup of mutations
analysis Different analyses, namely strand bias and platform enrichment

Training position weight matrices (PWMs) to model sequence biases

If you have variants from (ideally) sequencing the sample sample across two platforms (i.e., NovaSeq 6000 vs HiSeq X10, as in the paper), you can train your own model

$ Rscript src/tg_bias_removal_main.r -h
Usage: src/tg_bias_removal_main.r [options]

Builds a per-SBS sequence-context motif model from paired NovaSeq / HiSeq mutation calls and filters putative T>G sequencing artifacts.

Options:
        --novaseq-mutations=NOVASEQ-MUTATIONS
                Path to NovaSeq mutations file (tab-delimited).
                           Required columns: chrom, start, end, ref, alt,
                           base_conversion, af, dp. [required]

        --hiseq-mutations=HISEQ-MUTATIONS
                Path to HiSeq mutations file (tab-delimited).
                           Same required columns as --novaseq-mutations. [required]

        --outdir=OUTDIR
                Output directory. Created if absent. [default: .]

        --run-name=RUN-NAME
                Prefix for all output files. [default: run]

        --genome=GENOME
                BSgenome package name. Must be installed.
                           [default: BSgenome.Hsapiens.UCSC.hg19]

        --window-size=WINDOW-SIZE
                Sequence window width (must be odd). [default: 41]

        --alpha=ALPHA
                Likelihood-ratio filtering threshold. [default: 0.05]

        --extract-sequences
                Extract ref_seq from --genome rather than reading
                           from input files.

        --panels=PANELS
                Comma-separated list of panels to generate.
                           A = pre-correction counts, B = AF distributions,
                           C = trinucleotide spectrum (pre-correction),
                           D = k-mer counts, E = post-correction counts + AF,
                           F = base composition pointrange (bootstrap),
                           G = trinucleotide spectrum (post-correction).
                           [default: A,B,C,D,E,F,G]

        --n-bootstraps=N-BOOTSTRAPS
                Number of bootstrap resamples for Panel F. [default: 50]

        --empirical-gc
                Compute GC background frequencies empirically from
                           the ref_seq windows rather than using human genome
                           defaults (G/C = 0.205, A/T = 0.295). Recommended
                           for non-human genomes.

        -h, --help
                Show this help message and exit

A PWM trained in this paper is available at bias_models/fu_et_al.ns6000_vs_hsx10.pwm_model.rds. The mutations used to train this model are derived from Tables S1.2 and S1.5 from Viswanadham et al. 2025 (10.1016/j.celrep.2025.116458).

Applying PWMs to remove artifacts

$ Rscript src/tg_apply_pwm.r -h
Usage: src/tg_apply_pwm.r [options]

Applies a pre-trained per-SBS PWM model to new NovaSeq-only mutations and partitions them into kept / removed without requiring HiSeq data.

Options:
        --mutations=MUTATIONS
                Path to new NovaSeq mutations file (tab-delimited).
                           Required columns: chrom, start, end, ref, alt,
                           base_conversion, AF, dp. [required]

        --pwm-model=PWM-MODEL
                Path to .pwm_model.rds produced by tg_bias_removal.R.
                           [required]

        --outdir=OUTDIR
                Output directory. Created if absent. [default: .]

        --run-name=RUN-NAME
                Prefix for all output files. [default: run]

        --genome=GENOME
                BSgenome package name. Must be installed.
                           [default: BSgenome.Hsapiens.UCSC.hg19]

        --window-size=WINDOW-SIZE
                Sequence window width (must be odd). Must match the
                           window used to train the PWM model. [default: 41]

        --alpha=ALPHA
                Likelihood-ratio filtering threshold. [default: 0.05]

        --extract-sequences
                Extract ref_seq from --genome rather than reading
                           from input file.

        --panels=PANELS
                Comma-separated list of diagnostic panels to generate.
                           A = mutation counts, B = AF distributions,
                           C = likelihood ratio distributions,
                           D = trinucleotide spectrum (kept vs removed),
                           E = sequence motif bits,
                           F = trinucleotide spectrum before vs after correction.
                           [default: A,B,C,D,E,F]

        --threshold-mode=THRESHOLD-MODE
                How to determine the filtering threshold per SBS type.
                           'saved'   = use the threshold saved during training
                                       (exact reproduction of training results).
                           'rederive' = re-score the saved null (HiSeq) sequences
                                       against the model PWMs and compute a fresh
                                       threshold at --alpha (allows different alpha
                                       than training).
                           [default: saved]

        -h, --help
                Show this help message and exit

Citation

If you use this pipeline, please cite

A recurrent sequencing artifact on Illumina sequencers with two-color fluorescent dye chemistry and its impact on somatic variant detection
Beverly J. Fu, Vinayak V Viswanadham, Dominika Maziec, Hu Jin, Peter J. Park
bioRxiv 2025.09.27.678978; doi: https://doi.org/10.1101/2025.09.27.678978

This work was recently accepted, so an updated citation will be provided.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NovaSeq 6000 bias correction script

Installation

R package requirements

CRAN

Bioconductor

Instructions

Upstream processing

Training position weight matrices (PWMs) to model sequence biases

Applying PWMs to remove artifacts

Citation

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
alignment		alignment
analysis		analysis
bias_models		bias_models
config		config
pileups		pileups
src		src
variant_calling		variant_calling
workflows		workflows
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

NovaSeq 6000 bias correction script

Installation

R package requirements

CRAN

Bioconductor

Instructions

Upstream processing

Training position weight matrices (PWMs) to model sequence biases

Applying PWMs to remove artifacts

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages