This repository contains code to correct the bias towards T>G/A>C single-base substitutions in NovaSeq6000 data.
- Clone this repository to a folder, and add the path to the cloned folder to your
PATHvariable (i.e.PATH=$PATH:/path/to/cloned/repository/). - Make sure you have the following installed (or available)
Rsnakemakebedtools
- Update the
config.yamlwith the reference genome that you used to make mutation calls
Please make sure you have the following packages installed. This code runs with R version 4.4.2 (2024-10-31) -- "Pile of Leaves" or higher.
optparsetidyverse(includes dplyr, tidyr, purrr, tibble, readr)data.tableggplot2stringrscalescowplotpheatmapopenxlsx
GenomicRangesBiostringsBSgenomeand a genome assembly package (e.g.BSgenome.Hsapiens.UCSC.hg19)seqLogo
Please see the individual folders for different upstream processing steps
alignmentAlignmentvariant_callingVariant callingpileupPileup of mutationsanalysisDifferent analyses, namely strand bias and platform enrichment
If you have variants from (ideally) sequencing the sample sample across two platforms (i.e., NovaSeq 6000 vs HiSeq X10, as in the paper), you can train your own model
$ Rscript src/tg_bias_removal_main.r -h
Usage: src/tg_bias_removal_main.r [options]
Builds a per-SBS sequence-context motif model from paired NovaSeq / HiSeq mutation calls and filters putative T>G sequencing artifacts.
Options:
--novaseq-mutations=NOVASEQ-MUTATIONS
Path to NovaSeq mutations file (tab-delimited).
Required columns: chrom, start, end, ref, alt,
base_conversion, af, dp. [required]
--hiseq-mutations=HISEQ-MUTATIONS
Path to HiSeq mutations file (tab-delimited).
Same required columns as --novaseq-mutations. [required]
--outdir=OUTDIR
Output directory. Created if absent. [default: .]
--run-name=RUN-NAME
Prefix for all output files. [default: run]
--genome=GENOME
BSgenome package name. Must be installed.
[default: BSgenome.Hsapiens.UCSC.hg19]
--window-size=WINDOW-SIZE
Sequence window width (must be odd). [default: 41]
--alpha=ALPHA
Likelihood-ratio filtering threshold. [default: 0.05]
--extract-sequences
Extract ref_seq from --genome rather than reading
from input files.
--panels=PANELS
Comma-separated list of panels to generate.
A = pre-correction counts, B = AF distributions,
C = trinucleotide spectrum (pre-correction),
D = k-mer counts, E = post-correction counts + AF,
F = base composition pointrange (bootstrap),
G = trinucleotide spectrum (post-correction).
[default: A,B,C,D,E,F,G]
--n-bootstraps=N-BOOTSTRAPS
Number of bootstrap resamples for Panel F. [default: 50]
--empirical-gc
Compute GC background frequencies empirically from
the ref_seq windows rather than using human genome
defaults (G/C = 0.205, A/T = 0.295). Recommended
for non-human genomes.
-h, --help
Show this help message and exit
A PWM trained in this paper is available at bias_models/fu_et_al.ns6000_vs_hsx10.pwm_model.rds. The mutations used to train this model are derived from Tables S1.2 and S1.5 from Viswanadham et al. 2025 (10.1016/j.celrep.2025.116458).
$ Rscript src/tg_apply_pwm.r -h
Usage: src/tg_apply_pwm.r [options]
Applies a pre-trained per-SBS PWM model to new NovaSeq-only mutations and partitions them into kept / removed without requiring HiSeq data.
Options:
--mutations=MUTATIONS
Path to new NovaSeq mutations file (tab-delimited).
Required columns: chrom, start, end, ref, alt,
base_conversion, AF, dp. [required]
--pwm-model=PWM-MODEL
Path to .pwm_model.rds produced by tg_bias_removal.R.
[required]
--outdir=OUTDIR
Output directory. Created if absent. [default: .]
--run-name=RUN-NAME
Prefix for all output files. [default: run]
--genome=GENOME
BSgenome package name. Must be installed.
[default: BSgenome.Hsapiens.UCSC.hg19]
--window-size=WINDOW-SIZE
Sequence window width (must be odd). Must match the
window used to train the PWM model. [default: 41]
--alpha=ALPHA
Likelihood-ratio filtering threshold. [default: 0.05]
--extract-sequences
Extract ref_seq from --genome rather than reading
from input file.
--panels=PANELS
Comma-separated list of diagnostic panels to generate.
A = mutation counts, B = AF distributions,
C = likelihood ratio distributions,
D = trinucleotide spectrum (kept vs removed),
E = sequence motif bits,
F = trinucleotide spectrum before vs after correction.
[default: A,B,C,D,E,F]
--threshold-mode=THRESHOLD-MODE
How to determine the filtering threshold per SBS type.
'saved' = use the threshold saved during training
(exact reproduction of training results).
'rederive' = re-score the saved null (HiSeq) sequences
against the model PWMs and compute a fresh
threshold at --alpha (allows different alpha
than training).
[default: saved]
-h, --help
Show this help message and exit
If you use this pipeline, please cite
A recurrent sequencing artifact on Illumina sequencers with two-color fluorescent dye chemistry and its impact on somatic variant detection
Beverly J. Fu, Vinayak V Viswanadham, Dominika Maziec, Hu Jin, Peter J. Park
bioRxiv 2025.09.27.678978; doi: https://doi.org/10.1101/2025.09.27.678978
This work was recently accepted, so an updated citation will be provided.