Skip to content

LUMC/rose-dt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

118 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rose-dt

rose-dt (Reciprocal Soft-Clip Events Detection Tool) is a tool for detection of pairs of soft clip events whose soft-clipped sequences align to each other's non-soft-clipped parts (i.e. reciprocal). These events may be indicative of an underlying tandem duplication event.

Rationale

rose-dt was written for detection of tandem duplication events (relative to a reference sequence) with specific characteristics:

  • The event occurs within the boundary of a known region.
  • The exact location of the duplication may vary within that region.
  • The size of the duplicated sequence is not known apriori; it may be less or more than the length of the read sequence.
  • The duplicated sequence may not be exact copies of each other.

This is the characteristics of duplication events found in the FLT3 gene which may be clinically relevant in the diagnosis of acute myeloid leukemia. In that gene, such duplications are known to occur within the exon 14-15 region and have been the target of several assays used in the diagnosis of the disease. Spencer et al. (2013) reported the analysis of targeted exome sequencing data using Pindel for detection of these events. Rustagi et al. (2016) recently developed an algorithm for detection of such events from whole exome sequencing data.

Here we develop rose-dt a tool that is meant to detect tandem duplication events from mRNA-seq data. Though initially designed for detection of FLT3 tandem duplications, it has been shown to be able to detect much larger duplication events as well, namely from the KMT2A gene. rose-dt is developed as part of the the Hamlet pipeline but can be used as a standalone tool.

Description

rose-dt detects tandem duplication events using reciprocal soft clip events (RoSEs). We show a short description of the steps using the schema below.

Detection Scheme

Let us suppose we have a sample containing a tandem duplication event. If the duplicated sequence is longer than our read length (section A, left), our reads will have parts of themselves mapped to one duplicate of the sequence. It can be the 3'-end (red bar) or the 5'-end (yellow bar). If the duplicated sequence is shorter than our read length (section A, right), we still have cases where the duplicated sequence is partially covered by our reads on the 3'-end (purple bar) or the 5'-end (green bar). We may also have cases where our read spans the entire duplicated sequence (blue bar).

What happens when we map these reads to the reference sequence? Since our reference sequence do not contain the duplication (section B), the sequences that span the tandem duplications will be soft clipped (bent lines). This is generally true, despite the length of the duplicated sequence. There is an additional case when reads span the entire duplicated sequence. For those cases, we may also see insertion events instead of soft clip events (blue bar), and is currently ignored by rose-dt.

What rose-dt does is to try map back the soft clipped sequences based on where they are found (section C). For soft clip sequences originating from the start of a read (termed 'start' type, shown as yellow and green bent lines), if they originate from a tandem duplication, we can expect them to map to the other end of the duplicated region. Close to the region at which these 'start' type sequences map, we also expect soft clipped sequences to be present that goes the other way around. These sequences originate from the end of reads (termed 'end' type, shown as red and purple bent lines) and can be mapped near to the 'start' types.

Both 'start' and 'end' soft clip events that map close to each other's origins are required to call one reciprocal event. Alignment is done using a modified Smith-Waterman algorithm, with the following criteria:

  • For 'start' soft clip types, we consider it to be mapped if and only if its 3'-most base can be mapped to a single location in the region of the reference sequence starting from the position at which the 5'-most base of the originating read is mapped, to the end of the sequence. This position is represented by a single, unique, maximum score in the Smith-Waterman matrix.

  • For 'end' soft clip types, we consider it to be mapped if and only if its 5'-most base can be mapped to a single location in the region of the reference sequence starting from the beginning of the sequence to the position at which the 3'-most base of the originating read is mapped. Similarly, this position is represented by a single, unique, maximum score in the Smith-Waterman matrix.

Note that in both cases, we only detect for the uniqueness of the maximum score in the alignment matrix. This means that we ignore possible variations that may arise when performing the trace back of the sequence as we are only concerned about the positions that may bound the underlying tandem duplication events.

Several parameters can be adjusted for the detection of such events:

  • Minimum soft clip count (--min-soft-clip-count, default: 2) determines the minimum number of soft clip events required at a given position for that position to be included in constructing the reciprocal events.

  • Minimum alignment score ratio (--min-score-ratio, default: 0.5). This is a proxy for the minimum length of the alignments, calculated as the ratio of the alignment score and the maximum possible alignment score (5 * ). Higher values of this parameter corresponds to requiring higher similarity between the duplicated sequence.

  • Maximum fuzziness (--max-fuzziness, default: 5). How many basepairs upstream and downstream of a given soft clip mapping should we look for its reciprocal soft clip event.

Usage

rose-dt requires the following files as input:

  • mRNA-seq reads mapped to a reference transcript or transcriptome
  • A FASTA file of that reference transcript or transcriptome (the same one used for creating the alignment index).

The RNA-seq reads can be supplied via stdin or as a sorted, indexed BAM file. Supply it via stdin if you only want to detect the events without any additional context for a given transcript. Supply it as a BAM file if you would also like to measure the depth coverage and/or perform the analysis on a subset region of the transcript.

We used BAM files created using BWA MEM version 0.7.16a-r1181 to test our use cases. Other aligners may or may not produce BAM files that work.

The FASTA file can be of a single transcript or a full transcriptome. If a full transcriptome is used, you must supply the name of the transcript to which the reads will be mapped. Here you can also index the FASTA file (for example using samtools faidx) speed up access.

Check out complete help via rose-dt --help for the exact instruction on running the tool and to see other parameters that are tweakable.

The output of the tool is a tab-delimited file containing the following columns:

  • td_starts: The 5' bound of the predicted tandem duplication event. A pair of numbers means that the bound can not be determined exactly but instead lies in the region delimited by those numbers (zero-based, half-open coordinates).

  • td_ends: The 3' bound of the predicted tandem duplication event. A pair of numbers also means that the bound can not be determined exactly, similar to td_starts.

  • rose_start_count: The number of the 'start' type soft clip events of a RoSE.

  • rose_end_count: The number of the 'end' type soft clip events of a RoSE.

  • rose_start_pos: The position at which the 'start' type soft clip events occur. This is calculated as 1 bp upstream of the 5'-most mapped position of a given read.

  • rose_start_anchor_pos: The 3'-most position at which 'start' type soft clip events can be mapped.

  • rose_end_pos: Th position at which the 'end' type of soft clip events occur. This is calculated as 1 bp downstream of the 3'-most mapped position of a given read.

  • rose_end_anchor_pos: The 5'-most position at which 'end' type soft clip events can be mapped.

  • boundary_type: 'fuzzy' or 'exact'; A short hand for the 'fuzziness' column.

  • fuzziness: When value is 0 (boundary type 'exact'), the detected tandem duplication event corresponds exactly to the anchor positions of each soft clip type. When value is larger than 0 (boundary type 'fuzzy'), the detected tandem duplication occurs within a distance of one or both the anchor positions.

Each line in the output corresponds to a single RoSE event.

Installation

Since rose-dt is still in beta, please follow the development setup instructions below.

Development Setup

This tool is written in the Rust programming language (v1.25 or above). If you do not have the Rust toolchain installed yet, you will need to install it. The Rust official installation guide is easy to follow and should get you going in no time.

If you already have toolchain installed, ensure that you are at least on version 1.25 and execute the following commands:

# Clone the repository and cd into it
$ git clone https://git.lumc.nl/hem/rose-dt.git $ cd rose-dt

# Install the dependencies
$ cargo update

# Run the test suite (if you want to tweak the code)
$ cargo test --all

# Build the release version (to ./target/release/rose-dt)
$ cargo build --release

Extras

Ploting Results

A Python script for visualizing the results of rose-dt is available in the extras directory of the source. Dependencies of this script can be installed in a Conda environment using the provided environment-rose-dt-plot.yml file. Briefly, the steps are as follows:

# Create a conda environment and activate it (follow the instructions on https://conda.io/docs/).

# Install the required packages.
$ conda env update -f extras/environment-rose-dt-plot.yml

# Read through the script documentation and then run it.
$ python extras/rose-dt-plot.py --help

Here are some examples of the plotting script output.

A sample without any RoSEs

Example rose-dt output: no events

This plot shows only the background non-soft clip coverage per position (grey area, maximum and minimum value denoted on the right Y axis) and the ratio of all soft clip events to the non-soft clip events per position (yellow area, values denoted on the left Y axis).

No reciprocal events / RoSEs are detected here.

A sample containing a single RoSE

Example rose-dt output: single exact event

Similar to the sample without any RoSEs, here the X axis represents the ratio of the soft clip count to the non-soft clip counts and the Y-axis represents the position in the transcript. However, we see that a single RoSE is shown here, represented by the two arcs with the same color.

The two arcs represent the components of the detected RoSE: one resulting from soft clips at the 5' end of reads (bottom arc) and one resulting from soft clips at the 3'end of reads (top arc). The dots from which the arcs originate represent the location of the soft clips, while the arrows pointed to by the arcs represent the location to which those soft clip sequences can be mapped uniquely.

Here, we have an indication that the sequence bounded by the RoSE is duplicated next to each other.

Both arcs are shown using solid lines, which means that each soft clip sequence map to the exact location at which they are expected to map.

A sample containing multiple RoSEs

Example rose-dt output: multiple mix event

Sometimes a sample may have more than one duplication events, which are expected to be detected as multiple RoSEs. Here we se an example of such sample.

The second event, denoted by the purple arcs, is shown with dashed lines since the soft clip sequences do not map exactly at the expected position, but rather several bases away from it (the maximum allowed distance is controlled using the --max-fuzziness flag and by default is set to 5). Also note that this second RoSE has a lower coverage ratio than the first.

The plotting script also allows for a different mode of plotting, in which two tracks are shown: one for each half of the RoSE. In this mode, the non-soft clip coverage values and soft clip ratio peaks are shown as mirrors of each other.

For example, the same sample containing multiple events above will be shown like this:

Example rose-dt output: multiple mix event per end

About

Reciprocal Soft-Clip Events Detection Tool

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors