TriTanc

Taxonomy, ANI and coverage aware metagenomic sequence binner

Metagenomic Contig Clustering Pipeline

A taxonomy-aware metagenomic contig clustering pipeline that integrates ANI (Average Nucleotide Identity), coverage correlation, tetranucleotide frequency (TNF), protein similarity, and taxonomic classification to bin assembled contigs into high-quality genomic clusters.

Motivation

Metagenomic binning involves processing and grouping the assembled contigs into genome bins, and forms a foundational step in creating a metagenome-assembled-genomes (MAGs). Established tools such as MetaBAT2, MaxBin2, VAMB, SemiBin2, and COMEBin predominantly rely on two signals: tetranucleotide frequency (TNF) as a proxy for sequence composition, and coverage depth across sequencing samples. While effective, TNF-based similarity can conflate phylogenetically distinct organisms that share sequence composition biases, deep learning approaches require sufficient sample sizes to generalize, and taxonomy is generally ignored entirely or used as a soft-regularization arm.

TriTanc combines five signals: coverage correlation across multiple samples, taxonomic assignments, average nucleotide identity, tetranucleotide frequency, and protein similarity. Taxonomic assignment acts as a hard clustering gate, and contigs are co-binned if they share a genus or finer assignment to ensure evolutionary coherence. ANI provides direct sequence-level evidence of genomic relatedness, TNF captures sequence composition, protein similarity bridges divergent contigs that share functional gene content, and coverage correlation confirms that co-binned contigs share the same ecological dynamics across samples.

This multi-signal, rule-based architecture is intentionally interpretable: every binning decision can be traced to explicit thresholds on measurable quantities, with no black-box model weights or training data required. A hierarchical tiered design (main clustering → secondary clustering → recovery → coverage-only recovery) maximises contig placement without sacrificing bin purity.

Integrated CheckM2 quality assessment and dRep dereplication ensure that the final output consists of non-redundant, quality-tiered MAGs ready for downstream pangenomic or ecological analysis.

What's New

TNF signal — canonical RC-collapsed 4-mer frequencies added as a third composition signal; contigs < 1,000 bp are TNF-neutral
Protein similarity — ORFs called with pyrodigal (Prodigal, meta mode), followed by MMseqs2 all-vs-all protein search; protein edges fill gaps where skani ANI is sparse
Multimodal graph (_build_multimodal_graph) — Tiers 1/2 now combine ANI edges, protein-similarity edges, and taxonomy edges in priority order; hybrid gating replaces the old hard/soft coverage switch
Leiden community detection — leidenalg + python-igraph replace naive connected-components; tunable resolution at each tier (--leiden-res-main, --leiden-res-secondary, --leiden-res-t4)
Centroid-based coverage — Tier 3 and recovery use cluster centroids (not single representatives) for more stable coverage correlation
Lineage-aware recovery — taxonomy fallback traverses the full lineage rather than leaf taxon only
--coverage-as-tiebreaker — disables the hard coverage gate for low-coverage datasets
--min-af — alignment fraction filter for skani pairs
--min-checkm2-bp — minimum bin size (bp) for CheckM2 submission
Output restructured — bins in clusters/bins/, representatives in clusters/representatives/, CheckM2-eligible symlinks in clusters/thresh_bins/

Time taken to run the script

For the three inputs, the script can take inputs from pre-computed results, or can run the tools needed to generate these inputs. For coverage, jgi_summarize_bam_contig_depths is used if the BAMs are provided. For taxonomic information, mmseqs taxonomy is run if the relevant input is absent, and an mmseqs2-compatible database is provided. skani is run if needed, and the script will automatically run this. Time taken for running the different tools here: skani ~ few minutes < jgi_summarize_bam_contig_depths ~ go out for a coffee break < mmseqs taxonomy ~ go have lunch

Protein similarity (pyrodigal + MMseqs2 protein search) adds ~5–30 min depending on assembly size. Supply --prot-sim <file> to skip if pre-computed, or pass --skip-prot-sim to omit entirely.

The script is also compatible with the taxometer results from VAMB.

How It Works

The pipeline proceeds through the following steps:

Taxonomy assignment - Runs MMseqs2 Taxonomy against a reference database (e.g., GTDB), or accepts a pre-computed taxonomy file (MMseqs2 or Taxometer format)
ANI calculation - Runs skani triangle (all-vs-all) to compute pairwise ANI between contigs
Protein similarity - Calls ORFs with pyrodigal then runs MMseqs2 all-vs-all protein search to produce contig-level protein similarity scores
TNF computation - Computes canonical RC-collapsed tetranucleotide frequencies for all contigs ≥ 1,000 bp
Depth profiling - Runs jgi_summarize_bam_contig_depths on BAM files to build a per-sample coverage matrix
Tier 1 (main clustering) - Clusters long contigs (≥ 5,000 bp) assigned at genus/species level using a multimodal graph (ANI + protein + taxonomy edges) with hybrid ANI/coverage/TNF gating; Leiden community detection
Tier 2 (secondary clustering) - Second-pass multimodal clustering for above-genus assigned contigs (order/family/class/phylum)
Tier 3 (recovery) - Assigns short and unclassified contigs to existing clusters via ANI hits to cluster centroids and/or shared lineage taxonomy
Tier 4 (coverage-only recovery) - Places remaining unassigned contigs using coverage + TNF correlation against cluster centroids, without requiring ANI evidence
Quality assessment - Runs CheckM2 for completeness/contamination estimates (bins ≥ --min-checkm2-bp). I need to combine the quality assessment summary to the clustering summary
Dereplication - Runs dRep at 95% ANI to produce a non-redundant bin set
Checkpointing - Saves intermediate results as Parquet/JSON files to allow resuming interrupted runs

Requirements

Python

Python 3.10+ is required.

Python Packages

Replace pip with mamba/conda as you wish. I recommend creating a new environment for this.

pip install numpy pandas networkx biopython scipy statsmodels pyarrow igraph leidenalg

Package	Version	Purpose
`numpy`	≥1.23	Vectorized depth/TNF matrix operations
`pandas`	≥1.5	Data I/O and manipulation
`networkx`	≥2.8	Graph construction
`biopython`	≥1.79	FASTA parsing and writing
`scipy`	≥1.9	Spearman correlation
`statsmodels`	≥0.13	BH-FDR multiple testing correction
`pyarrow`	≥10.0	Parquet checkpointing
`igraph`	≥0.10	Fast graph backend for Leiden
`leidenalg`	≥0.10	Leiden community detection

External Tools

Install via conda (recommended):

mamba install -c bioconda mmseqs2 skani metabat2 samtools checkm2 drep vamb pyrodigal

Tool	Required When	Source
`mmseqs`	`--taxonomy` not provided, or protein similarity step	MMseqs2
`skani`	`--ani` not provided	skani
`pyrodigal`	`--prot-sim` not provided and `--skip-prot-sim` not set	pyrodigal
`jgi_summarize_bam_contig_depths`	`--depth` not provided	Part of MetaBAT2
`taxometer`	`alternate to mmseqs taxonomy`	Part of VAMB - https://github.com/RasmussenLab/vamb
`samtools`	`--depth` not provided	HTSlib
`checkm2`	`--skip-checkm2` not set	CheckM2
`dRep`	`--skip-drep` not set	dRep

Usage

Fully Automatic (minimum inputs)

python tritanc_v10.py \
  --fasta assembly.fasta \
  --bams sample1.bam sample2.bam sample3.bam \
  --mmseqs-db /path/to/GTDB \
  --outdir results/

Skip Individual Steps with Pre-computed Files

python tritanc_v10.py \
  --fasta assembly.fasta \
  --taxonomy saliva_tax.tsv \
  --taxonomy-format taxometer \
  --ani skani.tsv \
  --prot-sim prot_sim.tsv \
  --depth depth_matrix.txt \
  --outdir results/

Skip Protein Similarity (faster, slightly lower recovery)

python tritanc_v10.py \
  --fasta assembly.fasta \
  --taxonomy saliva_tax.tsv \
  --ani skani.tsv \
  --depth depth_matrix.txt \
  --skip-prot-sim \
  --outdir results/

Low-Coverage Datasets (disable hard coverage gate)

python tritanc_v10.py \
  --fasta assembly.fasta \
  --bams sample*.bam \
  --mmseqs-db /path/to/GTDB \
  --coverage-as-tiebreaker \
  --outdir results/

Arguments

Required

Argument	Description
`--fasta`	Assembly FASTA file containing all contigs
`--outdir`	Output directory

Step-skipping Inputs

Supplying these pre-computed files skips the corresponding pipeline step:

Argument	Skips
`--taxonomy`	MMseqs2 taxonomy run
`--ani`	skani triangle run
`--prot-sim`	pyrodigal + MMseqs2 protein similarity run
`--depth`	jgi depth profiling step

Conditionally Required

Argument	Required When
`--bams`	`--depth` is not supplied
`--mmseqs-db`	`--taxonomy` is not supplied

Examples for Tunable Parameters

Argument	Default	Description
`--threads`	`8`	Threads for MMseqs2, skani, and pyrodigal
`--min-len`	`5000`	Min contig length (bp) for main clustering
`--ani-threshold`	`95.0`	ANI (%) threshold; adaptive override
`--cov-threshold`	adaptive	Spearman r threshold; overrides all tiers
`--min-af`	`0.0`	Min skani alignment fraction (0 = disabled)
`--coverage-as-tiebreaker`	off	Disable hard coverage gate globally
`--tnf-gate-main`	`0.93`	Min TNF cosine similarity for hybrid gating in Tiers 1/2
`--min-score`	`0.0`	Taxometer min per-level confidence score (0–1)
`--taxonomy-format`	`mmseqs2`	Taxonomy file format: `mmseqs2` or `taxometer`
`--min-prot-sim`	`50.0`	Min protein identity (%) to retain a hit

Leiden Resolution

Argument	Default	Description
`--leiden-res-main`	`3.5`	Leiden resolution for Tier 1 (main clustering)
`--leiden-res-secondary`	`2.0`	Leiden resolution for Tier 2 (secondary clustering)
`--leiden-res-t4`	`1.5`	Leiden resolution for Tier 4 (coverage-only recovery)

Change the resolutions as needed for clustering. Higher resolutions lead to finer splits, lower resolutions allow coarser clusters.

Recovery

Argument	Default	Description
`--skip-cov-recovery`	off	Skip Tier 4 coverage-only recovery
`--cov-recovery-r`	`0.80`	Min Spearman r for Tier 4
`--cov-recovery-tnf-min`	`0.90`	Min TNF cosine similarity for Tier 4

Post-processing

Argument	Description
`--skip-checkm2`	Skip CheckM2 (also skips dRep)
`--skip-drep`	Skip dRep dereplication
`--skip-prot-sim`	Skip protein similarity step entirely
`--checkm2-db`	Path to CheckM2 diamond DB (or set `CHECKM2DB` env variable)
`--min-checkm2-bp`	Minimum total bin size (bp) for CheckM2 input list (default: `100000`)

Checkpointing

Argument	Description
`--checkpoint-dir`	Directory to store/load intermediate checkpoints (default: `<outdir>/checkpoints/`)
`--no-cache`	Ignore existing checkpoints and recompute everything

Outputs

outdir/
├── taxonomy/                       # MMseqs2 output (if run)
├── ani/                            # skani triangle output (if run)
├── protein_similarity/             # pyrodigal + MMseqs2 protein output (if run)
├── depth/                          # jgi depth matrix (if run)
├── clusters/
│   ├── bins/
│   │   └── cluster_NNNN.fasta      # All contigs in cluster
│   ├── representatives/
│   │   └── cluster_NNNN_representative.fasta
│   └── thresh_bins/                # Symlinks to bins >= --min-checkm2-bp
├── unassigned/
│   └── unassigned_contigs.fasta
├── cluster_summary.tsv             # Per-contig assignments + taxonomy + CheckM2 quality
├── checkm2_bin_list.txt            # Input list for CheckM2
├── checkm2/                        # CheckM2 results (if run)
└── drep/                           # dRep dereplication results (if run)

Output description

`cluster_summary.tsv` Columns

Column	Description
`contig`	Contig ID
`cluster`	Assigned cluster ID or `unassigned`
`is_rep`	Whether this contig is the cluster representative
`contig_len`	Contig length (bp)
`cluster_bp`	Total size of the cluster bin (bp)
`mean_depth`	Mean coverage depth across samples
`rank`	Taxonomic rank of this contig's assignment
`name`	Taxon name
`lineage`	Full lineage (semicolon-delimited)
`rep_name`	Taxon name of the cluster representative
`rep_rank`	Taxonomic rank of the cluster representative
`rep_lineage`	Full lineage of the cluster representative

Quality Thresholds

Tier	Completeness	Contamination
High quality	≥ 90%	≤ 5%
Medium quality	≥ 50%	≤ 10%
Low (flagged)	< 50%	> 10%

dRep dereplication is run at 95% ANI on medium + high quality bins only.

Clustering Logic Summary

flowchart TD
    A["Long contigs (≥2 kbp) + genus/species taxonomy"]
    --> B["Tier 1 — Main clustering\nMultimodal graph (ANI + protein + taxonomy edges)\nHybrid ANI/cov/TNF gating | Leiden community detection"]
    --> C["Above-genus contigs\norder / family / class / phylum"]
    --> D["Tier 2 — Secondary clustering\nMultimodal graph | Leiden community detection"]
    --> E["Short / unclassified / secondary singletons"]
    --> F["Tier 3 — Recovery\nANI to cluster centroids + lineage-aware taxonomy"]
    --> G["Still unassigned contigs"]
    --> H["Tier 4 — Coverage-only recovery\nCoverage + TNF correlation to centroids"]
    H --> J["optional Checkm2 bin QC on thresholded bins"]
    J --> K["optional dRep based dereplication"]
    H --> I["Unassigned contigs → unassigned_contigs.fasta"]

Other thoughts

Was thinking about calling it STRATA based on geological layering which is a perfect metaphor for hierarchical tiered binning (Tiers 1–5), and it's visually distinctive. I might at a later date :)
Haven't really benchmarked with other binners, but my belief is that it might be competitive
Various hyperparameters might make it difficult to get the thing running like you want it right away, but as with any other tool, tuning is sort of needed

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README.md		README.md
tritanc.py		tritanc.py

Folders and files

Latest commit

History

Repository files navigation

TriTanc

Metagenomic Contig Clustering Pipeline

Motivation

What's New

Time taken to run the script

How It Works

Requirements

Python

Python Packages

External Tools

Usage

Fully Automatic (minimum inputs)

Skip Individual Steps with Pre-computed Files

Skip Protein Similarity (faster, slightly lower recovery)

Low-Coverage Datasets (disable hard coverage gate)

Arguments

Required

Step-skipping Inputs

Conditionally Required

Examples for Tunable Parameters

Leiden Resolution

Recovery

Post-processing

Checkpointing

Outputs

Output description

cluster_summary.tsv Columns

Quality Thresholds

Clustering Logic Summary

Other thoughts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`cluster_summary.tsv` Columns

Packages