Metax is a crossโdomain metagenomic taxonomic profiler designed to deliver accurate, robust, and interpretable community composition analyses across bacteria, archaea, eukaryotes, and viruses. Unlike existing profilers, Metax integrates probabilistic modeling of genome coverage to distinguish true community members from artifacts caused by reference contamination, local genomic similarity, or reagentโderived DNA fragments.
Through comprehensive benchmarks on more than 500 samples, Metax demonstrated:
๐งฌ Speciesโlevel accuracy across all domains of life
โก Robustness to shallow sequencing and lowโbiomass, hostโdominated samples
๐ Contamination detection, including reagentโborne DNA and reference misassemblies
๐ฆ Clinical and environmental relevance, e.g. enables identifying cross kingdom interactions and clarifying tumor microbiome signals
By unifying coverageโinformed presence probability estimation with EMโbased abundance refinement, Metax overcomes challenges of ambiguous read mapping, database contamination and kitome DNA fragments. These properties make it a powerful tool for microbiome research, clinical metagenomics, and identifying genome misassemblies.
-
Install the package using conda:
Ensure that
biocondais included in the Conda channel source file (~/.condarc):channels: - conda-forge - bioconda channel_priority: strictconda create -n metax -c zldeng metax conda activate metax
-
Taxonmy dmp files
- Create a
metax_dmpdirectory. - Download the NCBI taxonomy dump (
taxdump.tar.gz) from: NCBI. - Extract its contents directly into
metax_dmp. - Optional: To use an alternative taxonomy source (e.g. GTDB or ICTV), replace the extracted
taxdumpfiles inmetax_dmpwith your own dmp files.
- Create a
-
Reference database
A pre-built reference database is available at here. It is based on the RefSeq snapshot of 10 August 2022 and includes top genomes for each NCBI taxonomic identifier (txid), prioritizing assemblies flagged as โrepresentativeโ or โreferenceโ and then selecting the highest assembly level (Complete Genome > Chromosome > Scaffold > Contig). In total, it contains 33,143 genomes from bacteria, archaea, viruses, fungi, protozoa, and Homo sapiens (bavfph).
Another pre-built reference database is available for CAMI II data benchmarks.
A customized reference database can be created by following steps:
-
Prepare the genomes in fasta format, the header of each sequence should be in the format:
>genome_id|txid|species_txid|sequence_id[|genome_size]Each genome must have a unique genome_id, and each sequence a unique sequence_id. The genome_size field is optional but necessary for subsampled reference database creation using
metax index. When using the NCBI taxonomy,txidshould be the genomeโs NCBI taxonomy ID, andspecies_txidthe speciesโ NCBI taxonomy ID. If you choose a different taxonomy source (e.g. GTDB, ICTV), use the corresponding IDs from your taxonomy dump files. -
Run the following command to build the database:
It may take long to complete, please run it in a Tmux session or screen.
metax index <fasta_file> -o <database_dir>
This command produces a database named
metax_dbin<database_dir>.To subsample each genome before indexing, provide
-f <fraction>(where0 < fraction < 1). This enables the creation of a database that uses less space and memory while also reducing profiling runtime. But it might be less sensitive for low read count taxa. The CLI will extract evenly distributed, non-overlapping segments of length-l/--segment-length(50 Kbp by default) across each genome until at least the requested fraction is collected, write a subsampled FASTA alongside the index, and skip generating read-level classifications while scaling reported counts during profiling to account for the reduced genome size. Use-s/--seed(default42) to make the segment selection reproducible, and-m/--min-length(default 3 Kbp) to discard subsampled segments shorter than the threshold for genomes more than ten times longer than that value. Supply-t/--threadsto subsample genomes in parallel; omit it (or set it to 1) to run sequentially. Add-z/--compressto gzip the subsampled FASTA once it finishes building the index (the uncompressed file is removed after compression).build_db <fasta_file> -o <database_dir>
- Pathogen host map file for pathogen detection mode
You may optionally provide a custom pathogen host mapping file to enable Metax to prioritize detection of microorganisms relevant to a specific host of interest.
The mapping file should be a tab-delimited table containing the following columns (see the format of data/pathogen_host_disease.txt):
txid (microbial taxon ID), host_txids (associated host taxon IDs), host (host name or label) and diseases (asociated diseases).
The diseases column can be left blank.
For convenience, we also provide a precompiled virus host mapping file (data/pathogen_host_disease.txt) generated from the Virus-Host Database.
- Test data
Usage: metax profile [OPTIONS] --outprefix <PREFIX> [-- <EXTRA_ARGS>...]
Arguments:
[EXTRA_ARGS]... Additional arguments passed directly to maCMD (use after `--`).
Options:
--db <DB> Path to the maCMD reference database (metax_db.json).
--dmp-dir <DMP_DIR> Directory containing the NCBI-style taxonomy dump (dmp files).
-i, --in-seq <READS> Comma-separated list of input read files (one or two for Illumina paired-end).
-o, --outprefix <PREFIX> Prefix for output files.
-t, --threads <THREADS> Number of threads to use for alignment and profiling. [default: 20]
-r, --resume Resume profiling by reusing existing alignment output if present.
--reuse-sam <SAM> Existing SAM (or compressed SAM) file to reuse instead of running maCMD.
--sequencer <TYPE> Sequencer type (e.g. Nanopore, PacBio, Illumina). [default: Illumina]
-p, --is-paired Treat Illumina inputs as paired-end reads (expects two files).
--strain Enable strain-level profiling outputs.
--mode <MODE> Alignment mode preset: default, recall, or precision. [default: default] [possible values: recall, precision, default]
--batch-size <N> Maximum number of reads to process per batch.
--identity <FLOAT> Minimum alignment identity threshold for retaining a read.
-m, --mapped-len <LEN> Minimum mapped read length threshold.
-b, --breadth <FRACTION> Minimum breadth of coverage required to report a genome.
--chunk-breadth <FRACTION> Manually set the minimum chunk breadth (overrides automatic estimate).
-f, --fraction <FRACTION> Minimum aligned fraction a read must cover to be considered.
-l, --lowbiomass Apply heuristics tuned for low biomass samples.
-k, --keep-raw Retain the unfiltered rprofile.txt output alongside the final profile.
--by-aligned Estimate the minimum chunk breadth using the number of aligned reads.
-z, --compress-sam Compress the generated SAM file after profiling completes.
--pathogen-host <TSV> Optional TSV mapping pathogen taxids to host metadata for annotation.
--host <TAXID> NCBI taxid of the host organism (enables pathogen-specific profiling).
--verbose Log the full command line parameters.
-h, --help Print help
-V, --version Print versionmetax profile --dmp-dir <dump_dir> \
--db <reference_db> \
-i <r1>[,<r2>] \
-o <output_prefix> \
[other options ...]<dump_dir>: path to the metax_dmp folder where the dump files located.
<reference_db>: path to the json file of the database (e.g. metax_bavfph/metax_db.json)
The first run (sample) takes a bit longer; subsequent runs will be substantially faster by using the cached database.
Usage: metax [OPTIONS] [EXTRA_ARGS]...
A taxonomy profiler for metagenomic data
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --db PATH The reference database file. โ
โ --dmp_dir PATH The directory of dmp files. โ
โ --in_seq -i TEXT The input read files separated with comma. โ
โ * --outprefix -o TEXT The prefix of output files. [required] โ
โ --threads -t INTEGER Number of threads to use. โ
โ --resume -r Resume from the last run. โ
โ --reuse_sam PATH The sam file to reuse for profiling. โ
โ --sequencer [Illumina|Nanopore|PacBio|assembly] Sequencer used to generate the reads. Default: โ
โ Illumina โ
โ --is_paired -p Whether the reads are paired or not? โ
โ --strain Whether profile on strain level? (experimental) โ
โ --mode [recall|precision|default] The mode of the profiler. recall: ensure high recall, โ
โ precision: ensure high precision, default: use the โ
โ default mode. โ
โ --batch_size INTEGER Reduce memory consumption with smaller batch size. โ
โ (Default: 5000 for short, 1000 for long reads) โ
โ --identity FLOAT The sequence identity (matched bases/gap compressed โ
โ len) cutoff to consider a valid assignment. (Default: โ
โ 0.95 for short, 0.86 for long reads) โ
โ --mapped_len -m INTEGER The mapped length cutoff to consider a valid โ
โ assignment. (Default: 50 for short, 250 for long โ
โ reads) โ
โ --breadth -b FLOAT The genome breadth coverage cutoff to consider the โ
โ presence of a genome. โ
โ --chunk_breadth FLOAT The genome chunk breadth coverage cutoff to consider โ
โ the presence of a genome. โ
โ --fraction -f FLOAT The fraction of matched based in a read to consider a โ
โ valid alignment. (Default: 0.6) โ
โ --lowbiomass -l Is a low biomass sample? (No coverage filter by โ
โ default for low biomass sample) โ
โ --keep_raw -k Keep raw profiling file without statistical โ
โ filtering. โ
โ --pathogen_host PATH The pathogen host table file โ
โ --host TEXT The host taxid for pathogen detection โ
โ --version Show the version and exit. โ
โ --help -h Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
metax --dmp_dir <dump_dir> \
--db <reference_db> \
-i <r1>[,<r2>] \
-o <output_prefix> \
[other options ...]<dump_dir>: path to the metax_dmp folder where the dump files located.
<reference_db>: path to the json file of the database (e.g. metax_bavfph/metax_db.json)
- Final taxonomy profile:
*.profile.txt
column 1: Taxon name
column 2: Taxon ID
column 3: Taxon rank
column 4: Number reads
column 5: Depth of coverage
column 6: Abundance
column 7: Breadth of coverage (B)
column 8: Expected breadth of coverage (EB)
column 9: Likelihood of presence based on breadth
column 10: Fixed chunk breadth of coverage
column 11: Flex chunk breadth of coverage
column 12: Expected flex chunk breadth of coverage (ECB)
column 13: Likelihood of presence based on flex chunk breadth
If pathogen detection mode is enabled, the output profile will also include 3 extra columns as below:
column 14: The host names
column 15: The host taxonomy IDs
column 16: The relevant diseases
- Reads taxonomy classification:
*.classify.txt
column 1: Read name
column 2: Name of the most likely taxon
column 3: taxonomy ID of the most likely taxon
column 4: Rank of the taxon
column 5: Names of all possible taxa that the reads originated from
column 6: Taxonomy IDs of all possible taxa
column 7: Likelihood for each of those possible taxa
column 2 and column 5 are not included in the version >=9.12
- What platforms and operating systems does Metax support?
Metax currently supports Linux on x86-64 (64-bit Intel/AMD) systems. Other architectures (e.g., ARM/macOS) are not yet officially supported.
- Why do I get the error: โProcessor 6174 is not supported by this buildโ?
This error indicates that your CPU does not support some modern instruction sets required by Metax.
We thank Gary Robertson for IT support, Dr. Mohammad-Hadi Foroughmand-Araabi for advice on statistical formulations, and Hesham Almessady for software testing.