Skip to content

yykaya/PanHOG

Repository files navigation

PanHOG

PanHOG logo A phylogeny-aware toolkit for classifying and annotating Hierarchical Orthologous Groups (HOGs) in pangenomic datasets. It's a flexible command-line toolkit for classifying HOGs across multiple genomes in a pangenome-aware, phylogeny-informed context. It supports core/shell/private gene classification, heatmap visualization, pan-proteome generation, functional annotation, saturation analysis, Ka/Ks selection analysis, phylogenetic LCA analysis, and supermatrix generation. Now supports both command-line arguments and external configuration files.
---

Changelog

[v0.2.0] - 2026-02-27

New Features

  • Summary Statistics (--summary): Generates a summary statistics table and stacked bar chart showing Gene/HOG ratios across pangenome compartments (core, single-copy, shell, private, cloud).
  • Random HOG Matrix (--random-hog-matrix N): Generates an absent/single/multi-copy heatmap for N randomly sampled HOGs across all species.
  • Ka/Ks Analysis (--kaks): Performs codon-based selection analysis (dN/dS) on orthologous groups using the Nei-Gojobori (1986) method.
    • Supports core, shell, private, or all HOG types via --kaks-type.
    • Calculation methods: biopython (built-in) or kakscalculator (external).
    • Aligners: mafft or muscle via --aligner.
    • Back-translation: naive (built-in) or pal2nal (external) via --backtrans.
    • Optional pairwise mode with --reference species.
  • Phylogenetic LCA Analysis (--species-tree): Maps HOGs to their Lowest Common Ancestor on a species tree, generating per-node HOG counts and a phylogeny-aware classification.
  • Supermatrix Generation (--supermatrix): Concatenates aligned single-copy orthologs into a supermatrix for phylogenomic inference, with partition file for RAxML/IQ-TREE.
  • Configurable Tool Paths: Added --mafft-path, --muscle-path, --pal2nal-path, --blastp-path, --makeblastdb-path, --kakscalculator-path for HPC/custom installations.

Improvements

  • Graceful Dependency Handling: Optional dependencies (PyYAML, BioPython, matplotlib/seaborn) are now handled with try/except, providing clear error messages when missing.
  • Extended FASTA Support: Now recognizes .pep.fa and uppercase extensions (.FASTA, .FA, .PEP).
  • Conda Packaging: Added conda-recipe/meta.yaml, environment.yml, and updated setup.py for single-command install via conda or pip.
  • Updated Config Examples: Example YAML config files now include all new analysis options.

[Released] - 2025-09-21

New Features

  • Functional Annotation (--funano): Added a major feature to perform functional annotation of pangenome compartments (core, shell, private, etc.) against the UniProt/Swiss-Prot database.
    • Supports annotation of all compartments (--funano 1) or specific ones (--funano 2-6).
    • Automatically downloads the database if a local copy is not provided via --uniprot-db.
    • Generates detailed annotation reports, raw BLAST results, and protein sequence files.
  • PAV and Count Matrix Generation:
    • Added --pav flag to generate a Presence/Absence Variant (PAV) matrix.
    • Added --matrix flag to generate a gene copy number (Count) matrix.

Improvements

  • Restructured Output: The output directory structure has been completely reorganized for better clarity. All results are now saved within a main results/ directory, with sub-folders for annotations/, blast_results/, peptides/, and compartments/.
  • Organized Classification: Pangenome classification files (core.HOGs.tsv, etc.) are now neatly stored in results/compartments/panhog_classification/.
  • Improved Cloud Gene File: The cloud.unassigned_genes.tsv file now correctly includes a header.

Key Features

Feature Flag Description
Global Pangenome Classification --pan Classify HOGs into core, single-copy, shell, private, and cloud
Clade-Specific Analysis --clade sp1,sp2,... Pangenome classification on a subset of species
Pan-Proteome Construction --proteome Extract FASTA sequences for pangenome compartments
Gene Variation Heatmap --genevar Heatmap of gene copy number variation across species
Saturation Analysis --saturation Bootstrapped core/pan-genome growth curves
Clade-pair Saturation --saturation-cladepair Compare saturation between two clades
Summary Statistics --summary Summary table and stacked bar chart
Random HOG Matrix --random-hog-matrix N Heatmap of N randomly sampled HOGs
PAV Matrix --pav Presence/Absence Variant matrix
Count Matrix --matrix Gene copy number matrix
Functional Annotation --funano BLAST-based annotation against UniProt
Ka/Ks Analysis --kaks Selection analysis on orthologous groups
Phylogenetic LCA --species-tree Map HOGs to species tree nodes
Supermatrix --supermatrix Concatenated single-copy orthologs for phylogenomics
Config File --config YAML-based configuration

Installation

Option 1: Conda (Recommended)

# Create environment with all dependencies
conda env create -f environment.yml
conda activate panhog

# Install PanHOG
pip install .

Option 2: Pip

pip install panhog

Option 3: From Source

git clone https://github.com/yykaya/PanHOG.git
cd PanHOG
pip install -e .

After installation, the panhog and pangenehog commands are available system-wide.


Documentation


Quick Start

Basic pangenome classification

panhog --hog N0.tsv --fasta ./peptides/ --pan -o results/ -p run1_

Full pipeline with all analyses

panhog --hog N0.tsv --fasta ./peptides/ --pan \
  --proteome ALL --genevar ALL --saturation \
  --pav --matrix --summary --random-hog-matrix 1000 \
  -o results/ -p full_

Ka/Ks selection analysis

panhog --hog N0.tsv --fasta ./peptides/ --pan \
  --kaks --cds ./cds_fasta/ --kaks-type core \
  --aligner mafft --backtrans naive \
  -o results/ -p kaks_

Phylogenetic LCA analysis

panhog --hog N0.tsv --fasta ./peptides/ --pan \
  --species-tree species_tree.nwk \
  -o results/ -p phylo_

With YAML config file

panhog --hog N0.tsv --fasta ./peptides/ --config config.yaml

Clade-specific analysis

panhog --hog N0.tsv --fasta ./peptides/ \
  --clade species1,species2,species3 -o results/ -p cladeA_

Pangene annotation pipeline

pangenehog --hog N0.tsv --fasta ./peptides/ --pan -o results/

Command-Line Options

Main Arguments

Argument Description Default
--hog Path to the input HOGs TSV file (e.g., N0.tsv). Required
--fasta Path to the directory containing protein FASTA files. Required
-o, --output Directory where all results will be saved. . (current dir)
-p, --prefix A prefix to add to all output file names. "" (none)
--config Path to a YAML configuration file for advanced settings. config.yaml

Analysis Modes

Argument Description Default
--pan Performs a global pangenome classification across all species. Enabled if --clade is not used.
--clade Performs a pangenome classification on a specific subset of species. Provide a comma-separated list. None

New Analyses (v0.2.0)

Argument Description Default
--summary Generate summary statistics table and stacked bar chart. False
--random-hog-matrix Generate random HOG matrix heatmap with N HOGs. None
--kaks Perform Ka/Ks (dN/dS) selection analysis. False
--cds Path to directory containing CDS FASTA files (required for --kaks). None
--kaks-type HOG type for Ka/Ks analysis: core, shell, private, all. core
--kaks-method Ka/Ks calculation method: biopython or kakscalculator. biopython
--aligner Protein alignment tool: mafft or muscle. mafft
--backtrans Back-translation method: naive (built-in) or pal2nal. naive
--reference Reference species for pairwise Ka/Ks analysis. None
--species-tree Path to Newick species tree for phylogenetic LCA analysis. None
--supermatrix Generate supermatrix from single-copy orthologs. False

Tool Paths

Argument Description Default
--mafft-path Path to MAFFT executable. mafft
--muscle-path Path to MUSCLE executable. muscle
--pal2nal-path Path to PAL2NAL script. pal2nal.pl
--blastp-path Path to BLASTP executable. blastp
--makeblastdb-path Path to makeblastdb executable. makeblastdb
--kakscalculator-path Path to KaKs_Calculator executable. KaKs_Calculator

Functional Annotation

Argument Description Default
--funano Performs functional annotation against UniProt. Options: 0 (disabled), 1 (all), 2 (core), 3 (single-copy), 4 (shell), 5 (private), 6 (cloud). 0
--uniprot-db Path to a local UniProt/Swiss-Prot FASTA file. If not provided, it will be downloaded automatically. None
--threads Number of CPU threads to use for the BLASTP search. 4
--keep-uniprot If specified, the downloaded UniProt database will not be deleted after the run. False

Downstream Analyses & Visualizations

Argument Description Default
--proteome Constructs a pan-proteome FASTA file. Use --proteome for all species or --proteome sp1,sp2 for a subset. None
--genevar Generates a heatmap of gene copy number variation. Use --genevar for all species or --genevar sp1,sp2 for a subset. None
--saturation Performs a bootstrapped saturation analysis to plot core and pan-genome growth. False
--saturation-cladepair Performs saturation analysis for two pre-defined clades (--clade1, --clade2) on the same plot. False
--zscore Applies Z-score normalization to the --genevar heatmap. False
--log Applies log2(count+1) transformation to the --genevar heatmap. False

Output Matrices

Argument Description Default
--pav Generates a Presence/Absence Variant (PAV) matrix (1 for present, 0 for absent). False
--matrix Generates a gene copy number (Count) matrix. False

Saturation Plot Customization

Argument Description Default
-b, --bootstrap Number of random sampling iterations for saturation analysis. 100
--clade1 / --clade2 Comma-separated species lists for --saturation-cladepair. (pre-defined lists)
--marker-* / --color-* A range of options to customize markers and colors for saturation plots. (various)

Configuration File

You can use a YAML config file to set advanced parameters like colors, markers, labels, and clade definitions.

Sample config.yaml

# Input files
hog: "N0.tsv"
fasta: "/path/to/peptide/fasta"

# Output settings
output: "./results"
prefix: "my_analysis_"

# Analysis options
pan: true
summary: true
random_hog_matrix: 1000
bootstrap: 10000

# Ka/Ks analysis
# kaks: true
# cds: "/path/to/cds/fasta"
# kaks_type: "core"
# aligner: "mafft"

# Phylogeny LCA
# species_tree: "species_tree.nwk"

# Supermatrix
# supermatrix: true

# Saturation plot customization
marker_core: "o"
marker_pan: "s"

Run using config:

panhog --hog N0.tsv --fasta ./peptides/ --config config.yaml

Any command-line flag will override the corresponding config value.


Output Files

  • core.HOGs.tsv, shell.HOGs.tsv, gt-specific.HOGs.tsv, single-copy.HOGs.tsv
  • cloud.unassigned_genes.tsv
  • private_genes_<species>.txt
  • pan_proteome.fa
  • summary_stats.tsv, summary_stats.png (v0.2.0)
  • random_hog_matrix.png (v0.2.0)
  • pav_matrix.tsv, count_matrix.tsv
  • genevar_heatmap.[png|pdf|svg]
  • saturation_analysis.[png|pdf|svg]
  • kaks_results.tsv (v0.2.0)
  • phylogeny_lca_results.tsv (v0.2.0)
  • supermatrix.fasta, supermatrix_partitions.txt (v0.2.0)

Recommendations

  • Use --summary for a quick overview of your pangenome composition.
  • Use --random-hog-matrix 1000 to visualize HOG presence patterns across species.
  • Use --proteome to extract FASTA of shared pangenes.
  • Use --saturation-cladepair for insight into core/pan genome expansion across defined clades.
  • Use --genevar with --zscore for population-scale expansions or contractions.
  • Use --kaks with --kaks-type core to identify genes under selection in the core genome.
  • Use --species-tree with a well-supported phylogeny for evolutionary context.
  • Use --supermatrix to generate input for phylogenomic tree inference.

Contact & Citation

This tool is currently in beta. For questions, contributions, or citation requests, please contact the developer or include the GitHub link in your reference.


Happy pangenomics with PanHOG!

About

A phylogeny-aware toolkit for classifying and annotating Hierarchical Orthologous Groups (HOGs) in pangenomic datasets. It's a flexible command-line toolkit for classifying HOGs across multiple genomes in a pangenome-aware, phylogeny-informed context.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages