You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A phylogeny-aware toolkit for classifying and annotating Hierarchical Orthologous Groups (HOGs) in pangenomic datasets. It's a flexible command-line toolkit for classifying HOGs across multiple genomes in a pangenome-aware, phylogeny-informed context. It supports core/shell/private gene classification, heatmap visualization, pan-proteome generation, functional annotation, saturation analysis, Ka/Ks selection analysis, phylogenetic LCA analysis, and supermatrix generation. Now supports both command-line arguments and external configuration files.
---
Changelog
[v0.2.0] - 2026-02-27
New Features
Summary Statistics (--summary): Generates a summary statistics table and stacked bar chart showing Gene/HOG ratios across pangenome compartments (core, single-copy, shell, private, cloud).
Random HOG Matrix (--random-hog-matrix N): Generates an absent/single/multi-copy heatmap for N randomly sampled HOGs across all species.
Ka/Ks Analysis (--kaks): Performs codon-based selection analysis (dN/dS) on orthologous groups using the Nei-Gojobori (1986) method.
Supports core, shell, private, or all HOG types via --kaks-type.
Calculation methods: biopython (built-in) or kakscalculator (external).
Aligners: mafft or muscle via --aligner.
Back-translation: naive (built-in) or pal2nal (external) via --backtrans.
Optional pairwise mode with --reference species.
Phylogenetic LCA Analysis (--species-tree): Maps HOGs to their Lowest Common Ancestor on a species tree, generating per-node HOG counts and a phylogeny-aware classification.
Supermatrix Generation (--supermatrix): Concatenates aligned single-copy orthologs into a supermatrix for phylogenomic inference, with partition file for RAxML/IQ-TREE.
Graceful Dependency Handling: Optional dependencies (PyYAML, BioPython, matplotlib/seaborn) are now handled with try/except, providing clear error messages when missing.
Extended FASTA Support: Now recognizes .pep.fa and uppercase extensions (.FASTA, .FA, .PEP).
Conda Packaging: Added conda-recipe/meta.yaml, environment.yml, and updated setup.py for single-command install via conda or pip.
Updated Config Examples: Example YAML config files now include all new analysis options.
[Released] - 2025-09-21
New Features
Functional Annotation (--funano): Added a major feature to perform functional annotation of pangenome compartments (core, shell, private, etc.) against the UniProt/Swiss-Prot database.
Supports annotation of all compartments (--funano 1) or specific ones (--funano 2-6).
Automatically downloads the database if a local copy is not provided via --uniprot-db.
Generates detailed annotation reports, raw BLAST results, and protein sequence files.
PAV and Count Matrix Generation:
Added --pav flag to generate a Presence/Absence Variant (PAV) matrix.
Added --matrix flag to generate a gene copy number (Count) matrix.
Improvements
Restructured Output: The output directory structure has been completely reorganized for better clarity. All results are now saved within a main results/ directory, with sub-folders for annotations/, blast_results/, peptides/, and compartments/.
Organized Classification: Pangenome classification files (core.HOGs.tsv, etc.) are now neatly stored in results/compartments/panhog_classification/.
Improved Cloud Gene File: The cloud.unassigned_genes.tsv file now correctly includes a header.
Key Features
Feature
Flag
Description
Global Pangenome Classification
--pan
Classify HOGs into core, single-copy, shell, private, and cloud
Clade-Specific Analysis
--clade sp1,sp2,...
Pangenome classification on a subset of species
Pan-Proteome Construction
--proteome
Extract FASTA sequences for pangenome compartments
Gene Variation Heatmap
--genevar
Heatmap of gene copy number variation across species
Saturation Analysis
--saturation
Bootstrapped core/pan-genome growth curves
Clade-pair Saturation
--saturation-cladepair
Compare saturation between two clades
Summary Statistics
--summary
Summary table and stacked bar chart
Random HOG Matrix
--random-hog-matrix N
Heatmap of N randomly sampled HOGs
PAV Matrix
--pav
Presence/Absence Variant matrix
Count Matrix
--matrix
Gene copy number matrix
Functional Annotation
--funano
BLAST-based annotation against UniProt
Ka/Ks Analysis
--kaks
Selection analysis on orthologous groups
Phylogenetic LCA
--species-tree
Map HOGs to species tree nodes
Supermatrix
--supermatrix
Concatenated single-copy orthologs for phylogenomics
Config File
--config
YAML-based configuration
Installation
Option 1: Conda (Recommended)
# Create environment with all dependencies
conda env create -f environment.yml
conda activate panhog
# Install PanHOG
pip install .
Option 2: Pip
pip install panhog
Option 3: From Source
git clone https://github.com/yykaya/PanHOG.git
cd PanHOG
pip install -e .
After installation, the panhog and pangenehog commands are available system-wide.
Use --summary for a quick overview of your pangenome composition.
Use --random-hog-matrix 1000 to visualize HOG presence patterns across species.
Use --proteome to extract FASTA of shared pangenes.
Use --saturation-cladepair for insight into core/pan genome expansion across defined clades.
Use --genevar with --zscore for population-scale expansions or contractions.
Use --kaks with --kaks-type core to identify genes under selection in the core genome.
Use --species-tree with a well-supported phylogeny for evolutionary context.
Use --supermatrix to generate input for phylogenomic tree inference.
Contact & Citation
This tool is currently in beta. For questions, contributions, or citation requests, please contact the developer or include the GitHub link in your reference.
Happy pangenomics with PanHOG!
About
A phylogeny-aware toolkit for classifying and annotating Hierarchical Orthologous Groups (HOGs) in pangenomic datasets. It's a flexible command-line toolkit for classifying HOGs across multiple genomes in a pangenome-aware, phylogeny-informed context.