PanHOG

A phylogeny-aware toolkit for classifying and annotating Hierarchical Orthologous Groups (HOGs) in pangenomic datasets. It's a flexible command-line toolkit for classifying HOGs across multiple genomes in a pangenome-aware, phylogeny-informed context. It supports core/shell/private gene classification, heatmap visualization, pan-proteome generation, functional annotation, saturation analysis, Ka/Ks selection analysis, phylogenetic LCA analysis, and supermatrix generation. Now supports both command-line arguments and external configuration files.

---

Changelog

[v0.2.0] - 2026-02-27

New Features

Summary Statistics (--summary): Generates a summary statistics table and stacked bar chart showing Gene/HOG ratios across pangenome compartments (core, single-copy, shell, private, cloud).
Random HOG Matrix (--random-hog-matrix N): Generates an absent/single/multi-copy heatmap for N randomly sampled HOGs across all species.
Ka/Ks Analysis (--kaks): Performs codon-based selection analysis (dN/dS) on orthologous groups using the Nei-Gojobori (1986) method.
- Supports core, shell, private, or all HOG types via --kaks-type.
- Calculation methods: biopython (built-in) or kakscalculator (external).
- Aligners: mafft or muscle via --aligner.
- Back-translation: naive (built-in) or pal2nal (external) via --backtrans.
- Optional pairwise mode with --reference species.
Phylogenetic LCA Analysis (--species-tree): Maps HOGs to their Lowest Common Ancestor on a species tree, generating per-node HOG counts and a phylogeny-aware classification.
Supermatrix Generation (--supermatrix): Concatenates aligned single-copy orthologs into a supermatrix for phylogenomic inference, with partition file for RAxML/IQ-TREE.
Configurable Tool Paths: Added --mafft-path, --muscle-path, --pal2nal-path, --blastp-path, --makeblastdb-path, --kakscalculator-path for HPC/custom installations.

Improvements

Graceful Dependency Handling: Optional dependencies (PyYAML, BioPython, matplotlib/seaborn) are now handled with try/except, providing clear error messages when missing.
Extended FASTA Support: Now recognizes .pep.fa and uppercase extensions (.FASTA, .FA, .PEP).
Conda Packaging: Added conda-recipe/meta.yaml, environment.yml, and updated setup.py for single-command install via conda or pip.
Updated Config Examples: Example YAML config files now include all new analysis options.

[Released] - 2025-09-21

New Features

Functional Annotation (--funano): Added a major feature to perform functional annotation of pangenome compartments (core, shell, private, etc.) against the UniProt/Swiss-Prot database.
- Supports annotation of all compartments (--funano 1) or specific ones (--funano 2-6).
- Automatically downloads the database if a local copy is not provided via --uniprot-db.
- Generates detailed annotation reports, raw BLAST results, and protein sequence files.
PAV and Count Matrix Generation:
- Added --pav flag to generate a Presence/Absence Variant (PAV) matrix.
- Added --matrix flag to generate a gene copy number (Count) matrix.

Improvements

Restructured Output: The output directory structure has been completely reorganized for better clarity. All results are now saved within a main results/ directory, with sub-folders for annotations/, blast_results/, peptides/, and compartments/.
Organized Classification: Pangenome classification files (core.HOGs.tsv, etc.) are now neatly stored in results/compartments/panhog_classification/.
Improved Cloud Gene File: The cloud.unassigned_genes.tsv file now correctly includes a header.

Key Features

Feature	Flag	Description
Global Pangenome Classification	`--pan`	Classify HOGs into core, single-copy, shell, private, and cloud
Clade-Specific Analysis	`--clade sp1,sp2,...`	Pangenome classification on a subset of species
Pan-Proteome Construction	`--proteome`	Extract FASTA sequences for pangenome compartments
Gene Variation Heatmap	`--genevar`	Heatmap of gene copy number variation across species
Saturation Analysis	`--saturation`	Bootstrapped core/pan-genome growth curves
Clade-pair Saturation	`--saturation-cladepair`	Compare saturation between two clades
Summary Statistics	`--summary`	Summary table and stacked bar chart
Random HOG Matrix	`--random-hog-matrix N`	Heatmap of N randomly sampled HOGs
PAV Matrix	`--pav`	Presence/Absence Variant matrix
Count Matrix	`--matrix`	Gene copy number matrix
Functional Annotation	`--funano`	BLAST-based annotation against UniProt
Ka/Ks Analysis	`--kaks`	Selection analysis on orthologous groups
Phylogenetic LCA	`--species-tree`	Map HOGs to species tree nodes
Supermatrix	`--supermatrix`	Concatenated single-copy orthologs for phylogenomics
Config File	`--config`	YAML-based configuration

Installation

Option 1: Conda (Recommended)

# Create environment with all dependencies
conda env create -f environment.yml
conda activate panhog

# Install PanHOG
pip install .

Option 2: Pip

pip install panhog

Option 3: From Source

git clone https://github.com/yykaya/PanHOG.git
cd PanHOG
pip install -e .

After installation, the panhog and pangenehog commands are available system-wide.

Documentation

Configuration Guide - Detailed guide for configuring PanHOG with YAML files
Pangene Integration Guide - Instructions for using the pangene integration module

Quick Start

Basic pangenome classification

panhog --hog N0.tsv --fasta ./peptides/ --pan -o results/ -p run1_

Full pipeline with all analyses

panhog --hog N0.tsv --fasta ./peptides/ --pan \
  --proteome ALL --genevar ALL --saturation \
  --pav --matrix --summary --random-hog-matrix 1000 \
  -o results/ -p full_

Ka/Ks selection analysis

panhog --hog N0.tsv --fasta ./peptides/ --pan \
  --kaks --cds ./cds_fasta/ --kaks-type core \
  --aligner mafft --backtrans naive \
  -o results/ -p kaks_

Phylogenetic LCA analysis

panhog --hog N0.tsv --fasta ./peptides/ --pan \
  --species-tree species_tree.nwk \
  -o results/ -p phylo_

With YAML config file

panhog --hog N0.tsv --fasta ./peptides/ --config config.yaml

Clade-specific analysis

panhog --hog N0.tsv --fasta ./peptides/ \
  --clade species1,species2,species3 -o results/ -p cladeA_

Pangene annotation pipeline

pangenehog --hog N0.tsv --fasta ./peptides/ --pan -o results/

Command-Line Options

Main Arguments

Argument	Description	Default
`--hog`	Path to the input HOGs TSV file (e.g., `N0.tsv`).	Required
`--fasta`	Path to the directory containing protein FASTA files.	Required
`-o`, `--output`	Directory where all results will be saved.	`.` (current dir)
`-p`, `--prefix`	A prefix to add to all output file names.	`""` (none)
`--config`	Path to a YAML configuration file for advanced settings.	`config.yaml`

Analysis Modes

Argument	Description	Default
`--pan`	Performs a global pangenome classification across all species.	Enabled if `--clade` is not used.
`--clade`	Performs a pangenome classification on a specific subset of species. Provide a comma-separated list.	`None`

New Analyses (v0.2.0)

Argument	Description	Default
`--summary`	Generate summary statistics table and stacked bar chart.	`False`
`--random-hog-matrix`	Generate random HOG matrix heatmap with N HOGs.	`None`
`--kaks`	Perform Ka/Ks (dN/dS) selection analysis.	`False`
`--cds`	Path to directory containing CDS FASTA files (required for `--kaks`).	`None`
`--kaks-type`	HOG type for Ka/Ks analysis: `core`, `shell`, `private`, `all`.	`core`
`--kaks-method`	Ka/Ks calculation method: `biopython` or `kakscalculator`.	`biopython`
`--aligner`	Protein alignment tool: `mafft` or `muscle`.	`mafft`
`--backtrans`	Back-translation method: `naive` (built-in) or `pal2nal`.	`naive`
`--reference`	Reference species for pairwise Ka/Ks analysis.	`None`
`--species-tree`	Path to Newick species tree for phylogenetic LCA analysis.	`None`
`--supermatrix`	Generate supermatrix from single-copy orthologs.	`False`

Tool Paths

Argument	Description	Default
`--mafft-path`	Path to MAFFT executable.	`mafft`
`--muscle-path`	Path to MUSCLE executable.	`muscle`
`--pal2nal-path`	Path to PAL2NAL script.	`pal2nal.pl`
`--blastp-path`	Path to BLASTP executable.	`blastp`
`--makeblastdb-path`	Path to makeblastdb executable.	`makeblastdb`
`--kakscalculator-path`	Path to KaKs_Calculator executable.	`KaKs_Calculator`

Functional Annotation

Argument	Description	Default
`--funano`	Performs functional annotation against UniProt. Options: `0` (disabled), `1` (all), `2` (core), `3` (single-copy), `4` (shell), `5` (private), `6` (cloud).	`0`
`--uniprot-db`	Path to a local UniProt/Swiss-Prot FASTA file. If not provided, it will be downloaded automatically.	`None`
`--threads`	Number of CPU threads to use for the BLASTP search.	`4`
`--keep-uniprot`	If specified, the downloaded UniProt database will not be deleted after the run.	`False`

Downstream Analyses & Visualizations

Argument	Description	Default
`--proteome`	Constructs a pan-proteome FASTA file. Use `--proteome` for all species or `--proteome sp1,sp2` for a subset.	`None`
`--genevar`	Generates a heatmap of gene copy number variation. Use `--genevar` for all species or `--genevar sp1,sp2` for a subset.	`None`
`--saturation`	Performs a bootstrapped saturation analysis to plot core and pan-genome growth.	`False`
`--saturation-cladepair`	Performs saturation analysis for two pre-defined clades (`--clade1`, `--clade2`) on the same plot.	`False`
`--zscore`	Applies Z-score normalization to the `--genevar` heatmap.	`False`
`--log`	Applies log2(count+1) transformation to the `--genevar` heatmap.	`False`

Output Matrices

Argument	Description	Default
`--pav`	Generates a Presence/Absence Variant (PAV) matrix (1 for present, 0 for absent).	`False`
`--matrix`	Generates a gene copy number (Count) matrix.	`False`

Saturation Plot Customization

Argument	Description	Default
`-b`, `--bootstrap`	Number of random sampling iterations for saturation analysis.	`100`
`--clade1` / `--clade2`	Comma-separated species lists for `--saturation-cladepair`.	(pre-defined lists)
`--marker-` / `--color-`	A range of options to customize markers and colors for saturation plots.	(various)

Configuration File

You can use a YAML config file to set advanced parameters like colors, markers, labels, and clade definitions.

Sample `config.yaml`

# Input files
hog: "N0.tsv"
fasta: "/path/to/peptide/fasta"

# Output settings
output: "./results"
prefix: "my_analysis_"

# Analysis options
pan: true
summary: true
random_hog_matrix: 1000
bootstrap: 10000

# Ka/Ks analysis
# kaks: true
# cds: "/path/to/cds/fasta"
# kaks_type: "core"
# aligner: "mafft"

# Phylogeny LCA
# species_tree: "species_tree.nwk"

# Supermatrix
# supermatrix: true

# Saturation plot customization
marker_core: "o"
marker_pan: "s"

Run using config:

panhog --hog N0.tsv --fasta ./peptides/ --config config.yaml

Any command-line flag will override the corresponding config value.

Output Files

core.HOGs.tsv, shell.HOGs.tsv, gt-specific.HOGs.tsv, single-copy.HOGs.tsv
cloud.unassigned_genes.tsv
private_genes_<species>.txt
pan_proteome.fa
summary_stats.tsv, summary_stats.png (v0.2.0)
random_hog_matrix.png (v0.2.0)
pav_matrix.tsv, count_matrix.tsv
genevar_heatmap.[png|pdf|svg]
saturation_analysis.[png|pdf|svg]
kaks_results.tsv (v0.2.0)
phylogeny_lca_results.tsv (v0.2.0)
supermatrix.fasta, supermatrix_partitions.txt (v0.2.0)

Recommendations

Use --summary for a quick overview of your pangenome composition.
Use --random-hog-matrix 1000 to visualize HOG presence patterns across species.
Use --proteome to extract FASTA of shared pangenes.
Use --saturation-cladepair for insight into core/pan genome expansion across defined clades.
Use --genevar with --zscore for population-scale expansions or contractions.
Use --kaks with --kaks-type core to identify genes under selection in the core genome.
Use --species-tree with a well-supported phylogeny for evolutionary context.
Use --supermatrix to generate input for phylogenomic tree inference.

Contact & Citation

This tool is currently in beta. For questions, contributions, or citation requests, please contact the developer or include the GitHub link in your reference.

Happy pangenomics with PanHOG!

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
conda-recipe		conda-recipe
dist		dist
panhog.egg-info		panhog.egg-info
Installation.md		Installation.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
PanHOG.py		PanHOG.py
PangeneHOG.py		PangeneHOG.py
README.md		README.md
README_config.md		README_config.md
README_pangene.md		README_pangene.md
Saturation_byClade.png		Saturation_byClade.png
environment.yml		environment.yml
environment_pangene.yml		environment_pangene.yml
example1.config.yaml		example1.config.yaml
example2.config.yaml		example2.config.yaml
genevar_heatmap.png		genevar_heatmap.png
meta.yaml		meta.yaml
panhog.png		panhog.png
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

PanHOG

Changelog

[v0.2.0] - 2026-02-27

New Features

Improvements

[Released] - 2025-09-21

New Features

Improvements

Key Features

Installation

Option 1: Conda (Recommended)

Option 2: Pip

Option 3: From Source

Documentation

Quick Start

Basic pangenome classification

Full pipeline with all analyses

Ka/Ks selection analysis

Phylogenetic LCA analysis

With YAML config file

Clade-specific analysis

Pangene annotation pipeline

Command-Line Options

Main Arguments

Analysis Modes

New Analyses (v0.2.0)

Tool Paths

Functional Annotation

Downstream Analyses & Visualizations

Output Matrices

Saturation Plot Customization

Configuration File

Sample config.yaml

Run using config:

Output Files

Recommendations

Contact & Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Sample `config.yaml`

Packages