Peptide–MHC presentation, cross-reactivity, and motif tools — the applied peptide–MHC layer on top
of the seqtree fuzzy-search substrate. mhcmatch
productionizes the reference seqtree.pmhc methodology (anchor-masked TCR-facing homology,
presentation-aware E-values, allele guessing) and adds a pseudosequence-based cross-allele
diffusion model that rescues rare alleles by borrowing from groove-similar frequent ones.
The mathematical/statistical theory is in appendix/mhcmatch.tex; the
development plan is in ROADMAP.md.
- MHC restriction & presentation — rank presenting alleles for a peptide (single / set / all, human & mouse), flag non-binders, and scan a whole protein for presented peptides.
- Large-scale similarity search — find similar peptides across big sets / proteomes, either by same-MHC binding (presentation signature) or similar TCR recognition (anchor-masked, TCR-facing); neoantigen molecular mimicry with per-allele E-values.
- Anchor / TCR-facing split — decompose a peptide into anchor and TCR-facing parts (
Xmasks). - Near-exact source lookup — find the self peptide a neoantigen derives from + its parent protein / mutated position, against a reference proteome.
- Motif logos — per-allele information-content logos with length distributions.
- Pseudosequence diffusion — allele similarity, clustering, and kernel-shrinkage pooling over 34-mer groove pseudosequences (rare-allele rescue).
bash setup.sh # repo-local .venv + editable install (uses sibling ../seqtree if present)
bash setup.sh --tests # + pytest
bash setup.sh --logo # + logomaker/matplotlib for rendering logosimport mhcmatch
# build from the isalgo/pmhc_data table (full or shortlist tier)
store = mhcmatch.Store.from_pmhc("pmhc_full.tsv.gz", species="human")
store.restriction("NLVPMVATV") # ranked presenting alleles + binder flags
store.is_binder("NLVPMVATV", "HLA-A*02:01")
store.scan_protein(my_protein, cls="mhc1") # presented peptides in a protein
store.decompose("NLVPMVATV", cls="mhc1") # (tcr_facing, presentation) with X masks
# similarity at scale
mhcmatch.search.search("NLVPMVATV", big_peptide_set, mode="tcr") # TCR-facing homologs
mhcmatch.search.find_mimics("EAAGIGILTV", self_set, bacterial_sets={...})
# near-exact source of a neoantigen
pm = mhcmatch.Proteome.from_fasta("UP000005640_9606.fasta.gz")
pm.find_source("NLVPMVATV", max_subs=1)
# pseudosequence allele similarity + rare-allele diffusion
ps = mhcmatch.Pseudoseq("mhc1")
ps.neighbors("HLA-A*02:01", candidates=store.alleles("mhc1"))
# diffusion-powered forward scorer (rescues rare alleles by borrowing from groove-neighbours)
am = store.anchor_model("mhc1") # learned anchor weights + bounded-prior shrinkage
am.score("NLVPMVATV", "HLA-A*02:01") # anchor log-odds; am.score(..., raw=True) disables borrowing
mhcmatch.logo.motif(store, "HLA-A*02:01", "mhc1")mhcmatch decompose NLVPMVATV # anchor / TCR-facing split (no data)
set -x MHCMATCH_PMHC /path/to/pmhc_data # or pass --pmhc to each command
mhcmatch restriction NLVPMVATV --allele 'A*02:01' --diffuse # allele name auto-resolved; rare-aware
mhcmatch scan my_protein.fasta --correction bh # presented windows, BH-FDR controlled
mhcmatch source MKTAYIAKW --proteome UP000005640_9606.fasta.gz
mhcmatch logo 'HLA-A*02:01'- Reference ligands:
isalgo/pmhc_data(full / shortlist tiers) — pass the path toStore.from_pmhcor setMHCMATCH_PMHC. - Pseudosequences: 34-mer groove pseudosequences vendored in
src/mhcmatch/data/(see itsPROVENANCE.md). - Reference proteomes: not bundled — supply a UniProt reference proteome FASTA
(UP000005640 human / UP000000589 mouse) to
Proteome.from_fasta.
Alpha (v0). See ROADMAP.md for phased plans (tuned thresholds, learned anchor
weights, future stability/affinity/cleavage/immunogenicity predictors, and the NetMHCpan /
MixMHCpred benchmark).