A structure-aware, machine learningβdriven bioinformatics pipeline for predicting antibody-accessible epitopes on the Porcine Circovirus Type 2 (PCV2) capsid protein (ORF2).
π https://pcv2.epitope.aiconceptlimited.com.ng/
This project implements a research-grade computational framework integrating:
- Evolutionary sequence analysis
- Structural biology (PDB-based features)
- Physicochemical characterization
- Machine learning (XGBoost)
to identify potential B-cell epitopes on the PCV2 capsid protein.
- Epitope discovery
- Vaccine target identification
- Viral antigen characterization
- Immunoinformatics research
Epitope prediction is treated as a multi-modal biological inference problem:
Sequence β Evolution β Structure β Features β ML β Prediction β Validation
| Component | Source |
|---|---|
| Protein sequences | NCBI (Entrez API) |
| Reference sequence | UniProt |
| Protein structures | PDB (3R0R, 6EZG) |
| Epitope validation | IEDB |
NCBI Retrieval
β
Sequence Cleaning (capsid-only filtering)
β
Multiple Sequence Alignment (MAFFT)
β
Feature Engineering
- Conservation
- Entropy
- SASA
- Residue Depth
- Electrostatics
β
Feature Matrix Construction
β
Epitope Labeling (IEDB)
β
Machine Learning (XGBoost)
β
Prediction
β
3D + Sequence Visualization (Streamlit)
- Conservation score (frequency-based)
- Shannon entropy (sequence variability)
- Solvent Accessible Surface Area (SASA)
- Residue depth
- Secondary structure (loop/helix/sheet)
- Electrostatics
- Hydrophobicity
- Charge distribution
- Sliding window (Β±2 residues)
- Spatial neighborhood aggregation
- Model: XGBoost Classifier
- Input: Residue-level feature matrix
- Output: Probability of epitope per residue
- Imbalanced dataset handling
- Threshold tuning (default: 0.25)
- Feature importance extraction
| Metric | Value |
|---|---|
| Total residues | ~162β245 |
| Predicted epitopes | ~24 |
| Validated (IEDB overlap) | ~4 |
| ROC-AUC | ~0.70β0.75 |
- Predictions compared with IEDB experimental epitopes
- Overlap analysis performed at residue level
- β Overlapping residues β validated epitopes
- π¬ Non-overlapping β novel candidate epitopes
Predicted epitopes are enriched in:
- Surface-exposed regions (high SASA)
- Loop/coil structures
- High-entropy (variable) regions
π This aligns with known principles of antibody binding.
pcv2_epitope_project/
β
βββ data/ # Metadata, mappings, IEDB data
βββ sequences/ # FASTA + alignments
βββ structures/ # PDB files (3R0R, 6EZG)
βββ features/ # Engineered features
βββ results/ # Predictions + evaluation
βββ models/ # Trained ML model
βββ scripts/ # Feature + analysis scripts
βββ pipeline/ # Automation scripts
β
βββ dashboard.py # Streamlit interface
βββ run_smart_pipeline.py # Full pipeline runner
git clone https://github.com/YOUR_USERNAME/pcv2-epitope-platform.git
cd pcv2-epitope-platform
python -m venv pcv2_env
source pcv2_env/bin/activate
pip install -r requirements.txtpython run_smart_pipeline.pystreamlit run dashboard.py- π Epitope probability plots
- 𧬠Sequence visualization (UniProt-aligned)
- π§ 3D structure mapping (Py3Dmol)
- π§ͺ IEDB validation overlay
- π¦ Epitope clustering
- Limited experimentally validated epitopes (class imbalance)
- Predictions are computational (require lab validation)
- Sequenceβstructure mapping introduces approximation
- Graph Neural Networks (GNN)
- Transformer-based protein models
- Improved structural alignment
- REST API deployment
- Continuous data updates (automated pipeline)
Open to collaborations in:
- Bioinformatics
- Immunoinformatics
- Vaccine design
- Structural biology
This system provides computational predictions and should not replace experimental validation.
Abubakar Bioinformatics & Computational Biology
- NCBI (sequence data)
- RCSB PDB (structural data)
- IEDB (epitope data)
- Biopython, XGBoost, Streamlit communities