Classic machine learning on top of multiple position weight matrices improves genomic prediction of transcription factor binding sites
ArChIPelago is a computational framework that combines multiple position weight matrices (PWMs) into a joint model using classic machine learning techniques — from linear regression to ensembles of decision trees — to improve prediction of transcription factor binding sites in genomic sequences.
Using 704 ChIP-Seq datasets spanning 36 orthologous human-mouse transcription factor pairs and 1558 mono- and dinucleotide PWMs from the HOCOMOCO v11 motif collection, we show that ArChIPelago consistently outperforms the best available individual PWMs as well as sparse local inhomogeneous mixture (Slim) models. Models trained on human data generalize to mouse orthologs, demonstrating cross-species applicability.
This repository contains the full analysis pipeline, pre-trained models, and code to reproduce all results from:
Kravchenko P., Vorontsov I.E., Makeev V.J., Kulakovskiy I.V., and Penzar D.D. (2026). Classic machine learning on top of multiple position weight matrices improves genomic prediction of transcription factor binding sites.
Want to scan your own sequences? See the companion tool ArChIPelago-TFBS-finder for a ready-to-use CLI that applies pre-trained models to any FASTA file.
|
Pipeline at a glance:
|
Key results:
|
ArChIPelago/
│
├── ArChIPelago_code/ # Main analysis pipeline
│ ├── config.yml # ← EDIT THIS: set paths to your data
│ │
│ ├── archipielago/ # Python package (used by all notebooks)
│ │ ├── config.py # Config loading & path validation
│ │ ├── io.py # FASTA/BED I/O, train/test splitting
│ │ ├── scanning.py # SPRY-SARUS wrapper, feature matrix construction
│ │ ├── training.py # RF training, evaluation, cross-validation
│ │ └── scoring.py # Scorer classes (sklearn + PRROC)
│ │
│ ├── tests/ # Unit tests (pytest)
│ │ ├── conftest.py # Shared fixtures
│ │ ├── test_config.py
│ │ ├── test_io.py
│ │ ├── test_scanning.py
│ │ ├── test_training.py
│ │ └── test_scoring.py
│ │
│ ├── 5_CTCF_demo_pipeline.ipynb # ← START HERE: end-to-end CTCF demo
│ ├── 0_Data_preparation and_test_train_split.ipynb
│ ├── 1_Scanning_with_CHIPMUNK_feature_generation_MONO_DI.ipynb
│ ├── 2_ArChIPelago_and_Slim_training.ipynb
│ ├── 3_Biasaway_QC.ipynb
│ ├── 4_Plot_generation_and_analysis.ipynb
│ ├── run_simulation.py # End-to-end pipeline verification script
│ ├── scorer_module.py # Backward-compat shim → archipielago.scoring
│ └── environment_rpy_2.yml # Conda environment specification
│
├── Figures/ # Publication figure R scripts
│ ├── Figure 2 and S1/ # Model comparison boxplots (Fig. 2, S1)
│ ├── Figure 3 and S3/ # PWM vs ensemble scatter + delta (Fig. 3, S3)
│ └── Figure 4/ # Slim/diChIPMunk comparison (Fig. 4, S4)
│
├── ArChIPelago-TFBS-finder/ # Standalone scanning tool (git submodule)
├── sarus/ # SPRY-SARUS PWM scanner (git submodule)
├── seqtk/ # seqtk sequence toolkit (git submodule)
│
├── Slim/ # Slim/diChIPMunk model tools (Java)
│ ├── SlimDimont.jar # diChIPMunk: de novo diPWM discovery
│ ├── TrainAndApplySlim.jar # Slim: sparse local inhomogeneous mixture models
│ ├── ytilib/ # Ruby utilities for diChIPMunk
│ └── jdk8u232-b09/ # Bundled OpenJDK 8 (required by Slim jars)
│
├── Table_1.xlsx # ChIP-seq experiment identifiers (Sup. Table 1)
├── Table_2.xlsx # TF metadata: experiments, peaks, GC% (Sup. Table 2)
└── Table_3.xlsx # ArChIPelago performance metrics (Sup. Table 3)
Run the full pipeline on CTCF without any external data preprocessing:
# 1. Clone
git clone --recurse-submodules https://github.com/autosome-ru/ArChIPelago.git
cd ArChIPelago
# 2. Set up environment
conda env create -f ArChIPelago_code/environment_rpy_2.yml
conda activate ArChIPelago
# 3. Download CTCF data from Zenodo (DOI: 10.5281/zenodo.14927304)
# Extract train/, test/, PWMs_mono_HUMAN/, PWMs_di_HUMAN/
# Edit ArChIPelago_code/config.yml with the paths
# 4. Run the demo
cd ArChIPelago_code
jupyter lab 5_CTCF_demo_pipeline.ipynbThe demo notebook walks through every step:
| Step | What it does |
|---|---|
| Load sequences | Positive (ChIP-seq peaks) and negative (GC-matched background) FASTA |
| PWM scanning | SPRY-SARUS scan with HOCOMOCO v11 mono + di PWMs |
| Feature matrix | Build and select top-1000 features by RF importance |
| Train model | RandomForestClassifier(n_estimators=100, max_depth=6) |
| Evaluate | auROC, auPRC on held-out test chromosomes |
| Predict | Per-sequence binding probabilities |
| Visualize | Publication-quality ROC and PR curves |
- Conda (Miniconda or Anaconda)
- Java 8+ (for SPRY-SARUS PWM scanning; OpenJDK 8 is bundled in
Slim/jdk8u232-b09/) - Git with submodule support
# Clone with all submodules (sarus, seqtk, ArChIPelago-TFBS-finder)
git clone --recurse-submodules https://github.com/autosome-ru/ArChIPelago.git
cd ArChIPelago
# Create and activate the conda environment
conda env create -f ArChIPelago_code/environment_rpy_2.yml
conda activate ArChIPelago
# Verify Java
java -version # should print 1.8 or higher
# or use the bundled JDK:
Slim/jdk8u232-b09/bin/java -version
# Edit configuration
nano ArChIPelago_code/config.ymlNote on R: The environment includes R 4.3.1 and rpy2 3.5.11 for PRROC-based auROC/auPRC computation (Grau et al. 2015), as used in the publication. If R is not needed, the sklearn-based scorers in
archipielago/scoring.pyprovide equivalent functionality.
Edit ArChIPelago_code/config.yml to set paths for your system:
paths:
genome_human: "/path/to/hg38.fa" # UCSC hg38 reference genome
genome_mouse: "/path/to/mm10.fa" # UCSC mm10 reference genome
repeatmasker: "/path/to/track_out.bed" # RepeatMasker BED (mouse only)
macs_peaks: "/path/to/macs/" # GTRD MACS peak interval files
output_dir: "Release/TF-ML" # Output directory
tools:
sarus_jar: "../sarus/releases/sarus-2.2.3.jar"
java_bin: "java"
slim_apply_jar: "../Slim/TrainAndApplySlim.jar" # Slim model training
slim_java_bin: "../Slim/jdk8u232-b09/bin/java" # JDK 8 for Slim jars| Archive | Contents | Required for |
|---|---|---|
train.tar.gz |
Training FASTA sequences (36 TFs) | Notebooks 2, 5 |
test.tar.gz |
Test FASTA sequences (36 TFs) | Notebooks 2, 5 |
PWMs_mono_HUMAN.tar.gz |
Mononucleotide PWMs per TF | Notebooks 1, 2, 5 |
PWMs_di_HUMAN.tar.gz |
Dinucleotide PWMs per TF | Notebooks 1, 2, 5 |
hocomoco11.tar.gz |
HOCOMOCO v11 benchmark data | Notebook 1 |
Models.tar.gz |
Pre-trained RF models (sklearn 1.3) | Inference only |
Slim.tar.gz |
Slim jar files and bundled JDK 8 | Notebook 2 (Slim training) |
Slim_models.tar.gz |
Pre-trained Slim models | Notebook 2 (comparison) |
macs.tar.gz |
GTRD MACS peak intervals | Notebook 0 |
Archipelago_intermediate_files.tar.gz |
Pre-computed feature matrices | Skip notebooks 0–1 |
# Human hg38 (~3 GB)
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz && gunzip hg38.fa.gz
# Mouse mm10 (~2.7 GB)
wget https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz && gunzip mm10.fa.gzDownload track_out.bed from the UCSC Table Browser (group: Repeats, track: RepeatMasker) and set paths.repeatmasker in config.yml.
Run notebooks sequentially. Each reads config.yml for external paths.
| Step | Notebook | Input | Output |
|---|---|---|---|
| 0 | Data preparation | GTRD BED + hg38/mm10 FASTA | Train/test FASTA and BED files |
| 1 | PWM scanning | FASTA + HOCOMOCO PWMs | Per-TF SPRY-SARUS feature matrices |
| 2 | Model training | Feature matrices | Trained RF/Slim models, auROC/auPRC scores |
| 3 | BiasAway QC | Negative sequences | GC-content quality control plots |
| 4 | Figures | Model scores | Publication figures |
Data preparation (Notebook 0): ChIP-Seq peaks from GTRD were called with MACS (Zhang et al. 2008). Peak lists were filtered by tags >= 10 and sorted by -log10(P value). Putative binding regions of 301 bp [-150;+150] were extracted centred at the peak summit. Repeat-overlapping peaks were removed using RepeatMasker (Smit et al. 2013-2015) via pybedtools subtract(f=0.7, N=True).
Train/test split: Training uses chromosomes 2–7, 9–20; testing uses chromosomes 1, 8, 21 (human) and chromosomes 1, 8, 19 (mouse), following Zhou and Troyanskaya (2015). Up to 10,000 positives per TF; negatives (1:100 class balance) sampled from peaks of unrelated TF families (TFClass; Wingender et al. 2018), GC-matched using BiasAway (Khan et al. 2021). Sex chromosomes were excluded.
PWM scanning: Genomic regions were scanned with SPRY-SARUS (Kulakovskiy et al. 2016) using --skipn --show-non-matching --output-scoring-mode score besthit. Log-odds best-hit scores serve as features.
Model parameters (Random Forest): max_depth=6, max_samples=0.8, n_estimators=100 (selected via GridSearchCV). Features were scale-transformed with sklearn.preprocessing.StandardScaler.
Slim models (Notebook 2): ArChIPelago was benchmarked against sparse local inhomogeneous mixture (Slim) models (Grau et al. 2013), trained side-by-side from extended 1001 bp genomic regions around the same peak summits. Slim models were trained using TrainAndApplySlim.jar with the bundled JDK 8 (Slim/jdk8u232-b09/bin/java). Three Slim model orders were compared: markov_order=0 (equivalent to monoPWM), markov_order=1 (equivalent to diPWM), and LSlim with markov_order=-5 (limited Slim; Keilwagen and Grau 2015). Peak signal annotations were derived from the -10*log10(pvalue) MACS output field. Slim predictions used max_score for performance assessment. Additionally, diPWMs were constructed de novo from positive sequences using diChIPMunk (run_dichiphorde8.rb). The RF model trained on all available PWMs was then augmented with Slim and diChIPMunk features to test whether combining diverse model types improves prediction beyond PWM ensembles alone (see Fig. 4).
The Slim Python wrapper is available at: https://github.com/autosome-ru/ArChIPelago
Supported TFs (36): ANDR, AP2A, CEBPB, COE1, CTCF, E2F4, ERG, ESR1, FLI1, GATA1, GATA2, GATA3, GCR, HNF4A, IRF1, IRF4, JUND, MAFK, MAX, MYC, P53, PPARG, PRGR, REST, RUNX1, RXRA, SOX2, SPI1, SRF, STA5A, STAT1, STAT3, TAL1, TF65, TFE2, USF2.
Compute requirements: Full training across all 36 TFs was performed on 100 AMD EPYC 7662 cores (~8 hours). For a single TF on a laptop, use
n_jobs=4and expect ~30 min.
All reusable pipeline functions are consolidated in ArChIPelago_code/archipielago/:
# Configuration
from archipielago.config import load_config, get_path
# I/O
from archipielago.io import fasta_iter, load_fasta, save_fasta, make_train_test_beds
# PWM scanning and feature construction
from archipielago.scanning import run_sarus, build_feature_matrix, select_top_features
# Model training and evaluation
from archipielago.training import train_rf, evaluate_model, cross_validate_model
# Scorer classes (sklearn and R/PRROC)
from archipielago.scoring import SklearnROCAUC, SklearnPRAUC, PRROC_PRAUC, PRROC_ROCAUCfrom archipielago import io, scanning, training
from archipielago.config import load_config
cfg = load_config()
# Load sequences
train_pos = io.load_fasta("train/CTCF_HUMAN.PEAKS000001.macs.train.mfa")
# Scan with SPRY-SARUS
scanning.run_sarus("train.fasta", "motif.pwm", cfg['tools']['sarus_jar'],
"scores.txt", pwm_type="mono")
# Build features and train
X = scanning.build_feature_matrix("scan_dir/", mode="mono_di")
model = training.train_rf(X, labels, n_estimators=100, random_state=42)
# Evaluate
results = training.evaluate_model(model, X_test, y_test)
print(f"auROC: {results['roc_auc']:.4f}, auPRC: {results['pr_auc']:.4f}")conda activate ArChIPelago
cd ArChIPelago_code
pytest tests/ -vThe test suite covers all package modules (59 tests). Tests run without Zenodo data, external tools, or reference genomes. PRROC scorer tests are skipped automatically when R/rpy2 is unavailable.
R scripts for all publication figures are in Figures/. Generated PDFs and images are not tracked in git — run the scripts to regenerate them.
| Figure | Script | Description |
|---|---|---|
| Fig. 2 | Figure 2 and S1/Figure_2_21012026_H_H.R |
Model comparison boxplots (human test set) |
| Fig. S1 | Figure 2 and S1/Figure_2_210120267_H_M.R |
Model comparison boxplots (mouse validation) |
| Fig. 3 | Figure 3 and S3/Figure_3_H_H_20012026.R |
PWM vs RF ensemble scatter + improvement (human) |
| Fig. S3 | Figure 3 and S3/Figure_S3_H_M_20012026.R |
PWM vs RF ensemble scatter + improvement (mouse) |
| Fig. 4 | Figure 4/Figure_4_H_H_7012026.R |
Slim/diChIPMunk comparison, delta (human) |
| Fig. S4 | Figure 4/Figure_S4_H_M_7012026.R |
Slim/diChIPMunk comparison, delta (mouse) |
| Fig. 4 (abs) | Figure 4/Figure_4_H_H_7012026_abs_values.R |
Absolute values variant |
| Fig. S4 (abs) | Figure 4/Figure_S4_H_M_7012026_abs_values.R |
Absolute values variant |
The R scripts read from HUMAN_MOUSE_total_100k.csv (generated by Notebook 4). All mono+di comparisons use the mononucleotide single-PWM baseline, consistent with the analysis in the manuscript.
Developed and tested on:
- Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-113-generic x86_64)
- Python 3.8.18, scikit-learn 1.3.2, numpy 1.24.4, pandas 2.0.3
- R 4.3.1 with PRROC, ggplot2, patchwork
- Java 8 (OpenJDK 8u232-b09, bundled)
ArChIPelago is distributed under WTFPL. If you prefer a standard license, treat it as CC-BY.
Kravchenko P., Vorontsov I.E., Makeev V.J., Kulakovskiy I.V., and Penzar D.D. (2026). Classic machine learning on top of multiple position weight matrices improves genomic prediction of transcription factor binding sites.
Zenodo data archive: 10.5281/zenodo.14927304
