ArChIPelago

Classic machine learning on top of multiple position weight matrices improves genomic prediction of transcription factor binding sites

ArChIPelago is a computational framework that combines multiple position weight matrices (PWMs) into a joint model using classic machine learning techniques — from linear regression to ensembles of decision trees — to improve prediction of transcription factor binding sites in genomic sequences.

Using 704 ChIP-Seq datasets spanning 36 orthologous human-mouse transcription factor pairs and 1558 mono- and dinucleotide PWMs from the HOCOMOCO v11 motif collection, we show that ArChIPelago consistently outperforms the best available individual PWMs as well as sparse local inhomogeneous mixture (Slim) models. Models trained on human data generalize to mouse orthologs, demonstrating cross-species applicability.

This repository contains the full analysis pipeline, pre-trained models, and code to reproduce all results from:

Kravchenko P., Vorontsov I.E., Makeev V.J., Kulakovskiy I.V., and Penzar D.D. (2026). Classic machine learning on top of multiple position weight matrices improves genomic prediction of transcription factor binding sites.

Want to scan your own sequences? See the companion tool ArChIPelago-TFBS-finder for a ready-to-use CLI that applies pre-trained models to any FASTA file.

Overview

Pipeline at a glance:

Extract ChIP-seq peaks from GTRD, split by chromosome into train/test sets
Scan 301 bp peak regions with HOCOMOCO v11 PWMs (mono- and dinucleotide) using SPRY-SARUS
Build feature matrices from log-odds best-hit PWM scores
Train ensemble classifiers (Random Forest, XGBoost, Logistic Regression, Bagging)
Evaluate with auROC and auPRC using the PRROC R package
Compare against single-PWM baselines, Slim models, and diChIPMunk

Key results:

704 ChIP-Seq datasets, 36 human-mouse TF pairs
Median 17 mono + 17 di PWMs per TF (up to 198+99)
Median auROC 0.891 (vs 0.869 single best PWM)
Median auPRC improvement +0.043 over best PWM
Cross-species validation on mouse orthologs
Outperforms single best PWM for 34 of 36 TFs

Repository Structure

ArChIPelago/
│
├── ArChIPelago_code/                # Main analysis pipeline
│   ├── config.yml                   # ← EDIT THIS: set paths to your data
│   │
│   ├── archipielago/                # Python package (used by all notebooks)
│   │   ├── config.py                # Config loading & path validation
│   │   ├── io.py                    # FASTA/BED I/O, train/test splitting
│   │   ├── scanning.py              # SPRY-SARUS wrapper, feature matrix construction
│   │   ├── training.py              # RF training, evaluation, cross-validation
│   │   └── scoring.py               # Scorer classes (sklearn + PRROC)
│   │
│   ├── tests/                       # Unit tests (pytest)
│   │   ├── conftest.py              # Shared fixtures
│   │   ├── test_config.py
│   │   ├── test_io.py
│   │   ├── test_scanning.py
│   │   ├── test_training.py
│   │   └── test_scoring.py
│   │
│   ├── 5_CTCF_demo_pipeline.ipynb   # ← START HERE: end-to-end CTCF demo
│   ├── 0_Data_preparation and_test_train_split.ipynb
│   ├── 1_Scanning_with_CHIPMUNK_feature_generation_MONO_DI.ipynb
│   ├── 2_ArChIPelago_and_Slim_training.ipynb
│   ├── 3_Biasaway_QC.ipynb
│   ├── 4_Plot_generation_and_analysis.ipynb
│   ├── run_simulation.py            # End-to-end pipeline verification script
│   ├── scorer_module.py             # Backward-compat shim → archipielago.scoring
│   └── environment_rpy_2.yml        # Conda environment specification
│
├── Figures/                         # Publication figure R scripts
│   ├── Figure 2 and S1/             # Model comparison boxplots (Fig. 2, S1)
│   ├── Figure 3 and S3/             # PWM vs ensemble scatter + delta (Fig. 3, S3)
│   └── Figure 4/                    # Slim/diChIPMunk comparison (Fig. 4, S4)
│
├── ArChIPelago-TFBS-finder/         # Standalone scanning tool (git submodule)
├── sarus/                           # SPRY-SARUS PWM scanner (git submodule)
├── seqtk/                           # seqtk sequence toolkit (git submodule)
│
├── Slim/                            # Slim/diChIPMunk model tools (Java)
│   ├── SlimDimont.jar               # diChIPMunk: de novo diPWM discovery
│   ├── TrainAndApplySlim.jar        # Slim: sparse local inhomogeneous mixture models
│   ├── ytilib/                      # Ruby utilities for diChIPMunk
│   └── jdk8u232-b09/                # Bundled OpenJDK 8 (required by Slim jars)
│
├── Table_1.xlsx                     # ChIP-seq experiment identifiers (Sup. Table 1)
├── Table_2.xlsx                     # TF metadata: experiments, peaks, GC% (Sup. Table 2)
└── Table_3.xlsx                     # ArChIPelago performance metrics (Sup. Table 3)

Quick Start — CTCF Demo

Run the full pipeline on CTCF without any external data preprocessing:

# 1. Clone
git clone --recurse-submodules https://github.com/autosome-ru/ArChIPelago.git
cd ArChIPelago

# 2. Set up environment
conda env create -f ArChIPelago_code/environment_rpy_2.yml
conda activate ArChIPelago

# 3. Download CTCF data from Zenodo (DOI: 10.5281/zenodo.14927304)
#    Extract train/, test/, PWMs_mono_HUMAN/, PWMs_di_HUMAN/
#    Edit ArChIPelago_code/config.yml with the paths

# 4. Run the demo
cd ArChIPelago_code
jupyter lab 5_CTCF_demo_pipeline.ipynb

The demo notebook walks through every step:

Step	What it does
Load sequences	Positive (ChIP-seq peaks) and negative (GC-matched background) FASTA
PWM scanning	SPRY-SARUS scan with HOCOMOCO v11 mono + di PWMs
Feature matrix	Build and select top-1000 features by RF importance
Train model	`RandomForestClassifier(n_estimators=100, max_depth=6)`
Evaluate	auROC, auPRC on held-out test chromosomes
Predict	Per-sequence binding probabilities
Visualize	Publication-quality ROC and PR curves

Installation

Prerequisites

Conda (Miniconda or Anaconda)
Java 8+ (for SPRY-SARUS PWM scanning; OpenJDK 8 is bundled in Slim/jdk8u232-b09/)
Git with submodule support

Step-by-step

# Clone with all submodules (sarus, seqtk, ArChIPelago-TFBS-finder)
git clone --recurse-submodules https://github.com/autosome-ru/ArChIPelago.git
cd ArChIPelago

# Create and activate the conda environment
conda env create -f ArChIPelago_code/environment_rpy_2.yml
conda activate ArChIPelago

# Verify Java
java -version   # should print 1.8 or higher
# or use the bundled JDK:
Slim/jdk8u232-b09/bin/java -version

# Edit configuration
nano ArChIPelago_code/config.yml

Note on R: The environment includes R 4.3.1 and rpy2 3.5.11 for PRROC-based auROC/auPRC computation (Grau et al. 2015), as used in the publication. If R is not needed, the sklearn-based scorers in archipielago/scoring.py provide equivalent functionality.

Configuration

Edit ArChIPelago_code/config.yml to set paths for your system:

paths:
  genome_human: "/path/to/hg38.fa"       # UCSC hg38 reference genome
  genome_mouse: "/path/to/mm10.fa"       # UCSC mm10 reference genome
  repeatmasker: "/path/to/track_out.bed" # RepeatMasker BED (mouse only)
  macs_peaks:   "/path/to/macs/"         # GTRD MACS peak interval files
  output_dir:   "Release/TF-ML"         # Output directory

tools:
  sarus_jar:      "../sarus/releases/sarus-2.2.3.jar"
  java_bin:       "java"
  slim_apply_jar: "../Slim/TrainAndApplySlim.jar"   # Slim model training
  slim_java_bin:  "../Slim/jdk8u232-b09/bin/java"   # JDK 8 for Slim jars

Data Download

Zenodo archive (required)

DOI: 10.5281/zenodo.14927304

Archive	Contents	Required for
`train.tar.gz`	Training FASTA sequences (36 TFs)	Notebooks 2, 5
`test.tar.gz`	Test FASTA sequences (36 TFs)	Notebooks 2, 5
`PWMs_mono_HUMAN.tar.gz`	Mononucleotide PWMs per TF	Notebooks 1, 2, 5
`PWMs_di_HUMAN.tar.gz`	Dinucleotide PWMs per TF	Notebooks 1, 2, 5
`hocomoco11.tar.gz`	HOCOMOCO v11 benchmark data	Notebook 1
`Models.tar.gz`	Pre-trained RF models (sklearn 1.3)	Inference only
`Slim.tar.gz`	Slim jar files and bundled JDK 8	Notebook 2 (Slim training)
`Slim_models.tar.gz`	Pre-trained Slim models	Notebook 2 (comparison)
`macs.tar.gz`	GTRD MACS peak intervals	Notebook 0
`Archipelago_intermediate_files.tar.gz`	Pre-computed feature matrices	Skip notebooks 0–1

Reference genomes (for Notebook 0 only)

# Human hg38 (~3 GB)
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz && gunzip hg38.fa.gz

# Mouse mm10 (~2.7 GB)
wget https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz && gunzip mm10.fa.gz

RepeatMasker (for Notebook 0 only)

Download track_out.bed from the UCSC Table Browser (group: Repeats, track: RepeatMasker) and set paths.repeatmasker in config.yml.

Full Reproduction — Notebooks 0–4

Run notebooks sequentially. Each reads config.yml for external paths.

Step	Notebook	Input	Output
0	Data preparation	GTRD BED + hg38/mm10 FASTA	Train/test FASTA and BED files
1	PWM scanning	FASTA + HOCOMOCO PWMs	Per-TF SPRY-SARUS feature matrices
2	Model training	Feature matrices	Trained RF/Slim models, auROC/auPRC scores
3	BiasAway QC	Negative sequences	GC-content quality control plots
4	Figures	Model scores	Publication figures

Data preparation (Notebook 0): ChIP-Seq peaks from GTRD were called with MACS (Zhang et al. 2008). Peak lists were filtered by tags >= 10 and sorted by -log10(P value). Putative binding regions of 301 bp [-150;+150] were extracted centred at the peak summit. Repeat-overlapping peaks were removed using RepeatMasker (Smit et al. 2013-2015) via pybedtools subtract(f=0.7, N=True).

Train/test split: Training uses chromosomes 2–7, 9–20; testing uses chromosomes 1, 8, 21 (human) and chromosomes 1, 8, 19 (mouse), following Zhou and Troyanskaya (2015). Up to 10,000 positives per TF; negatives (1:100 class balance) sampled from peaks of unrelated TF families (TFClass; Wingender et al. 2018), GC-matched using BiasAway (Khan et al. 2021). Sex chromosomes were excluded.

PWM scanning: Genomic regions were scanned with SPRY-SARUS (Kulakovskiy et al. 2016) using --skipn --show-non-matching --output-scoring-mode score besthit. Log-odds best-hit scores serve as features.

Model parameters (Random Forest): max_depth=6, max_samples=0.8, n_estimators=100 (selected via GridSearchCV). Features were scale-transformed with sklearn.preprocessing.StandardScaler.

Slim models (Notebook 2): ArChIPelago was benchmarked against sparse local inhomogeneous mixture (Slim) models (Grau et al. 2013), trained side-by-side from extended 1001 bp genomic regions around the same peak summits. Slim models were trained using TrainAndApplySlim.jar with the bundled JDK 8 (Slim/jdk8u232-b09/bin/java). Three Slim model orders were compared: markov_order=0 (equivalent to monoPWM), markov_order=1 (equivalent to diPWM), and LSlim with markov_order=-5 (limited Slim; Keilwagen and Grau 2015). Peak signal annotations were derived from the -10*log10(pvalue) MACS output field. Slim predictions used max_score for performance assessment. Additionally, diPWMs were constructed de novo from positive sequences using diChIPMunk (run_dichiphorde8.rb). The RF model trained on all available PWMs was then augmented with Slim and diChIPMunk features to test whether combining diverse model types improves prediction beyond PWM ensembles alone (see Fig. 4).

The Slim Python wrapper is available at: https://github.com/autosome-ru/ArChIPelago

Supported TFs (36): ANDR, AP2A, CEBPB, COE1, CTCF, E2F4, ERG, ESR1, FLI1, GATA1, GATA2, GATA3, GCR, HNF4A, IRF1, IRF4, JUND, MAFK, MAX, MYC, P53, PPARG, PRGR, REST, RUNX1, RXRA, SOX2, SPI1, SRF, STA5A, STAT1, STAT3, TAL1, TF65, TFE2, USF2.

Compute requirements: Full training across all 36 TFs was performed on 100 AMD EPYC 7662 cores (~8 hours). For a single TF on a laptop, use n_jobs=4 and expect ~30 min.

`archipielago` Python Package

All reusable pipeline functions are consolidated in ArChIPelago_code/archipielago/:

# Configuration
from archipielago.config import load_config, get_path

# I/O
from archipielago.io import fasta_iter, load_fasta, save_fasta, make_train_test_beds

# PWM scanning and feature construction
from archipielago.scanning import run_sarus, build_feature_matrix, select_top_features

# Model training and evaluation
from archipielago.training import train_rf, evaluate_model, cross_validate_model

# Scorer classes (sklearn and R/PRROC)
from archipielago.scoring import SklearnROCAUC, SklearnPRAUC, PRROC_PRAUC, PRROC_ROCAUC

Minimal example

from archipielago import io, scanning, training
from archipielago.config import load_config

cfg = load_config()

# Load sequences
train_pos = io.load_fasta("train/CTCF_HUMAN.PEAKS000001.macs.train.mfa")

# Scan with SPRY-SARUS
scanning.run_sarus("train.fasta", "motif.pwm", cfg['tools']['sarus_jar'],
                   "scores.txt", pwm_type="mono")

# Build features and train
X = scanning.build_feature_matrix("scan_dir/", mode="mono_di")
model = training.train_rf(X, labels, n_estimators=100, random_state=42)

# Evaluate
results = training.evaluate_model(model, X_test, y_test)
print(f"auROC: {results['roc_auc']:.4f}, auPRC: {results['pr_auc']:.4f}")

Running Tests

conda activate ArChIPelago
cd ArChIPelago_code
pytest tests/ -v

The test suite covers all package modules (59 tests). Tests run without Zenodo data, external tools, or reference genomes. PRROC scorer tests are skipped automatically when R/rpy2 is unavailable.

Publication Figures

R scripts for all publication figures are in Figures/. Generated PDFs and images are not tracked in git — run the scripts to regenerate them.

Figure	Script	Description
Fig. 2	`Figure 2 and S1/Figure_2_21012026_H_H.R`	Model comparison boxplots (human test set)
Fig. S1	`Figure 2 and S1/Figure_2_210120267_H_M.R`	Model comparison boxplots (mouse validation)
Fig. 3	`Figure 3 and S3/Figure_3_H_H_20012026.R`	PWM vs RF ensemble scatter + improvement (human)
Fig. S3	`Figure 3 and S3/Figure_S3_H_M_20012026.R`	PWM vs RF ensemble scatter + improvement (mouse)
Fig. 4	`Figure 4/Figure_4_H_H_7012026.R`	Slim/diChIPMunk comparison, delta (human)
Fig. S4	`Figure 4/Figure_S4_H_M_7012026.R`	Slim/diChIPMunk comparison, delta (mouse)
Fig. 4 (abs)	`Figure 4/Figure_4_H_H_7012026_abs_values.R`	Absolute values variant
Fig. S4 (abs)	`Figure 4/Figure_S4_H_M_7012026_abs_values.R`	Absolute values variant

The R scripts read from HUMAN_MOUSE_total_100k.csv (generated by Notebook 4). All mono+di comparisons use the mononucleotide single-PWM baseline, consistent with the analysis in the manuscript.

Platform

Developed and tested on:

Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-113-generic x86_64)
Python 3.8.18, scikit-learn 1.3.2, numpy 1.24.4, pandas 2.0.3
R 4.3.1 with PRROC, ggplot2, patchwork
Java 8 (OpenJDK 8u232-b09, bundled)

License

ArChIPelago is distributed under WTFPL. If you prefer a standard license, treat it as CC-BY.

Citation

Kravchenko P., Vorontsov I.E., Makeev V.J., Kulakovskiy I.V., and Penzar D.D. (2026). Classic machine learning on top of multiple position weight matrices improves genomic prediction of transcription factor binding sites.

Zenodo data archive: 10.5281/zenodo.14927304

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
ArChIPelago-TFBS-finder @ e4e4cc1		ArChIPelago-TFBS-finder @ e4e4cc1
ArChIPelago_code		ArChIPelago_code
Figures		Figures
Slim		Slim
sarus @ dd9310b		sarus @ dd9310b
seqtk @ c9458ba		seqtk @ c9458ba
.gitignore		.gitignore
.gitmodules		.gitmodules
Archipelago.png		Archipelago.png
Kravchenko_et_al_Supplementary_Materials.pdf		Kravchenko_et_al_Supplementary_Materials.pdf
README.md		README.md
Table_1.xlsx		Table_1.xlsx
Table_2.xlsx		Table_2.xlsx
Table_3.xlsx		Table_3.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ArChIPelago

Overview

Repository Structure

Quick Start — CTCF Demo

Installation

Prerequisites

Step-by-step

Configuration

Data Download

Zenodo archive (required)

Reference genomes (for Notebook 0 only)

RepeatMasker (for Notebook 0 only)

Full Reproduction — Notebooks 0–4

`archipielago` Python Package

Minimal example

Running Tests

Publication Figures

Platform

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ArChIPelago

Overview

Repository Structure

Quick Start — CTCF Demo

Installation

Prerequisites

Step-by-step

Configuration

Data Download

Zenodo archive (required)

Reference genomes (for Notebook 0 only)

RepeatMasker (for Notebook 0 only)

Full Reproduction — Notebooks 0–4

archipielago Python Package

Minimal example

Running Tests

Publication Figures

Platform

License

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`archipielago` Python Package

Packages