Python utilities for cyto - format conversion and sample aggregation.
pycyto provides command-line tools for working with cyto outputs:
- convert: Transform MTX format to h5ad (AnnData)
- aggregate: Combine multi-probe cyto outputs into unified sample-level datasets
uv tool install pycyto
pycyto --helpConvert Matrix Market (MTX) format from cyto to h5ad for downstream analysis.
pycyto convert <mtx_directory> <output.h5ad>Arguments:
mtx_directory: Path to MTX directory (containingmatrix.mtx,features.tsv,barcodes.tsv) or direct path to.mtxfileoutput.h5ad: Output h5ad file path
Options:
--compress / --no-compress: Enable gzip compression in h5ad (default: enabled)--integer / --no-integer: Store counts as int32 instead of float32 (default: float32)
Examples:
# Convert MTX directory to h5ad
pycyto convert cyto_out/counts/BC001.counts.mtx sample.h5ad
# Convert without compression
pycyto convert cyto_out/counts/BC001.counts.mtx sample.h5ad --no-compress
# Store as integers for memory efficiency
pycyto convert cyto_out/counts/BC001.counts.mtx sample.h5ad --integerInput structure (from cyto ibu count --format mtx):
mtx_directory/
├── matrix.mtx # Sparse count matrix (gene × cell)
├── features.tsv # Gene/feature names
└── barcodes.tsv # Cell barcodes
Output: AnnData object (cell × gene) in CSR sparse format, ready for scanpy/seurat workflows
Aggregate multi-probe cyto outputs across flex barcodes into unified sample-level datasets. Designed specifically for single-modal (GEX) and multi-modal Flex experiments (GEX + CRISPR).
pycyto aggregate <config.json> <cyto_outdir> <output_directory>Arguments:
config.json: JSON configuration specifying sample structure and barcode assignmentscyto_outdir: Directory containing cyto workflow outputsoutput_directory: Where to write aggregated files
Options:
--compress / --no-compress: Compress output h5ad files (default: no compression)--threads INT: Number of parallel sample processing threads (default: -1 for all cores)--verbose: Enable detailed logging
What it does:
- Concatenates data across multiple flex probe barcodes per sample
- Merges GEX and CRISPR modalities when both present
- Adds guide assignments to GEX cell metadata
- Filters CRISPR data to match filtered GEX cells (useful if using alternative guide-assignment algorithms)
- Preserves per-cell read/UMI statistics
Output structure (per sample):
output_directory/
└── sample_name/
├── sample_name_gex.h5ad # Gene expression data
├── sample_name_crispr.h5ad # Guide RNA counts (GEX-filtered)
├── sample_name_assignments.parquet # Guide assignments per cell
└── sample_name_reads.parquet # Read/UMI statistics per barcode
Configuration files require:
libraries: Named paths to feature files (probe lists, guide libraries)samples: Array of sample specifications
Each sample requires:
experiment: Experiment identifier (must match cyto output directory names)sample: Unique sample name (used for output files)mode: Processing mode (gex,crispr, orgex+crispr)features: Which library/libraries to use (must match mode with+)barcodes: Flex probe barcode assignments
Matches an output directory of the following path structure:
cyto_output_directory/
└── [experiment_name]_[mode]_Lane*/
└── ...
Note: All
Lanes for anExperiment+Modewill be concatenated. If you have differing barcode poolings based onLaneyou will need to adjust yourExperimentname to reflect that.
Single barcode:
"barcodes": "BC001"Barcode range (expands to BC001, BC002, BC003):
"barcodes": "BC1..3"Non-contiguous selection:
"barcodes": "BC001|BC003|BC005"Combined range and selection:
"barcodes": "BC1..3|BC005|BC7..9"Multi-modal pairing (GEX + CRISPR on same cells):
"mode": "gex+crispr",
"features": "GEX_PROBE_LIST+CRISPR_PROBE_LIST",
"barcodes": "BC1..8+CR1..8"Pairs BC001+CR001, BC002+CR002, ..., BC008+CR008
Multiple independent combinations:
"barcodes": "BC001+CR001|BC002+CR002"Processes BC001+CR001 as one pair, BC002+CR002 as another
{
"libraries": {
"GEX_PROBES": "./gex_probes.tsv"
},
"samples": [
{
"experiment": "exp1",
"mode": "gex",
"features": "GEX_PROBES",
"sample": "control",
"barcodes": "BC1..4"
},
{
"experiment": "exp1",
"mode": "gex",
"features": "GEX_PROBES",
"sample": "treatment",
"barcodes": "BC5..8"
}
]
}{
"libraries": {
"GEX_PROBES": "./gex_probes.tsv",
"GUIDE_LIBRARY": "./guides.tsv"
},
"samples": [
{
"experiment": "perturbseq_screen",
"mode": "gex+crispr",
"features": "GEX_PROBES+GUIDE_LIBRARY",
"sample": "screen_replicate1",
"barcodes": "BC1..8+CR1..8"
},
{
"experiment": "perturbseq_screen",
"mode": "gex+crispr",
"features": "GEX_PROBES+GUIDE_LIBRARY",
"sample": "screen_replicate2",
"barcodes": "BC9..16+CR9..16"
}
]
}{
"libraries": {
"GEX_PROBES": "./gex_probes.tsv",
"GUIDES": "./guides.tsv"
},
"samples": [
{
"experiment": "timecourse_20250101",
"mode": "gex+crispr",
"features": "GEX_PROBES+GUIDES",
"sample": "day0",
"barcodes": "BC1..4+CR1..4"
},
{
"experiment": "timecourse_20250101",
"mode": "gex+crispr",
"features": "GEX_PROBES+GUIDES",
"sample": "day7",
"barcodes": "BC5..7+CR5..7"
},
{
"experiment": "timecourse_20250101",
"mode": "gex+crispr",
"features": "GEX_PROBES+GUIDES",
"sample": "day14",
"barcodes": "BC8..12+CR8..12"
}
]
}# 1. Run cyto workflow
cyto workflow gex -c probes.tsv -w whitelist.txt -o cyto_out sample.vbq
# 2. Convert probe barcode to h5ad
pycyto convert cyto_out/counts/BC001.counts.mtx sample_BC001.h5ad# 1. Run cyto for GEX and CRISPR
cyto workflow gex -c gex_probes.tsv -w whitelist.txt -p probes.txt -o cyto_out/perturbseq_GEX_Lane1 sample.vbq
cyto workflow crispr -c guides.tsv -w whitelist.txt -p probes.txt -o cyto_out/perturbseq_CRISPR_Lane1 sample.vbq
# 2. Create aggregation config
cat > config.json << 'EOF'
{
"libraries": {
"GEX": "./gex_probes.tsv",
"GUIDES": "./guides.tsv"
},
"samples": [
{
"experiment": "perturbseq",
"mode": "gex+crispr",
"features": "GEX+GUIDES",
"sample": "mysample",
"barcodes": "BC1..16+CR1..16"
}
]
}
EOF
# 3. Aggregate (merges GEX + CRISPR modalities)
pycyto aggregate config.json ./cyto_out ./aggr
# Output:
# ./aggr/mysample/
# ├── mysample_gex.h5ad # Gene expression with guide annotations
# ├── mysample_crispr.h5ad # Guide counts (filtered to GEX cells)
# ├── mysample_assignments.parquet # Guide assignments
# └── mysample_reads.parquet # QC statistics# 1. Run cyto workflows
cyto workflow gex -c probes.tsv -w whitelist.txt -p probes.txt -o cyto_out/exp_GEX_Lane1 samples.vbq
cyto workflow crispr -c guides.tsv -w whitelist.txt -p probes.txt -o cyto_out/exp_CRISPR_Lane1 samples.vbq
# 2. Create config assigning barcodes to biological samples
cat > config.json << 'EOF'
{
"libraries": {
"GEX": "./gex_probes.tsv",
"GUIDES": "./guides.tsv"
},
"samples": [
{
"experiment": "exp",
"mode": "gex+crispr",
"features": "GEX+GUIDES",
"sample": "control_rep1",
"barcodes": "BC1..2+CR1..2"
},
{
"experiment": "exp",
"mode": "gex+crispr",
"features": "GEX+GUIDES",
"sample": "control_rep2",
"barcodes": "BC3..4+CR3..4"
},
{
"experiment": "exp",
"mode": "gex+crispr",
"features": "GEX+GUIDES",
"sample": "treatment_rep1",
"barcodes": "BC5..6+CR5..6"
},
{
"experiment": "exp",
"mode": "gex+crispr",
"features": "GEX+GUIDES",
"sample": "treatment_rep2",
"barcodes": "BC7..8+CR7..8"
}
]
}
EOF
# 3. Aggregate all samples in parallel
pycyto aggregate config.json ./cyto_out ./aggrFor each sample, aggregate combines data across all specified flex barcodes:
GEX mode: Concatenates gene expression matrices
CRISPR mode: Concatenates guide count matrices
GEX + CRISPR mode:
- Merges guide assignments into GEX
.obsmetadata - Filters CRISPR data to cells present in filtered GEX data
- Adds read/UMI statistics for both modalities
The aggregated GEX h5ad includes:
experiment: Experiment identifiersample: Sample nameflex_barcode: Original flex probe barcode (e.g., "BC001")lane_id: Sequencing lane identifierassignment: Assigned guide(s) from CRISPR datamoi: Multiplicity of infection (number of guides per cell)umis: Guide UMI countsn_reads_gex: Total GEX reads per celln_umis_gex: Total GEX UMIs per celln_reads_crispr: Total CRISPR reads per celln_umis_crispr: Total CRISPR UMIs per cell
When pairing GEX and CRISPR data:
- CRISPR barcodes (CR) are automatically converted to match GEX format (BC) for cell matching
- Cells are matched on:
cell_barcode + flex_barcode + lane_id - Only cells present in filtered GEX data are retained in CRISPR output
aggregate expects cyto outputs organized as:
cyto_outdir/
├── {experiment}_GEX_Lane*/
│ └── counts/
│ ├── BC001.h5ad
│ ├── BC002.h5ad
│ └── ...
└── {experiment}_CRISPR_Lane*/
├── counts/
│ ├── CR001.h5ad
│ └── ...
└── assignments/
├── CR001.assignments.tsv
└── ...
Where {experiment} matches the experiment field in your config.
Process only specific barcode combinations by modifying the config:
{
"samples": [
{
"experiment": "exp1",
"mode": "gex+crispr",
"features": "GEX+GUIDES",
"sample": "high_quality_subset",
"barcodes": "BC001+CR001|BC003+CR003|BC007+CR007"
}
]
}Reference different probe/guide libraries:
{
"libraries": {
"TISSUE_PANEL": "./tissue_probes.tsv",
"IMMUNE_PANEL": "./immune_probes.tsv",
"SCREEN_GUIDES": "./guides.tsv"
},
"samples": [
{
"experiment": "exp1",
"mode": "gex+crispr",
"features": "TISSUE_PANEL+SCREEN_GUIDES",
"sample": "tissue_sample",
"barcodes": "BC1..8+CR1..8"
},
{
"experiment": "exp1",
"mode": "gex+crispr",
"features": "IMMUNE_PANEL+SCREEN_GUIDES",
"sample": "pbmc_sample",
"barcodes": "BC9..16+CR9..16"
}
]
}# Use all available cores
pycyto aggregate config.json cyto_out output --threads -1
# Limit to 8 samples processed simultaneously
pycyto aggregate config.json cyto_out output --threads 8
# Single-threaded (minimal memory)
pycyto aggregate config.json cyto_out output --threads 1Problem: "Expected Feature file does not exist"
Solution: Ensure MTX directory contains matrix.mtx, features.tsv, and barcodes.tsv
Problem: "Invalid barcode format"
Solution: Barcodes must be BC/CR/AB followed by numbers. Use range syntax: BC1..8 not BC1-8
Problem: "No data found to process for sample"
Solution:
- Verify experiment name in config matches cyto output directory prefix
- Check that barcode specifications match available probe barcodes in cyto output
- Ensure cyto outputs are in expected directory structure
Problem: Out of memory during aggregation
Solution:
- Reduce
--threadsto process fewer samples concurrently - Use
--compressto reduce output file sizes - Process samples in smaller batches with separate configs
- Parallel aggregation: Samples are processed independently in parallel (one per thread)
- Lazy loading: Uses anndata experimental lazy loading to minimize memory overhead
- Compression: Optional compression reduces h5ad file sizes ~40-50%
- Integer storage:
--integerflag inconvertreduces memory vs float32
- cyto: Main processing pipeline - https://github.com/arcinstitute/cyto
- anndata: Annotated data structures - https://anndata.readthedocs.io
If you use pycyto in your research, please cite:
Teyssier, N. and Dobin, A. (2025). cyto: ultra-high throughput processing
of 10x-flex single cell sequencing. bioRxiv.
- Issues: https://github.com/arcinstitute/pycyto/issues
- Documentation: Run
pycyto --helporpycyto <command> --help