This file provides guidance to Claude Code when working with pycyto code.
pycyto is a Python companion tool for cyto, providing utilities to work with cyto outputs:
- convert: Transform Matrix Market (MTX) format to h5ad (AnnData) for downstream analysis
- aggregate: Combine multi-probe cyto outputs into unified sample-level datasets
pycyto is designed specifically for 10x Genomics Flex workflows where a single biological sample is split across multiple flex probe barcodes (BC001-BC016 for GEX, CR001-CR016 for CRISPR). The main challenge is intelligently aggregating data across these technical replicates while handling multi-modal experiments (GEX + CRISPR).
uv tool install pycytoOr from source:
cd pycyto
uv pip install -e .__main__.py- CLI entry point using Typer (convert, aggregate commands)config.py- Configuration parsing and barcode expansion DSLaggregate.py- Multi-modal sample aggregation logicconvert.py- Simple MTX to h5ad conversion utilities
polars- Fast dataframe operations for configuration handlinganndata- Single-cell data structuresanndata.experimental- Lazy loading for memory efficiencypandas- Legacy support for anndata.obsmanipulation
Purpose: Parse JSON configuration files that specify how to map flex probe barcodes to biological samples.
Key challenge: Users need a concise way to specify barcode ranges and pairings without listing hundreds of individual barcodes.
Pycyto supports two barcode naming conventions:
Flex-V1 (16-plex):
- Format:
BC001-BC016,CR001-CR016,AB001-AB016 - Use case: Standard 16-plex multiplexing for GEX, CRISPR, or antibody capture
Flex-V2 (384-plex):
- Format:
[ABCD]-[ABCDEFGH][01-12]or[ABCD]_[ABCDEFGH][01-12]- Examples:
A-A01,B-C05,D-H12orA_A01,B_C05,D_H12 - Both hyphen and underscore separators are supported
- Examples:
- Structure: 4 sets × 8 rows × 12 columns = 384 unique barcodes
- Sets: A, B, C, D
- Rows: A-H (8 rows per set, like a 96-well plate)
- Columns: 01-12 (12 columns per row)
- Use case: High-throughput 384-plex multiplexing
- Note: No BC/CR dual naming for Flex-V2 (single naming scheme per barcode set)
The barcode format is automatically detected from the naming convention - no config flag needed.
The config parser implements a mini-language for specifying barcodes:
Flex-V1 Range syntax:
"BC1..8" → expands to BC001, BC002, BC003, ..., BC008Flex-V2 Range syntax:
"A-A01..A-A12" → expands to A-A01, A-A02, ..., A-A12
"A-A01..A-H12" → expands to all 96 barcodes in set A (full plate)Selection syntax (both V1 and V2):
"BC1|BC3|BC5" → expands to BC001, BC003, BC005
"A-A01|A-A05|A-C03" → expands to A-A01, A-A05, A-C03Combined syntax:
"BC1..4|BC7|BC9..12" → expands to BC001, BC002, BC003, BC004, BC007, BC009, BC010, BC011, BC012
"A-A01..A-A12|B-B01..B-B12" → expands to ranges from sets A and BPairing syntax (for multi-modal experiments, Flex-V1 only):
"BC1..8+CR1..8" → pairs BC001+CR001, BC002+CR002, ..., BC008+CR008Multiple independent pairings:
"BC001+CR001|BC002+CR002" → two separate pairingsFlex-V2 Multi-Modal: Since Flex-V2 uses a single naming scheme (no BC/CR dual naming), no + pairing is needed:
"A-A01..A-H12" → uses same barcodes for both GEX and CRISPRThe parser must handle ambiguity with the | operator:
- Does
BC1..4|BC5+CR1..4|CR5mean two pairings OR ranges within each component? - Solution: Check if each
|-separated part has the expected number of+separators- If all parts are complete pairings (correct number of
+), treat as multiple combinations - Otherwise, treat
|as selection within components
- If all parts are complete pairings (correct number of
parse_config(config_path)- Main entry point, returns polars DataFrame with expanded barcodes_parse_barcodes(entry, nlib)- Handles the barcode expansion DSL_expand_barcode_component(component)- Expands a single component likeBC1..8_expand_range(range_str)- Expands5..7to[5, 6, 7]_expand_selection(selection)- Expands1|3|5..7to[1, 3, 5, 6, 7]
Returns a polars DataFrame where each row represents a single flex barcode assignment:
experiment | sample | mode | bc_component | bc_idx | features | probe_set | feature_path | expected_prefix
Critical columns:
bc_component: The specific flex barcode (BC001, CR001, etc.)bc_idx: Index within the pairing (for tracking which barcodes go together)expected_prefix: Used for directory matching (e.g., "experiment_GEX_Lane")
Purpose: Aggregate cyto outputs across multiple flex barcodes and lanes into unified sample-level datasets.
Key challenge: Cyto outputs are organized by experiment_MODE_Lane# directories, with one file per flex barcode. For multi-modal experiments (GEX + CRISPR), data must be merged at the cell level.
-
Directory Discovery: For each sample, find all matching cyto output directories
- Uses regex to match
{experiment}_{MODE}_Lane*patterns - Aggregates across all lanes for an experiment+mode combination
- Uses regex to match
-
Data Loading: Load data for all barcodes assigned to a sample
- GEX: Filtered h5ad files (
BC001.filt.h5ad) - CRISPR: Unfiltered h5ad files (
CR001.h5ad) + assignment files (CR001.assignments.tsv) - Reads: Per-barcode read statistics (
BC001.reads.tsv.zst)
- GEX: Filtered h5ad files (
-
Multi-Modal Merging (GEX + CRISPR case):
- Concatenate all GEX data across barcodes
- Concatenate all CRISPR data and assignments
- Convert CRISPR barcodes: CR → BC for cell matching
- Merge guide assignments into GEX
.obsmetadata - Filter CRISPR data to only cells present in filtered GEX data
-
Output Generation: Write sample-level files
{sample}_gex.h5ad- Gene expression with guide annotations{sample}_crispr.h5ad- Guide counts (GEX-filtered){sample}_assignments.parquet- Guide assignments per cell{sample}_reads.parquet- Read/UMI statistics
The problem (Flex-V1 only): CRISPR libraries can use either BC or CR flex barcode prefixes depending on the experimental design. When pairing GEX (which always uses BC) with CRISPR data that uses CR prefixes, the cell barcodes need to be matched correctly.
Why separate prefixes (Flex-V1): CRISPR and GEX can be run with different flex barcode sets:
- Same barcodes:
BC1..8+BC9..16(CRISPR uses BC) - Different barcodes:
BC1..8+CR1..8(CRISPR uses CR)
The solution:
- Detect barcode format (Flex-V1 vs Flex-V2) from GEX data
- For Flex-V1:
- Detect if CRISPR uses CR prefixes by checking assignment data
- If CR detected: cell barcodes like
ACGTACGT-CR001-1are converted toACGTACGT-BC001-1 - If BC detected: no conversion needed, barcodes already match
- For Flex-V2:
- No conversion needed - single naming scheme per barcode set
- Both GEX and CRISPR use the same barcode identifiers (e.g.,
A-A01)
- Matching is always done on
cell_barcode + flex_barcode + lane_id
See _process_gex_crispr_set() around line 140-160 for the detection and conversion logic.
The aggregation process is designed to minimize memory usage:
- Uses
anndata.experimental.read_lazy()to avoid loading full matrices into memory - Explicit
delstatements after concatenation to free memory immediately - Parallel processing at the sample level (not barcode level) to avoid memory explosion
aggregate_data(config, cyto_outdir, outdir, ...)- Main entry point, orchestrates parallel processingprocess_sample(sample, config, ...)- Processes a single sample (runs in parallel)_process_gex_crispr_set(...)- Handles GEX + CRISPR merging and filtering_filter_crispr_adata_to_gex_barcodes(...)- Filters CRISPR to GEX cells_load_*_for_experiment_sample(...)- Load specific data types from cyto outputs
Samples are processed in parallel using multiprocessing with spawn context:
- One process per sample (controlled by
--threads) - Each worker initializes its own logger
- No shared state between workers
Configurations are JSON files with two main sections:
Maps logical names to feature file paths:
{
"libraries": {
"GEX_PROBES": "./gex_probes.tsv",
"GUIDE_LIBRARY": "./guides.tsv"
}
}Specifies how to aggregate barcodes into samples:
{
"samples": [
{
"experiment": "exp1",
"sample": "sample_name",
"mode": "gex+crispr",
"features": "GEX_PROBES+GUIDE_LIBRARY",
"barcodes": "BC1..8+CR1..8"
}
]
}Key fields:
experiment: Must match cyto output directory prefixmode:gex,crispr, orgex+crisprfeatures: Library names fromlibrariessection (use+for multi-modal)barcodes: Barcode specification using the DSL (use+for pairing)
The aggregate command expects cyto outputs organized as:
cyto_outdir/
├── {experiment}_GEX_Lane1/
│ ├── counts/
│ │ ├── BC001.filt.h5ad
│ │ └── BC002.filt.h5ad
│ └── stats/reads/
│ ├── BC001.reads.tsv.zst
│ └── BC002.reads.tsv.zst
└── {experiment}_CRISPR_Lane1/
├── counts/
│ ├── CR001.h5ad
│ └── CR002.h5ad
└── assignments/
├── CR001.assignments.tsv
└── CR002.assignments.tsv
Note: For single-lane runs, the _Lane{N} suffix is optional. Directories named {experiment}_GEX or {experiment}_CRISPR (without Lane suffix) will be treated as Lane 1. This makes simple runs more convenient:
cyto_outdir/
├── {experiment}_GEX/ # Treated as Lane 1
│ └── counts/
│ ├── BC001.filt.h5ad
│ └── BC002.filt.h5ad
└── {experiment}_CRISPR/ # Treated as Lane 1
└── counts/
├── CR001.h5ad
└── CR002.h5ad
{
"libraries": { "GEX": "./gex_probes.tsv" },
"samples": [
{
"experiment": "exp1",
"mode": "gex",
"features": "GEX",
"sample": "control",
"barcodes": "BC1..4"
}
]
}{
"libraries": {
"GEX": "./gex_probes.tsv",
"GUIDES": "./guides.tsv"
},
"samples": [
{
"experiment": "perturbseq",
"mode": "gex+crispr",
"features": "GEX+GUIDES",
"sample": "screen_rep1",
"barcodes": "BC1..8+CR1..8"
}
]
}{
"libraries": { "GEX": "./gex_probes.tsv" },
"samples": [
{
"experiment": "exp1",
"mode": "gex",
"features": "GEX",
"sample": "poolA_sample1",
"barcodes": "A-A01..A-H12"
},
{
"experiment": "exp1",
"mode": "gex",
"features": "GEX",
"sample": "poolB_sample1",
"barcodes": "B-A01..B-H12"
}
]
}{
"libraries": {
"GEX": "./gex_probes.tsv",
"GUIDES": "./guides.tsv"
},
"samples": [
{
"experiment": "perturbseq",
"mode": "gex+crispr",
"features": "GEX+GUIDES",
"sample": "screen_poolA",
"barcodes": "A-A01..A-H12"
}
]
}Note: Flex-V2 uses a single naming scheme, so no + pairing is needed for multi-modal experiments.
Run tests with:
pytest tests/Example configurations are available in examples/:
aggregation.json- Simple multi-modal example with explicit pairingsngn2_agg.json- Developmental timecourse with range syntax
- cyto: Main processing pipeline - https://github.com/arcinstitute/cyto
- anndata: Data structures - https://anndata.readthedocs.io