A Python package for simulating realistic spatial transcriptomics data — useful for benchmarking deconvolution, segmentation, and transcript-assignment methods.
The simulator generates a synthetic tissue section in four stages:
SimulationConfig
│
▼
1. Gene Parameters — sample μ (expression level) and θ (overdispersion)
per gene per cell type using Gamma priors
│
▼
2. Cell Geometry — place nuclei on a jittered hex grid,
grow log-normal polygons, tile boundaries with Voronoi
│
▼
3. Count Matrix — draw transcript counts from NegBinom(μ·scale, θ)
per cell per gene
│
▼
4. Transcript Locations — place each transcript inside its cell polygon;
a configurable fraction leaks outside (leakage model)
Key design choices:
- Cell-type-specific marker, housekeeping, and silent gene classes
- Per-cell size scaling so larger cells receive more transcripts proportionally
- Per-cell-type and per-gene leakage probability for realistic cross-boundary contamination
- Spatial prototype system for inserting structured niches (clusters, rings, chains)
- Shapely 2.x geometry throughout; no rasterisation
# Recommended: create a dedicated conda environment
conda env create -f environment.yml
conda activate STPuppeteer
# Or install into an existing Python >=3.10 environment
pip install -e .Requires Python ≥ 3.10 and Shapely ≥ 2.0.
from STpuppeteer.simulation import SimulationConfig, SpotlessSimulator
config = SimulationConfig(
n_cells=200,
n_celltype=3,
celltype_proportion=[0.5, 0.3, 0.2],
n_genes=500,
n_markers=[100, 80, 60], # marker genes per cell type
leakage_by_celltype=[0.1, 0.15, 0.05],
seed=42,
)
sim = SpotlessSimulator(config)
sim.run_full_simulation()
# Access results
sim.cell_gdf # GeoDataFrame — cell polygons, cell type, size scaling
sim.gpar_df # DataFrame — gene parameters (μ, θ, gene_type)
sim.count_array # ndarray — (n_cells × n_genes) count matrix
sim.trs_df # DataFrame — transcript coordinates + metadata
# Save
sim.save_simple("output/") # Parquet, NPY, CSV
sim.save_spatialdata("output.zarr") # SpatialData/Zarr (requires spatialdata)
sim.save_xenium("output_xenium/") # 10x Xenium-compatible Parquet formatSee examples/example.py for a runnable end-to-end script.
| Attribute | Type | Description |
|---|---|---|
sim.cell_gdf |
GeoDataFrame | Cell and nucleus polygons, cell type, centroid, area, scale factor |
sim.gpar_df |
DataFrame | Per-gene μ, θ, gene type (marker / housekeeping / silent), gene leakage |
sim.count_array |
ndarray | Count matrix (cells × genes) |
sim.trs_df |
DataFrame | Transcript x/y/z location, cell/gene assignment, leakage and nucleus overlap flags |
For full parameter and API reference see REFERENCE.md.