STPuppeteer

A Python package for simulating realistic spatial transcriptomics data — useful for benchmarking deconvolution, segmentation, and transcript-assignment methods.

How it works

The simulator generates a synthetic tissue section in four stages:

SimulationConfig
      │
      ▼
1. Gene Parameters     — sample μ (expression level) and θ (overdispersion)
                          per gene per cell type using Gamma priors
      │
      ▼
2. Cell Geometry       — place nuclei on a jittered hex grid,
                          grow log-normal polygons, tile boundaries with Voronoi
      │
      ▼
3. Count Matrix        — draw transcript counts from NegBinom(μ·scale, θ)
                          per cell per gene
      │
      ▼
4. Transcript Locations — place each transcript inside its cell polygon;
                           a configurable fraction leaks outside (leakage model)

Key design choices:

Cell-type-specific marker, housekeeping, and silent gene classes
Per-cell size scaling so larger cells receive more transcripts proportionally
Per-cell-type and per-gene leakage probability for realistic cross-boundary contamination
Spatial prototype system for inserting structured niches (clusters, rings, chains)
Shapely 2.x geometry throughout; no rasterisation

Installation

# Recommended: create a dedicated conda environment
conda env create -f environment.yml
conda activate STPuppeteer

# Or install into an existing Python >=3.10 environment
pip install -e .

Requires Python ≥ 3.10 and Shapely ≥ 2.0.

Quickstart

from STpuppeteer.simulation import SimulationConfig, SpotlessSimulator

config = SimulationConfig(
    n_cells=200,
    n_celltype=3,
    celltype_proportion=[0.5, 0.3, 0.2],
    n_genes=500,
    n_markers=[100, 80, 60],           # marker genes per cell type
    leakage_by_celltype=[0.1, 0.15, 0.05],
    seed=42,
)

sim = SpotlessSimulator(config)
sim.run_full_simulation()

# Access results
sim.cell_gdf      # GeoDataFrame — cell polygons, cell type, size scaling
sim.gpar_df       # DataFrame    — gene parameters (μ, θ, gene_type)
sim.count_array   # ndarray      — (n_cells × n_genes) count matrix
sim.trs_df        # DataFrame    — transcript coordinates + metadata

# Save
sim.save_simple("output/")           # Parquet, NPY, CSV
sim.save_spatialdata("output.zarr")  # SpatialData/Zarr (requires spatialdata)
sim.save_xenium("output_xenium/")    # 10x Xenium-compatible Parquet format

See examples/example.py for a runnable end-to-end script.

Output summary

Attribute	Type	Description
`sim.cell_gdf`	GeoDataFrame	Cell and nucleus polygons, cell type, centroid, area, scale factor
`sim.gpar_df`	DataFrame	Per-gene μ, θ, gene type (marker / housekeeping / silent), gene leakage
`sim.count_array`	ndarray	Count matrix (cells × genes)
`sim.trs_df`	DataFrame	Transcript x/y/z location, cell/gene assignment, leakage and nucleus overlap flags

For full parameter and API reference see REFERENCE.md.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
src		src
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
REFERENCE.md		REFERENCE.md
environment.yml		environment.yml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STPuppeteer

How it works

Installation

Quickstart

Output summary

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

STPuppeteer

How it works

Installation

Quickstart

Output summary

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages