Skip to content

pkharchenko/STpuppeteer

 
 

Repository files navigation

STPuppeteer

A Python package for simulating realistic spatial transcriptomics data — useful for benchmarking deconvolution, segmentation, and transcript-assignment methods.

How it works

The simulator generates a synthetic tissue section in four stages:

SimulationConfig
      │
      ▼
1. Gene Parameters     — sample μ (expression level) and θ (overdispersion)
                          per gene per cell type using Gamma priors
      │
      ▼
2. Cell Geometry       — place nuclei on a jittered hex grid,
                          grow log-normal polygons, tile boundaries with Voronoi
      │
      ▼
3. Count Matrix        — draw transcript counts from NegBinom(μ·scale, θ)
                          per cell per gene
      │
      ▼
4. Transcript Locations — place each transcript inside its cell polygon;
                           a configurable fraction leaks outside (leakage model)

Key design choices:

  • Cell-type-specific marker, housekeeping, and silent gene classes
  • Per-cell size scaling so larger cells receive more transcripts proportionally
  • Per-cell-type and per-gene leakage probability for realistic cross-boundary contamination
  • Spatial prototype system for inserting structured niches (clusters, rings, chains)
  • Shapely 2.x geometry throughout; no rasterisation

Installation

# Recommended: create a dedicated conda environment
conda env create -f environment.yml
conda activate STPuppeteer

# Or install into an existing Python >=3.10 environment
pip install -e .

Requires Python ≥ 3.10 and Shapely ≥ 2.0.

Quickstart

from STpuppeteer.simulation import SimulationConfig, SpotlessSimulator

config = SimulationConfig(
    n_cells=200,
    n_celltype=3,
    celltype_proportion=[0.5, 0.3, 0.2],
    n_genes=500,
    n_markers=[100, 80, 60],           # marker genes per cell type
    leakage_by_celltype=[0.1, 0.15, 0.05],
    seed=42,
)

sim = SpotlessSimulator(config)
sim.run_full_simulation()

# Access results
sim.cell_gdf      # GeoDataFrame — cell polygons, cell type, size scaling
sim.gpar_df       # DataFrame    — gene parameters (μ, θ, gene_type)
sim.count_array   # ndarray      — (n_cells × n_genes) count matrix
sim.trs_df        # DataFrame    — transcript coordinates + metadata

# Save
sim.save_simple("output/")           # Parquet, NPY, CSV
sim.save_spatialdata("output.zarr")  # SpatialData/Zarr (requires spatialdata)
sim.save_xenium("output_xenium/")    # 10x Xenium-compatible Parquet format

See examples/example.py for a runnable end-to-end script.

Output summary

Attribute Type Description
sim.cell_gdf GeoDataFrame Cell and nucleus polygons, cell type, centroid, area, scale factor
sim.gpar_df DataFrame Per-gene μ, θ, gene type (marker / housekeeping / silent), gene leakage
sim.count_array ndarray Count matrix (cells × genes)
sim.trs_df DataFrame Transcript x/y/z location, cell/gene assignment, leakage and nucleus overlap flags

For full parameter and API reference see REFERENCE.md.

About

Playground to generate your desirable spatial transcriptomics data with customizable morphology and transcript locations.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%