Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
fb063f6
feat: initial implementation
noamteyssier Feb 24, 2026
91514ea
feat: introduce testing and fix bug in unique
noamteyssier Feb 24, 2026
9c694be
chore: remove unused deps
noamteyssier Feb 24, 2026
759048b
tests: clean up typing
noamteyssier Feb 24, 2026
720a774
feat: added 1vrest
noamteyssier Feb 24, 2026
13f92bb
feat: clean up implementation
noamteyssier Feb 24, 2026
966b946
feat: wrap fc and pcc in numba
noamteyssier Feb 24, 2026
bf92d17
fix: edge cases in threads
noamteyssier Feb 24, 2026
bb2db2b
feat: reuse ntc sparse column indices
noamteyssier Feb 24, 2026
52127e7
refactor: clean up syntax
noamteyssier Feb 24, 2026
9876fde
fix: correctly set numba threads
noamteyssier Feb 24, 2026
ad6b9eb
style: fmt
noamteyssier Feb 24, 2026
8e315ca
feat: update dep to github
noamteyssier Feb 24, 2026
239bdd6
refactor: rename ntc to ref
noamteyssier Feb 25, 2026
1c50c7e
fix: test in numba threads
noamteyssier Feb 25, 2026
ecd4735
chore: added claude md
noamteyssier Feb 25, 2026
8fadea3
feat: added on-target effect
noamteyssier Feb 25, 2026
9011059
refactor: skip missing on-target genes without erroring
noamteyssier Feb 25, 2026
cf5f3c4
feat: added logging
noamteyssier Feb 25, 2026
f6ed57f
perf: improve group gene map
noamteyssier Feb 25, 2026
dec265f
fix: include feature name in output
noamteyssier Feb 25, 2026
6638588
refactor: rename group_ to target_
noamteyssier Feb 25, 2026
f3ce79c
perf: improve performance of building group_gene_map
noamteyssier Feb 25, 2026
48ba8a3
feat: added correct mean and log1p conditioning with tests
noamteyssier Feb 25, 2026
a481779
docs: added docstring on usage of pdex
noamteyssier Feb 25, 2026
9ede4bf
feat: improve API with configurable log1p setting and explicitly defi…
noamteyssier Feb 25, 2026
94dc498
perf: improve performance on sparse matrix path for pseudobulk
noamteyssier Feb 25, 2026
9e52ca4
docs: update md
noamteyssier Feb 25, 2026
1577950
style: fmt
noamteyssier Feb 25, 2026
1a069e9
refactor: replace pdex implementation with pdex2 rewrite. Merges the …
noamteyssier Feb 25, 2026
3002d0d
dep: update all dependencies and python version
noamteyssier Feb 25, 2026
ec1c961
chore: remove ruff config
noamteyssier Feb 25, 2026
33e8cb5
docs: update new repo with claude.md
noamteyssier Feb 25, 2026
e8944c0
docs: update pdex documentation for 0.2.0
noamteyssier Feb 25, 2026
c7232b8
ci: ty checking + matrix on python version
noamteyssier Feb 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ jobs:

install-job:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.13", "3.12", "3.11"]

steps:
- uses: actions/checkout@v4
Expand All @@ -21,7 +24,7 @@ jobs:
with:
enable-cache: true
cache-dependency-glob: "pyproject.toml"
python-version: "3.12"
python-version: ${{ matrix.python-version }}

- name: install dependencies
run: |
Expand All @@ -40,7 +43,7 @@ jobs:
with:
enable-cache: true
cache-dependency-glob: "pyproject.toml"
python-version: "3.12"
python-version: "3.13"

- name: install dependencies
run: |
Expand All @@ -63,15 +66,15 @@ jobs:
with:
enable-cache: true
cache-dependency-glob: "pyproject.toml"
python-version: "3.12"
python-version: "3.13"

- name: install dependencies
run: |
uv sync --all-extras --dev

- name: run type checking
run: |
uv run pyright
uv run ty check

pytest:
runs-on: ubuntu-latest
Expand All @@ -86,7 +89,7 @@ jobs:
with:
enable-cache: true
cache-dependency-glob: "pyproject.toml"
python-version: "3.12"
python-version: "3.13"

- name: install dependencies
run: |
Expand Down
2 changes: 1 addition & 1 deletion .python-version
Original file line number Diff line number Diff line change
@@ -1 +1 @@
3.10
3.13
99 changes: 99 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

**Important:** This file must be kept up to date with the codebase. Any time the public API, output schema, modes, parameters, or architecture changes, update the relevant sections here before closing the task.

## Project Overview

`pdex` is a Python library for Parallel Differential Expression (PDEX) analysis in single-cell genomics, focused on conditional screens.
It computes per-gene statistics comparing perturbation groups against a reference using Mann-Whitney U tests with FDR correction.
It also provides functionality for per-gene statistics on 1-vs-rest comparisons and on-target single-gene comparisons.

## Commands

```bash
# Install / sync dependencies
uv sync

# Run all tests
uv run pytest -v

# Run a specific test file
uv run pytest tests/test_pdex.py

# Run a single test by name
uv run pytest tests/test_pdex.py::TestPdexRefMode::test_columns

# Lint and format
uv run ruff format

# Type check
uv run ty check
```

## Architecture

### Core Pipeline (`src/pdex/__init__.py`)

The main entry point is `pdex(adata, groupby, mode, threads, is_log1p, geometric_mean, as_pandas, **kwargs)`, which:

1. Validates the `groupby` column in `adata.obs`
2. Extracts unique groups (filters NaN and empty strings)
3. Identifies a reference group (defaults to `"non-targeting"` in `"ref"` and `"on_target"` modes)
4. For each non-reference group, slices the expression matrix, computes pseudobulk (mean), fold change, percent change, and Mann-Whitney U statistic vs the reference
5. Applies per-group FDR correction (scipy) and returns a Polars DataFrame (or pandas if `as_pandas=True`)

Three modes:

- `"ref"`: each non-reference group vs a single reference group (reference group is excluded from output)
- `"all"`: each group vs all remaining cells (1-vs-rest)
- `"on_target"`: each non-reference group vs the reference, but only at the single gene targeted by that group (requires `gene_col=` kwarg)

Unexpected `**kwargs` for any mode trigger a `UserWarning`.

### Key Files

| File | Role |
| ---------------------- | ------------------------------------------------------------------------------------------------------- |
| `src/pdex/__init__.py` | `pdex()` entry point and full pipeline logic |
| `src/pdex/_math.py` | Numba JIT-compiled `fold_change()`, `percent_change()`, and `mwu()` wrappers; `pseudobulk()` dispatcher |
| `src/pdex/_utils.py` | `set_numba_threadpool()` — sets Numba thread count before JIT warmup; `_detect_is_log1p()` heuristic |

### Performance Design

- Numba JIT compilation accelerates per-cell/per-gene math (`fold_change`, `percent_change`, `_log1p_col_mean`, `_expm1_vec`)
- `numba-mwu` (external dep) provides a Numba-accelerated Mann-Whitney U implementation
- Sparse CSR matrices are handled by reusing pre-computed non-targeting column indices to avoid redundant dense conversion
- Parallelism is controlled via `threads` passed to `set_numba_threadpool()`

### Output Schema

The returned Polars DataFrame (or pandas DataFrame when `as_pandas=True`) has columns:

| Column | Type | Description |
| ------------------- | ----- | --------------------------------------------------------------------- |
| `target` | str | Perturbation group name |
| `feature` | str | Gene name |
| `target_mean` | float | Pseudobulk mean for the target group, always in natural (count) space |
| `ref_mean` | float | Pseudobulk mean for the reference, always in natural (count) space |
| `target_membership` | int | Number of cells in the target group |
| `ref_membership` | int | Number of cells in the reference |
| `fold_change` | float | log2(target_mean / ref_mean) — computed from pseudobulk means |
| `percent_change` | float | (target_mean - ref_mean) / ref_mean — computed from pseudobulk means |
| `p_value` | float | Mann-Whitney U p-value (per-cell vectors) |
| `statistic` | float | Mann-Whitney U statistic |
| `fdr` | float | FDR-corrected p-value, applied per-group across genes |

`target_mean` and `ref_mean` are always in natural (count) space regardless of `is_log1p` or `geometric_mean`.
FDR is corrected within each group (across genes), not globally across all (group, gene) pairs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This statement about FDR correction is accurate for ref and all modes, but not for on_target mode, where correction happens across groups. To avoid confusion, it would be helpful to clarify how FDR is handled in each case.

Suggested change
FDR is corrected within each group (across genes), not globally across all (group, gene) pairs.
FDR is corrected within each group (across genes) for `"ref"` and `"all"` modes, not globally across all (group, gene) pairs. For `"on_target"` mode, FDR is corrected across all tested groups.


### Public API (`__all__`)

```python
from pdex import pdex, DEFAULT_REFERENCE
```

## Dependencies

Managed with `uv`. Build backend: `hatchling`. Key packages: `anndata`, `numba`, `numba-mwu`, `polars`, `pyarrow`, `scipy`, `tqdm`. Dev tools: `pytest`, `ruff`, `ty`.
180 changes: 77 additions & 103 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,128 +1,102 @@
# pdex

parallel differential expression for single-cell perturbation sequencing
Parallel differential expression for single-cell perturbation sequencing.

## Installation

Add to your `pyproject.toml` file with [`uv`](https://github.com/astral-sh/uv)

```bash
# add to pyproject.toml
uv add pdex
```

## Summary
# add to env
uv pip install pdex
```

This is a python package for performing parallel differential expression between multiple groups and a control.
## Overview

It is optimized for very large datasets and very large numbers of perturbations.
`pdex` computes per-gene differential expression statistics between perturbation groups in single-cell data using Mann-Whitney U tests with FDR correction. It was originally designed for CRISPR screen and perturbation sequencing datasets with many groups and large cell counts.

It makes use of shared memory to parallelize the computation to a high number of threads and minimizes the [IPC](https://en.wikipedia.org/wiki/Inter-process_communication) between processes to reduce overhead.
It supports dense and sparse (CSR) expression matrices, and uses [numba-mwu](https://github.com/noamteyssier/numba-mwu) for Numba-accelerated Mann-Whitney U computation.

It supports the following metrics:
## Modes

- Wilcoxon Rank Sum
- Anderson-Darling
- T-Test
| Mode | Description |
| ------------- | ------------------------------------------------------------------- |
| `"ref"` | Each group vs a single reference group (default: `"non-targeting"`) |
| `"all"` | Each group vs all remaining cells (1-vs-rest) |
| `"on_target"` | Each group vs the reference at its single target gene only |

## Backed vs In-Memory AnnData
## Usage

pdex adapts its execution strategy based on how the AnnData object is stored:
### Reference mode (default)

- **In-memory AnnData** (e.g., loaded via `sc.read_h5ad(path)` without `backed="r"`):
pdex uses a shared-memory multiprocessing workflow. Each worker process gets
access to the full expression matrix through shared memory, which minimizes
serialization overhead. Parallelism is configured via the `num_workers`
parameter (process count). `num_threads` is ignored in this mode because numba
kernels operate on per-target slices entirely in memory.
- **Backed AnnData** (`adata.X` is an on-disk HDF5 dataset): pdex automatically
switches to the low-memory chunked implementation. Gene chunks are streamed
from disk, the reference group is computed once per chunk, and targets are
processed in parallel via `num_workers` (thread pool). Within each target,
Wilcoxon metrics can additionally use numba parallelization controlled by
`num_threads`. This mode avoids loading the entire matrix into RAM while still
enabling both target-level and gene-level parallelism.
```python
import anndata as ad
from pdex import pdex

If a backed dataset is supplied without enabling low-memory mode, pdex raises a
helpful error explaining that chunked processing is required. Conversely, you can
force the chunked path for large in-memory matrices by passing `low_memory=True`.
adata = ad.read_h5ad("screen.h5ad")

## Parallelization
results = pdex(
adata,
groupby="guide",
mode="ref",
is_log1p=False,
)
```

`parallel_differential_expression` exposes two orthogonal knobs for controlling
parallel execution:
### 1-vs-rest mode

- `num_workers` controls the number of Python threads that process targets within
each gene chunk. `None` (default in low-memory mode) enables an auto-detected
worker count based on available CPUs, while `1` disables thread-level parallelism.
- `num_threads` controls the numba thread pool used by the Wilcoxon kernel. `None`
lets numba auto-detect the optimal size, whereas `1` turns numba parallelization
off. This setting is only used in low-memory mode and only when `metric="wilcoxon"`.
When pdex detects non-integer expression values in a gene chunk (for example, after
log-normalization), it automatically disables numba for that chunk, logs a warning,
and falls back to the SciPy implementation to preserve correct rank ordering.
```python
results = pdex(
adata,
groupby="guide",
mode="all",
is_log1p=False,
)
```

These strategies can be combined: for example, `num_workers=2, num_threads=8`
runs two target threads that share an eight-thread numba pool. When the metric
does not support numba acceleration, pdex automatically logs a warning and
falls back to thread-only execution.
### On-target mode

## Usage
Requires a column in `adata.obs` mapping each group to its target gene:

```python
import anndata as ad
import numpy as np
import pandas as pd

from pdex import parallel_differential_expression

PERT_COL = "perturbation"
CONTROL_VAR = "control"

N_CELLS = 1000
N_GENES = 100
N_PERTS = 10
MAX_UMI = 1e6


def build_random_anndata(
n_cells: int = N_CELLS,
n_genes: int = N_GENES,
n_perts: int = N_PERTS,
pert_col: str = PERT_COL,
control_var: str = CONTROL_VAR,
) -> ad.AnnData:
"""Sample a random AnnData object."""
return ad.AnnData(
X=np.random.randint(0, MAX_UMI, size=(n_cells, n_genes)),
obs=pd.DataFrame(
{
pert_col: np.random.choice(
[f"pert_{i}" for i in range(n_perts)] + [control_var],
size=n_cells,
replace=True,
),
}
),
)


def main():
adata = build_random_anndata()

# Run pdex with default metric (wilcoxon)
results = parallel_differential_expression(
adata,
reference=CONTROL_VAR,
groupby_key=PERT_COL,
)
assert results.shape[0] == N_GENES * N_PERTS

# Run pdex with alt metric (anderson)
results = parallel_differential_expression(
adata,
reference=CONTROL_VAR,
groupby_key=PERT_COL,
metric="anderson"
)
assert results.shape[0] == N_GENES * N_PERTS
results = pdex(
adata,
groupby="guide",
mode="on_target",
gene_col="target_gene",
is_log1p=False,
)
```

## Parameters

| Parameter | Type | Default | Description |
| ---------------- | -------------- | ----------------- | ---------------------------------------------------------- |
| `adata` | `AnnData` | required | Annotated data matrix (dense or sparse CSR) |
| `groupby` | `str` | required | Column in `adata.obs` defining groups |
| `mode` | `str` | `"ref"` | Comparison mode: `"ref"`, `"all"`, or `"on_target"` |
| `threads` | `int` | `0` | Numba thread count (`0` = all CPUs) |
| `is_log1p` | `bool \| None` | `None` | Whether data is log1p-transformed. Auto-detected if `None` |
| `geometric_mean` | `bool` | `True` | Use geometric mean for pseudobulk (vs arithmetic) |
| `as_pandas` | `bool` | `False` | Return a pandas DataFrame instead of Polars |
| `reference` | `str` | `"non-targeting"` | Reference group name (modes: `ref`, `on_target`) |
| `gene_col` | `str` | — | Column mapping groups to target genes (mode: `on_target`) |

## Output

Returns a Polars DataFrame (or pandas if `as_pandas=True`) with one row per (group, gene) pair:

| Column | Description |
| ------------------- | -------------------------------------------------- |
| `target` | Perturbation group name |
| `feature` | Gene name |
| `target_mean` | Pseudobulk mean for the target group (count space) |
| `ref_mean` | Pseudobulk mean for the reference (count space) |
| `target_membership` | Number of cells in the target group |
| `ref_membership` | Number of cells in the reference |
| `fold_change` | log2(target_mean / ref_mean) |
| `percent_change` | (target_mean - ref_mean) / ref_mean |
| `p_value` | Mann-Whitney U p-value |
| `statistic` | Mann-Whitney U statistic |
| `fdr` | FDR-corrected p-value (per-group, across genes) |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The description for FDR correction is accurate for ref and all modes, but can be misleading for on_target mode. In on_target mode, FDR is corrected across groups, not per-group across genes (since there's only one gene per group). It would be clearer to specify this distinction.

Suggested change
| `fdr` | FDR-corrected p-value (per-group, across genes) |
| `fdr` | FDR-corrected p-value. For `ref` and `all` modes, this is applied per-group across genes. For `on_target` mode, it is applied across groups. |

Loading