-
Notifications
You must be signed in to change notification settings - Fork 11
Integrate pdex2 #68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate pdex2 #68
Changes from all commits
fb063f6
91514ea
9c694be
759048b
720a774
13f92bb
966b946
bf92d17
bb2db2b
52127e7
9876fde
ad6b9eb
8e315ca
239bdd6
1c50c7e
ecd4735
8fadea3
9011059
cf5f3c4
f6ed57f
dec265f
6638588
f3ce79c
48ba8a3
a481779
9ede4bf
94dc498
9e52ca4
1577950
1a069e9
3002d0d
ec1c961
33e8cb5
e8944c0
c7232b8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1 +1 @@ | ||
| 3.10 | ||
| 3.13 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,99 @@ | ||
| # CLAUDE.md | ||
|
|
||
| This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. | ||
|
|
||
| **Important:** This file must be kept up to date with the codebase. Any time the public API, output schema, modes, parameters, or architecture changes, update the relevant sections here before closing the task. | ||
|
|
||
| ## Project Overview | ||
|
|
||
| `pdex` is a Python library for Parallel Differential Expression (PDEX) analysis in single-cell genomics, focused on conditional screens. | ||
| It computes per-gene statistics comparing perturbation groups against a reference using Mann-Whitney U tests with FDR correction. | ||
| It also provides functionality for per-gene statistics on 1-vs-rest comparisons and on-target single-gene comparisons. | ||
|
|
||
| ## Commands | ||
|
|
||
| ```bash | ||
| # Install / sync dependencies | ||
| uv sync | ||
|
|
||
| # Run all tests | ||
| uv run pytest -v | ||
|
|
||
| # Run a specific test file | ||
| uv run pytest tests/test_pdex.py | ||
|
|
||
| # Run a single test by name | ||
| uv run pytest tests/test_pdex.py::TestPdexRefMode::test_columns | ||
|
|
||
| # Lint and format | ||
| uv run ruff format | ||
|
|
||
| # Type check | ||
| uv run ty check | ||
| ``` | ||
|
|
||
| ## Architecture | ||
|
|
||
| ### Core Pipeline (`src/pdex/__init__.py`) | ||
|
|
||
| The main entry point is `pdex(adata, groupby, mode, threads, is_log1p, geometric_mean, as_pandas, **kwargs)`, which: | ||
|
|
||
| 1. Validates the `groupby` column in `adata.obs` | ||
| 2. Extracts unique groups (filters NaN and empty strings) | ||
| 3. Identifies a reference group (defaults to `"non-targeting"` in `"ref"` and `"on_target"` modes) | ||
| 4. For each non-reference group, slices the expression matrix, computes pseudobulk (mean), fold change, percent change, and Mann-Whitney U statistic vs the reference | ||
| 5. Applies per-group FDR correction (scipy) and returns a Polars DataFrame (or pandas if `as_pandas=True`) | ||
|
|
||
| Three modes: | ||
|
|
||
| - `"ref"`: each non-reference group vs a single reference group (reference group is excluded from output) | ||
| - `"all"`: each group vs all remaining cells (1-vs-rest) | ||
| - `"on_target"`: each non-reference group vs the reference, but only at the single gene targeted by that group (requires `gene_col=` kwarg) | ||
|
|
||
| Unexpected `**kwargs` for any mode trigger a `UserWarning`. | ||
|
|
||
| ### Key Files | ||
|
|
||
| | File | Role | | ||
| | ---------------------- | ------------------------------------------------------------------------------------------------------- | | ||
| | `src/pdex/__init__.py` | `pdex()` entry point and full pipeline logic | | ||
| | `src/pdex/_math.py` | Numba JIT-compiled `fold_change()`, `percent_change()`, and `mwu()` wrappers; `pseudobulk()` dispatcher | | ||
| | `src/pdex/_utils.py` | `set_numba_threadpool()` — sets Numba thread count before JIT warmup; `_detect_is_log1p()` heuristic | | ||
|
|
||
| ### Performance Design | ||
|
|
||
| - Numba JIT compilation accelerates per-cell/per-gene math (`fold_change`, `percent_change`, `_log1p_col_mean`, `_expm1_vec`) | ||
| - `numba-mwu` (external dep) provides a Numba-accelerated Mann-Whitney U implementation | ||
| - Sparse CSR matrices are handled by reusing pre-computed non-targeting column indices to avoid redundant dense conversion | ||
| - Parallelism is controlled via `threads` passed to `set_numba_threadpool()` | ||
|
|
||
| ### Output Schema | ||
|
|
||
| The returned Polars DataFrame (or pandas DataFrame when `as_pandas=True`) has columns: | ||
|
|
||
| | Column | Type | Description | | ||
| | ------------------- | ----- | --------------------------------------------------------------------- | | ||
| | `target` | str | Perturbation group name | | ||
| | `feature` | str | Gene name | | ||
| | `target_mean` | float | Pseudobulk mean for the target group, always in natural (count) space | | ||
| | `ref_mean` | float | Pseudobulk mean for the reference, always in natural (count) space | | ||
| | `target_membership` | int | Number of cells in the target group | | ||
| | `ref_membership` | int | Number of cells in the reference | | ||
| | `fold_change` | float | log2(target_mean / ref_mean) — computed from pseudobulk means | | ||
| | `percent_change` | float | (target_mean - ref_mean) / ref_mean — computed from pseudobulk means | | ||
| | `p_value` | float | Mann-Whitney U p-value (per-cell vectors) | | ||
| | `statistic` | float | Mann-Whitney U statistic | | ||
| | `fdr` | float | FDR-corrected p-value, applied per-group across genes | | ||
|
|
||
| `target_mean` and `ref_mean` are always in natural (count) space regardless of `is_log1p` or `geometric_mean`. | ||
| FDR is corrected within each group (across genes), not globally across all (group, gene) pairs. | ||
|
|
||
| ### Public API (`__all__`) | ||
|
|
||
| ```python | ||
| from pdex import pdex, DEFAULT_REFERENCE | ||
| ``` | ||
|
|
||
| ## Dependencies | ||
|
|
||
| Managed with `uv`. Build backend: `hatchling`. Key packages: `anndata`, `numba`, `numba-mwu`, `polars`, `pyarrow`, `scipy`, `tqdm`. Dev tools: `pytest`, `ruff`, `ty`. | ||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -1,128 +1,102 @@ | ||||||
| # pdex | ||||||
|
|
||||||
| parallel differential expression for single-cell perturbation sequencing | ||||||
| Parallel differential expression for single-cell perturbation sequencing. | ||||||
|
|
||||||
| ## Installation | ||||||
|
|
||||||
| Add to your `pyproject.toml` file with [`uv`](https://github.com/astral-sh/uv) | ||||||
|
|
||||||
| ```bash | ||||||
| # add to pyproject.toml | ||||||
| uv add pdex | ||||||
| ``` | ||||||
|
|
||||||
| ## Summary | ||||||
| # add to env | ||||||
| uv pip install pdex | ||||||
| ``` | ||||||
|
|
||||||
| This is a python package for performing parallel differential expression between multiple groups and a control. | ||||||
| ## Overview | ||||||
|
|
||||||
| It is optimized for very large datasets and very large numbers of perturbations. | ||||||
| `pdex` computes per-gene differential expression statistics between perturbation groups in single-cell data using Mann-Whitney U tests with FDR correction. It was originally designed for CRISPR screen and perturbation sequencing datasets with many groups and large cell counts. | ||||||
|
|
||||||
| It makes use of shared memory to parallelize the computation to a high number of threads and minimizes the [IPC](https://en.wikipedia.org/wiki/Inter-process_communication) between processes to reduce overhead. | ||||||
| It supports dense and sparse (CSR) expression matrices, and uses [numba-mwu](https://github.com/noamteyssier/numba-mwu) for Numba-accelerated Mann-Whitney U computation. | ||||||
|
|
||||||
| It supports the following metrics: | ||||||
| ## Modes | ||||||
|
|
||||||
| - Wilcoxon Rank Sum | ||||||
| - Anderson-Darling | ||||||
| - T-Test | ||||||
| | Mode | Description | | ||||||
| | ------------- | ------------------------------------------------------------------- | | ||||||
| | `"ref"` | Each group vs a single reference group (default: `"non-targeting"`) | | ||||||
| | `"all"` | Each group vs all remaining cells (1-vs-rest) | | ||||||
| | `"on_target"` | Each group vs the reference at its single target gene only | | ||||||
|
|
||||||
| ## Backed vs In-Memory AnnData | ||||||
| ## Usage | ||||||
|
|
||||||
| pdex adapts its execution strategy based on how the AnnData object is stored: | ||||||
| ### Reference mode (default) | ||||||
|
|
||||||
| - **In-memory AnnData** (e.g., loaded via `sc.read_h5ad(path)` without `backed="r"`): | ||||||
| pdex uses a shared-memory multiprocessing workflow. Each worker process gets | ||||||
| access to the full expression matrix through shared memory, which minimizes | ||||||
| serialization overhead. Parallelism is configured via the `num_workers` | ||||||
| parameter (process count). `num_threads` is ignored in this mode because numba | ||||||
| kernels operate on per-target slices entirely in memory. | ||||||
| - **Backed AnnData** (`adata.X` is an on-disk HDF5 dataset): pdex automatically | ||||||
| switches to the low-memory chunked implementation. Gene chunks are streamed | ||||||
| from disk, the reference group is computed once per chunk, and targets are | ||||||
| processed in parallel via `num_workers` (thread pool). Within each target, | ||||||
| Wilcoxon metrics can additionally use numba parallelization controlled by | ||||||
| `num_threads`. This mode avoids loading the entire matrix into RAM while still | ||||||
| enabling both target-level and gene-level parallelism. | ||||||
| ```python | ||||||
| import anndata as ad | ||||||
| from pdex import pdex | ||||||
|
|
||||||
| If a backed dataset is supplied without enabling low-memory mode, pdex raises a | ||||||
| helpful error explaining that chunked processing is required. Conversely, you can | ||||||
| force the chunked path for large in-memory matrices by passing `low_memory=True`. | ||||||
| adata = ad.read_h5ad("screen.h5ad") | ||||||
|
|
||||||
| ## Parallelization | ||||||
| results = pdex( | ||||||
| adata, | ||||||
| groupby="guide", | ||||||
| mode="ref", | ||||||
| is_log1p=False, | ||||||
| ) | ||||||
| ``` | ||||||
|
|
||||||
| `parallel_differential_expression` exposes two orthogonal knobs for controlling | ||||||
| parallel execution: | ||||||
| ### 1-vs-rest mode | ||||||
|
|
||||||
| - `num_workers` controls the number of Python threads that process targets within | ||||||
| each gene chunk. `None` (default in low-memory mode) enables an auto-detected | ||||||
| worker count based on available CPUs, while `1` disables thread-level parallelism. | ||||||
| - `num_threads` controls the numba thread pool used by the Wilcoxon kernel. `None` | ||||||
| lets numba auto-detect the optimal size, whereas `1` turns numba parallelization | ||||||
| off. This setting is only used in low-memory mode and only when `metric="wilcoxon"`. | ||||||
| When pdex detects non-integer expression values in a gene chunk (for example, after | ||||||
| log-normalization), it automatically disables numba for that chunk, logs a warning, | ||||||
| and falls back to the SciPy implementation to preserve correct rank ordering. | ||||||
| ```python | ||||||
| results = pdex( | ||||||
| adata, | ||||||
| groupby="guide", | ||||||
| mode="all", | ||||||
| is_log1p=False, | ||||||
| ) | ||||||
| ``` | ||||||
|
|
||||||
| These strategies can be combined: for example, `num_workers=2, num_threads=8` | ||||||
| runs two target threads that share an eight-thread numba pool. When the metric | ||||||
| does not support numba acceleration, pdex automatically logs a warning and | ||||||
| falls back to thread-only execution. | ||||||
| ### On-target mode | ||||||
|
|
||||||
| ## Usage | ||||||
| Requires a column in `adata.obs` mapping each group to its target gene: | ||||||
|
|
||||||
| ```python | ||||||
| import anndata as ad | ||||||
| import numpy as np | ||||||
| import pandas as pd | ||||||
|
|
||||||
| from pdex import parallel_differential_expression | ||||||
|
|
||||||
| PERT_COL = "perturbation" | ||||||
| CONTROL_VAR = "control" | ||||||
|
|
||||||
| N_CELLS = 1000 | ||||||
| N_GENES = 100 | ||||||
| N_PERTS = 10 | ||||||
| MAX_UMI = 1e6 | ||||||
|
|
||||||
|
|
||||||
| def build_random_anndata( | ||||||
| n_cells: int = N_CELLS, | ||||||
| n_genes: int = N_GENES, | ||||||
| n_perts: int = N_PERTS, | ||||||
| pert_col: str = PERT_COL, | ||||||
| control_var: str = CONTROL_VAR, | ||||||
| ) -> ad.AnnData: | ||||||
| """Sample a random AnnData object.""" | ||||||
| return ad.AnnData( | ||||||
| X=np.random.randint(0, MAX_UMI, size=(n_cells, n_genes)), | ||||||
| obs=pd.DataFrame( | ||||||
| { | ||||||
| pert_col: np.random.choice( | ||||||
| [f"pert_{i}" for i in range(n_perts)] + [control_var], | ||||||
| size=n_cells, | ||||||
| replace=True, | ||||||
| ), | ||||||
| } | ||||||
| ), | ||||||
| ) | ||||||
|
|
||||||
|
|
||||||
| def main(): | ||||||
| adata = build_random_anndata() | ||||||
|
|
||||||
| # Run pdex with default metric (wilcoxon) | ||||||
| results = parallel_differential_expression( | ||||||
| adata, | ||||||
| reference=CONTROL_VAR, | ||||||
| groupby_key=PERT_COL, | ||||||
| ) | ||||||
| assert results.shape[0] == N_GENES * N_PERTS | ||||||
|
|
||||||
| # Run pdex with alt metric (anderson) | ||||||
| results = parallel_differential_expression( | ||||||
| adata, | ||||||
| reference=CONTROL_VAR, | ||||||
| groupby_key=PERT_COL, | ||||||
| metric="anderson" | ||||||
| ) | ||||||
| assert results.shape[0] == N_GENES * N_PERTS | ||||||
| results = pdex( | ||||||
| adata, | ||||||
| groupby="guide", | ||||||
| mode="on_target", | ||||||
| gene_col="target_gene", | ||||||
| is_log1p=False, | ||||||
| ) | ||||||
| ``` | ||||||
|
|
||||||
| ## Parameters | ||||||
|
|
||||||
| | Parameter | Type | Default | Description | | ||||||
| | ---------------- | -------------- | ----------------- | ---------------------------------------------------------- | | ||||||
| | `adata` | `AnnData` | required | Annotated data matrix (dense or sparse CSR) | | ||||||
| | `groupby` | `str` | required | Column in `adata.obs` defining groups | | ||||||
| | `mode` | `str` | `"ref"` | Comparison mode: `"ref"`, `"all"`, or `"on_target"` | | ||||||
| | `threads` | `int` | `0` | Numba thread count (`0` = all CPUs) | | ||||||
| | `is_log1p` | `bool \| None` | `None` | Whether data is log1p-transformed. Auto-detected if `None` | | ||||||
| | `geometric_mean` | `bool` | `True` | Use geometric mean for pseudobulk (vs arithmetic) | | ||||||
| | `as_pandas` | `bool` | `False` | Return a pandas DataFrame instead of Polars | | ||||||
| | `reference` | `str` | `"non-targeting"` | Reference group name (modes: `ref`, `on_target`) | | ||||||
| | `gene_col` | `str` | — | Column mapping groups to target genes (mode: `on_target`) | | ||||||
|
|
||||||
| ## Output | ||||||
|
|
||||||
| Returns a Polars DataFrame (or pandas if `as_pandas=True`) with one row per (group, gene) pair: | ||||||
|
|
||||||
| | Column | Description | | ||||||
| | ------------------- | -------------------------------------------------- | | ||||||
| | `target` | Perturbation group name | | ||||||
| | `feature` | Gene name | | ||||||
| | `target_mean` | Pseudobulk mean for the target group (count space) | | ||||||
| | `ref_mean` | Pseudobulk mean for the reference (count space) | | ||||||
| | `target_membership` | Number of cells in the target group | | ||||||
| | `ref_membership` | Number of cells in the reference | | ||||||
| | `fold_change` | log2(target_mean / ref_mean) | | ||||||
| | `percent_change` | (target_mean - ref_mean) / ref_mean | | ||||||
| | `p_value` | Mann-Whitney U p-value | | ||||||
| | `statistic` | Mann-Whitney U statistic | | ||||||
| | `fdr` | FDR-corrected p-value (per-group, across genes) | | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The description for FDR correction is accurate for
Suggested change
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This statement about FDR correction is accurate for
refandallmodes, but not foron_targetmode, where correction happens across groups. To avoid confusion, it would be helpful to clarify how FDR is handled in each case.