feat: experimental vectorized and numba parallelized implementation by drbh · Pull Request #44 · ArcInstitute/pdex

drbh · 2025-08-06T23:49:48Z

This PR contains a experimental implementation of parallel_differential_expression that uses numpy vectorization, numbda.prange and @njit to try to squeeze perf out of the CPU. With some empirical testing this sped up some operations by an order of magnitude.

The changes include a USE_EXPERIMENTAL env var to enable opt-in usage and transparently replace the parallel_differential_expression, and a new bench_expr.py that compares the reference with the experimental impl.

Running benches

uv run python -m pytest tests/bench_expr.py

current limitations: only the wilcoxon metric is implemented in parallel_differential_expression_vec

More realistic workload

In a slightly bigger example this reduces the compute time for a dataset of 100,000 cells, 18,080 genes and 150 perturbations from ~5 mins to ~25 seconds on my MacBook M3.

**(ref is using num_workers=16 and batch_size=100)

uv run compare.py
============================================================
Benchmarking with 100000 cells, 18080 genes, 150 perturbations
============================================================

1. Reference implementation (batch processing):
INFO:pdex._single_cell:Precomputing masks for each target gene
Identifying target masks: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 151/151 [00:00<00:00, 451.40it/s]
INFO:pdex._single_cell:Precomputing variable indices for each feature
Identifying variable indices: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18080/18080 [00:00<00:00, 7455807.33it/s]
INFO:pdex._single_cell:Creating shared memory memory matrix for parallel computing
INFO:pdex._single_cell:Creating generator of all combinations: N=2730080
INFO:pdex._single_cell:Creating generator of all batches: N=27301
INFO:pdex._single_cell:Initializing parallel processing pool
INFO:pdex._single_cell:Processing batches
Processing batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 27301/27301 [04:50<00:00, 94.10it/s]
INFO:pdex._single_cell:Flattening results
INFO:pdex._single_cell:Closing shared memory pool
   Time: 299.028 seconds

2. Vectorized implementation:
INFO:pdex._single_cell:vectorized processing: 151 targets, 18080 genes
INFO:pdex._single_cell:Processing 150 targets
Processing targets: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 150/150 [00:19<00:00,  7.59it/s]
   Time: 25.581 seconds
   Speedup: 11.7x

============================================================
Correctness Verification:
============================================================
✅ vec: Column 'target_mean' values match within 1e-06 tolerance
✅ vec: Column 'reference_mean' values match within 1e-06 tolerance
✅ vec: Column 'percent_change' values match within 0.01 tolerance
✅ vec: Column 'fold_change' values match within 1e-06 tolerance
✅ vec: Results match reference

============================================================
Performance Summary:
============================================================
Implementation                 Time (s)     Speedup
----------------------------------------------------
reference                      299.028      1.0       x
vec                            25.581       11.7      x

noamteyssier · 2025-08-08T05:03:58Z

This is awesome, thanks @drbh !

I’ll do some testing and try to get this merged asap.

noamteyssier · 2025-08-15T23:14:45Z

thanks for the PR @drbh !

I'm going to test this more in a few different contexts and will eventually just make this the stable execution path for wilcoxon.

cheers!

drbh added 12 commits August 6, 2025 12:24

feat: prefer vectorized ops

380eb3e

fix: update readme

adb5c92

fix: adjust debug logs and output on readme

1baa608

feat: prefer numba operations

d6bbb27

feat: prefer numba always

f57f334

fix: avoid dev changes

f136324

fix: add missing newline

21acd9f

fix: place behind env

3c81d63

fix: improve benches

df173d4

fix: small refactor cleanups

99bfec5

fix: adjust dev dependencies

9a94766

fix:improve precision and correctness

f838aff

noamteyssier added 2 commits August 15, 2025 16:11

fix(typing): quiet pyright

1634366

chore(semver): bump

0dff9f6

noamteyssier merged commit eef6f3c into ArcInstitute:main Aug 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: experimental vectorized and numba parallelized implementation#44

feat: experimental vectorized and numba parallelized implementation#44
noamteyssier merged 14 commits into
ArcInstitute:mainfrom
drbh:main

drbh commented Aug 6, 2025 •

edited

Loading

Uh oh!

noamteyssier commented Aug 8, 2025

Uh oh!

noamteyssier commented Aug 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drbh commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Running benches

More realistic workload

Uh oh!

noamteyssier commented Aug 8, 2025

Uh oh!

noamteyssier commented Aug 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drbh commented Aug 6, 2025 •

edited

Loading