Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4038 +/- ##
==========================================
+ Coverage 79.85% 79.89% +0.03%
==========================================
Files 121 121
Lines 12924 12938 +14
==========================================
+ Hits 10321 10337 +16
+ Misses 2603 2601 -2
Flags with carried forward coverage won't be shown. Click here to find out more.
|
|
Hey, good catch for the OVO column ordering. I have no idea where the bug comes from, it seems to be related to group labels for PBMC that |
|
So it seems to be due to np.argsort and np.sort not sorting the weird characters (" ", "+", "/", "#", etc) of the bulk_labels the same way. I will ship a fix sometime today or this weekend. NB: I spent a bit of time looking into the PBMC dataset and it seems rather unusual: a lot of values are negative leading to non-defined logfoldchanges, on top of which there seems to be very little value diversity in the dataset. As a result, a lot of genes end up with identical ranksum, but because illico and scanpy do not compute z-score the exact same way (mathematically equivalent but programmatically different), gene ordering is impacted. The gene ordering impacts the position of NaN values in the results, but non-NaN values do match because they are very similar. I am not sure as if this dataset is a good test case. NB2: details of the programmatic differences: U = ranksum - n_tgt * (n_tgt+ 1)/2
z = U - n_ref * n_tgt / 2scanpy does: z =
ranksum - n_tgt * (n_tgt + 1 + n_ref) / 2Those are mathematically equivalent but result in differences of the order of 1.e-9 approximately, changing gene orders when ranksums are equal. |
Interesting, good observation but glad we are aware of this now. This could be another target for scanpy 2.0. I will try a different dataset but glad to have this documented. |
|
Forget my previous message which was actually kind of off topic. This PBMC dataset was actually a very good test case.
Example: let's take NB: what is the situation with the Dask issue ? Current version is not dask-compatible. |
That should come next, I want to keep this scoped for now to the in-memory stuff. Another thing if you're doing a batch of fixes this weekend - supporting Also for your |
I'm seeing some instances of https://github.com/satijalab/seurat/blob/main/R/differential_expression.R#L1089 i.e., they calculate the mean expression with this "+1" offset so don't need the correction. So we have three different things here, I guess. I think just giving the option to match scanpy (since that seems a project goal) is good, and then we can maybe revisit what seurat does as a new parameter for scanpy 2.0.
For this, I would be open to also making this part of a scanpy 2.0 set of default i.e., using Overall, my ideal scenario would be that Short of exact matches, like I said we can just document changes. It sounds like the biggest things that would close the gap are:
I think we can patch the Overall, this is really cool and I'm super happy that things seem to be pretty close! |
|
Hey, I just released version 0.5.0rc1 which should fix the issues identified on this test case:
Regarding the numerical precision question: so far I force the Testing locally, it seems everything now runs smoothly for PMBC. Let me know how it goes with the rest of the CI. |
|
@remydubois I think the issue is that you are sorting / cutting off by In any case, I think I got a little lost-in-the-sauce. I think matching p-values and scores should be enough because scanpy can internally handle LFC when it wraps I'm pushing the results of With this in mind, I am going to push ahead with integrating BTW: https://remydubois.github.io/illico/api.html looks both incomplete and like it's leaking internals. |
|
Ok everything passing locally with just getting P-value and z-score from I assume the LFC calculation (i.e., actually calculating the means) is the less computationally intensive part? |
Ok ! That's good news to me. Because this was bugging me out a little bit: I checked out on your def test_illico(test, corr_method, exp_post_agg):
from illico.asymptotic_wilcoxon import asymptotic_wilcoxon
pbmc = pbmc68k_reduced()
pbmc = ad.AnnData(pbmc.X.copy(), obs=pbmc.obs.copy(), var=pbmc.var.copy())
# ... Rest of the test. They all pass on my machineThen, all tests passed (LFC, pvals, scores, pvals_adj, up to By the way, wouldn't it be safer for
Very happy to read it's moving forward 🚀
Yes it's neglectable, I could not measure its impact on the overall runtime. |
|
@remydubois Everything seems to pass with |
|
@ilan-gold good news !
That makes sense. Does it mean any specifics for illico releases (i.e: should I release/not release some features in the v0.6.0) ? |
I think you should be pretty good releasing what you want - we can always catch up here e.g., when we do a proper assessment of dask. |
| dynamic = [ "version" ] | ||
| dependencies = [ | ||
| "anndata>=0.10.8", | ||
| "anndata>=0.11", |
There was a problem hiding this comment.
which feature does this PR need?
There was a problem hiding this comment.
This comes from illico but it seemed as good a time as any to require this. I will split this out into a separate PR though, fair point, there are probably version checks floating around our code base
There was a problem hiding this comment.
There is no reason for illico to require 0.11. I can lower the lower bound to 0.10.8 in the next release (v0.6.0).
There was a problem hiding this comment.
No worries, really, We should upgrade to 0.11 anyway for the next scanpy minor release. I'd rather keep the community moving on this front.
Co-authored-by: Philipp A. <flying-sheep@web.de>
|
@flying-sheep Re-requesting your review here because I want two things:
So to that end (see previous commit), if we are in preview-mode, I set it manually inside |
| method = settings.preset.rank_genes_groups.method | ||
| if settings.preset is Preset.ScanpyV2Preview: | ||
| method = "wilcoxon_illico" |
There was a problem hiding this comment.
Sorry for the tone of this comment, I’m not intending it to be as sassy as it reads lol
that makes no sense.
This configuration exists exactly for line 720 do do what you’re now doing in the two lines after.
scanpy/src/scanpy/_settings/presets.py
Lines 241 to 243 in ecec3c2
So that configuration is now a lie and ignored because you hardcode the default here instead, why?
Re-requesting your review here
if you press the button, I’ll actually see that!
There was a problem hiding this comment.
Apologies, just forgot to push the button after the comment :)
My thinking goes that I do not want people using wilcoxon_illico parameter and therefore don’t want it hardcoded as the scanpy 2.0 default in the preset.
I would start a deprecation cycle for wilcoxon_illico once 2.0 gets closer.
But I think this might just be too complicated and not worth the slightly-cleaner preset appearance.
We could just make the preset itself wilcoxon_illico and then deprecate people passing in willcoxon_illico manually, instead pointing the preset.
Sorry for the confusion :/ This was sort of a bad middle ground between having wilcoxon_illico with no plan for it, and not having it at all, instead relying purely on the preset.
There was a problem hiding this comment.
Oof, I must have failed to read half of your comment above, sorry! I get it now!
But wouldn’t it be easier to just not have "wilcoxon_ilico" at all then? Just check for if method == "wilcoxon" and settings.preset == Preset.ScanpyV2Preview?
TODOs:
See: https://github.com/scverse/scanpy/actions/runs/24088645078/job/70268566419?pr=4038
ovo"column" in scores (cc @remydubois)illico(see thefilterwarningson the new test)ovrresults are so far offexp_post_aggfeature feat: allow exponentiation post agg for log-fold-change inrank_genes_groups#4037) toillico, we will add theillicobackend (or all changes have been documented)illico#4012