Hc/fix impossible schemes#43
Conversation
K-Meech
left a comment
There was a problem hiding this comment.
Thanks @HenryCrosswell - this is looking great! Your additions are filtering the impossible schemes really nicely, which will stop any incorrect data making it to the later processing stages.
I've put a few comments below - most are minor wording suggestions, but the main one is we need to make sure un-genotyped individuals can pass through the standardise process without erroring.
| pooch_data_path("standardised-data-forbidden-genotypes.csv") | ||
| ) | ||
|
|
||
| with pytest.raises(TypeError): |
There was a problem hiding this comment.
By adding with pytest.raises(TypeError), this test is allowed to pass even though it throws an error during standardise_pyrat_csv and never reaches pd.testing.assert_frame_equal. This means it is no longer testing that genotypes are standardised correctly.
You'll need to remove the with pytest.raises(TypeError) and rather add a fix in _remove_impossible_breeding_schemes instead. Otherwise, at the moment, any ungenotyped individuals that enter standardise_pyrat_csv will cause the processing to stop early (we need these ungenotyped individuals later in the historical stats part - see this issue: #8)
| genotype_father = standardised_df_row["genotype_father"] | ||
| genotype_mother = standardised_df_row["genotype_mother"] | ||
| genotype_offspring = standardised_df_row["genotype_offspring"] | ||
|
|
There was a problem hiding this comment.
To keep rows with un-genotyped offspring, you could add something like this here:
# If the offspring are un-genotyped, we keep the row as there is no way
# of checking if the breeding scheme was valid
if pd.isna(genotype_offspring):
return pop
Hopefully this should make the test_standardise_genotypes test pass, although I see that there is one remaining impossible breeding scheme in the test file I missed. You can change: pyrat-data-forbidden-genotypes.csv / standardised-data-forbidden-genotypes.csv ID-011 genotype_mother to wt_het on your GIN branch.
| ].agg("_".join, axis=1) | ||
|
|
||
|
|
||
| def _remove_impossible_breeding_schemes( |
There was a problem hiding this comment.
As this function doesn't directly remove the impossible breeding schemes, but rather identifies them with True / False. I'd maybe re-name this? Something like:
| def _remove_impossible_breeding_schemes( | |
| def _is_impossible_breeding_scheme( |
| Parameters | ||
| ---------- | ||
| standardised_df_row : pd.Series | ||
| rom from standardised_dataframe (pd.DataFrame): standardised PyRAT df |
There was a problem hiding this comment.
| rom from standardised_dataframe (pd.DataFrame): standardised PyRAT df | |
| row from standardised_dataframe (pd.DataFrame): standardised PyRAT df |
| ) -> bool: | ||
| """Retrieves parent genotypes and pulls the mendalian ratios from | ||
| BreedingScheme. | ||
| Compares offspring to these ratios, removing those which are not possible. | ||
| e.g. hom x hom parents cannot produce wt x wt offspring. |
There was a problem hiding this comment.
Maybe modify this so it's clear that this particular function isn't removing them?
| ) -> bool: | |
| """Retrieves parent genotypes and pulls the mendalian ratios from | |
| BreedingScheme. | |
| Compares offspring to these ratios, removing those which are not possible. | |
| e.g. hom x hom parents cannot produce wt x wt offspring. | |
| ) -> bool: | |
| """ | |
| Checks whether the given row contains an impossible breeding scheme. | |
| Retrieves parent genotypes and pulls the mendalian ratios from | |
| BreedingScheme. Compares offspring to these ratios, returning True for | |
| those which are not possible. | |
| e.g. hom x hom parents cannot produce wt offspring. |
| to_ignore = standard_csv.apply(_remove_impossible_breeding_schemes, axis=1) | ||
| standard_csv = standard_csv[~to_ignore] |
There was a problem hiding this comment.
Mostly personal preference, but maybe change the variable name here to make it super clear what is being ignored.
| to_ignore = standard_csv.apply(_remove_impossible_breeding_schemes, axis=1) | |
| standard_csv = standard_csv[~to_ignore] | |
| impossible_breeding_schemes = standard_csv.apply(_is_impossible_breeding_scheme, axis=1) | |
| standard_csv = standard_csv[~impossible_breeding_schemes] |
Before submitting a pull request (PR), please read the contributing guide.
Please fill out as much of this template as you can, but if you have any problems or questions, just leave a comment and we will help out :)
Description
What is this PR
Why is this PR needed?
Since the PyRAT db is manually populated, there is a possibility that someone could have incorrectly entered genotypes for the offspring.
What does this PR do?
This adds a function to standardise.py that checks whether the offsprings genotype is a possible combination of the two parents, using the Mendelian ratios.
References
closes issue #10
How has this PR been tested?
Added new test to test_standardise, along with new test data which includes a raw file with impossible schemes, and a standardised file that asserts that those schemes have been removed.
Is this a breaking change?
No
Does this PR require an update to the documentation?
Not yet.
Checklist: