Skip to content

Hc/fix impossible schemes#43

Draft
HenryCrosswell wants to merge 6 commits into
mainfrom
hc/fix_impossible_schemes
Draft

Hc/fix impossible schemes#43
HenryCrosswell wants to merge 6 commits into
mainfrom
hc/fix_impossible_schemes

Conversation

@HenryCrosswell

Copy link
Copy Markdown

Before submitting a pull request (PR), please read the contributing guide.

Please fill out as much of this template as you can, but if you have any problems or questions, just leave a comment and we will help out :)

Description

What is this PR

  • Bug fix
  • Addition of a new feature
  • Other

Why is this PR needed?
Since the PyRAT db is manually populated, there is a possibility that someone could have incorrectly entered genotypes for the offspring.

What does this PR do?
This adds a function to standardise.py that checks whether the offsprings genotype is a possible combination of the two parents, using the Mendelian ratios.

References

closes issue #10

How has this PR been tested?

Added new test to test_standardise, along with new test data which includes a raw file with impossible schemes, and a standardised file that asserts that those schemes have been removed.

Is this a breaking change?

No

Does this PR require an update to the documentation?

Not yet.

Checklist:

  • The code has been tested locally
  • Tests have been added to cover all new functionality
  • The documentation has been updated to reflect any changes
  • The code has been formatted with pre-commit

@HenryCrosswell HenryCrosswell requested a review from a team June 8, 2026 14:18
@HenryCrosswell HenryCrosswell self-assigned this Jun 8, 2026

@K-Meech K-Meech left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HenryCrosswell - this is looking great! Your additions are filtering the impossible schemes really nicely, which will stop any incorrect data making it to the later processing stages.

I've put a few comments below - most are minor wording suggestions, but the main one is we need to make sure un-genotyped individuals can pass through the standardise process without erroring.

pooch_data_path("standardised-data-forbidden-genotypes.csv")
)

with pytest.raises(TypeError):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By adding with pytest.raises(TypeError), this test is allowed to pass even though it throws an error during standardise_pyrat_csv and never reaches pd.testing.assert_frame_equal. This means it is no longer testing that genotypes are standardised correctly.

You'll need to remove the with pytest.raises(TypeError) and rather add a fix in _remove_impossible_breeding_schemes instead. Otherwise, at the moment, any ungenotyped individuals that enter standardise_pyrat_csv will cause the processing to stop early (we need these ungenotyped individuals later in the historical stats part - see this issue: #8)

genotype_father = standardised_df_row["genotype_father"]
genotype_mother = standardised_df_row["genotype_mother"]
genotype_offspring = standardised_df_row["genotype_offspring"]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep rows with un-genotyped offspring, you could add something like this here:

# If the offspring are un-genotyped, we keep the row as there is no way
# of checking if the breeding scheme was valid
if pd.isna(genotype_offspring):
    return pop

Hopefully this should make the test_standardise_genotypes test pass, although I see that there is one remaining impossible breeding scheme in the test file I missed. You can change: pyrat-data-forbidden-genotypes.csv / standardised-data-forbidden-genotypes.csv ID-011 genotype_mother to wt_het on your GIN branch.

].agg("_".join, axis=1)


def _remove_impossible_breeding_schemes(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this function doesn't directly remove the impossible breeding schemes, but rather identifies them with True / False. I'd maybe re-name this? Something like:

Suggested change
def _remove_impossible_breeding_schemes(
def _is_impossible_breeding_scheme(

Parameters
----------
standardised_df_row : pd.Series
rom from standardised_dataframe (pd.DataFrame): standardised PyRAT df

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rom from standardised_dataframe (pd.DataFrame): standardised PyRAT df
row from standardised_dataframe (pd.DataFrame): standardised PyRAT df

Comment on lines +394 to +398
) -> bool:
"""Retrieves parent genotypes and pulls the mendalian ratios from
BreedingScheme.
Compares offspring to these ratios, removing those which are not possible.
e.g. hom x hom parents cannot produce wt x wt offspring.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe modify this so it's clear that this particular function isn't removing them?

Suggested change
) -> bool:
"""Retrieves parent genotypes and pulls the mendalian ratios from
BreedingScheme.
Compares offspring to these ratios, removing those which are not possible.
e.g. hom x hom parents cannot produce wt x wt offspring.
) -> bool:
"""
Checks whether the given row contains an impossible breeding scheme.
Retrieves parent genotypes and pulls the mendalian ratios from
BreedingScheme. Compares offspring to these ratios, returning True for
those which are not possible.
e.g. hom x hom parents cannot produce wt offspring.

Comment on lines +97 to +98
to_ignore = standard_csv.apply(_remove_impossible_breeding_schemes, axis=1)
standard_csv = standard_csv[~to_ignore]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly personal preference, but maybe change the variable name here to make it super clear what is being ignored.

Suggested change
to_ignore = standard_csv.apply(_remove_impossible_breeding_schemes, axis=1)
standard_csv = standard_csv[~to_ignore]
impossible_breeding_schemes = standard_csv.apply(_is_impossible_breeding_scheme, axis=1)
standard_csv = standard_csv[~impossible_breeding_schemes]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants