Symmetry and Data Augmentation#626
Conversation
|
@Scienfitz just to make sure - I guess since this is marked as a draft, you do not require a PR review for now, right? Is there anything else that we can assist with? |
|
@AVHopp yes exactly and it will always be like that for PR's that I open in draft: Ignore until requested or asked in any other way |
00cfef8 to
5ba4cb1
Compare
db98f64 to
aedafa7
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR implements automatic data augmentation for measurements when constraints support symmetry assumptions, particularly for permutation and dependency invariance constraints. This enhancement helps surrogate models better learn from symmetric relationships in the data without requiring users to manually generate augmented points.
- Adds
consider_data_augmentationflags to both surrogate models and relevant constraints to control augmentation behavior - Integrates augmentation logic into the Bayesian recommender workflow, applying it before model fitting when configured
- Provides comprehensive examples and documentation showing the performance benefits of augmentation
Reviewed Changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_measurement_augmentation.py |
New test file verifying augmentation is applied when configured |
examples/Constraints_Discrete/augmentation.py |
New example demonstrating augmentation effects on optimization performance |
docs/userguide/surrogates.md |
Documentation updates explaining data augmentation feature |
docs/userguide/constraints.md |
Documentation updates for augmentation flags in constraints |
docs/scripts/build_examples.py |
Build script improvement to ignore __pycache__ folders |
baybe/utils/dataframe.py |
Added documentation note about constraint considerations |
baybe/utils/augmentation.py |
Cleaned up duplicate example in docstring |
baybe/surrogates/gaussian_process/core.py |
Added consider_data_augmentation flag with temporary default |
baybe/surrogates/base.py |
Added base consider_data_augmentation flag to surrogate interface |
baybe/searchspace/core.py |
Core augmentation logic and augment_measurements method |
baybe/recommenders/pure/bayesian/base.py |
Integration of augmentation into Bayesian recommender workflow |
baybe/recommenders/pure/base.py |
Minor cleanup of validation logic |
baybe/constraints/discrete.py |
Added consider_data_augmentation flags to constraint classes |
baybe/constraints/base.py |
Moved augmentation flag to base constraint class |
CHANGELOG.md |
Documented new features and changes |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
583b7be to
98c29e8
Compare
This comment was marked as outdated.
This comment was marked as outdated.
98c29e8 to
0e05880
Compare
46bc49c to
859ca3b
Compare
| # Validate compatibility of surrogate symmetries with searchspace | ||
| if hasattr(self._surrogate_model, "symmetries"): | ||
| for s in self._surrogate_model.symmetries: | ||
| s.validate_searchspace_context(searchspace) |
There was a problem hiding this comment.
Important: Validation so far is only part of the recommend call here in the recommenders. Validation has not been included in the Campaign yet. This is due to two factors
- To properly validate the symmetries and searchspace compatibility there needs to be a mechanism that can iterate over all possible recommenders of a metarecommender. Otherwise this upfront validation already fails for the two phase recommender if the second recommender has symmetries
- There would be double validation with campaign and recommend call so the context info of whether validation was already performed needs to be passed somewhere. Likely fixable with settings mechanism not yet available
There was a problem hiding this comment.
@AdrianSosic I see now that the 2nd point could be solved with the Settings mechanism but I have no idea how to solve issue 1.
In the absence of that its not realy possible to turn it into an upfront validation, so I would probably not change the validation for this moment unless you have a smarter idea
There was a problem hiding this comment.
+1 for being pragmatic and not trying to come up with something potentially convoluted right now. Even if we find a better way for the validation later, including it is just a plain improvement without negative consequences to users, so we can add it later without problems.
This comment was marked as outdated.
This comment was marked as outdated.
The _autoreplicate converter on main wraps surrogates in a CompositeSurrogate. Access the inner template for symmetry validation and augmentation.
Use full module paths (e.g., baybe.symmetries.base.Symmetry) instead of short paths via __init__.py re-exports, which Sphinx cannot resolve.
`DiscretePermutationInvarianceConstraint` was always internally applying a DiscreteNoLabelDuplicates constraint to remove the diagonal elements, which is not correct and can always be achieved separately by explicitly using `DiscreteNoLabelDuplicates`
Co-authored-by: Alexander V. Hopp <alexander.hopp@merckgroup.com>
eb0e90a to
19af1b3
Compare
|
@AdrianSosic appreciate your review |
AdrianSosic
left a comment
There was a problem hiding this comment.
Gut Ding will Weile haben ...
| return tuple(x) | ||
|
|
||
|
|
||
| def normalize_convertible2str_sequence( |
There was a problem hiding this comment.
My comment from here is actually still unresolved:
- The name is super-hard to read and breaks our conventions (took me ages to understand that the part with the "2" is to be read as one piece)
- The function doesn't care at all about whether the content is bools/str, so why put it in the name
- Should mention that this is attrs-format
So how about:
- Using a proper generic as input/output type of the sequence/tuple
- rename the thing to
to_sorted_tuple - Optionally turn docstrings into something like
Attrs-converter transforming sequences into sorted tuplesor similar?
| alias="values", | ||
| converter=Converter(_convert_values, takes_self=True, takes_field=True), # type: ignore | ||
| validator=( | ||
| converter=Converter( # type: ignore[misc,call-overload] # mypy: Converter |
There was a problem hiding this comment.
what is the mypy part about? Haven't seen this before
| converter=Converter( # type: ignore[misc,call-overload] # mypy: Converter | ||
| normalize_convertible2str_sequence, takes_self=True, takes_field=True | ||
| ), | ||
| validator=( # type: ignore[arg-type] # mypy: validator tuple |
There was a problem hiding this comment.
why suddenly ignores needed (and what is the mypy part about)? Due to newer mypy release?
| from baybe.parameters.validation import validate_unique_values | ||
| from baybe.settings import active_settings | ||
| from baybe.utils.interval import InfiniteIntervalError, Interval | ||
| from baybe.utils.validation import validate_is_finite |
There was a problem hiding this comment.
Question: this PR moves some stuff around (like the validate_is_finite and others), which technically represents breaking changes. Changelog entry or no?
| from baybe.symmetries.base import Symmetry | ||
| from baybe.symmetries.dependency import DependencySymmetry | ||
| from baybe.symmetries.mirror import MirrorSymmetry | ||
| from baybe.symmetries.permutation import PermutationSymmetry | ||
|
|
||
| __all__ = [ | ||
| "DependencySymmetry", | ||
| "MirrorSymmetry", | ||
| "PermutationSymmetry", | ||
| "Symmetry", | ||
| ] |
There was a problem hiding this comment.
For other components, we don't expose the base class in the namespace. So for consistency, I suggest:
| from baybe.symmetries.base import Symmetry | |
| from baybe.symmetries.dependency import DependencySymmetry | |
| from baybe.symmetries.mirror import MirrorSymmetry | |
| from baybe.symmetries.permutation import PermutationSymmetry | |
| __all__ = [ | |
| "DependencySymmetry", | |
| "MirrorSymmetry", | |
| "PermutationSymmetry", | |
| "Symmetry", | |
| ] | |
| from baybe.symmetries.dependency import DependencySymmetry | |
| from baybe.symmetries.mirror import MirrorSymmetry | |
| from baybe.symmetries.permutation import PermutationSymmetry | |
| __all__ = [ | |
| "DependencySymmetry", | |
| "MirrorSymmetry", | |
| "PermutationSymmetry", | |
| ] |
| A dataframe with the augmented measurements, including the original | ||
| ones. |
There was a problem hiding this comment.
| A dataframe with the augmented measurements, including the original | |
| ones. | |
| A dataframe with the augmented measurements, including the original ones. |
| def augment_measurements( | ||
| self, | ||
| measurements: pd.DataFrame, | ||
| parameters: Iterable[Parameter] | None = None, |
There was a problem hiding this comment.
Looking at your code, I think Sequence is probably the safer choice here, given that you already (unnoticedly) broke the Iterable contract by iterating multiple times over it 😬 What do you think?
| parameters: Iterable[Parameter] | None = None, | |
| parameters: Sequence[Parameter] | None = None, |
| measurements: The dataframe containing the measurements to be | ||
| augmented. | ||
| parameters: Optional parameter objects carrying additional information. | ||
| Only required by specific augmentation implementations. |
There was a problem hiding this comment.
| measurements: The dataframe containing the measurements to be | |
| augmented. | |
| parameters: Optional parameter objects carrying additional information. | |
| Only required by specific augmentation implementations. | |
| measurements: The dataframe containing the measurements to be augmented. | |
| parameters: Optional parameter objects carrying additional information. | |
| Only required by some symmetry classes. |
| parameters_missing = set(self.parameter_names).difference( | ||
| searchspace.parameter_names | ||
| ) | ||
| if parameters_missing: | ||
| raise IncompatibleSearchSpaceError( | ||
| f"The symmetry of type '{self.__class__.__name__}' was set up with the " | ||
| f"following parameters that are not present in the search space: " | ||
| f"{parameters_missing}." | ||
| ) |
There was a problem hiding this comment.
A bit shorter
| parameters_missing = set(self.parameter_names).difference( | |
| searchspace.parameter_names | |
| ) | |
| if parameters_missing: | |
| raise IncompatibleSearchSpaceError( | |
| f"The symmetry of type '{self.__class__.__name__}' was set up with the " | |
| f"following parameters that are not present in the search space: " | |
| f"{parameters_missing}." | |
| ) | |
| if missing := set(self.parameter_names) - set(searchspace.parameter_names): | |
| raise IncompatibleSearchSpaceError( | |
| f"The symmetry of type '{self.__class__.__name__}' was set up with the " | |
| f"following parameters that are not present in the search space: " | |
| f"{missing}." | |
| ) |
| - Interpoint constraints for continuous search spaces | ||
| - Transfer learning benchmarks for shifted and inverted Hartmann functions | ||
| - Coding convention instructions for agentic developers (`AGENTS.md`, `CLAUDE.md`) | ||
| - Symmetry classes (`PermutationSymmetry`, `MirrorSymmetry`, `DependencySymmetry`) |
There was a problem hiding this comment.
maybe add the base class? You could even mention the symmetry framework as a whole first and only then mention the classes, because that the former is new is not clear from the bullet (you cannot distinguish it from the case where we later add a fourth class)
| condition: Condition = field(validator=instance_of(Condition)) | ||
| """The condition specifying the active range of the causing parameter.""" | ||
|
|
||
| affected_parameter_names: tuple[str, ...] = field( |
There was a problem hiding this comment.
I'm still puzzled why you chose the terminology causing and affected when the the class is named Dependent... and you talk about dependent parameters in the docstring. Wouldn't it much more sense to call the attribute dependent_... and correspondingly use independent for the other parameter types, which furthermore avoids the clash with the actual causal terminology from causal modeling?
| """The condition specifying the active range of the causing parameter.""" | ||
|
|
||
| affected_parameter_names: tuple[str, ...] = field( | ||
| converter=Converter( # type: ignore[misc,call-overload] # mypy: Converter |
There was a problem hiding this comment.
Again: what is this mypy comment?
| """The parameters affected by the dependency.""" | ||
|
|
||
| n_discretization_points: int = field( | ||
| default=3, validator=(instance_of(int), ge(2)), kw_only=True |
There was a problem hiding this comment.
I can unterstand that you want to make the concept useable in conti spaces, but I would suggest to not chose any default then. I think we all agree that there is no obvious "default" that one could possibly chose to well-approximate a continuous function with discrete points, and the switch from disc to conti is one that moves from an exact representation of the symmetry to an approximate one. So I think this is not something that should be hidden from the user but where they should explicitly opt-in, don't you agree?
| parameters: Iterable[Parameter] | None = None, | ||
| ) -> pd.DataFrame: | ||
| # See base class. | ||
| if not self.use_data_augmentation: |
There was a problem hiding this comment.
Since this check has to happen in each subclass, let's perhaps turn it into the template pattern?
| # values that are not active, as rows containing them should be | ||
| # augmented. | ||
| param = next( | ||
| cast(DiscreteParameter, p) |
There was a problem hiding this comment.
Why a cast? I think this needs to be a proper validation instead, no?
There was a problem hiding this comment.
Let's take this opportunity to make all the df function arguments in file positional-only
| def df_apply_permutation_augmentation( | ||
| df: pd.DataFrame, | ||
| column_groups: Sequence[Sequence[str]], | ||
| permutation_groups: Sequence[Sequence[str]], |
There was a problem hiding this comment.
I vaguely remember you and @AVHopp having a discussion about this but: do we really want to change the semantics of this function? I know there is a changelog entry for it, but still: this is strictly speaking the worst case of breaking change. The function has the same name, same number of arguments, and same argument types, same return type, but will silently do something very different from the original version --> silent bug.
I'm asking because there is no real need for this change in the first place!?
There was a problem hiding this comment.
This is probably minor, but still: Do we want to make any claims about the ordering of dataframe content produced by the utilities in this function? Right now, some append the augmented values, while others squeeze them in.
|
|
||
| Note: | ||
| This function does not consider constraints and might provide unexpected or | ||
| invalid data if certain constraints are present. |
There was a problem hiding this comment.
IMO this is more confusion than not having it. On this level, the concept of constraints does not even exist. The situation would be different if one of the arguments was a SearchSpace and the function was silently ignoring its constraints, but this is not the case here
| Note: | |
| This function does not consider constraints and might provide unexpected or | |
| invalid data if certain constraints are present. |
| ) | ||
|
|
||
|
|
||
| def validate_unique_values( # noqa: DOC101, DOC103 |
There was a problem hiding this comment.
We now have two validate_unique_values functions. I think you forgot to delete the existing one?
Implements #621
New: A
Symmetryclass which is part of baybeSurrogates. Three distinct symmetries are included, for more info check the userguide and for a demonstration of the effect see the new example. The ability to perform data augmentation has been included for all symmetries.I have left some initial comments on design questions that are still open or where I am kind of indifferent and just had to choose one. Feel free to leave an opinion there first so the large-scale design picture can be finalized independent of small comments.
TODO
Other Notes
Symmetries and constraints are conceptually so similar that they should probably have the same interface. The design here has been done from scratch completely ignoring the constraint interface because it is already known to be not optimal and needs refactoring.
parametersor similar because some symmetries allow single and some multiple such parameters. Instead the parameters are treated like the objectives treat target(s)Unrelated Bugfix
I noticed that the permutation constraint also removed the diagonal in its filtering process. However this seems unreasonable since the diagonal is a set of points that are unique and have no invariant equivalent hence nothing needs removing. Turns out there was an automatic removal of the diagoanl because internally
DiscretePermutationInvarianceConstraintalso always applied aDiscreteNoLabelDuplicatesconstraint. I think the rational was that label duplicates dont make sense in these mixture situations so they need removing. However, this as nothing to do with the invariance and is achieved anyway in mixture use cases by adding a no label duplicate explicitly. So it was removed from theDiscretePermutationInvarianceConstraintwhich now leads to the expected amount of removed points (on of the matrix triangles)