Skip to content

Symmetry and Data Augmentation#626

Open
Scienfitz wants to merge 24 commits into
dev/symmetryfrom
feature/invariance_augmentation
Open

Symmetry and Data Augmentation#626
Scienfitz wants to merge 24 commits into
dev/symmetryfrom
feature/invariance_augmentation

Conversation

@Scienfitz

@Scienfitz Scienfitz commented Aug 21, 2025

Copy link
Copy Markdown
Collaborator

Implements #621

New: A Symmetry class which is part of baybe Surrogates. Three distinct symmetries are included, for more info check the userguide and for a demonstration of the effect see the new example. The ability to perform data augmentation has been included for all symmetries.

I have left some initial comments on design questions that are still open or where I am kind of indifferent and just had to choose one. Feel free to leave an opinion there first so the large-scale design picture can be finalized independent of small comments.

TODO

  • CHANGELOG (after architecture is logged in)
  • Remember to compress finalized svg picture

Other Notes
Symmetries and constraints are conceptually so similar that they should probably have the same interface. The design here has been done from scratch completely ignoring the constraint interface because it is already known to be not optimal and needs refactoring.

  • There is no overarching shared attribute parameters or similar because some symmetries allow single and some multiple such parameters. Instead the parameters are treated like the objectives treat target(s)
  • Contrary to dependency constraint, dependency symmetry can only hold 1 set of dependencies. The constraint should be refactored to look the same.

Unrelated Bugfix
I noticed that the permutation constraint also removed the diagonal in its filtering process. However this seems unreasonable since the diagonal is a set of points that are unique and have no invariant equivalent hence nothing needs removing. Turns out there was an automatic removal of the diagoanl because internally DiscretePermutationInvarianceConstraint also always applied a DiscreteNoLabelDuplicates constraint. I think the rational was that label duplicates dont make sense in these mixture situations so they need removing. However, this as nothing to do with the invariance and is achieved anyway in mixture use cases by adding a no label duplicate explicitly. So it was removed from the DiscretePermutationInvarianceConstraint which now leads to the expected amount of removed points (on of the matrix triangles)

@Scienfitz Scienfitz self-assigned this Aug 21, 2025
@Scienfitz Scienfitz added the new feature New functionality label Aug 21, 2025
@Scienfitz Scienfitz linked an issue Aug 21, 2025 that may be closed by this pull request
@AVHopp

AVHopp commented Aug 25, 2025

Copy link
Copy Markdown
Collaborator

@Scienfitz just to make sure - I guess since this is marked as a draft, you do not require a PR review for now, right? Is there anything else that we can assist with?

@Scienfitz

Copy link
Copy Markdown
Collaborator Author

@AVHopp yes exactly and it will always be like that for PR's that I open in draft: Ignore until requested or asked in any other way

@Scienfitz Scienfitz force-pushed the feature/invariance_augmentation branch from 00cfef8 to 5ba4cb1 Compare September 10, 2025 08:10
@Scienfitz Scienfitz force-pushed the feature/invariance_augmentation branch 4 times, most recently from db98f64 to aedafa7 Compare September 25, 2025 10:41
@Scienfitz Scienfitz marked this pull request as ready for review September 25, 2025 11:01
Copilot AI review requested due to automatic review settings September 25, 2025 11:01

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements automatic data augmentation for measurements when constraints support symmetry assumptions, particularly for permutation and dependency invariance constraints. This enhancement helps surrogate models better learn from symmetric relationships in the data without requiring users to manually generate augmented points.

  • Adds consider_data_augmentation flags to both surrogate models and relevant constraints to control augmentation behavior
  • Integrates augmentation logic into the Bayesian recommender workflow, applying it before model fitting when configured
  • Provides comprehensive examples and documentation showing the performance benefits of augmentation

Reviewed Changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_measurement_augmentation.py New test file verifying augmentation is applied when configured
examples/Constraints_Discrete/augmentation.py New example demonstrating augmentation effects on optimization performance
docs/userguide/surrogates.md Documentation updates explaining data augmentation feature
docs/userguide/constraints.md Documentation updates for augmentation flags in constraints
docs/scripts/build_examples.py Build script improvement to ignore __pycache__ folders
baybe/utils/dataframe.py Added documentation note about constraint considerations
baybe/utils/augmentation.py Cleaned up duplicate example in docstring
baybe/surrogates/gaussian_process/core.py Added consider_data_augmentation flag with temporary default
baybe/surrogates/base.py Added base consider_data_augmentation flag to surrogate interface
baybe/searchspace/core.py Core augmentation logic and augment_measurements method
baybe/recommenders/pure/bayesian/base.py Integration of augmentation into Bayesian recommender workflow
baybe/recommenders/pure/base.py Minor cleanup of validation logic
baybe/constraints/discrete.py Added consider_data_augmentation flags to constraint classes
baybe/constraints/base.py Moved augmentation flag to base constraint class
CHANGELOG.md Documented new features and changes

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread baybe/surrogates/gaussian_process/core.py Outdated
Comment thread tests/test_measurement_augmentation.py Outdated
Comment thread docs/scripts/build_examples.py
@Scienfitz Scienfitz force-pushed the feature/invariance_augmentation branch from 583b7be to 98c29e8 Compare September 25, 2025 11:10
@Scienfitz

This comment was marked as outdated.

@Scienfitz Scienfitz marked this pull request as draft September 30, 2025 11:26
@Scienfitz Scienfitz changed the title Add Auto-Augmentation of Measurements in the Presence of Invariance Constraints Symmetry and Data Augmentation Oct 9, 2025
@Scienfitz Scienfitz force-pushed the feature/invariance_augmentation branch from 98c29e8 to 0e05880 Compare October 24, 2025 18:12
@Scienfitz Scienfitz force-pushed the feature/invariance_augmentation branch from 46bc49c to 859ca3b Compare October 31, 2025 18:23
Comment thread baybe/symmetries.py Outdated
Comment thread baybe/symmetries.py Outdated
# Validate compatibility of surrogate symmetries with searchspace
if hasattr(self._surrogate_model, "symmetries"):
for s in self._surrogate_model.symmetries:
s.validate_searchspace_context(searchspace)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: Validation so far is only part of the recommend call here in the recommenders. Validation has not been included in the Campaign yet. This is due to two factors

  • To properly validate the symmetries and searchspace compatibility there needs to be a mechanism that can iterate over all possible recommenders of a metarecommender. Otherwise this upfront validation already fails for the two phase recommender if the second recommender has symmetries
  • There would be double validation with campaign and recommend call so the context info of whether validation was already performed needs to be passed somewhere. Likely fixable with settings mechanism not yet available

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AdrianSosic I see now that the 2nd point could be solved with the Settings mechanism but I have no idea how to solve issue 1.

In the absence of that its not realy possible to turn it into an upfront validation, so I would probably not change the validation for this moment unless you have a smarter idea

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for being pragmatic and not trying to come up with something potentially convoluted right now. Even if we find a better way for the validation later, including it is just a plain improvement without negative consequences to users, so we can add it later without problems.

@Scienfitz Scienfitz marked this pull request as ready for review November 3, 2025 17:31
@Scienfitz Scienfitz requested a review from Copilot November 4, 2025 08:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@Scienfitz Scienfitz requested a review from Copilot November 4, 2025 11:55

This comment was marked as resolved.

This comment was marked as outdated.

Scienfitz and others added 23 commits April 30, 2026 21:01
The _autoreplicate converter on main wraps surrogates in a
CompositeSurrogate. Access the inner template for symmetry
validation and augmentation.
Use full module paths (e.g., baybe.symmetries.base.Symmetry)
instead of short paths via __init__.py re-exports, which
Sphinx cannot resolve.
`DiscretePermutationInvarianceConstraint` was always internally applying a DiscreteNoLabelDuplicates constraint to remove the diagonal elements, which is not correct and can always be achieved separately by explicitly using `DiscreteNoLabelDuplicates`
Co-authored-by: Alexander V. Hopp <alexander.hopp@merckgroup.com>
@Scienfitz Scienfitz force-pushed the feature/invariance_augmentation branch from eb0e90a to 19af1b3 Compare April 30, 2026 19:02
@Scienfitz Scienfitz changed the base branch from dev/symmetry to main April 30, 2026 19:02
@Scienfitz Scienfitz changed the base branch from main to dev/symmetry April 30, 2026 19:03
@Scienfitz Scienfitz requested a review from AdrianSosic May 7, 2026 18:12
@Scienfitz

Copy link
Copy Markdown
Collaborator Author

@AdrianSosic appreciate your review

@AdrianSosic AdrianSosic left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gut Ding will Weile haben ...

Comment thread baybe/utils/conversion.py
return tuple(x)


def normalize_convertible2str_sequence(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment from here is actually still unresolved:

  • The name is super-hard to read and breaks our conventions (took me ages to understand that the part with the "2" is to be read as one piece)
  • The function doesn't care at all about whether the content is bools/str, so why put it in the name
  • Should mention that this is attrs-format

So how about:

  1. Using a proper generic as input/output type of the sequence/tuple
  2. rename the thing to to_sorted_tuple
  3. Optionally turn docstrings into something like Attrs-converter transforming sequences into sorted tuples or similar?

alias="values",
converter=Converter(_convert_values, takes_self=True, takes_field=True), # type: ignore
validator=(
converter=Converter( # type: ignore[misc,call-overload] # mypy: Converter

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the mypy part about? Haven't seen this before

converter=Converter( # type: ignore[misc,call-overload] # mypy: Converter
normalize_convertible2str_sequence, takes_self=True, takes_field=True
),
validator=( # type: ignore[arg-type] # mypy: validator tuple

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why suddenly ignores needed (and what is the mypy part about)? Due to newer mypy release?

from baybe.parameters.validation import validate_unique_values
from baybe.settings import active_settings
from baybe.utils.interval import InfiniteIntervalError, Interval
from baybe.utils.validation import validate_is_finite

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: this PR moves some stuff around (like the validate_is_finite and others), which technically represents breaking changes. Changelog entry or no?

Comment on lines +3 to +13
from baybe.symmetries.base import Symmetry
from baybe.symmetries.dependency import DependencySymmetry
from baybe.symmetries.mirror import MirrorSymmetry
from baybe.symmetries.permutation import PermutationSymmetry

__all__ = [
"DependencySymmetry",
"MirrorSymmetry",
"PermutationSymmetry",
"Symmetry",
]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other components, we don't expose the base class in the namespace. So for consistency, I suggest:

Suggested change
from baybe.symmetries.base import Symmetry
from baybe.symmetries.dependency import DependencySymmetry
from baybe.symmetries.mirror import MirrorSymmetry
from baybe.symmetries.permutation import PermutationSymmetry
__all__ = [
"DependencySymmetry",
"MirrorSymmetry",
"PermutationSymmetry",
"Symmetry",
]
from baybe.symmetries.dependency import DependencySymmetry
from baybe.symmetries.mirror import MirrorSymmetry
from baybe.symmetries.permutation import PermutationSymmetry
__all__ = [
"DependencySymmetry",
"MirrorSymmetry",
"PermutationSymmetry",
]

Comment thread baybe/surrogates/base.py
Comment on lines +141 to +142
A dataframe with the augmented measurements, including the original
ones.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A dataframe with the augmented measurements, including the original
ones.
A dataframe with the augmented measurements, including the original ones.

Comment thread baybe/symmetries/base.py
def augment_measurements(
self,
measurements: pd.DataFrame,
parameters: Iterable[Parameter] | None = None,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at your code, I think Sequence is probably the safer choice here, given that you already (unnoticedly) broke the Iterable contract by iterating multiple times over it 😬 What do you think?

Suggested change
parameters: Iterable[Parameter] | None = None,
parameters: Sequence[Parameter] | None = None,

Comment thread baybe/symmetries/base.py
Comment on lines +58 to +61
measurements: The dataframe containing the measurements to be
augmented.
parameters: Optional parameter objects carrying additional information.
Only required by specific augmentation implementations.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
measurements: The dataframe containing the measurements to be
augmented.
parameters: Optional parameter objects carrying additional information.
Only required by specific augmentation implementations.
measurements: The dataframe containing the measurements to be augmented.
parameters: Optional parameter objects carrying additional information.
Only required by some symmetry classes.

Comment thread baybe/symmetries/base.py
Comment on lines +77 to +85
parameters_missing = set(self.parameter_names).difference(
searchspace.parameter_names
)
if parameters_missing:
raise IncompatibleSearchSpaceError(
f"The symmetry of type '{self.__class__.__name__}' was set up with the "
f"following parameters that are not present in the search space: "
f"{parameters_missing}."
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit shorter

Suggested change
parameters_missing = set(self.parameter_names).difference(
searchspace.parameter_names
)
if parameters_missing:
raise IncompatibleSearchSpaceError(
f"The symmetry of type '{self.__class__.__name__}' was set up with the "
f"following parameters that are not present in the search space: "
f"{parameters_missing}."
)
if missing := set(self.parameter_names) - set(searchspace.parameter_names):
raise IncompatibleSearchSpaceError(
f"The symmetry of type '{self.__class__.__name__}' was set up with the "
f"following parameters that are not present in the search space: "
f"{missing}."
)

Comment thread CHANGELOG.md
- Interpoint constraints for continuous search spaces
- Transfer learning benchmarks for shifted and inverted Hartmann functions
- Coding convention instructions for agentic developers (`AGENTS.md`, `CLAUDE.md`)
- Symmetry classes (`PermutationSymmetry`, `MirrorSymmetry`, `DependencySymmetry`)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add the base class? You could even mention the symmetry framework as a whole first and only then mention the classes, because that the former is new is not clear from the bullet (you cannot distinguish it from the case where we later add a fourth class)

@AdrianSosic AdrianSosic left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And some more

condition: Condition = field(validator=instance_of(Condition))
"""The condition specifying the active range of the causing parameter."""

affected_parameter_names: tuple[str, ...] = field(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still puzzled why you chose the terminology causing and affected when the the class is named Dependent... and you talk about dependent parameters in the docstring. Wouldn't it much more sense to call the attribute dependent_... and correspondingly use independent for the other parameter types, which furthermore avoids the clash with the actual causal terminology from causal modeling?

"""The condition specifying the active range of the causing parameter."""

affected_parameter_names: tuple[str, ...] = field(
converter=Converter( # type: ignore[misc,call-overload] # mypy: Converter

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again: what is this mypy comment?

"""The parameters affected by the dependency."""

n_discretization_points: int = field(
default=3, validator=(instance_of(int), ge(2)), kw_only=True

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can unterstand that you want to make the concept useable in conti spaces, but I would suggest to not chose any default then. I think we all agree that there is no obvious "default" that one could possibly chose to well-approximate a continuous function with discrete points, and the switch from disc to conti is one that moves from an exact representation of the symmetry to an approximate one. So I think this is not something that should be hidden from the user but where they should explicitly opt-in, don't you agree?

parameters: Iterable[Parameter] | None = None,
) -> pd.DataFrame:
# See base class.
if not self.use_data_augmentation:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this check has to happen in each subclass, let's perhaps turn it into the template pattern?

# values that are not active, as rows containing them should be
# augmented.
param = next(
cast(DiscreteParameter, p)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a cast? I think this needs to be a proper validation instead, no?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take this opportunity to make all the df function arguments in file positional-only

def df_apply_permutation_augmentation(
df: pd.DataFrame,
column_groups: Sequence[Sequence[str]],
permutation_groups: Sequence[Sequence[str]],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely remember you and @AVHopp having a discussion about this but: do we really want to change the semantics of this function? I know there is a changelog entry for it, but still: this is strictly speaking the worst case of breaking change. The function has the same name, same number of arguments, and same argument types, same return type, but will silently do something very different from the original version --> silent bug.

I'm asking because there is no real need for this change in the first place!?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably minor, but still: Do we want to make any claims about the ordering of dataframe content produced by the utilities in this function? Right now, some append the augmented values, while others squeeze them in.

Comment thread baybe/utils/dataframe.py
Comment on lines +195 to +198

Note:
This function does not consider constraints and might provide unexpected or
invalid data if certain constraints are present.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is more confusion than not having it. On this level, the concept of constraints does not even exist. The situation would be different if one of the arguments was a SearchSpace and the function was silently ignoring its constraints, but this is not the case here

Suggested change
Note:
This function does not consider constraints and might provide unexpected or
invalid data if certain constraints are present.

Comment thread baybe/utils/validation.py
)


def validate_unique_values( # noqa: DOC101, DOC103

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have two validate_unique_values functions. I think you forgot to delete the existing one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dev new feature New functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Data Augmentation for Invariant Contraints

6 participants