Symmetry and Data Augmentation by Scienfitz · Pull Request #626 · emdgroup/baybe

Scienfitz · 2025-08-21T15:01:05Z

Implements #621

New: A Symmetry class which is part of baybe Surrogates. Three distinct symmetries are included, for more info check the userguide and for a demonstration of the effect see the new example. The ability to perform data augmentation has been included for all symmetries.

I have left some initial comments on design questions that are still open or where I am kind of indifferent and just had to choose one. Feel free to leave an opinion there first so the large-scale design picture can be finalized independent of small comments.

TODO

CHANGELOG (after architecture is logged in)
Remember to compress finalized svg picture

Other Notes
Symmetries and constraints are conceptually so similar that they should probably have the same interface. The design here has been done from scratch completely ignoring the constraint interface because it is already known to be not optimal and needs refactoring.

There is no overarching shared attribute parameters or similar because some symmetries allow single and some multiple such parameters. Instead the parameters are treated like the objectives treat target(s)
Contrary to dependency constraint, dependency symmetry can only hold 1 set of dependencies. The constraint should be refactored to look the same.

Unrelated Bugfix
I noticed that the permutation constraint also removed the diagonal in its filtering process. However this seems unreasonable since the diagonal is a set of points that are unique and have no invariant equivalent hence nothing needs removing. Turns out there was an automatic removal of the diagoanl because internally DiscretePermutationInvarianceConstraint also always applied a DiscreteNoLabelDuplicates constraint. I think the rational was that label duplicates dont make sense in these mixture situations so they need removing. However, this as nothing to do with the invariance and is achieved anyway in mixture use cases by adding a no label duplicate explicitly. So it was removed from the DiscretePermutationInvarianceConstraint which now leads to the expected amount of removed points (on of the matrix triangles)

AVHopp · 2025-08-25T07:31:31Z

@Scienfitz just to make sure - I guess since this is marked as a draft, you do not require a PR review for now, right? Is there anything else that we can assist with?

Scienfitz · 2025-08-26T15:04:57Z

@AVHopp yes exactly and it will always be like that for PR's that I open in draft: Ignore until requested or asked in any other way

Copilot

Pull Request Overview

This PR implements automatic data augmentation for measurements when constraints support symmetry assumptions, particularly for permutation and dependency invariance constraints. This enhancement helps surrogate models better learn from symmetric relationships in the data without requiring users to manually generate augmented points.

Adds consider_data_augmentation flags to both surrogate models and relevant constraints to control augmentation behavior
Integrates augmentation logic into the Bayesian recommender workflow, applying it before model fitting when configured
Provides comprehensive examples and documentation showing the performance benefits of augmentation

Reviewed Changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`tests/test_measurement_augmentation.py`	New test file verifying augmentation is applied when configured
`examples/Constraints_Discrete/augmentation.py`	New example demonstrating augmentation effects on optimization performance
`docs/userguide/surrogates.md`	Documentation updates explaining data augmentation feature
`docs/userguide/constraints.md`	Documentation updates for augmentation flags in constraints
`docs/scripts/build_examples.py`	Build script improvement to ignore `__pycache__` folders
`baybe/utils/dataframe.py`	Added documentation note about constraint considerations
`baybe/utils/augmentation.py`	Cleaned up duplicate example in docstring
`baybe/surrogates/gaussian_process/core.py`	Added `consider_data_augmentation` flag with temporary default
`baybe/surrogates/base.py`	Added base `consider_data_augmentation` flag to surrogate interface
`baybe/searchspace/core.py`	Core augmentation logic and `augment_measurements` method
`baybe/recommenders/pure/bayesian/base.py`	Integration of augmentation into Bayesian recommender workflow
`baybe/recommenders/pure/base.py`	Minor cleanup of validation logic
`baybe/constraints/discrete.py`	Added `consider_data_augmentation` flags to constraint classes
`baybe/constraints/base.py`	Moved augmentation flag to base constraint class
`CHANGELOG.md`	Documented new features and changes

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Scienfitz · 2025-11-03T17:22:53Z

+        # Validate compatibility of surrogate symmetries with searchspace
+        if hasattr(self._surrogate_model, "symmetries"):
+            for s in self._surrogate_model.symmetries:
+                s.validate_searchspace_context(searchspace)


Important: Validation so far is only part of the recommend call here in the recommenders. Validation has not been included in the Campaign yet. This is due to two factors

To properly validate the symmetries and searchspace compatibility there needs to be a mechanism that can iterate over all possible recommenders of a metarecommender. Otherwise this upfront validation already fails for the two phase recommender if the second recommender has symmetries

There would be double validation with campaign and recommend call so the context info of whether validation was already performed needs to be passed somewhere. Likely fixable with settings mechanism not yet available

@AdrianSosic I see now that the 2nd point could be solved with the Settings mechanism but I have no idea how to solve issue 1.

In the absence of that its not realy possible to turn it into an upfront validation, so I would probably not change the validation for this moment unless you have a smarter idea

+1 for being pragmatic and not trying to come up with something potentially convoluted right now. Even if we find a better way for the validation later, including it is just a plain improvement without negative consequences to users, so we can add it later without problems.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

The _autoreplicate converter on main wraps surrogates in a CompositeSurrogate. Access the inner template for symmetry validation and augmentation.

Use full module paths (e.g., baybe.symmetries.base.Symmetry) instead of short paths via __init__.py re-exports, which Sphinx cannot resolve.

`DiscretePermutationInvarianceConstraint` was always internally applying a DiscreteNoLabelDuplicates constraint to remove the diagonal elements, which is not correct and can always be achieved separately by explicitly using `DiscreteNoLabelDuplicates`

Co-authored-by: Alexander V. Hopp <alexander.hopp@merckgroup.com>

Scienfitz · 2026-05-18T08:43:30Z

@AdrianSosic appreciate your review

AdrianSosic

Gut Ding will Weile haben ...

AdrianSosic · 2026-05-18T17:21:17Z

    return tuple(x)


+def normalize_convertible2str_sequence(


My comment from here is actually still unresolved:

The name is super-hard to read and breaks our conventions (took me ages to understand that the part with the "2" is to be read as one piece)

The function doesn't care at all about whether the content is bools/str, so why put it in the name

Should mention that this is attrs-format

So how about:

Using a proper generic as input/output type of the sequence/tuple

rename the thing to to_sorted_tuple

Optionally turn docstrings into something like Attrs-converter transforming sequences into sorted tuples or similar?

AdrianSosic · 2026-05-18T17:22:14Z

        alias="values",
-        converter=Converter(_convert_values, takes_self=True, takes_field=True),  # type: ignore
-        validator=(
+        converter=Converter(  # type: ignore[misc,call-overload]  # mypy: Converter


what is the mypy part about? Haven't seen this before

AdrianSosic · 2026-05-18T17:23:01Z

+        converter=Converter(  # type: ignore[misc,call-overload]  # mypy: Converter
+            normalize_convertible2str_sequence, takes_self=True, takes_field=True
+        ),
+        validator=(  # type: ignore[arg-type]  # mypy: validator tuple


why suddenly ignores needed (and what is the mypy part about)? Due to newer mypy release?

AdrianSosic · 2026-05-18T17:27:37Z

+from baybe.parameters.validation import validate_unique_values
 from baybe.settings import active_settings
 from baybe.utils.interval import InfiniteIntervalError, Interval
+from baybe.utils.validation import validate_is_finite


Question: this PR moves some stuff around (like the validate_is_finite and others), which technically represents breaking changes. Changelog entry or no?

AdrianSosic · 2026-05-18T17:29:28Z

+from baybe.symmetries.base import Symmetry
+from baybe.symmetries.dependency import DependencySymmetry
+from baybe.symmetries.mirror import MirrorSymmetry
+from baybe.symmetries.permutation import PermutationSymmetry
+
+__all__ = [
+    "DependencySymmetry",
+    "MirrorSymmetry",
+    "PermutationSymmetry",
+    "Symmetry",
+]


For other components, we don't expose the base class in the namespace. So for consistency, I suggest:

Suggested change

from baybe.symmetries.base import Symmetry

from baybe.symmetries.dependency import DependencySymmetry

from baybe.symmetries.mirror import MirrorSymmetry

from baybe.symmetries.permutation import PermutationSymmetry

__all__ = [

"DependencySymmetry",

"MirrorSymmetry",

"PermutationSymmetry",

"Symmetry",

]

from baybe.symmetries.dependency import DependencySymmetry

from baybe.symmetries.mirror import MirrorSymmetry

from baybe.symmetries.permutation import PermutationSymmetry

__all__ = [

"DependencySymmetry",

"MirrorSymmetry",

"PermutationSymmetry",

]

AdrianSosic · 2026-06-22T07:54:23Z

+            A dataframe with the augmented measurements, including the original
+            ones.


Suggested change

A dataframe with the augmented measurements, including the original

ones.

A dataframe with the augmented measurements, including the original ones.

AdrianSosic · 2026-06-22T08:09:13Z

+    def augment_measurements(
+        self,
+        measurements: pd.DataFrame,
+        parameters: Iterable[Parameter] | None = None,


Looking at your code, I think Sequence is probably the safer choice here, given that you already (unnoticedly) broke the Iterable contract by iterating multiple times over it 😬 What do you think?

Suggested change

parameters: Iterable[Parameter] | None = None,

parameters: Sequence[Parameter] | None = None,

AdrianSosic · 2026-06-22T08:11:32Z

+            measurements: The dataframe containing the measurements to be
+                augmented.
+            parameters: Optional parameter objects carrying additional information.
+                Only required by specific augmentation implementations.


Suggested change

measurements: The dataframe containing the measurements to be

augmented.

parameters: Optional parameter objects carrying additional information.

Only required by specific augmentation implementations.

measurements: The dataframe containing the measurements to be augmented.

parameters: Optional parameter objects carrying additional information.

Only required by some symmetry classes.

AdrianSosic · 2026-06-22T08:13:57Z

+        parameters_missing = set(self.parameter_names).difference(
+            searchspace.parameter_names
+        )
+        if parameters_missing:
+            raise IncompatibleSearchSpaceError(
+                f"The symmetry of type '{self.__class__.__name__}' was set up with the "
+                f"following parameters that are not present in the search space: "
+                f"{parameters_missing}."
+            )


A bit shorter

Suggested change

parameters_missing = set(self.parameter_names).difference(

searchspace.parameter_names

)

if parameters_missing:

raise IncompatibleSearchSpaceError(

f"The symmetry of type '{self.__class__.__name__}' was set up with the "

f"following parameters that are not present in the search space: "

f"{parameters_missing}."

)

if missing := set(self.parameter_names) - set(searchspace.parameter_names):

raise IncompatibleSearchSpaceError(

f"The symmetry of type '{self.__class__.__name__}' was set up with the "

f"following parameters that are not present in the search space: "

f"{missing}."

)

AdrianSosic · 2026-06-22T08:20:15Z

 - Interpoint constraints for continuous search spaces
 - Transfer learning benchmarks for shifted and inverted Hartmann functions
 - Coding convention instructions for agentic developers (`AGENTS.md`, `CLAUDE.md`)
+- Symmetry classes (`PermutationSymmetry`, `MirrorSymmetry`, `DependencySymmetry`)


maybe add the base class? You could even mention the symmetry framework as a whole first and only then mention the classes, because that the former is new is not clear from the bullet (you cannot distinguish it from the case where we later add a fourth class)

AdrianSosic

And some more

AdrianSosic · 2026-06-22T08:39:43Z

+    condition: Condition = field(validator=instance_of(Condition))
+    """The condition specifying the active range of the causing parameter."""
+
+    affected_parameter_names: tuple[str, ...] = field(


I'm still puzzled why you chose the terminology causing and affected when the the class is named Dependent... and you talk about dependent parameters in the docstring. Wouldn't it much more sense to call the attribute dependent_... and correspondingly use independent for the other parameter types, which furthermore avoids the clash with the actual causal terminology from causal modeling?

AdrianSosic · 2026-06-22T08:40:16Z

+    """The condition specifying the active range of the causing parameter."""
+
+    affected_parameter_names: tuple[str, ...] = field(
+        converter=Converter(  # type: ignore[misc,call-overload]  # mypy: Converter


Again: what is this mypy comment?

AdrianSosic · 2026-06-22T08:43:28Z

+    """The parameters affected by the dependency."""
+
+    n_discretization_points: int = field(
+        default=3, validator=(instance_of(int), ge(2)), kw_only=True


I can unterstand that you want to make the concept useable in conti spaces, but I would suggest to not chose any default then. I think we all agree that there is no obvious "default" that one could possibly chose to well-approximate a continuous function with discrete points, and the switch from disc to conti is one that moves from an exact representation of the symmetry to an approximate one. So I think this is not something that should be hidden from the user but where they should explicitly opt-in, don't you agree?

AdrianSosic · 2026-06-22T08:44:27Z

+        parameters: Iterable[Parameter] | None = None,
+    ) -> pd.DataFrame:
+        # See base class.
+        if not self.use_data_augmentation:


Since this check has to happen in each subclass, let's perhaps turn it into the template pattern?

AdrianSosic · 2026-06-22T08:47:32Z

+        # values that are not active, as rows containing them should be
+        # augmented.
+        param = next(
+            cast(DiscreteParameter, p)


Why a cast? I think this needs to be a proper validation instead, no?

AdrianSosic · 2026-06-22T10:06:43Z

Let's take this opportunity to make all the df function arguments in file positional-only

AdrianSosic · 2026-06-22T10:09:38Z

 def df_apply_permutation_augmentation(
    df: pd.DataFrame,
-    column_groups: Sequence[Sequence[str]],
+    permutation_groups: Sequence[Sequence[str]],


I vaguely remember you and @AVHopp having a discussion about this but: do we really want to change the semantics of this function? I know there is a changelog entry for it, but still: this is strictly speaking the worst case of breaking change. The function has the same name, same number of arguments, and same argument types, same return type, but will silently do something very different from the original version --> silent bug.

I'm asking because there is no real need for this change in the first place!?

AdrianSosic · 2026-06-22T10:15:18Z

This is probably minor, but still: Do we want to make any claims about the ordering of dataframe content produced by the utilities in this function? Right now, some append the augmented values, while others squeeze them in.

AdrianSosic · 2026-06-22T10:20:42Z

+
+    Note:
+        This function does not consider constraints and might provide unexpected or
+        invalid data if certain constraints are present.


IMO this is more confusion than not having it. On this level, the concept of constraints does not even exist. The situation would be different if one of the arguments was a SearchSpace and the function was silently ignoring its constraints, but this is not the case here

Suggested change

Note:

This function does not consider constraints and might provide unexpected or

invalid data if certain constraints are present.

AdrianSosic · 2026-06-22T10:27:28Z

+        )
+
+
+def validate_unique_values(  # noqa: DOC101, DOC103


We now have two validate_unique_values functions. I think you forgot to delete the existing one?

Scienfitz self-assigned this Aug 21, 2025

Scienfitz added the new feature New functionality label Aug 21, 2025

Scienfitz linked an issue Aug 21, 2025 that may be closed by this pull request

Add Data Augmentation for Invariant Contraints #621

Open

Scienfitz mentioned this pull request Aug 24, 2025

Data Input With Invariant Parameters #291

Closed

Scienfitz force-pushed the feature/invariance_augmentation branch from 00cfef8 to 5ba4cb1 Compare September 10, 2025 08:10

Scienfitz force-pushed the feature/invariance_augmentation branch 4 times, most recently from db98f64 to aedafa7 Compare September 25, 2025 10:41

Scienfitz marked this pull request as ready for review September 25, 2025 11:01

Scienfitz requested review from AVHopp and AdrianSosic as code owners September 25, 2025 11:01

Copilot AI review requested due to automatic review settings September 25, 2025 11:01

Copilot AI reviewed Sep 25, 2025

View reviewed changes

Comment thread baybe/surrogates/gaussian_process/core.py Outdated

Comment thread tests/test_measurement_augmentation.py Outdated

Comment thread docs/scripts/build_examples.py

Scienfitz force-pushed the feature/invariance_augmentation branch from 583b7be to 98c29e8 Compare September 25, 2025 11:10

This comment was marked as outdated.

Sign in to view

Scienfitz marked this pull request as draft September 30, 2025 11:26

Scienfitz changed the title ~~Add Auto-Augmentation of Measurements in the Presence of Invariance Constraints~~ Symmetry and Data Augmentation Oct 9, 2025

Scienfitz force-pushed the feature/invariance_augmentation branch from 98c29e8 to 0e05880 Compare October 24, 2025 18:12

Scienfitz force-pushed the feature/invariance_augmentation branch from 46bc49c to 859ca3b Compare October 31, 2025 18:23

Scienfitz commented Nov 3, 2025

View reviewed changes

Scienfitz marked this pull request as ready for review November 3, 2025 17:31

Scienfitz requested a review from Copilot November 4, 2025 08:59

Copilot AI reviewed Nov 4, 2025

Scienfitz requested a review from Copilot November 4, 2025 11:55

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

Copilot AI mentioned this pull request Nov 4, 2025

Fix max_value calculation in permutation_symmetries strategy #688

Closed

Scienfitz and others added 23 commits April 30, 2026 21:01

Update permutation augmentation utility interface

0c8dd26

Add mirror augmentation utility

388f81a

Add Symmetry domain model

61e9a49

Add Parameter.is_equivalent and apply in PermutationSymmetry

f612319

Integrate symmetries into surrogates and recommenders

a982cb8

Update constraints for symmetry support

bccda12

Add hypothesis strategies for symmetries and conditions

0d7da57

Add symmetry tests

756809b

Add symmetry documentation

6f4855f

Add symmetry example

e221574

Handle CompositeSurrogate in symmetry integration

b1f704b

The _autoreplicate converter on main wraps surrogates in a CompositeSurrogate. Access the inner template for symmetry validation and augmentation.

Fix mypy errors in categorical validator and dependency type ignore

9c894a9

Add symmetry validation tests

2dac5a4

Update CHANGELOG

f853266

Replace deprecated set_random_seed with Settings in example

3247695

Fix Sphinx cross-references for symmetry classes

dda3997

Use full module paths (e.g., baybe.symmetries.base.Symmetry) instead of short paths via __init__.py re-exports, which Sphinx cannot resolve.

Fix bug in permutation constraint

08a09b8

`DiscretePermutationInvarianceConstraint` was always internally applying a DiscreteNoLabelDuplicates constraint to remove the diagonal elements, which is not correct and can always be achieved separately by explicitly using `DiscreteNoLabelDuplicates`

Improve docstring

1adf5c1

Co-authored-by: Alexander V. Hopp <alexander.hopp@merckgroup.com>

Add docstrings to to_symmetries and to_symmetry methods

180521c

Add use_data_augmentation to symmetry summary

fe9cb4c

Rework imports

88035b3

Improve example

dbb2693

Rename partial functions

19af1b3

Scienfitz force-pushed the feature/invariance_augmentation branch from eb0e90a to 19af1b3 Compare April 30, 2026 19:02

Scienfitz changed the base branch from dev/symmetry to main April 30, 2026 19:02

Scienfitz changed the base branch from main to dev/symmetry April 30, 2026 19:03

Scienfitz requested a review from AdrianSosic May 7, 2026 18:12

AdrianSosic requested changes Jun 22, 2026

View reviewed changes

		A dataframe with the augmented measurements, including the original
		ones.

	parameters: Iterable[Parameter] \| None = None,
	parameters: Sequence[Parameter] \| None = None,

Conversation

Scienfitz commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AVHopp commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Scienfitz commented Aug 26, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as outdated.

Scienfitz commented May 18, 2026

Uh oh!

AdrianSosic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AdrianSosic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Scienfitz commented Aug 21, 2025 •

edited

Loading

AVHopp commented Aug 25, 2025 •

edited

Loading