Two problems:
- Most dataframe validation is currently rather weak, only checking the columns relevant for the respective context. However, the (non-)existence of columns is not explicitly addressed, which can lead to severe silent bugs in certain cases, e.g. when information that is / is not part of the parameter configuration is not / is considered processed. One such arbitrary example is shown below, but more paths exist.
- Validation errors are raised eagerly, i.e. the execution stops upon the first encounter
Envisioned solution:
- By default, all validation should be conservative / strict, with opt-out possibilities. In other places, we already have the
allow_extra/allow_missing flags, that default to False. We should either role out this approach consistently across the code base or refine it further.
- Use
ExceptionGroups wherever appropriate. For example, if there are both missing and extra columns, all violations should be reported collected and reported simultaneously.
Probably, this requires some refactoring / cleanup of the validation logic, centralizing the checks into clean single-responsibility utility functions.
Minimal Example
import pandas as pd
from baybe import Campaign
from baybe.objectives import SingleTargetObjective
from baybe.parameters import CategoricalParameter, NumericalDiscreteParameter
from baybe.searchspace import SearchSpace
from baybe.targets import NumericalTarget
parameters = [
NumericalDiscreteParameter("Temperature", [300, 350, 400]),
CategoricalParameter("Catalyst", ["A", "B", "C"]),
]
df_space = pd.DataFrame({"Temperature": [300, 350, 400], "Catalyst": ["A", "B", "C"]})
searchspace = SearchSpace.from_dataframe(df_space, parameters)
objective = SingleTargetObjective(target=NumericalTarget(name="Yield", mode="MAX"))
campaign = Campaign(searchspace=searchspace, objective=objective)
batch = pd.DataFrame(
{
"Temperature": [300, 350],
"Catalyst": ["A", "B"],
"Yield": [0.72, 0.85],
"Notes": ["run ok", "re-run"], # <-- extra column: not a parameter or target
}
)
campaign.add_measurements(batch)
assert "Notes" in campaign._measurements_exp # <-- passes but should not
Two problems:
Envisioned solution:
allow_extra/allow_missingflags, that default toFalse. We should either role out this approach consistently across the code base or refine it further.ExceptionGroupswherever appropriate. For example, if there are both missing and extra columns, all violations should be reported collected and reported simultaneously.Probably, this requires some refactoring / cleanup of the validation logic, centralizing the checks into clean single-responsibility utility functions.
Minimal Example