Skip to content

Strict dataframe validation #830

Description

@AdrianSosic

Two problems:

  • Most dataframe validation is currently rather weak, only checking the columns relevant for the respective context. However, the (non-)existence of columns is not explicitly addressed, which can lead to severe silent bugs in certain cases, e.g. when information that is / is not part of the parameter configuration is not / is considered processed. One such arbitrary example is shown below, but more paths exist.
  • Validation errors are raised eagerly, i.e. the execution stops upon the first encounter

Envisioned solution:

  • By default, all validation should be conservative / strict, with opt-out possibilities. In other places, we already have the allow_extra/allow_missing flags, that default to False. We should either role out this approach consistently across the code base or refine it further.
  • Use ExceptionGroups wherever appropriate. For example, if there are both missing and extra columns, all violations should be reported collected and reported simultaneously.

Probably, this requires some refactoring / cleanup of the validation logic, centralizing the checks into clean single-responsibility utility functions.

Minimal Example

import pandas as pd

from baybe import Campaign
from baybe.objectives import SingleTargetObjective
from baybe.parameters import CategoricalParameter, NumericalDiscreteParameter
from baybe.searchspace import SearchSpace
from baybe.targets import NumericalTarget

parameters = [
    NumericalDiscreteParameter("Temperature", [300, 350, 400]),
    CategoricalParameter("Catalyst", ["A", "B", "C"]),
]
df_space = pd.DataFrame({"Temperature": [300, 350, 400], "Catalyst": ["A", "B", "C"]})
searchspace = SearchSpace.from_dataframe(df_space, parameters)
objective = SingleTargetObjective(target=NumericalTarget(name="Yield", mode="MAX"))
campaign = Campaign(searchspace=searchspace, objective=objective)

batch = pd.DataFrame(
    {
        "Temperature": [300, 350],
        "Catalyst": ["A", "B"],
        "Yield": [0.72, 0.85],
        "Notes": ["run ok", "re-run"],  # <-- extra column: not a parameter or target
    }
)

campaign.add_measurements(batch)
assert "Notes" in campaign._measurements_exp  # <-- passes but should not

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementExpand / change existing functionality

    Type

    No fields configured for Task.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions