Skip to content

Dataframe backend lock-in #795

Description

@AdrianSosic

Problem

Pandas is hardcoded throughout SubspaceDiscrete. The class stores exp_rep and comp_rep as pd.DataFrame, constructs them using pandas operations, and returns them as pandas objects. All downstream consumers (recommenders, constraints, Campaign) assume pandas.

Polars support was bolted on via the BAYBE_DEACTIVATE_POLARS environment variable, but it is partial:

  • Polars is used for some internal construction steps, then collected back to pandas.
  • Row ordering differs between the pandas and Polars code paths, producing non-deterministic results depending on the backend.
  • There is no abstraction over the tabular type at the API boundary — callers must work with pandas regardless of their preferred backend.

Why it matters

  • Users who prefer Polars or PyArrow are forced to convert at every interaction with BayBE's search space API.
  • Row ordering inconsistencies between backends create subtle bugs that are difficult to diagnose.
  • True lazy evaluation is impossible with pandas as the internal representation. Pandas executes eagerly; wrapping it with a lazy abstraction does not make it lazy. This blocks the performance gains needed for large spaces (see sub-issue Eager search space materialization #796).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions