- Expose arrow streams
- Preserve metadata (roles) of getml dataframes when converting to pandas, arrow, parquet
- Support python 3.13
- More efficient build chain
- Expose
GETML_CMAKE_FRESH_PRESET - Use uv for env mangement
- Introduce mise
- Fix slow parquet writing (by using the exposed arrow stream instead of materializing batches as arrow tables on the engine side)
- Fix XGBoost deprecation warnings
- Support numpy>=2.0.0
- Fix parquet reader for parquet files containing custom metadata
- Set unit when setting time_stamp role directly in
from_...factories
- Overhaul and better integration of API documentation and web page:
- Switch from sphinx to mkdocs
- Restructuring of User Guide, multiple amendments to documentation
- Introduce strict typing regiment for feature learning aggregations and loss functions
- Clean up and maintenance of example notebooks, make them executable in Colab
- More informative progress bar and status updates using rich
- Completely reworked IO
- Leveraging PyArrow to improve reliability, speed and maintainability
- Introduce reflect-cpp for parsing and de/serialization
- Introduce overhauled getML Docker runtime available from Docker Hub, allowing for easy setup
- See docker-related section of the new getML documentation for details
- Complete rework of the build pipeline (docker and linux native)
- Ruff for linting and formatting
- Hatch for python package management
- Generalization of
Placeholder.join'sonargument - Improved timestamp handling
- Slicing improvements
- Slicing of
DataFramesreturned wrong results: Remove short circuit for slices with upper bound - Introduce set semantics for slicing of
DataFrame(return empty collections instead of erroring)
- Slicing of
- Fix displaying of parameter lists with values that exceed the presentable width
- Fix displaying of
DataFrameswith one row or less - Fix progress bar output on Google Colab
- Accelerated feature learning through Fastboost
- Improved modelling on huge datasets through ScaleGBMClassifier and ScaleGBMRegressor
- Advanced trend aggregations using EWMATrend aggregations
- Faster JSON parsing using YYJSON
- Minor bugfixes
- Implement
tqdmfor progress bars - Minor bugfixes
- Use websockets instead of polling
- Size threshold for better visualization of feature code
- Faster reading of memory-mapped data, relevant for all feature learners and predictors
- Introduce CategoryTrimmer as preprocessor
- Support for SQL transpilation: TSQL, Postgres, MySQL, BigQuery, Spark
- Support for memory mapping
- Enhance data processing by introducing Spark (e.g. spark_sql) and Arrow (e.g. from_arrow())
- Integrate Vcpkg for dependency management
- Improve code transpilation for seasonal variables
- Better control of predictor training and hyperparamter optimization through introduction of early stopping (e.g. in ScaleGBMClassifier)
- Introduce TREND aggregation
- Better progress logging
- Introduction of Containers for data storage
- Complete overhaul of the API including Views, StarSchema, TimeSeries
- Add subroles for fine grained data control
- Improved model evaluation through Plots and Scores container
- Introduce slicing of Views
- Add datetime() utility
- Add the Mapping and TextFieldSplitter preprocessors
- Add the Fastprop feature learner
- Overhaul the way RelMT and Relboost generate features, making them more efficient
- Significant improvement of project management:
- project.restart(), project.suspend(), and project.switch()
- multiple project support
- Add custom
__getattr__and__dir__methods to DataFrame, enabling column retrieval through autocomplete
- Introduce new feature learner:
- RelMTModel [now RelMT],
- RelMTTimeSeries [now integrated in TimeSeries]
- Extend dataframe handling: delete(), exists()
- Data set provisioning: load_air_pollution(), load_atherosclerosis(), load_biodegradability(), load_consumer_expenditures(), load_interstate94(), load_loans(), load_occupancy()
- High-level hyperopt handlers: tune_feature_learners(), tune_predictors()
- Improve pipeline functionality: delete(), exists(), Columns
- Introduce preprocessors: EmailDomain, Imputation, Seasonal, Substring
- Add pipeline functionality: Pipeline, list_pipelines(), Features, Metrics, SQLCode, Scores
- Better control of hyperparameter optimization: burn_in, kernels, optimization
- Handling of time stamps: time
- Improve database I/O: connect_odbc(), copy_table(), list_connections(), read_s3(), sniff_s3()
- Enable S3 access: set_s3_access_key_id(), set_s3_secret_access_key()
- New Feature Learner: MultirelTimeSeries, RelboostTimeSeries [now both integrated in TimeSeries]
- Add XGBoostClassifier and XGBoostRegressor for improved predictive power
- Overhaul of documentation
- Introduction of "getML in one minute" (now Quickstart) and "How to use this guide" (now User Guide)
- Introduction of User Guide (now Concepts) to include data annotation, feature engineering, hyperparameter optimization and more
- Integration with additional databases like Greenplum, MariaDB, MySQL, and extended PostgreSQL support
- Include hotfix for new domain getml.com
- Rework hyperopt design and handling, added load_hyperopt()
- Improved dataframe handling: add to_placeholder() and nrows()
- Rename Autosql to Multirel
- Boolean and categorical columns: Add support for boolean columns and operators, along with enhanced categorical column handling.
- Introduce API improvements: fitting, saving/loading of models, data transformation
- Add support for various aggregation functions such as MEDIAN, VAR, STDDEV, and COUNT_DISTINCT
- Move from closed beta to pip
- Introduce basic hyperopt algorithms: LatinHypercubeSearch, RandomSearch