Building robust ML baselines with `tabular_pipeline()`#

The tabular_pipeline() is a function that, given a scikit-learn estimator, returns a full scikit-learn Pipeline that contains a TableVectorizer followed by the given estimator. If the estimator is a linear model (e.g., Ridge, LogisticRegression), tabular_pipeline() adds a SquashingScaler and a SimpleImputer to the pipeline.

>>> from sklearn.linear_model import LinearRegression
>>> from skrub import tabular_pipeline
>>> tabular_pipeline(LinearRegression())
Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(datetime=DatetimeEncoder(periodic_encoding='spline'))),
                ('simpleimputer', SimpleImputer(add_indicator=True)),
                ('squashingscaler', SquashingScaler(max_absolute_value=5)),
                ('linearregression', LinearRegression())])

It is also possible to call the function with the name of the task that must be performed (regression/regressor, classification/classifier) to build a pipeline that uses a HistGradientBoostingRegressor/HistGradientBoostingClassifier.

>>> from skrub import tabular_pipeline
>>> tabular_pipeline("regression")
Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(...),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor(...))])

The pipeline prepared by tabular_pipeline() is a strong first baseline for most problems, but may not beat properly tuned ad-hoc pipelines.

Parameter values choice of `TableVectorizer` when using the `tabular_pipeline()` function#
Parameter	`RandomForest` models	`HistGradientBoosting` models	Linear models and others
Low-cardinality encoder	`OrdinalEncoder`	Native support ⁽¹⁾	`OneHotEncoder`
High-cardinality encoder	`StringEncoder`	`StringEncoder`	`StringEncoder`
Numeric preprocessor	No processing	No processing	`SquashingScaler`
Date preprocessor	`DatetimeEncoder`	`DatetimeEncoder`	`DatetimeEncoder` with spline encoding
Missing value strategy	Native support ⁽²⁾	Native support	`SimpleImputer`

Note

⁽¹⁾ if scikit-learn installed is lower than 1.4, then OrdinalEncoder is used since native support for categorical features is not available.

⁽²⁾ if scikit-learn installed is lower than 1.4, then SimpleImputer is used since native support for missing values is not available.

Building robust ML baselines with tabular_pipeline()#

This Page

Building robust ML baselines with `tabular_pipeline()`#