Building robust ML baselines with tabular_pipeline()#

The tabular_pipeline() is a function that, given a scikit-learn estimator, returns a full scikit-learn Pipeline that contains a TableVectorizer followed by the given estimator. If the estimator is a linear model (e.g., Ridge, LogisticRegression), tabular_pipeline() adds a StandardScaler and a SimpleImputer to the pipeline.

>>> from sklearn.linear_model import LinearRegression
>>> from skrub import tabular_pipeline
>>> tabular_pipeline(LinearRegression())
Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(datetime=DatetimeEncoder(periodic_encoding='spline'))),
                ('simpleimputer', SimpleImputer(add_indicator=True)),
                ('standardscaler', StandardScaler()),
                ('linearregression', LinearRegression())])

It is also possible to call the function with the name of the task that must be performed (regression/regressor, classification/classifier) to build a pipeline that uses a HistGradientBoostingRegressor/HistGradientBoostingClassifier.

>>> from skrub import tabular_pipeline
>>> tabular_pipeline("regression")
Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(...),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor())])

The pipeline prepared by tabular_pipeline() is a strong first baseline for most problems, but may not beat properly tuned ad-hoc pipelines.

Parameter values choice of TableVectorizer when using the tabular_pipeline() function#

Parameter

RandomForest models

HistGradientBoosting models

Linear models and others

Low-cardinality encoder

OrdinalEncoder

Native support (1)

OneHotEncoder

High-cardinality encoder

StringEncoder

StringEncoder

StringEncoder

Numeric preprocessor

No processing

No processing

StandardScaler

Date preprocessor

DatetimeEncoder

DatetimeEncoder

DatetimeEncoder with spline encoding

Missing value strategy

Native support (2)

Native support

SimpleImputer

Note

(1) if scikit-learn installed is lower than 1.4, then OrdinalEncoder is used since native support for categorical features is not available.

(2) if scikit-learn installed is lower than 1.4, then SimpleImputer is used since native support for missing values is not available.