Strong baseline pipelines#

TableVectorizer#

In tabular machine learning pipelines, practitioners often convert categorical features to numerical features using various encodings (OneHotEncoder, OrdinalEncoder, etc.).

The TableVectorizer parses the data type of each column and maps each column to an encoder, in order to produce numeric features for machine learning models.

More precisely, the TableVectorizer maps columns to one of the following four groups by default:

  • High-cardinality categorical columns: StringEncoder

  • Low-cardinality categorical columns: scikit-learn OneHotEncoder

  • Numerical columns: “passthrough” (no transformation)

  • Datetime columns: DatetimeEncoder

High cardinality categorical columns are those with more than 40 unique values, while all other categorical columns are considered low cardinality: the threshold can be changed by setting the cardinality_threshold parameter of TableVectorizer.

To change the encoder or alter default parameters, instantiate an encoder and pass it to TableVectorizer.

>>> from skrub import TableVectorizer, DatetimeEncoder, TextEncoder
>>> datetime_enc = DatetimeEncoder(periodic_encoding="circular")
>>> text_enc = TextEncoder()
>>> table_vec = TableVectorizer(datetime=datetime_enc, high_cardinality=text_enc)

The TableVectorizer is used in Encoding: from a dataframe to a numerical matrix for machine learning, while the docstring of the class provides more details on the parameters and usage, as well as various examples.

tabular_pipeline()#

The tabular_pipeline() is a function that, given a scikit-learn estimator or the name of the task (regression/regressor, classification/classifier), returns a full scikit-learn pipeline that contains a TableVectorizer followed by the given estimator, or a HistGradientBoostingRegressor/HistGradientBoostingClassifier if only the name of the task is given.

>>> from skrub import tabular_pipeline
>>> tabular_pipeline("regression")
Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(...),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor())])
>>> from sklearn.linear_model import LinearRegression
>>> tabular_pipeline(LinearRegression())
Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(datetime=DatetimeEncoder(periodic_encoding='spline'))),
                ('simpleimputer', SimpleImputer(add_indicator=True)),
                ('standardscaler', StandardScaler()),
                ('linearregression', LinearRegression())])

If the estimator is a linear model (e.g., Ridge, LogisticRegression), tabular_pipeline() adds a StandardScaler and a SimpleImputer to the pipeline. The pipeline prepared by tabular_pipeline() is a strong first baseline for most problems, but may not beat properly tuned ad-hoc pipelines.

Parameter values choice of TableVectorizer when using the tabular_pipeline() function#

RandomForest models

HistGradientBoosting models

Linear models and others

Low-cardinality encoder

OrdinalEncoder

Native support (1)

OneHotEncoder

High-cardinality encoder

StringEncoder

StringEncoder

StringEncoder

Numerical preprocessor

No processing

No processing

StandardScaler

Date preprocessor

DatetimeEncoder

DatetimeEncoder

DatetimeEncoder with spline encoding

Missing value strategy

Native support (2)

Native support

SimpleImputer

Note

(1) if scikit-learn installed is lower than 1.4, then OrdinalEncoder is used since native support for categorical features is not available.

(2) if scikit-learn installed is lower than 1.4, then SimpleImputer is used since native support for missing values is not available.