Strong baseline pipelines#

`TableVectorizer`#

In tabular machine learning pipelines, practitioners often convert categorical features to numerical features using various encodings (OneHotEncoder, OrdinalEncoder, etc.).

The TableVectorizer parses the data type of each column and maps each column to an encoder, in order to produce numeric features for machine learning models.

More precisely, the TableVectorizer maps columns to one of the following four groups by default:

High-cardinality categorical columns: StringEncoder
Low-cardinality categorical columns: scikit-learn OneHotEncoder
Numerical columns: “passthrough” (no transformation)
Datetime columns: DatetimeEncoder

High cardinality categorical columns are those with more than 40 unique values, while all other categorical columns are considered low cardinality: the threshold can be changed by setting the cardinality_threshold parameter of TableVectorizer.

To change the encoder or alter default parameters, instantiate an encoder and pass it to TableVectorizer.

>>> from skrub import TableVectorizer, DatetimeEncoder, TextEncoder

>>> datetime_enc = DatetimeEncoder(periodic_encoding="circular")
>>> text_enc = TextEncoder()
>>> table_vec = TableVectorizer(datetime=datetime_enc, high_cardinality=text_enc)

The TableVectorizer is used in Encoding: from a dataframe to a numerical matrix for machine learning, while the docstring of the class provides more details on the parameters and usage, as well as various examples.

`tabular_pipeline()`#

The tabular_pipeline() is a function that, given a scikit-learn estimator or the name of the task (regression/regressor, classification/classifier), returns a full scikit-learn pipeline that contains a TableVectorizer followed by the given estimator, or a HistGradientBoostingRegressor/HistGradientBoostingClassifier if only the name of the task is given.

>>> from skrub import tabular_pipeline
>>> tabular_pipeline("regression")
Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(...),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor())])

>>> from sklearn.linear_model import LinearRegression
>>> tabular_pipeline(LinearRegression())
Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(datetime=DatetimeEncoder(periodic_encoding='spline'))),
                ('simpleimputer', SimpleImputer(add_indicator=True)),
                ('standardscaler', StandardScaler()),
                ('linearregression', LinearRegression())])

If the estimator is a linear model (e.g., Ridge, LogisticRegression), tabular_pipeline() adds a StandardScaler and a SimpleImputer to the pipeline. The pipeline prepared by tabular_pipeline() is a strong first baseline for most problems, but may not beat properly tuned ad-hoc pipelines.

Parameter values choice of `TableVectorizer` when using the `tabular_pipeline()` function#
	`RandomForest` models	`HistGradientBoosting` models	Linear models and others
Low-cardinality encoder	`OrdinalEncoder`	Native support ⁽¹⁾	`OneHotEncoder`
High-cardinality encoder	`StringEncoder`	`StringEncoder`	`StringEncoder`
Numerical preprocessor	No processing	No processing	`StandardScaler`
Date preprocessor	`DatetimeEncoder`	`DatetimeEncoder`	`DatetimeEncoder` with spline encoding
Missing value strategy	Native support ⁽²⁾	Native support	`SimpleImputer`

Note

⁽¹⁾ if scikit-learn installed is lower than 1.4, then OrdinalEncoder is used since native support for categorical features is not available.

⁽²⁾ if scikit-learn installed is lower than 1.4, then SimpleImputer is used since native support for missing values is not available.

Strong baseline pipelines#

TableVectorizer#

tabular_pipeline()#

This Page

`TableVectorizer`#

`tabular_pipeline()`#