tabular_pipeline#

skrub.tabular_pipeline(estimator, *, n_jobs=None)[source]#

Get a simple machine-learning pipeline for tabular data.

Given either a scikit-learn estimator or one of the special-cased strings 'regressor', 'regression', 'classifier', 'classification', this function creates a scikit-learn pipeline that extracts numeric features, imputes missing values and scales the data if necessary, then applies the estimator.

Note

The heuristics used by the tabular_pipeline to define an appropriate preprocessing based on the estimator may change in future releases.

Changed in version 0.6.0: The high cardinality encoder has been changed from MinHashEncoder to StringEncoder.

Changed in version 0.7.0: The SquashingScaler with max_absolute_value=5 is now used instead of StandardScaler for centering and scaling numerical features when using linear models.

Parameters:

estimator{“regressor”, “regression”, “classifier”, “classification”} or sklearn.base.BaseEstimator

The estimator to use as the final step in the pipeline. Based on the type of estimator, the previous preprocessing steps and their respective parameters are chosen. The possible values are:

'regressor' or 'regression': a HistGradientBoostingRegressor is used as the final step;
'classifier' or 'classification': a HistGradientBoostingClassifier is used as the final step;
a scikit-learn estimator: the provided estimator is used as the final step.

n_jobsint, default=None

Number of jobs to run in parallel in the TableVectorizer step. None means 1 unless in a joblib parallel_backend context. -1 means using all processors.

Returns:

Pipeline: A scikit-learn Pipeline chaining some preprocessing and the provided estimator.

Notes

tabular_pipeline returns a scikit-learn Pipeline with several steps:

A TableVectorizer transforms the tabular data into numeric features. Its parameters are chosen depending on the provided estimator.
An optional SimpleImputer imputes missing values by their mean and adds binary columns that indicate which values were missing. This step is only added if the estimator cannot handle missing values itself.
An optional SquashingScaler centers and rescales the data. This step is not added (because it is unnecessary) when the estimator is a tree ensemble such as random forest or gradient boosting.
The last step is the provided estimator.

The parameter values for the TableVectorizer might differ depending on the version of scikit-learn:

support for categorical features in HistGradientBoostingClassifier and HistGradientBoostingRegressor was added in scikit-learn 1.4. Therefore, before this version, a OrdinalEncoder is used for low-cardinality features.
support for missing values in RandomForestClassifier and RandomForestRegressor was added in scikit-learn 1.4. Therefore, before this version, a SimpleImputer is used to impute missing values.

Read more in the User Guide.

Examples

>>> from skrub import tabular_pipeline

We can easily get a default pipeline for regression or classification:

>>> tabular_pipeline('regression')
Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(high_cardinality=StringEncoder(),
                                 low_cardinality=ToCategorical())),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor(categorical_features='from_dtype'))])

When requesting a 'regression', the last step of the pipeline is set to a HistGradientBoostingRegressor.

>>> tabular_pipeline('classification')
Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(high_cardinality=StringEncoder(),
                                 low_cardinality=ToCategorical())),
                ('histgradientboostingclassifier',
                 HistGradientBoostingClassifier(categorical_features='from_dtype'))])

When requesting a 'classification', the last step of the pipeline is set to a HistGradientBoostingClassifier.

This pipeline can be applied to rich tabular data:

>>> import pandas as pd
>>> X = pd.DataFrame(
...     {
...         "last_visit": ["2020-01-02", "2021-04-01", "2024-12-05", "2023-08-10"],
...         "medication": [None, "metformin", "paracetamol", "gliclazide"],
...         "insulin_prescriptions": ["N/A", 13, 0, 17],
...         "fasting_glucose": [35, 140, 44, 137],
...     }
... )
>>> y = [0, 1, 0, 1]
>>> X
   last_visit   medication insulin_prescriptions  fasting_glucose
0  2020-01-02          ...                   N/A               35
1  2021-04-01    metformin                    13              140
2  2024-12-05  paracetamol                     0               44
3  2023-08-10   gliclazide                    17              137

>>> model = tabular_pipeline('classifier').fit(X, y)
>>> model.predict(X)
array([0, 0, 0, 0])

Rather than using the default estimator, we can provide our own scikit-learn estimator:

>>> from sklearn.linear_model import LogisticRegression
>>> model = tabular_pipeline(LogisticRegression())
>>> model.fit(X, y)
Pipeline(steps=[('tablevectorizer',
                TableVectorizer(datetime=DatetimeEncoder(periodic_encoding='spline'))),
                ('simpleimputer', SimpleImputer(add_indicator=True)),
                ('squashingscaler', SquashingScaler(max_absolute_value=5)),
                ('logisticregression', LogisticRegression())])

By applying only the first pipeline step we can see the transformed data that is sent to the supervised estimator (see the TableVectorizer documentation for details):

>>> model.named_steps['tablevectorizer'].transform(X)
   last_visit_year  last_visit_month  ...  insulin_prescriptions  fasting_glucose
0           2020.0               1.0  ...                    NaN             35.0
1           2021.0               4.0  ...                   13.0            140.0
2           2024.0              12.0  ...                    0.0             44.0
3           2023.0               8.0  ...                   17.0            137.0

The parameters of the TableVectorizer depend on the provided estimator.

>>> tabular_pipeline(LogisticRegression())
Pipeline(steps=[('tablevectorizer',
                TableVectorizer(datetime=DatetimeEncoder(periodic_encoding='spline'))),
                ('simpleimputer', SimpleImputer(add_indicator=True)),
                ('squashingscaler', SquashingScaler(max_absolute_value=5)),
                ('logisticregression', LogisticRegression())])

For a LogisticRegression, we get:

a default configuration of the TableVectorizer which is intended to work well for a wide variety of downstream estimators. The configuration adds spline periodic features to datetime columns.
A SimpleImputer, as the LogisticRegression cannot handle missing values.
A SquashingScaler for centering and scaling numerical features.

On the other hand, For the HistGradientBoostingClassifier (generated with the string "classifier"):

>>> tabular_pipeline('classifier')
Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(high_cardinality=StringEncoder(),
                                 low_cardinality=ToCategorical())),
                ('histgradientboostingclassifier',
                 HistGradientBoostingClassifier(categorical_features='from_dtype'))])

A StringEncoder is used as the high_cardinality encoder. This encoder strikes a good balance between quality and performance in most situations.
The low_cardinality does not one-hot encode features. The HistGradientBoostingClassifier has built-in support for categorical data which is more efficient than one-hot encoding. Therefore the selected encoder, ToCategorical, simply makes sure that those features have a categorical dtype so that the HistGradientBoostingClassifier recognizes them as such.
There is no spline encoding of datetimes.
There is no missing-value imputation because the classifier has its own (better) mechanism for dealing with missing values, and no standard scaling because it is unnecessary for tree ensembles.