tabular_pipeline#
- skrub.tabular_pipeline(estimator, *, n_jobs=None)[source]#
Get a simple machine-learning pipeline for tabular data.
Given either a scikit-learn estimator or one of the special-cased strings
'regressor','regression','classifier','classification', this function creates a scikit-learn pipeline that extracts numeric features, imputes missing values and scales the data if necessary, then applies the estimator.Note
The heuristics used by the
tabular_pipelineto define an appropriate preprocessing based on theestimatormay change in future releases.Changed in version 0.6.0: The high cardinality encoder has been changed from
MinHashEncodertoStringEncoder.- Parameters:
- estimator{“regressor”, “regression”, “classifier”, “classification”} or sklearn.base.BaseEstimator
The estimator to use as the final step in the pipeline. Based on the type of estimator, the previous preprocessing steps and their respective parameters are chosen. The possible values are:
'regressor'or'regression': aHistGradientBoostingRegressoris used as the final step;'classifier'or'classification': aHistGradientBoostingClassifieris used as the final step;a scikit-learn estimator: the provided estimator is used as the final step.
- n_jobs
int, default=None Number of jobs to run in parallel in the
TableVectorizerstep.Nonemeans 1 unless in a joblibparallel_backendcontext.-1means using all processors.
- Returns:
Notes
tabular_pipelinereturns a scikit-learnPipelinewith several steps:A
TableVectorizertransforms the tabular data into numeric features. Its parameters are chosen depending on the providedestimator.An optional
SimpleImputerimputes missing values by their mean and adds binary columns that indicate which values were missing. This step is only added if theestimatorcannot handle missing values itself.An optional
StandardScalercenters and rescales the data. This step is not added (because it is unnecessary) when theestimatoris a tree ensemble such as random forest or gradient boosting.The last step is the provided
estimator.
The parameter values for the
TableVectorizermight differ depending on the version of scikit-learn:support for categorical features in
HistGradientBoostingClassifierandHistGradientBoostingRegressorwas added in scikit-learn 1.4. Therefore, before this version, aOrdinalEncoderis used for low-cardinality features.support for missing values in
RandomForestClassifierandRandomForestRegressorwas added in scikit-learn 1.4. Therefore, before this version, aSimpleImputeris used to impute missing values.
Read more in the User Guide.
Examples
>>> from skrub import tabular_pipeline
We can easily get a default pipeline for regression or classification:
>>> tabular_pipeline('regression') Pipeline(steps=[('tablevectorizer', TableVectorizer(high_cardinality=StringEncoder(), low_cardinality=ToCategorical())), ('histgradientboostingregressor', HistGradientBoostingRegressor(categorical_features='from_dtype'))])
When requesting a
'regression', the last step of the pipeline is set to aHistGradientBoostingRegressor.>>> tabular_pipeline('classification') Pipeline(steps=[('tablevectorizer', TableVectorizer(high_cardinality=StringEncoder(), low_cardinality=ToCategorical())), ('histgradientboostingclassifier', HistGradientBoostingClassifier(categorical_features='from_dtype'))])
When requesting a
'classification', the last step of the pipeline is set to aHistGradientBoostingClassifier.This pipeline can be applied to rich tabular data:
>>> import pandas as pd >>> X = pd.DataFrame( ... { ... "last_visit": ["2020-01-02", "2021-04-01", "2024-12-05", "2023-08-10"], ... "medication": [None, "metformin", "paracetamol", "gliclazide"], ... "insulin_prescriptions": ["N/A", 13, 0, 17], ... "fasting_glucose": [35, 140, 44, 137], ... } ... ) >>> y = [0, 1, 0, 1] >>> X last_visit medication insulin_prescriptions fasting_glucose 0 2020-01-02 None N/A 35 1 2021-04-01 metformin 13 140 2 2024-12-05 paracetamol 0 44 3 2023-08-10 gliclazide 17 137
>>> model = tabular_pipeline('classifier').fit(X, y) >>> model.predict(X) array([0, 0, 0, 0])
Rather than using the default estimator, we can provide our own scikit-learn estimator:
>>> from sklearn.linear_model import LogisticRegression >>> model = tabular_pipeline(LogisticRegression()) >>> model.fit(X, y) Pipeline(steps=[('tablevectorizer', TableVectorizer(datetime=DatetimeEncoder(periodic_encoding='spline'))), ('simpleimputer', SimpleImputer(add_indicator=True)), ('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])
By applying only the first pipeline step we can see the transformed data that is sent to the supervised estimator (see the
TableVectorizerdocumentation for details):>>> model.named_steps['tablevectorizer'].transform(X) last_visit_year last_visit_month ... insulin_prescriptions fasting_glucose 0 2020.0 1.0 ... NaN 35.0 1 2021.0 4.0 ... 13.0 140.0 2 2024.0 12.0 ... 0.0 44.0 3 2023.0 8.0 ... 17.0 137.0
The parameters of the
TableVectorizerdepend on the providedestimator.>>> tabular_pipeline(LogisticRegression()) Pipeline(steps=[('tablevectorizer', TableVectorizer(datetime=DatetimeEncoder(periodic_encoding='spline'))), ('simpleimputer', SimpleImputer(add_indicator=True)), ('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])
For a
LogisticRegression, we get:a default configuration of the
TableVectorizerwhich is intended to work well for a wide variety of downstream estimators. The configuration addssplineperiodic features to datetime columns.A
SimpleImputer, as theLogisticRegressioncannot handle missing values.A
StandardScalerfor centering and standard scaling numerical features.
On the other hand, For the
HistGradientBoostingClassifier(generated with the string"classifier"):>>> tabular_pipeline('classifier') Pipeline(steps=[('tablevectorizer', TableVectorizer(high_cardinality=StringEncoder(), low_cardinality=ToCategorical())), ('histgradientboostingclassifier', HistGradientBoostingClassifier(categorical_features='from_dtype'))])
A
StringEncoderis used as thehigh_cardinalityencoder. This encoder strikes a good balance between quality and performance in most situations.The
low_cardinalitydoes not one-hot encode features. TheHistGradientBoostingClassifierhas built-in support for categorical data which is more efficient than one-hot encoding. Therefore the selected encoder,ToCategorical, simply makes sure that those features have a categorical dtype so that theHistGradientBoostingClassifierrecognizes them as such.There is no spline encoding of datetimes.
There is no missing-value imputation because the classifier has its own (better) mechanism for dealing with missing values, and no standard scaling because it is unnecessary for tree ensembles.
Gallery examples#
Encoding: from a dataframe to a numerical matrix for machine learning
Use case: developing locally and deploying to production