tabular_learner#
- skrub.tabular_learner(estimator, *, n_jobs=None)[source]#
Get a simple machine-learning pipeline for tabular data.
Given a scikit-learn
estimator
, this function creates a machine-learning pipeline that preprocesses tabular data to extract numeric features, impute missing values and scale the data if necessary, then applies theestimator
.Instead of an actual estimator,
estimator
can also be the special-cased strings'regressor'
,'regression'
,'classifier'
,'classification'
to use aHistGradientBoostingRegressor
or aHistGradientBoostingClassifier
with default parameters.tabular_learner
returns a scikit-learnPipeline
with several steps:A
TableVectorizer
transforms the tabular data into numeric features. Its parameters are chosen depending on the providedestimator
.An optional
SimpleImputer
imputes missing values by their mean and adds binary columns that indicate which values were missing. This step is only added if theestimator
cannot handle missing values itself.An optional
StandardScaler
centers and rescales the data. This step is not added (because it is unnecessary) when theestimator
is a tree ensemble such as random forest or gradient boosting.The last step is the provided
estimator
.
Read more in the User Guide.
Note
tabular_learner
is a recent addition and the heuristics used to define an appropriate preprocessing based on theestimator
may change in future releases.- Parameters:
- estimator{“regressor”, “regression”, “classifier”, “classification”} or scikit-learn estimator
The estimator to use as the final step in the pipeline. Based on the type of estimator, the previous preprocessing steps and their respective parameters are chosen. The possible values are:
'regressor'
or'regression'
: aHistGradientBoostingRegressor
is used as the final step;'classifier'
or'classification'
: aHistGradientBoostingClassifier
is used as the final step;a scikit-learn estimator: the provided estimator is used as the final step.
- n_jobs
int
, default=None Number of jobs to run in parallel in the
TableVectorizer
step.None
means 1 unless in a joblibparallel_backend
context.-1
means using all processors.
- Returns:
Notes
The parameter values for the
TableVectorizer
might differ depending on the version of scikit-learn:support for categorical features in
HistGradientBoostingClassifier
andHistGradientBoostingRegressor
was added in scikit-learn 1.4. Therefore, before this version, aOrdinalEncoder
is used for low-cardinality features.support for missing values in
RandomForestClassifier
andRandomForestRegressor
was added in scikit-learn 1.4. Therefore, before this version, aSimpleImputer
is used to impute missing values.
Examples
>>> from skrub import tabular_learner
We can easily get a default pipeline for regression or classification:
>>> tabular_learner('regression') Pipeline(steps=[('tablevectorizer', TableVectorizer(high_cardinality=MinHashEncoder(), low_cardinality=ToCategorical())), ('histgradientboostingregressor', HistGradientBoostingRegressor(categorical_features='from_dtype'))])
When requesting a
'regression'
, the last step of the pipeline is set to aHistGradientBoostingRegressor
.>>> tabular_learner('classification') Pipeline(steps=[('tablevectorizer', TableVectorizer(high_cardinality=MinHashEncoder(), low_cardinality=ToCategorical())), ('histgradientboostingclassifier', HistGradientBoostingClassifier(categorical_features='from_dtype'))])
When requesting a
'classification'
, the last step of the pipeline is set to aHistGradientBoostingClassifier
.This pipeline can be applied to rich tabular data:
>>> import pandas as pd >>> X = pd.DataFrame( ... { ... "last_visit": ["2020-01-02", "2021-04-01", "2024-12-05", "2023-08-10"], ... "medication": [None, "metformin", "paracetamol", "gliclazide"], ... "insulin_prescriptions": ["N/A", 13, 0, 17], ... "fasting_glucose": [35, 140, 44, 137], ... } ... ) >>> y = [0, 1, 0, 1] >>> X last_visit medication insulin_prescriptions fasting_glucose 0 2020-01-02 None N/A 35 1 2021-04-01 metformin 13 140 2 2024-12-05 paracetamol 0 44 3 2023-08-10 gliclazide 17 137
>>> model = tabular_learner('classifier').fit(X, y) >>> model.predict(X) array([0, 0, 0, 0])
Rather than using the default estimator, we can provide our own scikit-learn estimator:
>>> from sklearn.linear_model import LogisticRegression >>> model = tabular_learner(LogisticRegression()) >>> model.fit(X, y) Pipeline(steps=[('tablevectorizer', TableVectorizer()), ('simpleimputer', SimpleImputer(add_indicator=True)), ('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])
By applying only the first pipeline step we can see the transformed data that is sent to the supervised estimator (see the
TableVectorizer
documentation for details):>>> model.named_steps['tablevectorizer'].transform(X) last_visit_year last_visit_month ... insulin_prescriptions fasting_glucose 0 2020.0 1.0 ... NaN 35.0 1 2021.0 4.0 ... 13.0 140.0 2 2024.0 12.0 ... 0.0 44.0 3 2023.0 8.0 ... 17.0 137.0
The parameters of the
TableVectorizer
depend on the providedestimator
.>>> tabular_learner(LogisticRegression()) Pipeline(steps=[('tablevectorizer', TableVectorizer()), ('simpleimputer', SimpleImputer(add_indicator=True)), ('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])
We see that for the
LogisticRegression
we get the default configuration of theTableVectorizer
which is intended to work well for a wide variety of downstream estimators. Moreover, as theLogisticRegression
cannot handle missing values, an imputation step is added. Finally, as many models require the inputs to be centered and on the same scale, centering and standard scaling is added.On the other hand, For the
HistGradientBoostingClassifier
(generated with the string"classifier"
):>>> tabular_learner('classifier') Pipeline(steps=[('tablevectorizer', TableVectorizer(high_cardinality=MinHashEncoder(), low_cardinality=ToCategorical())), ('histgradientboostingclassifier', HistGradientBoostingClassifier(categorical_features='from_dtype'))])
A
MinHashEncoder
is used as thehigh_cardinality
. This encoder provides good performance when the supervised estimator is based on a decision tree or ensemble of trees, as is the case for theHistGradientBoostingClassifier
. Unlike the defaultGapEncoder
, theMinHashEncoder
does not produce interpretable features. However, it is much faster and uses less memory.The
low_cardinality
does not one-hot encode features. TheHistGradientBoostingClassifier
has built-in support for categorical data which is more efficient than one-hot encoding. Therefore the selected encoder,ToCategorical
, simply makes sure that those features have a categorical dtype so that theHistGradientBoostingClassifier
recognizes them as such.There is no missing-value imputation because the classifier has its own (better) mechanism for dealing with missing values.
There is no standard scaling which is unnecessary for trees and ensembles of trees.
Gallery examples#
Encoding: from a dataframe to a numerical matrix for machine learning