Strong baseline pipelines#
TableVectorizer
#
In tabular machine learning pipelines, practitioners often convert categorical features to numerical features
using various encodings (OneHotEncoder
, OrdinalEncoder
, etc.).
The TableVectorizer
parses the data type of each column and maps each column to an encoder, in order
to produce numeric features for machine learning models.
More precisely, the TableVectorizer
maps columns to one of the following four groups by default:
High-cardinality categorical columns:
StringEncoder
Low-cardinality categorical columns: scikit-learn
OneHotEncoder
Numerical columns: “passthrough” (no transformation)
Datetime columns:
DatetimeEncoder
High cardinality categorical columns are those with more than 40 unique values,
while all other categorical columns are considered low cardinality: the
threshold can be changed by setting the cardinality_threshold
parameter of
TableVectorizer
.
To change the encoder or alter default parameters, instantiate an encoder and pass
it to TableVectorizer
.
>>> from skrub import TableVectorizer, DatetimeEncoder, TextEncoder
>>> datetime_enc = DatetimeEncoder(periodic_encoding="circular")
>>> text_enc = TextEncoder()
>>> table_vec = TableVectorizer(datetime=datetime_enc, high_cardinality=text_enc)
The TableVectorizer
is used in Encoding: from a dataframe to a numerical matrix for machine learning, while the
docstring of the class provides more details on the parameters and usage, as well
as various examples.
tabular_pipeline()
#
The tabular_pipeline()
is a function that, given a scikit-learn estimator or the
name of the task (regression
/regressor
, classification
/classifier
),
returns a full scikit-learn pipeline that contains a TableVectorizer
followed by the given estimator, or a
HistGradientBoostingRegressor
/HistGradientBoostingClassifier
if only
the name of the task is given.
>>> from skrub import tabular_pipeline
>>> tabular_pipeline("regression")
Pipeline(steps=[('tablevectorizer',
TableVectorizer(...),
('histgradientboostingregressor',
HistGradientBoostingRegressor())])
>>> from sklearn.linear_model import LinearRegression
>>> tabular_pipeline(LinearRegression())
Pipeline(steps=[('tablevectorizer',
TableVectorizer(datetime=DatetimeEncoder(periodic_encoding='spline'))),
('simpleimputer', SimpleImputer(add_indicator=True)),
('standardscaler', StandardScaler()),
('linearregression', LinearRegression())])
If the estimator is a linear model (e.g., Ridge
, LogisticRegression
),
tabular_pipeline()
adds a StandardScaler
and a SimpleImputer
to the pipeline.
The pipeline prepared by tabular_pipeline()
is a strong first baseline for most
problems, but may not beat properly tuned ad-hoc pipelines.
|
|
Linear models and others |
|
---|---|---|---|
Low-cardinality encoder |
Native support (1) |
||
High-cardinality encoder |
|
|
|
Numerical preprocessor |
No processing |
No processing |
|
Date preprocessor |
|
|
|
Missing value strategy |
Native support (2) |
Native support |
Note
(1) if scikit-learn installed is lower than 1.4, then
OrdinalEncoder
is used since native support
for categorical features is not available.
(2) if scikit-learn installed is lower than 1.4, then
SimpleImputer
is used since native support
for missing values is not available.