Building robust ML baselines with tabular_pipeline()
#
The tabular_pipeline()
is a function that, given a scikit-learn estimator,
returns a full scikit-learn Pipeline
that contains a TableVectorizer
followed by the given estimator.
If the estimator is a linear model (e.g., Ridge
, LogisticRegression
),
tabular_pipeline()
adds a StandardScaler
and a SimpleImputer
to the pipeline.
>>> from sklearn.linear_model import LinearRegression
>>> from skrub import tabular_pipeline
>>> tabular_pipeline(LinearRegression())
Pipeline(steps=[('tablevectorizer',
TableVectorizer(datetime=DatetimeEncoder(periodic_encoding='spline'))),
('simpleimputer', SimpleImputer(add_indicator=True)),
('standardscaler', StandardScaler()),
('linearregression', LinearRegression())])
It is also possible to call the function with the name of the task that must be
performed (regression
/regressor
, classification
/classifier
) to
build a pipeline that uses a
HistGradientBoostingRegressor
/HistGradientBoostingClassifier
.
>>> from skrub import tabular_pipeline
>>> tabular_pipeline("regression")
Pipeline(steps=[('tablevectorizer',
TableVectorizer(...),
('histgradientboostingregressor',
HistGradientBoostingRegressor())])
The pipeline prepared by tabular_pipeline()
is a strong first baseline for most
problems, but may not beat properly tuned ad-hoc pipelines.
Parameter |
|
|
Linear models and others |
---|---|---|---|
Low-cardinality encoder |
Native support (1) |
||
High-cardinality encoder |
|||
Numeric preprocessor |
No processing |
No processing |
|
Date preprocessor |
|
||
Missing value strategy |
Native support (2) |
Native support |
Note
(1) if scikit-learn installed is lower than 1.4, then
OrdinalEncoder
is used since native support
for categorical features is not available.
(2) if scikit-learn installed is lower than 1.4, then
SimpleImputer
is used since native support
for missing values is not available.