.. currentmodule:: skrub .. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer` .. |tabular_pipeline| replace:: :func:`~skrub.tabular_pipeline` .. |HistGradientBoostingRegressor| replace:: :class:`~sklearn.ensemble.HistGradientBoostingRegressor` .. |HistGradientBoostingClassifier| replace:: :class:`~sklearn.ensemble.HistGradientBoostingClassifier` .. |Pipeline| replace:: :class:`~sklearn.pipeline.Pipeline` .. |StandardScaler| replace:: :class:`~sklearn.preprocessing.StandardScaler` .. |SimpleImputer| replace:: :class:`~sklearn.impute.SimpleImputer` .. _user_guide_tabular_pipeline: Building robust ML baselines with |tabular_pipeline| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The |tabular_pipeline| is a function that, given a scikit-learn estimator, returns a full scikit-learn |Pipeline| that contains a |TableVectorizer| followed by the given estimator. If the estimator is a linear model (e.g., ``Ridge``, ``LogisticRegression``), |tabular_pipeline| adds a |StandardScaler| and a |SimpleImputer| to the pipeline. >>> from sklearn.linear_model import LinearRegression >>> from skrub import tabular_pipeline >>> tabular_pipeline(LinearRegression()) # doctest: +SKLEARN_VERSION >= "1.4" +ELLIPSIS Pipeline(steps=[('tablevectorizer', TableVectorizer(datetime=DatetimeEncoder(periodic_encoding='spline'))), ('simpleimputer', SimpleImputer(add_indicator=True)), ('standardscaler', StandardScaler()), ('linearregression', LinearRegression())]) It is also possible to call the function with the name of the task that must be performed (``regression``/``regressor``, ``classification``/``classifier``) to build a pipeline that uses a |HistGradientBoostingRegressor|/|HistGradientBoostingClassifier|. >>> from skrub import tabular_pipeline >>> tabular_pipeline("regression") # doctest: +SKLEARN_VERSION >= "1.4" +ELLIPSIS Pipeline(steps=[('tablevectorizer', TableVectorizer(...), ('histgradientboostingregressor', HistGradientBoostingRegressor())]) The pipeline prepared by |tabular_pipeline| is a strong first baseline for most problems, but may not beat properly tuned ad-hoc pipelines. .. list-table:: Parameter values choice of :class:`TableVectorizer` when using the :func:`tabular_pipeline` function :header-rows: 1 :widths: 25 25 25 25 * - Parameter - ``RandomForest`` models - ``HistGradientBoosting`` models - Linear models and others * - Low-cardinality encoder - :class:`~sklearn.preprocessing.OrdinalEncoder` - Native support :sup:`(1)` - :class:`~sklearn.preprocessing.OneHotEncoder` * - High-cardinality encoder - :class:`StringEncoder` - :class:`StringEncoder` - :class:`StringEncoder` * - Numeric preprocessor - No processing - No processing - :class:`~sklearn.preprocessing.StandardScaler` * - Date preprocessor - :class:`DatetimeEncoder` - :class:`DatetimeEncoder` - :class:`DatetimeEncoder` with spline encoding * - Missing value strategy - Native support :sup:`(2)` - Native support - :class:`~sklearn.impute.SimpleImputer` .. note:: :sup:`(1)` if scikit-learn installed is lower than 1.4, then :class:`~sklearn.preprocessing.OrdinalEncoder` is used since native support for categorical features is not available. :sup:`(2)` if scikit-learn installed is lower than 1.4, then :class:`~sklearn.impute.SimpleImputer` is used since native support for missing values is not available.