.. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer`
.. |DatetimeEncoder| replace:: :class:`~skrub.DatetimeEncoder`
.. |StringEncoder| replace:: :class:`~skrub.StringEncoder`
.. |OneHotEncoder| replace:: :class:`~sklearn.preprocessing.OneHotEncoder`
.. |OrdinalEncoder| replace:: :class:`~sklearn.preprocessing.OrdinalEncoder`
.. |TextEncoder| replace:: :class:`~skrub.TextEncoder`
.. |tabular_pipeline| replace:: :func:`~skrub.tabular_pipeline`
.. |HistGradientBoostingRegressor| replace:: :class:`~sklearn.ensemble.HistGradientBoostingRegressor`
.. |HistGradientBoostingClassifier| replace:: :class:`~sklearn.ensemble.HistGradientBoostingClassifier`
.. |StandardScaler| replace:: :class:`~sklearn.preprocessing.StandardScaler`
.. |SimpleImputer| replace:: :class:`~sklearn.impute.SimpleImputer`

.. _userguide_tablevectorizer:

Strong baseline pipelines
--------------------------------------------------------

|TableVectorizer|
~~~~~~~~~~~~~~~~~
In tabular machine learning pipelines, practitioners often convert categorical features to numerical features
using various encodings (|OneHotEncoder|, |OrdinalEncoder|, etc.).

The |TableVectorizer| parses the data type of each column and maps each column to an encoder, in order
to produce numeric features for machine learning models.

More precisely, the |TableVectorizer| maps columns to one of the following four groups by default:

- **High-cardinality categorical columns**: |StringEncoder|
- **Low-cardinality categorical columns**: scikit-learn |OneHotEncoder|
- **Numerical columns**: "passthrough" (no transformation)
- **Datetime columns**: |DatetimeEncoder|

**High cardinality** categorical columns are those with more than 40 unique values,
while all other categorical columns are considered **low cardinality**: the
threshold can be changed by setting the ``cardinality_threshold`` parameter of
|TableVectorizer|.

To change the encoder or alter default parameters, instantiate an encoder and pass
it to |TableVectorizer|.

>>> from skrub import TableVectorizer, DatetimeEncoder, TextEncoder

>>> datetime_enc = DatetimeEncoder(periodic_encoding="circular")
>>> text_enc = TextEncoder()
>>> table_vec = TableVectorizer(datetime=datetime_enc, high_cardinality=text_enc)

The |TableVectorizer| is used in :ref:`example_encodings`, while the
docstring of the class provides more details on the parameters and usage, as well
as various examples.


|tabular_pipeline|
~~~~~~~~~~~~~~~~~~
The |tabular_pipeline| is a function that, given a scikit-learn estimator or the
name of the task (``regression``/``regressor``, ``classification``/``classifier``),
returns a full scikit-learn pipeline that contains a |TableVectorizer|
followed by the given estimator, or a
|HistGradientBoostingRegressor|/|HistGradientBoostingClassifier| if only
the name of the task is given.


>>> from skrub import tabular_pipeline
>>> tabular_pipeline("regression") # doctest: +SKLEARN_VERSION >= "1.4" +ELLIPSIS
Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(...),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor())])

>>> from sklearn.linear_model import LinearRegression
>>> tabular_pipeline(LinearRegression()) # doctest: +SKLEARN_VERSION >= "1.4" +ELLIPSIS
Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(datetime=DatetimeEncoder(periodic_encoding='spline'))),
                ('simpleimputer', SimpleImputer(add_indicator=True)),
                ('standardscaler', StandardScaler()),
                ('linearregression', LinearRegression())])

If the estimator is a linear model (e.g., ``Ridge``, ``LogisticRegression``),
|tabular_pipeline| adds a |StandardScaler| and a |SimpleImputer| to the pipeline.
The pipeline prepared by |tabular_pipeline| is a strong first baseline for most
problems, but may not beat properly tuned ad-hoc pipelines.

.. list-table:: Parameter values choice of :class:`TableVectorizer` when using the :func:`tabular_pipeline` function
   :header-rows: 1

   * -
     - ``RandomForest`` models
     - ``HistGradientBoosting`` models
     - Linear models and others
   * - Low-cardinality encoder
     - :class:`~sklearn.preprocessing.OrdinalEncoder`
     - Native support :sup:`(1)`
     - :class:`~sklearn.preprocessing.OneHotEncoder`
   * - High-cardinality encoder
     - :class:`StringEncoder`
     - :class:`StringEncoder`
     - :class:`StringEncoder`
   * - Numerical preprocessor
     - No processing
     - No processing
     - :class:`~sklearn.preprocessing.StandardScaler`
   * - Date preprocessor
     - :class:`DatetimeEncoder`
     - :class:`DatetimeEncoder`
     - :class:`DatetimeEncoder` with spline encoding
   * - Missing value strategy
     - Native support :sup:`(2)`
     - Native support
     - :class:`~sklearn.impute.SimpleImputer`

.. note::
  :sup:`(1)` if scikit-learn installed is lower than 1.4, then
  :class:`~sklearn.preprocessing.OrdinalEncoder` is used since native support
  for categorical features is not available.

  :sup:`(2)` if scikit-learn installed is lower than 1.4, then
  :class:`~sklearn.impute.SimpleImputer` is used since native support
  for missing values is not available.