End-to-end predictive models#

Create baseline predictive models on heterogeneous dataset#

Crafting a machine-learning pipeline is a rather daunting task. Choosing the ending learner of such pipeline is usually the easiest part. However, it imposes constraints regarding the preprocessing steps that are are required ahead of the learner. Programmatically defining these steps is the part that requires the most expertise and that is cumbersome to write.

The function tabular_learner() provides a factory function that, given a scikit-learn estimator, returns a pipeline combining this estimator with the appropriate preprocessing steps. These steps correspond to a TableVectorizer that handles heterogeneous data and, depending on the capabilities of the final estimator, a missing value imputer and/or a standard scaler.

In the next section, we provide more details regarding the TableVectorizer (Turning a dataframe into a numeric feature matrix). The parameters of the TableVectorizer are chosen based on the type of the final estimator.

Parameter values choice of `TableVectorizer` when using the `tabular_learner()` function#
	`RandomForest` models	`HistGradientBoosting` models	Linear models and others
Low-cardinality encoder	`OrdinalEncoder`	Native support ⁽¹⁾	`OneHotEncoder`
High-cardinality encoder	`MinHashEncoder`	`MinHashEncoder`	`GapEncoder`
Numerical preprocessor	No processing	No processing	`StandardScaler`
Date preprocessor	`DatetimeEncoder`	`DatetimeEncoder`	`DatetimeEncoder` with spline encoding
Missing value strategy	Native support ⁽²⁾	Native support	`SimpleImputer`

Note

⁽¹⁾ if scikit-learn installed is lower than 1.4, then OrdinalEncoder is used since native support for categorical features is not available.

⁽²⁾ if scikit-learn installed is lower than 1.4, then SimpleImputer is used since native support for missing values is not available.

With tree-based models, the MinHashEncoder is used for high-cardinality categorical features. It does not provide interpretable features as the default GapEncoder but it is much faster. For low-cardinality, these models relies on either the native support of the model or the OrdinalEncoder.

With linear models or unknown models, the default values of the different parameters are used. Therefore, the GapEncoder is used for high-cardinality categorical features and the OneHotEncoder for low-cardinality ones. If the final estimator does not support missing values, a SimpleImputer is added before the final estimator. Finally, a StandardScaler is added to the pipeline. Those choices may not be optimal in all cases but they are methodologically safe.

Turning a dataframe into a numeric feature matrix#

A dataframe can contain columns of all kinds of types. We usually refer to such data as “heterogeneous” data. A good numerical representation of these columns helps with analytics and statistical learning.

The TableVectorizer gives a turn-key solution by applying different data-specific encoders to the different columns. It makes reasonable heuristic choices that are not necessarily optimal since it is not aware of the learner used for the machine learning task. However, it already provides a typically very good baseline.

The TableVectorizer handles the following type of data:

numerical data represented with the data types bool, int, and float;
categorical data represented with the data types str or categorical (e.g. pandas.CategoricalDtype or polars.datatypes.Categorical);
date and time data represented by datetime data type (e.g. numpy.datetime64, pandas.DatetimeTZDtype, polars.datatypes.Datetime).

Categorical data are subdivided into two groups: columns containing a large number of categories (high-cardinality columns) and columns containing a small number of categories (low-cardinality columns). A column is considered high-cardinality if the number of unique values is greater than a given threshold, which is controlled by the parameter cardinality_threshold.

Each group of data types defined earlier is associated with a specific init parameter (e.g. numeric, datetime, etc.). The value of these parameters follows the same convention:

when set to "passthrough", the input columns are output as they are;
when set to "drop", the input columns are dropped;
when set to a compatible scikit-learn transformer (implementing fit, transform, and fit_transform methods), the transformer is applied to each column independently. The transformer is cloned (using sklearn.base.clone()) before calling the fit method.

Examples#

The following examples provide an in-depth look at how to use the TableVectorizer class and the tabular_learner() function.

Encoding: from a dataframe to a numerical matrix for machine learning

End-to-end predictive models#

Create baseline predictive models on heterogeneous dataset#

Turning a dataframe into a numeric feature matrix#

Examples#

This Page