End-to-end predictive models#

Turning a dataframe into a numerical feature matrix#

A dataframe can contain columns of all kind of types. A good numerical representation of these columns help analytics and statistical learning.

The TableVectorizer gives a turn-key solution by applying different data-specific encoders to the different columns. It makes reasonable heuristic choices that are not necessarily optimal since it is not aware of the learner used for the machine learning task). However, it already provides a typically very good baseline.

The function tabular_learner() goes the extra mile by creating a machine-learning model that works well on tabular data. This model combines a TableVectorizer with a provided scikit-learn estimator. Depending whether or not the final estimator natively supports missing values, a missing value imputer step is added before the final estimator. The parameters of the TableVectorizer are chosen based on the type of the final estimator.

Parameter values choice of TableVectorizer when using the tabular_learner() function#

RandomForest models

HistGradientBoosting models

Linear models and others

Low-cardinality encoder

OrdinalEncoder

Native support (1)

OneHotEncoder

High-cardinality encoder

MinHashEncoder

MinHashEncoder

GapEncoder

Numerical preprocessor

No processing

No processing

StandardScaler

Date preprocessor

DatetimeEncoder

DatetimeEncoder

DatetimeEncoder

Missing value strategy

Native support (2)

Native support

SimpleImputer

Note

(1) if scikit-learn installed is lower than 1.4, then OrdinalEncoder is used since native support for categorical features is not available.

(2) if scikit-learn installed is lower than 1.4, then SimpleImputer is used since native support for missing values is not available.

With tree-based models, the MinHashEncoder is used for high-cardinality categorical features. It does not provide interpretable features as the default GapEncoder but it is much faster. For low-cardinality, these models relies on either the native support of the model or the OrdinalEncoder.

With linear models or unknown models, the default values of the different parameters are used. Therefore, the GapEncoder is used for high-cardinality categorical features and the OneHotEncoder for low-cardinality ones. If the final estimator does not support missing values, a SimpleImputer is added before the final estimator. Finally, a StandardScaler is added to the pipeline. Those choices may not be optimal in all cases but they are methodologically safe.