Encoding: creating feature matrices#

Encoding or vectorizing creates numerical features from the data, converting dataframes, strings, dates… Different encoders are suited for different types of data.

Turning a dataframe into a numerical feature matrix#

A dataframe can comprise columns of all kind of types. A good numerical representation of these columns help analytics and statistical learning.

The TableVectorizer gives a turn-key solution by applying different data-specific encoder to the different columns. It makes reasonable heuristic choices that are not necessarily optimal since it is not aware of the learner used for the machine learning task). However, it already provides a typically very good baseline.

The function tabular_learner() goes the extra mile by creating a machine-learning model that works well on tabular data. This model combines a TableVectorizer with a provided scikit-learn estimator. Depending whether or not the final estimator natively support missing values, a missing value imputer step is added before the final estimator. The parameters of the TableVectorizer are chosen based on the type of the final estimator.

Parameter values choice of TableVectorizer when using the tabular_learner() function#

RandomForest models

HistGradientBoosting models

Linear models and others

Low-cardinality encoder

OrdinalEncoder

Native support (1)

OneHotEncoder

High-cardinality encoder

MinHashEncoder

MinHashEncoder

GapEncoder

Numerical preprocessor

No processing

No processing

StandardScaler

Date preprocessor

DatetimeEncoder

DatetimeEncoder

DatetimeEncoder

Missing value strategy

Native support (2)

Native support

SimpleImputer

Note

(1) if scikit-learn installed is lower than 1.4, then OrdinalEncoder is used since native support for categorical features is not available.

(2) if scikit-learn installed is lower than 1.4, then SimpleImputer is used since native support for missing values is not available.

With tree-based models, the MinHashEncoder is used for high-cardinality categorical features. It does not provide interpretable features as the default GapEncoder but it is much faster. For low-cardinality, these models relies on either the native support of the model or the OrdinalEncoder.

With linear models or unknown models, the default values of the different parameters are used. Therefore, the GapEncoder is used for high-cardinality categorical features and the OneHotEncoder for low-cardinality ones. If the final estimator does not support missing values, a SimpleImputer is added before the final estimator. Finally, a StandardScaler is added to the pipeline. Those choices may not be optimal in all cases but they are methodologically safe.

Encoding open-ended entries and dirty categories#

String columns can be seen categories for statistical analysis, but standard tools to represent categories fail if these strings are not normalized into a small number of well-identified form, if they have typos, or if there are too many categories.

Skrub provides encoders that represent well open-ended strings or dirty categories, eg to replace OneHotEncoder:

  • GapEncoder: infers latent categories and represent the data on these. Very interpretable, sometimes slow

  • MinHashEncoder: a very scalable encoding of strings capturing their similarities. Particularly useful on large databases and well suited for learners such as trees (boosted trees or random forests)

  • SimilarityEncoder: a simple encoder that works by representing strings similarities with all the different categories in the data. Useful when there are a small number of categories, but we still want to capture the links between them (eg: “west”, “north”, “north-west”)

Encoding dates#

The DatetimeEncoder encodes date and time: it represent them as time in seconds since a fixed date, but also added features useful to capture regularities: week of the day, month of the year…