Encoding: creating feature matrices#
Encoding or vectorizing creates numerical features from the data, converting dataframes, strings, dates… Different encoders are suited for different types of data.
Turning a dataframe into a numerical feature matrix#
A dataframe can comprise columns of all kind of types. A good numerical representation of these columns help analytics and statistical learning.
The TableVectorizer
gives a turn-key solution by applying
different data-specific encoder to the different columns. It makes reasonable
heuristic choices that are not necessarily optimal since it is not aware of the learner
used for the machine learning task). However, it already provides a typically very good
baseline.
The function tabular_learner()
goes the extra mile by creating a machine-learning
model that works well on tabular data. This model combines a TableVectorizer
with a provided scikit-learn estimator. Depending whether or not the final estimator
natively support missing values, a missing value imputer step is added before the
final estimator. The parameters of the TableVectorizer
are chosen based on the
type of the final estimator.
|
|
Linear models and others |
|
---|---|---|---|
Low-cardinality encoder |
Native support (1) |
||
High-cardinality encoder |
|||
Numerical preprocessor |
No processing |
No processing |
|
Date preprocessor |
|||
Missing value strategy |
Native support (2) |
Native support |
Note
(1) if scikit-learn installed is lower than 1.4, then
OrdinalEncoder
is used since native support
for categorical features is not available.
(2) if scikit-learn installed is lower than 1.4, then
SimpleImputer
is used since native support
for missing values is not available.
With tree-based models, the MinHashEncoder
is used for high-cardinality
categorical features. It does not provide interpretable features as the default
GapEncoder
but it is much faster. For low-cardinality, these models relies on
either the native support of the model or the
OrdinalEncoder
.
With linear models or unknown models, the default values of the different
parameters are used. Therefore, the GapEncoder
is used for
high-cardinality categorical features and the
OneHotEncoder
for low-cardinality ones. If the
final estimator does not support missing values, a
SimpleImputer
is added before the final estimator.
Finally, a StandardScaler
is added to the
pipeline. Those choices may not be optimal in all cases but they are
methodologically safe.
Encoding open-ended entries and dirty categories#
String columns can be seen categories for statistical analysis, but standard tools to represent categories fail if these strings are not normalized into a small number of well-identified form, if they have typos, or if there are too many categories.
Skrub provides encoders that represent well open-ended strings or dirty
categories, eg to replace OneHotEncoder
:
GapEncoder
: infers latent categories and represent the data on these. Very interpretable, sometimes slowMinHashEncoder
: a very scalable encoding of strings capturing their similarities. Particularly useful on large databases and well suited for learners such as trees (boosted trees or random forests)SimilarityEncoder
: a simple encoder that works by representing strings similarities with all the different categories in the data. Useful when there are a small number of categories, but we still want to capture the links between them (eg: “west”, “north”, “north-west”)
Encoding dates#
The DatetimeEncoder
encodes date and time: it represent them as
time in seconds since a fixed date, but also added features useful to
capture regularities: week of the day, month of the year…