Encoding: creating feature matrices#
Encoding or vectorizing creates numerical features from the data, converting dataframes, strings, dates… Different encoders are suited for different types of data.
Encoding open-ended entries and dirty categories#
String columns can be seen categories for statistical analysis, but standard tools to represent categories fail if these strings are not normalized into a small number of well-identified form, if they have typos, or if there are too many categories.
Skrub provides encoders that represent well open-ended strings or dirty
categories, eg to replace OneHotEncoder
:
GapEncoder
: infers latent categories and represent the data on these. Very interpretable, sometimes slowMinHashEncoder
: a very scalable encoding of strings capturing their similarities. Particularly useful on large databases and well suited for learners such as trees (boosted trees or random forests)SimilarityEncoder
: a simple encoder that works by representing strings similarities with all the different categories in the data. Useful when there are a small number of categories, but we still want to capture the links between them (eg: “west”, “north”, “north-west”)
Encoding dates#
The DatetimeEncoder
encodes date and time: it represent them as
time in seconds since a fixed date, but also added features useful to
capture regularities: week of the day, month of the year…