Encoding: creating feature matrices#
Encoding or vectorizing creates numerical features from the data, converting dataframes, strings, dates… Different encoders are suited for different types of data.
Turning a dataframe into a numerical feature matrix#
A dataframe can comprise columns of all kind of types. A good numerical representation of these columns help analytics and statistical learning.
TableVectorizer gives a turn-key solution by applying
different data-specific encoder to the different columns. It makes
heuristic choices that are not necessarily optimal but is typically a
very good baseline.
Encoding open-ended entries and dirty categories#
String columns can be seen categories for statistical analysis, but standard tools to represent categories fail if these strings are not normalized into a small number of well-identified form, if they have typos, or if there are too many categories.
Skrub provides encoders that represent well open-ended strings or dirty
categories, eg to replace
GapEncoder: infers latent categories and represent the data on these. Very interpretable, sometimes slow
MinHashEncoder: a very scalable encoding of strings capturing their similarities. Particularly useful on large databases and well suited for learners such as trees (boosted trees or random forests)
SimilarityEncoder: a simple encoder that works by representing strings similarities with all the different categories in the data. Useful when there are a small number of categories, but we still want to capture the links between them (eg: “west”, “north”, “north-west”)
DatetimeEncoder encodes date and time: it represent them as
time in seconds since a fixed date, but also added features useful to
capture regularities: week of the day, month of the year…