Encoding: creating feature matrices#

Encoding or vectorizing creates numerical features from the data, converting dataframes, strings, dates… Different encoders are suited for different types of data.

String entries: categories and open-ended entries#

Summary#

StringEncoder is a good default when working with high-cardinality string features, as it provides good performance on both categorical features (e.g,, work titles, city names etc.) and free-flowing text (reviews, comments etc.), while being very efficient and quick to fit.

GapEncoder provides better performance on dirty categories, while TextEncoder works better on free-flowing text or when external context helps. However, both encoders are much slower to execute, and in the case of TextEncoder, additional dependencies are needed.

MinHashEncoder may scale better in case of large datasets, but its performance is in general not as good as that of the other methods.

Non-normalized entries and dirty categories#

String columns can be seen categories for statistical analysis, but standard tools to represent categories fail if these strings are not normalized into a small number of well-identified form, if they have typos, or if there are too many categories.

Skrub provides encoders that represent well open-ended strings or dirty categories, eg to replace OneHotEncoder:

GapEncoder: infers latent categories and represent the data on these. Very interpretable, sometimes slow
MinHashEncoder: a very scalable encoding of strings capturing their similarities. Particularly useful on large databases and well suited for learners such as trees (boosted trees or random forests)
SimilarityEncoder: a simple encoder that works by representing strings similarities with all the different categories in the data. Useful when there are a small number of categories, but we still want to capture the links between them (eg: “west”, “north”, “north-west”)

Text with diverse entries#

When strings in a column are not dirty categories, but rather diverse entries of text (names, open-ended or free-flowing text) it is useful to use methods that can address the variety of terms that can appear. Skrub provides two encoders to handle these to represent string columns as embeddings, TextEncoder and StringEncoder.

Depending on the task and dataset, this approach may lead to significant improvements in the quality of predictions, albeit with potential increases in memory usage and computation time in the case of TextEncoder.

`StringEncoder`: Vectorizing text#

A lightweight solution for handling diverse strings is to first apply a tf-idf vectorization, then follow it with a dimensionality reduction algorithm such as TruncatedSVD to limit the number of features: the StringEncoder implements this operation. Since tf-idf produces sparse vectors as outputs, applying TruncatedSVD also allows to concatenate them to dataframes, which would not be possible otherwise.

In simpler terms, StringEncoder builds a sparse matrix that counts the number of times each word appears in all documents (where a document in this case is a string in the column to encode), and then reduces the size of the sparse matrix to a limited number of features for the training operation.

`TextEncoder`: Using pretrained language models#

Skrub integrates language models as scikit-learn transformers, allowing them to be easily plugged into TableVectorizer and Pipeline.

These language models are pre-trained deep-learning encoders that have been fine-tuned specifically for embedding tasks. Note that skrub does not provide a simple way to fine-tune language models directly on your dataset.

Warning

These encoders require installing additional dependencies around torch. See the “deep learning dependencies” section in the Install guide for more details.

With TextEncoder, a wrapper around the sentence-transformers package, you can use any sentence embedding model available on the HuggingFace Hub or locally stored on your disk. This means you can fine-tune a model using the sentence-transformers library and then use it with the TextEncoder like any other pre-trained model. For more information, see the sentence-transformers fine-tuning guide.

Encoding dates#

The DatetimeEncoder encodes date and time: it represent them as time in seconds since a fixed date, but also added features useful to capture regularities: week of the day, month of the year…