Encoding string and text columns as numeric features#

In skrub, categorical features are features that are not parsed as either numbers or datetimes. They may have a Categorical datatype, or they may simply be strings. These features are very common in practice, and there are various strategies that can be employed to handle them.

A common approach is to use the OneHotEncoder or the OrdinalEncoder on categorical features, but both approaches have limitations. The OneHotEncoder becomes expensive when the number of distinct values becomes large, while the OrdinalEncoder introduces order in features that may not have an inherent ordering.

To address these shortcomings and generalize to more columns, skrub implements four different transformers, each with its own pros and cons.

  • StringEncoder: the default encoder, strong in most cases: A strong and quick baseline for both short strings with high cardinality and long text. This encoder computes the n-gram frequency using tf-idf vectorization, followed by truncated SVD (Latent Semantic Analysis). This is the default encoder used by the TableVectorizer and the tabular_pipeline().

  • TextEncoder: language model-based, strong on text but expensive to run: This encoder encodes string features using pretrained language models from the HuggingFace Hub. It is a wrapper around sentence-transformers compatible with the scikit-learn API and usable in pipelines. Best for free-flowing text and when columns include context found in the pretrained model (e.g., names of cities etc.). Note that this encoder can take a very long time to train, especially on large datasets and on CPU. The TextEncoder has additional dependencies that are not included in the standard skrub installation. Refer to Install for info on how to prepare the environment.

  • MinHashEncoder: very fast encoder, but not as effective as the others: This encoder decomposes strings into n-grams, then applies the MinHash method to convert them into numeric features. Fast to train, but features may yield worse results compared to other methods.

  • GapEncoder: an interpretable, if slower encoder: The GapEncoder estimates “latent categories” on the training data by finding common n-grams between strings, then encodes the categories as real numbers. It allows access to grouped features via .get_feature_names_out(), which allows for better interpretability. This encoder may require a long time to train.

All encoders work like regular scikit-learn transformers. All encoders take a parameter n_components to specify how many features should be generated for each input feature.

>>> import pandas as pd
>>> from skrub import StringEncoder
>>> X = pd.Series([
...   "The professor snatched a good interview out of the jaws of these questions.",
...   "Bookmarking this to watch later.",
...   "When you don't know the lyrics of the song except the chorus",
... ], name='video comments')
>>> encoder = StringEncoder(n_components=2)

The result of the .fit_transform is a new dataframe that contains as many columns as the number of components specified (here, 2). Features generated by each encoder (except the GapEncoder) are always named after the original column name (here, "video comments"), followed by the index of the resulting feature.

>>> encoder.fit_transform(X)
   video comments_0  video comments_1
0          1.322969         -0.163066
1          0.379689          1.659318
2          1.306402         -0.317126

The GapEncoder names the columns after the categories it estimates from the data, which are built by capturing combinations of substrings that frequently co-occur. More information on the functioning and the theoretical background of the GapEncoder is available in the documentation of the encoder itself.

>>> from skrub import GapEncoder
>>> GapEncoder(n_components=2).fit_transform(X)
   video comments: bookmarking, except, lyrics  video comments: professor, questions, interview
0                                     0.000786                                         1.360704
1                                     0.559531                                         0.000717
2                                     0.982307                                         0.099680

Comparing the categorical encoders included in skrub#

Encoder

Training time

Performance on categorical data

Performance on text data

Notes

StringEncoder

Fast

Good

Good

TextEncoder

Very slow

Mediocre to good

Very good

Requires the transformers package to be installed

GapEncoder

Slow

Good

Mediocre to good

Interpretable

MinHashEncoder

Very fast

Mediocre to good

Mediocre

This example and this blog post include a more systematic analysis of each method. The docstrings of each encoder provide additional details on how they work.