Feature engineering for categorical data#

In Skrub, categorical features correspond to columns whose data type is neither numeric nor datetime. This includes string, categorical, and object data types.

StringEncoder#

A strong and quick baseline for both short strings with high cardinality and long text. This encoder computes the ngram frequency using tf-idf vectorization, followed by truncated SVD (Latent Semantic Analysis).

TextEncoder#

This encoder encodes string features using pretrained language models from the HuggingFace Hub. It is a wrapper around sentence-transformers compatible with the scikit-learn API and usable in pipelines. Best for free-flowing text and when columns include context found in the pretrained model (e.g., name of cities etc.). Note that this encoder can take a very long time to train, especially on large datasets and on CPU.

MinHashEncoder#

This encoder decomposes strings into ngrams, then applies the MinHash method to convert them into numerical features. Fast to train, but features may yield worse results compared to other methods.

GapEncoder#

The GapEncoder estimates “latent categories” on the training data by finding common ngrams between strings, then encodes the categories as real numbers. It allows access to grouped features via .get_feature_names_out(), which allows for better interpretability. This encoder may require a long time to train.

Comparison of the Categorical Encoders#

Encoder

Training time

Performance on categorical data

Performance on text data

Notes

StringEncoder

Fast

Good

Good

TextEncoder

Very slow

Mediocre to good

Very good

Requires the transformers dep.

GapEncoder

Slow

Good

Mediocre to good

Interpretable

MinHashEncoder

Very fast

Mediocre to good

Mediocre

This example and this blog post include a more systematic analysis of each method.