Feature engineering for categorical data#
In Skrub, categorical features correspond to columns whose data type is neither numeric nor datetime. This includes string, categorical, and object data types.
StringEncoder
#
A strong and quick baseline for both short strings with high cardinality and long text. This encoder computes the ngram frequency using tf-idf vectorization, followed by truncated SVD (Latent Semantic Analysis).
TextEncoder
#
This encoder encodes string features using pretrained language models from the HuggingFace Hub. It is a wrapper around sentence-transformers compatible with the scikit-learn API and usable in pipelines. Best for free-flowing text and when columns include context found in the pretrained model (e.g., name of cities etc.). Note that this encoder can take a very long time to train, especially on large datasets and on CPU.
MinHashEncoder
#
This encoder decomposes strings into ngrams, then applies the MinHash method to convert them into numerical features. Fast to train, but features may yield worse results compared to other methods.
GapEncoder
#
The GapEncoder
estimates “latent categories” on the training data by finding
common ngrams between strings, then encodes the categories as real
numbers. It allows access to grouped features via .get_feature_names_out()
,
which allows for better interpretability. This encoder may require a long time to train.
Comparison of the Categorical Encoders#
Encoder |
Training time |
Performance on categorical data |
Performance on text data |
Notes |
---|---|---|---|---|
StringEncoder |
Fast |
Good |
Good |
|
TextEncoder |
Very slow |
Mediocre to good |
Very good |
Requires the |
GapEncoder |
Slow |
Good |
Mediocre to good |
Interpretable |
MinHashEncoder |
Very fast |
Mediocre to good |
Mediocre |
This example and this blog post include a more systematic analysis of each method.