Real-world datasets rarely contain only numeric values. We frequently encounter categorical features—values that belong to discrete categories, such as names, occupations, geographic locations, or clothing sizes. Text data also falls into this category, since each unique string can be considered a categorical value.
The challenge is that machine learning models require numeric input. How do we convert these categorical values into numeric features that preserve their information and enable our models to make good predictions?
This chapter explores the various strategies and tools available in skrub to encode categorical features, with an eye on choosing the best method for our specific use case.
The way we encode categorical features significantly impacts our machine learning pipeline:
Using the appropriate encoder ensures we’re making the best use of categorical information while keeping our model efficient and interpretable.
OneHotEncoder: Creates a binary indicator column for each unique category, where 1 denotes the presence of the category and 0 its absence.
Pros:
Cons:
The OneHotEncoder is used by default by the skrub TableVectorizer for categorical features with fewer than 40 unique values.
OrdinalEncoder: Assigns each category a numerical value (0, 1, 2, …).
Pros:
Cons:
All the categorical encoders in skrub are designed to encode any number of unique values using a fixed number of components: this number is controlled by the parameter n_components in each transformer.
Approach: Applies term frequency-inverse document frequency (tf-idf) vectorization to character n-grams, followed by truncated singular value decomposition (SVD) for dimensionality reduction. This method is also known as Latent Semantic Analysis.
Pros:
Cons:
Approach: Uses pretrained language models from HuggingFace Hub to generate dense vector representations of text.
Pros:
Cons:
Approach: Decomposes strings into n-grams and applies the MinHash algorithm for quick dimension reduction.
Pros:
Cons:
StringEncoder and TextEncoderApproach: Estimates latent categories by finding common n-gram patterns across values, then encodes these patterns as numeric features.
Pros:
Cons:
StringEncoder and MinHashEncoderStringEncoderEncoding categorical features is a critical step in preparing data for machine learning. The skrub library provides multiple encoders to handle different scenarios:
StringEncoder as a default for high-cardinality categorical features. It offers the best balance of speed, performance, and robustness across diverse datasets.OneHotEncoder for low-cardinality features (< 40 unique values) to keep the feature space manageable.TextEncoder if you’re working with true textual data (reviews, comments, descriptions) and have sufficient computational resources.GapEncoder when interpretability is important and the additional training time can be dealt with.MinHashEncoder when you need maximum speed and are working with very large datasets.The TableVectorizer integrates these encoders automatically, dispatching columns to the appropriate encoder based on their data type and cardinality. This automation makes it easy to process mixed-type datasets efficiently while still allowing fine-grained control when needed. By default, the TableVectorizer uses the OneHotEncoder for categorical features with cardinality <= 40, and StringEncoder for categorical features with cardinality > 40.
For a comprehensive empirical comparison of these methods, refer to the categorical encoders benchmark.
In the next chapter we will put all the encoders explained so far in a single object, the TableVectorizer.