11 Mixed data: dealing with categories
11.1 Introduction
Real-world datasets rarely contain only numeric values. We frequently encounter categorical features—values that belong to discrete categories, such as names, occupations, geographic locations, or clothing sizes. Text data also falls into this category, since each unique string can be considered a categorical value.
The challenge is that machine learning models require numeric input. How do we convert these categorical values into numeric features that preserve their information and enable our models to make good predictions?
This chapter explores the various strategies and tools available in skrub to encode categorical features, with an eye on choosing the best method for our specific use case.
11.2 Why categorical encoders matter
The way we encode categorical features significantly impacts our machine learning pipeline:
- Performance: The encoding choice directly affects how well our model learns from categorical information
- Efficiency: Some encodings create many features (potentially thousands), which increases computation time and memory usage
- Interpretability: Different encoders provide varying levels of transparency in what features represent
- Scalability: Not all methods scale well to high-cardinality features (those with many unique values)
Using the appropriate encoder ensures we’re making the best use of categorical information while keeping our model efficient and interpretable.
11.3 Categorical encoders: pros and cons
11.3.1 One-Hot Encoding and Ordinal Encoding (scikit-learn)
OneHotEncoder: Creates a binary indicator column for each unique category, where 1 denotes the presence of the category and 0 its absence.
Pros:
- Straightforward and intuitive
- Works well for low-cardinality features (few unique values)
- Produces sparse matrices that can save memory
Cons:
- Becomes impractical with high-cardinality features (creates hundreds or thousands of columns)
- Results in mostly zero-valued sparse matrices when dense, which is the situation when working with dataframes
- Increases overfitting risk and computational overhead
The OneHotEncoder is used by default by the skrub TableVectorizer for categorical features with fewer than 40 unique values.
OrdinalEncoder: Assigns each category a numerical value (0, 1, 2, …).
Pros:
- Very memory-efficient
- Creates only one output column per input column
- Fast to compute
Cons:
- Introduces artificial ordering among categories that may not exist in reality
- Can mislead models into thinking some categories are “greater than” others
11.3.2 Categorical encoders in skrub
All the categorical encoders in skrub are designed to encode any number of unique values using a fixed number of components: this number is controlled by the parameter n_components in each transformer.
11.3.3 StringEncoder
Approach: Applies term frequency-inverse document frequency (tf-idf) vectorization to character n-grams, followed by truncated singular value decomposition (SVD) for dimensionality reduction. This method is also known as Latent Semantic Analysis.
Pros:
- The best all-rounder: Performs well on both categorical and text data
- Fast training time
- Robust and generalizes well across different datasets
- No artificial ordering introduced
Cons:
- Less interpretable than one-hot encoding or ordinal encoding
- May not capture semantic relationships as well as language model-based approaches
- Performance depends on the nature of the categorical data
11.3.4 TextEncoder
Approach: Uses pretrained language models from HuggingFace Hub to generate dense vector representations of text.
Pros:
- Exceptional performance on free-flowing text and natural language
- Captures semantic meaning and context
- Leverages knowledge from large-scale language model pretraining
- Can excel on datasets where domain-specific information aligns with pretraining data
Cons:
- Very computationally expensive: Significantly slower than other methods
- Requires heavy dependencies (PyTorch, transformers)
- Models are large and require downloading
- Impractical for CPU-only environments
- Performance on traditional categorical data (non-text, such as IDs) is not much better than simpler methods
11.3.5 MinHashEncoder
Approach: Decomposes strings into n-grams and applies the MinHash algorithm for quick dimension reduction.
Pros:
- Very fast training time
- Simple and lightweight
- Minimal memory overhead
- Good for quick prototyping or very large-scale datasets
Cons:
- Performance generally lags behind
StringEncoderandTextEncoder - Less nuanced feature representation
- Less robust across different types of data
11.3.6 GapEncoder
Approach: Estimates latent categories by finding common n-gram patterns across values, then encodes these patterns as numeric features.
Pros:
- Interpretable: Column names reflect the estimated categories
- Can group similar strings intelligently
- Reasonable performance across datasets
Cons:
- Slower training time compared to
StringEncoderandMinHashEncoder - Performance is on par with or slightly worse than the faster
StringEncoder - Interpretability comes at the cost of training speed
- May require more computational resources for large datasets
11.4 Conclusion
Encoding categorical features is a critical step in preparing data for machine learning. The skrub library provides multiple encoders to handle different scenarios:
- Start with
StringEncoderas a default for high-cardinality categorical features. It offers the best balance of speed, performance, and robustness across diverse datasets. - Use
OneHotEncoderfor low-cardinality features (< 40 unique values) to keep the feature space manageable. - Choose
TextEncoderif you’re working with true textual data (reviews, comments, descriptions) and have sufficient computational resources. - Consider
GapEncoderwhen interpretability is important and the additional training time can be dealt with. - Use
MinHashEncoderwhen you need maximum speed and are working with very large datasets.
The TableVectorizer integrates these encoders automatically, dispatching columns to the appropriate encoder based on their data type and cardinality. This automation makes it easy to process mixed-type datasets efficiently while still allowing fine-grained control when needed. By default, the TableVectorizer uses the OneHotEncoder for categorical features with cardinality <= 40, and StringEncoder for categorical features with cardinality > 40.
For a comprehensive empirical comparison of these methods, refer to the categorical encoders benchmark.
In the next chapter we will put all the encoders explained so far in a single object, the TableVectorizer.