Encoding categorical features

The Categorical Encoding Problem

Machine learning models need numeric input, but we have:

Names: “Alice”, “Bob”
Occupations: “engineer”, “teacher”
Locations: “NYC”, “LA”

How to convert these to numbers?

One-Hot Encoding

Creates binary indicator columns:

from sklearn.preprocessing import OneHotEncoder
import pandas as pd 
from skrub import ApplyToCols

X = pd.DataFrame({"color": ["red", "blue", "red"]})
encoder = OneHotEncoder(sparse_output=False)
encoded = ApplyToCols(encoder).fit_transform(X)

Pros: Intuitive, works well for few categories Cons: Explodes with many categories, creates sparse matrices

Ordinal Encoding

Assigns sequential numbers:

pd.Categorical(X["color"]).codes
# red=0, blue=1, etc.

array([1, 0, 1], dtype=int8)

Pros: Memory efficient Cons: Introduces artificial ordering

The Challenge: High-Cardinality Features

What if you have 1000+ unique values (IDs, free text)?

One-Hot would create thousands of columns
Ordinal misleads models with false ordering
Need a smarter approach

StringEncoder: Best All-Rounder

from skrub import StringEncoder

encoder = StringEncoder(n_components=10)
encoded = ApplyToCols(encoder).fit_transform(X)

Uses TF-IDF + SVD for dimensionality reduction
Fixed output dimension (n_components)
No artificial ordering
Fast and robust
Not very good with free-flowing text

TextEncoder: For Natural Language

from skrub import TextEncoder

encoder = TextEncoder()
encoded = ApplyToCols(encoder).fit_transform(X)

Pros: Captures semantic meaning, works very well on text Cons: Very slow, requires heavy dependencies

MinHashEncoder: Fast Alternative

from skrub import MinHashEncoder

encoder = MinHashEncoder()
encoded = ApplyToCols(encoder).fit_transform(X)

Pros: Very fast Cons: Encodings usually have very low quality

GapEncoder: Interpretable

from skrub import GapEncoder

encoder = GapEncoder(n_components=10)
encoded = ApplyToCols(encoder).fit_transform(X)

Pros: Interpretable results Cons: Slower, performance similar to StringEncoder

Encoder Comparison

Encoder	Speed	Performance	Interpretability	Use Case
OneHot	Fast	Good	High	Low cardinality
StringEncoder	Fast	Excellent	Low	Default choice
TextEncoder	Slow	Excellent	Medium	Real text data
MinHashEncoder	Very Fast	Fair	Low	Quick prototyping
GapEncoder	Slow	Good	High	Interpretability needed

In summary

Use OneHotEncoder for < 40 unique values
Use StringEncoder for high-cardinality by default
Use TextEncoder for true natural language
Use GapEncoder when interpretability matters
All preserve no artificial ordering