Encoding categorical features

The Categorical Encoding Problem

Machine learning models need numeric input, but we have:

  • Names: “Alice”, “Bob”
  • Occupations: “engineer”, “teacher”
  • Locations: “NYC”, “LA”

How to convert these to numbers?

One-Hot Encoding

Creates binary indicator columns:

from sklearn.preprocessing import OneHotEncoder
import pandas as pd 
from skrub import ApplyToCols

X = pd.DataFrame({"color": ["red", "blue", "red"]})
encoder = OneHotEncoder(sparse_output=False)
encoded = ApplyToCols(encoder).fit_transform(X)

Pros: Intuitive, works well for few categories Cons: Explodes with many categories, creates sparse matrices

Ordinal Encoding

Assigns sequential numbers:

pd.Categorical(X["color"]).codes
# red=0, blue=1, etc.
array([1, 0, 1], dtype=int8)

Pros: Memory efficient Cons: Introduces artificial ordering

The Challenge: High-Cardinality Features

What if you have 1000+ unique values (IDs, free text)?

  • One-Hot would create thousands of columns
  • Ordinal misleads models with false ordering
  • Need a smarter approach

StringEncoder: Best All-Rounder

from skrub import StringEncoder

encoder = StringEncoder(n_components=10)
encoded = ApplyToCols(encoder).fit_transform(X)
  • Uses TF-IDF + SVD for dimensionality reduction
  • Fixed output dimension (n_components)
  • No artificial ordering
  • Fast and robust
  • Not very good with free-flowing text

TextEncoder: For Natural Language

from skrub import TextEncoder

encoder = TextEncoder()
encoded = ApplyToCols(encoder).fit_transform(X)

Pros: Captures semantic meaning, works very well on text Cons: Very slow, requires heavy dependencies

MinHashEncoder: Fast Alternative

from skrub import MinHashEncoder

encoder = MinHashEncoder()
encoded = ApplyToCols(encoder).fit_transform(X)

Pros: Very fast Cons: Encodings usually have very low quality

GapEncoder: Interpretable

from skrub import GapEncoder

encoder = GapEncoder(n_components=10)
encoded = ApplyToCols(encoder).fit_transform(X)

Pros: Interpretable results Cons: Slower, performance similar to StringEncoder

Encoder Comparison

Encoder Speed Performance Interpretability Use Case
OneHot Fast Good High Low cardinality
StringEncoder Fast Excellent Low Default choice
TextEncoder Slow Excellent Medium Real text data
MinHashEncoder Very Fast Fair Low Quick prototyping
GapEncoder Slow Good High Interpretability needed

In summary

  • Use OneHotEncoder for < 40 unique values
  • Use StringEncoder for high-cardinality by default
  • Use TextEncoder for true natural language
  • Use GapEncoder when interpretability matters
  • All preserve no artificial ordering