Machine learning models need numeric input, but we have:
How to convert these to numbers?
Creates binary indicator columns:
Pros: Intuitive, works well for few categories Cons: Explodes with many categories, creates sparse matrices
Assigns sequential numbers:
Pros: Memory efficient Cons: Introduces artificial ordering
What if you have 1000+ unique values (IDs, free text)?
Pros: Captures semantic meaning, works very well on text Cons: Very slow, requires heavy dependencies
Pros: Very fast Cons: Encodings usually have very low quality
Pros: Interpretable results Cons: Slower, performance similar to StringEncoder
| Encoder | Speed | Performance | Interpretability | Use Case |
|---|---|---|---|---|
| OneHot | Fast | Good | High | Low cardinality |
| StringEncoder | Fast | Excellent | Low | Default choice |
| TextEncoder | Slow | Excellent | Medium | Real text data |
| MinHashEncoder | Very Fast | Fair | Low | Quick prototyping |
| GapEncoder | Slow | Good | High | Interpretability needed |
OneHotEncoder for < 40 unique valuesStringEncoder for high-cardinality by defaultTextEncoder for true natural languageGapEncoder when interpretability matters