Convert any mixed-type dataframe into numeric features for ML: - Categorical features need to be encoded into features - Datetime features can be transformed into more informative features - Data should be sanitized
The TableVectorizer starts by cleaning:
float32 for efficiencyIn practice, a Cleaner followed by conversion to float32.
After cleaning, columns are routed to appropriate transformers:
| Column Type | Cardinality | Transformer |
|---|---|---|
| Numeric | - | Passthrough |
| Datetime | - | DatetimeEncoder |
| String/Category | ≤ 40 | OneHotEncoder |
| String/Category | > 40 | StringEncoder |
from skrub import DatetimeEncoder, StringEncoder
# Add periodic encoding
datetime_enc = DatetimeEncoder(periodic_encoding="circular")
# Change the number of output components of the StringEncoder
string_enc = StringEncoder(n_components=10)
tv = TableVectorizer(
datetime=datetime_enc,
high_cardinality=string_enc
)More about the encoders in Chapter 4!
More about selectors in Chapter 3!
Columns in specific_transformers are forwarded directly to the transformer without any cleaning or parsing.
TableVectorizer – handles preprocessing and feature engineering