TableVectorizerMachine learning models typically require numeric input features. When working with real-world datasets, we often have a mix of data types: numbers, text, dates, and categorical values. The TableVectorizer automates the entire process of converting a heterogeneous dataframe into a matrix of numeric features ready for machine learning.
Instead of manually specifying how to handle each column, the TableVectorizer automatically detects the data type of each column and applies the appropriate transformation to encode the column using numerical features.
The TableVectorizer operates in two phases:
First, it runs a Cleaner on the input data to:
float32 to reduce the computational cost.This ensures that each column has the correct data type before encoding.
After cleaning, the TableVectorizer categorizes columns and dispatches them to the appropriate transformer based on their data type and cardinality. Categorical columns with a cardinality (i.e., number of unique values) larger than 40 (by default)) are considered “high cardinality”, while all other categorical columns are “low cardinality”.
The TableVectorizer uses the following default transformers for each column type:
DatetimeEncoder to extract meaningful temporal featuresOneHotEncoder to create binary indicator variablesStringEncoder to create dense numeric representationsThe cardinality threshold that splits columns in “high” and “low” cardinality can be changed by setting the relative parameter:
The TableVectorizer forwards several parameters to the internal Cleaner, which behave in the same way:
drop_null_fraction: Fraction of nulls above which a column is dropped (default: 1.0)drop_if_constant: Drop columns with only one unique value (default: False)drop_if_unique: Drop string/categorical columns where all values are unique (default: False)datetime_format: Format string for parsing datesTableVectorizerThe TableVectorizer applies whatever transformer is provided to each of the numeric, datetime, high_cardinality, and low_cardinality paramters. To tweak the default parameters of the transformers a new transformer should be provided:
from skrub import TableVectorizer, DatetimeEncoder, StringEncoder
from sklearn.preprocessing import OneHotEncoder
# Create custom transformers
datetime_enc = DatetimeEncoder(periodic_encoding="circular")
string_enc = StringEncoder(n_components=10)
# Pass them to TableVectorizer
tv = TableVectorizer(
datetime=datetime_enc,
high_cardinality=string_enc,
)This allows to, for example, change the number of parameters in the StringEncoder, provide a custom datetime format for the DatetimeEncoder, or use a completely different encoder such as the TextEncoder.
TableVectorizer only to a subset of columnsBy default, the TableVectorizer is applied to all the columns in the given dataframe. In some cases, it may be important to keep specific columns “as is”, so that they are not modified by the transformer.
This can be done by wrapping the vectorizer into an ApplyToCols object.
For example, in this case we might want to avoid modifying the two *_id columns.
import pandas as pd
from skrub import ApplyToCols
import skrub.selectors as s
df = pd.DataFrame(
{
"metric_1": [10.5, 20.3, 30.1, 40.2],
"metric_2": [5.1, 15.6, None, 35.8],
"metric_3": [1.1, 3.3, 2.6, .8],
"num_id": [101, 102, 103, 104],
"str_id": ["A101", "A102", "A103", "A104"],
"description": ["apple", None, "cherry", "date"],
"name": ["Alice", "Bob", "Charlie", "David"],
}
)
df| metric_1 | metric_2 | metric_3 | num_id | str_id | description | name | |
|---|---|---|---|---|---|---|---|
| 0 | 10.5 | 5.1 | 1.1 | 101 | A101 | apple | Alice |
| 1 | 20.3 | 15.6 | 3.3 | 102 | A102 | None | Bob |
| 2 | 30.1 | NaN | 2.6 | 103 | A103 | cherry | Charlie |
| 3 | 40.2 | 35.8 | 0.8 | 104 | A104 | date | David |
We can use ApplyToCols and the skrub selectors as follows:
Original
num_id str_id
0 101 A101
1 102 A102
2 103 A103
3 104 A104
Encoded
num_id str_id
0 101 A101
1 102 A102
2 103 A103
3 104 A104
The id strings are the same, while all other columns have been encoded as expected.
specific_transformers for more low-level controlFor fine-grained control, we can specify transformers for specific columns using the specific_transformers parameter. This is useful when we want to override the default behavior for particular columns:
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
df = pd.DataFrame({
"occupation": ["engineer", "teacher", "doctor"],
"salary": [100000, 50000, 150000]
})
# Create a custom transformer for the 'occupation' column
specific_transformers = [(OrdinalEncoder(), ["occupation"])]
tv = TableVectorizer(specific_transformers=specific_transformers)
result = tv.fit_transform(df)Important notes about specific_transformers:
ApplyToCols and the selectors API (explained in the previous chapters), or the skrub Data Ops.The TableVectorizer is a self-contained feature engineering engine that
The idea behind the TableVectorizer is that you should be able to provide any dataframe, and get a good feature matrix based on that dataframe as a result.
The TableVectorizer makes use of most of the objects that have been explained so far, and is an important part of the tabular_pipeline explained in the next chapter.
TableVectorizer from its componentsPath to the exercise: content/exercises/08_feat_eng_table_vect.ipynb
Replicate the behavior of a TableVectorizer using ApplyToCols, the skrub selectors, and the given transformers.
Notes on the implementation:
Cleaner that uses default parameters, except for numeric_dtype="float32".threshold to 4.StringEncoder. In this exercise, set n_components to 2.OneHotEncoder, and the first category in binary features is dropped (hint: check the docs of the OneHotEncoder for the drop parameter). Set sparse_output=True.cardinality_below is one of the skrub selectors.DatetimeEncoder.Pipeline (or with make_pipeline).Use the following dataframe to test the result.
import pandas as pd
import datetime
data = {
"int": [15, 56, 63, 12, 44],
"float": [5.2, 2.4, 6.2, 10.45, 9.0],
"str1": ["public", "private", "private", "private", "public"],
"str2": ["officer", "manager", "lawyer", "chef", "teacher"],
"bool": [True, False, True, False, True],
"datetime-col": [
"2020-02-03T12:30:05",
"2021-03-15T00:37:15",
"2022-02-13T17:03:25",
"2023-05-22T08:45:55",
]
+ [None],
}
df = pd.DataFrame(data)
df| int | float | str1 | str2 | bool | datetime-col | |
|---|---|---|---|---|---|---|
| 0 | 15 | 5.20 | public | officer | True | 2020-02-03T12:30:05 |
| 1 | 56 | 2.40 | private | manager | False | 2021-03-15T00:37:15 |
| 2 | 63 | 6.20 | private | lawyer | True | 2022-02-13T17:03:25 |
| 3 | 12 | 10.45 | private | chef | False | 2023-05-22T08:45:55 |
| 4 | 44 | 9.00 | public | teacher | True | None |
Use the following PassThrough transformer where needed.
You can test the correctness of your solution by comparing it with the equivalent TableVectorizer:
| int | float | str1_public | str2_0 | str2_1 | bool | datetime-col_year | datetime-col_month | datetime-col_day | datetime-col_hour | datetime-col_total_seconds | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15.0 | 5.20 | 1.0 | 0.820969 | -0.926891 | 1.0 | 2020.0 | 2.0 | 3.0 | 12.0 | 1.580733e+09 |
| 1 | 56.0 | 2.40 | 0.0 | 0.820970 | -0.926892 | 0.0 | 2021.0 | 3.0 | 15.0 | 0.0 | 1.615769e+09 |
| 2 | 63.0 | 6.20 | 0.0 | 0.862893 | -0.936523 | 1.0 | 2022.0 | 2.0 | 13.0 | 17.0 | 1.644772e+09 |
| 3 | 12.0 | 10.45 | 0.0 | 1.029679 | 1.353004 | 0.0 | 2023.0 | 5.0 | 22.0 | 8.0 | 1.684745e+09 |
| 4 | 44.0 | 9.00 | 1.0 | 1.419119 | 0.660169 | 1.0 | NaN | NaN | NaN | NaN | NaN |
# Solution
cleaner = ApplyToCols(Cleaner())
high_cardinality = ApplyToCols(
StringEncoder(n_components=2), cols=~s.cardinality_below(4) & (s.string())
)
low_cardinality = ApplyToCols(
OneHotEncoder(sparse_output=False, drop="if_binary"),
cols=s.cardinality_below(4) & s.string(),
)
numeric = ApplyToCols(PassThrough(), cols=s.numeric())
datetime = ApplyToCols(DatetimeEncoder(), cols=s.any_date())
my_table_vectorizer = make_pipeline(
cleaner, numeric, high_cardinality, low_cardinality, datetime
)
my_table_vectorizer.fit_transform(df)| int | float | str1_public | str2_0 | str2_1 | bool | datetime-col_year | datetime-col_month | datetime-col_day | datetime-col_hour | datetime-col_total_seconds | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15 | 5.20 | 1.0 | 0.820966 | -0.926893 | True | 2020.0 | 2.0 | 3.0 | 12.0 | 1.580733e+09 |
| 1 | 56 | 2.40 | 0.0 | 0.820970 | -0.926890 | False | 2021.0 | 3.0 | 15.0 | 0.0 | 1.615769e+09 |
| 2 | 63 | 6.20 | 0.0 | 0.862890 | -0.936528 | True | 2022.0 | 2.0 | 13.0 | 17.0 | 1.644772e+09 |
| 3 | 12 | 10.45 | 0.0 | 1.029685 | 1.353004 | False | 2023.0 | 5.0 | 22.0 | 8.0 | 1.684745e+09 |
| 4 | 44 | 9.00 | 1.0 | 1.419118 | 0.660162 | True | NaN | NaN | NaN | NaN | NaN |