from skrub import TableVectorizer
tv = TableVectorizer(cardinality_threshold=10) # Adjust the threshold13 All the pre-processing in one place: TableVectorizer
13.1 Introduction
Machine learning models typically require numeric input features. When working with real-world datasets, we often have a mix of data types: numbers, text, dates, and categorical values. The TableVectorizer automates the entire process of converting a heterogeneous dataframe into a matrix of numeric features ready for machine learning.
Instead of manually specifying how to handle each column, the TableVectorizer automatically detects the data type of each column and applies the appropriate transformation to encode the column using numerical features.
13.2 How does the TableVectorizer work?
The TableVectorizer operates in two phases:
13.2.1 Phase 1: Data cleaning and type detection
First, it runs a Cleaner on the input data to:
- Detect and parse datetime columns (possibly, with custom datetime formats)
- Detect and parse numerical columns written as strings
- Handle missing values represented as strings (e.g., “N/A”)
- Clean up categorical columns to have consistent typing
- Remove uninformative columns (those with only nulls, constant values, or all unique values)
- Finally, convert all numerical features to
float32to reduce the computational cost.
This ensures that each column has the correct data type before encoding.
13.2.2 Phase 2: Column dispatch and encoding
After cleaning, the TableVectorizer categorizes columns and dispatches them to the appropriate transformer based on their data type and cardinality. Categorical columns with a cardinality (i.e., number of unique values) larger than 40 (by default)) are considered “high cardinality”, while all other categorical columns are “low cardinality”.
The TableVectorizer uses the following default transformers for each column type:
- Numeric columns: Left untouched (passthrough) - they’re already in the right format
- Datetime columns: Transformed by
DatetimeEncoderto extract meaningful temporal features - Low-cardinality categorical/string columns: Transformed with
OneHotEncoderto create binary indicator variables - High-cardinality categorical/string columns: Transformed with
StringEncoderto create dense numeric representations
13.3 Key Parameters
13.3.1 Cardinality threshold
The cardinality threshold that splits columns in “high” and “low” cardinality can be changed by setting the relative parameter:
13.3.2 Data cleaning parameters
The TableVectorizer forwards several parameters to the internal Cleaner, which behave in the same way:
drop_null_fraction: Fraction of nulls above which a column is dropped (default:1.0)drop_if_constant: Drop columns with only one unique value (default:False)drop_if_unique: Drop string/categorical columns where all values are unique (default:False)datetime_format: Format string for parsing dates
tv = TableVectorizer(
drop_null_fraction=0.9, # Drop columns that are 90% null
drop_if_constant=True,
datetime_format="%Y-%m-%d"
)13.3.3 Customizing the transformers used by TableVectorizer
The TableVectorizer applies whatever transformer is provided to each of the numeric, datetime, high_cardinality, and low_cardinality paramters. To tweak the default parameters of the transformers a new transformer should be provided:
from skrub import TableVectorizer, DatetimeEncoder, StringEncoder
from sklearn.preprocessing import OneHotEncoder
# Create custom transformers
datetime_enc = DatetimeEncoder(periodic_encoding="circular")
string_enc = StringEncoder(n_components=10)
# Pass them to TableVectorizer
tv = TableVectorizer(
datetime=datetime_enc,
high_cardinality=string_enc,
)This allows to, for example, change the number of parameters in the StringEncoder, provide a custom datetime format for the DatetimeEncoder, or use a completely different encoder such as the TextEncoder.
13.3.4 Applying the TableVectorizer only to a subset of columns
By default, the TableVectorizer is applied to all the columns in the given dataframe. In some cases, it may be important to keep specific columns “as is”, so that they are not modified by the transformer.
This can be done by wrapping the vectorizer into an ApplyToCols object.
For example, in this case we might want to avoid modifying the two *_id columns.
import pandas as pd
from skrub import ApplyToCols
import skrub.selectors as s
df = pd.DataFrame(
{
"metric_1": [10.5, 20.3, 30.1, 40.2],
"metric_2": [5.1, 15.6, None, 35.8],
"metric_3": [1.1, 3.3, 2.6, .8],
"num_id": [101, 102, 103, 104],
"str_id": ["A101", "A102", "A103", "A104"],
"description": ["apple", None, "cherry", "date"],
"name": ["Alice", "Bob", "Charlie", "David"],
}
)
df| metric_1 | metric_2 | metric_3 | num_id | str_id | description | name | |
|---|---|---|---|---|---|---|---|
| 0 | 10.5 | 5.1 | 1.1 | 101 | A101 | apple | Alice |
| 1 | 20.3 | 15.6 | 3.3 | 102 | A102 | None | Bob |
| 2 | 30.1 | NaN | 2.6 | 103 | A103 | cherry | Charlie |
| 3 | 40.2 | 35.8 | 0.8 | 104 | A104 | date | David |
We can use ApplyToCols and the skrub selectors as follows:
tv = ApplyToCols(TableVectorizer(), cols=s.all()-s.glob("*_id"))
df_enc = tv.fit_transform(df)
print("Original")
print(df[["num_id", "str_id"]])
print("\nEncoded")
print(df_enc[["num_id", "str_id"]])Original
num_id str_id
0 101 A101
1 102 A102
2 103 A103
3 104 A104
Encoded
num_id str_id
0 101 A101
1 102 A102
2 103 A103
3 104 A104
The id strings are the same, while all other columns have been encoded as expected.
13.3.5 Using specific_transformers for more low-level control
For fine-grained control, we can specify transformers for specific columns using the specific_transformers parameter. This is useful when we want to override the default behavior for particular columns:
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
df = pd.DataFrame({
"occupation": ["engineer", "teacher", "doctor"],
"salary": [100000, 50000, 150000]
})
# Create a custom transformer for the 'occupation' column
specific_transformers = [(OrdinalEncoder(), ["occupation"])]
tv = TableVectorizer(specific_transformers=specific_transformers)
result = tv.fit_transform(df)Important notes about specific_transformers:
- Columns specified here bypass the default categorization logic
- The transformer receives the column as-is, without any preprocessing
- The transformer must be able to handle the column’s current data type and values
- For more complex transformations, consider using
ApplyToColsand the selectors API (explained in the previous chapters), or the skrub Data Ops.
13.4 Conclusions
The TableVectorizer is a self-contained feature engineering engine that
- Cleans your data to have consistent representation of data types and null values, and
- Encodes all columns depending on their data type and characteristics using good defaults.
The idea behind the TableVectorizer is that you should be able to provide any dataframe, and get a good feature matrix based on that dataframe as a result.
The TableVectorizer makes use of most of the objects that have been explained so far, and is an important part of the tabular_pipeline explained in the next chapter.
14 Exercise: implementing a TableVectorizer from its components
Path to the exercise: content/exercises/08_feat_eng_table_vect.ipynb
Replicate the behavior of a TableVectorizer using ApplyToCols, the skrub selectors, and the given transformers.
from skrub import Cleaner, ApplyToCols, StringEncoder, DatetimeEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
import skrub.selectors as sNotes on the implementation:
- In the first step, the TableVectorizer cleans the data to parse datetimes and other dtypes with a
Cleanerthat uses default parameters, except fornumeric_dtype="float32". - Numeric features are left untouched, i.e., they use a Passthrough transformer.
- String and categorical feature are split into high and low cardinality features.
- For this exercise, set the the cardinality
thresholdto 4. - High cardinality features are transformed with a
StringEncoder. In this exercise, setn_componentsto 2. - Low cardinality features are transformed with a
OneHotEncoder, and the first category in binary features is dropped (hint: check the docs of theOneHotEncoderfor thedropparameter). Setsparse_output=True. - Remember
cardinality_belowis one of the skrub selectors. - Datetimes are transformed by a default
DatetimeEncoder. - Everything should be wrapped in a scikit-learn
Pipeline(or withmake_pipeline).
Use the following dataframe to test the result.
import pandas as pd
import datetime
data = {
"int": [15, 56, 63, 12, 44],
"float": [5.2, 2.4, 6.2, 10.45, 9.0],
"str1": ["public", "private", "private", "private", "public"],
"str2": ["officer", "manager", "lawyer", "chef", "teacher"],
"bool": [True, False, True, False, True],
"datetime-col": [
"2020-02-03T12:30:05",
"2021-03-15T00:37:15",
"2022-02-13T17:03:25",
"2023-05-22T08:45:55",
]
+ [None],
}
df = pd.DataFrame(data)
df| int | float | str1 | str2 | bool | datetime-col | |
|---|---|---|---|---|---|---|
| 0 | 15 | 5.20 | public | officer | True | 2020-02-03T12:30:05 |
| 1 | 56 | 2.40 | private | manager | False | 2021-03-15T00:37:15 |
| 2 | 63 | 6.20 | private | lawyer | True | 2022-02-13T17:03:25 |
| 3 | 12 | 10.45 | private | chef | False | 2023-05-22T08:45:55 |
| 4 | 44 | 9.00 | public | teacher | True | None |
Use the following PassThrough transformer where needed.
from skrub._apply_to_cols import SingleColumnTransformer
class PassThrough(SingleColumnTransformer):
def fit_transform(self, column, y=None):
return column
def transform(self, column):
return columnYou can test the correctness of your solution by comparing it with the equivalent TableVectorizer:
from skrub import TableVectorizer
tv = TableVectorizer(
high_cardinality=StringEncoder(n_components=2), cardinality_threshold=4
)
tv.fit_transform(df)| int | float | str1_public | str2_0 | str2_1 | bool | datetime-col_year | datetime-col_month | datetime-col_day | datetime-col_hour | datetime-col_total_seconds | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15.0 | 5.20 | 1.0 | 0.820974 | -0.926883 | 1.0 | 2020.0 | 2.0 | 3.0 | 12.0 | 1.580733e+09 |
| 1 | 56.0 | 2.40 | 0.0 | 0.820970 | -0.926897 | 0.0 | 2021.0 | 3.0 | 15.0 | 0.0 | 1.615769e+09 |
| 2 | 63.0 | 6.20 | 0.0 | 0.862896 | -0.936519 | 1.0 | 2022.0 | 2.0 | 13.0 | 17.0 | 1.644772e+09 |
| 3 | 12.0 | 10.45 | 0.0 | 1.029678 | 1.353008 | 0.0 | 2023.0 | 5.0 | 22.0 | 8.0 | 1.684745e+09 |
| 4 | 44.0 | 9.00 | 1.0 | 1.419116 | 0.660171 | 1.0 | NaN | NaN | NaN | NaN | NaN |
# Write your code here
#
#
#
#
#
#
#
#
#
#
#
# # Solution
cleaner = ApplyToCols(Cleaner())
high_cardinality = ApplyToCols(
StringEncoder(n_components=2), cols=~s.cardinality_below(4) & (s.string())
)
low_cardinality = ApplyToCols(
OneHotEncoder(sparse_output=False, drop="if_binary"),
cols=s.cardinality_below(4) & s.string(),
)
numeric = ApplyToCols(PassThrough(), cols=s.numeric())
datetime = ApplyToCols(DatetimeEncoder(), cols=s.any_date())
my_table_vectorizer = make_pipeline(
cleaner, numeric, high_cardinality, low_cardinality, datetime
)
my_table_vectorizer.fit_transform(df)| int | float | str1_public | str2_0 | str2_1 | bool | datetime-col_year | datetime-col_month | datetime-col_day | datetime-col_hour | datetime-col_total_seconds | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15 | 5.20 | 1.0 | 0.820968 | -0.926893 | True | 2020.0 | 2.0 | 3.0 | 12.0 | 1.580733e+09 |
| 1 | 56 | 2.40 | 0.0 | 0.820969 | -0.926902 | False | 2021.0 | 3.0 | 15.0 | 0.0 | 1.615769e+09 |
| 2 | 63 | 6.20 | 0.0 | 0.862895 | -0.936510 | True | 2022.0 | 2.0 | 13.0 | 17.0 | 1.644772e+09 |
| 3 | 12 | 10.45 | 0.0 | 1.029677 | 1.353003 | False | 2023.0 | 5.0 | 22.0 | 8.0 | 1.684745e+09 |
| 4 | 44 | 9.00 | 1.0 | 1.419119 | 0.660171 | True | NaN | NaN | NaN | NaN | NaN |