All the pre-processing in one place: `TableVectorizer`

Introduction

Machine learning models typically require numeric input features. When working with real-world datasets, we often have a mix of data types: numbers, text, dates, and categorical values. The TableVectorizer automates the entire process of converting a heterogeneous dataframe into a matrix of numeric features ready for machine learning.

Instead of manually specifying how to handle each column, the TableVectorizer automatically detects the data type of each column and applies the appropriate transformation to encode the column using numerical features.

How does the TableVectorizer work?

The TableVectorizer operates in two phases:

Phase 1: Data cleaning and type detection

First, it runs a Cleaner on the input data to:

Detect and parse datetime columns (possibly, with custom datetime formats)
Detect and parse numerical columns written as strings
Handle missing values represented as strings (e.g., “N/A”)
Clean up categorical columns to have consistent typing
Remove uninformative columns (those with only nulls, constant values, or all unique values)
Finally, convert all numerical features to float32 to reduce the computational cost.

This ensures that each column has the correct data type before encoding.

Phase 2: Column dispatch and encoding

After cleaning, the TableVectorizer categorizes columns and dispatches them to the appropriate transformer based on their data type and cardinality. Categorical columns with a cardinality (i.e., number of unique values) larger than 40 (by default)) are considered “high cardinality”, while all other categorical columns are “low cardinality”.

The TableVectorizer uses the following default transformers for each column type:

Numeric columns: Left untouched (passthrough) - they’re already in the right format
Datetime columns: Transformed by DatetimeEncoder to extract meaningful temporal features
Low-cardinality categorical/string columns: Transformed with OneHotEncoder to create binary indicator variables
High-cardinality categorical/string columns: Transformed with StringEncoder to create dense numeric representations

Key Parameters

Cardinality threshold

The cardinality threshold that splits columns in “high” and “low” cardinality can be changed by setting the relative parameter:

from skrub import TableVectorizer

tv = TableVectorizer(cardinality_threshold=10)  # Adjust the threshold

Data cleaning parameters

The TableVectorizer forwards several parameters to the internal Cleaner, which behave in the same way:

drop_null_fraction: Fraction of nulls above which a column is dropped (default: 1.0)
drop_if_constant: Drop columns with only one unique value (default: False)
drop_if_unique: Drop string/categorical columns where all values are unique (default: False)
datetime_format: Format string for parsing dates

tv = TableVectorizer(
    drop_null_fraction=0.9,  # Drop columns that are 90% null
    drop_if_constant=True,
    datetime_format="%Y-%m-%d"
)

Customizing the transformers used by `TableVectorizer`

The TableVectorizer applies whatever transformer is provided to each of the numeric, datetime, high_cardinality, and low_cardinality paramters. To tweak the default parameters of the transformers a new transformer should be provided:

from skrub import TableVectorizer, DatetimeEncoder, StringEncoder
from sklearn.preprocessing import OneHotEncoder

# Create custom transformers
datetime_enc = DatetimeEncoder(periodic_encoding="circular")
string_enc = StringEncoder(n_components=10)

# Pass them to TableVectorizer
tv = TableVectorizer(
    datetime=datetime_enc,
    high_cardinality=string_enc,
)

This allows to, for example, change the number of parameters in the StringEncoder, provide a custom datetime format for the DatetimeEncoder, or use a completely different encoder such as the TextEncoder.

Applying the `TableVectorizer` only to a subset of columns

By default, the TableVectorizer is applied to all the columns in the given dataframe. In some cases, it may be important to keep specific columns “as is”, so that they are not modified by the transformer.

This can be done by wrapping the vectorizer into an ApplyToCols object.

For example, in this case we might want to avoid modifying the two *_id columns.

import pandas as pd
from skrub import ApplyToCols
import skrub.selectors as s

df = pd.DataFrame(
    {
        "metric_1": [10.5, 20.3, 30.1, 40.2],
        "metric_2": [5.1, 15.6, None, 35.8],
        "metric_3": [1.1, 3.3, 2.6, .8],
        "num_id": [101, 102, 103, 104],
        "str_id": ["A101", "A102", "A103", "A104"],
        "description": ["apple", None, "cherry", "date"],
        "name": ["Alice", "Bob", "Charlie", "David"],
    }
)
df

	metric_1	metric_2	metric_3	num_id	str_id	description	name
0	10.5	5.1	1.1	101	A101	apple	Alice
1	20.3	15.6	3.3	102	A102	None	Bob
2	30.1	NaN	2.6	103	A103	cherry	Charlie
3	40.2	35.8	0.8	104	A104	date	David

We can use ApplyToCols and the skrub selectors as follows:

tv = ApplyToCols(TableVectorizer(), cols=s.all()-s.glob("*_id"))
df_enc = tv.fit_transform(df)

print("Original")
print(df[["num_id", "str_id"]])
print("\nEncoded")
print(df_enc[["num_id", "str_id"]])

Original
   num_id str_id
0     101   A101
1     102   A102
2     103   A103
3     104   A104

Encoded
   num_id str_id
0     101   A101
1     102   A102
2     103   A103
3     104   A104

The id strings are the same, while all other columns have been encoded as expected.

Using `specific_transformers` for more low-level control

For fine-grained control, we can specify transformers for specific columns using the specific_transformers parameter. This is useful when we want to override the default behavior for particular columns:

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

df = pd.DataFrame({
    "occupation": ["engineer", "teacher", "doctor"],
    "salary": [100000, 50000, 150000]
})

# Create a custom transformer for the 'occupation' column
specific_transformers = [(OrdinalEncoder(), ["occupation"])]

tv = TableVectorizer(specific_transformers=specific_transformers)
result = tv.fit_transform(df)

Important notes about specific_transformers:

Columns specified here bypass the default categorization logic
The transformer receives the column as-is, without any preprocessing
The transformer must be able to handle the column’s current data type and values
For more complex transformations, consider using ApplyToCols and the selectors API (explained in the previous chapters), or the skrub Data Ops.

Conclusions

The TableVectorizer is a self-contained feature engineering engine that

Cleans your data to have consistent representation of data types and null values, and
Encodes all columns depending on their data type and characteristics using good defaults.

The idea behind the TableVectorizer is that you should be able to provide any dataframe, and get a good feature matrix based on that dataframe as a result.

The TableVectorizer makes use of most of the objects that have been explained so far, and is an important part of the tabular_pipeline explained in the next chapter.

Exercise: implementing a `TableVectorizer` from its components

Path to the exercise: content/exercises/08_feat_eng_table_vect.ipynb

Replicate the behavior of a TableVectorizer using ApplyToCols, the skrub selectors, and the given transformers.

from skrub import Cleaner, ApplyToCols, StringEncoder, DatetimeEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
import skrub.selectors as s

Notes on the implementation:

In the first step, the TableVectorizer cleans the data to parse datetimes and other dtypes with a Cleaner that uses default parameters, except for numeric_dtype="float32".
Numeric features are left untouched, i.e., they use a Passthrough transformer.
String and categorical feature are split into high and low cardinality features.
For this exercise, set the the cardinality threshold to 4.
High cardinality features are transformed with a StringEncoder. In this exercise, set n_components to 2.
Low cardinality features are transformed with a OneHotEncoder, and the first category in binary features is dropped (hint: check the docs of the OneHotEncoder for the drop parameter). Set sparse_output=True.
Remember cardinality_below is one of the skrub selectors.
Datetimes are transformed by a default DatetimeEncoder.
Everything should be wrapped in a scikit-learn Pipeline (or with make_pipeline).

Use the following dataframe to test the result.

import pandas as pd
import datetime

data = {
    "int": [15, 56, 63, 12, 44],
    "float": [5.2, 2.4, 6.2, 10.45, 9.0],
    "str1": ["public", "private", "private", "private", "public"],
    "str2": ["officer", "manager", "lawyer", "chef", "teacher"],
    "bool": [True, False, True, False, True],
    "datetime-col": [
        "2020-02-03T12:30:05",
        "2021-03-15T00:37:15",
        "2022-02-13T17:03:25",
        "2023-05-22T08:45:55",
    ]
    + [None],
}
df = pd.DataFrame(data)
df

	int	float	str1	str2	bool	datetime-col
0	15	5.20	public	officer	True	2020-02-03T12:30:05
1	56	2.40	private	manager	False	2021-03-15T00:37:15
2	63	6.20	private	lawyer	True	2022-02-13T17:03:25
3	12	10.45	private	chef	False	2023-05-22T08:45:55
4	44	9.00	public	teacher	True	None

Use the following PassThrough transformer where needed.

from skrub._single_column_transformer import SingleColumnTransformer


class PassThrough(SingleColumnTransformer):
    def fit_transform(self, column, y=None):
        return column

    def transform(self, column):
        return column

You can test the correctness of your solution by comparing it with the equivalent TableVectorizer:

from skrub import TableVectorizer

tv = TableVectorizer(
    high_cardinality=StringEncoder(n_components=2), cardinality_threshold=4
)
tv.fit_transform(df)

	int	float	str1_public	str2_0	str2_1	bool	datetime-col_year	datetime-col_month	datetime-col_day	datetime-col_hour	datetime-col_total_seconds
0	15.0	5.20	1.0	0.820969	-0.926891	1.0	2020.0	2.0	3.0	12.0	1.580733e+09
1	56.0	2.40	0.0	0.820970	-0.926892	0.0	2021.0	3.0	15.0	0.0	1.615769e+09
2	63.0	6.20	0.0	0.862893	-0.936523	1.0	2022.0	2.0	13.0	17.0	1.644772e+09
3	12.0	10.45	0.0	1.029679	1.353004	0.0	2023.0	5.0	22.0	8.0	1.684745e+09
4	44.0	9.00	1.0	1.419119	0.660169	1.0	NaN	NaN	NaN	NaN	NaN

# Write your code here
#
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
#

# Solution
cleaner = ApplyToCols(Cleaner())
high_cardinality = ApplyToCols(
    StringEncoder(n_components=2), cols=~s.cardinality_below(4) & (s.string())
)
low_cardinality = ApplyToCols(
    OneHotEncoder(sparse_output=False, drop="if_binary"),
    cols=s.cardinality_below(4) & s.string(),
)
numeric = ApplyToCols(PassThrough(), cols=s.numeric())
datetime = ApplyToCols(DatetimeEncoder(), cols=s.any_date())

my_table_vectorizer = make_pipeline(
    cleaner, numeric, high_cardinality, low_cardinality, datetime
)

my_table_vectorizer.fit_transform(df)

	int	float	str1_public	str2_0	str2_1	bool	datetime-col_year	datetime-col_month	datetime-col_day	datetime-col_hour	datetime-col_total_seconds
0	15	5.20	1.0	0.820966	-0.926893	True	2020.0	2.0	3.0	12.0	1.580733e+09
1	56	2.40	0.0	0.820970	-0.926890	False	2021.0	3.0	15.0	0.0	1.615769e+09
2	63	6.20	0.0	0.862890	-0.936528	True	2022.0	2.0	13.0	17.0	1.644772e+09
3	12	10.45	0.0	1.029685	1.353004	False	2023.0	5.0	22.0	8.0	1.684745e+09
4	44	9.00	1.0	1.419118	0.660162	True	NaN	NaN	NaN	NaN	NaN

All the pre-processing in one place: TableVectorizer