13  All the pre-processing in one place: TableVectorizer

13.1 Introduction

Machine learning models typically require numeric input features. When working with real-world datasets, we often have a mix of data types: numbers, text, dates, and categorical values. The TableVectorizer automates the entire process of converting a heterogeneous dataframe into a matrix of numeric features ready for machine learning.

Instead of manually specifying how to handle each column, the TableVectorizer automatically detects the data type of each column and applies the appropriate transformation to encode the column using numerical features.

13.2 How does the TableVectorizer work?

The TableVectorizer operates in two phases:

13.2.1 Phase 1: Data cleaning and type detection

First, it runs a Cleaner on the input data to:

  • Detect and parse datetime columns (possibly, with custom datetime formats)
  • Detect and parse numerical columns written as strings
  • Handle missing values represented as strings (e.g., “N/A”)
  • Clean up categorical columns to have consistent typing
  • Remove uninformative columns (those with only nulls, constant values, or all unique values)
  • Finally, convert all numerical features to float32 to reduce the computational cost.

This ensures that each column has the correct data type before encoding.

13.2.2 Phase 2: Column dispatch and encoding

After cleaning, the TableVectorizer categorizes columns and dispatches them to the appropriate transformer based on their data type and cardinality. Categorical columns with a cardinality (i.e., number of unique values) larger than 40 (by default)) are considered “high cardinality”, while all other categorical columns are “low cardinality”.

The TableVectorizer uses the following default transformers for each column type:

  • Numeric columns: Left untouched (passthrough) - they’re already in the right format
  • Datetime columns: Transformed by DatetimeEncoder to extract meaningful temporal features
  • Low-cardinality categorical/string columns: Transformed with OneHotEncoder to create binary indicator variables
  • High-cardinality categorical/string columns: Transformed with StringEncoder to create dense numeric representations

13.3 Key Parameters

13.3.1 Cardinality threshold

The cardinality threshold that splits columns in “high” and “low” cardinality can be changed by setting the relative parameter:

from skrub import TableVectorizer

tv = TableVectorizer(cardinality_threshold=10)  # Adjust the threshold

13.3.2 Data cleaning parameters

The TableVectorizer forwards several parameters to the internal Cleaner, which behave in the same way:

  • drop_null_fraction: Fraction of nulls above which a column is dropped (default: 1.0)
  • drop_if_constant: Drop columns with only one unique value (default: False)
  • drop_if_unique: Drop string/categorical columns where all values are unique (default: False)
  • datetime_format: Format string for parsing dates
tv = TableVectorizer(
    drop_null_fraction=0.9,  # Drop columns that are 90% null
    drop_if_constant=True,
    datetime_format="%Y-%m-%d"
)

13.3.3 Customizing the transformers used by TableVectorizer

The TableVectorizer applies whatever transformer is provided to each of the numeric, datetime, high_cardinality, and low_cardinality paramters. To tweak the default parameters of the transformers a new transformer should be provided:

from skrub import TableVectorizer, DatetimeEncoder, StringEncoder
from sklearn.preprocessing import OneHotEncoder

# Create custom transformers
datetime_enc = DatetimeEncoder(periodic_encoding="circular")
string_enc = StringEncoder(n_components=10)

# Pass them to TableVectorizer
tv = TableVectorizer(
    datetime=datetime_enc,
    high_cardinality=string_enc,
)

This allows to, for example, change the number of parameters in the StringEncoder, provide a custom datetime format for the DatetimeEncoder, or use a completely different encoder such as the TextEncoder.

13.3.4 Applying the TableVectorizer only to a subset of columns

By default, the TableVectorizer is applied to all the columns in the given dataframe. In some cases, it may be important to keep specific columns “as is”, so that they are not modified by the transformer.

This can be done by wrapping the vectorizer into an ApplyToCols object.

For example, in this case we might want to avoid modifying the two *_id columns.

import pandas as pd
from skrub import ApplyToCols
import skrub.selectors as s

df = pd.DataFrame(
    {
        "metric_1": [10.5, 20.3, 30.1, 40.2],
        "metric_2": [5.1, 15.6, None, 35.8],
        "metric_3": [1.1, 3.3, 2.6, .8],
        "num_id": [101, 102, 103, 104],
        "str_id": ["A101", "A102", "A103", "A104"],
        "description": ["apple", None, "cherry", "date"],
        "name": ["Alice", "Bob", "Charlie", "David"],
    }
)
df
metric_1 metric_2 metric_3 num_id str_id description name
0 10.5 5.1 1.1 101 A101 apple Alice
1 20.3 15.6 3.3 102 A102 None Bob
2 30.1 NaN 2.6 103 A103 cherry Charlie
3 40.2 35.8 0.8 104 A104 date David

We can use ApplyToCols and the skrub selectors as follows:

tv = ApplyToCols(TableVectorizer(), cols=s.all()-s.glob("*_id"))
df_enc = tv.fit_transform(df)

print("Original")
print(df[["num_id", "str_id"]])
print("\nEncoded")
print(df_enc[["num_id", "str_id"]])
Original
   num_id str_id
0     101   A101
1     102   A102
2     103   A103
3     104   A104

Encoded
   num_id str_id
0     101   A101
1     102   A102
2     103   A103
3     104   A104

The id strings are the same, while all other columns have been encoded as expected.

13.3.5 Using specific_transformers for more low-level control

For fine-grained control, we can specify transformers for specific columns using the specific_transformers parameter. This is useful when we want to override the default behavior for particular columns:

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

df = pd.DataFrame({
    "occupation": ["engineer", "teacher", "doctor"],
    "salary": [100000, 50000, 150000]
})

# Create a custom transformer for the 'occupation' column
specific_transformers = [(OrdinalEncoder(), ["occupation"])]

tv = TableVectorizer(specific_transformers=specific_transformers)
result = tv.fit_transform(df)

Important notes about specific_transformers:

  • Columns specified here bypass the default categorization logic
  • The transformer receives the column as-is, without any preprocessing
  • The transformer must be able to handle the column’s current data type and values
  • For more complex transformations, consider using ApplyToCols and the selectors API (explained in the previous chapters), or the skrub Data Ops.

13.4 Conclusions

The TableVectorizer is a self-contained feature engineering engine that

  1. Cleans your data to have consistent representation of data types and null values, and
  2. Encodes all columns depending on their data type and characteristics using good defaults.

The idea behind the TableVectorizer is that you should be able to provide any dataframe, and get a good feature matrix based on that dataframe as a result.

The TableVectorizer makes use of most of the objects that have been explained so far, and is an important part of the tabular_pipeline explained in the next chapter.

14 Exercise: implementing a TableVectorizer from its components

Path to the exercise: content/exercises/08_feat_eng_table_vect.ipynb

Replicate the behavior of a TableVectorizer using ApplyToCols, the skrub selectors, and the given transformers.

from skrub import Cleaner, ApplyToCols, StringEncoder, DatetimeEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
import skrub.selectors as s

Notes on the implementation:

  • In the first step, the TableVectorizer cleans the data to parse datetimes and other dtypes with a Cleaner that uses default parameters, except for numeric_dtype="float32".
  • Numeric features are left untouched, i.e., they use a Passthrough transformer.
  • String and categorical feature are split into high and low cardinality features.
  • For this exercise, set the the cardinality threshold to 4.
  • High cardinality features are transformed with a StringEncoder. In this exercise, set n_components to 2.
  • Low cardinality features are transformed with a OneHotEncoder, and the first category in binary features is dropped (hint: check the docs of the OneHotEncoder for the drop parameter). Set sparse_output=True.
  • Remember cardinality_below is one of the skrub selectors.
  • Datetimes are transformed by a default DatetimeEncoder.
  • Everything should be wrapped in a scikit-learn Pipeline (or with make_pipeline).

Use the following dataframe to test the result.

import pandas as pd
import datetime

data = {
    "int": [15, 56, 63, 12, 44],
    "float": [5.2, 2.4, 6.2, 10.45, 9.0],
    "str1": ["public", "private", "private", "private", "public"],
    "str2": ["officer", "manager", "lawyer", "chef", "teacher"],
    "bool": [True, False, True, False, True],
    "datetime-col": [
        "2020-02-03T12:30:05",
        "2021-03-15T00:37:15",
        "2022-02-13T17:03:25",
        "2023-05-22T08:45:55",
    ]
    + [None],
}
df = pd.DataFrame(data)
df
int float str1 str2 bool datetime-col
0 15 5.20 public officer True 2020-02-03T12:30:05
1 56 2.40 private manager False 2021-03-15T00:37:15
2 63 6.20 private lawyer True 2022-02-13T17:03:25
3 12 10.45 private chef False 2023-05-22T08:45:55
4 44 9.00 public teacher True None

Use the following PassThrough transformer where needed.

from skrub._apply_to_cols import SingleColumnTransformer


class PassThrough(SingleColumnTransformer):
    def fit_transform(self, column, y=None):
        return column

    def transform(self, column):
        return column

You can test the correctness of your solution by comparing it with the equivalent TableVectorizer:

from skrub import TableVectorizer

tv = TableVectorizer(
    high_cardinality=StringEncoder(n_components=2), cardinality_threshold=4
)
tv.fit_transform(df)
int float str1_public str2_0 str2_1 bool datetime-col_year datetime-col_month datetime-col_day datetime-col_hour datetime-col_total_seconds
0 15.0 5.20 1.0 0.820974 -0.926883 1.0 2020.0 2.0 3.0 12.0 1.580733e+09
1 56.0 2.40 0.0 0.820970 -0.926897 0.0 2021.0 3.0 15.0 0.0 1.615769e+09
2 63.0 6.20 0.0 0.862896 -0.936519 1.0 2022.0 2.0 13.0 17.0 1.644772e+09
3 12.0 10.45 0.0 1.029678 1.353008 0.0 2023.0 5.0 22.0 8.0 1.684745e+09
4 44.0 9.00 1.0 1.419116 0.660171 1.0 NaN NaN NaN NaN NaN
# Write your code here
#
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# Solution
cleaner = ApplyToCols(Cleaner())
high_cardinality = ApplyToCols(
    StringEncoder(n_components=2), cols=~s.cardinality_below(4) & (s.string())
)
low_cardinality = ApplyToCols(
    OneHotEncoder(sparse_output=False, drop="if_binary"),
    cols=s.cardinality_below(4) & s.string(),
)
numeric = ApplyToCols(PassThrough(), cols=s.numeric())
datetime = ApplyToCols(DatetimeEncoder(), cols=s.any_date())

my_table_vectorizer = make_pipeline(
    cleaner, numeric, high_cardinality, low_cardinality, datetime
)

my_table_vectorizer.fit_transform(df)
int float str1_public str2_0 str2_1 bool datetime-col_year datetime-col_month datetime-col_day datetime-col_hour datetime-col_total_seconds
0 15 5.20 1.0 0.820968 -0.926893 True 2020.0 2.0 3.0 12.0 1.580733e+09
1 56 2.40 0.0 0.820969 -0.926902 False 2021.0 3.0 15.0 0.0 1.615769e+09
2 63 6.20 0.0 0.862895 -0.936510 True 2022.0 2.0 13.0 17.0 1.644772e+09
3 12 10.45 0.0 1.029677 1.353003 False 2023.0 5.0 22.0 8.0 1.684745e+09
4 44 9.00 1.0 1.419119 0.660171 True NaN NaN NaN NaN NaN