TableVectorizer: All preprocessing in one place

The Goal

Convert any mixed-type dataframe into numeric features for ML:

import pandas as pd
from skrub import TableVectorizer

df = pd.DataFrame({
    "age": [25, 30, 35],
    "salary": [50000, 65000, 75000],
    "hire_date": ["2020-01-15", "2021-06-20", "2022-03-10"],
    "department": ["Sales", "Engineering", "Sales"]
})

tv = TableVectorizer()
X_numeric = tv.fit_transform(df)

Phase 1: Cleaning

The TableVectorizer starts by cleaning:

  • Parse datetime columns (with custom formats)
  • Detect numbers written as strings (“123.45”)
  • Handle missing value markers (“N/A”, “?”)
  • Type consistency for categories
  • Drop uninformative columns
  • Convert to float32 for efficiency

Phase 2: Column Dispatch

After cleaning, columns are routed to appropriate transformers:

Column Type Cardinality Transformer
Numeric - Passthrough
Datetime - DatetimeEncoder
String/Category ≤ 40 OneHotEncoder
String/Category > 40 StringEncoder

The Cardinality Threshold

# Change when categories become "high-cardinality"
tv = TableVectorizer(cardinality_threshold=10)

Customizing Parameters

Data cleaning parameters:

tv = TableVectorizer(
    drop_null_fraction=0.9,  # Drop columns 90% null
    drop_if_constant=True,   # Drop constant columns
    datetime_format="%Y-%m-%d"
)

Customizing Transformers

from skrub import DatetimeEncoder, StringEncoder

datetime_enc = DatetimeEncoder(periodic_encoding="circular")
string_enc = StringEncoder(n_components=10)

tv = TableVectorizer(
    datetime=datetime_enc,
    high_cardinality=string_enc
)

Applying to Subset of Columns

from skrub import ApplyToCols
import skrub.selectors as s

# Don't transform ID columns
tv = ApplyToCols(
    TableVectorizer(),
    cols=s.all() - s.glob("*_id")
)

Fine-Grained Control: specific_transformers

from sklearn.preprocessing import OrdinalEncoder

specific = [(OrdinalEncoder(), ["occupation"])]
tv = TableVectorizer(specific_transformers=specific)

Key Takeaways

  • One object handles all preprocessing
  • Automatic type detection and routing
  • Configurable cleaning and encoding
  • Smart defaults based on column characteristics
  • Easy to customize per-column behavior