14  Building a tabular pipeline

14.1 Introduction

Up until now we have covered how to clean data with the Cleaner, extract features from different column types, and handle categorical features with specialized encoders. In this section we will show how we can combine all these preprocessing techniques into a complete machine learning pipeline.

A pipeline ensures that:

  • Data transformations are applied consistently across training and test sets
  • Data leakage is avoided by fitting transformers only on training data
  • The workflow is reproducible and deployable
  • Preprocessing steps are properly chained together

In this chapter, we explore two approaches: building custom pipelines with TableVectorizer, and using the tabular_pipeline function for quick, well-tuned baselines.

14.2 Manual pipeline construction with TableVectorizer

The TableVectorizer can be the foundation of a custom scikit-learn pipeline, where cleaning and feature engineering are dealt with by a single object. Scaling and imputation are not required by all models, so they are not in the TableVectorizer’s scope.

We combine it with other preprocessing steps and a final estimator:

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from skrub import TableVectorizer

model = make_pipeline(
    TableVectorizer(),           # Feature engineering
    SimpleImputer(),             # Handle missing values
    StandardScaler(),            # Normalize features
    LogisticRegression()         # Final estimator
)

This approach gives complete control over which preprocessing steps to use and in what order. We can customize the TableVectorizer parameters (cardinality threshold, custom encoders, etc.) and add additional preprocessing steps as needed.

In the case of the example we used LogisticRegression as our estimator, but if we used a different estimator, such as the HistogramGradientBoostingClassifier, the scaling and imputation steps could have been avoided.

14.3 The tabular_pipeline

For many common use cases, we can skip the manual pipeline construction and use the tabular_pipeline function. This function automatically creates an appropriate pipeline based on the estimator we provide:

from skrub import tabular_pipeline
from sklearn.linear_model import LogisticRegression

# Create a complete pipeline for a specific estimator
model = tabular_pipeline(LogisticRegression())

Or, we can use a string to get a pre-configured pipeline with a default estimator:

# Classification with HistGradientBoostingClassifier
model = tabular_pipeline('classification')

# Regression with HistGradientBoostingRegressor
model = tabular_pipeline('regression')

14.4 How tabular_pipeline adapts to different estimators

The tabular_pipeline function configures the preprocessing pipeline based on the estimator type:

14.4.1 For linear models (e.g., LogisticRegression, Ridge)

  • TableVectorizer: Uses the default configuration, except for the addition of spline-encoded datetime features by the DatetimeEncoder
  • SimpleImputer: Added because linear models cannot handle missing values
  • SquashingScaler: Normalizes numeric features to improve convergence and performance
  • Estimator: The provided linear model

This configuration ensures numeric features are properly scaled and missing values are handled appropriately.

14.4.2 For tree-based ensemble models (RandomForest, HistGradientBoosting)

  • TableVectorizer: Configured specifically for tree models
    • Low-cardinality categorical features: Either kept as categorical (HistGradientBoosting) or ordinal encoded (RandomForest)
    • High-cardinality features: StringEncoder for robust feature extraction
    • Datetime features: No spline encoding (unnecessary for trees)
  • Scaler: Not added (unnecessary for tree-based models)
  • Estimator: The provided tree-based estimator

This configuration leverages the native capabilities of tree models while still providing effective feature engineering through the StringEncoder.

14.5 Conclusions: why the tabular_pipeline is useful

  1. Smart Configuration: Automatically selects preprocessing parameters appropriate for the estimator
  2. Simplicity: One-line creation of a complete, well-tuned baseline
  3. Robustness: Handles edge cases like missing values and mixed data types automatically
  • Use tabular_pipeline when you want a quick, well-tuned baseline to benchmark against or as a starting point
  • Build manual pipelines when you need fine-grained control over preprocessing steps or want to experiment with custom transformers

Both approaches produce scikit-learn compatible pipelines that can be used with cross-validation, hyperparameter tuning, and other standard workflows.

15 Exercise

Path to the exercise: content/exercises/09_tabular_pipeline.ipynb

In this exercise we’re going to use the TableVectorizer and tabular_pipeline to replicate the behavior of a traditional scikit-learn pipeline.

First, let’s load the dataset:

import pandas as pd

X = pd.read_csv("../data/adult_census/data.csv")
y = pd.read_csv("../data/adult_census/target.csv")

This is the pipeline that needs to be replicated.

  • It uses LogisticRegression as the classifier, i.e., a linear model.
  • It scales numerical features using a StandardScaler.
  • Categorical features are one-hot-encoded.
  • Missing values are imputed using a SimpleImputer.
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer

categorical_columns = selector(dtype_include="category")(X)
numerical_columns = selector(dtype_include="number")(X)

ct = make_column_transformer(
      (StandardScaler(),
       numerical_columns),
      (OneHotEncoder(handle_unknown="ignore"),
       categorical_columns))

model_base = make_pipeline(ct, SimpleImputer(), LogisticRegression())
# model_base

Use the TableVectorizer and make_pipeline to write a pipeline named model_tv, which includes all the steps necessary for the LogisticRegression to work (i.e., scaling and imputing missing values).

from skrub import TableVectorizer
# Write your code here
# 
# 
# 
# 
# 
# 
# 
# 
# 
from skrub import TableVectorizer

tv = TableVectorizer()

model_tv = make_pipeline(tv, SimpleImputer(), StandardScaler(), LogisticRegression())
# model_tv

Now use the tabular_pipeline to get a new pipeline named model_tp.

from skrub import tabular_pipeline
# Write your code here
# 
# 
# 
from skrub import tabular_pipeline

model_tp = tabular_pipeline(LogisticRegression())
# model_tp

For reference, let’s also create a pipeline that uses HistGradientBoostingClassifier. This can be done by passing the string “classification” to tabular_pipeline.

model_hgb = tabular_pipeline("classification")
# model_hgb

Finally, let’s evaluate the different models and see how they perform:

from sklearn.model_selection import cross_val_score

results_base = cross_val_score(model_base, X, y)
print(f"Base model: {results_base.mean():.4f}")

results_tv = cross_val_score(model_tv, X, y)
print(f"TableVectorizer: {results_tv.mean():.4f}")

results_tp = cross_val_score(model_tp, X, y)
print(f"Tabular pipeline: {results_tp.mean():.4f}")

results_hgb = cross_val_score(model_hgb, X, y)
print(f"HGB model: {results_hgb.mean():.4f}")
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Base model: 0.8150
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
TableVectorizer: 0.8523
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Tabular pipeline: 0.8528
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
HGB model: 0.8737

Unsurprisingly, the model that uses HGB outperforms the other models, while being much slower to train. The other pipelines have very similar performance, which is to be expected since they are very similar to each other.