import pandas as pd
X = pd.read_csv("../data/adult_census/data.csv")
y = pd.read_csv("../data/adult_census/target.csv")14 Building a tabular pipeline
14.1 Introduction
Up until now we have covered how to clean data with the Cleaner, extract features from different column types, and handle categorical features with specialized encoders. In this section we will show how we can combine all these preprocessing techniques into a complete machine learning pipeline.
A pipeline ensures that:
- Data transformations are applied consistently across training and test sets
- Data leakage is avoided by fitting transformers only on training data
- The workflow is reproducible and deployable
- Preprocessing steps are properly chained together
In this chapter, we explore two approaches: building custom pipelines with TableVectorizer, and using the tabular_pipeline function for quick, well-tuned baselines.
14.2 Manual pipeline construction with TableVectorizer
The TableVectorizer can be the foundation of a custom scikit-learn pipeline, where cleaning and feature engineering are dealt with by a single object. Scaling and imputation are not required by all models, so they are not in the TableVectorizer’s scope.
We combine it with other preprocessing steps and a final estimator:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from skrub import TableVectorizer
model = make_pipeline(
TableVectorizer(), # Feature engineering
SimpleImputer(), # Handle missing values
StandardScaler(), # Normalize features
LogisticRegression() # Final estimator
)This approach gives complete control over which preprocessing steps to use and in what order. We can customize the TableVectorizer parameters (cardinality threshold, custom encoders, etc.) and add additional preprocessing steps as needed.
In the case of the example we used LogisticRegression as our estimator, but if we used a different estimator, such as the HistogramGradientBoostingClassifier, the scaling and imputation steps could have been avoided.
14.3 The tabular_pipeline
For many common use cases, we can skip the manual pipeline construction and use the tabular_pipeline function. This function automatically creates an appropriate pipeline based on the estimator we provide:
from skrub import tabular_pipeline
from sklearn.linear_model import LogisticRegression
# Create a complete pipeline for a specific estimator
model = tabular_pipeline(LogisticRegression())Or, we can use a string to get a pre-configured pipeline with a default estimator:
# Classification with HistGradientBoostingClassifier
model = tabular_pipeline('classification')
# Regression with HistGradientBoostingRegressor
model = tabular_pipeline('regression')14.4 How tabular_pipeline adapts to different estimators
The tabular_pipeline function configures the preprocessing pipeline based on the estimator type:
14.4.1 For linear models (e.g., LogisticRegression, Ridge)
- TableVectorizer: Uses the default configuration, except for the addition of spline-encoded datetime features by the
DatetimeEncoder - SimpleImputer: Added because linear models cannot handle missing values
- SquashingScaler: Normalizes numeric features to improve convergence and performance
- Estimator: The provided linear model
This configuration ensures numeric features are properly scaled and missing values are handled appropriately.
14.4.2 For tree-based ensemble models (RandomForest, HistGradientBoosting)
- TableVectorizer: Configured specifically for tree models
- Low-cardinality categorical features: Either kept as categorical (HistGradientBoosting) or ordinal encoded (RandomForest)
- High-cardinality features: StringEncoder for robust feature extraction
- Datetime features: No spline encoding (unnecessary for trees)
- Scaler: Not added (unnecessary for tree-based models)
- Estimator: The provided tree-based estimator
This configuration leverages the native capabilities of tree models while still providing effective feature engineering through the StringEncoder.
14.5 Conclusions: why the tabular_pipeline is useful
- Smart Configuration: Automatically selects preprocessing parameters appropriate for the estimator
- Simplicity: One-line creation of a complete, well-tuned baseline
- Robustness: Handles edge cases like missing values and mixed data types automatically
- Use
tabular_pipelinewhen you want a quick, well-tuned baseline to benchmark against or as a starting point - Build manual pipelines when you need fine-grained control over preprocessing steps or want to experiment with custom transformers
Both approaches produce scikit-learn compatible pipelines that can be used with cross-validation, hyperparameter tuning, and other standard workflows.
15 Exercise
Path to the exercise: content/exercises/09_tabular_pipeline.ipynb
In this exercise we’re going to use the TableVectorizer and tabular_pipeline to replicate the behavior of a traditional scikit-learn pipeline.
First, let’s load the dataset:
This is the pipeline that needs to be replicated.
- It uses
LogisticRegressionas the classifier, i.e., a linear model. - It scales numerical features using a
StandardScaler. - Categorical features are one-hot-encoded.
- Missing values are imputed using a
SimpleImputer.
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer
categorical_columns = selector(dtype_include="category")(X)
numerical_columns = selector(dtype_include="number")(X)
ct = make_column_transformer(
(StandardScaler(),
numerical_columns),
(OneHotEncoder(handle_unknown="ignore"),
categorical_columns))
model_base = make_pipeline(ct, SimpleImputer(), LogisticRegression())
# model_baseUse the TableVectorizer and make_pipeline to write a pipeline named model_tv, which includes all the steps necessary for the LogisticRegression to work (i.e., scaling and imputing missing values).
from skrub import TableVectorizer
# Write your code here
#
#
#
#
#
#
#
#
# from skrub import TableVectorizer
tv = TableVectorizer()
model_tv = make_pipeline(tv, SimpleImputer(), StandardScaler(), LogisticRegression())
# model_tvNow use the tabular_pipeline to get a new pipeline named model_tp.
from skrub import tabular_pipeline
# Write your code here
#
#
# from skrub import tabular_pipeline
model_tp = tabular_pipeline(LogisticRegression())
# model_tpFor reference, let’s also create a pipeline that uses HistGradientBoostingClassifier. This can be done by passing the string “classification” to tabular_pipeline.
model_hgb = tabular_pipeline("classification")
# model_hgbFinally, let’s evaluate the different models and see how they perform:
from sklearn.model_selection import cross_val_score
results_base = cross_val_score(model_base, X, y)
print(f"Base model: {results_base.mean():.4f}")
results_tv = cross_val_score(model_tv, X, y)
print(f"TableVectorizer: {results_tv.mean():.4f}")
results_tp = cross_val_score(model_tp, X, y)
print(f"Tabular pipeline: {results_tp.mean():.4f}")
results_hgb = cross_val_score(model_hgb, X, y)
print(f"HGB model: {results_hgb.mean():.4f}")/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
Base model: 0.8150
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
TableVectorizer: 0.8523
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
Tabular pipeline: 0.8528
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
HGB model: 0.8737
Unsurprisingly, the model that uses HGB outperforms the other models, while being much slower to train. The other pipelines have very similar performance, which is to be expected since they are very similar to each other.