Up until now we have covered how to clean data with the Cleaner, extract features from different column types, and handle categorical features with specialized encoders. In this section we will show how we can combine all these preprocessing techniques into a complete machine learning pipeline.
A pipeline ensures that:
In this chapter, we explore two approaches: building custom pipelines with TableVectorizer, and using the tabular_pipeline function for quick, well-tuned baselines.
TableVectorizerThe TableVectorizer can be the foundation of a custom scikit-learn pipeline, where cleaning and feature engineering are dealt with by a single object. Scaling and imputation are not required by all models, so they are not in the TableVectorizer’s scope.
We combine it with other preprocessing steps and a final estimator:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from skrub import TableVectorizer
model = make_pipeline(
TableVectorizer(), # Feature engineering
SimpleImputer(), # Handle missing values
StandardScaler(), # Normalize features
LogisticRegression() # Final estimator
)This approach gives complete control over which preprocessing steps to use and in what order. We can customize the TableVectorizer parameters (cardinality threshold, custom encoders, etc.) and add additional preprocessing steps as needed.
In the case of the example we used LogisticRegression as our estimator, but if we used a different estimator, such as the HistogramGradientBoostingClassifier, the scaling and imputation steps could have been avoided.
tabular_pipelineFor many common use cases, we can skip the manual pipeline construction and use the tabular_pipeline function. This function automatically creates an appropriate pipeline based on the estimator we provide:
Or, we can use a string to get a pre-configured pipeline with a default estimator:
tabular_pipeline adapts to different estimatorsThe tabular_pipeline function configures the preprocessing pipeline based on the estimator type:
DatetimeEncoderThis configuration ensures numeric features are properly scaled and missing values are handled appropriately.
This configuration leverages the native capabilities of tree models while still providing effective feature engineering through the StringEncoder.
tabular_pipeline is usefultabular_pipeline when you want a quick, well-tuned baseline to benchmark against or as a starting pointBoth approaches produce scikit-learn compatible pipelines that can be used with cross-validation, hyperparameter tuning, and other standard workflows.
Path to the exercise: content/exercises/09_tabular_pipeline.ipynb
In this exercise we’re going to use the TableVectorizer and tabular_pipeline to replicate the behavior of a traditional scikit-learn pipeline.
First, let’s load the dataset:
This is the pipeline that needs to be replicated.
LogisticRegression as the classifier, i.e., a linear model.StandardScaler.SimpleImputer.from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer
categorical_columns = selector(dtype_include="category")(X)
numerical_columns = selector(dtype_include="number")(X)
ct = make_column_transformer(
(StandardScaler(),
numerical_columns),
(OneHotEncoder(handle_unknown="ignore"),
categorical_columns))
model_base = make_pipeline(ct, SimpleImputer(), LogisticRegression())
# model_baseUse the TableVectorizer and make_pipeline to write a pipeline named model_tv, which includes all the steps necessary for the LogisticRegression to work (i.e., scaling and imputing missing values).
Now use the tabular_pipeline to get a new pipeline named model_tp.
For reference, let’s also create a pipeline that uses HistGradientBoostingClassifier. This can be done by passing the string “classification” to tabular_pipeline.
Finally, let’s evaluate the different models and see how they perform:
from sklearn.model_selection import cross_val_score
results_base = cross_val_score(model_base, X, y)
print(f"Base model: {results_base.mean():.4f}")
results_tv = cross_val_score(model_tv, X, y)
print(f"TableVectorizer: {results_tv.mean():.4f}")
results_tp = cross_val_score(model_tp, X, y)
print(f"Tabular pipeline: {results_tp.mean():.4f}")
results_hgb = cross_val_score(model_hgb, X, y)
print(f"HGB model: {results_hgb.mean():.4f}")/Users/rcap/work/skrub-tutorials/.pixi/envs/default/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/default/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/default/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/default/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/default/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
Base model: 0.8150
/Users/rcap/work/skrub-tutorials/.pixi/envs/default/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/default/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/default/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/default/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/default/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
TableVectorizer: 0.8523
/Users/rcap/work/skrub-tutorials/.pixi/envs/default/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/default/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/default/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/default/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/default/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
Tabular pipeline: 0.8528
/Users/rcap/work/skrub-tutorials/.pixi/envs/default/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/default/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/default/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/default/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/default/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
HGB model: 0.8725
Unsurprisingly, the model that uses HGB outperforms the other models, while being much slower to train. The other pipelines have very similar performance, which is to be expected since they are very similar to each other.