Building complete tabular pipelines

Building a baseline model with preprocessing

Beyond preprocessing, ML pipelines need:

Feature engineering (cleaning, encoding)
Handling missing values
Scaling features
Training the model

Manual Pipeline Construction

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from skrub import TableVectorizer

model = make_pipeline(
    TableVectorizer(),      # Feature engineering
    SimpleImputer(),        # Handle missing values
    StandardScaler(),       # Normalize features
    LogisticRegression()    # Final estimator
)

The tabular_pipeline Function

Quick one-liner for well-tuned baselines:

from skrub import tabular_pipeline

# Option 1: Provide an estimator
model = tabular_pipeline(LogisticRegression())

# Option 2: Use preset strings
model_clf = tabular_pipeline("classification")
model_reg = tabular_pipeline("regression")

For Linear Models

Configures pipeline with:

TableVectorizer optimized for linear models
- DatetimeEncoder with spline-encoded datetimes
SimpleImputer (linear models need this)
SquashingScaler for robust scaling
Your estimator (LogisticRegression, Ridge, etc.)

For Tree-Based Models

Configures pipeline with:

TableVectorizer optimized for trees
- Low-cardinality: Keep as categorical (HistGB) or ordinal (RandomForest)
- High-cardinality: StringEncoder
- No spline encoding (unnecessary)
No scaling (trees don’t need scaling)
No imputation (trees don’t need imputation)
Your estimator

When to Use Which Approach

Use tabular_pipeline: - Quick baseline for benchmarking - Starting point for experimentation - Don’t need custom preprocessing

Use manual pipeline: - Fine-grained control needed - Custom preprocessing steps - Experimenting with different transformers

Customizing the pipeline

Tip

This is unnecessary most of the time!

from sklearn.decomposition import PCA
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from skrub import tabular_pipeline
# Create a pipeline that includes two steps
model_pipeline = make_pipeline(PCA(n_components=20), Ridge())

# Pass the new model to tabular_pipeline
full_pipeline = tabular_pipeline(model_pipeline)
[name for name, _ in full_pipeline.steps]

Example: Complete Workflow

from sklearn.model_selection import cross_val_score

# Create pipeline
model = tabular_pipeline(LogisticRegression())

# Evaluate with cross-validation
scores = cross_val_score(model, X, y, cv=5)
print(f"Score: {scores.mean():.4f}")

That’s it!

What to remember

TableVectorizer handles all feature engineering
Manual pipelines give full control
tabular_pipeline provides smart defaults
Different configurations for different model types
Always use pipelines to prevent data leakage

Building complete tabular pipelines

Building a baseline model with preprocessing

Manual Pipeline Construction

The tabular_pipeline Function

For Linear Models

For Tree-Based Models

When to Use Which Approach

Customizing the pipeline

Example: Complete Workflow

What to remember

Time for the quiz!