Building complete tabular pipelines

Building a baseline model with preprocessing

Beyond preprocessing, ML pipelines need:

  • Feature engineering (cleaning, encoding)
  • Handling missing values
  • Scaling features
  • Training the model

Manual Pipeline Construction

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from skrub import TableVectorizer

model = make_pipeline(
    TableVectorizer(),      # Feature engineering
    SimpleImputer(),        # Handle missing values
    StandardScaler(),       # Normalize features
    LogisticRegression()    # Final estimator
)

The tabular_pipeline Function

Quick one-liner for well-tuned baselines:

from skrub import tabular_pipeline

# Option 1: Provide an estimator
model = tabular_pipeline(LogisticRegression())

# Option 2: Use preset strings
model_clf = tabular_pipeline("classification")
model_reg = tabular_pipeline("regression")

For Linear Models

Configures pipeline with:

  • TableVectorizer optimized for linear models
    • DatetimeEncoder with spline-encoded datetimes
  • SimpleImputer (linear models need this)
  • SquashingScaler for robust scaling
  • Your estimator (LogisticRegression, Ridge, etc.)

For Tree-Based Models

Configures pipeline with:

  • TableVectorizer optimized for trees
    • Low-cardinality: Keep as categorical (HistGB) or ordinal (RandomForest)
    • High-cardinality: StringEncoder
    • No spline encoding (unnecessary)
  • No scaling (trees don’t need scaling)
  • No imputation (trees don’t need imputation)
  • Your estimator

When to Use Which Approach

Use tabular_pipeline: - Quick baseline for benchmarking - Starting point for experimentation - Don’t need custom preprocessing

Use manual pipeline: - Fine-grained control needed - Custom preprocessing steps - Experimenting with different transformers

Customizing the pipeline

Tip

This is unnecessary most of the time!

from sklearn.decomposition import PCA
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from skrub import tabular_pipeline
# Create a pipeline that includes two steps
model_pipeline = make_pipeline(PCA(n_components=20), Ridge())

# Pass the new model to tabular_pipeline
full_pipeline = tabular_pipeline(model_pipeline)
[name for name, _ in full_pipeline.steps]

Example: Complete Workflow

from sklearn.model_selection import cross_val_score

# Create pipeline
model = tabular_pipeline(LogisticRegression())

# Evaluate with cross-validation
scores = cross_val_score(model, X, y, cv=5)
print(f"Score: {scores.mean():.4f}")

That’s it!

What to remember

  • TableVectorizer handles all feature engineering
  • Manual pipelines give full control
  • tabular_pipeline provides smart defaults
  • Different configurations for different model types
  • Always use pipelines to prevent data leakage

Time for the quiz!