Building complete tabular pipelines

The Complete Pipeline

Beyond preprocessing, ML pipelines need:

  • Feature engineering (cleaning, encoding)
  • Handling missing values
  • Scaling features
  • Training the model

Manual Pipeline Construction

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from skrub import TableVectorizer

model = make_pipeline(
    TableVectorizer(),      # Feature engineering
    SimpleImputer(),        # Handle missing values
    StandardScaler(),       # Normalize features
    LogisticRegression()    # Final estimator
)

The tabular_pipeline Function

Quick one-liner for well-tuned baselines:

from skrub import tabular_pipeline

# Option 1: Provide an estimator
model = tabular_pipeline(LogisticRegression())

# Option 2: Use preset strings
model_clf = tabular_pipeline("classification")
model_reg = tabular_pipeline("regression")

For Linear Models

Configures pipeline with:

  • TableVectorizer with spline-encoded datetimes
  • SimpleImputer (linear models need this)
  • SquashingScaler for robust scaling
  • Your estimator (LogisticRegression, Ridge, etc.)

For Tree-Based Models

Configures pipeline with:

  • TableVectorizer optimized for trees
    • Low-cardinality: Keep as categorical (HistGB) or ordinal (RandomForest)
    • High-cardinality: StringEncoder
    • No spline encoding (unnecessary)
  • No scaling (trees don’t need it)
  • Your estimator

When to Use Which Approach

Use tabular_pipeline: - Quick baseline for benchmarking - Starting point for experimentation - Don’t need custom preprocessing

Use manual pipeline: - Fine-grained control needed - Custom preprocessing steps - Experimenting with different transformers

Example: Complete Workflow

from sklearn.model_selection import cross_val_score

# Create pipeline
model = tabular_pipeline(LogisticRegression())

# Evaluate with cross-validation
scores = cross_val_score(model, X, y, cv=5)
print(f"Score: {scores.mean():.4f}")

Key Advantages of Pipelines

✓ Prevent data leakage (fit only on training data) ✓ Consistent transformations (train & test) ✓ Reproducible workflows ✓ Deployable units ✓ Compatible with hyperparameter tuning

Key Takeaways

  • TableVectorizer handles all feature engineering
  • Manual pipelines give full control
  • tabular_pipeline provides smart defaults
  • Different configurations for different model types
  • Always use pipelines to prevent data leakage