from skrub import tabular_pipeline# Option 1: Provide an estimatormodel = tabular_pipeline(LogisticRegression())# Option 2: Use preset stringsmodel_clf = tabular_pipeline("classification")model_reg = tabular_pipeline("regression")
For Linear Models
Configures pipeline with:
TableVectorizer with spline-encoded datetimes
SimpleImputer (linear models need this)
SquashingScaler for robust scaling
Your estimator (LogisticRegression, Ridge, etc.)
For Tree-Based Models
Configures pipeline with:
TableVectorizer optimized for trees
Low-cardinality: Keep as categorical (HistGB) or ordinal (RandomForest)
High-cardinality: StringEncoder
No spline encoding (unnecessary)
No scaling (trees don’t need it)
Your estimator
When to Use Which Approach
Use tabular_pipeline: - Quick baseline for benchmarking - Starting point for experimentation - Don’t need custom preprocessing
Use manual pipeline: - Fine-grained control needed - Custom preprocessing steps - Experimenting with different transformers
Example: Complete Workflow
from sklearn.model_selection import cross_val_score# Create pipelinemodel = tabular_pipeline(LogisticRegression())# Evaluate with cross-validationscores = cross_val_score(model, X, y, cv=5)print(f"Score: {scores.mean():.4f}")
Key Advantages of Pipelines
✓ Prevent data leakage (fit only on training data) ✓ Consistent transformations (train & test) ✓ Reproducible workflows ✓ Deployable units ✓ Compatible with hyperparameter tuning
Key Takeaways
TableVectorizer handles all feature engineering
Manual pipelines give full control
tabular_pipeline provides smart defaults
Different configurations for different model types